Skip to main content
Global Epidemiology logoLink to Global Epidemiology
. 2025 Jan 22;9:100186. doi: 10.1016/j.gloepi.2025.100186

On the current and future potential of simulations based on directed acyclic graphs

Lutz P Breitling a,, Anca D Dragomir b,c, Chongyang Duan d, George Luta c,e,f
PMCID: PMC12190896  PMID: 40568465

Abstract

Real-world data are playing an increasingly important role in regulatory decision making. Adequately addressing bias is of paramount importance in this context. Structural representations of bias using directed acyclic graphs (DAGs) provide a unified approach to conceptualize bias, distinguish between different types of bias, and identify ways to address bias. DAG-based data simulation further enhances the scope of this approach. Recently, DAGs have been used to demonstrate how missing eligibility information can compromise emulated target trial analysis, a cutting edge approach to estimate treatment effects using real-world data. The importance of simulation for methodological research has received substantial recognition in the past few years, and others have argued that simulating data based on DAGs can be especially helpful for understanding various epidemiological concepts. In the present work, we present two concrete examples of how simulations based on DAGs can be used to gain insights into issues commonly encountered in real-world analytics, i.e., regression modelling to address confounding bias, and the potential extent of selection bias. Increasing accessibility and extending the simulation algorithms of existing software to include longitudinal and time-to-event data are identified as priorities for further development. With such extensions, simulations based on DAGs would be an even more powerful tool to advance our understanding of the rapidly growing toolbox of real-world analytics.

Keywords: Simulation studies, Directed acyclic graphs, Selection bias, Confounding, Real world evidence, Regression modelling, Emulated target trial

Introduction

Real-world data have become a routinely used component of regulatory decision making, and emulated target trials are considered a “reliable design strategy” in this context [1]. In an emulated target trial approach, large observational datasets are analyzed in a way to approximate the causal effect estimates that would have been obtained in an explicitly defined target trial, that is, in a randomized experiment that could not be realized [2]. The emulated target trial methodology justifiably is receiving a lot of attention, as reflected by pertinent keynote presentations, workshops, and dedicated sessions at scientific meetings of both biometrical as well as epidemiological communities. However, at the same time, it is not fully understood why the emulated target trial approach is better in some situations than in others, when it comes to reproducing the causal effect estimates from corresponding randomized controlled trials. Tompsett et al. recently explored the relevance of various mechanisms of missingness of eligibility data in emulated target trials [3]. They identify one possible reason for the varying performance of emulated target trials, which obviously needs to be taken into account when judging the added value of any such study for advancing our knowledge on treatment effects. More as a by-product, their work features yet another application of directed acyclic graphs (DAGs) when they illustrate the causal relationships underlying their missingness simulations, including biasing paths induced by conditioning on the eligibility indicator and its missingness process.

In brief, DAGs are structural representations of causal relationships between variables, which under certain assumptions can be used to investigate issues related to bias in epidemiological studies. Although the graph-theoretical foundations underlying DAG-based approaches to the analysis of bias have been around for many decades, it took substantial efforts to make these approaches accessible and popular in the health data-analytical and epidemiological communities. In a recent publication, Levy and Keyes [4] revisit one of the truly seminal papers in this field, pointing out the great contribution this work by Hernán et al. [5] has made to our understanding of the intricacies of bias. They also identified some open issues potentially requiring additional attention, such as the extent of bias introduced by controlling for or stratifying on a collider, and what impact this has on external as opposed to internal validity. Though some of these challenges lend themselves to either an algebraic or philosophical approach, most of them could probably be addressed with carefully designed DAG-based data simulations. Statistical software providing convenient interfaces to simulate data based on the causal structures from a given DAG, however, remain remarkably scarce [6].

DAG-based simulations as an accessible educational tool

Interdisciplinary teaching provides an important opportunity to promote sound approaches to data analysis among collaborating non-biometrical researchers. Fox et al. recently outlined how DAG-based data simulations can be used to aid the understanding of epidemiologic concepts [7]. They emphasized how coding the data-generating mechanisms by itself could provide an added value and should be an integral component of the teaching and learning process. Although they focused on DAG-based simulations in their paper and addressed key concepts particularly suitable to DAG-based approaches, their work supports the more general notion of the role simulation approaches can play in epidemiological teaching. In a complementary paper, they illustrate the usefulness of data simulation in the context of critically mastering nondifferential misclassification and null-hypothesis testing, two fundamental concepts in epidemiology [8]. Their conclusion states that “simulation puts the ability to experiment and test methods in the students' hands,” which would be all the more important given the ever increasing complexity of the data analytical repertoire. This is reminiscent of the challenges the community faces when it comes to innovative methods such as emulated target trials.

Although programming by oneself should be considered as a very instructive and useful way to deep dive into the various topics of interest, it may be beyond the scope of many epidemiologists and other researchers for various reasons. The process may be tedious, which can be helpful as a part of the learning process, but also prohibitive in terms of time resources available. Ad hoc coding from scratch may also appear somewhat intimidating and error-prone, therefore presenting another obstacle for non-expert users. Time will tell what role generative artificial intelligence can play in this setting [9], where understanding the data-generating process is of fundamental importance. Methodologists might more often have the necessary programming and validation skills, but convenient software solutions can still be time-saving and have the potential to enable a much wider community to employ DAG-based data simulations on a greater scale, both for research and teaching purposes. The use of pre-implemented, flexible data-generating mechanisms could also contribute to an improved reproducibility of simulation studies in the field of methods comparisons, where ensuring comprehensive reporting remains a challenge [10].

In a recent paper, we showed how DAG-based data simulations can be used to address a variety of epidemiological and regression modelling issues in an educational context, for example, classical confounding and harmful adjustment [11]. Although many of the findings obtained from such simulation studies can be also derived using regression theory, this may be inaccessible to many less stat-savvy students. Furthermore, some issues can be addressed much more easily using simulations. For example, in the classical confounding triangle with binary variables, it is challenging to evaluate analytically the bias due to the confounding variable, because of the dependence of the variance of the exposure on its prevalence. Very basic simulations allow exploring this issue across plausible parameter values and provide concrete and informative estimates of the bias that must be expected if the confounder cannot be adjusted for in a real-world setting [11]. This, of course, should not be misunderstood as implying that simulations can fully replace theoretical considerations. They may miss important features of bias that might be effectively revealed by theory, and one always has to remain cautious about the generalizability of simulation-based results. Every data generation mechanism ultimately corresponds only to some kind of model, and all models tend to be wrong in some way or another [12].

Instructive use cases of DAG-based data simulations

At present, regression models probably are the most common tool used to address confounding in epidemiological studies and real-world data analysis. DAG-based methods to correctly identify minimal sufficient adjustment sets can be very useful in this context [13]. However, one issue that appears somewhat neglected when introducing this subject is whether different minimal sufficient adjustment sets can be seen as equivalent when trying to obtain an unbiased estimate of a causal effect using a regression model. This is not at all the case, and DAG-based simulations can be used in a straightforward fashion to demonstrate this issue. For the time being, the reader is encouraged to focus only on the upper panel of Fig. 1, showing three different DAGs with four confounding variables each. For the DAG labelled C, there is a single backdoor path from X to Y, comprising all four confounders. Based on DAG theory, adjustment for any single one of the confounding variables is sufficient to remove confounding bias, but modelling-wise, this is not as simple. The reader is encouraged to first reflect—based on the parameters given in the figure caption—on what kind of bias she/he expects to see in the regression estimates of the effect of X on Y for the different causal structures and adjustment sets, before examining the corresponding results in the bottom panel of Fig. 1. In a computer lab-like setting, a logical follow-up task for the participants would be to have them run relevant simulations themselves, and to allow them to explore by themselves the impact of changing the various parameters or data-analytical approaches themselves.

Fig. 1.

Fig. 1

Upper panel: Three different causal structures with four confounding variables each (c1 to c4). Lower panel: logistic regression model estimate of the effect of exposure x on outcome y in data simulated based on the different DAGs, either raw (without adjustment) or adjusting for any of the possible minimal sufficient adjustment sets. Simulations were done with n = 1,000,000, all nodes binary, OR(x → y) = exp.(1), also OR = exp.(1) for all other arcs, and setting the prevalence for each node to 0.5 in its reference category, i.e. where its parents sum to 0.

Each of the DAGs shown in Fig. 1 represents a causal structure including four variables (c1 to c4) that could be considered confounding variables based on classical criteria. For DAG A, which features four archetypical confounding variables, each having a direct effect on both the exposure and the outcome, the only minimal sufficient adjustment set (MSAS) consists of all four variables. When estimating the causal effect of x on y using a logistic regression analysis, adjusting for the MSAS in this case adequately removes all confounding bias. For DAGs B and C, the situation is somewhat different. Here, four different MSAS are identified by DAG theory, each consisting either of two (DAG B) or one (DAG C) of the four nodes. These MSAS are of equal size and of equal value in the sense that adjustment for any one of these MSAS theoretically closes all backdoor paths producing spurious associations between exposure and outcome. The estimate obtained by logistic regression nonetheless will only be unbiased if one choses the MSAS consisting of the nodes being direct ancestors of the outcome, whereas adjusting for the more distant nodes is less effective in removing confounding bias. We have also considered smaller values of OR, i.e., 1.2, 1.5, and 2.0. Although the patterns are unaffected, the differences become smaller as the OR gets closer to 1. Commented R code for analyses performed along the lines of Fig. 1 is provided in the Supplemental Materials.

Exercises such as the one presented in the preceding paragraphs can be useful in an educational context with modelling and model selection in mind. This obviously extends also to methodological work trying to compare the performance of various analytical approaches for estimating causal effects under various scenarios. On a different note, similar simulations can be used during the design phase of real-world studies, as it is straightforward to examine what impact the (non-)adjustment for a certain confounding variable would have. Given that measuring different variables can come at hugely different costs, it would appear almost negligent to make pertinent decisions without a systematic evaluation of costs and benefits in terms of both resources and modelling. It should be worth noting that measurement error adds another layer of complexity to these considerations. This can be trivially incorporated in DAG-based simulations by including “measured value” nodes. It is precisely in the context of measurement error and cost of different exposure surrogates that the matter of cost was considered by Armstrong [14] via analytical solutions, although it is natural to extend that work to simulation-based approaches when measurement errors do not conform to known theory.

A prime example of the superiority of the DAG-based approach to confounding bias is a causal structure like the M-DAG. The M-DAG features a causal structure in which ancestors of the exposure and outcome are determinants of a common child node, the collider variable. Unbiased estimation of the causal effect of the exposure on the outcome outcome is performed without any adjustment in this situation, but it can be easily demonstrated that adjusting for such a collider in the analysis leads to a biased estimation of the causal effect of the exposure on the outcome [13]. Although the M-DAG has consequently gained some popularity, even simpler causal structures provide useful examples of such adjustment-induced bias. Greenland et al. [13] discussed a causal structure as the DAG shown in Fig. 2. In their example, they were referring to an analysis of the effect of estrogen therapy on endometrial cancer, with the bias introduced by stratification on bleeding. A more recent motivation stems from potentially strong selection bias that must be taken into account when interpreting findings based on online administered instruments, which have become a very common tool and experienced an additional boost in popularity during the COVID-19 pandemic [15]. Such instruments often are widely and unselectively distributed to compensate for low responses rates. It is rather obvious that associations estimated from such studies will be biased if both the risk factor (e.g., willingness to obey COVID-19 hygiene recommendations) and the outcome of interest (e.g., having experienced COVID-19 during a specific time frame) have an effect on study participation. Simple DAG-based simulations such as those shown in Fig. 2 can help to put the observed associations into perspective, and more complex causal structures can easily be accomodated. Of course, the power of this tool crucially depends on what external evidence is available regarding the various variables and associations related to the specific study question.

Fig. 2.

Fig. 2

Selection bias when both exposure (x) and outcome (y) affect the participation/selection into the study (ps). Binary data were simulated with n = 10,000,000 according to the simple DAG presented, with no direct effect of x on y, and with varying effects (odds ratios, OR) of x and y on ps. The simulated sample size was further increased compared to Fig. 1 to obtain smoother curves and improve the readability of the plots. Shown is the bias (spurious risk difference) induced when regressing y on x in the subset with ps = 1 by fitting a generalized linear model with binomial distribution and identity link. The impact of varying the baseline probability of participation/selection is shown across panels.

A simple DAG like the one shown in Fig. 2 can be introduced even to non-statistical audiences as the causal structure according to which a simulation study has been carried out. In the present example, each panel allows the researcher to demonstrate in a very tangible fashion that the magnitude and direction of the spurious risk difference introduced through selection bias vary widely depending on the effects of exposure and outcome on study participation. If some consensus can be reached regarding plausible values for the various parameters related to a specific study question, one can directly read from the figure the extent of bias that can be expected to affect the causal estimates from a relevant study. If no such consensus can be reached, the conclusion could be that one cannot even predict whether negative or positive bias is expected, and that related results must be interpreted with great caution. On the programming side, a large number of scenarios can be easily investigated with simple loops around the ready-to-use DAG-based simulation functions (see Supplemental Materials).

The software used in the aforementioned work was coded and published in the form of an R package called “dagR” to facilitate general access and further development [6]. As part of the process, an important realization was that the simulation capabilities of published DAG software were somewhat limited. For example, both the initial version of dagR, as well as other available R packages, were limited to simulating binary data based on logistic regression models. This is somewhat surprising given the importance of introducing epidemiology students to different effect measures. A related phenomenon, that has puzzled the uninitiated since its first description, is the non-collapsibility of the odds ratio [16]. The latest version of dagR consequently has been extended to include functionalities for risk difference-based simulations, which generally expand the flexibility of the package, but may be particularly useful for teaching non-collapsibility and for differentiating this issue from confounding [17].

Discussion and conclusion

The above discourse shows how DAG-based simulations as implemented in currently available software can be used for diverse teaching and research purposes ranging from applications of data-analytical theory to practical study design questions. A couple of extensions would appear helpful to make this approach suitable for addressing an even wider range of challenges and some current “hot topics” in particular. Continuous variables and count data could be added to these simulations by using linear regression models and Poisson regression models, respectively, instead of logistic regression models. Given the great power of data from cohort/prospective studies for describing and analysing differences in health outcomes, it would be especially useful to also implement time-to-event algorithms for DAG-based software, preferentially based on additive hazards or other survival analysis models that lend themselves to causal interpretations [18]. Also, future simulation algorithms should accomodate longitudinal data with repeated measurements, which would allow evaluating a greater variety of interesting modelling approaches, in particular those based on marginal structural models [19,20]. In terms of accessibility, functionalities that greatly facilitate the development of interactive web applications have recently been added to R [21]. Providing a convenient web-based interface to DAG-based data simulations could greatly enhance the use of this tool for teaching and research.

DAGs have been around for decades, but enhancing their usability appears more needed and exciting than ever. There seems to be an exponential growth in the availability and regulatory consideration of large-scale real-world data, as well as in the use of more complex modelling approaches. Just like in the case of emulated target trial approaches, it is not sufficient to demonstrate that some algorithmic “blackbox” procedure yields an association estimate close to the effect estimate obtained by some gold standard comparison approach. It is rather necessary to understand under which conditions this is the case or not, meaning the (forcibly simulated?) causal truth in the data must be known; causal knowledge remains as fundamental for evaluating the performance of complex methods as it is for even the basic understanding of causal effects and different types of bias [4]. Interestingly, causal discovery, which attempts to go in the opposite direction, has also recently been identified as one of the “top 10 future directions for causal inference research” [22]. Although the importance of a sound theoretical foundation for all these research fields cannot be emphasized enough, DAG-based data simulations are likely to play an important role in facilitating methodological research development regarding these topics, in addition to providing a convenient tool to help teachers and students in mastering pertinent data analytical approaches.

Funding information

No specific funding was received for this work.

CRediT authorship contribution statement

Lutz P. Breitling: Writing – original draft, Software, Methodology, Formal analysis, Conceptualization. Anca D. Dragomir: Writing – review & editing, Methodology, Conceptualization. Chongyang Duan: Writing – review & editing, Methodology, Conceptualization. George Luta: Writing – review & editing, Methodology, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

The author is an Editorial Board Member/Editor-in-Chief/Associate Editor/Guest Editor for [GLOEPI] and was not involved in the editorial review or the decision to publish this article.

Footnotes

Appendix A

Supplementary data (commented R code for DAG-based data simulations) to this article can be found online at https://doi.org/10.1016/j.gloepi.2025.100186.

Appendix A. Supplementary data

Supplementary material

mmc1.docx (14.9KB, docx)

Data availability

No original data was used in this work.

References

  • 1.Purpura C.A., Garry E.M., Honig N., et al. The role of real- world evidence in FDA- approved new drug and biologics license applications. Clin Pharmacol Ther. 2022;111:135–144. doi: 10.1002/cpt.2474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hernán M.A., Robins J.M. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183:758–764. doi: 10.1093/aje/kwv254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tompsett D., Zylbersztejn A., Hardelid P., De Stavola B. Target trial emulation and Bias through missing eligibility data: an application to a study of Palivizumab for the prevention of hospitalization due to infant respiratory illness. Am J Epidemiol. 2023;192:600–611. doi: 10.1093/aje/kwac202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Levy N.S., Keyes K.M. Causal knowledge as a prerequisite for interrogating bias: reflections on Hernán et al. 20 years later. Am J Epidemiol. 2023;192:1797–1800. doi: 10.1093/aje/kwab274. [DOI] [PubMed] [Google Scholar]
  • 5.Hernán M.A., Hernández-Díaz S., Werler M.M., Mitchell A.A. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol. 2002;155:176–184. doi: 10.1093/aje/155.2.176. [DOI] [PubMed] [Google Scholar]
  • 6.Breitling L.P., Duan C., Dragomir A.D., Luta G. Using dagR to identify minimal sufficient adjustment sets and to simulate data based on directed acyclic graphs. Int J Epidemiol. 2021;50:1772–1777. [Google Scholar]
  • 7.Fox M.P., Nianogo R., Rudolph J.E., Howe C.J. Illustrating how to simulate data from directed acyclic graphs to understand epidemiologic concepts. Am J Epidemiol. 2022;191:1300–1306. doi: 10.1093/aje/kwac041. [DOI] [PubMed] [Google Scholar]
  • 8.Rudolph J.E., Fox M.P., Naimi A.I. Simulation as a tool for teaching and learning epidemiologic methods. Am J Epidemiol. 2021;190:900–907. doi: 10.1093/aje/kwaa232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nivard M., Wade J., Calderon S. gptstudio: Use Large Language Models Directly in your Development Environment. 2023. https://cran.r-project.org/package=gptstudio
  • 10.Morris T.P., White I.R., Crowther M.J. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38:2074–2102. doi: 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Duan C., Dragomir A.D., Luta G., Breitling L.P. Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations. Int J Epidemiol. 2021;50:2091–2097. doi: 10.1093/ije/dyab096. [DOI] [PubMed] [Google Scholar]
  • 12.Box G.E. Science and statistics. J Am Stat Assoc. 1976;71:791–799. [Google Scholar]
  • 13.Greenland S., Pearl J., Robins J.M. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48. [PubMed] [Google Scholar]
  • 14.Armstrong B.G. Optimizing power in allocating resources to exposure assessment in an epidemiologic study. Am J Epidemiol. 1996;144:192–197. doi: 10.1093/oxfordjournals.aje.a008908. [DOI] [PubMed] [Google Scholar]
  • 15.De Man J., Campbell L., Tabana H., Wouters E. The pandemic of online research in times of COVID-19. BMJ Open. 2021;11 doi: 10.1136/bmjopen-2020-043866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Miettinen O.S., Cook E.F. Confounding: essence and detection. Am J Epidemiol. 1981;114:593–603. doi: 10.1093/oxfordjournals.aje.a113225. [DOI] [PubMed] [Google Scholar]
  • 17.Breitling L.P., Duan C., Dragomir A.D., Luta G. Presented at the 17th meeting of the German Society of Epidemiology, Greifswald, Germany. September 26-29, 2022. Non-collapsiblity and conditional independence: DAG-based simulation exercise and discussion [abstract] [Google Scholar]
  • 18.Aalen O.O., Cook R.J., Roysland K. Does cox analysis of a randomized survival study yield a causal treatment effect? Lifetime Data Anal. 2015;21:579–593. doi: 10.1007/s10985-015-9335-y. [DOI] [PubMed] [Google Scholar]
  • 19.Gilsanz P., Young J.G., Glymour M.M., et al. Marginal structural models for life-course theories and social epidemiology: definitions, sources of bias, and simulated illustrations. Am J Epidemiol. 2022;191:349–359. doi: 10.1093/aje/kwab253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Keogh R.H., Seaman S.R., Gran J.M., Vansteelandt S. Simulating longitudinal data from marginal structural models using the additive hazard model. Biom J. 2021;63:1526–1541. doi: 10.1002/bimj.202000040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jia L., Yao W., Jiang Y., et al. Development of interactive biological web applications with R/shiny. Brief Bioinform. 2022;23 doi: 10.1093/bib/bbab415. [DOI] [PubMed] [Google Scholar]
  • 22.Mitra N., Roy J., Small D. The future of causal inference. Am J Epidemiol. 2022;191:1671–1676. doi: 10.1093/aje/kwac108. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx (14.9KB, docx)

Data Availability Statement

No original data was used in this work.


Articles from Global Epidemiology are provided here courtesy of Elsevier

RESOURCES