Skip to main content
PLOS One logoLink to PLOS One
. 2023 Apr 14;18(4):e0284443. doi: 10.1371/journal.pone.0284443

DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation

Ghadi S Al Hajj 1,*, Johan Pensar 2, Geir K Sandve 1
Editor: Emmanuel S Adabor3
PMCID: PMC10104342  PMID: 37058511

Abstract

Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim

Introduction

Data simulation is fundamental for machine learning (ML) and causal inference (CI), as it allows ML/CI methods to be evaluated in a controlled setting using a ground truth model [13]. For the purpose of designing flexible, controllable, and transparent simulator models, the class of directed acyclic graphs (DAGs) provides a highly useful framework. The DAG is used to encode the structure of a model involving multiple variables in a form that is both compact and intuitive to a human user. In addition, the resulting DAG-based model is modular and allows for building complex simulators from simpler local components or modules. In a purely probabilistic model, known as a Bayesian Network (BN) [4], the DAG is used to specify the dependence structure over the considered variables. In a causal model, known as a structural causal model (SCM) [5], the DAG is used to specify the causal structure of the underlying data-generating process. In either case, a simulation model is defined by specifying the functional relations between each node and its parents in the assumed graph. In a BN, these relations are typically defined as probability distributions, while an SCM typically models relations as deterministic, where the value of a node is computed based on the value of its parents and an additional exogenous random variable (often referred to as a noise variable). In terms of simulation, there is no practical distinction between the purely probabilistic (BN) and causal perspective (SCM)–in either case, data is generated through direct forward sampling following a node ordering that is consistent with the given DAG (known as a topological ordering). However, the fundamental difference is that an SCM is equipped with some additional causal capabilities that go beyond those of a BN. For example, scenarios involving interventions and counterfactuals can be simulated by making simple local modifications to the original model ahead of a standard simulation.

While there is in principle no limitation on the types of variables or functional forms in the BN and SCM frameworks, the main emphasis has historically been on relatively small DAGs with variables of basic data types (typically ordinal/categorical scalar values) [612]. A visual notation known as plate notation is well-established for denoting vector-valued variables in BNs, but representing a k-dimensional tensor requires a fixed k-level nesting of plates, and there is no well-established notation for representing sets or sequences. This stands in stark contrast to the recent neural network (NN)-driven machine learning revolution, where the main aspect has been the ability to learn useful representations from data of large dimensionality and complex structure [1315]. The canonical example of this is the learning of complex signals from large, two-dimensional structures of pixel values in image analysis, as well as from sequences of words in natural language processing.

The emphasis on simple types of variables and functional relations in the graphical model field is also apparent from the programming libraries available for structure learning, parameter inference and simulation from graphical models. For example, the seminal bnlearn R package [9] can both infer parameters and simulate data from a model but is restricted to numerical variables, whether discrete or continuous, and restricted to full conditional probability tables and linear regression models as functional forms. DAG-based simulation is also supported in a variety of other packages, either as the main purpose or as a side purpose (their properties in terms of simulation are summarised in Table 1). Many of the packages are explicitly restricted to linear relations, as in the structural equation models (SEM) framework. All the mentioned packages share with bnlearn the restriction to numerical variables and particular functional forms.

Table 1. An overview of all established frameworks that to the authors’ knowledge offer DAG-based simulation functionalities, describing for each package the main purpose, the type of data it simulates, the functional forms used, and the additional simulation utilities provided.

The bnlearn package [9] can both infer parameters and simulate data from a model, with numerical variables and functional forms restricted to full conditional probability tables and linear regression models. The pgmpy package [16] is similar to bnlearn in terms of its purpose and simulation functionalities. The package simCausal [11] is more aimed toward causal inference problems and thus focuses on simulating longitudinal data based on SEMs. The main goal of the simMixedDAG package [12] is to simulate “real life” datasets based on a learned generalised additive model or user-defined parametric linear models. The package MXM [7] simulates data from multivariate Gaussian distributions based on a user-defined or randomly generated adjacency matrix, while abn [6] simulates data from Poisson, multinomial, and Gaussian distributions based on a user-defined adjacency matrix. The packages dagitty [10], dagR [17], and lavaan [8] provide similar functionalities for simulating data based on SEMs.

Framework Main Purpose Data type Functional form Distinctive features Reference
DagSim Data simulation Any data type (passed directly between nodes) Any form (custom function) Plates, selection bias, missing values, stratification This paper
bnlearn Structure, and parameter learning Discrete and continuous Categorical distribution and linear Gaussian model Includes several off-the-shelf models [9]
pgmpy Model learning, and approximate, exact, and causal inference Discrete and continuous Categorical distribution and linear Gaussian model Includes several off-the-shelf models [16]
simCausal Simulation of SEM-based complex longitudinal data structures Discrete and continuous Linear model Counterfactual data, interventions, time-varying nodes [11]
simMixedDAG Simulation of data from parametric and non-parametric DAG models Discrete and continuous Generalized additive model Learns a non-parametric model from data [12]
MXM Feature selection Discrete and continuous Linear Gaussian model Simulates a DAG with arbitrary arc density [7]
abn Modelling data with additive Bayesian networks Discrete and continuous Generalized linear model Simulates a DAG with arbitrary arc density [6]
dagitty Graphical analysis of Structural Causal Models Binary and continuous Linear Gaussian and logistic models Characterisation, restructuring and random generation of DAGs [10]
dagR Construct and evaluate DAGs, and simulate data Binary and continuous Linear Gaussian and logistic models Includes several off-the-shelf models [17]
lavaan Latent Variable Analysis Continuous Linear model Fits a latent variable model to data [8]

We here argue for the usefulness of combining the ideas of carefully designed models of variable relations from the graphical modelling field with the complex data types that are characteristic of the current wave of NN-driven deep learning. We present a Python library DagSim that streamlines the specification of simulation scenarios based on graphical models where variables and functional relations can be of any form. The fundamental idea of the framework is simple yet powerful: allowing the user to define a DAG-based simulation by connecting nodes to their parents through standard Python functions with unrestricted parameter and return values. DagSim provides a range of functionality for increasing the convenience and transparency of such a simulation setup—offering a choice between an external (YAML-based) or internal (Python-based) succinct and transparent domain-specific language (DSL) to specify simulations involving plates, mixture distributions and various missing data schemes. It also includes functionalities of specific use for the simulation of causal scenarios, including native support for simulating sample selection bias.

Implementation

To specify a DagSim simulation model, a user simply defines a set of nodes (variables) along with possible input nodes (parents), which together make up the backbone of the model in the form of a directed graph. The user then assigns a general Python function for simulating the value of each node given the values of parent nodes, if any. When the model has been fully specified, the package checks that the provided directed graph is acyclic, thus ensuring that the values of any input node can be sampled prior to all downstream nodes. Following a topological ordering of the nodes, DagSim then uses standard forward sampling to simulate the values of each node by calling its assigned function and providing the values of its parents (if any) as arguments. Importantly, parent values are directly passed on as native Python values, ensuring that the framework supports general data types and any functional forms. The simulated data is saved to a CSV file.

In addition to more standard simulation scenarios, DagSim provides additional types of nodes that facilitate the simulation of different types of real-world scenarios. The Selection node, for example, allows the user to simulate selection bias [5] through a function that governs the sample selection scheme, where the arguments for that function are similar to those of a standard node. The Missing node, on the other hand, provides a convenient way to simulate missing entries in the data by specifying the variable that will consequently have missing values and another standard node that specifies which entries should be removed. Finally, the Stratify node offers a way to automatically stratify the resulting data into separate files through a single function that defines the stratum of each sample. Additionally, DagSim supports transparent a specification of simulations based on a succinct YAML format.

Step-by-step example

Suppose you would want to simulate sequences of coin tosses, each represented as a 10–20 long text of H (head) and T (tail), according to a sample-specific probability of getting heads, sampled itself from a uniform distribution. Fig 1 shows the overall workflow one would follow:

Fig 1. A typical workflow of simulating data using DagSim.

Fig 1

  • First, define one node for the probability of getting Heads, one node for the number of coin tosses per sample, and one node for the sampled sequence itself (which has incoming edges from the two other nodes), using e.g. YAML to specify the graph

  • Second, define the simulation instructions in the YAML file

  • Third, define the custom function for simulating a sequence of tosses, with the other two nodes utilizing already existing functions, e.g. using numpy.

  • Finally, simply run the defined simulation, e.g. from the command line.

The code and files corresponding to this example can be found in the supplementary material.

Use

The driving motivation for DagSim is the ability to combine basic (scalar, ordinal/categorical) variables with complex or high-dimensional data types in a transparent structure as defined by a DAG, without any restrictions on the complexity of the functional relations between nodes. We illustrate these capabilities through two stylized simulation examples, where basic metadata variables are controlling 1) shapes in an image (two-dimensional numeric tensor), and 2) bio-sequence patterns in an adaptive immune receptor repertoire (set of sequences). Our main emphasis is on the ease of defining simulations and the transparency of the resulting simulation models. Detailed versions of these examples can be found in S1 and S2 Figs, respectively.

A first use case is based on a study by Sani et. al. [18] on causal learning as a tool for explaining the behaviour of black-box prediction algorithms. In order to illustrate their approach, they simulated simple images with specific shapes overlaid on a black background, where an additional set of scalar variables controlled the probability of each of the different shapes being introduced to the image. We show how such a simulation is easily reproduced using DagSim, based on a succinct, transparent, and well-modularized model specification (Fig 2A and 2B). The simulation of each node given its parents is defined by a set of Python functions provided in the supplementary material, where the main function is the one that generates an image conditioned on the scalar metadata values. If this use case was to be performed by any of the existing DAG-based simulation frameworks, then the scalar values would have to be simulated separately using an appropriate DAG. Following that, a separate function that takes as input the variables V, C, R, and H and iterates on all the samples could be used to create the desired image. This would detach the image construction process from the rest of the simulation making the code unnecessarily complicated and less transparent.

Fig 2.

Fig 2

(a-b) The YAML specification and corresponding DAG for the image simulation use case, (c-d) The YAML specification and corresponding DAG for the biosequence simulation use case.

A second use case exemplifies simulation in settings of high-dimensional biomarkers and low-dimensional patient characteristics. The considered biomarker is based on sequence patterns in a gene known as the immune receptor, which is reflecting what individual adaptive immune cells are recognizing and reacting to. The set of DNA sequences for this gene across all adaptive immune cells in the body is collectively known as the adaptive immune receptor repertoire (AIRR). Any disease state with immune system involvement, including infectious disease, auto-immunity and cancer, introduces sequence patterns in the AIRR. Additionally, it has been proposed that immune repertoires become less diverse with age [19] and that experimental protocols introduce particular sequence biases in observed AIRR datasets [20, 21]. Simulation of such biomarker signals allows benchmarking the ability of the current methodology to infer disease state from AIRR patterns [22, 23], as well as to assess the robustness of the learning process to variability in patient characteristics and experimental biases [24]. The model specification and resulting DAG are shown in Fig 2C and 2D. The Python functions are provided in the supplementary material, where the main function simulates the AIRR for each patient conditioned on disease state, age, and the experimental protocol used. If this use case was to be performed by any of the existing DAG-based simulation frameworks, one would have to use a numeric representation of the sequences, with for example the use of ad hoc end-of-sequence numeric codes to emulate a set of variable length sequences. As the frameworks also do not support the specification of custom functions, one would need to supplement a DAG-based simulation of baseline sequences (in numeric representation) with post-hoc functions for implanting desired signals.

Conclusion

We have here argued that DAG-based simulation should transcend the traditional settings of only numeric-valued variables, allowing the convenience and transparency of graphical models to see use also in settings with more complex data types. Specifically, complex data types would bring these simulations closer to applications typically considered by modern machine learning (often deep learning) models. Hence, we consider the integration of complex data types and graphical modelling for simulation purposes as highly timely, given both the increasing inclusion of complex data types in modelling scenarios and the increasing interest in causal concepts in the ML field. In terms of the latter, there has been recent research into how underlying causal mechanisms affect ML strategies [25], research into how the underlying causal structure determines whether data from different sources can be successfully fused for learning ML models [26], research into how overlaid signals arising from distinct mechanism can be disentangled [27, 28], calls for extending modern ML methods to directly predict effects of interventions [29], calls for incorporating non-linear machine learning methods for causal inference in epidemiology [30], and an increasing interest in how causal mechanisms affect the stability (generalizability) of ML models when applied to new settings [24, 31]. The combination of a DAG-based model backbone with flexible data types and functional relations provides for transparent and modularized simulation models in these emerging settings, where low-dimensional variables are connected to complex patterns in high-dimensional variables. Through a succinct YAML format for defining the model backbone and the use of individual, native Python functions for defining the functional relations to each node, DagSim provides a straightforward, practical implementation to support such an approach. The framework also natively supports specific functionalities that are useful when simulating data, e.g. for emulating selection bias and missing data, and could in the future be further extended to natively support features like Dynamic Bayesian Network-based simulation of time series, as well as nested and intersecting plates structures for complex modelling scenarios [4]. More important than individual features is the overall ability to exploit DAG structures to improve transparency and code modularization. The examples of simulating shapes in images and patterns in biosequences are but two exemplifications of DagSim’s advantages. While existing frameworks also allow to define transparent simulations in settings with standard functional relations between numeric scalars or vectors, only DagSim allows transparency and code modularization in a broad range of settings with complex data types and functional relations.

Supporting information

S1 Fig. DAG for use case I.

(TIF)

S2 Fig. DAG for use case II.

(TIF)

Acknowledgments

We wish to thank Victor Greiff and Anne H Schistad Solberg for their input on the manuscript text, and we wish to thank Milena Pavlović and Knut Rand for feedback after trying out the software.

Data Availability

The data used in this paper are generated using computer simulations. The reader can simulate similar data using the described tool, which is available as a python package (https://pypi.org/project/dagsim/). The code (available as python scripts and as jupyter notebooks) for simulating the data used in the manuscript usecases can be found on GitHub (https://github.com/uio-bmi/dagsim/tree/main/manuscript_usecases). Also, the user can run these simulations online by clicking on the "launch binder" badge provided on that page, using the free Binder service.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019;38(11):2074–102. doi: 10.1002/sim.8086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Schuler A, Jung K, Tibshirani R, Hastie T, Shah N. Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset. arXiv:171100083 [stat] [Internet]. 2017. Oct 31 [cited 2022 Jan 27]; Available from: http://arxiv.org/abs/1711.00083 [Google Scholar]
  • 3.Sandve GK, Greiff V. Access to ground truth at unconstrained size makes simulated data as indispensable as experimental data for bioinformatics methods development and benchmarking. Bioinformatics. 2022. Sep 8;btac612. doi: 10.1093/bioinformatics/btac612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Koller D, Friedman N. Probabilistic graphical models: principles and techniques. Cambridge, MA: MIT Press; 2009. 1231 p. (Adaptive computation and machine learning). [Google Scholar]
  • 5.Pearl J. Causality [Internet]. 2nd ed. Cambridge: Cambridge University Press; 2009 [cited 2021 Nov 21]. Available from: https://www.cambridge.org/core/books/causality/B0046844FAE10CBF274D4ACBDAEB5F5B
  • 6.Kratzer G, Lewis FI, Comin A, Pittavino M, Furrer R. Additive Bayesian Network Modelling with the R Package abn. arXiv:191109006 [cs, stat] [Internet]. 2019. Nov 20 [cited 2022 Feb 6]; Available from: http://arxiv.org/abs/1911.09006 [Google Scholar]
  • 7.Lagani V, Athineou G, Farcomeni A, Tsagris M, Tsamardinos I. Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets. Journal of Statistical Software. 2017. Sep 5;80:1–25. [Google Scholar]
  • 8.Rosseel Y. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software. 2012. May 24;48:1–36. [Google Scholar]
  • 9.Scutari M. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software. 2010. Jul 16;35:1–22.21603108 [Google Scholar]
  • 10.Textor J, van der Zander B, Gilthorpe MS, Liśkiewicz M, Ellison GTH. Robust causal inference using directed acyclic graphs: the R package ‘dagitty.’ Int J Epidemiol. 2017. Jan 15;dyw341. [DOI] [PubMed] [Google Scholar]
  • 11.Sofrygin O, Laan MJ van der, Neugebauer R simcausal R Package: Conducting Transparent and Reproducible Simulation Studies of Causal Effect Estimation with Complex Longitudinal Data. Journal of Statistical Software. 2017. Oct 16;81:1–47. doi: 10.18637/jss.v081.i02 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lin I. simMixedDAG [Internet]. GitHub. [cited 2022 Feb 6]. Available from: https://github.com/IyarLin/simMixedDAG
  • 13.Prakash E, Shrikumar A, Kundaje A. Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics [Internet]. bioRxiv; 2021. [cited 2022 Jan 27]. p. 2021.12.26.474224. Available from: https://www.biorxiv.org/content/10.1101/2021.12.26.474224v1 [Google Scholar]
  • 14.Bengio Y. Deep Learning of Representations for Unsupervised and Transfer Learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning [Internet]. JMLR Workshop and Conference Proceedings; 2012 [cited 2022 Jan 29]. p. 17–36. Available from: https://proceedings.mlr.press/v27/bengio12a.html
  • 15.Bengio Y, Courville A, Vincent P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013. Aug;35(8):1798–828. doi: 10.1109/TPAMI.2013.50 [DOI] [PubMed] [Google Scholar]
  • 16.Ankan A, Panda A. pgmpy: Probabilistic Graphical Models using Python. In Austin, Texas; 2015 [cited 2022 Apr 24]. p. 6–11. Available from: https://conference.scipy.org/proceedings/scipy2015/ankur_ankan.html
  • 17.Breitling LP. dagR: A Suite of R Functions for Directed Acyclic Graphs. Epidemiology. 2010. Jul;21(4):586–7. doi: 10.1097/EDE.0b013e3181e09112 [DOI] [PubMed] [Google Scholar]
  • 18.Sani N, Malinsky D, Shpitser I. Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning. arXiv:200602482 [cs, stat] [Internet]. 2020. Jun 3 [cited 2021 Dec 28]; Available from: http://arxiv.org/abs/2006.02482 [Google Scholar]
  • 19.Britanova OV, Putintseva EV, Shugay M, Merzlyak EM, Turchaninova MA, Staroverov DB, et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J Immunol. 2014. Mar 15;192(6):2689–98. doi: 10.4049/jimmunol.1302064 [DOI] [PubMed] [Google Scholar]
  • 20.Trück J, Eugster A, Barennes P, Tipton CM, Luning Prak ET, Bagnara D, et al. Biological controls for standardization and interpretation of adaptive immune receptor repertoire profiling. Cowell L, Taniguchi T, editors. eLife. 2021 May 26;10:e66274. [DOI] [PMC free article] [PubMed]
  • 21.Barennes P, Quiniou V, Shugay M, Egorov ES, Davydov AN, Chudakov DM, et al. Benchmarking of T cell receptor repertoire profiling methods reveals large systematic biases. Nat Biotechnol. 2021. Feb;39(2):236–45. doi: 10.1038/s41587-020-0656-3 [DOI] [PubMed] [Google Scholar]
  • 22.Kanduri C, Pavlović M, Scheffer L, Motwani K, Chernigovskaya M, Greiff V, et al. Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification [Internet]. bioRxiv; 2021. [cited 2022 Apr 17]. p. 2021.05.23.445346. Available from: https://www.biorxiv.org/content/10.1101/2021.05.23.445346v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Pavlović M, Scheffer L, Motwani K, Kanduri C, Kompova R, Vazov N, et al. The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires. Nat Mach Intell. 2021. Nov;3(11):936–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Pavlović M, Al Hajj GS, Pensar J, Wood M, Sollid LM, Greiff V, et al. Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics. arXiv:220409291 [cs, q-bio] [Internet]. 2022. Apr 20 [cited 2022 Apr 24]; Available from: http://arxiv.org/abs/2204.09291 [Google Scholar]
  • 25.Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij J. On causal and anticausal learning. In: Proceedings of the 29th International Coference on International Conference on Machine Learning. Madison, WI, USA: Omnipress; 2012. p. 459–66. (ICML’12).
  • 26.Bareinboim E, Pearl J. Causal inference and the data-fusion problem. PNAS. 2016. Jul 5;113(27):7345–52. doi: 10.1073/pnas.1510507113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Träuble F, Creager E, Kilbertus N, Locatello F, Dittadi A, Goyal A, et al. On Disentangled Representations Learned from Correlated Data. In: Proceedings of the 38th International Conference on Machine Learning [Internet]. PMLR; 2021. [cited 2022 Apr 6]. p. 10401–12. Available from: https://proceedings.mlr.press/v139/trauble21a.html [Google Scholar]
  • 28.Wang Y, Jordan MI. Desiderata for Representation Learning: A Causal Perspective. arXiv:210903795 [cs, stat] [Internet]. 2022. Feb 10 [cited 2022 Apr 6]; Available from: http://arxiv.org/abs/2109.03795 [Google Scholar]
  • 29.Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat Mach Intell. 2020. Jul;2(7):369–75. [Google Scholar]
  • 30.Balzer LB, Petersen ML. Invited Commentary: Machine Learning in Causal Inference—How Do I Love Thee? Let Me Count the Ways. American Journal of Epidemiology. 2021. Aug 1;190(8):1483–7. doi: 10.1093/aje/kwab048 [DOI] [PubMed] [Google Scholar]
  • 31.Subbaswamy A, Schulam P, Saria S. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport. In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics [Internet]. PMLR; 2019. [cited 2022 Jan 31]. p. 3118–27. Available from: https://proceedings.mlr.press/v89/subbaswamy19a.html [Google Scholar]

Decision Letter 0

Emmanuel S Adabor

26 Jan 2023

PONE-D-22-32989DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulationPLOS ONE

Dear Dr. Al Hajj,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 12 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Emmanuel S Abador

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. ‘Please include your tables as part of your main manuscript and remove the individual files. Please note that supplementary tables (should remain/ be uploaded) as separate "supporting information" files.’

3. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 

6. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

(1) There should be description section that presents step-by-step development and also the use of a flow chart is encouraged;

(2) Supplementary Material 2 is not available

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In the manuscript “DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation”, Ghadi et al. present a python based framework for DAG-based data simulations: DagSim which integrates complex data types and graphical modelling (that is, neither restricted by variable types nor functional forms). More so, the simulation model is defined in the YAML format promoting transparency. The authors demonstrated the framework on two use cases which are available on Github (examples, where basic metadata variables are controlling the shapes in an image and patterns in bio-sequences).

Major comments:

Table 1: An overview of the current frameworks that offer simulation functionalities. The authors provided detailed information showing how these tools/frameworks differ.

1. However, they are not clear on the selection criteria they used for the presented current frameworks (are these all the existing tools/packages?)

2. When simulating simple DAGs (discrete and continuous data), how does dagSim compare to the other simulation frameworks (for example, simMixedDAG and dagR)?

Minor comment:

Table 1: An overview of the current frameworks that offer simulation functionalities.

• dagR does not have any simulation utilities, however, Duan et al. 2021 (Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations, https://academic.oup.com/ije/article/50/6/2091/6272915) demonstrated that dagR can be useful in addressing selection and information bias analysis. Please, clarify if dagitty and dagR doesn’t have simulation utilities

Reviewer #2: The authors introduce a novel Direct Acyclic Graph (DAG) - based data simulation Python library called DagSim. They survey existing data simulation tools, discussing them and highlighting a common limitation i.e. that their inputs are limited to numeric variables and specified functional forms. DagSim has the unique advantage of having the capacity to use variables and functional relations of any form. Furthermore, it is modular and easily updatable. An implementation of DagSim has been made available on Github.

The survey is instructive and potentially a worthy contribution. I would suggest the following modifications:

-Table 1, as provided, is a very useful summary of existing tools, their capabilities, and limitations. I would recommend an additional column in which associated citations for the tools are provided.

-In addition to Figure 1, it would be useful to present a demonstration of an example of the limitation of the other tools’ input variable limitations, and DagSim’s advantage in processing more complex data types.

Minor edits:

-pages need to be numbered

-there is a citation in the Abstract. That needs to be removed.

-In the last section under “Use”, “Additionally, it is known that e.g. the age of a patient leaves a mark on the AIRR (19)…” needs to be corrected

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 14;18(4):e0284443. doi: 10.1371/journal.pone.0284443.r002

Author response to Decision Letter 0


16 Feb 2023

We would like to thank the editor and the reviewers for their helpful comments. Please find the response in the file named "Response to Reviewers".

Attachment

Submitted filename: Response to Reviewers.pdf

Decision Letter 1

Emmanuel S Adabor

30 Mar 2023

DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation

PONE-D-22-32989R1

Dear Dr. Al Hajj,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Emmanuel S Adabor

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: George K. Acquaah-Mensah

**********

Acceptance letter

Emmanuel S Adabor

5 Apr 2023

PONE-D-22-32989R1

DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation

Dear Dr. Al Hajj:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Emmanuel S Adabor

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. DAG for use case I.

    (TIF)

    S2 Fig. DAG for use case II.

    (TIF)

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    The data used in this paper are generated using computer simulations. The reader can simulate similar data using the described tool, which is available as a python package (https://pypi.org/project/dagsim/). The code (available as python scripts and as jupyter notebooks) for simulating the data used in the manuscript usecases can be found on GitHub (https://github.com/uio-bmi/dagsim/tree/main/manuscript_usecases). Also, the user can run these simulations online by clicking on the "launch binder" badge provided on that page, using the free Binder service.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES