Abstract
Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.
Keywords: demographic models, inference, simulation
Demographic models of increasing complexity are being inferred and simulated, yet there is no agreed-upon data format for describing these models. Gower, Ragsdale et al. present the Demes specification, a data model and file format definition built on standard technologies which provides a description of complex demographics that is both human-readable and machine-parsable. They provide thoroughly tested software components using this specification and demonstrate Demes support in several popular simulation and inference methods.
Introduction
The ever-increasing amount of genetic sequencing data from genetically and geographically diverse species and populations has allowed us to infer complex demography and study life history at fine scales. An integral component to such population genetics studies is simulation. Software to either simulate whole-genome sequences (Thornton 2014, 2019; Staab et al. 2015; Kelleher et al. 2016; Haller and Messer 2019; Baumdicker et al. 2022) or informative summary statistics of diversity (Gutenkunst et al. 2009; Jouganous et al. 2017; Kamm et al. 2017) have enabled the increasing complexity of genomic studies, with several software packages capable of handling large sample sizes, many interacting populations, and deviations from panmictic random-mating assumptions. This ability to infer and simulate such complex demographic scenarios, however, has highlighted a major shortcoming in community standards: the fragmented landscape of different ways to describe demographic models makes it difficult to compare inferences made by different methods and to reliably simulate from previously inferred models. Inference results are typically reported in publications via a combination of visual depiction, a list of key parameters in tabular form and a discussion within the text. Unfortunately, these descriptions are often ambiguous, and implementing the precise model inferred for later simulation is at best tedious and error prone (Adrion et al. 2020; Ragsdale et al. 2020), and occasionally impossible because of missing information.
Simulation is a core tool in population genetics, and many methods have been developed over the past 3 decades (Carvajal-Rodríguez 2008; Liu et al. 2008; Arenas 2012; Hoban et al. 2012; Yuan et al. 2012). Simulations are based on highly idealized population models, and one of the key uses of inferred demographic histories is to make simulations more realistic. Simulation methods take three broad approaches to specifying the demographic model to simulate, using either a command line interface (e.g. Hudson 2002; Hernandez 2008; Kern and Schrider 2016), a custom input file format (e.g. Guillaume and Rougemont 2006; Excoffier and Foll 2011; Shlyakhter et al. 2014), or an Application Programming Interface (API) to allow models to be defined programmatically (e.g. Thornton 2014; Hernandez and Uricchio 2015; Kelleher et al. 2016; Becheler et al. 2019; Haller and Messer 2019; Thornton 2019; Baumdicker et al. 2022). Command line interfaces are a concise way of expressing demographic models, and the syntax defined by ms (Hudson 2002) is used by several simulators (e.g. Chen et al. 2009; Ewing and Hermisson 2010; Staab et al. 2015). However, this conciseness means that models of even intermediate complexity are difficult for humans to understand, making errors likely. APIs are more verbose, but require a substantial time investment to learn, and as they are tied to a specific tool this knowledge is not portable to other simulators. Like APIs, input parameter file formats for simulators allow the model specification to be less terse and allow for documentation in the form of comments. Several graphical user interfaces and visualization methods have been developed, which greatly facilitate interpretation (Mailund et al. 2005; Antao et al. 2007; Parreira et al. 2009; Ewing and Hermisson 2010; Parobek et al. 2017; Zhou et al. 2018). However, these methods currently have little traction as they are all either directly coupled to an internal simulation method or to the syntax of a specific simulator. There is currently no way in which demographic models inferred by different packages can be simulated or visualized by downstream software.
Here, we present “Demes,” a data model and file format specification for complex demographic models developed by the PopSim Consortium (Adrion et al. 2020). The Demes data model precisely defines the sizes and relationships of populations, and it provides a way to explicitly encode the information relevant to demography while avoiding repetition. This data model is implemented in the widely used YAML format (Ben-Kiki et al. 2009), which is a data serialization language that provides a good balance between human and machine readability. The specification precisely defines the required behavior of implementations, ensuring that there is no ambiguity of interpretation, and includes both a reference implementation and an extensive suite of test examples and their expected output. The initial software ecosystem includes high-quality Python and C parser implementations, as well as utilities for verification and visualization of Demes models, and has been implemented in several popular inference and simulation methods (Table 1). We hope that this data model and file format will be widely adopted by the community, such that users can expect to simulate directly from inferred models with little to no programming effort.
Table 1.
Software infrastructure | |
---|---|
demes-python | A Python library for loading, saving, and working with Demes models. Includes support for converting to and from ms (Hudson 2002; https://github.com/popsim-consortium/demes-python). |
demes-c | A C library for parsing Demes YAML descriptions (https://github.com/grahamgower/demes-c). |
demes-rust | A Demes parser in Rust (https://github.com/molpopgen/demes-rs). |
demes-julia | A parser in Julia (https://github.com/apragsdale/Demes.jl). |
demesdraw | A Python library for visualizing Demes models (https://github.com/grahamgower/demesdraw). |
Methods using Demes as input/output format | |
dadi | Optimizes parameters in models of demographic history and distributions of fitness effects using SFS (Gutenkunst et al. 2009). Can simulate SFS from Demes models. |
demes-slim | Loads Demes models into the SLiM forward simulator (Haller and Messer 2019). |
fwdpy11 | Simulates the Wright–Fisher model forward in time (Thornton 2014, 2019). Demes are the preferred format for specifying a demographic model. |
GADMA | Infers models of demographic history (Noskova et al. 2020). Outputs Demes models and visualizations. |
gIMble | Fits IM-type demographic models and infers genomic barriers to geneflow (Laetsch et al. 2022). Outputs inferred models in Demes format. |
moments | Optimizes parameters in models of demographic history using SFS and linkage disequilibrium statistics (Jouganous et al. 2017; Ragsdale and Gravel 2019). Models to be optimized can be specified in Demes. |
MSMC | A script provided in the MSMC-tools repository (https://github.com/stschiff/msmc-tools) converts MSMC (Schiffels and Durbin 2014; Schiffels and Wang 2020) output to the demes format. |
msprime | Simulates population genetic models using tree sequences (Kelleher et al. 2016; Kelleher and Lohse 2020; Baumdicker et al. 2022). Demographic history models can be specified using Demes. |
We have included software infrastructure developed for working with Demes models (such as parsing, validation, and visualization) as well as downstream software that implement the specification, at the time of writing.
Demes
The design of Demes is a balance between two partially competing requirements: that (a) models should be easy for humans to understand and manipulate; and (b) software processing Demes models should be provided with an unambiguous representation that is straightforward to process. For efficiency of understanding and avoidance of model specification error, we require a data representation without redundancy (i.e. repetition of values). However, for the simplicity of software working with the Demes model (and the avoidance of programming error, or divergence in interpretations of the specification), it is preferable to have an explicit representation, in which all relevant values are readily available. Thus, Demes are composed of three entities: the Human Data Model (HDM) designed for human readability; the Machine Data Model (MDM) designed for programmatic input and processing; and the parser, which is responsible for transforming the former into the latter.
Here, we provide a brief overview of the population genetics models that Demes supports and the components of the Demes infrastructure. Complete technical details of the MDM and HDM, and the responsibilities of the parser are provided in the online Demes specification (https://popsim-consortium.github.io/demes-spec-docs/; accessed 2022 September 12). This specification rigorously defines the data model, fully describing the entities and their relationships, and the required behavior of implementations. Since the online specification is definitive, we will not recapitulate the details here, but instead focus on the high-level properties of the model and the rationale behind key design decisions.
Population genetics model
For inference and simulation software to meaningfully interoperate, there must be a shared understanding of what a demographic model is. Population genetics is a large field, and rather than attempting to capture all possible within- and between-population processes, we have instead adopted a pragmatic approach of identifying a common set of assumptions shared by many methods. We outline the processes and assumptions briefly here and in the Appendix.
Demographic models consist of one or more populations (or “demes”) defined by their size histories and the time intervals of their existence. Individuals can move between populations based on their ancestor-descendant relationships or by continuous or discrete migration events. Within a population, we assume Wright–Fisher dynamics (see the Appendix for more precise details). As described in the Scope of the Specification section below, the demographic model does not, as a deliberate simplification and separation of duties, include any information about genome biology or selection.
These basic assumptions of discrete Wright–Fisher populations connected by instantaneous or continuous migrations are shared by many inference methods (e.g. Gutenkunst et al. 2009; Li and Durbin 2011; Gravel 2012; Schiffels and Durbin 2014; Jouganous et al. 2017; Kamm et al. 2017; Ragsdale and Gravel 2019; Excoffier et al. 2021), and forwards- and backwards-time simulators (e.g. Hudson 2002; Gutenkunst et al. 2009; Excoffier and Foll 2011; Kelleher et al. 2016; Jouganous et al. 2017; Haller and Messer 2019; Thornton 2019). Demes therefore serves as “middleware” between inference methods and simulation software, capturing these common assumptions.
It is important to note that the goal of describing the basic population processes precisely is not to be proscriptive about what methods may or may not use the specification, but so that we can be clear on what situations we can expect methods to agree exactly. Arbitrary population processes—for example, within-deme continuous spatial structure (Wright 1943; Barton et al. 2002, 2010; Ringbauer et al. 2017; Battey et al. 2020)—may be layered on top of this basic description, but as dynamics diverge from the core assumptions, then of course we can expect results to differ accordingly.
Human data model
The Demes HDM is focused on efficient human understanding and avoiding errors. We have adopted the widely used YAML format (Ben-Kiki et al. 2009) as the primary interface for writing and interchanging demographic models (see the Appendix for rationale). Demographic models provide information about global features of the model (such as time units and generation times), populations (as “demes”) and their existence intervals (as “epochs”), and gene flow between populations (as continuous “migrations” or instantaneous “pulse” events). Figure 1 shows an example isolation-with-migration model in HDM format.
Structurally, the HDM encourages human understanding by avoiding redundancy in the description where possible and by providing a mechanism for specifying default values that are inherited hierarchically. For values that repeat across fields, the “defaults” mechanism may be used to implicitly assign default values to fields of the given type. A default is superseded by an explicitly provided value if given. Size values are inherited naturally following the progression of time. For example, if an epoch start_size is not provided (either directly, or via a defaults section), it is assumed to be equal to the end_size of the previous epoch. This also means that the first epoch of each population must specify the initial size (or it must be provided in a defaults section).
Avoiding redundancy in this way reduces the cognitive load on readers, by highlighting necessary parameters which may be otherwise be obscured. It is not necessary—or indeed recommended—that all models are expressed in a maximally concise form, and we wholeheartedly endorse the explicit statement of parameters where it increases model legibility.
Parsers
While the HDM is designed for human readability and conciseness, the underlying data model suitable for software implementation (the MDM) is redundant and exhaustive. Translation from the HDM to the MDM requires resolving hierarchically defined default values and verifying relationships between populations and the validity of specified parameter values. Because this translation and validation require significant programming effort, we define a standard software entity as part of the specification to perform this task (the parser), which is intended to be shared by programs that support Demes as input. The Demes specification precisely defines the required behavior of parsers, and we provide a reference implementation written in Python to resolve any potential ambiguities, as well as an extensive test suite of examples and the expected outputs. In addition, we have high-quality parser implementations in the Python, C, Rust, and Julia languages (Table 1) providing a solid foundation for the software ecosystem. By maintaining high-quality Demes parsers available as libraries, we ensure consistency across simulation and inference software. Having common parsers also benefits users by providing consistent and informative error messages for missing values or issues in formatting.
Scope of the specification
A primary design goal of Demes is to provide a means of unambiguously communicating the results of demographic model inferences to population genetic simulators. Since demography is defined in terms of groups of individuals and these groupings are influenced by genetics, it is difficult to find a simple definition that separates the two. Thus, we have attempted to be pragmatic, limiting the features that we include in Demes to those that are in practise regarded as part of a demographic model.
The model is therefore limited to features that we can expect many different demographic inference and simulation methods to share. The specification only describes demographic features at the population level. Features of genome biology are out of scope, including mutation and recombination rates, genome annotations, ploidy, and so on. Selection and dominance models are absent, as discussed in the Appendix. It is important to note, however, that Demes may be used in applications that include additional population genetic processes outside of what is explicitly modeled in the specification, such as interpreting population sizes as carrying capacities, implementations of hard selection, or layering more complicated mating or spatial structure. The Demes specification is intended to provide a basic model that can be elaborated on where necessary.
Demes are not a standard population genetic simulation specification, although it could be part of one. Since the standard is based on JSON, and JSON documents can be arbitrarily nested, we can imagine a simple specification of genome features such as mutation and recombination rates in which the demography is defined by an embedded Demes specification. Features of the simulation specification (such as defining the time and location of samples) can then refer to the Demes model. This design, in which we embed the demographic model within a larger specification rather than adding arbitrary and unrelated complexities to the demography is an essential simplification and separation of duties.
The Demes specification is static by design—we wish to unambiguously describe a demographic model with a concrete set of parameters. This simplicity means that we cannot directly specify parameter distributions or estimated confidence intervals for those parameters. While it is not difficult to imagine extending the specification in ways that would allow this, it is not clear that the benefits are worth the greatly increased parser complexity (see the Appendix).
Example: an isolation-with-migration model
In Fig. 1, we provide an example isolation-with-migration model. Models typically start with a concise description, followed by the mandatory time_units field. This model uses the defaults section to provide a default start_size of 1,000 individuals for each epoch of each deme. There are 3 demes in the model, an ancestral deme named “A” which exists arbitrarily far back into the past then ceases to exist at 100 generations ago, and demes “X” and “Y” that derive their ancestry from A when it goes extinct. Demes A and X have only one epoch, in which the population sizes are constant, whereas deme Y has 2 epochs. Deme Y’s second epoch has a different end_size than its start_size, which indicates the size grows exponentially from 1,000 individuals at 50 generations ago to 3,000 individuals at time 0 (the present). The migration section lists one migration stanza, between demes X and Y. This migration stanza does not indicate a source or destination deme, so the migration is symmetric. No migration times are specified, so migrations occur continuously at the given rate during the time interval over which both demes exist (from 100 generations ago until the present). We do not attempt a detailed explanation of all Demes features here, and readers are instead directed to the tutorial and detailed specification in the online documentation (https://popsim-consortium.github.io/demes-spec-docs/; accessed 2022 September 12).
Application: simulation using Demes
Here, we highlight the interaction between Demes and other software, including simulation and model illustration tools. Demes allow us to specify a demographic model which can be used as the input for a growing number of simulation packages (Table 1). We implemented the human two-population demographic model from Tennessen et al. (2012) inferred from European and African-American sequencing data. This model (shown in Demes format in Fig. A2) is parameterized by an ancestral population with an ancient growth, divergence into “AFR” and “EUR” that each has multiple-epoch size histories, and multiple epochs of continuous migration between the two branches (illustrated using demesdraw in Fig. 2a). The large final sizes ( individuals each) are one to three orders of magnitude larger than ancestral population sizes, reflecting the recent explosive population size increase in humans.
We used this model to simulate 20 haploid genome copies from EUR and AFR at time zero (i.e. present day) to obtain the joint site-frequency spectrum (SFS), a summary of observed allele frequencies widely used in evolutionary inference (Bustamante et al. 2001; Gutenkunst et al. 2009; Tennessen et al. 2012; Jouganous et al. 2017; Kamm et al. 2017; Kim et al. 2017). The Demes model (Figs. 2a and A2) was provided as the input demography to msprime (Baumdicker et al. 2022) to simulate a large recombining region under the mutation rate assumed in Tennessen et al. (2012), and we computed the observed SFS using tskit (Ralph et al. 2020). Using the same Demes model as input to moments (Jouganous et al. 2017), we computed the expectation of the joint SFS and compared to the msprime simulated data (Fig. 2, b and c). Figure 2d shows the code required to run the simulations in msprime and moments, and demonstrates that precisely the same input model, without modification, was provided to both packages. Such interoperability is a major gain for researchers, which we hope will become the expected norm as more packages adopt the Demes format.
Discussion
Stable and healthy software ecosystems require standard interchange formats, allowing for the development of high-quality and long-lasting tools that produce and consume the standard. Demographic models are a key part of population genetics research, and to date, the transfer of inferred models to downstream simulations has been ad hoc, and conversions between the many different ways of expressing such models are both labor intensive and error-prone. The proposed Demes standard is an attempt to bridge this gap between inference and simulation, and also to provide the foundations for a sustainable ecosystem of tools built around this data model. Table 1 shows some initial infrastructure that we have built as part of developing Demes, but many other useful tools can be envisaged that produce, consume, or transform this format.
Reproducibility is a significant problem throughout the sciences (Baker 2016), and various measures have been proposed to increase the likelihood of researchers being able to replicate results in the literature (Munafò et al. 2017). The most basic requirement for reproducibility is that we must be able to state precisely what the result in question is. The lack of standardization in how complex demographic models are communicated today, and the lack of precision in the published model descriptions means that it is difficult to replicate analyses, or reproduce those models for later simulation. Thus, we hope that the Demes standard introduced here will be widely adopted by simulation and inference methods and be used for reporting results in publications, either as supplemental material or uploaded to a data repository.
Acknowledgments
The authors would like to thank the editor and reviewers for helpful comments that have significantly improved this manuscript.
Funding
GG was supported by a Villum Fonden Young Investigator award to Fernando Racimo (project no. 00025300). RG and TS were supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM127348 to RNG). MH was supported by a Natural Environment Research Council Independent Research Fellowship (NE/R015686/1). JK was supported by the Robertson Foundation. SS was supported by funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement No 851511). GB was supported by funding from the European Research Council (ModelGenomLand, 757648).
Conflicts of interest
None declared.
Appendix
The Demes specification is a formal data model for describing the properties of populations over time, along with some metadata and provenance information. The data model is based on the ubiquitous JSON (Bray 2017) standard, and formally defined using JSON Schema (Wright et al. 2020). Along with the schema, full technical details of the model are provided in the online specification document (https://popsim-consortium.github.io/demes-spec-docs/).
Population genetics model details
In Demes, demographic models consist of one or more interacting populations, or “demes,” understood to be a collection of individuals that can be conveniently modeled using a defined set of rules and parameters (Gilmour and Gregor 1939; Gilmour and Heslop-Harrison 1954). To avoid confusion with the name of the specification itself, we will use the term “population” in this discussion, with the understanding that the terms are interchangeable. A population is defined as some collection of individuals that exists for some period of time, and has a well-defined size (i.e. number of individuals) during that time period. Individuals can move between populations either according to their ancestor–descendant relationships or through processes involving migrations. Few other properties of the populations are specified in the model: we are concerned primarily with defining the populations, their sizes, and the movement of individuals between those populations.
Time units
Population and event times are written as units in the past, so that time zero corresponds to the final generation or “now,” and event times in the past are values greater than zero with larger values corresponding to times in the more distant past. By having time values increase into the past, we avoid the need to choose an arbitrary point in history as “time zero.” A natural specification for time units is in generations, although other time units are permitted, such as years, accompanied by the generation time so that downstream software may convert times into generations as required.
There must be at least one population with an infinite start_time. An infinite start time may be interpreted differently depending on the simulator. In a coalescent setting, there is no upper bound for the coalescent time of lineages in this population. In a forwards-time setting, the interval of time between infinity and the oldest noninfinite model time (i.e. the “first event”) is approximated by the simulator’s burn-in phase—detailed guidance is provided in the online specification.
Sizes and epochs
Population sizes are given as numbers of individuals, and details such as ploidy levels are considered external to the model. We therefore focus on the number of individuals as opposed to the number of genome copies. Sizes and mating system details are specified for each population within population-specific epochs. Epochs are contiguous time intervals that define the existence interval of the population. Each epoch specifies the population size over that interval, which can be a constant value or a function defined by start and end sizes that must remain positive. Only exponential population size changes are currently supported, but other functions may be added to the specification over time.
Population dynamics
Within a population, we assume that allele frequency dynamics can be described by the Wright–Fisher model. Briefly, generations are nonoverlapping (all parents reproduce and die simultaneously), and for allele i currently at frequency pi, its frequency in the next generation (at birth) is expected to be , where wi and are the marginal and mean fitnesses, respectively, properly weighted according to ancestry proportions. In this framework, a forward-time simulation of finite populations is equivalent to multinomial sampling of allele frequencies each generation (Bürger 2000, pp. 29–31; Crow and Kimura 1970, pp. 179–181), and a backwards-time (coalescent) simulation follows the approximations described in Tajima (1983), Hudson (1983), and Wakeley (2008, Chapter 3). Furthermore, this model assumes “soft” selection (Christiansen 1975), meaning that the dynamics of population sizes changes are independent of the details of individual fitnesses. As such, this model excludes scenarios such as “hard selection,” in which population sizes are dependent on a population’s mean fitness, or stochastic fluctuations in population size, such as interpreting population sizes as carrying capacities. Many forwards and backwards time simulators currently implement this model (e.g. Hudson 2002; Gutenkunst et al. 2009; Excoffier and Foll 2011; Kelleher et al. 2016; Jouganous et al. 2017; Haller and Messer 2019; Thornton 2019).
Selfing and cloning
Each population has an assigned selfing rate and cloning rate, where each defines the probability that offspring are generated from one generation to the next by either self-fertilization or cloning of an individual. More specifically, for a given epoch within a population denote the clonal rate by σ and the selfing rate by S. S and σ can take any value between zero and one and can sum to more than one. Each generation a proportion of offspring σ are expected to be generated through clonal reproduction, while are expected to arise through sexual reproduction. Within the sexually reproduced offspring, a proportion S is born via self-fertilization while the rest have parents drawn at random from the previous generation. Depending on the simulator, this random drawing of parent may occur either with or without replacement. When drawing occurs with replacement, a small amount of “residual” selfing is expected, so that the realized selfing probability is instead of (so that even with σ = 0 and S = 0, selfing may still occur with probability ), although this effect is negligible in large populations (Nordborg and Donnelly 1997).
By allowing the definition of selfing and cloning probabilities, we allow many standard models to be defined. However, by parameterizing selfing and cloning as we have, we assume that these properties of populations can be specified independently from the genetics. In other words, mutations that cause selfing probabilities to fluctuate within an epoch are not considered. More details of the mathematical properties of selfing and cloning rates in a coalescent context can be found in Nordborg and Donnelly (1997) and Hartfield et al. (2016).
Relationships between populations
A population may have one or more ancestors, which are other populations that exist at the population’s start time. If one ancestor is specified, the first generation is constructed by randomly sampling parents from the ancestral population to contribute to offspring in the newly generated population. If more than one ancestor is specified, the proportions of ancestry from each contributing population must be provided, and those proportions must sum to one. In this case, parents are chosen randomly from each ancestral population with probability given by those proportions.
Individuals in a population may have parents from a different population through migrations. These can be defined as continuous migration rates over time intervals for which populations coexist or through instantaneous (or pulse) migration events at a given time. Continuous migration rates are defined as the probability that parents in the “destination” population are chosen from the “source” population. On the other hand, pulse migration events specify the instantaneous replacement of a given fraction of individuals in a destination population by individuals with parents from a source population.
Rationale for YAML
We have adopted the widely used YAML format (Ben-Kiki et al. 2009) as the recommended means of interchanging Demes models (e.g. Figs. 1 and A2). YAML is a data serialization language with an emphasis on simplicity and which interoperates well with JSON (indeed, YAML 1.2 is a superset of JSON). We chose YAML over JSON because although JSON is an excellent format for data interchange, it is ill-suited for human understanding and manipulation. We also considered other declarative data exchange formats such as TOML, but chose YAML because of its equivalence with JSON, popularity, and good software support. Since the Demes data model is defined in JSON Schema, however, there is no formal dependency on YAML and implementations may choose to use JSON directly if they wish (e.g. for greater efficiency).
Rationale for static models
The Demes specification is designed to describe demographic models defined by a fixed set of model parameters. As described in the main text, it does not include information about estimated confidence intervals or the joint distribution of parameter values. In this section, we describe the rationale for this design decision.
The parameters of demographic models are typically tightly coupled, and cases in which distributions for different parameters can be simply described are rare. In this situation, the simplest way to describe an estimated distribution is to list a large number of samples from the posterior. While writing out a large number of Demes models in YAML format may seem inefficient, it can in fact be a compact way to describe these distributions. For example, consider a one-population model with piecewise-constant sizes over 20 epochs which has ∼40 free parameters: the start_size and end_time values for each epoch. If we sample 50,000 models from the posterior distribution, the resulting multidocument YAML file is 45 MiB. This format compresses down to 8.4 MiB when gzipped or 6.2 MiB when compressed with LZMA2, which is on par with an equivalent binary representation of the free parameters ().
Similarly, one might be interested in running simulations in which the demographic model parameters are drawn from a distribution, e.g. in ABC inference (Beaumont et al. 2002). Other inference procedures based on optimizing a loss function (Gutenkunst et al. 2009; Jouganous et al. 2017; Kamm et al. 2017; Ragsdale and Gravel 2019; Excoffier et al. 2021) need users to specify parameter bounds, and possibly nonlinear or conditional constraints between parameters. Indeed, the choice of how to parameterize a model could be important for some inference methods (e.g. absolute times vs relative times between events).
Implementing the many distributions of interest and supporting a general way to describe a model’s free parameters would greatly increase the complexity of parsers, with relatively limited benefit to most users. It is unlikely that Demes could be made sufficiently flexible without implementing many features of general-purpose programming languages, such as variables, arithmetic, and flow control. Such use cases are therefore better served by writing model-generating functions in an existing programming language, for example using the Demes Python API (e.g. as implemented in moments; Jouganous et al. 2017; Ragsdale and Gravel 2019). As an intriguing possibility for developments in this direction, there exist many templating solutions for YAML and JSON that are specifically designed for extending static data in arbitrarily complex ways (e.g. YTT, Jsonnet, CUE, and Dhall).
Contributor Information
Graham Gower, Section for Molecular Ecology and Evolution, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark.
Aaron P Ragsdale, Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA.
Gertjan Bisschop, Institute of Ecology and Evolution, The University of Edinburgh, Edinburgh EH9 3FL, UK.
Ryan N Gutenkunst, Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.
Matthew Hartfield, Institute of Ecology and Evolution, The University of Edinburgh, Edinburgh EH9 3FL, UK.
Ekaterina Noskova, Computer Technologies Laboratory, ITMO University, 197101 Saint-Petersburg, Russia.
Stephan Schiffels, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany.
Travis J Struck, Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.
Jerome Kelleher, Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.
Kevin R Thornton, Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA.
Data Availability
The Demes specification and documentation are available at https://popsim-consortium.github.io/demes-spec-docs/
Literature cited
- Adrion JR, Cole CB, Dukler N, Galloway JG, Gladstein AL, Gower G, Kyriazis CC, Ragsdale AP, Tsambos G, Baumdicker F, et al. A community-maintained standard library of population genetic models. eLife. 2020;9:e54967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Antao T, Beja-Pereira A, Luikart G.. MODELER4SIMCOAL2: a user-friendly, extensible modeler of demography and linked loci for coalescent simulations. Bioinformatics. 2007;23(14):1848–1850. [DOI] [PubMed] [Google Scholar]
- Arenas M. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput Biol. 2012;8(5):e1002495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker M. 1,500 scientists lift the lid on reproducibility. Nat News. 2016;533(7604):452–454. [DOI] [PubMed] [Google Scholar]
- Barton NH, Depaulis F, Etheridge AM.. Neutral evolution in spatially continuous populations. Theor Popul Biol. 2002;61(1):31–48. [DOI] [PubMed] [Google Scholar]
- Barton NH, Kelleher J, Etheridge AM.. A new model for extinction and recolonization in two dimensions: quantifying phylogeography. Evolution. 2010;64(9):2701–2715. [DOI] [PubMed] [Google Scholar]
- Battey C, Ralph PL, Kern AD.. Space is the place: effects of continuous spatial structure on analysis of population genetic data. Genetics. 2020;215(1):193–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022;220(3):iyab229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont MA, Zhang W, Balding DJ.. Approximate Bayesian computation in population genetics. Genetics. 2002;162(4):2025–2035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Becheler A, Coron C, Dupas S.. The quetzal coalescence template library: a C++ programmers resource for integrating distributional, demographic and coalescent models. Mol Ecol Resour. 2019;19(3):788–793. [DOI] [PubMed] [Google Scholar]
- Ben-Kiki O, Evans C, Ingerson B. YAML Ain’t Markup Language (yaml™) version 1.1. Working Draft 2008-0511. 2009. https://yaml.org/spec/1.2.2/
- Bray T. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. 2017. https://www.rfc-editor.org/rfc/rfc8259
- Bürger R. The Mathematical Theory of Selection, Recombination, and Mutation. Chichester (UK): Wiley; 2000. [Google Scholar]
- Bustamante CD, Wakeley J, Sawyer S, Hartl DL.. Directional selection and the site-frequency spectrum. Genetics. 2001;159(4):1779–1788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carvajal-Rodríguez A. Simulation of genomes: a review. Curr Genomics. 2008;9(3):155–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen GK, Marjoram P, Wall JD.. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19(1):136–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christiansen FB. Hard and soft selection in a subdivided population. Am Nat. 1975;109(965):11–16. [Google Scholar]
- Crow JF, Kimura M.. An Introduction to Population Genetics Theory. Caldwell; NJ: Blackburn press; 1970. [Google Scholar]
- Ewing G, Hermisson J.. MSMS: a coalescent simulation program including recombination, demographic structure, and selection at a single locus. Bioinformatics. 2010;26(16):2064–2065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Excoffier L, Foll M.. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27(9):1332–1334. [DOI] [PubMed] [Google Scholar]
- Excoffier L, Marchi N, Marques DA, Matthey-Doret R, Gouy A, Sousa VC.. fastsimcoal2: demographic inference under complex evolutionary scenarios. Bioinformatics. 2021;37(24):4882–4885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilmour JSL, Gregor JW.. Demes: a suggested new terminology. Nature. 1939;144(3642):333. [Google Scholar]
- Gilmour JSL, Heslop-Harrison J.. The deme terminology and the units of micro-evolutionary change. Genetica. 1954;27(1–2):147–161. [DOI] [PubMed] [Google Scholar]
- Gravel S. Population genetics models of local ancestry. Genetics. 2012;191(2):607–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guillaume F, Rougemont J.. Nemo: an evolutionary and population genetics programming framework. Bioinformatics. 2006;22(20):2556–2557. [DOI] [PubMed] [Google Scholar]
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD.. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics. 2009;5(10):e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller BC, Messer PW.. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Mol Biol Evol. 2019;36(3):632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartfield M, Wright SI, Agrawal AF.. Coalescent times and patterns of genetic diversity in species with facultative sex: effects of gene conversion, population structure, and heterogeneity. Genetics. 2016;202(1):297–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernandez RD. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24(23):2786–2787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernandez RD, Uricchio LH. SFS_code: more Efficient and Flexible Forward Simulations. Technical report, bioRxiv. 2015. 10.1101/025064 [DOI]
- Hoban S, Bertorelle G, Gaggiotti OE.. Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet. 2012;13(2):110–122. [DOI] [PubMed] [Google Scholar]
- Hudson RR. Testing the constant-rate neutral allele model with protein sequence data. Evolution. 1983;37(1):203–217. [DOI] [PubMed] [Google Scholar]
- Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. [DOI] [PubMed] [Google Scholar]
- Jouganous J, Long W, Ragsdale AP, Gravel S.. Inferring the joint demographic history of multiple populations: beyond the diffusion approximation. Genetics. 2017;206(3):1549–1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamm JA, Terhorst J, Song YS.. Efficient computation of the joint sample frequency spectra for multiple populations. J Comput Graph Stat. 2017;26(1):182–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J, Etheridge AM, McVean G.. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput Biol. 2016;12(5):e1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J, Lohse K.. Coalescent simulation with msprime. In: Dutheil JY, editor. Statistical Population Genomics. New York (NY: ): Springer US; 2020. p. 191–230. [DOI] [PubMed] [Google Scholar]
- Kern AD, Schrider DR.. Discoal: flexible coalescent simulations with selection. Bioinformatics. 2016;32(24):3839–3841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim BY, Huber CD, Lohmueller KE.. Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples. Genetics. 2017;206(1):345–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laetsch D. R, Bisschop G, Martin S, Aeschbacher S, Setter D, Lohse K. Demographically explicit scans for barriers to gene flow using genome-wide IM blockwise likelihood estimation: gIMble. In prep.2022. [DOI] [PMC free article] [PubMed]
- Li H, Durbin R.. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Athanasiadis G, Weale ME.. A survey of genetic simulation software for population and epidemiological studies. Hum Genomics. 2008;3(1):79–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mailund T, Schierup MH, Pedersen CNS, Mechlenborg PJM, Madsen JN, Schauser L.. CoaSim: a flexible environment for simulating genetic data under coalescent models. BMC Bioinformatics. 2005;6:252–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, du Sert NP, Simonsohn U, Wagenmakers E-J, Ware JJ, Ioannidis JPA, et al. A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nordborg M, Donnelly P.. The coalescent process with selfing. Genetics. 1997;146(3):1185–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noskova E, Ulyantsev V, Koepfli K-P, O’Brien SJ, Dobrynin P.. GADMA: genetic algorithm for inferring demographic history of multiple populations from allele frequency spectrum data. GigaScience. 2020;9(3):giaa005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parobek CM, Archer FI, DePrenger-Levin ME, Hoban SM, Liggins L, Strand AE.. skeleSim: an extensible, general framework for population genetic simulation in R. Mol Ecol Resour. 2017;17(1):101–109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parreira B, Trussart M, Sousa V, Hudson R, Chikhi L.. SPAms: a user-friendly software to simulate population genetics data under complex demographic models. Mol Ecol Resour. 2009;9(3):749–753. [DOI] [PubMed] [Google Scholar]
- Ragsdale AP, Gravel S.. Models of archaic admixture and recent history from two-locus statistics. PLoS Genet. 2019;15(6):e1008204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ragsdale AP, Nelson D, Gravel S, Kelleher J.. Lessons learned from bugs in models of human history. Am J Hum Genet. 2020;107(4):583–588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ralph P, Thornton K, Kelleher J.. Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes. Genetics. 2020;215(3):779–797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ringbauer H, Coop G, Barton NH.. Inferring recent demography from isolation by distance of long shared sequence blocks. Genetics. 2017;205(3):1335–1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schiffels S, Durbin R.. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schiffels S, Wang K.. MSMC and MSMC2: the multiple sequentially Markovian coalescent. In: Dutheil JY, editor. Statistical Population Genomics. Vol 2090 of Methods in Molecular Biology. New York (NY: ): Springer US; 2020. p. 147–166. [DOI] [PubMed] [Google Scholar]
- Shlyakhter I, Sabeti PC, Schaffner SF.. Cosi2: an efficient simulator of exact and approximate coalescent with selection. Bioinformatics. 2014;30(23):3427–3429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staab PR, Zhu S, Metzler D, Lunter G.. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics. 2015;31(10):1680–1682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105(2):437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, et al. ; NHLBI Exome Sequencing Project. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton KR. A C++ template library for efficient forward-time population genetic simulation of large populations. Genetics. 2014;198(1):157–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton KR. Polygenic adaptation to an environmental shift: temporal dynamics of variation under Gaussian stabilizing selection and additive effects on a single trait. Genetics. 2019;213(4):1513–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J. Coalescent Theory: An Introduction. Englewood, CO: Roberts and CompanyOxford University Press; 2008. [Google Scholar]
- Wright A, Andrews H, Hutton B, Dennis G. JSON schema: a media type for describing JSON documents. 2020. https://json-schema.org/
- Wright S. Isolation by distance. Genetics. 1943;28(2):114–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y.. An overview of population genetic data simulation. J Comput Biol. 2012;19(1):42–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Tian X, Browning BL, Browning SR.. POPdemog: visualizing population demographic history from simulation scripts. Bioinformatics. 2018;34(16):2854–2855. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The Demes specification and documentation are available at https://popsim-consortium.github.io/demes-spec-docs/