Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Mar 23.
Published in final edited form as: Genet Epidemiol. 2014 Dec 13;39(1):2–10. doi: 10.1002/gepi.21876

Genetic data simulators and their applications: an overview

Bo Peng 1,*, Huann-Sheng Chen 2, Leah E Mechanic 4, Ben Racine 3, John Clarke 3, Elizabeth Gillanders 4, Eric J Feuer 2
PMCID: PMC4804465  NIHMSID: NIHMS767883  PMID: 25504286

Abstract

Computer simulations have played an indispensable role in the development and application of statistical models and methods for genetic studies across multiple disciplines. The need to simulate complex evolutionary scenarios and pseudo-datasets for various studies has fueled the development of dozens of computer programs with varying reliability, performance, and application areas. To help researchers compare and choose the most appropriate simulators for their studies, we have created the Genetic Simulation Resources (GSR) website, which allows authors of simulation software to register their applications and describe them with more than 160 defined attributes. This article summarizes the properties of 93 simulators currently registered at GSR and provides an overview of the development and applications of genetic simulators. Unlike other review articles that address technical issues or compare simulators for particular application areas, we focus on software development, maintenance, and features of simulators, often from a historical perspective. Publications that cite these simulators are used to summarize both the applications of genetic simulations and the utilization of simulators.

Keywords: genetic simulation, software, Genetic Simulation Resources

Introduction

In silico Monte Carlo simulations of genetic data have played an important role in the development and applications of statistical methods for a wild variety of research topics in multiple disciplines [Daetwyler, et al. 2013; Epperson, et al. 2010; Hoban, et al. 2011; Marjoram and Tavare 2006]. In evolutionary applications, such simulations model the evolution of the human genome under the influence of demographic and genetic features such as population bottleneck and expansion, mutation, natural selection, and recombination. The ability to simulate complex interactions among multiple factors makes computer simulations ideal tools to predict the outcome of complex evolutionary models, and to infer demographic or genetic features from observed genetic samples [Hoban, et al. 2011]. The latter is generally achieved by simulating a large number of samples under different parameter settings and choosing the parameters that produce samples that best match the observed empirical data [Csillery, et al. 2010].

In genetic epidemiological studies of human traits and diseases, genotypes of evolving populations or simulated datasets are associated with phenotypes under hypothetical disease models. Through the simulation of datasets with known genotype-phenotype associations, computer simulations aid in validating and evaluating power of statistical methods that detect disease predisposing genes associated with human traits and diseases [Ritchie and Bush 2010]. The roles of evolutionary modeling vary in such simulations: whereas simulations for the validation of statistical methods should match evolutionary and other assumptions of the model, the evaluation of statistical power requires more realistic datasets that are simulated independent of the statistical method. In other words, the validation of a method determines whether the method works under model assumptions, and the power evaluation determines to what extent the assumptions hold in reality and how well the method performs if the assumptions fail. Using datasets simulated under model assumptions to evaluate the power of statistical methods can lead to unfair and uninformative comparisons and unreliable conclusions on the applicability and power of these methods. In addition to these two broad types of applications, computer simulations have been used to simulate the evolution of molecular sequences and related traits along phylogenetic trees [Dalquen, et al. 2012] and outcomes of RNA-Seq [Griebel, et al. 2012] and next-generation sequencing [Griebel, et al. 2012] experiments.

A number of computer programs have been developed to perform simulations for different application areas. These simulators vary greatly in terms of their simulation targets, simulation methods, input/output formats, and features they provide. Two major simulation methods, namely backward-time and forward-time methods, have been used to model the evolutionary history of human populations. The backward-time method constructs the genealogy of samples retrospectively according to the coalescent theory of population genetics [Kingman 1982; Kingman 2000]. The forward-time method simulates the evolution of populations prospectively and draws samples from simulated populations. The pros and cons of these two methods have been discussed extensively elsewhere [Carvajal-Rodriguez 2010; Ritchie and Bush 2010; Yuan, et al. 2012]. For simulations that do not model the evolution of populations, theoretical, resampling, and gene-dropping methods have their own applications. Simulations based on phylogenetic trees are also widely used for phylogenetic studies, and are not discussed here [Arenas 2012].

Unlike previous reviews that address technical issues [Carvajal-Rodriguez 2010; Ritchie and Bush 2010] or relate simulators to particular research topics [Hoban, et al. 2011], we emphasize the development and applications of simulators themselves. We first summarized properties of 93 simulators catalogued on the Genetic Simulation Resources (GSR) website (http://popmodels.cancercontrol.cancer.gov/gsr/) [Peng, et al. 2013], including a historical view of targets of simulations and simulation methods. We then collected and analyzed all publications that cite these simulators to summarize both the applications of genetic simulations and the utilization of simulators.

Methods

Creation of the GSR Catalogue

In order to create GSR, we searched published articles for software applications that simulate genetic data for the human genome in scientific journals such as Bioinformatics, BMC Bioinformatics, Genetics, Molecular Biology and Evolution, and Genetic Epidemiology. We selected simulators that simulate genetic material of the human genome, including genetic markers, haploid and diploid DNA sequences and RNA and protein sequences. We included ecology-oriented simulators that are applicable to human populations but excluded those that are designed specifically for animal populations (e.g. AQUASPLATCHE [Neuenschwander 2006]). We excluded simulators without an accessible webpage or download link (e.g. POPSIM [Hampe, et al. 1998], simM [Lemire, et al. 2004]) and those that were designed for teaching purposes and do not aim to generate datasets to be analyzed by other applications (e.g. Populus, PopG).

Initially, we collected basic information of 80 selected simulators, including short and long descriptions, homepage, project start date, most recent version and its release date, and publications that describe the simulator. These publications are called primary citations in this paper and were used to trace the applications of simulators. We also summarized features of these 80 simulators using 168 attributes in 8 categories and 25 subcategories. These attributes range from key features such as the type of simulated genetic variations, simulation method, and input/output file formats, to development features such as programming language, supported platforms, and license. To ensure the accuracy of our data, the 80 package authors were asked to verify abstracted data. 42 packages were confirmed by package authors. After launching the GSR website in October 2013 [Peng, et al. 2013], we invited authors of an additional 26 packages to register packages on GSR. Of those, 13 authors responded and added information about their software packages as of May 2014. The remaining 13 packages are currently pending. The GSR team is evaluating different options and might register and maintain entries of these simulators, including more classic simulators such as SIMULATE [Terwilliger, et al. 1993], as long as they are actively used in the research community.

Application and Utilization of Genetic Simulation Programs

To assess the applications of these 93 (initial 80 + new 13) simulation packages, we performed a web of science® (Thomson Reuters) search and collected all publications that cite the primary citations of the simulators. Due to the large number of citations (N=5,687), we were not able to verify if these publications actually use cited simulators to perform simulation, or just referenced the primary citations. Articles in which cited simulators were not used included review articles, application notes (cross citations from another package), and positive or negative mention (of potential use of simulators). These articles can represent a large proportion of citing articles of some simulators according to a review of all citing articles of six simulators (20 non-application citations out of 33 total citations for CoaSim [Mailund, et al. 2005], 41 out of 88 for simuPOP [Peng and Kimmel 2005], 15 out of 26 for Hap-Sample [Wright, et al. 2007], 7 out of 23 for genomeSIMLA [Edwards, et al. 2008], 13 out of 21 for forsim [Lambert, et al. 2008], and 19 out of 30 for FREGENE [Chadeau-Hyam, et al. 2008]).

To explore the simulation tools used in recent genetic epidemiology publications and examine whether genetic simulations were performed using existing simulators catalogued in GSR, we reviewed all articles from Sep. 2013 to Feb. 2014 issues of the journal of Genetic Epidemiology to determine how simulations were performed in these articles.

Results

Characteristics of Genetic Simulation programs

We used the date of the first public release or the publication date of the primary citation as the creation date of a simulator. The creation dates of the catalogued simulators range from 1990 to 2013 but we catalogued only 4 packages that were created before 2000. The oldest simulator in the GSR catalogue is fastSLINK, which was last updated in 2010 but its previous version, SLINK, was released in 1990. We excluded some other earlier simulation programs because they are no longer publicly available. We collected on average 2 packages released from 2000 to 2004, 7.75 packages released from 2005 to 2008, and 10.75 packages released from 2009 to 2012. Although we included only 5 packages for year 2013 thus far, the number of new packages for this year is expected to be more than 15 if we include packages that have not been published, and the ones we have not yet added to GSR. The increasing number of new simulators over the past ten years corresponds to the increasing depth and breadth of genetic epidemiological and other studies, which should continue to drive the development of new genetic simulators.

The C/C++ programming language was used by most of the simulators (69%), followed by Python (15%), R (11%), Java (9%), Perl (5%) and other languages such as Visual Basic (10%). The dominance of C/C++ is not surprising due to its popularity in the scientific community and computational speed relative to java, R and Python. Owing to its ‘glue-language’ nature, Python was most frequently used to create wrappers of the underlying C/C++ code [De Mita and Siol 2012] or simulation library [Peng and Kimmel 2005] to provide flexible command line or graphical interfaces to the underlying simulation engine [Mailund, et al. 2005]. Most of the simulators can be executed under Linux (84%), Mac OS X (68%), and Windows (66%); only 20% of the simulators support Solaris and other systems; and 56% of the simulators support all major operating systems.

Among the 93 catalogued simulators, 68 were released under an open-source license (62 under a GUN Public License, 5 under a BSD-style license), 4 are controlled by universities and government agencies, and 21 packages do not specify a license although 2 indicate they are free for academic use. Although academic use of all simulators is likely to be free and source code is available for most of the simulators, license terms may preclude use for commercial simulation services or data set generation. It is also worth noting that the copyright of some software packages might belong to the institutions or other grantees if the development of the software was supported by grants.

Because a command line interface is easy to implement, flexible, portable, and works easily with other tools, such an interface is preferred for scientific computation and is provided by a majority (81 out of 93) of the simulators. In comparison, only 29 of the simulators provide a graphical user interface and most of them are thin wrappers of the corresponding command line interfaces designed to allow easy input of parameters for novice users [Peng and Liu 2010]. For complex simulations, many simulators allow the use of configuration files to enter a large number of parameters (e.g. forsim [Lambert, et al. 2008]). For even more complex simulations with conditions and loops, a scripting interface provides more powerful, and often more readable alternative to complex configuration files [Peng and Kimmel 2005]. Scripting interfaces in Python and R are available in 26 of the 93 simulators. Last, 5 simulators are accessible from web interfaces. These interfaces are appealing because they can save users from the difficulties of downloading and installing a simulator and from the computational burden of running the simulations locally. However, these interfaces are only suitable for small-scale simulations with small input and output files and maintaining such services can be financially unsustainable for authors of simulators in the long run.

Attributes of simulation packages

Among the 93 genetic simulators, 75 simulate genetic markers or sequences on DNA, 11 simulate protein sequences, and 7 simulate RNA sequences. Among these 75 DNA sequence simulators, 61 simulate haploid (54) or diploid (29) sequences, 42 simulate genetic markers, and 17 simulate sex chromosomes and mitochondrial DNA. Although haploid sequence simulators (mostly coalescent-based) have been frequently used to simulate diploid individuals in a population by pairing haploid sequences randomly, we do not consider them as diploid simulators because these simulators cannot simulate diploid-specific models such as non-additive selection models.

All sequence types have been simulated by an increasing number of simulators (Fig. 1) but the actual simulated data and application areas can be quite different. For example, whereas early simulators of RNA and protein sequence such as ROSE [Stoye, et al. 1998]) were used to study the evolution of RNA sequences along phylogenetic trees to evaluate the sequence alignment and phylogenetic prediction methods [Truszkowski, et al. 2012], recent RNA sequence simulators such as FLUX SIMULATOR [Griebel, et al. 2012] simulate RNA-seq experiments to understand the impact and mutual interference of biases in different experimental setups. Similarly, simulators of genetic markers simulate genetic markers in varying density, number and population frequencies, for different genetic epidemiological studies such as linkage studies [Leal, et al. 2005], genome wide association (GWA) studies [Li and Li 2008], and rare variant association [Peng and Liu 2010] analyses.

Figure 1.

Figure 1

Number of simulators that simulate different types of genetic sequences at each year. X-axis: initial release of simulator. Y-axis: number of simulators. A package that simulates multiple types of sequences will be displayed in multiple groups.

The forward-time simulation method was the most frequently used simulation method among all simulators (33), followed by coalescent-based backward simulation simulators (28), phylogenetic (22), resampling (12), and other methods (17). Although the first notable forward-time simulator easyPOP [Balloux 2001] was developed earlier than the first popular coalescent simulator ms [Hudson 2002], forward-time simulations were less popular than coalescent-based simulators until 2009, when a number of forward-time simulators such as SFS_CODE [Hernandez 2008] and ForSim [Lambert, et al. 2008] were introduced to simulate haplotypes under complex evolutionary scenarios, mostly for GWA studies (Figure 2). Recent years have also witnessed a notable increase of the use of multiple simulation methods. For example, MSMS [Ewing and Hermisson 2010] includes the functionality of the simulator ms for modeling population structure and demography but adds a model for deme- and time-dependent selection using forward simulations.

Figure 2.

Figure 2

Number of newly developed simulators that use different simulation approaches at each year. X-axis: initial release of simulator. Y-axis: number of simulators. A simulator that uses multiple simulation methods will be displayed in multiple groups, as well as the “multiple methods” group.

To evaluate the modeling flexibility of catalogued simulators, we summarized evolutionary and sampling features provided by all simulators and categorized simulators by the number of features they provide (Fig. 3). Because forward-time simulation methods can in theory simulate arbitrary demographic and genetic models, it is not surprising that forward-time simulators provide many more features than other simulation methods. In contrast, because resampling based simulation methods have little control over the evolutionary forces that shape the genetic composition of the human population, they are used for specific applications with limited features.

Figure 3.

Figure 3

Number of evolutionary features provided by simulators using different simulation methods. Y-axis: number of features provided. X-axis: number of packages.

Utilization of genetic simulators

We searched web of science (WoS) for the primary citations of 93 simulators. Excluding 4 unpublished simulators (FPG, RNA Seq Simulator, rlsim, SimCopy) and 4 others that were published either too early to be indexed (GASP), too late to be indexed (SEQPower) [Wang, et al. 2014], or in journals that are not indexed by WoS (Mendel’s Accountant, Mason, SIBSIM), the remaining 85 simulators were published in Bioinformatics (41%), BMC Bioinformatics (14%), Molecular Biology and Evolution (5%), Genetics (5%), Molecular Ecology Notes (5%) and another 18 other journals. The 2-page Application Note format in Bioinformatics appears to be the most popular way to publish simulation programs.

The 93 simulators were cited in 5,687 articles, spreading to 772 journals in 20 disciplines (according to categories defined by WoS). Molecular Biology and Genetics represents the largest application area (31%), followed by Environment/Ecology (23%), Plant & Animal Science (18%), Computer Science (7%), Clinical Medicine (5%), and Biology and Biochemistry (4%) (Fig. 4). Together, the categories of human (biology, medicine and genetics) and non-human (ecology, plant and animal science) genetics each represent about 40% of the publications.

Figure 4.

Figure 4

Distribution of publications by Discipline. Number of publications that cite the catalogued simulators in GSR in different disciplines as categorized by Web of Science.

The utilization of simulators, measured by the number of citing articles excluding review articles and cross-citations from other simulators, varies greatly among simulators. Among 69 simulators published in 2011 or before, 2 were never cited, 15 were cited less than once per year, and 70% (45) of the simulators were cited fewer than 5 times per year. Only 6 simulators were cited 20 or more times per year, and 5 of those are coalescent-based (BOTTLENECK, SIMCOAL2, cosi, ms, msHOT). In comparison, the most popular forward-time simulators SFS_CODE and simuPOP were cited about 10 and 6 times annually, respectively. Because a simulator could be cited without actually being used and could be used without being cited, these numbers are only rough estimates of the utilization of simulators.

Applications of genetic simulators catalogued in GSR in human genetics are surprisingly diverse with 2,420 articles published in 196 different journals. The top 6 journals published 11% (Molecular Biology and Evolution), 10% (Genetics), 9% (PLoS One), 6% (American Journal of Human Genetics), 4% (Molecular Phylogenetics and Evolution) and 4% (Genetic Epidemiology) of the 2,420 articles. These journals could be categorized by scientific topics of interest based on the scientific focus of the journal, for example, by evolution, genetics, epidemiology, etc. Because each scientific topic has its own simulators of choice, the popularity of simulators varied by specific journal and year.

Limiting the 2420 articles to only those published in the journal Genetic Epidemiology, excluding two articles that proposed new simulation methods [Xu, et al. 2013; Yang and Gu 2013], 92 articles reported the use of catalogued simulators to simulate data for various research topics (Table 1). SLINK/FastSLINK [Ott 1989], perhaps with other early simulators, were used to simulate genotypes on pedigrees for linkage analysis. Starting from year 2004, coalescent simulators ms/msHOT [Hudson 2002], cosi [Schaffner, et al. 2005], GENOME [Liang, et al. 2007] and MaCS [Chen, et al. 2009] were frequently used to simulate sequence data under neutral assumptions. With the increasing popularity of GWA studies after 2004, a theoretical simulator HapSim [Montana 2005] was developed to simulate haplotypes with pre-specified allele frequency and linkage disequilibrium structure; resampling based methods GWAsimulator [Li and Li 2008], HAPGEN [Su, et al. 2011], HAPSIMU [Zhang, et al. 2008], and HAP-SAMPLE [Wright, et al. 2007] were used to simulate “realistic” GWA samples; and SIMLA [Schmidt, et al. 2005] was used to simulate pedigree data using the gene-dropping method. Finally, when more complex evolutionary scenarios were considered, forward-time simulators such as simuPOP [Peng and Kimmel 2005], genomeSIMLA [Edwards, et al. 2008], SFS_CODE [Hernandez 2008], and FREGENE [Chadeau-Hyam, et al. 2008] were used. Most recently, srv [Peng and Liu 2010] was used to simulate rare variants for rare variants association analysis.

Table 1.

Simulators used to simulate data for a total of 92 articles in the journal Genetic Epidemiology, from year 1990 to 2013. Simulators using different simulation methods are marked in different colors, starting from the year they were created. The lengths of blue bars are proportional to the number of citations. Some articles cite more than one simulator so there are 96 citations from 92 articles.

graphic file with name nihms767883t1.jpg

Although 19 simulators have been used for publications in Genetic Epidemiology, the numbers of publications citing these simulators were around 10 in recent years (Table 1), which seems to be low compared to the number of methodology papers published in this journal each year. To further investigate the utilization of simulations reported in this journal, we reviewed 46 articles published in 5 recent issues of Genetic Epidemiology. Excluding 11 reports and theoretical and review articles, all remaining 35 method development papers used simulations to validate their methods and compare statistical power. Among these 35 papers, 17 simulated genotypes on unlinked loci using pre-specified minor allele frequency, 8 used existing simulator, 6 resampled empirical data (3 of which used data from the 1000 genomes project [Abecasis, et al. 2012]), 2 simulated pedigree-based data using the gene-dropping method, and 4 developed their own methods. In summary, only 23% of computer simulations used existing simulators.

Discussion

We summarized the properties and applications of 93 simulators catalogued at the GSR website. The properties that we have discussed include programming language, supported platforms, interface, license, targets of simulation, simulation methods, and the number of evolutionary features provided by each simulator. We collected publications that cite these 93 simulators and summarized the utilization of the genetic simulators, both individually and in groups categorized by the initial release year and simulation method. We used Genetic Epidemiology as an example to study the applications of genetic simulations and simulators.

Implementing and maintaining a decent software package to be used by others require a good initial design, enough resources to properly implement and test the features, and continued support to keep it up with the changing research and computing environments. Unfortunately, such requirements can rarely be met under an academia environment. A large number of simulators were designed and implemented by graduate students without proper training and required experience in computer science and software engineering, and there were rarely enough resources for proper testing and documentation of simulation packages. Despite great efforts from the authors of packages, the usability of genetic simulators leaves much to be desired. Installing a simulator from source code is seldom a smooth process and it can take a lot of guesswork to figure out what a simulator does, how it works, and how to use it to simulate. When a simulator actually runs, users might encounter various problems with stability, efficiency, and bugs. However, the high dropout rate of simulators implies that support is unavailable for about half of the simulators. Such difficulties must have contributed to the low utilization rate of genetic simulators.

GSR helps mitigate the challenge of identifying the right simulation tools for particular research topics, by providing a basic description, summary of features, and collection of sample applications of simulators. It does not, however, address the issues with the quality of simulators. During a workshop titled Genetic simulation Tools for Post-Genome Wide Association Studies of Complex Diseases held in Bethesda, Maryland, March 11–12, 2014, a GSR certification program was proposed. This program certifies and promotes genetic simulators that meet criteria in areas such as accessibility, documentation, usability, application, and support. It aims to establish and encourage the enforcement of standards for the development of genetic simulators and can hopefully help improve the quality of genetic simulators [Huann-Sheng Chen 2014].

Whereas the availability of documentation, unit tests, examples, and sample outputs could be easily measured, real-world applicability and correctness are harder to gauge. For these issues, in-depth reviews of a smaller number of simulators [Yang, et al. 2014] compliment review articles that list on-paper features without testing their usability. The GSR website allows visitors to leave comments about usability, applicability, and problems with simulators in real-world applications. The GSR certification program will also provide standard test cases to test simulators under common assumptions. Nevertheless, because of the differences in assumptions, methods, and implementations, datasets simulated by different simulators could have quite different properties [Xu, et al. 2013]. Because of a lack of detailed information about the implementation of different models, it is often challenging for users to know exactly how these simulators work and identify the sources of differences. This is another reason why some researchers tend to write their own simulators, although the quality and correctness of such disposable simulators are also questionable.

Although genetic simulations have been ubiquitously used to validate statistical methods and evaluate their power, most of them do not use existing simulators. The reasons for this phenomenon can be multifold. First, as we have discussed, because of incomplete information about available simulators and potential issues with existing simulators, researchers might not know the availability of a suitable simulator, or chose to implement their own simulation program to gain complete control over the simulation process. Second, many publications in recent issues of Genetic Epidemiology simulate highly hypothetical datasets based on simple assumptions (e.g. unlinked loci). Complex simulators are not necessary in this context. The use of overly simplified simulated datasets for rare variant association analysis reflects the status of this field and might have led to inaccurate assessments of the power of rare variant association tests and inappropriate design of sequencing-based epidemiological studies. We expect the use of more sophisticated simulation methods when new methods start to tackle more complex genotype structure and genotype-phenotype associations. Finally, a method that addresses questions under particular assumptions might need a new type of simulated data that is not supported by any of the existing simulators. In this case, a researcher might have no choice but to develop a new method to simulate the needed datasets. Demand for certain types of simulation will be a primary factor in their development and release for public use.

Whereas some simpler simulators such as ms [Hudson 2002] and cosi [Schaffner, et al. 2005] could be distributed in source code form and left untouched for years, most simulators, especially those that support multiple platforms and depend on multiple third-party libraries, need to be actively maintained to keep up with evolving computer environments and research needs. Unfortunately, authors of simulators might not be able to maintain the simulators after a few years, especially when they left academia or switched research areas after moving to other institutions. Even for authors with a stable position, they might not be willing to maintain a simulator because the maintenance of a simulator could become a never-ending chore without much reward, and the workload would only increase with increasing power and popularity of the simulator. Because of the difficulty in securing funding for the continued development of software, other forms of support might be needed. For example, a “Google Summer of Code”-like program was proposed during the aforementioned workshop to give authors or their students short-term support to develop or maintain genetic simulators.

We categorized simulators according to the simulation methods used. Despite the fact that there are now more forward-time simulators than coalescent-based simulators, coalescent-based simulators are still used more often than forward-time simulators, accounting for overall 57% of all applications in comparison to 15% for forward-time simulators. Although the “market share” of coalescent-based simulators has been declining steadily, it will continue to be the leading simulation method for years to come (Fig. 5).

Figure 5.

Figure 5

Distribution of simulation methods for publications that cite 93 catalogued simulators, from year 1990 to 2013. We group simulators by the simulation methods they use and count the number of publications that cite simulators in each group. Articles that cite more than one simulator or a simulator using multiple methods will be counted multiple times.

A new simulator is needed if existing simulators do not provide a feature of interest. Instead of writing a simulator from scratch, it can be much easier to start from an existing simulator with proven reliability and well-documented source code. As a matter of fact, many popular simulators have been extended by others to be applied to new application areas. Examples of such extensions include msHOT and mbs [Teshima and Innan 2009] to ms [Hudson 2002], quantiNemo [Neuenschwander, et al. 2008] to Nemo [Guillaume and Rougemont 2006], and fastSimCoal [Excoffier and Foll 2011] and Bayesian Serial SimCoal to SimCoal [Excoffier, et al. 2000]. Another way to write a new simulator is to write it in a higher-level language utilizing existing simulation libraries. simuPOP [Peng and Kimmel 2005] is currently the only option for this approach. Because of its flexible design, simuPOP has been used to simulate very complex evolutionary scenarios, and used as the simulation engine for 6 simulators such as srv [Peng and Liu 2010], GENS2 [Pinelli, et al. 2012], SEQPower [Wang, et al. 2014], and Variant Simulation Tools [Peng 2014]. Because utilizing an existing library is much easier and less error-prone than writing or extending a simulator in a lower-level language, the development and use of simulation libraries are strongly encouraged.

With the rapid changes in research topics in genetic epidemiology and other fields, new simulators are being developed at an ever-increasing speed. The resources provided by GSR are expected to assist users of simulators in identifying appropriate simulators for their research topics and help developers promote the use of their simulators. The GSR team is working on the GSR-certification and other initiatives proposed during the workshop to promote the development and application of genetic simulators.

Acknowledgments

The development of GSR is supported by a contract HHSN261201100558P from the National Cancer Institute. This work was supported in part by grant 1R01HG005859 from the National Institute of health, the Chapman Foundation, the Michael and Susan Dell Foundation (honoring Lorraine Dell), and MD Anderson Cancer Center Support Grant P30 CA016672. The GSR team thanks all authors who verified the attributes of their packages, and all visitors who have emailed us with their suggestions.

References

  1. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Arenas M. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput Biol. 2012;8(5):e1002495. doi: 10.1371/journal.pcbi.1002495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Balloux F. EASYPOP (version 1.7): a computer program for population genetics simulations. J Hered. 2001;92(3):301–302. doi: 10.1093/jhered/92.3.301. [DOI] [PubMed] [Google Scholar]
  4. Carvajal-Rodriguez A. Simulation of genes and genomes forward in time. Curr Genomics. 2010;11(1):58–61. doi: 10.2174/138920210790218007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chadeau-Hyam M, Hoggart CJ, O'Reilly PF, Whittaker JC, De Iorio M, Balding DJ. Fregene: simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics. 2008;9:364. doi: 10.1186/1471-2105-9-364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19(1):136–142. doi: 10.1101/gr.083634.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Csillery K, Blum MG, Gaggiotti OE, Francois O. Approximate Bayesian Computation (ABC) in practice. Trends Ecol Evol. 2010;25(7):410–418. doi: 10.1016/j.tree.2010.04.001. [DOI] [PubMed] [Google Scholar]
  8. Daetwyler HD, Calus MP, Pong-Wong R, de Los Campos G, Hickey JM. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013;193(2):347–365. doi: 10.1534/genetics.112.147983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. ALF--a simulation framework for genome evolution. Mol Biol Evol. 2012;29(4):1115–1123. doi: 10.1093/molbev/msr268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. De Mita S, Siol M. EggLib: processing, analysis and simulation tools for population genetics and genomics. BMC Genet. 2012;13:27. doi: 10.1186/1471-2156-13-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Edwards T, Bush W, Turner S, Dudek S, Torstenson E, Schmidt M, Martin E, Ritchie M. Generating Linkage Disequilibrium Patterns in Data Simulations Using genomeSIMLA. In: Marchiori E, Moore J, editors. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Berlin Heidelberg: Springer; 2008. pp. 24–35. [Google Scholar]
  12. Epperson BK, McRae BH, Scribner KIM, Cushman SA, Rosenberg MS, Fortin M-J, James PMA, Murphy M, Manel S, Legendre P, et al. Utility of computer simulations in landscape genetics. Molecular Ecology. 2010;19(17):3549–3564. doi: 10.1111/j.1365-294X.2010.04678.x. [DOI] [PubMed] [Google Scholar]
  13. Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010;26(16):2064–2065. doi: 10.1093/bioinformatics/btq322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Excoffier L, Foll M. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27(9):1332–1334. doi: 10.1093/bioinformatics/btr124. [DOI] [PubMed] [Google Scholar]
  15. Excoffier L, Novembre J, Schneider S. SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J Hered. 2000;91(6):506–509. doi: 10.1093/jhered/91.6.506. [DOI] [PubMed] [Google Scholar]
  16. Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012;40(20):10073–10083. doi: 10.1093/nar/gks666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Guillaume F, Rougemont J. Nemo: an evolutionary and population genetics programming framework. Bioinformatics. 2006;22(20):2556–2557. doi: 10.1093/bioinformatics/btl415. [DOI] [PubMed] [Google Scholar]
  18. Hampe J, Wienker T, Schreiber S, Nurnberg P. POPSIM: a general population simulation program. Bioinformatics. 1998;14(5):458–464. doi: 10.1093/bioinformatics/14.5.458. [DOI] [PubMed] [Google Scholar]
  19. Hernandez RD. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24(23):2786–2787. doi: 10.1093/bioinformatics/btn522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hoban S, Bertorelle G, Gaggiotti OE. Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet. 2011;13(2):110–122. doi: 10.1038/nrg3130. [DOI] [PubMed] [Google Scholar]
  21. Huann-Sheng Chen CMH, Mechanic Leah E, Amos Christopher I, Bafna Vineet, Hauser Elizabeth, Hernandez Ryan D, Li Chun, Liberles David A, McAllister Kimberly, Moore Jason H, Paltoo Dina N, Papanicolaou George, Peng Bo, Ritchie Marylyn D, Rosenfeld Gabriel, Witte John S, Gillanders Elizabeth M, Feuer Eric J. Genetic Simulation Tools for Post-Genome Wide Association Studies of Complex Diseases. Genetic Epidemiology. 2014 doi: 10.1002/gepi.21870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  23. Kingman JFC. The coalescent. Stochastic Processes and their Applications. 1982;13(3):235–248. [Google Scholar]
  24. Kingman JFC. Origins of the Coalescent: 1974–1982. Genetics. 2000;156(4):1461–1463. doi: 10.1093/genetics/156.4.1461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lambert BW, Terwilliger JD, Weiss KM. ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics. 2008;24(16):1821–1822. doi: 10.1093/bioinformatics/btn317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Leal SM, Yan K, Muller-Myhsok B. SimPed: a simulation program to generate haplotype and genotype data for pedigree structures. Hum Hered. 2005;60(2):119–122. doi: 10.1159/000088914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lemire M, Roslin NM, Laprise C, Hudson TJ, Morgan K. Transmission-ratio distortion and allele sharing in affected sib pairs: a new linkage statistic with reduced bias, with application to chromosome 6q25.3. Am J Hum Genet. 2004;75(4):571–586. doi: 10.1086/424528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li C, Li M. GWAsimulator: a rapid whole-genome simulation program. Bioinformatics. 2008;24(1):140–142. doi: 10.1093/bioinformatics/btm549. [DOI] [PubMed] [Google Scholar]
  29. Liang L, Zollner S, Abecasis GR. GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics. 2007;23(12):1565–1567. doi: 10.1093/bioinformatics/btm138. [DOI] [PubMed] [Google Scholar]
  30. Mailund T, Schierup MH, Pedersen CN, Mechlenborg PJ, Madsen JN, Schauser L. CoaSim: a flexible environment for simulating genetic data under coalescent models. BMC Bioinformatics. 2005;6:252. doi: 10.1186/1471-2105-6-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Marjoram P, Tavare S. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet. 2006;7(10):759–770. doi: 10.1038/nrg1961. [DOI] [PubMed] [Google Scholar]
  32. Montana G. HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics. 2005;21(23):4309–4311. doi: 10.1093/bioinformatics/bti689. [DOI] [PubMed] [Google Scholar]
  33. Neuenschwander S. AQUASPLATCHE: a program to simulate genetic diversity in populations living in linear habitats. Molecular Ecology Notes. 2006;6(3):583–585. [Google Scholar]
  34. Neuenschwander S, Hospital F, Guillaume F, Goudet J. quantiNemo: an individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation. Bioinformatics. 2008;24(13):1552–1553. doi: 10.1093/bioinformatics/btn219. [DOI] [PubMed] [Google Scholar]
  35. Ott J. Computer-simulation methods in human linkage analysis. Proc Natl Acad Sci U S A. 1989;86(11):4175–4178. doi: 10.1073/pnas.86.11.4175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Peng B. Reproducible simulations of realistic samples for next-gen sequencing studies using Variant Simulation Tools. 2014 doi: 10.1002/gepi.21867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Clarke L, Gillanders E, Feuer EJ. Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators. Bioinformatics. 2013;29(8):1101–1102. doi: 10.1093/bioinformatics/btt094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Peng B, Kimmel M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics. 2005;21(18):3686–3687. doi: 10.1093/bioinformatics/bti584. [DOI] [PubMed] [Google Scholar]
  39. Peng B, Liu X. Simulating sequences of the human genome with rare variants. Hum Hered. 2010;70(4):287–291. doi: 10.1159/000323316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Pinelli M, Scala G, Amato R, Cocozza S, Miele G. Simulating gene-gene and gene-environment interactions in complex diseases: Gene-Environment iNteraction Simulator 2. BMC Bioinformatics. 2012;13:132. doi: 10.1186/1471-2105-13-132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Ritchie MD, Bush WS. Genome simulation approaches for synthesizing in silico datasets for human genomics. Adv Genet. 2010;72:1–24. doi: 10.1016/B978-0-12-380862-2.00001-1. [DOI] [PubMed] [Google Scholar]
  42. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Schmidt M, Hauser ER, Martin ER, Schmidt S. Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol. 2005;4 doi: 10.2202/1544-6115.1133. Article 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Stoye J, Evers D, Meyer F. Rose: generating sequence families. Bioinformatics. 1998;14(2):157–163. doi: 10.1093/bioinformatics/14.2.157. [DOI] [PubMed] [Google Scholar]
  45. Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27(16):2304–2305. doi: 10.1093/bioinformatics/btr341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Terwilliger JD, Speer M, Ott J. Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol. 1993;10(4):217–224. doi: 10.1002/gepi.1370100402. [DOI] [PubMed] [Google Scholar]
  47. Teshima KM, Innan H. mbs: modifying Hudson's ms software to generate samples of DNA sequences with a biallelic site under selection. BMC Bioinformatics. 2009;10:166. doi: 10.1186/1471-2105-10-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Truszkowski J, Hao Y, Brown DG. Towards a practical O(nlogn) phylogeny algorithm. Algorithms Mol Biol. 2012;7(1):32. doi: 10.1186/1748-7188-7-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wang GT, Li B, Lyn Santos-Cortez RP, Peng B, Leal SM. Power analysis and sample size estimation for sequence-based association studies. Bioinformatics. 2014 doi: 10.1093/bioinformatics/btu296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Wright FA, Huang H, Guan X, Gamiel K, Jeffries C, Barry WT, de Villena FP, Sullivan PF, Wilhelmsen KC, Zou F. Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics. 2007;23(19):2581–2588. doi: 10.1093/bioinformatics/btm386. [DOI] [PubMed] [Google Scholar]
  51. Xu Y, Wu Y, Song C, Zhang H. Simulating realistic genomic data with rare variants. Genet Epidemiol. 2013;37(2):163–172. doi: 10.1002/gepi.21696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Yang T, Deng HW, Niu T. Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences. BMC Bioinformatics. 2014;15:3. doi: 10.1186/1471-2105-15-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yang W, Gu CC. A whole-genome simulator capable of modeling high-order epistasis for complex disease. Genet Epidemiol. 2013;37(7):686–694. doi: 10.1002/gepi.21761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y. An overview of population genetic data simulation. J Comput Biol. 2012;19(1):42–54. doi: 10.1089/cmb.2010.0188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Zhang F, Liu J, Chen J, Deng HW. HAPSIMU: a genetic simulation platform for population-based association studies. BMC Bioinformatics. 2008;9:331. doi: 10.1186/1471-2105-9-331. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES