Abstract
Summary: forqs is a forward-in-time simulation of recombination, quantitative traits and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits.
Availability and implementation: forqs is implemented as a command-line C++ program. Source code and binary executables for Linux, OSX and Windows are freely available under a permissive BSD license: https://bitbucket.org/dkessner/forqs.
Contact: jnovembre@uchicago.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Simulations have a long history in population genetics, both for verifying analytical results and for exploring population models that are mathematically intractable. Population genetics simulations can be broadly classified as forward-in-time (e.g. Wright–Fisher) or backward-in-time (e.g. coalescent). Coalescent simulations [e.g. ms (Hudson, 2002), MaCS (Chen et al., 2009), fastsimcoal (Excoffier and Foll, 2011)] are efficient for simulating neutral sequence data because they only need to track lineages that are ancestral to the sample. Although it is possible to simulate certain selection scenarios within the coalescent framework (Ewing and Hermisson, 2010; Hudson and Kaplan, 1988), one must turn to forward-in-time simulations to model selection in a flexible way.
Many forward-in-time simulators are currently available. Most of these simulators use a mutation-centric approach, implemented by storing the mutations carried by individuals in an array. To handle selection, the majority of these simulators assign selection coefficients to individual mutations [e.g. ForwSim (Padhukasahasram et al., 2008), Fregene (Chadeau–Hyam et al., 2008), GENOMEPOP (Carvajal–Rodriguez, 2008), SFS_CODE (Hernandez, 2008), TreesimJ (O’Fallon, 2010), SLiM (Messer, 2013)], although a few also include support for quantitative traits [e.g. ForSim (Lambert et al., 2008), quantiNemo (Neuenschwander et al., 2008), simuPOP (Peng and Kimmel, 2005)]. Hoban et al. (2011) and Yuan et al. (2012) are recent reviews providing a comprehensive comparison of these and other simulators.
In many scenarios of biological interest, substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on standing variation, rather than mutational input. For example, one may be interested in the genome-wide haplotype patterns that emerge from admixture between historically isolated populations (Wegmann et al., 2011) or from artificial selection on a quantitative trait. Studying these haplotype patterns can be difficult with existing forward-in-time simulators because detailed information about the mosaic haplotype structure of individuals is not readily available, and must be inferred from the output sequences of the simulation and/or stored recombination event data. In addition, forward-in-time simulators that store entire sequences incur a severe trade-off between the size of the genomic regions and the size of the populations simulated.
Motivated by such examples, we have implemented a new forward-in-time simulation approach that, instead of tracking single-site variants, tracks individual haplotype chunks as they recombine over multiple generations. Further, we have designed the simulator for fast simulation of quantitative traits under selection. We have labeled this software forqs (Forward-in-time simulation of Recombination, Quantitative Traits and Selection). Similar approaches have been implemented recently by Haiminen et al. (2013) and by Aberer and Stamatakis (2013) for the simple selection models with per-mutation fitness effects.
The haplotype-based design allows for fast simulation of whole genomes, with efficient memory usage. For example, forqs can easily simulate two populations (size 10 000 each) selected for different optimal trait values, where individuals have human-sized genomes (23 chromosome pairs, 100 Mb each), taking ∼2 s/generation. For comparison, existing forward simulators are limited by the amount of sequence that can be stored in arrays in memory: for the aforementioned 20 000 individuals, 16 GB of memory would permit the storage of only 3.2 million base pairs of sequence per individual, which is an order of magnitude smaller than the smallest human chromosome. The forqs’ design also preserves information about the haplotype structure of individuals, which allows for immediate identification of genomic regions where individuals share identical-by-descent haplotype tracts.
Our simulator uses a modular architecture to allow the user to flexibly specify recombination maps, mutation rates, demographic models, quantitative traits and fitness functions. This modular approach facilitates simulation of complicated scenarios and investigation of the resulting haplotype patterns. forqs is currently under active development to support ongoing projects.
2 DESIGN AND IMPLEMENTATION
forqs begins with a set of founding haplotypes carried by the individuals in the initial generation. Individuals are diploid and carry a user-specified number of chromosome pairs. By assigning a unique identifier to each founding haplotype, individual haplotype chunks are tracked as they recombine over subsequent generations (Fig. 1). For the purposes of simulation, any existing neutral variation on the haplotype chunks can be ignored, and only those loci with fitness effects need to be tracked.
forqs performs the following actions during a single cycle of the simulation: (i) generation of new populations, (ii) genotyping, (iii) quantitative trait evaluation, (iv) fitness evaluation and (v) reporting. forqs has a flexible design in which the simulator delegates specific tasks or calculations to configurable modules. The user specifies which modules to instantiate in a configuration file.
In addition to the primary modules that are used to specify demography, mutation, recombination, quantitative traits, fitness and reported output, there are several building block modules that provide basic functionality to the primary modules. For example, Trajectory modules provide a unified method for specifying values that change over time, such as population sizes or migration rates. Similarly, Distribution modules can be used to specify how to draw particular random values [e.g. quantitative trait loci (QTL) positions or allele frequencies).
As an illustration of forqs configuration, suppose that a user wanted to simulate populations undergoing neutral admixture. The user would specify a PopulationConfigGenerator module representing a stepping stone or island model with the desired population size and migration rate trajectories. However, the user would not specify any quantitative trait modules and would use the default FitnessFunction module that assigns identical fitness values to all individuals. On the other hand, to simulate an artificial selection experiment with truncation selection on a single quantitative trait, the user would specify the trait with QTLs and effect sizes, and choose a FitnessFunction module that selects the desired proportion of individuals to produce the next generation. Alternatively, the user could indicate that the QTLs and effect sizes should be drawn randomly from user-specified distributions.
The representation of chromosomes as haplotype chunks in forqs makes efficient use of memory, independent of the size of the chromosomes. On a typical laptop computer, for a population size of 1 million, simulations take ∼1.5 s/generation for neutral simulations and ∼3 s/generation with selection at a single locus. Decreasing the population size allows the simulation of a greater number of generations in a reasonable amount of time: a population size of 10 000 takes ∼3 s/100 generations (without selection, with a slight increase with selection). However, forqs’ design comes with the trade-off that memory usage grows linearly with the number of generations simulated due to recombination. Thus, for investigations focusing on mutational input over a large number of generations (e.g. studies involving demographic changes taking place over thousands of generations), forqs’ design is not as efficient as array-based implementations (e.g. SLiM or SFS_CODE) that were designed specifically for these scenarios. Similarly, we recommend that forqs be used in conjunction with a coalescent simulator to generate neutral variation, rather than running forqs for a long burn-in period to reach mutation-drift equilibrium.
forqs has been extensively tested for correctness, both at the level of individual code units and in its large-scale behavior in comparison with theoretical predictions from population genetics and quantitative genetics. Validation results, tutorials and documentation can be found in the Supplementary Information. Configuration files for all simulations mentioned in this article are included in the forqs software packages.
Supplementary Material
ACKNOWLEDGEMENTS
The authors would like to thank Alex Platt, Charleston Chiang, Eunjung Han and Diego Ortega Del Vecchyo for helpful comments on features, usability and documentation of the software.
Funding: National Institutes of Health (Training Grant in Genomic Analysis and Interpretation T32 HG002536 to D.K., R01 HG007089 to J.N.), the NSF (EF-0928690 to J.N.) and UCLA (Dissertation Year Fellowship to D.K.).
Conflict of Interest: none declared.
REFERENCES
- Aberer AJ, Stamatakis A. Rapid forward-in-time simulation at the chromosome and genome level. BMC Bioinformatics. 2013;14:216. doi: 10.1186/1471-2105-14-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carvajal-Rodriguez A. GENOMEPOP: a program to simulate genomes in populations. BMC Bioinformatics. 2008;9:223. doi: 10.1186/1471-2105-9-223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chadeau-Hyam M, et al. Fregene: simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics. 2008;9:364. doi: 10.1186/1471-2105-9-364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen GK, et al. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19:136–142. doi: 10.1101/gr.083634.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010;26:2064–2065. doi: 10.1093/bioinformatics/btq322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Excoffier L, Foll M. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27:1332–1334. doi: 10.1093/bioinformatics/btr124. [DOI] [PubMed] [Google Scholar]
- Haiminen N, et al. Efficient in silico chromosomal representation of populations via indexing ancestral genomes. Algorithms. 2013;6:430–441. [Google Scholar]
- Hernandez RD. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24:2786–2787. doi: 10.1093/bioinformatics/btn522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoban S, et al. Computer simulations: tools for population and evolutionary genetics. Nat. Rev. Genet. 2011;13:110–122. doi: 10.1038/nrg3130. [DOI] [PubMed] [Google Scholar]
- Hudson RR. Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Hudson RR, Kaplan NL. The coalescent process in models with selection and recombination. Genetics. 1988;120:831–840. doi: 10.1093/genetics/120.3.831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert BW, et al. ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics. 2008;24:1821–1822. doi: 10.1093/bioinformatics/btn317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Messer PW. SLiM: simulating evolution with selection and linkage. Genetics. 2013;194:1037–1039. doi: 10.1534/genetics.113.152181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neuenschwander S, et al. QuantiNemo: an individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation. Bioinformatics. 2008;24:1552–1553. doi: 10.1093/bioinformatics/btn219. [DOI] [PubMed] [Google Scholar]
- O’Fallon B. TreesimJ: a flexible, forward time population genetic simulator. Bioinformatics. 2010;26:2200–2201. doi: 10.1093/bioinformatics/btq355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Padhukasahasram B, et al. Exploring population genetic models with recombination using efficient forward-time simulations. Genetics. 2008;178:2417–2427. doi: 10.1534/genetics.107.085332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng B, Kimmel M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics. 2005;21:3686–3687. doi: 10.1093/bioinformatics/bti584. [DOI] [PubMed] [Google Scholar]
- Wegmann D, et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 2011;43:847–853. doi: 10.1038/ng.894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan X, et al. An overview of population genetic data simulation. J. Comput. Biol. 2012;19:42–54. doi: 10.1089/cmb.2010.0188. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.