Abstract
Background
The field of population genetics use the genetic composition of populations to study the effects of ecological and evolutionary factors, including selection, genetic drift, mating structure, and migration. Until recently, these studies were usually based upon the analysis of relatively few (typically 10–20) DNA markers on samples from multiple populations. In contrast, high-throughput sequencing provides large amounts of data and consequently very high resolution genetic information. Recent technological developments are rapidly making this a cost-effective alternative. In addition, sequencing allows both the direct study of genomic differences between population, and the discovery of single nucleotide polymorphism marker that can be subsequently used in high-throughput genotyping. Much of the analysis in population genetics was developed before large scale sequencing became feasible. Methods often do not take into account the characteristics of the different sequencing technologies, and consequently, may not always be well suited to this kind of data.
Results
Although the FlowSim suite of tools originally targeted simulation of de novo 454 genomics data, recent developments and enhancements makes it suitable also for simulating other kinds of data. We examine its application to population genomics, and provide examples and supplementary scripts and utilities to aid in this task.
Conclusions
Simulation is an important tool to study and develop methods in many fields, and here we demonstrate how to simulate a high-throughput sequencing dataset for population genomics.
Keywords: Simulation, Second-generation sequencing, Population genomics, Shotgun metagenomics, SNP
Background
Simulation is an important tool for developing and experimenting with methods for analysis of sequencing data. Several simulators exist, usually targeting specific data types or analyses. For instance, MetaSim [1] targets metagenomic samples, and SimSeq (St. John, unpublished) and Wgsim [2] target Illumina sequences.
As implied by the name, FlowSim [3] was originally developed for simulation of de novo genomics data on the 454 platform. Since its inception, it has grown into a flexible suite of tools that can be applied to a number of different uses, and here we demonstrate how it can simulate a population genomics data set consisting of Illumina reads.
A sequencing dataset for population genomics typically consists of reads from pools of individuals from a species, where each pool is taken from a specific populations or subpopulation of interest. By identifying and quantifying variants in the different pools, one can calculate the degree of divergence and population structure between the populations. In turn, this information can be used to study evolution [4,5], quantitative traits [6], and also constitutes an important tool for estimating biological diversity.
The FlowSim suite
The current version of FlowSim (0.3.5) consists of several independent components, as illustrated in Figure 1. Each component is implemented as a separate tool, using FASTA-formatted sequences for input and output. (The exception is flowsim, which outputs the native SFF file format. FASTA-formatted sequence can be trivially extracted, e.g. using the flower[7] tool). This makes it easy for the user to construct a custom simulation pipeline tailored to his or her needs. Here, we will make use of clonesim to generate sets of reads, mutator to simulate genetic divergence in the form of SNPs as well as simulating sequencing errors in the simulated reads, and duplicator to introduce artificial duplicates.
Methods and results
Under the current simulations, a population consists of a number of individuals with specific genetic variations. For simplicity, we will consider our populations as a sets of genome sequences, each similar to a reference genome, but differing in a set of locations with unique substitutions. We will refer to these genomes as the haplotypes of the population. Each haplotype (and thus its specific genomic variants) occurs with a specific frequency in the population as a whole.
Starting with a single haplotype (i.e., a reference genome or chromosome), we generate the new haplotypes by introducing random mutations. The mutations are identified, and noted separately. The resulting haplotypes are then concatenated in desired multiplicities into a combined genome representing each population, and sets of simulated reads are generated by selecting fragments randomly from the the population genomes. Finally, to simulate sequencing errors, artifacts [8], and the occurrence of rare variants [9], the reads have additional variations introduced. Also, a random selection of reads are output multiple times in order to simulate the occurrence of artificial duplicates [10,11].
Implementation
We will presume that our reference genome exists in a file called genome.fasta. First the set of haplotypes are generated by using mutator to randomly introduce on average five mutations per kilobase, using the option -s 0.005. To simplify analysis, we do not introduce indels (-i 0). The following script implements the analysis.
Although here we generate intermediate files, each step can also read from standard input and write to standard output. Thus, intermediate files can be omitted using UNIX pipes.
The next step simply concatenates the haplotypes in different proportions to construct the population genomes, p1.fasta and p2.fasta. Here, we combined the three haplotypes H1, H2 and H3 in proportions of 1:2:3 in population P1, and 3:2:1 in population P2, as shown in Figure 2. As a result, an allele present in H1 (i.e., H1.fasta) will have a minor allele frequency of 0.167 in population P1, and 0.5 in P2, giving it an a prioriF ST of 0.125, while variant alleles in H2 will occur with an equal minor allele frequency (of 0.333) in either population, resulting in an F st of 0.
In step three, we can use clonesim to generate reads by extracting 20 M (-c 2000000) random fragments of exactly 100 bp length (using the -l option to set the length distribution to Uniform 100 100). The generated reads are exact copies of fragments of the reference genome, and in order to simulate sequencing errors and rare variants, in step four we again apply mutator, this time allowing indels as well as substitutions. Finally, we randomly duplicate some of the reads, using the duplicator tool.
Additional analysis
FlowSim provides the basic building blocks for simulating the sequencing process, but analysis often depends on additional information, and sometimes requires intermediate steps to adapt the data.
A natural step in the analysis of sequence reads, simulated or otherwise, is to map them to a reference genome. This is also useful to verify that the data exhibits the expected properties, like coverage distribution or error rates. The simulation here produces FASTA sequences, but most short read mapping software accept FASTQ as input. Converting from FASTA to FASTQ is a simple task, here a small tool (called fasta2fastq) was written to perform this conversion.
To separate the haplotype variants from simulated sequencing errors, another small tool (snplist) were written to generate the list of variants per haplotype. This compares each haplotype with the reference genome, and outputs a list of the variant positions with reference and alternative allele. To simplify this process, it is conveneint to add the variants identification to e.g. the output from VCFtools [12] or similar variant callers, the following script can be used for this purpose:
Discussion and conclusion
As FlowSim is primarily targeted at accurate simulation of 454 sequencing, in the present study, we have applied a simplistic model for Illumina sequences. For instance, the probability of error is uniform along each read, and independent of base, and factors that can cause sequencing bias, like e.g. the read’s GC content [13] or strand [14], are not taken into account. Sometimes a simple model suffices, and it can also make analysis simpler. However, the individual components of FlowSim can easily be replaced by custom tools, and if a more accurate sequencing model is required, it can be implemented separately, and integrated into the simulation pipeline.
Similarly, we could conceive of a more realistic model for the reference genome, in order to explore properties likely to affect our analysis. For instance, repeats caused by recent duplications (common in many plants and teleosts), transposons, or low complexity regions could have dramatic impacts on analysis. Also artifacts of the reference assembly, where chimeric contigs, collapsed repeats, and contamination could have substantial effects on the result. Again, the user is free to implement appropriate designs and insert them as separate stages in the simulation pipeline.
Here we have explored the use of FlowSim for a population genetics study. A similar approach would also allow it to be used for shotgun metagenomics. In that case, the populations would consist of genomes (haplotypes) from different species, instead of originating in a single reference genome. One might also consider mutations of haplotypes in more complex arrangements to emulate evolution of closely related species.
Availability and requirements
All materials are available on-line, see Table 1 for details. The scripts as well as other tools mentioned are released into the public domain. The documentation for the FlowSim pipeline is available from the Biohaskell Wiki. FlowSim itself is available as Open Source software under the General Public License (GPL) version 2.0.
Table 1.
FlowSim source |
http://malde.org/~ketil/biohaskell/flowsim |
---|---|
code repository | |
Documentation |
http://biohaskell.org/Applications/FlowSim |
Supporting scripts | http://malde.org/~ketil/flowsim-extras |
Competing interests
The author declare that he has no competing interests.
Acknowledgements
Thanks to Kevin Glover for helpful comments on the manuscript. This article was funded in part by the Research Council of Norway through the SALMAT project (HAVBRUK 226221).
References
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH. Metasim–a sequencing simulator for genomics and metagenomics. PLoS ONE. 2008;3(10):3373. doi: 10.1371/journal.pone.0003373. doi:10.1371/journal.pone.0003373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAM tools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I. Characteristics of 454 pyrosequencing data - enabling realistic simulation with flowsim. Bioinformatics. 2010;26(18):i420–i425. doi: 10.1093/bioinformatics/btq365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F. Evolutionary relationship of dna sequences in finite populations. Genetics. 1983;105(2):437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turner TL, Bourne EC, Von Wettberg EJ, Hu TT, Nuzhdin SV. Population resequencing reveals local adaptation of arabidopsis lyrata to serpentine soils. Nat Genet. 2010;42(3):260–263. doi: 10.1038/ng.515. [DOI] [PubMed] [Google Scholar]
- Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK. High-throughput, pooled sequencing identifies mutations in nubpl and foxred1 in human complex i deficiency. Nat Genet. 2010;42(10):851–858. doi: 10.1038/ng.659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malde K. Flower: extracting information from pyrosequencing data. Bioinformatics. 2011;27(7):1041–1042. doi: 10.1093/bioinformatics/btr063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balzer S, Malde K, Jonassen I. Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics. 2011;27(13):304–309. doi: 10.1093/bioinformatics/btr251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting fst: the impact of rare variants. Genome Res. 2013;23(9):1514–1521. doi: 10.1101/gr.154831.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gomez-Alvarez V, Teal TK, Schmidt TM. Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009;3:1314–1317. doi: 10.1038/ismej.2009.72. [DOI] [PubMed] [Google Scholar]
- Balzer S, Malde K, Grohme M, Jonassen I. Filtering duplicate reads from 454 pyrosequencing data. Bioinformatics. 2013;29(7):830–836. doi: 10.1093/bioinformatics/btt047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R. 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics. 2011;27(15):2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14:R51. doi: 10.1186/gb-2013-14-5-r51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo1 Y, Li J, Li C-I, Long J, Samuels DC, Shyr Y. The effect of strand bias in illumina short-read sequencing data. BMC Genomics. 2012;13:666. doi: 10.1186/1471-2164-13-666. [DOI] [PMC free article] [PubMed] [Google Scholar]