Abstract
There has been a great increase in both the number of population genetic analysis programs and the size of data sets being studied with them. Since the file formats required by the most popular and useful programs are variable, automated reformatting or conversion between them is desirable. formatomatic is an easy to use program that can read allelic data files in genepop, raw (csv) or convert formats and create data files in nine formats: raw (csv), arlequin, genepop, immanc/bayesass +, migrate, newhybrids, msvar, baps and structure. Use of formatomatic should greatly reduce time spent reformatting data sets and avoid unnecessary errors.
Keywords: allele, conversion, data file, format, genotype, population genetic
Population genetic analysis programs are becoming ever more sophisticated and specialized. At the same time, the size of data sets being analysed with them is also increasing, thanks to automation and high-throughput molecular techniques. This combination requires ever more manipulation of large data sets into several specialized formats to be used by a combination of programs for population genetic analysis. Illustrating the problem, a recent review by Excoffier & Heckel (2006) discusses over 20 computer programs that together provide a powerful pallet of analytical options, but which between them employ dozens of file types. Formatting a large gentoypic data set for use with these and other programs necessitates automated tools to avoid errors and time expenditure on repetitive tasks.
formatomatic is a simple program designed to automate the task of converting files containing diploid allelic data between formats needed for population genetic analysis. It does not, as a rule, produce files that can be used immediately by an analysis program. Usually, run options, metadata and other details required by a particular program and analysis must be set by the user. However, with formatomatic, I aim to reduce or eliminate the error prone and time-consuming task of re-formatting the actual data set. formatomatic complements the functionality offered by another specialized file conversion tool, convert (Glaubitz 2004) and the data-importation abilities of some analysis programs.
Three specific advantages of formatomatic are worth highlighting: (i) the program is cross-platform, written in the Java programming language. It runs on any computer that has the Java Virtual Machine installed, including versions of Microsoft Windows, Apple Mac OS X, Unix and GNU/Linux. (ii) formatomatic is very easy to use. It consists of a simple graphical user interface where in-file and out-file names are set and formats selected. (iii) The source code is available for inspection and modification under the terms of the GPL, or General Public License. This permits users to add and contribute functionality as well as audit exactly how their data are being reformatted.
As of this writing, formatomatic can convert from files in genepop (Raymond & Rousset 1995), raw (csv) or convert (Glaubitz 2004) formats to raw (csv), arlequin (Excoffier et al. 2005), genepop, immanc/bayesass + (Rannala & Mountain 1997), migrate (Beerli 2006), newhybrids (Anderson & Thompson 2002), msvar (Beaumont 1999), baps (Corander et al. 2004) or structure (Pritchard et al. 2000; Falush et al. 2003) files. Documentation included with the program download provides more details on running formatomatic and on the file formats it can accept or produce. The program and documentation may be downloaded from: http://taylor0.biology.ucla.edu/∼manoukis/Pub_programs/Formatomatic.
Together with the aforementioned convert and the import capability of some of the analysis programs, files in one of formatomatic's in-formats can be transformed for use with 15 of the 23 analysis programs highlighted by Excoffier & Heckel (2006) as essential. formatomatic specifically supports creation of files for immanc/bayesass +, migrate, newhybrids, msvar and baps, which are not currently supported by others. I plan to add further input or output formats as new analysis programs are released or as requested by users, when possible. Contributions to the source code are encouraged.
Acknowledgments
Thanks to C. Taylor for supporting for this work. Thank you also to D. Earl for testing and suggestions. This work was supported by National Institutes of Health grants 5R01AI051633 and 5R01AI040308.
References
- Anderson EC, Thompson EA. A model-based method for identifying species hybrids using multilocus genetic data. Genetics. 2002;160:1217–1229. doi: 10.1093/genetics/160.3.1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont MA. Detecting population expansion and decline using microsatellites. Genetics. 1999;153:2013–2029. doi: 10.1093/genetics/153.4.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beerli P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics. 2006;22:341–345. doi: 10.1093/bioinformatics/bti803. [DOI] [PubMed] [Google Scholar]
- Corander J, Waldmann P, Marttinen P, Sillanpaa MJ. baps 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004;20:2363–2369. doi: 10.1093/bioinformatics/bth250. [DOI] [PubMed] [Google Scholar]
- Excoffier L, Heckel G. Computer programs for population genetics data analysis: a survival guide. Nature Reviews Genetics. 2006;7:745–758. doi: 10.1038/nrg1904. [DOI] [PubMed] [Google Scholar]
- Excoffier L, Laval G, Schneider S. arlequin, version 3.0: an integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online. 2005;1:47–50. [PMC free article] [PubMed] [Google Scholar]
- Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glaubitz JC. convert: A user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages. Molecular Ecology Notes. 2004;4:309–310. [Google Scholar]
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rannala B, Mountain JL. Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences, USA. 1997;94:9197–9201. doi: 10.1073/pnas.94.17.9197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raymond M, Rousset F. genepop version 1.2.: population genetics software for exact tests and ecumenicism. Journal of Heredity. 1995;86:248–249. [Google Scholar]