Abstract
The data presented in this article is related to the research article entitled “High-accuracy haplotype imputation using unphased genotype data as the references” which reports the unphased genotype data can be used as reference for haplotyping imputation [1]. This article reports different implementation generation pipeline, the results of performance comparison between different implementations (A, B, and C) and between HiFi and three major imputation software tools. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C. HiFi performed better on haplotype imputation accuracy and three other software performed slightly better on genotype imputation accuracy. These data may provide a strategy for choosing optimal phasing pipeline and software for different studies.
Specifications Table
Subject area | Biology |
More specific subject area | Bioinformatics |
Type of data | Tables |
How data was acquired | Genotype and haplotype data were obtained from the International HapMap Project database |
Data format | Analyzed |
Experimental factors | The original data were reformatted to fit the requirement of different software |
Experimental features | We generated different implementations from HapMap data set. Then: [1] We compared the performance of different implementations [2]. We compared the phasing performances among HiFi, MACH 1.0, IMPUTE2, BEAGLE. |
Data source location | Atlanta, Georgia, USA |
Data accessibility | The data are with this article |
Value of the data
-
•
This data is beneficial to researchers who are interested in haplotyping The data may provide guidance on how to choose the optimal phasing pipeline.
-
•
This data is beneficial to researchers who are interested in imputations and comparison between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on the accuracy and speed. The data may provide guidance on how to choose the suitable software for different study.
-
•
This data is helpful to compare between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on their tolerance on statistical reference panels.
1. Data
Data presented are summaries of comparison of HiFi performances with three different implementations A, B and C; comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference. The data showed that implementation-B is slightly but consistently higher than A and C; and the data also showed that HiFi performed better on haplotype imputation accuracy and speed,three other tools performed slightly better on genotype imputation.
2. Experimental design, materials and methods
2.1. Acquisition and processing of HapMap data for different implementations
We downloaded CEU (CEPH, U.S. Utah residents with ancestry from northern and western Europe) chromosome 1 genotype data and haplotype data from HapMap in text format [5], [6]. We use the original haplotype data as molecular reference. To generate the statistical haplotype reference panel, we erased the phase information from those trio haplotypes downloaded from HapMap, and then used the software Beagle version 3.3.2 to resolve the haplotypes from the unphased genotypes. Then we generated following three different implementations by Beagle version 3.3.2: (A) Beagle statistical phasing of unrelated persons and Mendelian-inheritance-based phasing of trios, and then pools the results together; (B) Beagle statistical phasing of pooled unrelated persons and trios, but presumes all as unrelated; and (C) Beagle statistical phasing of pooled unrelated persons and trios, and specifying the family structure in the input. And we chose same 6 samples [2] for further analysis.
2.2. Comparison of HiFi performances with three different implementations A, B and C
We compared the HiFi performances with three different implementations. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C (Table S1).
2.3. Comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference
We compared the performance between HiFi and three standard imputation software tools (MACH, IMPUTE2 and BEAGLE) [7], [8], [9]. As the result, HiFi performed better on haplotype imputation accuracy (Table S2) and speed (Table S4), whereas MACH, IMPUTE2 and BEAGLE performed slightly better on genotype imputation accuracy (Table S3), in which MACH and IMPUTE2 performed the best on genotype imputation.
Acknowledgments
This work was supported by National Institutes of Health (R21HG006173, R43HG007621, HL117929, MD007602, MD005964, RR003034, U54MD07588); and the American Heart Association Grant (09GRNT2300003).
Footnotes
Transparency data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.
Contributor Information
Li Ma, Email: lma@msm.edu.
Qing Song, Email: qsong@msm.edu.
Transparency document. Supplementary material
Supplementary material
Appendix A. Supplementary material
Supplementary material
References
- 1.Li W., Xu W., Fu G., Ma L., Richards J., Rao W., Bythwood T., Guo S., Song Q. High-accuracy haplotype imputation using unphased genotype data as the references. Gene. 2015;572:279–284. doi: 10.1016/j.gene.2015.07.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Li W., Xu W., He S., Ma L., Song Q. References for haplotype imputation in the big data era. Mol. Biol. 2015 doi: 10.4172/2168-9547.1000143. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Li W., Fu G., Rao W., Xu W., Ma L., Guo S., Song Q. GenomeLaser: fast and accurate haplotyping from pedigree genotypes. Bioinformatics. 2015 doi: 10.1093/bioinformatics/btv452. pii: btv452. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Xu W., Ma L., Li W., Brunson T.A., Tian X., Richards J., Li Q., Bythwood T., Yuan Z., Song Q. Functional pseudogenes inhibit the superoxide production. Precis. Med. 2015;1 [PMC free article] [PubMed] [Google Scholar]
- 7.Howie B.N., Donnelly P., Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li Y., Willer C.J., Ding J., Scheet P., Abecasis G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
Further reading
- 3.Ma Y., Zhao J., Wong J.S., Ma L., Li W., Fu G., Xu W., Zhang K., Kittles R.A., Li Y., Song Q. Accurate inference of local phased ancestry of modern admixed populations. Sci. Rep. 2014;4:5800. doi: 10.1038/srep05800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rao W., Ma Y., Ma L., Zhao J., Li Q., Gu W., Zhang K., Bond V.C., Song Q. High-resolution whole-genome haplotyping using limited seed data. Nat. Methods. 2013;10:6–7. doi: 10.1038/nmeth.2308. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Supplementary material