Skip to main content
Data in Brief logoLink to Data in Brief
. 2016 Jun 29;8:1412–1415. doi: 10.1016/j.dib.2016.06.029

Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

Wenzhi Li a,b,1, Wei Xu b,1, Shaohua He c, Li Ma b,c,, Qing Song a,b,c,
PMCID: PMC4995474  PMID: 27595130

Abstract

The data presented in this article is related to the research article entitled “High-accuracy haplotype imputation using unphased genotype data as the references” which reports the unphased genotype data can be used as reference for haplotyping imputation [1]. This article reports different implementation generation pipeline, the results of performance comparison between different implementations (A, B, and C) and between HiFi and three major imputation software tools. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C. HiFi performed better on haplotype imputation accuracy and three other software performed slightly better on genotype imputation accuracy. These data may provide a strategy for choosing optimal phasing pipeline and software for different studies.


Specifications Table

Subject area Biology
More specific subject area Bioinformatics
Type of data Tables
How data was acquired Genotype and haplotype data were obtained from the International HapMap Project database
Data format Analyzed
Experimental factors The original data were reformatted to fit the requirement of different software
Experimental features We generated different implementations from HapMap data set. Then: [1] We compared the performance of different implementations [2]. We compared the phasing performances among HiFi, MACH 1.0, IMPUTE2, BEAGLE.
Data source location Atlanta, Georgia, USA
Data accessibility The data are with this article

Value of the data

  • This data is beneficial to researchers who are interested in haplotyping The data may provide guidance on how to choose the optimal phasing pipeline.

  • This data is beneficial to researchers who are interested in imputations and comparison between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on the accuracy and speed. The data may provide guidance on how to choose the suitable software for different study.

  • This data is helpful to compare between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on their tolerance on statistical reference panels.

1. Data

Data presented are summaries of comparison of HiFi performances with three different implementations A, B and C; comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference. The data showed that implementation-B is slightly but consistently higher than A and C; and the data also showed that HiFi performed better on haplotype imputation accuracy and speed,three other tools performed slightly better on genotype imputation.

2. Experimental design, materials and methods

2.1. Acquisition and processing of HapMap data for different implementations

We downloaded CEU (CEPH, U.S. Utah residents with ancestry from northern and western Europe) chromosome 1 genotype data and haplotype data from HapMap in text format [5], [6]. We use the original haplotype data as molecular reference. To generate the statistical haplotype reference panel, we erased the phase information from those trio haplotypes downloaded from HapMap, and then used the software Beagle version 3.3.2 to resolve the haplotypes from the unphased genotypes. Then we generated following three different implementations by Beagle version 3.3.2: (A) Beagle statistical phasing of unrelated persons and Mendelian-inheritance-based phasing of trios, and then pools the results together; (B) Beagle statistical phasing of pooled unrelated persons and trios, but presumes all as unrelated; and (C) Beagle statistical phasing of pooled unrelated persons and trios, and specifying the family structure in the input. And we chose same 6 samples [2] for further analysis.

2.2. Comparison of HiFi performances with three different implementations A, B and C

We compared the HiFi performances with three different implementations. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C (Table S1).

2.3. Comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference

We compared the performance between HiFi and three standard imputation software tools (MACH, IMPUTE2 and BEAGLE) [7], [8], [9]. As the result, HiFi performed better on haplotype imputation accuracy (Table S2) and speed (Table S4), whereas MACH, IMPUTE2 and BEAGLE performed slightly better on genotype imputation accuracy (Table S3), in which MACH and IMPUTE2 performed the best on genotype imputation.

Acknowledgments

This work was supported by National Institutes of Health (R21HG006173, R43HG007621, HL117929, MD007602, MD005964, RR003034, U54MD07588); and the American Heart Association Grant (09GRNT2300003).

Footnotes

Transparency document

Transparency data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.

Appendix A

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.

Contributor Information

Li Ma, Email: lma@msm.edu.

Qing Song, Email: qsong@msm.edu.

Transparency document. Supplementary material

Supplementary material

mmc1.pdf (1.2MB, pdf)

Appendix A. Supplementary material

Supplementary material

mmc2.zip (32KB, zip)

References

  • 1.Li W., Xu W., Fu G., Ma L., Richards J., Rao W., Bythwood T., Guo S., Song Q. High-accuracy haplotype imputation using unphased genotype data as the references. Gene. 2015;572:279–284. doi: 10.1016/j.gene.2015.07.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li W., Xu W., He S., Ma L., Song Q. References for haplotype imputation in the big data era. Mol. Biol. 2015 doi: 10.4172/2168-9547.1000143. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Li W., Fu G., Rao W., Xu W., Ma L., Guo S., Song Q. GenomeLaser: fast and accurate haplotyping from pedigree genotypes. Bioinformatics. 2015 doi: 10.1093/bioinformatics/btv452. pii: btv452. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Xu W., Ma L., Li W., Brunson T.A., Tian X., Richards J., Li Q., Bythwood T., Yuan Z., Song Q. Functional pseudogenes inhibit the superoxide production. Precis. Med. 2015;1 [PMC free article] [PubMed] [Google Scholar]
  • 7.Howie B.N., Donnelly P., Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Li Y., Willer C.J., Ding J., Scheet P., Abecasis G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Further reading

  • 3.Ma Y., Zhao J., Wong J.S., Ma L., Li W., Fu G., Xu W., Zhang K., Kittles R.A., Li Y., Song Q. Accurate inference of local phased ancestry of modern admixed populations. Sci. Rep. 2014;4:5800. doi: 10.1038/srep05800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rao W., Ma Y., Ma L., Zhao J., Li Q., Gu W., Zhang K., Bond V.C., Song Q. High-resolution whole-genome haplotyping using limited seed data. Nat. Methods. 2013;10:6–7. doi: 10.1038/nmeth.2308. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.pdf (1.2MB, pdf)

Supplementary material

mmc2.zip (32KB, zip)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES