Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

Wenzhi Li; Wei Xu; Shaohua He; Li Ma; Qing Song

doi:10.1016/j.dib.2016.06.029

. 2016 Jun 29;8:1412–1415. doi: 10.1016/j.dib.2016.06.029

Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

Wenzhi Li ^a,^b,¹, Wei Xu ^b,¹, Shaohua He ^c, Li Ma ^b,^c,^⁎, Qing Song ^a,^b,^c,^⁎

PMCID: PMC4995474 PMID: 27595130

Abstract

The data presented in this article is related to the research article entitled “High-accuracy haplotype imputation using unphased genotype data as the references” which reports the unphased genotype data can be used as reference for haplotyping imputation [1]. This article reports different implementation generation pipeline, the results of performance comparison between different implementations (A, B, and C) and between HiFi and three major imputation software tools. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C. HiFi performed better on haplotype imputation accuracy and three other software performed slightly better on genotype imputation accuracy. These data may provide a strategy for choosing optimal phasing pipeline and software for different studies.

Specifications Table

Subject area	Biology
More specific subject area	Bioinformatics
Type of data	Tables
How data was acquired	Genotype and haplotype data were obtained from the International HapMap Project database
Data format	Analyzed
Experimental factors	The original data were reformatted to fit the requirement of different software
Experimental features	We generated different implementations from HapMap data set. Then: [1] We compared the performance of different implementations [2]. We compared the phasing performances among HiFi, MACH 1.0, IMPUTE2, BEAGLE.
Data source location	Atlanta, Georgia, USA
Data accessibility	The data are with this article

Open in a new tab

Value of the data

•
This data is beneficial to researchers who are interested in haplotyping The data may provide guidance on how to choose the optimal phasing pipeline.
•
This data is beneficial to researchers who are interested in imputations and comparison between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on the accuracy and speed. The data may provide guidance on how to choose the suitable software for different study.
•
This data is helpful to compare between HiFi and three major phasing software tools (MACH, Impute2 and Beagle) on their tolerance on statistical reference panels.

1. Data

Data presented are summaries of comparison of HiFi performances with three different implementations A, B and C; comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference. The data showed that implementation-B is slightly but consistently higher than A and C; and the data also showed that HiFi performed better on haplotype imputation accuracy and speed,three other tools performed slightly better on genotype imputation.

2. Experimental design, materials and methods

2.1. Acquisition and processing of HapMap data for different implementations

We downloaded CEU (CEPH, U.S. Utah residents with ancestry from northern and western Europe) chromosome 1 genotype data and haplotype data from HapMap in text format [5], [6]. We use the original haplotype data as molecular reference. To generate the statistical haplotype reference panel, we erased the phase information from those trio haplotypes downloaded from HapMap, and then used the software Beagle version 3.3.2 to resolve the haplotypes from the unphased genotypes. Then we generated following three different implementations by Beagle version 3.3.2: (A) Beagle statistical phasing of unrelated persons and Mendelian-inheritance-based phasing of trios, and then pools the results together; (B) Beagle statistical phasing of pooled unrelated persons and trios, but presumes all as unrelated; and (C) Beagle statistical phasing of pooled unrelated persons and trios, and specifying the family structure in the input. And we chose same 6 samples [2] for further analysis.

2.2. Comparison of HiFi performances with three different implementations A, B and C

We compared the HiFi performances with three different implementations. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C (Table S1).

2.3. Comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference

We compared the performance between HiFi and three standard imputation software tools (MACH, IMPUTE2 and BEAGLE) [7], [8], [9]. As the result, HiFi performed better on haplotype imputation accuracy (Table S2) and speed (Table S4), whereas MACH, IMPUTE2 and BEAGLE performed slightly better on genotype imputation accuracy (Table S3), in which MACH and IMPUTE2 performed the best on genotype imputation.

Acknowledgments

This work was supported by National Institutes of Health (R21HG006173, R43HG007621, HL117929, MD007602, MD005964, RR003034, U54MD07588); and the American Heart Association Grant (09GRNT2300003).

Footnotes

^{Transparency document}

Transparency data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.06.029.

Contributor Information

Li Ma, Email: lma@msm.edu.

Qing Song, Email: qsong@msm.edu.

Transparency document. Supplementary material

Supplementary material

mmc1.pdf^{(1.2MB, pdf)}

Appendix A. Supplementary material

Supplementary material

mmc2.zip^{(32KB, zip)}

References

1.Li W., Xu W., Fu G., Ma L., Richards J., Rao W., Bythwood T., Guo S., Song Q. High-accuracy haplotype imputation using unphased genotype data as the references. Gene. 2015;572:279–284. doi: 10.1016/j.gene.2015.07.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Li W., Xu W., He S., Ma L., Song Q. References for haplotype imputation in the big data era. Mol. Biol. 2015 doi: 10.4172/2168-9547.1000143. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Li W., Fu G., Rao W., Xu W., Ma L., Guo S., Song Q. GenomeLaser: fast and accurate haplotyping from pedigree genotypes. Bioinformatics. 2015 doi: 10.1093/bioinformatics/btv452. pii: btv452. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Xu W., Ma L., Li W., Brunson T.A., Tian X., Richards J., Li Q., Bythwood T., Yuan Z., Song Q. Functional pseudogenes inhibit the superoxide production. Precis. Med. 2015;1 [PMC free article] [PubMed] [Google Scholar]
7.Howie B.N., Donnelly P., Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Li Y., Willer C.J., Ding J., Scheet P., Abecasis G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.pdf^{(1.2MB, pdf)}

Supplementary material

mmc2.zip^{(32KB, zip)}

[bib1] 1.Li W., Xu W., Fu G., Ma L., Richards J., Rao W., Bythwood T., Guo S., Song Q. High-accuracy haplotype imputation using unphased genotype data as the references. Gene. 2015;572:279–284. doi: 10.1016/j.gene.2015.07.082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Li W., Xu W., He S., Ma L., Song Q. References for haplotype imputation in the big data era. Mol. Biol. 2015 doi: 10.4172/2168-9547.1000143. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Li W., Fu G., Rao W., Xu W., Ma L., Guo S., Song Q. GenomeLaser: fast and accurate haplotyping from pedigree genotypes. Bioinformatics. 2015 doi: 10.1093/bioinformatics/btv452. pii: btv452. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Xu W., Ma L., Li W., Brunson T.A., Tian X., Richards J., Li Q., Bythwood T., Yuan Z., Song Q. Functional pseudogenes inhibit the superoxide production. Precis. Med. 2015;1 [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Howie B.N., Donnelly P., Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Li Y., Willer C.J., Ding J., Scheet P., Abecasis G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

Wenzhi Li

Wei Xu

Shaohua He

Li Ma

Qing Song

Abstract

1. Data

2. Experimental design, materials and methods

2.1. Acquisition and processing of HapMap data for different implementations

2.2. Comparison of HiFi performances with three different implementations A, B and C

2.3. Comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference

Acknowledgments

Footnotes

Contributor Information

Transparency document. Supplementary material

Appendix A. Supplementary material

References

Further reading

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

Wenzhi Li

Wei Xu

Shaohua He

Li Ma

Qing Song

Abstract

1. Data

2. Experimental design, materials and methods

2.1. Acquisition and processing of HapMap data for different implementations

2.2. Comparison of HiFi performances with three different implementations A, B and C

2.3. Comparison of HiFi and three standard imputation software performances with molecular reference and statistical reference

Acknowledgments

Footnotes

Contributor Information

Transparency document. Supplementary material

Appendix A. Supplementary material

References

Further reading

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases