High rates of phasing errors in highly polymorphic species with low levels of linkage disequilibrium

Marek Bukowicki; Susanne U Franssen; Christian Schlötterer

doi:10.1111/1755-0998.12516

. Author manuscript; available in PMC: 2017 Jul 1.

Published in final edited form as: Mol Ecol Resour. 2016 Mar 21;16(4):874–882. doi: 10.1111/1755-0998.12516

High rates of phasing errors in highly polymorphic species with low levels of linkage disequilibrium

Marek Bukowicki ^*,^†, Susanne U Franssen ^*,^†, Christian Schlötterer ^*

PMCID: PMC4916997 EMSID: EMS68596 PMID: 26929272

Abstract

Short read sequencing of diploid individuals does not permit the direct inference of the sequence on each of the two chromosomes. Although various phasing software packages exist, they were primarily tailored for and tested on human data, which differ from other species in factors that influence phasing, such as SNP density, amounts of linkage disequilibrium (LD) and sample sizes. Despite becoming increasingly popular for other species, the reliability of phasing in non-human data has not been evaluated to a sufficient extent.

We scrutinized the phasing accuracy for Drosophila melanogaster, a species with high polymorphism levels and reduced LD relative to humans. We phased two D. melanogaster populations and compared the results to the known haplotypes. The performance increased with size of the reference panel and was highest when the reference panel and phased individuals were from the same population. Full genomic SNP data and inclusion of sequence read information also improved phasing. Despite humans and Drosophila having similar switch error rates between polymorphic sites, the distances between switch errors were much shorter in Drosophila with only fragments < 300-1500 bp being correctly phased with ≥ 95% confidence. This suggests that the higher SNP density cannot compensate for the higher recombination rate in D. melanogaster. Furthermore, we show that populations that have gone through demographic events such as bottlenecks can be phased with higher accuracy. Our results highlight that statistically phased data are particularly error prone in species with large population sizes or populations lacking suitable reference panels.

Keywords: haplotype analysis, computional phasing, switch errors, Drosophila melanogaster, single-nucleotide polymorphism

Introduction

The current short read sequencing technology as well as SNP genotyping approaches do not provide information about the phase of the identified variants and consequently it is unknown on which of the two chromosomes each of the variants resides. Nevertheless, many genetic analyses require phase information and various methods have been developed to obtain the phase of diploid individuals. Experimental haplotyping requires crosses of each individual to an inbred reference individual with a known genotype (Kapun et al. 2014; Franssen et al. 2015). By allelic subtraction the non-reference haplotype can be inferred. Alternatively, individuals may be inbred (Mackay et al. 2012) or by means of mutants haploid embryos can be generated (Langley et al. 2011). These experimental approaches are not only costly and time consuming, but in many species simply not practicable. Therefore, the alternative approach of phasing genotypic population data statistically is enjoying enormous popularity over the past years, in particular since increasingly powerful software tools are being developed – largely with application to human data sets in mind (reviewed in: Browning & Browning 2011).

The basic idea behind statistical phasing is that individuals in a population are expected to share short runs of nucleotides (haplotypes) and each haplotype is reconstructed as mosaic of other haplotypes in the population (Stephens et al. 2001). Several different software tools have been developed, which build on the same idea such as PHASE (Stephens et al. 2001), Beagle (Browning & Browning 2007), IMPUTE (Marchini et al. 2007), MaCH (Li et al. 2010) and SHAPEIT/2 (Delaneau et al. 2013a, Delaneau et al. 2013b).

While phasing algorithms were initially designed for the analysis of human data, in the wake of the success to the application to human data, they are now starting to be applied to other species (e.g. Auton et al. 2012; Cadzow et al. 2014; Utsunomiya et al. 2015; Bosse et al. 2015). We are, however, lacking a thorough evaluation of the phasing accuracy in species with different genetic properties than humans. In this study we are filling this gap by taking advantage of almost 200 phased haplotypes for each of two D. melanogaster populations with different demographic histories. Specifically, we evaluate the crucial parameters affecting the accuracy of phasing in a non-human species and evaluate how the accuracy of phasing varies between populations.

Material and Methods

Data sets used for phasing

Phasing was performed using Beagle 4.1 (Browning & Browning 2007) and SHAPEIT2 (Delaneau et al. 2013a) on two sets of genotypes from different natural D. melanogaster populations, the DGRP lines from Raleigh, North America (Mackay et al. 2012; http://dgrp2.gnets.ncsu.edu/data.html, accessed July 2015) and the DPGP3 lines from Siavonga, Zambia, Africa (Pool et al. 2012; Lack et al. 2015; www.dpgp.org/DPGP3_PRERELEASE/NEXUS/dpgp_files/ accessed July 2015) for chromosome 2R. For each population a set of 30 diploid genotypes was created by merging pairs of randomly selected haplotypes from the respective population. This procedure provided genotypes from a natural population with known phase.

Both software tools can incorporate information of reference haplotypes to improve phasing. We assembled different sets of reference haplotypes to test the effects of population origin and size of the reference panel on phasing. The different sets of reference haplotypes consisted of A) 145 haplotypes from the DGRP lines (North American population), B) 137 haplotypes from the Zambian Siavonga population of the DPGP3 lines (African population), C) a set of 122 remaining lines from the DPGP project (Pool et al. 2012) from other African populations (Congo, Cameroon, Ethiopia, Gabon, Guinea, Kenya, Nigeria, Rwanda, South Africa, Tanzania, Uganda, Zimbabwe and locations in Zambia other than Siavonga) and France, together with 48 haplotypes from a Portuguese population (Kapun et al. 2014; http://dx.doi.org/10.5061/dryad.t66ns; Franssen et al. 2015; http://dx.doi.org/10.5061/dryad.403b2/2 , http://dx.doi.org/10.5061/dryad.403b2/3; accessed July 2015). All haplotype sequences were encoded as homozygous genotypes for the reference panel. The reference panels did not contain haplotypes that were used to generate the diploid genotypes subjected to phasing.

Four different SNP sets were used for phasing depending on the reference panel and genotypes to be phased: 1) The DGRP-based genotypes with the DGRP reference, 2) the DGRP-based genotypes with a composite reference of haplotype sets A, B and C, 3) DPGP3-based genotypes with the DPGP3 reference, 4) DPGP3-based genotypes with the composite reference of sets from A, B and C. For all four configurations polymorphic sites with more than 5% missing data and singletons were excluded. A summary of the different SNP data sets is in provided in suppl. table S1. Remaining sites with missing data were imputed with SHAPEIT2 prior to phasing. Unless otherwise noted, we are presenting results for the American and the African population, based on 30 genotypes and reference haplotypes from the corresponding population.

Parameter settings used for phasing with SHAPEIT2

The effective population size was set to 1,150,000, the N_e estimate of an ancestral African population (Shapiro et al. 2007). Consistently with the results for humans, this parameter does not have large influence on results (data not shown). We used recombination rates approximated with the 6^th order polynomial (Comeron et al. 2012). We also tested a more fine-scale recombination map with 100 kb windows (~0.25 cM), but this resulted in a slightly higher SER and was therefore not used. In order to obtain maximal accuracy the number of states was set to the total number of haplotypes (haplotypes to be phased + haplotypes in the reference panel). The number of states determines how many most similar haplotypes (within a given window) are used to phase each genotype.

Usage of sequencing reads to improve phasing using SHAPEIT2

To estimate the influence of adding sequencing read information during phasing, we downloaded the reads corresponding to haplotypes used to generate the 30 genotypes; DGRP lines (study number: SRP000694) and the DPGP3 lines (study number: SRP006733, NCBI SRA database (http://trace.ncbi.nlm.nih.gov/Traces/sra/, last accessed 10.09.2015). The chosen DGRP lines were sequenced with Illumina paired ends technology, with reads length between 95 bp to 125 bp and mean coverages between 17.7 and 57.2. DPGP3 lines had a read length between 146 to 160 bp and mean coverages between 33 and 49.2. Reads for the DPGP3 lines were trimmed with trim-fastq.pl from the PoPoolation2 package (Kofler et al. 2011) and remapped with BWA (version 0.7.12, parameter settings “-o 1 –n 0.01 –l 200 –e 12 –n 12”) against the D. melanogaster reference genome (version 5.38). For the DGRP lines we used the original alignment files.

Duplicates were removed from the alignment files of both populations with Picard Tools (version 1.115) and subsequently filtered for a mapping quality of 20 and proper pairs with samtools (version 0.1.19). To create alignment files for the 60 artificially created genotypes of both populations, individual alignment files were sub-sampled to a desired coverage (identical for the two haplotypes being combined into one genotype) and subsequently merged for the respective haplotype pairs. Sub-sampling was performed with samtools view (-s) taking into account the mean coverage in the original bam files at the polymorphic sites.

Parameter settings used for phasing with Beagle 4.1

We used 20 iterations and the error rate was set to 0.005, a value being considered typical for genomic data (Browning & Browning 2013). The window size was set to 50000 and the overlap to the average number of markers in 0.5 cM which corresponds to 2000 SNPs for data sets 2 and 4, 4000 SNPS for data set 1 and 10000 SNPs for data set 3.

Measures of phasing performance

Throughout the manuscript we focussed on the SER since the mean switch distance behaved similar. Since we compared data sets with different SNP densities and the mean switch distance could increase even with more phasing errors for low SNP densities (suppl. Fig. S1B), we considered SER is a more intuitive measure of accuracy. The SER is defined as the ratio of the number of switch errors divided by the total number of SNP pairs, for which the phase was estimated (Lin et al. 2002). We only calculated SER based on sites, for which no imputation was required. The mean switch distance was calculated by dividing the total chromosome length by the number of switch errors. Borders of correctly phased fragments were defined to be at half the distance between adjacent SNPs, where the first switch error occurred.

Detection of IBD regions

We used IBD detection for two distinct analyses: 1) to determine how IBD regions shared within a population affect phasing accuracy, and 2) to show the impact of phasing errors on haplotype-based downstream analysis.

For the first analysis IBD regions were detected in each population on the full data set containing all haplotypes, irrespective if they were used to generate the genotypes or served as reference panel. We used two different methods: Beagle 4.1 (Browning & Browning 2007), which performs an intermediate imputation and phasing step, and IBDseq (Browning & Browning 2013), which calculates IBD between genotypes instead of haplotypes directly and accounts for sequencing errors. In both cases default parameter setting were applied. For measuring the influence of IBD sharing on phasing performance we did not consider IBD regions shared only among the reference panel haplotypes since they do not affect the phasing.

In the second analysis IBD regions were identified separately for the true and the phased set of haplotypes to evaluate the influence of phasing errors on this type of downstream analysis. First, IBD regions were detected for the 30 artificial genotypes with known phase information for both the North American and the African population, respectively (see: “Data sets used for phasing”). Next, the identical parameter settings were applied to detect IBD for the same genotype sets but this time with the phase estimated by SHAPEIT, as described in the previous paragraphs. To exclude confounding effects of imputation, we used only sites without any missing data. IBD regions were detected with Beagle 4.1 using parameters according to guidelines for a given marker density (window=50000, overlap=6000, err=0.005, ibdtrim=450; marker density ~ 3000 markers/cM).

Linkage disequilibrium

Linkage disequilibrium was calculated as r² with VCFtools (version 0.1.12b) (Danecek et al. 2011). For population-wide LD analysis 197 haplotypes from each population were used – 8 randomly chosen haplotypes from the American population were excluded. To estimate the influence of phasing errors on LD, we used the before mentioned sets of 30 generated genotypes with either known, or estimated phase of each population. Analogously to the IBD analysis, only sites with no missing data were included.

Results

We tested the accuracy of phasing in two different D. melanogaster populations. One North American population (Mackay et al. 2012), representing cosmopolitan flies that underwent a major population bottleneck during their out of Africa habitat expansion (Andolfatto 2001, Kauer et al. 2003). The second population from Africa is a representative of the ancestral variation of D. melanogaster and has not experienced bottlenecks as pronounced as the North American one (Pool et al. 2012; Lack et al. 2015). These two populations have also the largest number of sequenced haplotypes in Drosophila. A special feature of these genotype data is that they were obtained by whole-genome sequencing of either inbred or experimentally haploidized individuals. Thus, we could use high quality haplotypes to generate diploid data of known phase, which were then subjected to phasing using the SHAPEIT2 (Delaneau et al. 2013a) or Beagle 4.1 (Browning & Browning 2007) software. We evaluated the influence of several factors on the accuracy of phasing in these two populations. Because the phasing performance of SHAPEIT was better (Fig. S7) (Delaneau et al. 2013b)., we focused on this software in the subsequent analyses.

Reference panel

Previous work showed that the inclusion of a reference panel of haplotypes can improve the accuracy of phasing (Delaneau et al. 2013b). We tested the phasing accuracy on 30 diploid individuals from each population and varied the size and origin of the reference panel. For both populations the switch error rate (SER) decreased with an increasing size of the reference panel from the same population (Fig. 1). For the North American population a reference panel of African flies also improved the phasing accuracy, albeit to a lesser extent. However, phasing the African population with a reference panel of North American haplotypes only, did not improve the SER. Intermediate results were obtained using a third reference panel, which contained a mixture of individuals from African and cosmopolitan populations. Notably, adding more individuals from other populations to existing reference panels of the focal population resulted in a very marginal increase of the SER. In contrast, based on the shape of the decay in SER when using reference haplotypes from the same population, we think that it will be possible to increase the phasing accuracy with even larger reference samples of the focal population.

Fig. 1 — Mean switch error rates (SER) for two different populations using reference populations of different origin and size. For each scenario 30 genotypes of A) North America (DGRP lines) and B) Africa (Zambian DPGP3 lines) were jointly phased using jointly called SNP sets 2) and 4), respectively (see Material & Methods for genotype construction). Differently coloured lines indicate the origin of the used reference haplotypes, i.e. red: 145 American lines, green: 137 African lines and blue: mixture of different populations (reference set C) in Material & Methods). Phasing performance is highest when the reference haplotypes come from the same population and is generally better for American than for African genotypes. Reference haplotypes from other populations typically have a weaker positive effect except when phasing African genotypes with reference haplotypes from the American lines only. In this case no effect or even a slight negative effect is present.

SNP density

Phasing can be either performed on fully sequenced genomes or on genotypes determined for a subset of SNPs only. We evaluated how the availability of different numbers of SNPs for the same chromosomes (and thus the same recombination history) influences phasing accuracy. As expected the SER decreases with an increasing number of SNPs being genotyped (suppl. Fig. S1A). Since a lower number of SNPs genotyped also implies a larger spacing between the SNPs, the increase in SER can be attributed to two factors: fewer markers to aid phasing and higher recombination rates between markers due to the larger spacing. Thus, we corrected the observed SER by the increase due to the spacing alone. The corrected SER still showed a decrease in the SER with an increasing number of markers genotyped suggesting that more comprehensive marker sets result in a more reliable phasing and fewer switch errors per base pair (Fig. 2). This trend was present for both populations.

Fig. 2 — Decrease in phasing accuracy for different fractions of the total number of SNPs. The decrease in phasing accuracy is corrected for the larger spacing when only a fraction of SNPs was used. Specifically, the corrected SER was calculated as the difference in SER for the identical subset of SNPs (1/x) when phasing was carried out for the total set of SNPs (1/1) or the reduced SNP set (1/x). Even after correcting for the influence of different SNP spacing, the SER increases when phasing is performed for a subset of SNPs. The loss of accuracy measured in increase of SER ranges from 0.01 for 50% of the SNPs used to 0.075 when only 5% of the SNPs are used for phasing.

Non-independence among neighboring sites increases phasing accuracy

Recombination breaks the association of neighboring polymorphic sites. Thus, lower recombination rates are expected to increase the association between polymorphic sites, which in turn should facilitate phasing. We tested this prediction by converting the physical distance between SNPs into genetic distance based on the recombination map in D. melanogaster (Comeron et al. 2012). As expected, we find that the SER increases with genetic distance (Fig. 3). While both populations show the same trend, there is a major difference between the North American and the African population for a given genetic distance. Even though the African population harbors more variation, phasing of the North American population was clearly more accurate. The better performance of the North American population seems to be counterintuitive since the African one harbors more SNP markers. Nevertheless, it is possible that the population bottleneck of the North African population resulted in a higher association between linked sites. Indeed, LD in the North American population is considerably higher than in the African one (suppl. Fig. S2, Pool et al. 2012). To account for the difference in LD, we determined the phasing accuracy in relation to LD (Fig. 4). As expected higher level of linkage disequilibrium results in more accurate phasing. While the behavior of the two populations is much more similar for similar levels of LD, the phasing of the North American population is slightly more accurate, in particular for SNP pairs with r²<0.3 (Fig. 4).

Fig. 3 — Phasing accuracy with respect to genetic distance between SNP pairs. The SER is shown as a function of the genetic distance between SNP pairs for the 30 North American (red) and the 30 African (green) genotypes, respectively. Genetic distances were calculated with interpolated (approximate, open circles) or fine-scale (crosses) recombination maps (Comeron *et al.* 2012). The plotted range of genetic distances contains more than 98% of the data.

Fig. 4 — Phasing accuracy for different levels of LD. Mean SER for adjacent SNPs with different levels of LD are shown for two different populations from North America and Africa, calculated for 197 known haplotypes for each population. The SER is highest for low levels of LD and drastically decreases with rising LD. Population differences in phasing performance are strongest for low levels of LD, r²<0.3.

Another factor potentially influencing the accuracy of phasing is length and frequency of genomic segments, which are identical by descent (IBD). Phasing becomes more reliable when IBD regions exist, with the biggest increase in accuracy at the transition from no versus one copy of an IBD segment being present in the data (Fig. 5, suppl. Fig. S3). Again, for genomic regions without IBD segments, phasing is more accurate for the North African population, while in the presence of IBD segments, the difference between the two populations can be neglected. Overall, our analyses indicate that a higher association among neighboring SNPs, either by lower recombination rates, linkage disequilibrium or shared IBD segments, results in a substantial improvement of phasing accuracy.

Fig. 5 — Accuracy of phasing within IBD regions with different copy numbers. The SER is shown for regions with different coverage of IBD sharing to the remaining haplotypes available for phasing. For each population IBD segments were computed with IBDseq for the 30 genotypes that were phased, together with the maximum number of haplotypes available for the same population (see Material & Methods). The SER is greatly reduced in the presence of IBD sharing to another haplotype present in the population with the strongest effect of the first copy. Population differences are greatest when no IBD sharing is present. Results for IBD regions detected with Beagle show the same pattern (suppl. Fig. S3).

Improvement of phasing through the incorporation of sequencing read information

SHAPEIT2 offers the option to incorporate information about the association of SNPs covered by read-pairs into the phasing. We observed striking differences between the two populations when read information was included (Fig 6). While in the North American population only minor improvements were noted, in the African population the SER could be reduced by about 70%, resulting in an even lower SER than in the North American population (1.1% compared to 2.0%). We reason that the higher SNP density in the African population (suppl. Fig. S4) in combination with the longer reads in this data set may be the major factors contributing to the higher performance gain in the African population.

Fig. 6 — Increase in phasing performance with additional usage of sequencing reads. SER are shown for the 30 North American (red) and African (green) genotypes for different sequencing coverages of individual genotypes. The increase in phasing accuracy when adding sequencing read information is much more pronounced for the African population.

Length of correctly phased fragments

With phasing being used for downstream analyses, it is important to have an impression about the reliability of the phasing in a genomic region of a certain size (Fig. 7). While large fragments (0.8-1.6 Mb) without switch error can be recovered, this is extremely unlikely. Using the frequently used 5% error rate as a cutoff, only DNA fragments shorter than 1400 bp (North America) and 300 bp (Africa) meet these criteria. When sequencing read information is being used this improves to 1500 bp (North America) and 750 bp (Africa). Removal of low recombination regions does not impact the expected length of correctly phased fragments (suppl. Fig. S5). These results suggest that studies relying on the correct phasing of large genomic regions will face problems due to switch errors.

Fig. 7 — Empirical probability of correctly phased fragments. The probability of the correct phase of a fragment for a window of a given size with respect to A) physical length, and B) the number of SNPs is shown. Only very short fragments in the range of 300 – 1500 bp or 3-6 SNPs have a high level of confidence to be correctly phased (≥ 95% confidence). Results with additional usage of sequencing reads are obtained with 24× coverage.

Influence of phasing errors on downstream analysis

To illustrate influence of phasing errors on downstream analysis we tested two commonly used measures that require haplotype information: 1) linkage disequilibrium and 2) identification of IBD regions. We show that even relatively high rate phasing errors (SER= 3.9% for the N. American and SER= 9.5% for the African population) have only a minor effect on chromosome wide LD estimations (Fig. S6). The mean absolute change of the chromosome wide r² is less than 0.01 for pair-wise SNP distances up to 10 kb in both populations. This difference in r² estimates increases with distance reflecting the lower phase confidence between more distant SNPs.

Compared to the small influence of phasing errors on LD, IBD detection was strongly reduced for computationally phased data. Phased data with a SER of 3.9% and 9.5% for N. America and Africa, respectively, only recover 70-75% of the IBD regions detected in known haplotypes (Fig. 8). Higher SERs obtained with smaller reference panels lead to an even more pronounced effect. The relationship between IBD detection and SER varies between populations. The false discovery rate of IBD regions from phased data is low and does typically not exceed 15%, except for outliers with the highest SER.

Fig. 8 — Fraction of IBD that can be recovered with phased haplotypes with different switch error rates. IBD regions were first detected for the 30 constructed haplotypes with the known phase and then after phasing with various SER in each population. The y-axis depicts the fraction of IBD regions that were also identified in the phased data. Different SERs were taken from results obtained from different reference panel sizes (figure 1). With an increasing SERs fewer IBD regions are detected. Note, the results are not affected by an increase of incorrectly identified IBD regions in the phased data (1-7% for the N. American and 8-15% for the African population (with exception of 30% for the African sample with the highest SER = 15%).

Discussion

Our study benefited from the availability of high quality haplotype data for two D. melanogaster populations with very different demographic histories. Phasing diploid individuals reconstructed from combining the reads from two haplotypes, we were able to infer parameters affecting the accuracy of phasing using real data. We find that many parameters affect the accuracy of phasing. A higher dependence of neighboring SNPs reduces the SER. This suggests that demographic events, such as bottlenecks and admixture can result in more reliable phasing. Also, genomic regions with low recombination have fewer switch errors, implying that phase reconstruction is more reliable for low recombination regions.

Apart from the population specific features, which cannot be changed, we found that many experimental parameters could be modified to improve the phasing quality. A more comprehensive genotyping effort results in more accurate phasing, which suggests that full genome sequences provide more reliable results than those based on a subset of the polymorphic sites. Thus, full genome sequencing is preferable to SNP genotyping or RADseq analysis. Consistent with previous results (Delaneau et al. 2013a) we observed that the incorporation of sequencing read information can result in a substantial improvement of the phasing accuracy. We anticipate that for highly polymorphic populations and long sequencing reads this effect will be particularly pronounced. Finally, we showed that the reference panel is crucial for the success of the experiment. The moderate size of the reference panel in Drosophila did not allow us to explore for what size of the reference panel no further improvement of the phasing accuracy can be expected. Nevertheless, we clearly show that the SER drops with an increase in panel size and a further drop can be expected when more than 100 haplotypes are used. Particularly interesting was the importance of population origin of the reference haplotypes. In both populations we found the biggest improvement for panels from the same population. In contrast, in the African population a reference panel consisting only of North American haplotypes did not improve phasing accuracy.

Interestingly, the SERs obtained in our study (1-2% with sequencing reads) are very similar to the ones reported for humans 0.5-7% (e.g. Browning & Browning 2007; Williams et al. 2012; Delaneau et al. 2013a). Nevertheless, we caution to conclude that phasing works equally well in both species. On the one hand in humans the size of reference panels is one to two orders of magnitude larger than the ones we used for Drosophila, suggesting that even lower SERs should be possible in Drosophila. On the other hand, Drosophila has about 10-fold higher polymorphism levels compared to humans, implying that the average length (in bp) of fragments without switch error is much longer in humans.

Of particular relevance are our results for studies on other species that do not have the same background information as humans or Drosophila. The use of powerful phasing software may be tempting even when the obtained results have very little prospect of being accurate. Very few species are enjoying the benefit of a high quality reference panel, which should be from the same population for optimal phasing performance. Less than 1 kb fragments being recovered for African D. melanogaster, even with a reference panel of the same population of more than 100 haplotypes, casts severe doubts about the practicability of reliable phasing in non-model organisms. This applies in particular, when RADseq methods (Davey & Blaxter 2010) are used rather than full genome sequencing. Moreover, downstream analyses such as IBD detection were very sensitive to phasing errors. Based on our results we strongly advocate that studies in organisms without sufficient background information should consider experimental phasing using the offspring from a cross to an inbred reference individual with a known genome (Kapun et al. 2014). Also, the analysis of parent offspring trios are could be used to provide additional confidence for statistically inferred haplotypes (Browning & Browning 2009; Williams et al. 2012; O’Connell et al. 2014). Alternatively, long read sequencing technologies are the most promising direction for future studies and until they become routine use it may be safer to avoid analyses building on statistically phased haplotypes in species with large population sizes and small reference panels.

Supplementary Material

supplement

NIHMS68596-supplement-supplement.pdf^{(684.4KB, pdf)}

Acknowledgments

This work was supported by the European research council (ERC) grant “ArchAdapt”.

Footnotes

Author Contributions

C. S. and S. U. F. designed research, M. B. and S. U. F. performed research and analyzed results, C. S, S. U. F. and M. B. wrote the article.

Data accessibility

All data used in this study have been published previously and are publicly available (see M&M for respective publication links and URLs).

References

Andolfatto P. Contrasting Patterns of X-Linked and Autosomal Nucleotide Variation in Drosophila melanogaster and Drosophila simulans. Molecular Biology and Evolution. 2001;18:279–290. doi: 10.1093/oxfordjournals.molbev.a003804. [DOI] [PubMed] [Google Scholar]
Auton A, Fledel-Alon A, Pfeifer S, et al. A Fine-Scale Chimpanzee Genetic Map from Population Sequencing. Science. 2012;336:193–198. doi: 10.1126/science.1216872. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bosse M, Megens H-J, Madsen O, et al. Using genome-wide measures of coancestry to maintain diversity and fitness in endangered and domestic pig populations. Genome Research. 2015 doi: 10.1101/gr.187039.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning SR, Browning BL. Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering. The American Journal of Human Genetics. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. American Journal of Human Genetics. 2009;84:210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 2011;12:703–714. doi: 10.1038/nrg3054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning BL, Browning SR. Detecting Identity by Descent and Estimating Genotype Error Rates in Sequence Data. The American Journal of Human Genetics. 2013;93:840–851. doi: 10.1016/j.ajhg.2013.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cadzow M, Boocock J, Nguyen HT, et al. A bioinformatics workflow for detecting signatures of selection in genomic data. Frontiers in Genetics. 2014;5 doi: 10.3389/fgene.2014.00293. [DOI] [PMC free article] [PubMed] [Google Scholar]
Comeron JM, Ratnappan R, Bailin S. The Many Landscapes of Recombination in Drosophila melanogaster. PLoS Genet. 2012;8:e1002905. doi: 10.1371/journal.pgen.1002905. [DOI] [PMC free article] [PubMed] [Google Scholar]
Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davey JW, Blaxter ML. RADSeq: next-generation population genetics. Briefings in Functional Genomics. 2010;9:416–423. doi: 10.1093/bfgp/elq031. [DOI] [PMC free article] [PubMed] [Google Scholar]
Delaneau O, Howie B, Cox AJ, Zagury J-F, Marchini J. Haplotype estimation using sequencing reads. American Journal of Human Genetics. 2013a;93:687–696. doi: 10.1016/j.ajhg.2013.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Delaneau O, Zagury J-F, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods. 2013b;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]
Franssen SU, Nolte V, Tobler R, Schlötterer C. Patterns of Linkage Disequilibrium and Long Range Hitchhiking in Evolving Experimental Drosophila melanogaster Populations. Molecular Biology and Evolution. 2015;32:495–509. doi: 10.1093/molbev/msu320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kapun M, Schalkwyk H, McAllister B, Flatt T, Schlötterer C. Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster. Molecular ecology. 2014;23:1813–1827. doi: 10.1111/mec.12594. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kauer MO, Dieringer D, Schlötterer C. A microsatellite variability screen for positive selection associated with the “out of Africa” habitat expansion of Drosophila melanogaster. Genetics. 2003;165:1137–1148. doi: 10.1093/genetics/165.3.1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kofler R, Pandey RV, Schlötterer C. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) Bioinformatics. 2011;27:3435–3436. doi: 10.1093/bioinformatics/btr589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lack JB, Cardeno CM, Crepeau MW, et al. The Drosophila Genome Nexus: A Population Genomic Resource of 623 Drosophila melanogaster Genomes, Including 197 from a Single Ancestral Range Population. Genetics. 2015;199:1229–1241. doi: 10.1534/genetics.115.174664. [DOI] [PMC free article] [PubMed] [Google Scholar]
Langley CH, Crepeau M, Cardeno C, Corbett-Detig R, Stevens K. Circumventing heterozygosity: sequencing the amplified genome of a single haploid Drosophila melanogaster embryo. Genetics. 2011;188:239–246. doi: 10.1534/genetics.111.127530. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin S, Cutler DJ, Zwick ME, Chakravarti A. Haplotype Inference in Random Population Samples. The American Journal of Human Genetics. 2002;71:1129–1137. doi: 10.1086/344347. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mackay TFC, Richards S, Stone EA, et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482:173–178. doi: 10.1038/nature10811. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
O’Connell J, Gurdasani D, Delaneau O, et al. A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness. PLoS Genet. 2014;10:e1004234. doi: 10.1371/journal.pgen.1004234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pool JE, Corbett-Detig RB, Sugino RP, et al. Population Genomics of Sub-Saharan Drosophila melanogaster: African Diversity and Non-African Admixture. PLoS Genet. 2012;8:e1003080. doi: 10.1371/journal.pgen.1003080. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shapiro JA, Huang W, Zhang C, et al. Adaptive genic evolution in the Drosophila genomes. Proceedings of the National Academy of Sciences. 2007;104:2271–2276. doi: 10.1073/pnas.0610385104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens M, Smith NJ, Donnelly P. A New Statistical Method for Haplotype Reconstruction from Population Data. The American Journal of Human Genetics. 2001;68:978–989. doi: 10.1086/319501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Utsunomiya YT, Pérez O’Brien AM, Sonstegard TS, Sölkner J, Garcia JF. Genomic data as the “hitchhiker’s guide” to cattle adaptation: tracking the milestones of past selection in the bovine genome. Frontiers in Genetics. 2015;6 doi: 10.3389/fgene.2015.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williams AL, Patterson N, Glessner J, Hakonarson H, Reich D. Phasing of Many Thousands of Genotyped Samples. The American Journal of Human Genetics. 2012;91:238–251. doi: 10.1016/j.ajhg.2012.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS68596-supplement-supplement.pdf^{(684.4KB, pdf)}

[R1] Andolfatto P. Contrasting Patterns of X-Linked and Autosomal Nucleotide Variation in Drosophila melanogaster and Drosophila simulans. Molecular Biology and Evolution. 2001;18:279–290. doi: 10.1093/oxfordjournals.molbev.a003804. [DOI] [PubMed] [Google Scholar]

[R2] Auton A, Fledel-Alon A, Pfeifer S, et al. A Fine-Scale Chimpanzee Genetic Map from Population Sequencing. Science. 2012;336:193–198. doi: 10.1126/science.1216872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bosse M, Megens H-J, Madsen O, et al. Using genome-wide measures of coancestry to maintain diversity and fitness in endangered and domestic pig populations. Genome Research. 2015 doi: 10.1101/gr.187039.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Browning SR, Browning BL. Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering. The American Journal of Human Genetics. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. American Journal of Human Genetics. 2009;84:210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 2011;12:703–714. doi: 10.1038/nrg3054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Browning BL, Browning SR. Detecting Identity by Descent and Estimating Genotype Error Rates in Sequence Data. The American Journal of Human Genetics. 2013;93:840–851. doi: 10.1016/j.ajhg.2013.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Cadzow M, Boocock J, Nguyen HT, et al. A bioinformatics workflow for detecting signatures of selection in genomic data. Frontiers in Genetics. 2014;5 doi: 10.3389/fgene.2014.00293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Comeron JM, Ratnappan R, Bailin S. The Many Landscapes of Recombination in Drosophila melanogaster. PLoS Genet. 2012;8:e1002905. doi: 10.1371/journal.pgen.1002905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Davey JW, Blaxter ML. RADSeq: next-generation population genetics. Briefings in Functional Genomics. 2010;9:416–423. doi: 10.1093/bfgp/elq031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Delaneau O, Howie B, Cox AJ, Zagury J-F, Marchini J. Haplotype estimation using sequencing reads. American Journal of Human Genetics. 2013a;93:687–696. doi: 10.1016/j.ajhg.2013.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Delaneau O, Zagury J-F, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods. 2013b;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]

[R14] Franssen SU, Nolte V, Tobler R, Schlötterer C. Patterns of Linkage Disequilibrium and Long Range Hitchhiking in Evolving Experimental Drosophila melanogaster Populations. Molecular Biology and Evolution. 2015;32:495–509. doi: 10.1093/molbev/msu320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kapun M, Schalkwyk H, McAllister B, Flatt T, Schlötterer C. Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster. Molecular ecology. 2014;23:1813–1827. doi: 10.1111/mec.12594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kauer MO, Dieringer D, Schlötterer C. A microsatellite variability screen for positive selection associated with the “out of Africa” habitat expansion of Drosophila melanogaster. Genetics. 2003;165:1137–1148. doi: 10.1093/genetics/165.3.1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kofler R, Pandey RV, Schlötterer C. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) Bioinformatics. 2011;27:3435–3436. doi: 10.1093/bioinformatics/btr589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Lack JB, Cardeno CM, Crepeau MW, et al. The Drosophila Genome Nexus: A Population Genomic Resource of 623 Drosophila melanogaster Genomes, Including 197 from a Single Ancestral Range Population. Genetics. 2015;199:1229–1241. doi: 10.1534/genetics.115.174664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Langley CH, Crepeau M, Cardeno C, Corbett-Detig R, Stevens K. Circumventing heterozygosity: sequencing the amplified genome of a single haploid Drosophila melanogaster embryo. Genetics. 2011;188:239–246. doi: 10.1534/genetics.111.127530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lin S, Cutler DJ, Zwick ME, Chakravarti A. Haplotype Inference in Random Population Samples. The American Journal of Human Genetics. 2002;71:1129–1137. doi: 10.1086/344347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Mackay TFC, Richards S, Stone EA, et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482:173–178. doi: 10.1038/nature10811. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]

[R24] O’Connell J, Gurdasani D, Delaneau O, et al. A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness. PLoS Genet. 2014;10:e1004234. doi: 10.1371/journal.pgen.1004234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Pool JE, Corbett-Detig RB, Sugino RP, et al. Population Genomics of Sub-Saharan Drosophila melanogaster: African Diversity and Non-African Admixture. PLoS Genet. 2012;8:e1003080. doi: 10.1371/journal.pgen.1003080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Shapiro JA, Huang W, Zhang C, et al. Adaptive genic evolution in the Drosophila genomes. Proceedings of the National Academy of Sciences. 2007;104:2271–2276. doi: 10.1073/pnas.0610385104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Stephens M, Smith NJ, Donnelly P. A New Statistical Method for Haplotype Reconstruction from Population Data. The American Journal of Human Genetics. 2001;68:978–989. doi: 10.1086/319501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Utsunomiya YT, Pérez O’Brien AM, Sonstegard TS, Sölkner J, Garcia JF. Genomic data as the “hitchhiker’s guide” to cattle adaptation: tracking the milestones of past selection in the bovine genome. Frontiers in Genetics. 2015;6 doi: 10.3389/fgene.2015.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Williams AL, Patterson N, Glessner J, Hakonarson H, Reich D. Phasing of Many Thousands of Genotyped Samples. The American Journal of Human Genetics. 2012;91:238–251. doi: 10.1016/j.ajhg.2012.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

High rates of phasing errors in highly polymorphic species with low levels of linkage disequilibrium

Marek Bukowicki

Susanne U Franssen

Christian Schlötterer

Abstract

Introduction

Material and Methods

Data sets used for phasing

Parameter settings used for phasing with SHAPEIT2

Usage of sequencing reads to improve phasing using SHAPEIT2

Parameter settings used for phasing with Beagle 4.1

Measures of phasing performance

Detection of IBD regions

Linkage disequilibrium

Results

Reference panel

Fig. 1.

SNP density

Fig. 2.

Non-independence among neighboring sites increases phasing accuracy

Fig. 3.

Fig. 4.

Fig. 5.

Improvement of phasing through the incorporation of sequencing read information

Fig. 6.

Length of correctly phased fragments

Fig. 7.

Influence of phasing errors on downstream analysis

Fig. 8.

Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases