A New Fast Phasing Method Based On Haplotype Subtraction

Evelina Mocci; Marija Debeljak; Alison P Klein; James R Eshleman

doi:10.1016/j.jmoldx.2018.12.004

. 2019 May;21(3):427–436. doi: 10.1016/j.jmoldx.2018.12.004

A New Fast Phasing Method Based On Haplotype Subtraction

Evelina Mocci ^∗, Marija Debeljak ^†, Alison P Klein ^∗,^†, James R Eshleman ^∗,^†,^∗

PMCID: PMC6504677 PMID: 30872187

Abstract

We developed a novel phasing approach, based solely on molecules and genotype frequency, that does not rely on inference of new alleles. We initiated the project because of errors that were detected in the phased 1000 Genomes Project data. The algorithm first combined identical genotypes into clusters and ranked them by descending frequency. Using alleles defined in homozygotes, it combined them to produce expected genotypes that were dismissed and subtracted them from remaining genotypes to define additional new putative alleles. Putative alleles had to be confirmed by identifying them in independent genotypes, and the process was iterated until all alleles were identified. The new approach was validated using single-molecule sequencing of eight loci, 145 (8 to 35 per locus) alleles were identified, and an average 98.2% (range, 95.0% to 99.9%) of 1000 genome individuals at these loci were explained. The accuracy of the new method was compared with that from PHASE and SHAPEIT2 to the experimentally determined genotypes based on single-molecule sequencing. Our method was comparable to PHASE and SHAPEIT2 in accuracy but was, on average, 14.6- and 10.8-fold faster, respectively.

Phasing genetic variants is important in several clinical situations. In the field of human genetics, closely linked restriction fragment length polymorphism markers were historically used to perform molecular diagnostics before the disease-causing mutation could be directly detected.¹ Furthermore, phasing two different partially defective variants to the same CFTR allele, in combination with a fully defective allele, such as ΔF508, manifested as the full phenotype of cystic fibrosis.2, 3, 4 For multigene mendelian recessive syndromes, one of the genes can be implicated if two mutations can be phased to distinct alleles.⁵ In cancer genetics, documenting that two deleterious mutations are on separate alleles is useful to confirm complete inactivation of a tumor-suppressor gene in a given cancer.4, 6 Within a given HLA gene, phasing is critical for correct allele assignment.⁷

Various molecular methods have been developed over the years to solve the phasing problem. These include chromosome/allele separation strategies8, 9, 10; long-range sequencing strategies of bulk molecules, including Nanopore (Oxford, UK) and Pacific Bio technologies (Menlo Park, CA)11, 12, 13; and genetic phasing of families.¹⁴ The company 10× Genomics (Pleasanton, CA) has developed a phasing strategy that multiply labels large DNA molecules with the same barcode, thereby providing linked reads.¹⁵ There are also many computational strategies to perform phasing (Discussion), the most popular of which are PHASE version 2.1 (Stephens Lab, Chicago, IL) and SHAPEIT2 version 2.r671.MacOSX.16, 17, 18

Previously, >4300 regions containing nine or more single-nucleotide polymorphisms (SNPs) within a region of 300 bases were identified in the human genome and a method was developed that uses haplotype counting for ultrasensitive (1:10,000) detection of human DNA mixes.¹⁹ It was then validated using seven additional polymorphic loci; however, because the loci selected are some of the most SNP-dense loci in the human genome, they are extremely challenging to phase.²⁰ During this process, some discrepancies between our experimentally identified alleles were revealed on the basis of single-molecule sequencing and those in the publicly available phased human genome for these loci.²¹ Discrepancies in the phased 1000 Genomes Project data have also been identified by an independent group recently.²²

Accordingly, we developed a molecule-based approach that relies on genotype frequency and direct subtraction but does not rely on inference of novel alleles. This approach relies on clustering the genotypes, first defining alleles from the homozygotes, then eliminating anticipated heterozygotes, and using the remaining heterozygotes to iteratively identify and confirm additional alleles. This is accomplished by testing whether only one of the previously identified alleles can contribute to a given heterozygous genotype, and if so, subtracting it to generate a putative new allele. Developing a base-10 lookup table facilitated our analysis. The same genotypes were also phased using PHASE and SHAPEIT2, and all three methods were compared with the known alleles previously defined experimentally using single-molecule sequencing.²⁰ We conclude that our method is slightly more accurate compared with PHASE and SHAPEIT2 and is an average of 14.6-fold (range, 5.6- to 49.8-fold) and 10.8-fold (range, 2.1- to 14.4-fold) faster, respectively; and it will likely become more useful as more whole genomes become publicly available.

Materials and Methods

Comparison of 1000 Genomes Project Phased Data with Experimental Data

In previous work, the 1000 Genomes Project database was scanned for regions containing at least nine SNPs within a 300-base window, and 4349 such SNP-dense regions were identified.¹⁹ From these, eight loci were chosen to experimentally validate because they were the most informative and were located on different chromosomes. Single-molecule sequencing for these regions was performed in 45 individuals from three ethnicities from 1000 Genomes Project samples using Ion Torrent sequencing technology (Thermo Fisher Scientific, Waltham, MA), and an average of 12.4 unique alleles per locus were identified.²⁰

Of the 45 samples, 30 were in the database, and the experimentally determined phased alleles²⁰ were compared with the publicly available phased 1000 Genomes Project database for the same 30 samples. The sequences were manually aligned to minimize the number of differences, and the number of perfectly aligned alleles was recorded. For imperfect alignments, the number of crossovers, the number of SNPs on the incorrect allele, and the number of crossover events required were recorded. Examples of this analysis are provided as Supplemental Figure S1.

Lookup Table

The genotype at each SNP was converted to a number using the following strategy: A/A = 0, C/C = 1, G/G = 2, T/T = 3, A/C = 4, A/G = 5, A/T = 6, C/G = 7, C/T = 8, and G/T = 9 (Table 1). Accordingly, the genotype containing three SNPs, A/A-C/G-G/T, would be represented as the number 079. Conversely, numerical representations can be converted back to their genotypes, as the number 368 can be converted back to the genotype T/T-A/T-C/T. If each of the digits in a number are 3 and below, then the genotype is homozygous; accordingly, the number 0123 would represent the homozygous genotype of A/A-C/C-G/G-T/T. The same number could also be used to represent the haplotype: A-C-G-T.

Table 1.

Lookup Table Converting between Genotype and Base-10 Numbers

Base or bases	Numerical representation
A or A/A	0
C or C/C	1
G or G/G	2
T or T/T	3
A/C	4
A/G	5
A/T	6
C/G	7
C/T	8
G/T	9
A + C = A/C	0 with 1->4
A + G = A/G	0 with 2->5
A + T = A/T	0 with 3->6
C + G = C/G	1 with 2->7
C + T = C/T	1 with 3->8
G + T = G/T	2 with 3->9

Open in a new tab

Phasing Methods

Genotypes were reconstructed by unphasing 1000 Genomes Project phase 3 data for these regions,²¹ and these data were the only input into the algorithm. These genotypes were phased using our method and with two other computation methods (PHASE and SHAPEIT2) and then all of the haplotypes were compared with experimental results. The phasing method is described in detail in Results. The method was not compared with other phasing methods, such as Beagle,²³ FASTPHASE,²⁴ WinHap2,²⁵ or HARSH.²⁶ The output of the new phasing algorithm is the alleles identified and their frequency. The code for this phasing approach is included as Supplemental Code S1 and has been deposited on GitHub (https://github.com/sevi2018/Project, last accessed October 22, 2018).

Experimental Confirmation of Alleles Identified in Silico

Several alleles were confirmed that were predicted algorithmically by identifying 1000 Genomes Project samples containing the putative new allele and performing single-molecule sequencing using the Ion Torrent Personal Genome Machine (Thermo Fisher Scientific) using primers and conditions previously described.²⁰ Samples NA18909 and NA19190 were sequenced for TFB2M; samples NA18508, NA18599, and NA19257 were sequenced at the FARP1 locus; and samples NA18538 and NA18596 were sequenced for the MT4 locus. The confirmation to these seven samples was limited because they were previously obtained.

PHASE and SHAPEIT2

The method was compared with estimated haplotypes obtained by two computational methods, PHASE¹⁶ and SHAPEIT2.¹⁷ PHASE uses a bayesian algorithm. Inference is based on coalescent theory, in which new haplotypes evolve through mutation and recombination; given these events have small probabilities in short genetic distances (with the exception of hotspots), the new haplotypes will be similar to ancestral haplotypes. PHASE output gives haplotype probability. Depending on the genetic region and the sample, it is possible to obtain two or more possible haplotypes. For these samples, the haplotype with the highest probability (>0.5) was selected. Individuals with two or more haplotypes with a similar probability were reported as doubts; 73 individuals had ambiguous haplotypes.

The SHAPEIT2 algorithm is based on PHASE¹⁷; however, because it selects a set of candidate haplotypes for each individual, it is computationally approximately 150 times faster. To increase the accuracy of SHAPEIT2, both the number of iterations and conditioning states on which haplotype estimation is based were increased (both of these parameters were tripled compared with default settings). For each region, SHAPEIT2 was run 10 times, and the haplotype most frequently represented was reported as the final one.

The input genotypes for the three phasing methods were obtained from the unphased 1000 Genomes Project Phase 3 website; this project includes 84.4 million variants from 2504 unrelated individuals selected from 26 different populations (The International Genome Sample Resource, http://www.internationalgenome.org/data, last accessed October 22, 2018).²⁷

HLA-A Genotype Comparisons

For the HLA-A alleles A*01:01:01:01, A*29:01:01:01, A*23:01:01:01, and A*36:01 (hereafter simply A1, A29, A23, and A36), alignments were performed using the Immuno Polymorphism Database (International ImMunoGeneTics Project/HLA release 3.31.0; Anthony Nolan Research Institute, The Royal Free Hospital, London, UK)²⁸ for all eight exons. A given base was considered informative if any of the four alleles varied from the other three. A genotype was constructed by combining alleles A1 and A29, and a second genotype was constructed from alleles A23 and A36. The two genotypes were compared counting the number of informative SNPs that were different at the genotype level.

Results

Analysis of Existing 1000 Genomes Project Phased Data

We recently became interested in using multi-SNP haplotypes to overcome the error rates in next-generation sequencing in the analysis of mixed human DNA samples, such as those in patients after bone marrow transplantation.¹⁹ A set of eight short (222 to 395 bp long) SNP-dense loci that can be used for ultrasensitive (1:10,000) detection of DNA mixtures were validated.²⁰ These regions were sequenced in 45 individuals (15 individuals × 3 populations) using single-molecule sequencing (Ion Torrent; average depth of coverage, >1000-fold). The sequences were compared with the available phased 1000 Genomes Project data for 30 of the 45 individuals who were included in the sequenced data.²¹ The SORCS2 locus was eliminated because some alleles contain an insertion/deletion, and it was difficult to evaluate them consistently. Of the remaining seven loci evaluated, 210 (7 × 30) genotypes examined, 69 were excluded because they were either homozygous or contained a single SNP and, therefore, could not be used to evaluate phasing. Of the 141 remaining loci containing two or more heterozygous SNPs within them, only 20 (13%) were perfectly phased (Supplemental Tables S1 and S2). The total number of SNPs was determined on the incorrect allele (average, 1.2 per locus), the number of events in which an SNP or a group of SNPs was on the incorrect allele (average, 1.7 per locus), and the number of crossovers that would be required on the current phased data to assemble the correct allele (average, 2.0 per locus). Examples of this approach are illustrated in Supplemental Figure S1. In this regard, the seven selected loci were exceedingly SNP dense (average, 16.6 SNPs per locus), and so the percentage that phased correctly would likely be higher for a more simple locus that contained fewer SNPs.

Proof of Principle of the New Strategy Using HLA-A

Phasing is likely correct, with a high degree of certainty, when the individual is homozygous at a given locus, especially when multiple individuals are homozygous for the same allele. Although this approach alone would be adequate to identify high-frequency alleles, it would be insufficient to identify low-frequency alleles because homozygotes bearing these alleles would not likely be represented in a limited data set. However, these alleles might be identifiable after first identifying the high-frequency alleles from the homozygotes and subsequently examining heterozygotes.

To establish proof of principle that this could be accomplished, a genotype was constructed from the HLA-A alleles A1 and A3 (Figure 1A). A set of HLA-A alleles were examined to eliminate those that contained a base not present in the genotype (Figure 1B). Finally, all possible combinations of the remaining alleles were constructed to demonstrate that only one combination would produce the desired genotype (Figure 1C). The method fundamentally works because the number of alleles for a given set of SNPs is few compared with those theoretically possible (2ⁿ) (Discussion). Of all possible combinations of exon 3 HLA-A alleles, only one ambiguous genotype was discovered, but was distinct at the full gene level (Discussion).

Proof of principle of the approach using *HLA-A.*A: Construction of an *A1/A3* heterozygous (Het) genotype (**boxed area**) from alleles A1 and A3. B: Elimination of certain alleles (**red text with strikethrough**) because they contain a base at a given position that is not present in the *A1/A3* genotype. C:*In silico* heterozygotes produced from all remaining alleles in B. Genotypes with bases at a given position that do not match the *A1/A3* genotype are eliminated (red text with strikethrough) to demonstrate that only *A1/A3* remains (**blue boxed area**). SNP, single-nucleotide polymorphism.

Novel Phasing Strategy

The overall strategy is detailed in Figure 2. Each of the 1000 Genomes Project genotypes is first converted to a number using a conversion table (Materials and Methods) (Table 1). This table allows one to translate the genotype into base-10 numbers and can be used to go back and forth between a string of SNPs in a given genotype and its corresponding number (Figure 2). For example, individuals homozygous for A/A, C/C, G/G, and T/T at a given SNP are represented by the integers 0 to 3, respectively; heterozygous SNPs A/C, A/G, A/T, C/G, C/T, and G/T are represented by numbers 4 to 9, respectively. Homozygous genotypes are then grouped, ranked by descending frequency, and assigned Hom-allele letters (a Hom-allele is defined as an allele that was identified from homozygotes) (Figure 2). From the Hom-alleles, all of the theoretical heterozygote genotypes are produced and designated in silico heterozygotes [computer-generated genotypes produced by combining two Hom-alleles (IS-Hets)] (Figure 2). They are converted to numerical form and ranked in ascending order to generate a lookup table.

Flowchart of new phasing approach. Nine steps outlined in Results. C-het, clustered heterozygote; het-alleles, alleles defined from heterozygotes; hom-alleles, alleles defined from homozygotes; IS-het, *in silico* heterozygote; unknown-het, a heterozygous genotype not explained by an IS-het.

Each heterozygous individual's sample in 1000 Genomes Project is then examined to determine whether the individual is an expected IS-Het (identified in step 3) or an unanticipated heterozygote (Figure 2). The expected heterozygotes are discarded, and the unknown heterozygotes are then grouped by identical genotype and ranked by descending frequency to produce clustered heterozygotes [defined from experimental 1000 Genomes Project data as identical genotypes clustered together (C-hets)] (Figure 2). Each of the existing Hom-alleles is then compared with the most frequent clustered heterozygote genotype to determine whether it could be one of the two alleles composing that genotype (Figure 2). If one, and only one, of the Hom-alleles can contribute, then that allele is subtracted from the genotype to reveal the second (novel) allele (Figure 2). If multiple previously defined alleles could contribute to that genotype, then that genotype is ignored and the next most frequent one is analyzed. For example, if three Hom-alleles were A-C-A, A-A-T, and A-C-G and given the heterozygous genotype A-C/G-A/T, then only the first allele could be in that genotype, and subtracting the first allele would yield the new allele A-G-T. (The second allele is ineligible because the heterozygous genotype does not contain an adenine at the second position; the third is ineligible because the genotype lacks a guanine at the third position.) The putative new allele is then used in combination with known alleles to reconstruct its IS-Hets, and the new allele is confirmed as valid if it is found with a different established allele in an independent genotype (Figure 2). If confirmed, then the new allele's IS-Hets are produced and excluded from the clustered heterozygotes and the next most frequent genotype is examined. The process is iterated until all alleles are identified, although genotypes are not resolved on the basis of a single individual because of the high risk that this apparent genotype contains an error in sequencing. Finally, the data are compared with the preexisting experimentally demonstrated haplotypes.

Blinded Analysis of TMPRSS15

The TMPRSS15 locus was analyzed and blinded (E.M. and A.P.K.) to the alleles that were previously identified experimentally using single-molecule sequencing (Figure 3). This allowed evaluation of whether this approach could be used for phasing in general. After converting to numerical form, there were 646 homozygous samples identified from the 2504 individuals (26%) in the 1000 Genomes Project data, and these were clustered to define six distinct Hom-alleles (Figure 3, A and B, and Supplemental Figure S2A). For example, there were 273 A/A homozygotes and 126 B/B homozygotes. The six Hom-alleles were combined pairwise to produce 15 distinct IS-Hets (Figure 3C and Supplemental Figure S2B), and these were converted to numerical form (Figure 3C).

Blind analysis (E.M. and A.P.K.) of the *TMPRSS15* locus. A: Convert genotypes to code using a lookup table (Table 1). B: Identify Hom-alleles. Shown are the six Hom-alleles (A to F), their base sequence, their numeric code, the number of individuals homozygous for that allele in 1000 Genomes Project (1000g), and the frequency (Freq) among the 2504 individuals. C:*In silico* heterozygotes (IS-Hets) produced from Hom-alleles A to F. Shown are 3 (A + C, A + B, and A + D) of the 15 possible combinations. D: Cluster 1000g Hets (C-hets), ranked by descending frequency. The three most common C-hets are present in 258, 241, and 169 individuals in 1000 Genomes Project and are explained by the three IS-Hets shown in C. The 15th most common genotype present in 18 individuals is in the **boxed area**. E: Discovery of allele G. Of the six Hom-alleles (alleles A to F), only allele A can be included in the cluster 15 genotype. Subtraction of allele A from this genotype reveals the new allele G. F: Additional IS-Hets containing the G allele. The putative allele G is confirmed because it is also found in combination with allele D in cluster 16. Nb, number of individuals in 1000 Genomes Project with that genotype. Hom-alleles, alleles defined from homozygotes.

The heterozygote individuals were then clustered from 1000 Genomes Project and sorted by decreasing frequency (Figure 3D and Supplemental Figure S2C). Among the clustered heterozygotes, cluster 1, cluster 2, and cluster 3 were present in 258, 241, and 169 individuals, respectively; and the comparison of the numbers to the IS-Hets allowed identification of these clusters as genotypes A + C (A/C), (A/B), and (A/D), respectively. In fact, the top 14 most frequent clustered 1000 Genomes Project heterozgotes could explain 54% of individuals, and after combining this with the 26% homozygotes, only 20% of the genotypes were unexplained (Supplemental Figure S2C). Clusters 1 to 14 and 17 were anticipated, whereas clusters 15, 16, and 18 to 28 were unanticipated.

The genotype in cluster 15 (3,201,213,108,313, where the underlined 8 indicates the only heterozygous SNP), present in 18 individuals, was unanticipated (Figure 3D). In this genotype, all SNPs are homozygous (represented by 0, 1, 2, or 3), except for one SNP that was heterozygous C/T. Comparing this genotype with the documented Hom-alleles implicated the A allele because none of the other Hom-alleles matched at all of the homozygous SNPs (eg, the other five Hom-alleles all start with CAG and are, therefore, excluded). The full set of the unexplained heterozygotes was compared with each of the known Hom-alleles individually to determine whether each of them could contribute to that genotype (Supplemental Figure S2D).

Subtraction of allele A (3,201,213,103,313, containing a T at the underlined base) yielded 3,201,213,101,313 (containing a 1 or C at the underlined position), and this was designated as putative new allele G (Figure 3E and Supplemental Figure S2E). IS-Hets generated using allele G in combination with alleles A to F confirmed allele G in a second cluster 16 (Figure 3F) 1000 g-het genotype (D/G in 15 individuals, in addition to the heterozygous clusters E/G, B/G, F/G, and C/G) (Supplemental Figure S2F).

Allele E was then subtracted from cluster 23 (Supplemental Figure S2G) to reveal the new putative allele H (Supplemental Figure S2H), which was then confirmed because it was found in other genotypes (Supplemental Figure S2I), including A/H (six individuals in cluster 22), D/H (four individuals in cluster 24), and F/H (three individuals in cluster 25). The last three individuals were not explained (Supplemental Figure S2J) because their genotypes were only present once, and there can be some sequencing errors in the database. After completion of the blind analysis (based on only the 1000 Genomes Project genotypes; E.M. and A.P.K.), the alleles determined by this new method were compared with those previously demonstrated experimentally and reported (Supplemental Figure S2K).²⁰ The seven alleles (A to F and H) corresponded to alleles 1 to 7 previously assigned, whereas allele G was not identified in the previous (limited) sequencing of 30 individuals.²⁰

Blinded Analysis of the Remaining Six Loci

Having developed the strategy with HLA-A (unblinded) and after analyzing TMPRSS15 (blinded to the experimentally documented alleles; E.M. and A.P.K.), the same analysis was then performed for the remaining six loci. Table 2 summarizes the results of the analyses for all eight loci. The most prevalent alleles at a given locus were discovered in homozygotes, as expected, and all of them were observed in the 45 samples initially genotyped. Loci tended to fall into one of two categories. The first, such as CSMD1 and TMPRSS15, had six Hom-alleles, but after combining these, most of the heterozygotes were anticipated, and there were only two more Het-alleles (an allele identified from heterozygotes) that were detected from unanticipated heterozygotes. For other loci, the number of alleles identified among heterozygotes approached or exceeded those identified among homozygotes. For example, the MT4 locus located on chromosome 16 was unusually polymorphic (29 alleles), where 17 alleles were identified among Hom-alleles and 12 alleles were from Het-alleles. This degree of polymorphism (29 alleles) rivaled that of the HLA genes (21 alleles for HLA-A and 35 alleles for HLA-B).

Table 2.

Alleles in Each Locus

Locus	Chr	SNPs, n	Hom-alleles (n)	Het-alleles (n)	Total alleles, n	Confirmed by NGS
TFB2M	1	8	A-G (7)	H-L (5)	12	A/H, C/H
SORCS2	4	11	A-I (9)	J-S (10)	19
HLA-A	6	9	A-H (8)	I-U (13)	21
HLA-B	6	29	A-L (12)	K-IZ (23)	35
CSMD1	8	14	A-F (6)	G, H (2)	8
FARP1	13	13	A-F (6)	G-M (7)	13	A/G, B/H, E/H
MT4	16	12	A-Q (17)	R-CZ (12)	29	A/R, B/R
TMPRSS15	21	12	A-F (6)	G, H (2)	8

Open in a new tab

Chr, chromosome; het-alleles, alleles that were defined from heterozygotes; hom-alleles, alleles that were defined from homozygotes; NGS, next-generation sequencing; SNP, single-nucleotide polymorphism.

Experimental Confirmation of Computationally Predicted Alleles

To challenge that the approach is correct, a series of alleles were used that were predicted using the new phasing strategy (in silico) and that had never been previously sequenced experimentally. The specific 1000 Genomes Project samples (previously purchased) predicted to contain these alleles were then identified, and PCR was performed on these seven samples, followed by Ion Torrent–based single-molecule sequencing, as previously described.²⁰ Allele H was confirmed at the TGF2M locus by identifying it as A + H in sample NA18909 and C + H in sample NA19190. For FARP1, G was demonstrated as A + G (NA18599), in addition to H as both B + H (NA19257) and E + H (NA18508). Finally, at the MT4 locus, the allele R was confirmed as both A + R (NA18538) and B + R (NA18596).

Comparison of Computational Phasing Methods

Using the exact same dephased 1000 Genomes Project phase 3 genotypes used in the analysis above were phased for these loci using both PHASE and SHAPEIT2 software. To evaluate the accuracy of this method, the estimated haplotypes were compared with the experimental sequences of the 30 samples sequenced in 1000 Genomes Project for all eight loci (Table 3). The table also shows the comparison of the haplotypes inferred with PHASE and SHAPEIT2 with the experimental data. Overall, this method performs better than both PHASE and SHAPEIT2, with 93% of alleles/haplotypes in agreement with the experimental data, versus 91.7% and 88.3% of PHASE and SHAPEIT2 methods, respectively. The comparison of the method was then extended with respect to the other two computational methods considering the whole 1000 Genomes Project data set; among the 20,032 expected genotypes (2504 samples × 8 loci), the three methods were compared for 19,532 genotypes or 39,064 inferred haplotypes (Supplemental Table S3). This method shows high agreement with both PHASE and SHAPEIT2 estimated haplotypes, 96.8% and 96.6%, respectively; whereas the agreement between PHASE and SHAPEIT2 estimates was a little lower, 94.9%.

Table 3.

Concordance between Methods for Experimentally Confirmed Alleles

Genes	Method	Concordant alleles
Genes	Method	n/Total n^∗	%
TFB2M (Chr1)	Our method	29/29	100
	PHASE	29/30	97
	SHAPEIT2	28/30	93
SORCS2 (Chr4)	Our method	27/29	93
	PHASE	29/30	97
	SHAPEIT2	28/30	93
HLA-A (Chr6)	Our method	24/30	80
	PHASE	24/30	80
	SHAPEIT2	23/30	77
HLA-B (Chr6)	Our method	26/28	93
	PHASE	26/30	87
	SHAPEIT2	25/30	83
CSMD1 (Chr8)	Our method	25/29	86
	PHASE	24/30	80
	SHAPEIT2	24/30	80
FARP1 (Chr13)	Our method	28/29	97
	PHASE	29/30	97
	SHAPEIT2	29/30	97
MT4 (Chr16)	Our method	27/28	96
	PHASE	29/30	97
	SHAPEIT2	30/30	100
TMPRSS15 (Chr21)	Our method	30/30	100
	PHASE	30/30	100
	SHAPEIT2	25/30	100

Open in a new tab

Chr, chromosome.

^∗

The denominator varies as our method does not evaluate genotypes present only once.

Computational Time

The time was recorded for each of the three methods required to complete the phasing problem for each of the eight loci (Figure 4). A bar graph of time (in seconds) was produced for comparison of our method to PHASE and SHAPEIT2. The ratio of times was calculated and averaged across all eight loci. Our method was an average of 14.6-fold (range, 5.6- to 49.8-fold) faster than PHASE and 10.8-fold (range, 2.1- to 14.4-fold) faster than SHAPEIT2.

Discussion

We demonstrate a new and faster method of phasing that is molecule and frequency based. It was validated for eight highly SNP-dense loci compared with alleles determined by single-molecule sequencing. For most loci, the new method performed comparably to PHASE and SHAPEIT2, but it was slightly better for three loci.

One step in this method overlaps with the method described by Dr. Andrew Clark²⁹ in 1990, defining an initial set of alleles from homozygotes. Our method then clusters all of the heterozygotes and discards the anticipated ones. A novel aspect of this algorithm is that it subtracts documented alleles from high-frequency genotypes to define putative new alleles. These potential alleles are only considered real if they can be confirmed in combination with an independent documented allele. Using the mentioned strategy, some of the most polymorphic regions of the human genome have been successfully phased. Applying it to less polymorphic regions would be easier as fewer individuals should be needed to identify all of the alleles. The strategy is not intrinsically connected to the 1000 Genomes Project database and could be applied to other data sets.

PHASE was used to phase the genotype of the HapMap project,³⁰ and it is considered the most accurate method to date for building haplotypes³⁰; however, the analysis time grows exponentially with the number of SNPs.¹⁶ The SHAPEIT2 algorithm has been further developed to be faster by adding a previous step, which generates subsets of candidate haplotypes for each individual¹⁷ and can be run efficiently on whole chromosomes.¹⁷ A major use of SHAPEIT2 is for haplotype estimation of genome-wide association study samples to speed up imputation from a large reference panel of haplotypes, such as 1000 Genomes Project.¹⁷

This method was not able to phase all 2504 individuals for all of the eight loci. Among 20,032 expected genotypes, 360 were not identified (1.8%) (Supplemental Table S4), in part because the unknown heterozygous genotypes observed only once were eliminated (because they might represent genotyping errors). Although this method requires that an allele be represented at least twice in the data set, rare variants with a minor allele frequency as low as 0.0008 were successfully identified. However, private variants (present only once in the database) were not tested, as one cannot be sure whether they are truly a private variant or a sequencing error of a common variant.

The application of these criteria has disadvantages in the ability to genotype number of samples when compared with the other methods (19,869 and 20,032 for SHAPEIT2 and PHASE, respectively). However, the haplotypes estimated using this method were unambiguous, and when compared with experimental data, they showed the highest accordance rates versus the other computational methods.

The number of alleles at a given locus varied considerably. Some of the most polymorphic loci were selected among the 4300 loci previously reported to be the most SNP dense and highly polymorphic in the human genome.¹⁹ For some loci, the number of alleles readily observable was relatively few compared with the number theoretically possible. For example, at the CSMD1 locus, there are 14 biallelic SNPs, the combination of which could result in a total of 16,384 alleles (2¹⁴), but in reality, only 10 alleles were documented among the 2504 individuals analyzed.²⁰ This is consistent with observations by others.³¹ In contrast to the number of alleles at CSMD1, the MT4 locus is much more polymorphic (29 alleles among the 2504 individuals). This number of alleles (29) exceeded the number demonstrated at the HLA-A locus (21 alleles), although it is lower than the number at the HLA-B locus (35 alleles). The reason for this level of diversity in the intergenic region of a family of heavy metal binding proteins, in contrast to a locus such as HLA where a high degree of diversity was likely evolutionarily selected, is unknown.32, 33 In fact, the HLA-B gene has nearly 5000 distinct alleles.28, 34 As the number of human genomes sequenced increases, more alleles will be able to be defined in homozygotes; however, heterozygotes will always be needed to define less common alleles, which are too infrequent to be represented in homozygotes.

All of these steps were performed and automated with the help of a lookup table, made of 10 numbers corresponding to all possible combinations of the four DNA bases. This tool allows us to store the genetic information as numbers, instead of strings, thereby facilitating sorting, comparisons of genotypes, and algebraic formulas. A previous report that represented genotypes as base-10 numbers was not found, although Griffin and Smith³⁵ reported a base-3 approach to represent genotypes.

The main advantage of this new phasing approach is its computational speed, presumably because of the lack of the need to infer alleles. There are limitations, however. The new method has only been demonstrated on relatively short regions (longest, approximately 400 bp); it is unclear if the speed advantage will extrapolate to larger regions. In addition, gene mutations that one would like to phase are not always sufficiently close to informative SNPs. The method currently ignores recombination. Finally, it is conceivable that some genotypes are ambiguous, in the sense that they could be produced by the combination of two different sets of alleles. Among all of the IS-Hets generated, only one genotype was found that was initially thought to be ambiguous for HLA-A (genotype 542858279: A/G-A/C-G-C/T-A/G-C/T-C-C/G-G/T), which can be produced from both A1 and A29, but also from A23 and A36 (Supplemental Table S5). However, this genotype is only ambiguous when considering the nine SNPs in the distal one-half of exon 3. When all eight exons were analyzed (Materials and Methods), 52 differences (71%) were found between the two genotypes among the 73 informative SNPs. Accordingly, no ambiguous genotypes were found among these loci.

Phasing is critical in human genetics in interpreting whether a variant is likely benign or pathogenic, among other situations.4, 36 Specifically, in autosomal dominant diseases, a pathogenic mutation in trans with a variant of unknown significance indicates that the variant of unknown significance is likely benign. Conversely, in an autosomal recessive condition, a variant of unknown significance in trans with a known pathogenic allele is evidence that the variant of unknown significance is likely pathogenic. Phasing is essential to correctly identify HLA alleles and for allele-specific expression. Finally, in cancer chemotherapy, certain drugs are synthetic lethal with biallelic inactivation of their partner genes, most notably the poly (ADP-ribose) polymerase inhibitors with breast cancer gene (BRCA) and BRCA-related genes,⁶ in addition to immune checkpoint inhibitors with mismatch repair defects.37, 38

In summary, we report a novel fast method of phasing and demonstrate its ability to phase 2504 individuals from 1000 Genomes Project at eight highly polymorphic loci. The method was validated using 30 samples that were experimentally genotyped and by subsequently genotyping seven samples to experimentally confirm alleles solely predicted using the new method. This algorithm will be useful in distinguishing compound heterozygotes from heterozygotes with a single allele bearing two mutations. It is likely to be increasingly useful as the number of publicly available whole genomes increases.

Acknowledgments

We thank Drs. Wayne Grody, Molly Sheridan, Lori Sokoll, Briana Vecchio-Pagán, Sarah Wheelan, Rena Xian, Garry Cutting, Elizabeth Pugh, Aravinda Chakravarti, Alan Scott, Christopher Gocke, Ephraim Fuchs, Annette Jackson, Maria Bettinotti, Ming-Tseh Lin, Brian Iglehart, and Jordan Brown for helpful discussions; and the participants who generously contributed their samples to the 1000 Genomes Project.

Footnotes

Supported in part by the Sol Goldman Pancreatic Cancer Research Center (J.R.E.), the STRINGER Foundation (J.R.E.), the Michael Rolfe Pancreatic Cancer Foundation (J.R.E.), the Mary Lou Wootton Pancreatic Cancer Research Fund (J.R.E.), and The Sidney Kimmel Comprehensive Cancer Center grant NCI P30CA006973.

E.M. and M.D. contributed equally to this work.

Disclosures: None declared.

Supplemental material for this article can be found at https://doi.org/10.1016/j.jmoldx.2018.12.004.

Supplemental Data

Supplemental Figure S1

Examples of phasing error scoring. Mock alleles and genotypes are shown. The true alleles in the sample are shown as experimental. Potential phased alleles are shown as 1000 genomes (1000g), illustrating various scenarios. Crossovers (C), number of single-nucleotide polymorphisms (SNP) involved (S), and events (E) are shown. A: A single SNP on the incorrect allele. B: Two SNPs on the incorrect allele. C: A single crossover event in the middle of five SNPs. D: A single SNP on the incorrect allele in combination with a single crossover event.

mmc1.pdf^{(32.8KB, pdf)}

Supplemental Figure S2

Detailed step-by-step analysis of TMPRSS15, according to the steps outlined in Figure 2. A: Steps 1 and 2. B: Step 3. C: Steps 4 and 5. D–I: There are two iterations of steps 6 to 8. Allele names are indicated in capital letters. Base-10 representation of genotypes is listed as code. J: List of unexplained clustered heterozygotes (C-hets). K: Step 9. 1000g, 1000 genomes; Het, heterozygous; Het-allele, an allele defined from heterozygotes; Het-cluster, identical genotypes grouped as a cluster; Hom-allele, an allele defined from homozygotes; IS-het, in silico heterozygote; Nb, number of individuals in 1000 genomes with that genotype.

mmc2.pdf^{(115.2KB, pdf)}

Supplemental Table S1

mmc3.docx^{(11.8KB, docx)}

Supplemental Table S2

mmc4.xlsx^{(23.9KB, xlsx)}

Supplemental Table S3

mmc5.docx^{(13.9KB, docx)}

Supplemental Table S4

mmc6.docx^{(15.3KB, docx)}

Supplemental Table S5

mmc7.xlsx^{(8.3KB, xlsx)}

Supplemental Code S1

mmc8.docx^{(36.1KB, docx)}

References

1.Boehm C.D., Antonarakis S.E., Phillips J.A., 3rd, Stetten G., Kazazian H.H., Jr. Prenatal diagnosis using DNA polymorphisms: report on 95 pregnancies at risk for sickle-cell disease or beta-thalassemia. N Engl J Med. 1983;308:1054–1058. doi: 10.1056/NEJM198305053081803. [DOI] [PubMed] [Google Scholar]
2.Kiesewetter S., Macek M., Jr., Davis C., Curristin S.M., Chu C.S., Graham C., Shrimpton A.E., Cashman S.M., Tsui L.C., Mickle J., Amos J., Highsmith W.E., Shuber A., Witt D.R., Crystal R.G., Cutting G.R. A mutation in CFTR produces different phenotypes depending on chromosomal background. Nat Genet. 1993;5:274–278. doi: 10.1038/ng1193-274. [DOI] [PubMed] [Google Scholar]
3.Groman J.D., Hefferon T.W., Casals T., Bassas L., Estivill X., Des Georges M., Guittard C., Koudova M., Fallin M.D., Nemeth K., Fekete G., Kadasi L., Friedman K., Schwarz M., Bombieri C., Pignatti P.F., Kanavakis E., Tzetis M., Schwartz M., Novelli G., D'Apice M.R., Sobczynska-Tomaszewska A., Bal J., Stuhrmann M., Macek M., Jr., Claustres M., Cutting G.R. Variation in a repeat sequence determines whether a common variant of the cystic fibrosis transmembrane conductance regulator gene is pathogenic or benign. Am J Hum Genet. 2004;74:176–179. doi: 10.1086/381001. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., Voelkerding K., Rehm H.L., Committee A.L.Q.A. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jones S., Hruban R.H., Kamiyama M., Borges M., Zhang X., Parsons D.W., Lin J.C., Palmisano E., Brune K., Jaffee E.M., Iacobuzio-Donahue C.A., Maitra A., Parmigiani G., Kern S.E., Velculescu V.E., Kinzler K.W., Vogelstein B., Eshleman J.R., Goggins M., Klein A.P. Exomic sequencing identifies PALB2 as a pancreatic cancer susceptibility gene. Science. 2009;324:217. doi: 10.1126/science.1171202. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lord C.J., Ashworth A. PARP inhibitors: synthetic lethality in the clinic. Science. 2017;355:1152–1158. doi: 10.1126/science.aam7344. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Carapito R., Radosavljevic M., Bahram S. Next-generation sequencing of the HLA locus: methods and impacts on HLA typing, population genetics and disease association studies. Hum Immunol. 2016;77:1016–1023. doi: 10.1016/j.humimm.2016.04.002. [DOI] [PubMed] [Google Scholar]
8.Yan H., Papadopoulos N., Marra G., Perrera C., Jiricny J., Boland C.R., Lynch H.T., Chadwick R.B., de la Chapelle A., Berg K., Eshleman J.R., Yuan W., Markowitz S., Laken S.J., Lengauer C., Kinzler K.W., Vogelstein B. Conversion of diploidy to haploidy. Nature. 2000;403:723–724. doi: 10.1038/35001659. [DOI] [PubMed] [Google Scholar]
9.Murphy N.M., Burton M., Powell D.R., Rossello F.J., Cooper D., Chopra A., Hsieh M.J., Sayer D.C., Gordon L., Pertile M.D., Tait B.D., Irving H.R., Pouton C.W. Haplotyping the human leukocyte antigen system from single chromosomes. Sci Rep. 2016;6:30381. doi: 10.1038/srep30381. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kidd J.M., Cheng Z., Graves T., Fulton B., Wilson R.K., Eichler E.E. Haplotype sorting using human fosmid clone end-sequence pairs. Genome Res. 2008;18:2016–2023. doi: 10.1101/gr.081786.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Norris A.L., Workman R.E., Fan Y., Eshleman J.R., Timp W. Nanopore sequencing detects structural variants in cancer. Cancer Biol Ther. 2016;17:246–253. doi: 10.1080/15384047.2016.1139236. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Timp W., Mirsaidov U.M., Wang D., Comer J., Aksimentiev A., Timp G. Nanopore sequencing: electrical measurements of the code of life. IEEE Trans Nanotechnol. 2010;9:281–294. doi: 10.1109/TNANO.2010.2044418. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Rhoads A., Au K.F. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics. 2015;13:278–289. doi: 10.1016/j.gpb.2015.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Roach J.C., Glusman G., Hubley R., Montsaroff S.Z., Holloway A.K., Mauldin D.E., Srivastava D., Garg V., Pollard K.S., Galas D.J., Hood L., Smit A.F. Chromosomal haplotypes by genetic phasing of human families. Am J Hum Genet. 2011;89:382–397. doi: 10.1016/j.ajhg.2011.07.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zheng G.X., Lau B.T., Schnall-Levin M., Jarosz M., Bell J.M., Hindson C.M. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016;34:303–311. doi: 10.1038/nbt.3432. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Stephens M., Smith N.J., Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001;68:978–989. doi: 10.1086/319501. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Delaneau O., Marchini J. 1000 Genomes Project Consortium: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Commun. 2014;5:3934. doi: 10.1038/ncomms4934. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Niu T. Algorithms for inferring haplotypes. Genet Epidemiol. 2004;27:334–347. doi: 10.1002/gepi.20024. [DOI] [PubMed] [Google Scholar]
19.Debeljak M., Freed D.N., Welch J.A., Haley L., Beierl K., Iglehart B.S., Pallavajjala A., Gocke C.D., Leffell M.S., Lin M.T., Pevsner J., Wheelan S.J., Eshleman J.R. Haplotype counting by next-generation sequencing for ultrasensitive human DNA detection. J Mol Diagn. 2014;16:495–503. doi: 10.1016/j.jmoldx.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Debeljak M., Mocci E., Morrison M.C., Pallavajjalla A., Beierl K., Amiel M., Noe M., Wood L.D., Lin M.T., Gocke C.D., Klein A.P., Fuchs E.J., Jones R.J., Eshleman J.R. Haplotype counting for sensitive chimerism testing: potential for early leukemia relapse detection. J Mol Diagn. 2017;19:427–436. doi: 10.1016/j.jmoldx.2017.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.1000 Genomes Project Consortium. Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Belsare S, Sakin-Levy M, Mostovoy Y, Durinck S, Chaudhry S, Xiao M, Peterson AS, Kwok P-Y, Seshagiri S, Wall JD: Evaluating the quality of the 1000 genomes project data. [ePub] doi:10.1101/383950. [DOI] [PMC free article] [PubMed]
23.Browning B.L., Zhou Y., Browning S.R. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 2018;103:338–348. doi: 10.1016/j.ajhg.2018.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Scheet P., Stphens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Pan W., Zhao Y., Xu Y., Zhou F. WinHAP2: an extremely fast haplotype phasing program for long genotype sequences. BMC Bioinformatics. 2014;15:164. doi: 10.1186/1471-2105-15-164. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yang W.Y., Hormozdiari F., Wang Z., He D., Pasaniuc B., Eskin E. Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data. Bioinformatics. 2013;29:2245–2252. doi: 10.1093/bioinformatics/btt386. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Sudmant P.H., Rausch T., Gardner E.J., Handsaker R.E., Abyzov A., Huddleston J. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Robinson J., Halliwell J.A., Hayhurst J.D., Flicek P., Parham P., Marsh S.G. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 2015;43:D423–D431. doi: 10.1093/nar/gku1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Clark A.G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol. 1990;7:111–122. doi: 10.1093/oxfordjournals.molbev.a040591. [DOI] [PubMed] [Google Scholar]
30.Marchini J., Cutler D., Patterson N., Stphens M., Eskin E., Halperin E., Lin S., Qin Z.S., Munro H.M., Abecasis G.R., Donnelly P., International HapMap Consortium A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet. 2006;78:437–450. doi: 10.1086/500808. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Clark A.G., Weiss K.M., Nickerson D.A., Taylor S.L., Buchanan A., Stengard J., Salomaa V., Vartiainen E., Perola M., Boerwinkle E., Sing C.F. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am J Hum Genet. 1998;63:595–612. doi: 10.1086/301977. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Quaife C.J., Findley S.D., Erickson J.C., Froelick G.J., Kelly E.J., Zambrowicz B.P., Palmiter R.D. Induction of a new metallothionein isoform (MT-IV) occurs during differentiation of stratified squamous epithelia. Biochemistry. 1994;33:7250–7259. doi: 10.1021/bi00189a029. [DOI] [PubMed] [Google Scholar]
33.Sanchez-Mazas A., Cerny V., Di D., Buhler S., Podgorna E., Chevallier E., Brunet L., Weber S., Kervaire B., Testi M., Andreani M., Tiercy J.M., Villard J., Nunes J.M. The HLA-B landscape of Africa: signatures of pathogen-driven selection and molecular identification of candidate alleles to malaria protection. Mol Ecol. 2017;26:6238–6252. doi: 10.1111/mec.14366. [DOI] [PubMed] [Google Scholar]
34.Xie C., Yeo Z.X., Wong M., Piper J., Long T., Kirkness E.F., Biggs W.H., Bloom K., Spellman S., Vierra-Green C., Brady C., Scheuermann R.H., Telenti A., Howard S., Brewerton S., Turpaz Y., Venter J.C. Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proc Natl Acad Sci U S A. 2017;114:8059–8064. doi: 10.1073/pnas.1707945114. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Griffin T.J., Smith L.M. Genetic identification by mass spectrometric analysis of single-nucleotide polymorphisms: ternary encoding of genotypes. Anal Chem. 2000;72:3298–3302. doi: 10.1021/ac991390e. [DOI] [PubMed] [Google Scholar]
36.Tewhey R., Bansal V., Torkamani A., Topol E.J., Schork N.J. The importance of phase information for human genomics. Nat Rev Genet. 2011;12:215–223. doi: 10.1038/nrg2950. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Le D.T., Durham J.N., Smith K.N., Wang H., Bartlett B.R., Aulakh L.K. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science. 2017;357:409–413. doi: 10.1126/science.aan6733. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Dudley J.C., Lin M.T., Le D.T., Eshleman J.R. Microsatellite instability as a biomarker for PD-1 blockade. Clin Cancer Res. 2016;22:813–820. doi: 10.1158/1078-0432.CCR-15-1678. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Figure S1

mmc1.pdf^{(32.8KB, pdf)}

Supplemental Figure S2

mmc2.pdf^{(115.2KB, pdf)}

Supplemental Table S1

mmc3.docx^{(11.8KB, docx)}

Supplemental Table S2

mmc4.xlsx^{(23.9KB, xlsx)}

Supplemental Table S3

mmc5.docx^{(13.9KB, docx)}

Supplemental Table S4

mmc6.docx^{(15.3KB, docx)}

Supplemental Table S5

mmc7.xlsx^{(8.3KB, xlsx)}

Supplemental Code S1

mmc8.docx^{(36.1KB, docx)}

[bib1] 1.Boehm C.D., Antonarakis S.E., Phillips J.A., 3rd, Stetten G., Kazazian H.H., Jr. Prenatal diagnosis using DNA polymorphisms: report on 95 pregnancies at risk for sickle-cell disease or beta-thalassemia. N Engl J Med. 1983;308:1054–1058. doi: 10.1056/NEJM198305053081803. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Kiesewetter S., Macek M., Jr., Davis C., Curristin S.M., Chu C.S., Graham C., Shrimpton A.E., Cashman S.M., Tsui L.C., Mickle J., Amos J., Highsmith W.E., Shuber A., Witt D.R., Crystal R.G., Cutting G.R. A mutation in CFTR produces different phenotypes depending on chromosomal background. Nat Genet. 1993;5:274–278. doi: 10.1038/ng1193-274. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Groman J.D., Hefferon T.W., Casals T., Bassas L., Estivill X., Des Georges M., Guittard C., Koudova M., Fallin M.D., Nemeth K., Fekete G., Kadasi L., Friedman K., Schwarz M., Bombieri C., Pignatti P.F., Kanavakis E., Tzetis M., Schwartz M., Novelli G., D'Apice M.R., Sobczynska-Tomaszewska A., Bal J., Stuhrmann M., Macek M., Jr., Claustres M., Cutting G.R. Variation in a repeat sequence determines whether a common variant of the cystic fibrosis transmembrane conductance regulator gene is pathogenic or benign. Am J Hum Genet. 2004;74:176–179. doi: 10.1086/381001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., Voelkerding K., Rehm H.L., Committee A.L.Q.A. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Jones S., Hruban R.H., Kamiyama M., Borges M., Zhang X., Parsons D.W., Lin J.C., Palmisano E., Brune K., Jaffee E.M., Iacobuzio-Donahue C.A., Maitra A., Parmigiani G., Kern S.E., Velculescu V.E., Kinzler K.W., Vogelstein B., Eshleman J.R., Goggins M., Klein A.P. Exomic sequencing identifies PALB2 as a pancreatic cancer susceptibility gene. Science. 2009;324:217. doi: 10.1126/science.1171202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Lord C.J., Ashworth A. PARP inhibitors: synthetic lethality in the clinic. Science. 2017;355:1152–1158. doi: 10.1126/science.aam7344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Carapito R., Radosavljevic M., Bahram S. Next-generation sequencing of the HLA locus: methods and impacts on HLA typing, population genetics and disease association studies. Hum Immunol. 2016;77:1016–1023. doi: 10.1016/j.humimm.2016.04.002. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Yan H., Papadopoulos N., Marra G., Perrera C., Jiricny J., Boland C.R., Lynch H.T., Chadwick R.B., de la Chapelle A., Berg K., Eshleman J.R., Yuan W., Markowitz S., Laken S.J., Lengauer C., Kinzler K.W., Vogelstein B. Conversion of diploidy to haploidy. Nature. 2000;403:723–724. doi: 10.1038/35001659. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Murphy N.M., Burton M., Powell D.R., Rossello F.J., Cooper D., Chopra A., Hsieh M.J., Sayer D.C., Gordon L., Pertile M.D., Tait B.D., Irving H.R., Pouton C.W. Haplotyping the human leukocyte antigen system from single chromosomes. Sci Rep. 2016;6:30381. doi: 10.1038/srep30381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Kidd J.M., Cheng Z., Graves T., Fulton B., Wilson R.K., Eichler E.E. Haplotype sorting using human fosmid clone end-sequence pairs. Genome Res. 2008;18:2016–2023. doi: 10.1101/gr.081786.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Norris A.L., Workman R.E., Fan Y., Eshleman J.R., Timp W. Nanopore sequencing detects structural variants in cancer. Cancer Biol Ther. 2016;17:246–253. doi: 10.1080/15384047.2016.1139236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Timp W., Mirsaidov U.M., Wang D., Comer J., Aksimentiev A., Timp G. Nanopore sequencing: electrical measurements of the code of life. IEEE Trans Nanotechnol. 2010;9:281–294. doi: 10.1109/TNANO.2010.2044418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Rhoads A., Au K.F. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics. 2015;13:278–289. doi: 10.1016/j.gpb.2015.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Roach J.C., Glusman G., Hubley R., Montsaroff S.Z., Holloway A.K., Mauldin D.E., Srivastava D., Garg V., Pollard K.S., Galas D.J., Hood L., Smit A.F. Chromosomal haplotypes by genetic phasing of human families. Am J Hum Genet. 2011;89:382–397. doi: 10.1016/j.ajhg.2011.07.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Zheng G.X., Lau B.T., Schnall-Levin M., Jarosz M., Bell J.M., Hindson C.M. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016;34:303–311. doi: 10.1038/nbt.3432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Stephens M., Smith N.J., Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001;68:978–989. doi: 10.1086/319501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Delaneau O., Marchini J. 1000 Genomes Project Consortium: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Commun. 2014;5:3934. doi: 10.1038/ncomms4934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Niu T. Algorithms for inferring haplotypes. Genet Epidemiol. 2004;27:334–347. doi: 10.1002/gepi.20024. [DOI] [PubMed] [Google Scholar]

[bib19] 19.Debeljak M., Freed D.N., Welch J.A., Haley L., Beierl K., Iglehart B.S., Pallavajjala A., Gocke C.D., Leffell M.S., Lin M.T., Pevsner J., Wheelan S.J., Eshleman J.R. Haplotype counting by next-generation sequencing for ultrasensitive human DNA detection. J Mol Diagn. 2014;16:495–503. doi: 10.1016/j.jmoldx.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Debeljak M., Mocci E., Morrison M.C., Pallavajjalla A., Beierl K., Amiel M., Noe M., Wood L.D., Lin M.T., Gocke C.D., Klein A.P., Fuchs E.J., Jones R.J., Eshleman J.R. Haplotype counting for sensitive chimerism testing: potential for early leukemia relapse detection. J Mol Diagn. 2017;19:427–436. doi: 10.1016/j.jmoldx.2017.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.1000 Genomes Project Consortium. Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Belsare S, Sakin-Levy M, Mostovoy Y, Durinck S, Chaudhry S, Xiao M, Peterson AS, Kwok P-Y, Seshagiri S, Wall JD: Evaluating the quality of the 1000 genomes project data. [ePub] doi:10.1101/383950. [DOI] [PMC free article] [PubMed]

[bib23] 23.Browning B.L., Zhou Y., Browning S.R. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 2018;103:338–348. doi: 10.1016/j.ajhg.2018.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Scheet P., Stphens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Pan W., Zhao Y., Xu Y., Zhou F. WinHAP2: an extremely fast haplotype phasing program for long genotype sequences. BMC Bioinformatics. 2014;15:164. doi: 10.1186/1471-2105-15-164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Yang W.Y., Hormozdiari F., Wang Z., He D., Pasaniuc B., Eskin E. Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data. Bioinformatics. 2013;29:2245–2252. doi: 10.1093/bioinformatics/btt386. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Sudmant P.H., Rausch T., Gardner E.J., Handsaker R.E., Abyzov A., Huddleston J. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Robinson J., Halliwell J.A., Hayhurst J.D., Flicek P., Parham P., Marsh S.G. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 2015;43:D423–D431. doi: 10.1093/nar/gku1161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Clark A.G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol. 1990;7:111–122. doi: 10.1093/oxfordjournals.molbev.a040591. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Marchini J., Cutler D., Patterson N., Stphens M., Eskin E., Halperin E., Lin S., Qin Z.S., Munro H.M., Abecasis G.R., Donnelly P., International HapMap Consortium A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet. 2006;78:437–450. doi: 10.1086/500808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Clark A.G., Weiss K.M., Nickerson D.A., Taylor S.L., Buchanan A., Stengard J., Salomaa V., Vartiainen E., Perola M., Boerwinkle E., Sing C.F. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am J Hum Genet. 1998;63:595–612. doi: 10.1086/301977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Quaife C.J., Findley S.D., Erickson J.C., Froelick G.J., Kelly E.J., Zambrowicz B.P., Palmiter R.D. Induction of a new metallothionein isoform (MT-IV) occurs during differentiation of stratified squamous epithelia. Biochemistry. 1994;33:7250–7259. doi: 10.1021/bi00189a029. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Sanchez-Mazas A., Cerny V., Di D., Buhler S., Podgorna E., Chevallier E., Brunet L., Weber S., Kervaire B., Testi M., Andreani M., Tiercy J.M., Villard J., Nunes J.M. The HLA-B landscape of Africa: signatures of pathogen-driven selection and molecular identification of candidate alleles to malaria protection. Mol Ecol. 2017;26:6238–6252. doi: 10.1111/mec.14366. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Xie C., Yeo Z.X., Wong M., Piper J., Long T., Kirkness E.F., Biggs W.H., Bloom K., Spellman S., Vierra-Green C., Brady C., Scheuermann R.H., Telenti A., Howard S., Brewerton S., Turpaz Y., Venter J.C. Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proc Natl Acad Sci U S A. 2017;114:8059–8064. doi: 10.1073/pnas.1707945114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Griffin T.J., Smith L.M. Genetic identification by mass spectrometric analysis of single-nucleotide polymorphisms: ternary encoding of genotypes. Anal Chem. 2000;72:3298–3302. doi: 10.1021/ac991390e. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Tewhey R., Bansal V., Torkamani A., Topol E.J., Schork N.J. The importance of phase information for human genomics. Nat Rev Genet. 2011;12:215–223. doi: 10.1038/nrg2950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Le D.T., Durham J.N., Smith K.N., Wang H., Bartlett B.R., Aulakh L.K. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science. 2017;357:409–413. doi: 10.1126/science.aan6733. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Dudley J.C., Lin M.T., Le D.T., Eshleman J.R. Microsatellite instability as a biomarker for PD-1 blockade. Clin Cancer Res. 2016;22:813–820. doi: 10.1158/1078-0432.CCR-15-1678. [DOI] [PubMed] [Google Scholar]

PERMALINK

A New Fast Phasing Method Based On Haplotype Subtraction

Evelina Mocci

Marija Debeljak

Alison P Klein

James R Eshleman

Abstract

Materials and Methods

Comparison of 1000 Genomes Project Phased Data with Experimental Data

Lookup Table

Table 1.

Phasing Methods

Experimental Confirmation of Alleles Identified in Silico

PHASE and SHAPEIT2

HLA-A Genotype Comparisons

Results

Analysis of Existing 1000 Genomes Project Phased Data

Proof of Principle of the New Strategy Using HLA-A

Figure 1.

Novel Phasing Strategy

Figure 2.

Blinded Analysis of TMPRSS15

Figure 3.

Blinded Analysis of the Remaining Six Loci

Table 2.

Experimental Confirmation of Computationally Predicted Alleles

Comparison of Computational Phasing Methods

Table 3.

Computational Time

Figure 4.

Discussion

Acknowledgments

Footnotes

Supplemental Data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases