Fast and accurate HLA typing from short-read next-generation sequence data with xHLA

Chao Xie; Zhen Xuan Yeo; Marie Wong; Jason Piper; Tao Long; Ewen F Kirkness; William H Biggs; Ken Bloom; Stephen Spellman; Cynthia Vierra-Green; Colleen Brady; Richard H Scheuermann; Amalio Telenti; Sally Howard; Suzanne Brewerton; Yaron Turpaz; J Craig Venter

doi:10.1073/pnas.1707945114

. 2017 Jun 28;114(30):8059–8064. doi: 10.1073/pnas.1707945114

Fast and accurate HLA typing from short-read next-generation sequence data with xHLA

Chao Xie ^a,¹, Zhen Xuan Yeo ^a, Marie Wong ^a, Jason Piper ^a, Tao Long ^b, Ewen F Kirkness ^b, William H Biggs ^b, Ken Bloom ^b, Stephen Spellman ^c, Cynthia Vierra-Green ^c, Colleen Brady ^c, Richard H Scheuermann ^d,^e, Amalio Telenti ^b, Sally Howard ^b, Suzanne Brewerton ^a, Yaron Turpaz ^a,^b, J Craig Venter ^b,^d,¹

PMCID: PMC5544337 PMID: 28674023

Significance

Regulation of the human immune system is largely controlled by the HLA gene complex on chromosome 6 and is important in infectious disease immunity, graft rejection, autoimmunity, and cancer. HLA typing is traditionally performed by serotyping and/or targeted sequencing. However, the advent of precision medicine and cheaper personal genome sequencing has sprung an unmet need for a fast and accurate way of predicting HLA types from short-read sequencing data. Here, we present xHLA, an algorithm for HLA typing based on translated short reads, exhaustive multiple sequence alignment-based alignment expansion, and iterative solution set refinement that is also faster and more accurate than existing methods. Results are achievable within minutes and could greatly benefit individuals who have had their genome sequenced.

Keywords: MHC, autoimmune diseases, transplantation

Abstract

The HLA gene complex on human chromosome 6 is one of the most polymorphic regions in the human genome and contributes in large part to the diversity of the immune system. Accurate typing of HLA genes with short-read sequencing data has historically been difficult due to the sequence similarity between the polymorphic alleles. Here, we introduce an algorithm, xHLA, that iteratively refines the mapping results at the amino acid level to achieve 99–100% four-digit typing accuracy for both class I and II HLA genes, taking only $\sim$ 3 min to process a 30× whole-genome BAM file on a desktop computer.

Genes within the HLA complex play an integral role in the human adaptive immune system. Classical class I (HLA-A, -B, and -C) and class II (HLA-DR, -DP, and -DQ) HLA gene products function by presenting foreign antigens to T cells to trigger immune responses (1). HLA genes show incredible sequence diversity in the human population. For example, there are >4,000 known alleles for the HLA-B gene alone (2, 3). The genetic diversity in HLA genes in which different alleles have different efficiencies for presenting different antigens is believed to be a result of evolution conferring better population-level resistance against the wide range of different pathogens to which humans are exposed (4).

In addition to its role in infectious disease defense, HLA has been associated with >100 different diseases, including various autoimmune disorders (1). However, although it plays an important role in human health, people do not routinely have their HLA genes typed. With the current trend toward precision medicine, knowing their HLA types will be crucial in early diagnosis and management of many diseases. For example, autoimmune disorder patients often have chronic problems with no exact diagnosis for many years after repeated doctor visits (5, 6). Knowing patients’ HLA types could lead to early diagnosis and reduce the burden on both patients and the healthcare system.

In the setting of hematopoietic stem cell transplantation (HSCT), matching of HLA alloantigens between donor and recipient to mitigate an allogeneic immune response is the single most important factor dictating the successful outcome of engraftment and survival after transplantation (7). HLA typing for HSCT donor–recipient matching in the clinical laboratory, including nucleotide sequence-based methods, focuses on characterizing sequence variations (polymorphisms) in the three classical class I alpha proteins, HLA-A, -B, and -C, and two of the classical class II beta proteins, HLA-DRB1 and -DQB1.

Finally, HLA determines reactivity and hypersensitivity reactions to a number of therapeutic agents that act as haptens. The Food and Drug Administration lists a number of drugs that require or may benefit from HLA typing before prescription to avoid severe adverse reactions (8).

HLA genes are usually typed with targeted sequencing methods: either long-read sequencing or long-insert short-read sequencing. In contrast, the most common personal genome sequencing methods use short-read shotgun technologies, which have historically been more difficult for HLA-typing algorithms (9–12). Existing algorithms are either inaccurate or slow. Given the growing popularity of precision medicine approaches and the generation of personal genomes, there is a clear need for faster and more accurate HLA-typing algorithms based on short-read shotgun data.

Results

There are three main challenges for HLA typing from short-read data. First, the high degree of sequence polymorphism means there are many potential candidate alleles for each gene, which means there is a high level of sequence “noise” when trying to find the correct allele from a large search space. Many of the existing algorithms try to reduce the noise level by filtering out less common HLA alleles from their candidate allele set. For example, OptiType ignores any allele with no reported allele frequency in AlleleFrequency.net (9, 13), whereas HLA-VBSeq only considers $\sim$ 100 HLA alleles as candidates (10). Second, despite the extensive polymorphisms within each HLA gene, alleles from different HLA genes can share regions that are extremely similar to each other, especially when looking at fragments resulting from short-read sequencing. Furthermore, there are many nonfunctional pseudogenes with similar sequences to functional HLA genes. Third, there are no complete reference sequences for most HLA alleles in the HLA reference database, International ImMunoGeneTics Project (IMGT)/HLA (3). Class I genes typically have exon 2 and 3 sequences available in the database, and class II genes typically have exon 2 sequences available in the IMGT/HLA database—the so-called “core exons” that comprise the antigen-recognition domain of the molecule involved in peptide presentation and interaction with the T-cell receptor. However, the availability of sequences for other exons in the database ranges from 5 to 36%. Noncore exons contribute to HLA allele polymorphism (2), and many existing algorithms, such as OptiType and HLA*PRG, which type HLA genes with core exons only, report only a group or representative allele for each gene, even when there are full-length sequences available for certain HLA alleles in the database.

xHLA Algorithm Overview.

Most existing HLA-typing algorithms follow a two-stage framework: First, align potential HLA sequencing reads to a collection of HLA reference sequences (IMGT/HLA), and then find a combination of HLA alleles that can best explain the observed alignment results. Our algorithm, xHLA, follows the same general framework with differing details (Fig. 1).

Fig. 1. — Overview of the xHLA algorithm.

Generating Alignment Matrix.

The starting data for xHLA is a BAM file that contains all sequenced reads mapped to a reference genome. Potential HLA sequencing reads are extracted from the BAM file based on their mapping coordinates and aligned to the IMGT/HLA database. All existing algorithms perform this step with DNA-level alignment, often accepting a certain degree of mismatches. A key difference in our algorithm is the use of a fast and sensitive protein level aligner, DIAMOND (14), and the acceptance of only perfect matches for quality-trimmed reads. The rationale is that, in HLA typing, the most relevant resolution is four-digit typing, which describes the protein-level amino acid sequence differences encoded in the HLA genes. Therefore, four-digit typing is what most clinical and next-generation sequencing-based typing algorithms aim to achieve. The problem with DNA-level alignment is that it cannot distinguish synonymous from nonsynonymous mismatches. For example, it will rank five synonymous mismatches as more dissimilar than a single nonsynonymous one. An added advantage of using protein-level alignment is that the identity threshold is not arbitrary as in DNA alignment, because there is clear concordance between 100% protein sequence identity and four-digit HLA typing.

Another issue with alignment of short reads against the IMGT/HLA database is that the database is not a typical reference sequence database. Most aligners have been developed to find the best homologs in a database with many diverse sequences. However, in this case, the reference database contains tens of thousands of very similar sequences. For any short sequencing read, there are many equally good or perfect matches. Most off-the-shelf aligners are optimized to find a number of good matches, but are not exhaustive. HLA*PRG is the only existing algorithm that tries to solve this problem by using a graph reference sequence set with certain assumptions about the intronic sequences (11). However, the HLA*PRG solution is very slow, taking $\sim$ 11 h for one 30× coverage whole-genome sequence BAM file. xHLA uses a precomputed protein-level reference multiple sequence alignment (MSA) of known HLA alleles from the IMGT/HLA database to expand the initial alignment results. For example, if a read is aligned to 100 reference sequences in the initial alignment, xHLA will compare the read to all equivalent sequence segments in the MSA corresponding to the initial 100 matches. Consequently, xHLA is guaranteed to exhaustively find all equally good matches from the HLA database.

Four-Digit Typing.

After aligning reads to the HLA reference sequences, the next task is to infer the most likely HLA allele combination that best explains the alignments. Rather than treating each gene separately, and by using alleles with the largest number of aligned reads, OptiType uses integer linear programing to find the complete allele set as a single optimization problem (9). This method is a better approach than the other methods because it explicitly considers short reads that map to multiple HLA allele candidates. However, there are two issues with this approach. First, there are often many different solution sets explaining the alignments equally or nearly equally well. Second, this method does not work when different HLA allele candidates have different numbers of reference exon sequences available. For example, allele candidates with a full set of exons available will most likely have more reads aligned to it compared with allele candidates with reference sequences from only a subset of exons available. xHLA solves these problems in three steps.

An initial set of HLA allele candidates is first selected in a similar way to the integer linear programing technique used in OptiType, by using only the core exons that are available for all HLA reference alleles in the IMGT/HLA database. Then, for each allele candidate (sol) in the initial solution set (solution_set), all alternative alleles (comp) that explain the alignments nearly as well are collected (Fig. 2). The current candidate, sol, is compared with alternative candidates, comp, based on the reads aligned to all of the exons available for both alleles in the IMGT/HLA database, rather than only the core exons as in the initial round (Fig. 2). When comparing the two alleles, sol and comp, only reads that are aligned to one of the two alleles, but not both, nor any other alleles in the current solution set, are considered (“is_better” in Fig. 2). This filtering step is important because reads that align to other alleles in the solution set are not informative when comparing sol and comp because it is unclear whether those reads are derived from the alleles being compared or other alleles in the solution set. If the alternative allele can explain more aligned and informative reads than the original sol, we replace the original sol with the alternative comp and repeat the iteration with a new set of alternatives (Fig. 2). When no further optimization is found, the procedure is repeated on the next allele candidate in the solution set. This procedure is a quick and crude way of pulling noncore exons into play. However, when updating an allele candidate, it is assumed that the rest of the solution set is correct, which may not be the case because the allele candidate changes through this iterative process. This limitation means that one more iterative refinement step (Fig. 3) is required.

Fig. 2. — Iterative allele set update considering all pairwise-shared exons.

Fig. 3. — Second-round iterative allele set refinement.

The second iterative refinement step is similar to the procedure described above, except that all candidate alleles in the current solution set are compared with their respective set of alternative alleles in parallel, and only one candidate allele is updated after each iteration. The update is repeated until no further optimization is possible (Fig. 3).

For every HLA gene, if the above procedure produces two alleles, a check is made on whether they are true heterozygotes by comparing the informative reads aligned to the two alleles, but not to any other allele in the solution set (zygosity check). If one of the two alleles has five times more informative reads aligned to it, the heterozygous call is changed to a homozygous call.

Full-Resolution Typing.

The above procedure gives a four-digit HLA allele set that best explains the alignments. We can optionally infer higher-resolution HLA types using the four-digit solution set as a starting point (Fig. 4). This process is an easier task compared with the above four-digit typing task itself, because the search space contains only higher-digit HLA alleles under each set of four-digit alleles, rather than the entire IMGT/HLA database.

Fig. 4. — Full-resolution HLA typing as a downstream step after four-digit typing.

Benchmarking.

xHLA was tested on four public datasets benchmarked from HLA*PRG’s and ATHLATES’ original publications (11, 15), including two whole-genome and two exome sequencing datasets. In all cases, xHLA was more accurate than existing algorithms (Table 1). HLA*PRG performed better than other existing algorithms, but scored 83.3% (class I) and 93% (class II) accuracy for the HapMap exome data based on their own benchmark. In contrast, xHLA scored 98.3% and 100% accuracy on the same dataset. In terms of runtime, xHLA was much faster than existing algorithms, taking $\sim$ 3 min per 30× whole genome (2 × 150-bp reads) sample on a machine with 16-core CPU and 30-GB memory. In comparison, HLA*PRG was the slowest and most resource-demanding algorithm, taking $\sim$ 11 h with 80-GB memory. Other algorithms tested usually take between 15 min and 5 h for similar samples.

Table 1.

Accuracy of xHLA compared with existing algorithms

Data type	Dataset	Class (n)	xHLA, %	HLA*PRG, %	PHLAT, %	HLA Reporter, %	SOAP-HLA, %	HLA-VBSeq, %	OptiType, %	ATHLATES, %
WGS	Platinum	I (18)	100	100	100	11.2
		II (12)	100	100	100	66.3
	1000G	I (66)	100	100	63.7	4.5
		II (44)	100	97.7	70.9	62.7
	CIBMTR	I (2,928)	99.7	98.5			91.0	97.5	97.0
		II (2,872)	99.4	96.5			97.0	38.0	NA
Exome	HapMap	I (174)	98.3	83.3	87.7	26.5
		II (104)	100	93.0	87.3	71.3
	1000G	I (66)	100							98.5
		II (44)	100							100
	GeT-RM	I (646)	99.5
		II (644)	96.1

Open in a new tab

A total of six datasets were used. Platinum, 1000G (WGS), HapMap, and 1000G (Exome) were identical to the benchmarking datasets used in refs. 11 and 15. Accuracy for HLA*PRG, PHLAT, and HLA Reporter on the public datasets were obtained from the same references. For each dataset, accuracy for HLA class I and II genes is listed separately, and the number of alleles in each dataset is shown in the table.

Additionally, xHLA was tested on two much larger, private datasets. For the whole-genome dataset, obtained through collaboration with the Center for International Blood and Marrow Transplant Research (CIBMTR; a research collaboration between the National Marrow Donor Program/Be The Match and the Medical College of Wisconsin), xHLA achieved 99.7% (class I) and 99.4% (class II) accuracy, higher than all other existing algorithms tested (Table 1). For exome data, the GeT-RM dataset (cell lines obtained from wwwn.cdc.gov/clia/Resources/GetRM/), xHLA achieved 99.5% accuracy on class I HLA alleles, but 96.1% on class II. Most of the errors (3.6% of 3.9%) for class II typing in the GeT-RM dataset were due to missing heterozygous calls (i.e., only reported one of the two heterozygous alleles), suggesting differential pull-down efficiencies of the exome enrichment probes against sequences from different HLA alleles for some class II HLA types.

Because most of the gold-standard HLA types we used for benchmarking are based on core exons of the HLA genes (15–17), the benchmarking procedure considered four-digit types with the same core exon sequences as “consistent” types. This fraction ranged from 0 to 1.9% of alleles in the benchmarking datasets. The largest number of consistent, but not identical, predictions by xHLA was the HLA-DRB1 gene in the CIBMTR data: There were 13 reported DRB1*14:01 alleles predicted as DRB1*14:54 by xHLA. The two DRB1 types shared the same core exon 2 sequence, and most of the reported DRB1*14:01 before 2006 were actually DRB1*14:54 (18). This result shows that our typing algorithm works well beyond the core exons.

Discussion

As demonstrated in our experiments, we have shown that xHLA is both faster and more accurate on whole-genome and -exome sequencing data than other methods. Specifically, seven other methods were tested against xHLA on three whole-genome and three whole-exome datasets. In all cases, xHLA performed the best.

Although all of the methods use a similar process of alignment followed by choosing the alleles that best explain the alignments, there are some differences. The first main difference of xHLA is the use of protein-level alignment and MSA-based alignment expansion, which results in a cleaner and exhaustive alignment matrix and, thus, higher accuracy compared with the other HLA-typing methods, as summarized in Table 1. This finding clearly shows the advantages of using protein-level typing because a perfect match in protein space is immutable. Therefore, there is no need to filter any of the initial data or worry about DNA sequence similarity of HLA genes and pseudogenes. There is also no bias introduced due to incomplete data in the IMGT/HLA reference sequence database. In addition, xHLA runs significantly faster than the other methods on similar hardware specifications by using DIAMOND, a double-indexing strategy that enables a 20,000× speed-up as compared with BLASTX, with a similar degree of sensitivity (14). It takes xHLA only a few minutes to process a 30× genome with 150-bp reads on a modest desktop computer. However, with shorter reads—say, 75-bp long—it will take significantly longer, because the number of matches in the exhaustive alignment matrix for shorter reads increases exponentially. Although it is focused on four-digit typing, xHLA offers an optional feature to run full-resolution HLA typing as a downstream step after four-digit typing.

Another key difference of xHLA is that it works beyond the core exons of HLA genes. One interesting observation in our benchmarking was the DRB1*14:54 and DRB1*14:01 pair: Thirteen previously reported DRB1*14:01 were predicted as DRB1*14:54, which is consistent with the fact that most previously reported DRB1*14:01 were actually DRB1*14:54 (18). We excluded DPA1 and DQA1 due to the lack of truth data for these genes.

Even though xHLA performed the best, it did drop in accuracy in one of the three exome data samples that we tested. In our private GeT-RM exome samples, the typing accuracy for class II genes was 96.1%, much lower than in other samples. The dip is likely due to differential pull-down efficiencies for different alleles by the exome enrichment kit.

By considering well-documented population-specific haplotypes (13), together with derived haplotypes from our recently published large-scale whole-genome sequencing data (19), we will be able to improve our typing accuracy further, especially in exome data, where enrichment bias occurs.

With the current movement toward precision medicine, personal genome sequencing is becoming increasingly popular and cheaper. HLA is one of the most relevant regions of the genome for personalized medicine, given its diversity and role in the immune system. xHLA is both fast and accurate in identifying HLA types from personal genome data and should be used to make HLA typing more readily available to individuals and their physicians as an added benefit to having their genomes sequenced.

Materials and Methods

The xHLA Algorithm Overview.

The basic steps of the xHLA algorithm are illustrated in Fig. 1. There are three major modules: construction of alignment matrix, four-digit typing, and full-resolution typing.

Alignment Matrix.

Preprocessing.

The input data of xHLA is a BAM file where sequencing reads are mapped to the hg38 human reference assembly (excluding alt contigs). Both BWA’s mem mode (Version 0.7.15) (20) and Isaac (Version 0.14.02.06) (21) with default parameters work well with xHLA on diverse datasets. Because all genome sequencing projects produce a BAM file, the alignment step is not considered as part of xHLA. xHLA extracts relevant HLA reads from the BAM file (chromosome 6, position 29,886,751–33,090,696), then trims and filters them based on base quality scores. Trimming is based on BWA’s trimming algorithm with Phred quality cutoff 20 from the 3′ end after first trimming Ns. Reads <70 bp or with more than five positions with Phred quality score <4 after trimming are removed.

DIAMOND alignment.

The reads are then aligned to reference HLA exon protein sequences from IMGT/HLA by using DIAMOND (Version 0.8.15; parameters: blastx –index-mode 1 –seg no –min-score 10 –top 20 -c 1 -C 20000). Strict filtering is performed on the alignments. The alignments need to be 100% identical and cover the full length of the query reads, unless the unaligned region of the query extends beyond the reference sequence. Because protein-level alignment is used against exons, an additional comparison of reads against incomplete codons at exon boundaries is performed. Only the alignments with equally best scores for each query read are kept.

MSA-based alignment expansion.

The next step is to generate an exhaustive alignment matrix by using a precomputed MSA of known HLA alleles with MUSCLE (22). For example, if a read is aligned to 100 reference sequences by DIAMOND after the previous step, xHLA compares the read to the equivalent sequence segments in all other reference sequences in the MSA. All equally good reference matches are retained in the exhaustive alignment matrix. Although only six HLA genes (HLA-A, -B, -C, -DQB1, -DRB1, and -DPB1) are typed, all other HLA genes are used as alignment references in the above steps. Reads that map equally well to genes being typed and those that are not are considered as uninformative and excluded from further analysis.