Abstract
Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.
1. Introduction
Many initiatives are in progress for building haplotype-resolved pangenome references of human and nonhuman species [22,11,36]. Among many applications, pangenome graphs can enable cost-effective genotyping and imputation of a wide spectrum of variant classes beyond single nucleotide polymorphisms (SNPs) and short indels [13]. Pangenome graphs represent sequence alignment of high-quality fully-phased genome assemblies of individuals from diverse populations [1]. A pangenome graph can be represented as either cyclic or acyclic directed graph where the vertices are labeled with sequences. Paths in this graph spell the reference haplotype sequences and their recombinations. The graph-based representation is flexible enough to incorporate single-nucleotide polymorphisms (SNPs), indels (short insertions and deletions), large structural variants (SVs), nested variants, gene absence/presence, etc. [4].
Recent works propose the use of pangenome references to improve genotyping accuracy from short-read sequencing data [9,14,12,18,2,10,33,25]. Especially for SVs, these methods are an effective alternative to the conventional genotyping methods that are based on aligning reads to a single reference because short-read alignments can be inaccurate for the reads originating from SVs [23,8]. Methods such as PRG [6], Pangenie [9] and KAGE [12], utilize k-mer statistics to infer paths in the graph that correspond to the target genome. These methods compare the k-mers surrounding a variant site in the graph with the k-mer counts in the sequencing data to calculate likelihoods of reference and alternative alleles. Pangenie and KAGE also use the long-range haplotype information available in the haplotype-resolved pangenome references. The other approach used in methods such as Giraffe [34] and Graphtyper [10] involves aligning reads to a pangenome graph.
There have been efforts on improving the accuracy of read alignments to pangenome graphs as well. A large combinatorial search space in terms of the number of candidate paths in a pangenome graph increases ambiguity during read alignment. This issue has motivated methods that either impute a personalized reference genome [38], sample variants [29,17,37] to obtain a smaller graph, or prioritize the use of reference haplotypes in the graph during alignment [3,34,26]. Our previous work proposed haplotype-aware sequence alignment to graphs by introducing penalties for haplotype switches in an alignment [3]. A recent feature added to VG allows sampling of reference haplotypes and their recombinations from the graph that are most relevant to the target genome using a k-mer-based greedy heuristic [35].
Low-coverage sequencing, combined with genotyping and phasing, is a cost-effective approach to conduct large-scale genetic studies [31,5,20,24]. In this paper, we develop a rigorous formulation and algorithms for genotyping using pangenome references. Our framework is also applicable to low-coverage short-read sequencing data (coverage 0.1 − 1×). Following the standard Li and Stephens model [21], we view the target genome as an imperfect mosaic of the reference haplotypes. Our contributions are as following.
We introduce a novel problem formulation to estimate the complete haplotype sequence of a haploid genome by determining an appropriate path in the pangenome graph. The objective is to maximize the number of shared substrings (e.g., k-mers or minimizers) between the sequencing data and the sequence spelled by the path. We permit recombinations in the path, subject to a fixed penalty per recombination. We refer to this problem as Path Inference Problem (formally defined in Section 2).
We prove that the Path Inference Problem is NP-hard, even when restricted to binary alphabets.
To solve this problem, we develop two integer-programming solutions which involve linear and quadratic constraints, respectively. The two solutions involve a tradeoff between runtime and memory usage.
We demonstrate the utility of this framework by testing it on downsampled short-read datasets from five human haploid cell lines (coverage 0.1 − 10×). For these five samples, complete major histocompatibility complex (MHC) haplotype sequences have been previously determined using long-read assembly [16].As our pangenome reference, we used a haplotype-resolved pangenome directed acyclic graph (DAG) of 49 MHC haplotype sequences [19]. We chose MHC region for evaluation because this is the most polymorphic and gene-rich region of the human genome [7]. The length of this region is about 5 Mbp.
Using datasets with 0.1× coverage, our algorithm outputs MHC sequences that are up to 99.96% identical to the ground-truth sequences. It compares favorably to the existing methods.
2. Notations and Problem Formulation
Let denote a directed acyclic graph (DAG) representing a haplotype-resolved pangenome reference. Function assigns a string label over alphabet to each vertex. A path in spells string , where denotes the concatenation of strings and . denotes a set of paths in such that each of these paths spells a reference haplotype sequence used in the pangenome reference. We refer to these paths as haplotype paths. We assume that each haplotype path is described by an array, i.e., [1] is the first vertex in [2] is the second vertex in , etc. The length of a haplotype path , that is, the count of vertices in is denoted as . The set of haplotype paths covering vertex is denoted as . We assume that, for each edge , there exists a haplotype path such that and are consecutive vertices in . In other words, each edge is supported by at least one haplotype path.
Definition 1 (Inferred Path).
An inferred path of length n is represented as an ordered set , where each is a two tuple such that , and for all . Furthermore, if , then and should be consecutive vertices in haplotype path .
In an inferred path, we keep track of the haplotype path indices alongside vertex indices (Figure 1). We say a recombination, or a haplotype switch, occurs between two consecutive vertices and in if . We use to denote the count of recombinations in . With a mild abuse of notation, we denote the string spelled by as .
Fig. 1:
A simple illustration of an haplotype-resolved pangenome graph with two haplotype paths highlighted in pink and blue colors. An inferred path with a single recombination is shown as a dashed line.
Problem 1 (Path Inference Problem).
Input: A haplotype-resolved pangenome DAG , a set of strings from the target genome, and a non-negative integer indicating recombination penalty.
- Output: An inferred path such that
is minimized, where if string occurs as a substring of string and 1 otherwise.
The intuition behind our formulation is to maximize the number of string matches along the inferred path while minimizing the number of recombinations. This approach yields an inferred path that incorporates the majority of strings from as a substring with a finite number of recombinations, constrained by a recombination penalty . Set can be set of either -mers or minimizers observed in the sequencing reads.
3. Computational Complexity
Theorem 1.
Problem 1 is NP-hard. This holds for any value of and even when .
We begin with an instance of the Hamiltonian Path Problem. Let . We first create a graph where
For , let be standard binary encoding of using bits. We assign the vertex labels
We create a distinct haplotype path for each edge that supports only that edge. We define the set of strings . See Figure 1 in Appendix for a small worked example. The reduction presented above clearly runs in polynomial time for . Combined with Lemmas 1 and 2, Theorem 1 follows.
Lemma 1.
If contains a Hamiltonian path, then has an inferred path with
Proof. Let be a Hamiltonian path in . We take as our inferred path , . As every edge has its own corresponding haplotype, the number of recombinations is . Furthermore, since is a Hamiltonian path and and are included in the inferred path, all strings in occur in . Hence, the total cost is . □
Lemma 2.
If has an inferred path with , then has a Hamiltonian path.
Proof. First, we claim that and must be included in . The substrings are used as padding to prevent any string in from being matched using portions of two or more vertex labels. Therefore, if or are not included in the inferred path, at least strings from do not occur in , contradicting that . Hence, the inferred path must contain and and be of the form , , for some . Since each edge traversed corresponds to a recombination, the total number of recombinations is . The only way the is if all strings in occur as substrings in . Again, due to the padding in the vertex labels, this can only happen if for all is a vertex in for some . Furthermore, because there are vertices in that are not or , there must be exactly one such for a given . We conclude that is a Hamiltonian path in . □
4. Proposed Algorithms
Before developing our integer programming solutions to Problem 1, it is first helpful to define an additional graph representation, which we call as expanded graph. In pangenome graphs, multiple haplotype paths share vertices if the sequences are conserved, whereas in the expanded graph, we will split all haplotypes into separate paths (Figures 2A, 2B). The expanded graph enables us to model Problem 1 as a sort of network flow problem. In particular, the inferred path will be reconstructed from a flow of value one in the expanded graph. We will assign weights to edges to account for recombination penalty. Additional constraints will be used to capture how many strings in occur in the resulting inferred path.
Fig. 2:
(A) A pangenome graph with four haplotype paths and . Set of haplotype paths passing through a vertex is listed below each vertex. (B) The corresponding expanded graph which includes four disjoint paths, one for each haplotype path. The recombination edges are shown in purple, these edges have a weight of . We consider only the useful recombinations (Lemma 3). The edges which are not recombination edges in the expanded graph have a weight of 0. (C) The corresponding optimized expanded graph.
Lemma 3 allows us to only consider a subset of all possible recombinations in order to find an optimal solution to Problem 1. We call the type of recombination described in Lemma 3 a useful recombination.
Lemma 3.
There exists an optimal inferred path for Problem 1 where for all , implies vertices and are not consecutive vertices in haplotype path .
Proof. Suppose there is an optimal inferred path for Problem 1 where for some such that and are consecutive vertices in haplotype path . Furthermore, suppose we start with the smallest where this holds. We then change the haplotype path for to equal . This does not increase the overall cost, since the number of string occurring in has not changed, and the number of recombinations either decreases or stays the same. Continuing this process from the next , such that and and are consecutive vertices in , we achieve an inferred path satisfying the conditions stated in the lemma after at most iterations. □
Next, we present a definition of the expanded graph where we will consider only the useful recombinations. For technical reasons, we preprocess each edge in , splitting it and adding a new vertex labeled with the empty string . Each added vertex inherits the haplotype paths which supported the edge it was formed from. This added step is to prevent recombinations from a haplotype to itself when we build our expanded graph. Now, let . For haplotype path , let denote the vertex in haplotype path . We use to denote the expanded graph. In , vertices are string-labeled and edges are weighted. Vertex set is defined as:
| (1) |
The vertex set contains a source and sink vertex, and , respectively. The vertex set also contains a set of disjoint vertices for each haplotype path in (Figure 2B). A superscript is used to indicate which haplotype path the vertex is designated to. We refer to the ordered vertex set as a haplotype path in .
We denote weighted edges in as tuples of the form (start, end, weight). The weighted edge set is
| (2) |
| (3) |
| (4) |
| (5) |
Next, we give some intuition for each line (2)-(5) in the above construction of .
(2) Weight 0 edges are created from to the start of each haplotype path in .
(3) Weight 0 edges are created from the end of each haplotype path in to .
(4) Weight 0 edges are created between adjacent vertices in each haplotype path. That is, in the path for , an edge is created from to .
(5) Weight edges are used to represent the useful recombinations described in Lemma 3. We call these recombination edges.
We use to denote the empty string. The vertex labels are defined as follows:
| (6) |
| (7) |
(6) The vertices in a haplotype path are labeled according to the corresponding vertex label in . These labels will be used to identify matches.
(7) The source, sink, do not require vertex labels and are hence labeled with the empty string .
Optimizing the Expanded Graph.
One issue with the above construction is that the number of recombination edges for a given potential recombination can be in the worst case. This occurs because we maintain copies of each vertex . For every edge allowing a recombination, we add edges to the edge set . Since both and can be at most , any potential recombination can result in recombination edges in the worst case. We observe this issue in practice as well. An improvement is to represent a recombination by having an intermediate vertex that represents the edge allowing for the recombination. We then create an edge to from every vertex in a haplotype path which the recombination would start from, and edges from to every vertex in a haplotype path to which the recombination would lead to (Figure 2C). More formally, the modified vertex set becomes
| (8) |
We also replace Line (5) in the construction of with the Lines (9) and (10) as follows:
| (9) |
| (10) |
We now call these edges created in Lines (9) and (10) the recombination edges. After creating the edges in , we delete any vertex that is isolated in . Finally, for any remaining vertices, we define . Observe, that the above modification allows for the same set of useful recombinations as our initial expanded graph construction. However, per potential useful recombination, the number of edges remains rather than . Before giving the integer programming solutions, we require one additional definition.
Definition 2 (Hits).
For a string , assuming , a path in , denoted as an ordered edge , matches if , where is a suffix of and a prefix of . We use to represent the set of paths matching string in .
4.1. Integer Linear Programming (ILP) Formulation
We assume that the maximum length of any vertex label is upper bounded by the length of any string in , i.e., . This condition can be easily enforced in the input graph by adjusting the lengths of vertex labels, e.g., by splitting a vertex with a long label into two, while ensuring that the graph’s topology is preserved. We assume .
The basis for our solution is to find an -flow with a flow of 1 through the expanded graph . Our integer programs will utilize binary decision variable for each edge. The variable will take the value 1 if edge is part of the solution flow and 0 otherwise. Because these are binary variables, the flow will always be a path. From the solution path in , it is straight forward to recover the corresponding inferred path . We use binary decision variable for each string such that will take the value 1 if the solution flow includes a subpath from . We also use variable for each .
Letting denote the weight of an edge , our ILP formulation is as follows:
| (11) |
subject to
| (12) |
| (13) |
| (14) |
| (15) |
In the ILP formulation, the Objective (11) models . The summation over imposes penalty for each recombination. This is due to the two weighted recombination edges that must traversed when the path switches between haplotype paths in (Figure 2C). In the second summation, the term adds a penalty of 1 to the objective for every where . Constraint (12) enforces flow conservation, allowing a unit flow from the source vertex to the sink vertex , ensuring that the ILP formulation selects a single path in the expanded graph.
To explain the function of Constraint (13), termed as linear string-hit constraint and (14), observe that in an optimal solution, whenever possible the variable is set to 1. This is because the term () in the objective function adds a penalty of 0 whenever . However, this is only possible when is equal to 1 for some . This, in turn, is only possible if , meaning occurs as a substring in the inferred path. Also note that at most one variable can equal 1 in Constraint (14). Other variables, where and , can have a value of 0, even if , justifying the use of equality in Constraint (14).
A weakness of the proposed ILP formulation is that the number of string-hit constraints equals the total number of string matches, that is, . We design another formulation with quadratic constraints in which fewer constraints are needed.
4.2. Integer Quadratic Programming (IQP) Formulation
In our IQP formulation, Objective (11), and Constraints (12), and (14) and (15) remain unchanged from the ILP formulation. Constraints in (13) are replaced by quadratic constraints defined as
| (16) |
We call Constraint (16) the quadratic string-hit constraint. Again, due to Constraint (14) at most one variable can be 1. The expression sums to 1 when the subpath is contained in the flow. In this case will take the value 1 and no penalty is paid in the objective. Conversely, if some of the edges for are not in the flow, the expression will sum to ≤ 0. If this is the case for each , then Constraint (16) can only be satisfied by setting and for each . Since , a penalty is paid in the objective. The total number of quadratic string-hit constraints is . In our experiments, we observe that IQP formulation solves the problem faster, albeit while requiring more memory.
As a further improvement, we relax the variables for all to continuous values in Constraint (15), following Lemma 4.
Lemma 4.
An optimal solution to the IQP (or ILP) with relaxed Constraint (15) where variables lie within the continuous interval [0, 1] can be transformed in polynomial time to an optimal solution satisfying for all .
Proof. First, observe that if and only if all edges in some have their corresponding variables set to 1. This follows from Constraints (13) and (16), and the fact that at most one can be 1 for a given , by Constraint (14).
If for all in , then can be trivially obtained as a single haplotype path in without recombination penalties. In such a case, all edge variables are assigned either 0 or 1.
For the remaining cases, we introduce the following terms:
is a used hit-subpath if .
A flow between vertices and can be decomposed into -paths each assigned some positive flow and called flow subpaths.
is the first used hit-subpath if there is a flow subpath from vertex to the first vertex of without passing through another used hit-subpath.
is the last used hit-subpath if there is a flow subpath from the last vertex of to vertex without passing through another used hit-subpath.
and are consecutive used hit-subpaths if there is a flow subpath between them without passing through a third used hit-subpath, where and .
Now, if in for some , there exists a used hit-subpath. We obtain as following. The flow used to reach the first hit-subpath avoids recombination penalties by following a single haplotype path. Similarly, the flow from the end vertex on the last used hit-subpath to avoids recombinations penalties by staying on a single haplotype path. Next, consider two consecutive used hit-subpaths and , with and as their respective end and start vertices. If and are on different haplotype paths, any flow subpaths between and must minimize the recombination penalty. The same minimum recombination cost can be achieved by replacing the potentially multiple fractional flow subpaths with a single path that incurs the same recombination penalty. We can select any flow subpath from to and assign its edge variables to 1. Edge variables on edges used on the flow from to and not on this selected path are set to 0. □
5. Results
Implementation Details.
We implemented our ILP and IQP solutions in C++ using Gurobi (v11.0.2) solver. We refer to our software as PHI (Pangenome-based Haplotype Inference). The user can provide a pangenome reference as either a graph (GFA format) or as a list of phased variants (VCF format). Given short-read or long-read sequencing data of either a haploid or a homozygous genome, PHI outputs the haplotype sequence associated with the optimal inferred path from the graph in FASTA format.
Given a set of reads, we compute () window minimizers [30] for identifying our hits (Definition 2). By default, and . These minimizers correspond to the set in Problem 1. Computing minimizer matches between two strings is faster than computing minimizer matches on a pangenome graph. For this reason, we find minimizer matches between reads and the sequences spelled by all the haplotype paths in the graph. This means includes only those subpaths that are completely contained in some haplotype path in (Definition 2). This restriction to also prevents us from needing to perform the additional edge splitting step described in Section 4.1. We used recombination penalty , this value was chosen empirically. We ran all our experiments on AMD EPYC 7763 processors with 512 GB RAM. We used 32 threads in all experiments.
Datasets.
We evaluated our algorithm by estimating MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO) from homozygous human cell lines. Recently, Houwaart et al. [16,32] published complete assemblies of these MHC sequences using long and short-read sequencing. The average length of these assemblies is 4.99 Mbp. We downloaded the five short-read sequencing datasets available from this study. To evaluate our algorithm using varying sequencing coverage, we down-sampled each short-read dataset to obtain coverage of 0.1×, 0.5×, 1×, 2×, 5×, and 10×. We also used the full datasets for evaluation (coverage 12.9 − 18.2×). We used the complete assemblies of five MHC haplotypes as ground-truth to evaluate the accuracy of our estimated sequences. To quantify the accuracy, we measured edit distance between each estimated sequence and the corresponding ground-truth sequence.
We built a haplotype-resolved pangenome graph of 49 complete MHC sequences [19] using Minigraph-Cactus [15]. These sequences were extracted from phased assemblies of 24 diploid human samples [22] and the CHM13 reference [27]. Using Minigraph-Cactus, we obtained the pangenome reference in a VCF format file. We subjected this file to further simplification steps1 to ensure compatibility with various tools. We show sequence similarity statistics between the complete MHC assemblies of five haplotypes (APD, DBB, MANN, QBL, SSTO) and the 49 pangenome reference haplotypes in Appendix Table 1.
Other Methods.
We compared PHI with two existing pangenome-based genotyping tools (i) VG (v1.60) [35] and (ii) PanGenie (v3.1) [9]. VG supports sampling of relevant haplotypes from a pangenome graph by comparing k-mer counts in the reads and k-mers of a reference haplotype. The selection of haplotypes is done locally in fixed-length non-overlapping blocks. Recombinations may be introduced to create contiguous haplotypes across the blocks. The number of samples can be specified by the user. Accordingly, VG’s haplotype sampling feature can be adapted for haplotype sequence estimation by simply setting the number of desired samples to one. Next, PanGenie supports short-read genotyping using a haplotype-resolved pangenome graph. PanGenie uses a hidden Markov model, which is similar to the standard Li and Stephens model [21]. PanGenie compares k-mer counts in the reads with the k-mers present in the graph to compute genotype likelihoods. PanGenie exhibited better genotyping accuracy and speed than other genotyping tools [9]. Our sequencing datasets are derived from homozygous cell lines, therefore we ignored the heterozygous genotype calls made by PanGenie (Appendix Table 3). We incorporated PanGenie’s predicted genotypes in the reference sequence to obtain the haplotype sequence. We list our commands to run PHI, VG and PanGenie in Appendix Table 2.
Genotyping performance.
We evaluated PHI, VG and PanGenie methods in their ability to infer the MHC sequences from short read datasets of varying coverage (see Figure 3). Using low coverage datasets (0.1−2×), PHI exhibits significantly higher accuracy. VG and PanGenie methods may not be suitable for low-coverage sequencing. For example, the distribution of k-mer counts at low coverage can be unreliable. Distinguishing k-mers originating from unique versus repetitive regions, as required by PanGenie and VG, is also challenging at low-coverage. Using coverage of 5× or more, the results of VG and PHI are comparable. PanGenie also produces comparable results using full datasets. We note that the integer programming (IQP) approach used in PHI requires more time and memory compared to the methods used in VG and PanGenie. PHI used up to 1.5 hours and 137 GB RAM in a single experiment. In contrast, VG and PanGenie required < 5 minutes and < 50 GB memory. It may be possible to optimize PHI by incorporating efficient heuristics. We show detailed performance statistics for PHI, including its runtime and memory usage in Appendix Table 4.
Fig. 3:
Accuracy of haplotype sequences estimated by PHI, VG and PanGenie using short reads from MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO). The x-axes indicate the coverage of short-read data. The y-axes indicate the edit distance between the estimate haplotype sequence and the ground-truth sequence on a logarithmic scale.
Eflect of our optimizations.
In PHI, we implemented both ILP-based and IQP-based solutions to solve the optimization problem. Using either solution, Gurobi solves Problem 1 to optimality. We benchmarked our ILP and IQP solutions to compare their runtime and memory-usage (see Figure 4). On low-coverage datasets (0.1−1×), the runtimes are comparable. At higher coverage, the IQP solution runs faster, which is likely due to fewer string-hit constraints used (Section 4.2). Although, it requires approximately 1.5 times more memory. This may be because Gurobi requires additional storage to handle quadratic constraints. Accordingly, while using PHI, the user can choose between ILP and IQP using a command line argument based on the available memory. If no choice is provided, the IQP solution is used by default. We also evaluated the advantage of relaxing edge variables to continuous values (Lemma 4) by comparing it to another version of our code where we set the edge variables to be discrete. Relaxation of variables deceases runtime of the IQP solution by a factor of 1.6 on average (Appendix Figure 2). Not much effect on the runtime is observed in the ILP solution (Appendix Figure 3).
Fig. 4:
Performance comparison between the ILP and IQP solutions implemented in PHI. We compared their runtime and memory-usage using short-read sequencing datasets sampled from five haplotypes.
Impact of graph expansion with the addition of more genomes.
We evaluated the impact of pangenome graph expansion on PHI’s genotyping accuracy as well as runtime. To do this, we created five versions of our pangenome graph, each containing an increasing number of reference haplotypes, added progressively. The first graph comprises a single diploid sample (chosen randomly from 24 diploid samples) plus CHM13 reference, therefore, it has three reference haplotypes in total. The second graph includes two more diploid samples (chosen randomly from the remaining 23), therefore, it has seven reference haplotypes in total. Similarly, third, fourth and fifth graphs contain 13, 25 and 49 reference haplotypes, respectively. The fifth graph is equivalent to the graph used in previous experiments as well. This results in five different graphs that have 3, 7, 13, 25, and 49 reference haplotypes respectively.
We repeated our experiments with full short-read datasets using these five graphs and present results in Figure 5. We observe that edit distances between the estimated sequences and the ground truth sequences decrease with the increasing number of reference haplotypes. This is expected because more haplotypes are available to choose from when we compute our inferred path in the graph. We also observe an increase in runtime and memory usage. Runtime appears to increase superlinearly and memory appears to increase linearly with the number of reference haplotypes. This is because the size of expanded graph and the number of minimizer matches increase leading to more variables and constraints in our integer program.
Fig. 5:
Assessement of PHI’s performance with the increasing number of genomes in pangenome graph. The left figure shows the accuracy in terms of edit distance between the output sequences and ground-truth sequences. The middle and right figure show the runtime and memory-usage respectively.
6. Discussion
Genotyping using pangenome graphs is equivalent to finding a walk in the graph that contains the sample’s variants [28]. If the sample is diploid, this becomes equivalent to finding a pair of paths. Drawing inspiration from this idea, we proposed a rigorous framework to infer a path through the graph, such that the sequence spelled by the path is consistent with the sequencing data in terms of the shared k-mers between them, while permitting a limited number of recombinations in the path, each incurring a fixed penalty. This optimization problem requires considering all possible paths in the graph. We proved that this problem is NP-Hard and subsequently gave efficient integer programming solutions. As part of our methodology, we introduced the expanded graph data structure on which we could compute an appropriate st-flow of 1. Experimental results demonstrate the advantage of the proposed ILP/IQP approaches for accurate genome inference, especially with low-coverage data (coverage 0.1 − 1×). Thus, our algorithm can facilitate affordable genotyping and association studies of complex and repeat-rich regions of the genome.
Although our approach is currently tailored to haploid samples, it could generalize to diploid samples. This may be accomplished by finding an st-flow of 2 through the expanded graph and modifying some constraints. How well this approach genotypes and phases the genome would be interesting to explore. Another limitation of this work is that we do not capture uncertainty. For example, there may be multiple inferred paths with minimum cost. Lastly, pangenome graphs are expected to grow in the number of genomes, therefore, scaling the current approach to a large number of haplotype paths may be important. We leave these extensions to future work.
Acknowledgements
This research is funded in part by the DBT/Wellcome Trust India Alliance Fellowship (grant number IA/I/23/2/506979), the Intel India Research Fellowship, the National Institutes of Health of the USA (NIH-NIAID U01 AI090905), and the Jürgen Manchot Foundation. We utilized computing resources available at the Indian Institute of Science and the U.S. National Energy Research Scientific Computing Center.
Appendix
Fig. 1:
A small example of our reduction from Hamiltonian Path Problem to Problem 1 (Theorem 1). (Top) The starting instance of of Hamiltonian Path Problem. (Bottom) The vertex labeled graph constructed from . Here, and we assume , making . Each edge is supported by a unique haplotype (not shown). The string set is .
Table 1:
Additional information about the MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO). We show the length of the complete assembly in the second column. The third and forth columns show edit distance statistics between the assembly and 49 reference haplotypes included in the pangenome reference. In the last two columns, we list the SRA accession numbers and coverage of short-read sequencing datasets.
| Edit distance with pangenome reference haplotypes | Short-read data | ||||
|---|---|---|---|---|---|
| Haplotype | Assembly length (Mbp) | Mean | Minimum | SRA Accession | Coverage |
| APD | 4.93 | 146,423 | 37,102 | SRR17272303 | 16.26x |
|
| |||||
| DBB | 5.05 | 174,619 | 10,380 | SRR17272302 | 12.91x |
|
| |||||
| MANN | 5.03 | 189,464 | 58,168 | SRR17272301 | 18.20x |
|
| |||||
| QBL | 4.90 | 159,968 | 72,293 | SRR17272300 | 12.85x |
|
| |||||
| SSTO | 5.05 | 161,044 | 35,583 | SRR17272299 | 15.04x |
Table 2:
Commands used for running various tools
| Haplotype/Genotype Imputation | |
|---|---|
| PHI | 1) vcf2gfa.py -v multi-allelic_phased.vcf -r reference.fa > graph.gfa 2) PHI -t32 -g graph.gfa -r reads.fq -o imputed_hap.fa |
| PanGenie | PanGenie -t32 -i reads.fq -r reference.fa -v multi- allelic_phased.vcf -o out_vcf_PG |
| VG | 1) kmc -t32 -k29 -m128 -okff -hp reads.fq sample tmp_dir 2) vg haplotypes -t32 -v2 --num-haplotypes 1 -i input.hapl - k sample.kff -g sample_graph.gbz input_graph.gbz 3) vg paths -x sample_graph.gbz -F -S recombination > imputed_hap.fa |
| VCF Operations | |
| Transform VCF to have non-overlapping variants | vcfbub -l 0 -r 100000 -i input.vcf > output.vcf |
| Filter heterozygous variants | bcftools view -i ‘GT=“hom”‘ input.vcf.gz > output.vcf |
| Generate haplotype from reference genome and VCF file | bcftools consensus -f reference.fa -o imputed_hap.fa input. vcf.gz |
| Evaluation | |
| Edit distance | edlib-aligner ground-truth_hap.fa imputed_hap.fa |
Table 3:
Count of homozygous and heterzygous genotype calls made by PanGenie. In our benchmark, we excluded the heterozygous calls because the sequencing datasets were derived from homozygous cell lines.
| Coverage | APD |
DBB |
MANN |
QBL |
SSTO |
|||||
|---|---|---|---|---|---|---|---|---|---|---|
| Hom | Het | Hom | Het | Hom | Het | Hom | Het | Hom | Het | |
| 0.1× | 52,816 | 6,245 | 51,435 | 7,626 | 52,452 | 6,609 | 53,707 | 5,354 | 53,893 | 5,168 |
|
| ||||||||||
| 0.5× | 56,249 | 2,812 | 55,845 | 3,216 | 56,258 | 2,803 | 56,447 | 2,614 | 56,064 | 2,997 |
|
| ||||||||||
| 1× | 57,448 | 1,613 | 57,010 | 2,051 | 57,064 | 1,997 | 57,224 | 1,837 | 57,099 | 1,962 |
|
| ||||||||||
| 2× | 58,201 | 860 | 57,948 | 1,113 | 58,334 | 727 | 58,101 | 960 | 58,397 | 664 |
|
| ||||||||||
| 5× | 58,552 | 509 | 58,382 | 679 | 58,601 | 460 | 58,340 | 721 | 58,228 | 833 |
|
| ||||||||||
| 10× | 58,533 | 528 | 58,478 | 583 | 58,188 | 873 | 58,343 | 718 | 58,337 | 724 |
|
| ||||||||||
| Complete data | 58,647 | 414 | 58,457 | 604 | 58,592 | 469 | 58,457 | 604 | 58,521 | 540 |
Table 4:
We report additional performance statistics for PHI on all our datasets. We specify the number of recombinations used in the solution in the second column. Next, we mention the runtime and memory usage of PHI. In the fifth and the sixth columns, we specify edit distance and alignment identity between the output MHC sequence and the ground-truth sequence. Alignment identify is defined as the ratio of the number of character matches divided by the length of the alignment. In the last three columns, we give statistics about the minimizers computed from sequencing reads. We give the count of distinct minimizers observed in the read set. A fraction of minimizers would be absent from the graph, and some fraction would be present in all reference haplotypes, making them ‘uninformative’. The matches of only the remaining fraction minimizers are useful while solving the optimization problem.
| Coverage | Recombinations | Time (s) | Memory (GB) | Edit distance | Alignment identity (%) | Minimizers (Reads) | Minimizers % Absent | % Uninformative |
|---|---|---|---|---|---|---|---|
| Haplotype: APD | |||||||
| 0.1 × | 3 | 1840 | 72 | 7551 | 99.85 | 33248 | 36.33 | 43.12 |
| 0.5× | 7 | 1294 | 84 | 2272 | 99.95 | 156209 | 37.90 | 41.42 |
| 1× | 7 | 2338 | 93 | 2220 | 99.95 | 289795 | 41.46 | 39.19 |
| 2× | 9 | 2702 | 108 | 1948 | 99.96 | 508720 | 46.47 | 35.84 |
| 5× | 10 | 4671 | 125 | 1779 | 99.96 | 984355 | 59.39 | 27.05 |
| 10 × | 10 | 3683 | 134 | 1810 | 99.96 | 1599325 | 72.22 | 18.33 |
| 16.26× | 10 | 4536 | 134 | 1810 | 99.96 | 2288126 | 80.17 | 13.00 |
| Haplotype: DBB | |||||||
| 0.1 × | 2 | 1604 | 70 | 2191 | 99.96 | 33901 | 37.28 | 41.78 |
| 0.5× | 4 | 1467 | 83 | 1415 | 99.97 | 157510 | 39.66 | 39.60 |
| 1× | 4 | 2022 | 92 | 1496 | 99.97 | 293996 | 42.54 | 37.84 |
| 2× | 4 | 2502 | 108 | 1472 | 99.97 | 518085 | 47.59 | 34.28 |
| 5× | 4 | 4175 | 126 | 1385 | 99.97 | 1015730 | 60.37 | 25.75 |
| 10 × | 4 | 4525 | 132 | 1377 | 99.97 | 1660305 | 72.79 | 17.55 |
| 12.91 × | 4 | 4743 | 135 | 1377 | 99.97 | 2028107 | 77.31 | 14.58 |
| Haplotype: MANN | |||||||
| 0.1 × | 3 | 1680 | 67 | 41028 | 99.19 | 33614 | 34.31 | 43.07 |
| 0.5× | 7 | 1658 | 85 | 38379 | 99.24 | 153933 | 36.66 | 41.50 |
| 1× | 8 | 2183 | 94 | 37898 | 99.25 | 288713 | 39.33 | 39.76 |
| 2× | 9 | 3054 | 109 | 37728 | 99.25 | 502336 | 44.89 | 36.22 |
| 5× | 12 | 3774 | 126 | 36263 | 99.28 | 964364 | 57.71 | 27.55 |
| 10 × | 14 | 5426 | 132 | 35941 | 99.29 | 1553694 | 70.85 | 18.86 |
| 18.20× | 14 | 4843 | 134 | 35940 | 99.29 | 2450244 | 81.06 | 12.15 |
| Haplotype: QBL | |||||||
| 0.1 × | 3 | 2222 | 88 | 15062 | 99.69 | 32464 | 35.13 | 43.05 |
| 0.5× | 9 | 1236 | 81 | 7829 | 99.84 | 153818 | 37.47 | 41.77 |
| 1× | 10 | 2388 | 92 | 4610 | 99.91 | 284587 | 39.92 | 40.35 |
| 2× | 14 | 2981 | 109 | 3561 | 99.93 | 502087 | 46.98 | 36.14 |
| 5× | 17 | 3986 | 123 | 3349 | 99.93 | 966151 | 58.80 | 27.40 |
| 10 × | 17 | 4049 | 129 | 3356 | 99.93 | 1566636 | 71.76 | 18.63 |
| 12.85× | 17 | 4113 | 131 | 3343 | 99.93 | 1862566 | 75.90 | 15.84 |
| Haplotype: SSTO | |||||||
| 0.1 × | 2 | 2013 | 72 | 17626 | 99.65 | 33792 | 36.06 | 41.98 |
| 0.5× | 12 | 1812 | 84 | 10471 | 99.79 | 156473 | 37.60 | 41.12 |
| 1× | 20 | 2536 | 93 | 5150 | 99.90 | 291484 | 41.05 | 38.59 |
| 2× | 24 | 2977 | 108 | 4671 | 99.91 | 513683 | 46.50 | 35.02 |
| 5× | 24 | 5023 | 124 | 4611 | 99.91 | 992511 | 59.01 | 26.68 |
| 10 × | 24 | 5021 | 132 | 4634 | 99.91 | 1609715 | 71.88 | 18.16 |
| 15.04 × | 24 | 4499 | 137 | 4637 | 99.91 | 2206289 | 79.07 | 13.44 |
Fig. 2:
Evaluation of the performance of the IQP method with and without relaxation of the binary edge variables . We compared runtime using various short-read datasets.
Fig. 3:
Evaluation of the performance of the ILP method with and without relaxation of the binary edge variables . We compared runtime using various short-read datasets.
Footnotes
References
- 1.Baaijens J.A., Bonizzoni P., Boucher C., Della Vedova G., Pirola Y., Rizzi R., Sirén J.: Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing pp. 1–28 (2022) [DOI] [PMC free article] [PubMed]
- 2.Bradbury P.J., Casstevens T., Jensen S.E., Johnson L., Miller Z., Monier B., Romay M., Song B., Buckler E.S.: The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics 38(15), 3698–3702 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chandra G., Gibney D., Jain C.: Haplotype-aware sequence alignment to pangenome graphs. Genome Research 34(9), 1265–1275 (2024) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics 19(1), 118–135 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Davies R.W., Kucka M., Su D., et al. : Rapid genotype imputation from sequence with reference panels. Nature Genetics 53(7), 1104–1111 (Jun 2021). 10.1038/s41588-021-00877-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Dilthey A., Cox C., Iqbal Z., Nelson M.R., McVean G.: Improved genome inference in the MHC using a population reference graph. Nature genetics 47(6), 682–688 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dilthey A.T.: State-of-the-art genome inference in the human MHC. The International Journal of Biochemistry & Cell Biology 131, 105882 (2021) [DOI] [PubMed] [Google Scholar]
- 8.Ebert P., Audano P.A., et al. : Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372(6537) (Apr 2021). 10.1126/science.abf7117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ebler J., Ebert P., Clarke W.E., Rausch T., Audano P.A., Houwaart T., Mao Y., Korbel J.O., Eichler E.E., Zody M.C., et al. : Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nature genetics 54(4), 518–525 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Eggertsson H.P., Jonsson H., Kristmundsdottir S., et al. : Graphtyper enables population-scale genotyping using pangenome graphs. Nature genetics 49(11), 1654–1660 (2017) [DOI] [PubMed] [Google Scholar]
- 11.Gao Y., Yang X., Chen H., Tan X., Yang Z., Deng L., Wang B., Kong S., Li S., Cui Y., et al. : A pangenome reference of 36 chinese populations. Nature 619(7968), 112–121 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Grytten I., Dagestad Rand K., Sandve G.K.: Kage: Fast alignment-free graph-based genotyping of SNPs and short indels. Genome Biology 23(1), 209 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Harris L., McDonagh E.M., Zhang X., Fawcett K., Foreman A., Daneck P., Sergouniotis P.I., Parkinson H., Mazzarotto F., Inouye M., et al. : Genome-wide association testing beyond SNPs. Nature Reviews Genetics pp. 1–15 (2024) [DOI] [PMC free article] [PubMed]
- 14.Hickey G., Heller D., Monlong J., Sibbesen J.A., Sirén J., Eizenga J., Dawson E.T., Garrison E., Novak A.M., Paten B.: Genotyping structural variants in pangenome graphs using the vg toolkit. Genome biology 21, 1–17 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hickey G., Monlong J., Ebler J., Novak A.M., Eizenga J.M., Gao Y., Marschall T., Li H., Paten B.: Pangenome graph construction from genome alignments with minigraph-cactus. Nature Biotechnology pp. 1–11 (2023) [DOI] [PMC free article] [PubMed]
- 16.Houwaart T., Scholz S., Pollock N.R., et al. : Complete sequences of six major histocompatibility complex haplotypes, including all the major MHC class II structures. HLA 102(1), 28–43 (Mar 2023). 10.1111/tan.15020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jain C., Tavakoli N., Aluru S.: A variant selection framework for genome graphs. Bioinformatics 37(Supplement_1), i460–i467 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Letcher B., Hunt M., Iqbal Z.: Gramtools enables multiscale variation analysis with genome graphs. Genome biology 22, 1–27 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li H.: Sample graphs and sequences for testing sequence-to-graph alignment (2022). 10.5281/zenodo.6617246 [DOI]
- 20.Li J.H., Mazur C.A., Berisa T., Pickrell J.K.: Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays. Genome research 31(4), 529–537 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Li N., Stephens M.: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165(4), 2213–2233 (2003) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liao W.W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J.K., Monlong J., Abel H.J., et al. : A draft human pangenome reference. Nature 617(7960), 312–324 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mahmoud M., Gobet N., Cruz-Dávalos D.I., Mounier N., Dessimoz C., Sedlazeck F.J.: Structural variant calling: the long and the short of it. Genome biology 20, 1–14 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Martin A.R., Atkinson E.G., Chapman S.B., Stevenson A., Stroud R.E., Abebe T., Akena D., Alemayehu M., Ashaba F.K., Atwoli L., et al. : Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. The American Journal of Human Genetics 108(4), 656–668 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mun T., Vaddadi N.S.K., Langmead B.: Pangenomic genotyping with the marker array. Algorithms for Molecular Biology 18(1), 2 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mustafa H., Karasikov M., Mansouri Ghiasi N., Rätsch G., Kahles A.: Label-guided seed-chain-extend alignment on annotated de bruijn graphs. Bioinformatics 40(Supplement_1), i337–i346 (2024) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Nurk S., Koren S., Rhie A., Rautiainen M., et al. : The complete sequence of a human genome. Science 376(6588), 44–53 (apr 2022). 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Paten B., Novak A.M., Eizenga J.M., Garrison E.: Genome graphs and the evolution of genome inference. Genome research 27(5), 665–676 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pritt J., Chen N.C., Langmead B.: Forge: prioritizing variants for graph genomes. Genome biology 19(1), 1–16 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Roberts M., Hayes W., Hunt B.R., Mount S.M., Yorke J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (jul 2004). 10.1093/bioinformatics/bth408 [DOI] [PubMed] [Google Scholar]
- 31.Rubinacci S., Ribeiro D.M., et al. : Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics 53(1), 120–126 (Jan 2021). 10.1038/s41588-020-00756-0 [DOI] [PubMed] [Google Scholar]
- 32.Scholz S.: Complete sequences of six major histocompatibility complex haplotypes rev2 (2024). 10.5281/ZENODO.13889311 [DOI] [PMC free article] [PubMed]
- 33.Sibbesen J.A., Maretty L., Consortium D.P.G., Krogh A.: Accurate genotyping across variant classes and lengths using variant graphs. Nature genetics 50(7), 1054–1059 (2018) [DOI] [PubMed] [Google Scholar]
- 34.Sirén J., Monlong J., Chang X., Novak A.M., Eizenga J.M., Markello C., Sibbesen J.A., Hickey G., Chang P.C., Carroll A., et al. : Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374(6574), abg8871 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sirén J., Eskandar P., Ungaro M.T., et al. : Personalized pangenome references. Nature Methods (Sep 2024). 10.1038/s41592-024-02407-2 [DOI] [PubMed]
- 36.Smith T.P., Bickhart D.M., Boichard D., Chamberlain A.J., Djikeng A., Jiang Y., Low W.Y., Pausch H., Demyda-Peyrás S., Prendergast J., et al. : The bovine pangenome consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. Genome biology 24(1), 139 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tavakoli N., Gibney D., Aluru S.: Haplotype-aware variant selection for genome graphs. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. pp. 1–9 (2022) [Google Scholar]
- 38.Vaddadi K., Mun T., Langmead B.: Minimizing reference bias with an impute-first approach (Dec 2023). 10.1101/2023.11.30.568362 [DOI]








