Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Oct 29:2024.10.27.620212. [Version 1] doi: 10.1101/2024.10.27.620212

Integer programming framework for pangenome-based genome inference

Ghanshyam Chandra 1, Md Helal Hossen 2, Stephan Scholz 3,4, Alexander T Dilthey 3,4, Daniel Gibney 2, Chirag Jain 1,*
PMCID: PMC11565907  PMID: 39554168

Abstract

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.

1. Introduction

Many initiatives are in progress for building haplotype-resolved pangenome references of human and nonhuman species [22,11,36]. Among many applications, pangenome graphs can enable cost-effective genotyping and imputation of a wide spectrum of variant classes beyond single nucleotide polymorphisms (SNPs) and short indels [13]. Pangenome graphs represent sequence alignment of high-quality fully-phased genome assemblies of individuals from diverse populations [1]. A pangenome graph can be represented as either cyclic or acyclic directed graph where the vertices are labeled with sequences. Paths in this graph spell the reference haplotype sequences and their recombinations. The graph-based representation is flexible enough to incorporate single-nucleotide polymorphisms (SNPs), indels (short insertions and deletions), large structural variants (SVs), nested variants, gene absence/presence, etc. [4].

Recent works propose the use of pangenome references to improve genotyping accuracy from short-read sequencing data [9,14,12,18,2,10,33,25]. Especially for SVs, these methods are an effective alternative to the conventional genotyping methods that are based on aligning reads to a single reference because short-read alignments can be inaccurate for the reads originating from SVs [23,8]. Methods such as PRG [6], Pangenie [9] and KAGE [12], utilize k-mer statistics to infer paths in the graph that correspond to the target genome. These methods compare the k-mers surrounding a variant site in the graph with the k-mer counts in the sequencing data to calculate likelihoods of reference and alternative alleles. Pangenie and KAGE also use the long-range haplotype information available in the haplotype-resolved pangenome references. The other approach used in methods such as Giraffe [34] and Graphtyper [10] involves aligning reads to a pangenome graph.

There have been efforts on improving the accuracy of read alignments to pangenome graphs as well. A large combinatorial search space in terms of the number of candidate paths in a pangenome graph increases ambiguity during read alignment. This issue has motivated methods that either impute a personalized reference genome [38], sample variants [29,17,37] to obtain a smaller graph, or prioritize the use of reference haplotypes in the graph during alignment [3,34,26]. Our previous work proposed haplotype-aware sequence alignment to graphs by introducing penalties for haplotype switches in an alignment [3]. A recent feature added to VG allows sampling of reference haplotypes and their recombinations from the graph that are most relevant to the target genome using a k-mer-based greedy heuristic [35].

Low-coverage sequencing, combined with genotyping and phasing, is a cost-effective approach to conduct large-scale genetic studies [31,5,20,24]. In this paper, we develop a rigorous formulation and algorithms for genotyping using pangenome references. Our framework is also applicable to low-coverage short-read sequencing data (coverage 0.1 − 1×). Following the standard Li and Stephens model [21], we view the target genome as an imperfect mosaic of the reference haplotypes. Our contributions are as following.

  • We introduce a novel problem formulation to estimate the complete haplotype sequence of a haploid genome by determining an appropriate path in the pangenome graph. The objective is to maximize the number of shared substrings (e.g., k-mers or minimizers) between the sequencing data and the sequence spelled by the path. We permit recombinations in the path, subject to a fixed penalty per recombination. We refer to this problem as Path Inference Problem (formally defined in Section 2).

  • We prove that the Path Inference Problem is NP-hard, even when restricted to binary alphabets.

  • To solve this problem, we develop two integer-programming solutions which involve linear and quadratic constraints, respectively. The two solutions involve a tradeoff between runtime and memory usage.

  • We demonstrate the utility of this framework by testing it on downsampled short-read datasets from five human haploid cell lines (coverage 0.1 − 10×). For these five samples, complete major histocompatibility complex (MHC) haplotype sequences have been previously determined using long-read assembly [16].As our pangenome reference, we used a haplotype-resolved pangenome directed acyclic graph (DAG) of 49 MHC haplotype sequences [19]. We chose MHC region for evaluation because this is the most polymorphic and gene-rich region of the human genome [7]. The length of this region is about 5 Mbp.

  • Using datasets with 0.1× coverage, our algorithm outputs MHC sequences that are up to 99.96% identical to the ground-truth sequences. It compares favorably to the existing methods.

2. Notations and Problem Formulation

Let G(V,E,σ,𝓗) denote a directed acyclic graph (DAG) representing a haplotype-resolved pangenome reference. Function σ assigns a string label over alphabet Σ={A,C,G,T} to each vertex. A path (u1,u2,,un) in G spells string σ(u1)σ(u2)σ(un), where s1s2 denotes the concatenation of strings s1 and s2. 𝓗={h1,h2,,h|𝓗|} denotes a set of paths in G such that each of these paths spells a reference haplotype sequence used in the pangenome reference. We refer to these paths as haplotype paths. We assume that each haplotype path is described by an array, i.e., hi[1] is the first vertex in hi,hi[2] is the second vertex in hi, etc. The length of a haplotype path hi, that is, the count of vertices in hi is denoted as |hi|. The set of haplotype paths covering vertex vV is denoted as haps(v). We assume that, for each edge (u,v)E, there exists a haplotype path hi𝓗 such that u and v are consecutive vertices in hi. In other words, each edge is supported by at least one haplotype path.

Definition 1 (Inferred Path).

An inferred path 𝓟 of length n is represented as an ordered set (a1,a2,,an), where each ai is a two tuple (u,h) such that uV,hhaps(u), and (aiu,ai+1u)E for all i[1,n). Furthermore, if aih=ai+1h, then ai.u and ai+1u should be consecutive vertices in haplotype path ai.h.

In an inferred path, we keep track of the haplotype path indices alongside vertex indices (Figure 1). We say a recombination, or a haplotype switch, occurs between two consecutive vertices ai.u and ai+1u in 𝓟 if ai.hai+1.h. We use γ(𝓟) to denote the count of recombinations in 𝓟. With a mild abuse of notation, we denote the string spelled by 𝓟 as σ(𝓟).

Fig. 1:

Fig. 1:

A simple illustration of an haplotype-resolved pangenome graph with two haplotype paths highlighted in pink and blue colors. An inferred path with a single recombination is shown as a dashed line.

Problem 1 (Path Inference Problem).

  • Input: A haplotype-resolved pangenome DAG G=(V,E,σ,𝓗), a set of strings 𝓢 from the target genome, and a non-negative integer c indicating recombination penalty.

  • Output: An inferred path 𝓟 such that
    Cost(𝓟)=cγ(𝓟)+r𝓢χ¯(r,σ(𝓟))
    is minimized, where χ¯(r,σ(𝓟))=0 if string r occurs as a substring of string σ(𝓟) and 1 otherwise.

The intuition behind our formulation is to maximize the number of string matches along the inferred path while minimizing the number of recombinations. This approach yields an inferred path that incorporates the majority of strings from 𝓢 as a substring with a finite number of recombinations, constrained by a recombination penalty c. Set 𝓢 can be set of either k-mers or minimizers observed in the sequencing reads.

3. Computational Complexity

Theorem 1.

Problem 1 is NP-hard. This holds for any value of c=|V|Θ(1) and even when Σ={0,1}.

We begin with an instance GH(VH,EH) of the Hamiltonian Path Problem. Let VH={u1,,un}. We first create a graph G=(V,E) where

V={s}{uki1kn,1in}{t}
E={(s,uk1)1kn}{(uki,uhi+1)(uk,uh)EH,1i<n}{(ukn,t)1kn}

For 1xn+2(c(n+1)+1, let bin(x) be standard binary encoding of x using b=log2(n+2(c)n+1)+1))+1 bits. We assign the vertex labels

σ(uki)=bin(k)0b1for1in,1knσ(s)=bin(n+1)0b1bin(n+2)0b1bin(n+c(n+1)+1)0b1σ(t)=bin(n+c(n+1)+1+1)0b1bin(n+c(n+1)+1+2)0b1bin(n+2(c(n+1)+1))0b1.

We create a distinct haplotype path for each edge that supports only that edge. We define the set of strings 𝓢={bin(1)0b1,bin(2)0b1,,bin(n+2(c(n+1)+1))0b1}. See Figure 1 in Appendix for a small worked example. The reduction presented above clearly runs in polynomial time for c=|V|Θ(1). Combined with Lemmas 1 and 2, Theorem 1 follows.

Lemma 1.

If GH contains a Hamiltonian path, then G has an inferred path 𝓟 with Cost(𝓟)=c(n+1)

Proof. Let ui1,,uin be a Hamiltonian path in GH. We take as our inferred path 𝓟=s,ui11,ui22,,uinn, t. As every edge has its own corresponding haplotype, the number of recombinations is n+1. Furthermore, since ui1,,uin is a Hamiltonian path and s and t are included in the inferred path, all strings in 𝓢 occur in σ(𝓟). Hence, the total cost is c(n+1). □

Lemma 2.

If G has an inferred path 𝓟 with Cost(𝓟)c(n+1), then GH has a Hamiltonian path.

Proof. First, we claim that s and t must be included in 𝓟. The 0b1 substrings are used as padding to prevent any string in 𝓢 from being matched using portions of two or more vertex labels. Therefore, if s or t are not included in the inferred path, at least c(n+1)+1 strings from 𝓢 do not occur in σ(𝓟), contradicting that Cost(𝓟)c(n+1). Hence, the inferred path 𝓟 must contain s and t and be of the form s, ui11,uinn, t for some i1,,in. Since each edge traversed corresponds to a recombination, the total number of recombinations is n+1. The only way the Cost(𝓟)c(n+1) is if all strings in 𝓢 occur as substrings in σ(𝓟). Again, due to the 0b1 padding in the vertex labels, this can only happen if for all i[1,n],uik is a vertex in 𝓟 for some k. Furthermore, because there are n vertices in 𝓟 that are not s or t, there must be exactly one such k for a given i. We conclude that ui1,,uin is a Hamiltonian path in GH. □

4. Proposed Algorithms

Before developing our integer programming solutions to Problem 1, it is first helpful to define an additional graph representation, which we call as expanded graph. In pangenome graphs, multiple haplotype paths share vertices if the sequences are conserved, whereas in the expanded graph, we will split all haplotypes into separate paths (Figures 2A, 2B). The expanded graph enables us to model Problem 1 as a sort of network flow problem. In particular, the inferred path will be reconstructed from a flow of value one in the expanded graph. We will assign weights to edges to account for recombination penalty. Additional constraints will be used to capture how many strings in 𝓢 occur in the resulting inferred path.

Fig. 2:

Fig. 2:

(A) A pangenome graph with four haplotype paths h1,h2,h3 and h4. Set of haplotype paths passing through a vertex is listed below each vertex. (B) The corresponding expanded graph which includes four disjoint paths, one for each haplotype path. The recombination edges are shown in purple, these edges have a weight of c. We consider only the useful recombinations (Lemma 3). The edges which are not recombination edges in the expanded graph have a weight of 0. (C) The corresponding optimized expanded graph.

Lemma 3 allows us to only consider a subset of all possible recombinations in order to find an optimal solution to Problem 1. We call the type of recombination described in Lemma 3 a useful recombination.

Lemma 3.

There exists an optimal inferred path 𝓟=(a1,,an) for Problem 1 where for all i[1,n), ai.hai+1h implies vertices ai.u and ai+1.u are not consecutive vertices in haplotype path ai.h.

Proof. Suppose there is an optimal inferred path 𝓟=(a1,,an) for Problem 1 where for some ai,ai.hai+1h such that ai.u and ai+1.u are consecutive vertices in haplotype path ai.h. Furthermore, suppose we start with the smallest i where this holds. We then change the haplotype path for ai+1 to equal ai.h. This does not increase the overall cost, since the number of string 𝓢 occurring in σ(𝓟) has not changed, and the number of recombinations either decreases or stays the same. Continuing this process from the next j>i, such that ajhaj+1h and aj.u and aj+1u are consecutive vertices in aj.h, we achieve an inferred path satisfying the conditions stated in the lemma after at most n iterations. □

Next, we present a definition of the expanded graph where we will consider only the useful recombinations. For technical reasons, we preprocess each edge in E, splitting it and adding a new vertex labeled with the empty string ε. Each added vertex inherits the haplotype paths which supported the edge it was formed from. This added step is to prevent recombinations from a haplotype to itself when we build our expanded graph. Now, let V={u1,,un}. For haplotype path hj𝓗, let uhj[i] denote the ith vertex in haplotype path hj. We use GE=(VE,EE,σE) to denote the expanded graph. In GE, vertices are string-labeled and edges are weighted. Vertex set VE is defined as:

VE={s}{t}{uhj[i]j|1j|𝓗|,1i|hj|} (1)

The vertex set contains a source and sink vertex, s and t, respectively. The vertex set also contains a set of disjoint vertices for each haplotype path in 𝓗 (Figure 2B). A superscript is used to indicate which haplotype path the vertex is designated to. We refer to the ordered vertex set uhj[1]juhj[|hj|]j as a haplotype path in GE.

We denote weighted edges in EE as tuples of the form (start, end, weight). The weighted edge set is

EE={(s,uhj[1]j,0)|1j|𝓗|} (2)
{(uhj[hj]j,t,0)|1j|𝓗|} (3)
{(uhj[i]j,uhj[i+1]j,0)|1j|𝓗|,1i<|hj|} (4)
{(uhj[i]j,ukj,c)|1j,j|𝓗|,(uhj[i],uk)Es.t.i=|hj|orhj[i+1]uk} (5)

Next, we give some intuition for each line (2)-(5) in the above construction of EE.

(2) Weight 0 edges are created from s to the start of each haplotype path in GE.

(3) Weight 0 edges are created from the end of each haplotype path in GE to t.

(4) Weight 0 edges are created between adjacent vertices in each haplotype path. That is, in the path for hj, an edge is created from uhj[i] to uhj[i+1].

(5) Weight c edges are used to represent the useful recombinations described in Lemma 3. We call these recombination edges.

We use ϵ to denote the empty string. The vertex labels are defined as follows:

σE(uhj[i]j)=σ(uhj[i])for1j|𝓗|,1i|hj| (6)
σE(s)=σE(t)=ϵ (7)

(6) The vertices in a haplotype path are labeled according to the corresponding vertex label in G. These labels will be used to identify matches.

(7) The source, sink, do not require vertex labels and are hence labeled with the empty string ϵ.

Optimizing the Expanded Graph.

One issue with the above construction is that the number of recombination edges for a given potential recombination can be O(|𝓗|2) in the worst case. This occurs because we maintain |haps(v)| copies of each vertex vV. For every edge (u,v)E allowing a recombination, we add O(|haps(u)||haps(v)|) edges to the edge set EE. Since both |haps(u)| and |haps(v)| can be at most |𝓗|, any potential recombination can result in O(|𝓗|2) recombination edges in the worst case. We observe this issue in practice as well. An improvement is to represent a recombination by having an intermediate vertex we that represents the edge eE allowing for the recombination. We then create an edge to we from every vertex in a haplotype path which the recombination would start from, and edges from we to every vertex in a haplotype path to which the recombination would lead to (Figure 2C). More formally, the modified vertex set becomes

VE={s}{t}{uhj[i]j|1j|𝓗|,1i|hj|}{weeE} (8)

We also replace Line (5) in the construction of EE with the Lines (9) and (10) as follows: EE=

{(uhj[i]j,we,c/2)|1j|𝓗|,e=(uhj[i],uk)Es.t.i=|hj|orhj[i+1]uk} (9)
{(we,ukj,c/2)|1j|𝓗|,e=(uk,uhj[i])Es.t.i=1orhj[i1]uk} (10)

We now call these edges created in Lines (9) and (10) the recombination edges. After creating the edges in EE, we delete any we vertex that is isolated in GE. Finally, for any remaining we vertices, we define σE(we)=ϵ. Observe, that the above modification allows for the same set of useful recombinations as our initial expanded graph construction. However, per potential useful recombination, the number of edges remains O(|𝓗|) rather than O(|𝓗|2). Before giving the integer programming solutions, we require one additional definition.

Definition 2 (Hits).

For a string r𝓢, assuming maxuVE|σE(u)|<|r|, a path in GE, denoted as an ordered edge ((u,v),(v,w),(w,x),(y,z)), matches r if r=σE(u)σE(v)σE(w)σE(x)σE(y)σE(z), where σE(u) is a suffix of σE(u) and σE(z) a prefix of σE(z). We use hits(r) to represent the set of paths matching string r in GE.

4.1. Integer Linear Programming (ILP) Formulation

We assume that the maximum length of any vertex label is upper bounded by the length of any string in 𝓢, i.e., maxuVE|σE(u)|<minr𝓢|r|. This condition can be easily enforced in the input graph by adjusting the lengths of vertex labels, e.g., by splitting a vertex with a long label into two, while ensuring that the graph’s topology is preserved. We assume minr𝓢|r|>1.

The basis for our solution is to find an st-flow with a flow of 1 through the expanded graph GE. Our integer programs will utilize binary decision variable xuv for each edge. The variable xuv will take the value 1 if edge (u,v)EE is part of the solution flow and 0 otherwise. Because these are binary variables, the flow will always be a path. From the solution path in GE, it is straight forward to recover the corresponding inferred path 𝓟. We use binary decision variable zr for each string r𝓢 such that zr will take the value 1 if the solution flow includes a subpath from hits(r). We also use variable zrω for each ωhits(r),r𝓢.

Letting weight(u,v) denote the weight of an edge (u,v)EE, our ILP formulation is as follows:

min(u,v)EEweight(u,v)xuv+r𝓢(1zr), (11)

subject to

v𝓝+(u)xuvv𝓝(u)xvu={1ifu=s,1ifu=t,0otherwise,uVE, (12)
(u,v)ωxuv|ω|zrω,zrω{0,1},ωhits(r),r𝓢, (13)
ωhits(r)zrω=zr,zr{0,1},r𝓢, (14)
xuv{0,1},(u,v)EE (15)

In the ILP formulation, the Objective (11) models Cost(𝓟). The summation over weight(u,v)xuv imposes penalty c for each recombination. This is due to the two c/2 weighted recombination edges that must traversed when the path switches between haplotype paths in GE (Figure 2C). In the second summation, the term (1zr) adds a penalty of 1 to the objective for every r𝓢 where χ¯(r,σ(𝓟))=1. Constraint (12) enforces flow conservation, allowing a unit flow from the source vertex s to the sink vertex t, ensuring that the ILP formulation selects a single path in the expanded graph.

To explain the function of Constraint (13), termed as linear string-hit constraint and (14), observe that in an optimal solution, whenever possible the variable zr is set to 1. This is because the term (1zr) in the objective function adds a penalty of 0 whenever zr=1. However, this is only possible when zrω is equal to 1 for some ωhits(r). This, in turn, is only possible if (u,v)ωxuv=|ω|, meaning r occurs as a substring in the inferred path. Also note that at most one zrω variable can equal 1 in Constraint (14). Other zrω variables, where ω,ωhits(r) and ωω, can have a value of 0, even if (u,v)ωxuv=|ω|, justifying the use of equality in Constraint (14).

A weakness of the proposed ILP formulation is that the number of string-hit constraints equals the total number of string matches, that is, r𝓢hits(r). We design another formulation with quadratic constraints in which fewer constraints are needed.

4.2. Integer Quadratic Programming (IQP) Formulation

In our IQP formulation, Objective (11), and Constraints (12), and (14) and (15) remain unchanged from the ILP formulation. Constraints in (13) are replaced by quadratic constraints defined as

whits(r)(1|ω|+(u,v)ωxuv)zrω=zr,r𝓢, (16)

We call Constraint (16) the quadratic string-hit constraint. Again, due to Constraint (14) at most one zrω variable can be 1. The expression 1|ω|+(u,v)ωxuv sums to 1 when the subpath ω is contained in the flow. In this case zr will take the value 1 and no penalty is paid in the objective. Conversely, if some of the edges for ω are not in the flow, the expression will sum to ≤ 0. If this is the case for each ωhits(r), then Constraint (16) can only be satisfied by setting zr=0 and zrω=0 for each ωhits(r). Since zr=0, a penalty is paid in the objective. The total number of quadratic string-hit constraints is |𝓢|. In our experiments, we observe that IQP formulation solves the problem faster, albeit while requiring more memory.

As a further improvement, we relax the variables xuv for all (u,v)EE to continuous values xuv[0,1] in Constraint (15), following Lemma 4.

Lemma 4.

An optimal solution ϕcont to the IQP (or ILP) with relaxed Constraint (15) where variables xuv lie within the continuous interval [0, 1] can be transformed in polynomial time to an optimal solution ϕ satisfying xuv{0,1} for all (u,v)EE.

Proof. First, observe that zr=1 if and only if all edges in some ωhits(r) have their corresponding variables set to 1. This follows from Constraints (13) and (16), and the fact that at most one zrω can be 1 for a given r, by Constraint (14).

If zr=0 for all r𝓢 in ϕcont, then ϕ can be trivially obtained as a single haplotype path in GE without recombination penalties. In such a case, all edge variables are assigned either 0 or 1.

For the remaining cases, we introduce the following terms:

  • ωhits(r) is a used hit-subpath if zrω=1.

  • A flow between vertices u and v can be decomposed into uv-paths each assigned some positive flow and called flow subpaths.

  • ω is the first used hit-subpath if there is a flow subpath from vertex s to the first vertex of ω without passing through another used hit-subpath.

  • ω is the last used hit-subpath if there is a flow subpath from the last vertex of ω to vertex t without passing through another used hit-subpath.

  • ω and ω are consecutive used hit-subpaths if there is a flow subpath between them without passing through a third used hit-subpath, where ωω and ωhits(r).

Now, if zr=1 in ϕcont for some r𝓢, there exists a used hit-subpath. We obtain ϕ as following. The flow used to reach the first hit-subpath avoids recombination penalties by following a single haplotype path. Similarly, the flow from the end vertex on the last used hit-subpath to t avoids recombinations penalties by staying on a single haplotype path. Next, consider two consecutive used hit-subpaths ω and ω, with u and v as their respective end and start vertices. If u and v are on different haplotype paths, any flow subpaths between u and v must minimize the recombination penalty. The same minimum recombination cost can be achieved by replacing the potentially multiple fractional flow subpaths with a single path that incurs the same recombination penalty. We can select any flow subpath from u to v and assign its edge variables to 1. Edge variables on edges used on the flow from u to v and not on this selected path are set to 0. □

5. Results

Implementation Details.

We implemented our ILP and IQP solutions in C++ using Gurobi (v11.0.2) solver. We refer to our software as PHI (Pangenome-based Haplotype Inference). The user can provide a pangenome reference as either a graph (GFA format) or as a list of phased variants (VCF format). Given short-read or long-read sequencing data of either a haploid or a homozygous genome, PHI outputs the haplotype sequence associated with the optimal inferred path from the graph in FASTA format.

Given a set of reads, we compute (w,k) window minimizers [30] for identifying our hits (Definition 2). By default, w=25 and k=31. These minimizers correspond to the set 𝓢 in Problem 1. Computing minimizer matches between two strings is faster than computing minimizer matches on a pangenome graph. For this reason, we find minimizer matches between reads and the sequences spelled by all the haplotype paths in the graph. This means hits(r) includes only those subpaths that are completely contained in some haplotype path in GE (Definition 2). This restriction to hits(r) also prevents us from needing to perform the additional edge splitting step described in Section 4.1. We used recombination penalty c=100, this value was chosen empirically. We ran all our experiments on AMD EPYC 7763 processors with 512 GB RAM. We used 32 threads in all experiments.

Datasets.

We evaluated our algorithm by estimating MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO) from homozygous human cell lines. Recently, Houwaart et al. [16,32] published complete assemblies of these MHC sequences using long and short-read sequencing. The average length of these assemblies is 4.99 Mbp. We downloaded the five short-read sequencing datasets available from this study. To evaluate our algorithm using varying sequencing coverage, we down-sampled each short-read dataset to obtain coverage of 0.1×, 0.5×, 1×, 2×, 5×, and 10×. We also used the full datasets for evaluation (coverage 12.9 − 18.2×). We used the complete assemblies of five MHC haplotypes as ground-truth to evaluate the accuracy of our estimated sequences. To quantify the accuracy, we measured edit distance between each estimated sequence and the corresponding ground-truth sequence.

We built a haplotype-resolved pangenome graph of 49 complete MHC sequences [19] using Minigraph-Cactus [15]. These sequences were extracted from phased assemblies of 24 diploid human samples [22] and the CHM13 reference [27]. Using Minigraph-Cactus, we obtained the pangenome reference in a VCF format file. We subjected this file to further simplification steps1 to ensure compatibility with various tools. We show sequence similarity statistics between the complete MHC assemblies of five haplotypes (APD, DBB, MANN, QBL, SSTO) and the 49 pangenome reference haplotypes in Appendix Table 1.

Other Methods.

We compared PHI with two existing pangenome-based genotyping tools (i) VG (v1.60) [35] and (ii) PanGenie (v3.1) [9]. VG supports sampling of relevant haplotypes from a pangenome graph by comparing k-mer counts in the reads and k-mers of a reference haplotype. The selection of haplotypes is done locally in fixed-length non-overlapping blocks. Recombinations may be introduced to create contiguous haplotypes across the blocks. The number of samples can be specified by the user. Accordingly, VG’s haplotype sampling feature can be adapted for haplotype sequence estimation by simply setting the number of desired samples to one. Next, PanGenie supports short-read genotyping using a haplotype-resolved pangenome graph. PanGenie uses a hidden Markov model, which is similar to the standard Li and Stephens model [21]. PanGenie compares k-mer counts in the reads with the k-mers present in the graph to compute genotype likelihoods. PanGenie exhibited better genotyping accuracy and speed than other genotyping tools [9]. Our sequencing datasets are derived from homozygous cell lines, therefore we ignored the heterozygous genotype calls made by PanGenie (Appendix Table 3). We incorporated PanGenie’s predicted genotypes in the reference sequence to obtain the haplotype sequence. We list our commands to run PHI, VG and PanGenie in Appendix Table 2.

Genotyping performance.

We evaluated PHI, VG and PanGenie methods in their ability to infer the MHC sequences from short read datasets of varying coverage (see Figure 3). Using low coverage datasets (0.1−2×), PHI exhibits significantly higher accuracy. VG and PanGenie methods may not be suitable for low-coverage sequencing. For example, the distribution of k-mer counts at low coverage can be unreliable. Distinguishing k-mers originating from unique versus repetitive regions, as required by PanGenie and VG, is also challenging at low-coverage. Using coverage of 5× or more, the results of VG and PHI are comparable. PanGenie also produces comparable results using full datasets. We note that the integer programming (IQP) approach used in PHI requires more time and memory compared to the methods used in VG and PanGenie. PHI used up to 1.5 hours and 137 GB RAM in a single experiment. In contrast, VG and PanGenie required < 5 minutes and < 50 GB memory. It may be possible to optimize PHI by incorporating efficient heuristics. We show detailed performance statistics for PHI, including its runtime and memory usage in Appendix Table 4.

Fig. 3:

Fig. 3:

Accuracy of haplotype sequences estimated by PHI, VG and PanGenie using short reads from MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO). The x-axes indicate the coverage of short-read data. The y-axes indicate the edit distance between the estimate haplotype sequence and the ground-truth sequence on a logarithmic scale.

Eflect of our optimizations.

In PHI, we implemented both ILP-based and IQP-based solutions to solve the optimization problem. Using either solution, Gurobi solves Problem 1 to optimality. We benchmarked our ILP and IQP solutions to compare their runtime and memory-usage (see Figure 4). On low-coverage datasets (0.1−1×), the runtimes are comparable. At higher coverage, the IQP solution runs faster, which is likely due to fewer string-hit constraints used (Section 4.2). Although, it requires approximately 1.5 times more memory. This may be because Gurobi requires additional storage to handle quadratic constraints. Accordingly, while using PHI, the user can choose between ILP and IQP using a command line argument based on the available memory. If no choice is provided, the IQP solution is used by default. We also evaluated the advantage of relaxing edge variables to continuous values (Lemma 4) by comparing it to another version of our code where we set the edge variables to be discrete. Relaxation of variables deceases runtime of the IQP solution by a factor of 1.6 on average (Appendix Figure 2). Not much effect on the runtime is observed in the ILP solution (Appendix Figure 3).

Fig. 4:

Fig. 4:

Performance comparison between the ILP and IQP solutions implemented in PHI. We compared their runtime and memory-usage using short-read sequencing datasets sampled from five haplotypes.

Impact of graph expansion with the addition of more genomes.

We evaluated the impact of pangenome graph expansion on PHI’s genotyping accuracy as well as runtime. To do this, we created five versions of our pangenome graph, each containing an increasing number of reference haplotypes, added progressively. The first graph comprises a single diploid sample (chosen randomly from 24 diploid samples) plus CHM13 reference, therefore, it has three reference haplotypes in total. The second graph includes two more diploid samples (chosen randomly from the remaining 23), therefore, it has seven reference haplotypes in total. Similarly, third, fourth and fifth graphs contain 13, 25 and 49 reference haplotypes, respectively. The fifth graph is equivalent to the graph used in previous experiments as well. This results in five different graphs that have 3, 7, 13, 25, and 49 reference haplotypes respectively.

We repeated our experiments with full short-read datasets using these five graphs and present results in Figure 5. We observe that edit distances between the estimated sequences and the ground truth sequences decrease with the increasing number of reference haplotypes. This is expected because more haplotypes are available to choose from when we compute our inferred path in the graph. We also observe an increase in runtime and memory usage. Runtime appears to increase superlinearly and memory appears to increase linearly with the number of reference haplotypes. This is because the size of expanded graph and the number of minimizer matches increase leading to more variables and constraints in our integer program.

Fig. 5:

Fig. 5:

Assessement of PHI’s performance with the increasing number of genomes in pangenome graph. The left figure shows the accuracy in terms of edit distance between the output sequences and ground-truth sequences. The middle and right figure show the runtime and memory-usage respectively.

6. Discussion

Genotyping using pangenome graphs is equivalent to finding a walk in the graph that contains the sample’s variants [28]. If the sample is diploid, this becomes equivalent to finding a pair of paths. Drawing inspiration from this idea, we proposed a rigorous framework to infer a path through the graph, such that the sequence spelled by the path is consistent with the sequencing data in terms of the shared k-mers between them, while permitting a limited number of recombinations in the path, each incurring a fixed penalty. This optimization problem requires considering all possible paths in the graph. We proved that this problem is NP-Hard and subsequently gave efficient integer programming solutions. As part of our methodology, we introduced the expanded graph data structure on which we could compute an appropriate st-flow of 1. Experimental results demonstrate the advantage of the proposed ILP/IQP approaches for accurate genome inference, especially with low-coverage data (coverage 0.1 − 1×). Thus, our algorithm can facilitate affordable genotyping and association studies of complex and repeat-rich regions of the genome.

Although our approach is currently tailored to haploid samples, it could generalize to diploid samples. This may be accomplished by finding an st-flow of 2 through the expanded graph and modifying some constraints. How well this approach genotypes and phases the genome would be interesting to explore. Another limitation of this work is that we do not capture uncertainty. For example, there may be multiple inferred paths with minimum cost. Lastly, pangenome graphs are expected to grow in the number of genomes, therefore, scaling the current approach to a large number of haplotype paths may be important. We leave these extensions to future work.

Acknowledgements

This research is funded in part by the DBT/Wellcome Trust India Alliance Fellowship (grant number IA/I/23/2/506979), the Intel India Research Fellowship, the National Institutes of Health of the USA (NIH-NIAID U01 AI090905), and the Jürgen Manchot Foundation. We utilized computing resources available at the Indian Institute of Science and the U.S. National Energy Research Scientific Computing Center.

Appendix

Fig. 1:

Fig. 1:

A small example of our reduction from Hamiltonian Path Problem to Problem 1 (Theorem 1). (Top) The starting instance of G of Hamiltonian Path Problem. (Bottom) The vertex labeled graph G constructed from G. Here, n=4 and we assume c=2, making b=log2(n+2(c(n+1)+1))+1=6. Each edge is supported by a unique haplotype (not shown). The string set is 𝓢={0000010000001,0000100000001,,0110100000001}.

Table 1:

Additional information about the MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO). We show the length of the complete assembly in the second column. The third and forth columns show edit distance statistics between the assembly and 49 reference haplotypes included in the pangenome reference. In the last two columns, we list the SRA accession numbers and coverage of short-read sequencing datasets.

Edit distance with pangenome reference haplotypes Short-read data
Haplotype Assembly length (Mbp) Mean Minimum SRA Accession Coverage
APD 4.93 146,423 37,102 SRR17272303 16.26x

DBB 5.05 174,619 10,380 SRR17272302 12.91x

MANN 5.03 189,464 58,168 SRR17272301 18.20x

QBL 4.90 159,968 72,293 SRR17272300 12.85x

SSTO 5.05 161,044 35,583 SRR17272299 15.04x

Table 2:

Commands used for running various tools

Haplotype/Genotype Imputation
PHI 1) vcf2gfa.py -v multi-allelic_phased.vcf -r reference.fa > graph.gfa
2) PHI -t32 -g graph.gfa -r reads.fq -o imputed_hap.fa
PanGenie PanGenie -t32 -i reads.fq -r reference.fa -v multi- allelic_phased.vcf -o out_vcf_PG
VG 1) kmc -t32 -k29 -m128 -okff -hp reads.fq sample tmp_dir
2) vg haplotypes -t32 -v2 --num-haplotypes 1 -i input.hapl - k sample.kff -g sample_graph.gbz input_graph.gbz
3) vg paths -x sample_graph.gbz -F -S recombination > imputed_hap.fa
VCF Operations
Transform VCF to have non-overlapping variants vcfbub -l 0 -r 100000 -i input.vcf > output.vcf
Filter heterozygous variants bcftools view -i ‘GT=“hom”‘ input.vcf.gz > output.vcf
Generate haplotype from reference genome and VCF file bcftools consensus -f reference.fa -o imputed_hap.fa input. vcf.gz
Evaluation
Edit distance edlib-aligner ground-truth_hap.fa imputed_hap.fa

Table 3:

Count of homozygous and heterzygous genotype calls made by PanGenie. In our benchmark, we excluded the heterozygous calls because the sequencing datasets were derived from homozygous cell lines.

Coverage APD
DBB
MANN
QBL
SSTO
Hom Het Hom Het Hom Het Hom Het Hom Het
0.1× 52,816 6,245 51,435 7,626 52,452 6,609 53,707 5,354 53,893 5,168

0.5× 56,249 2,812 55,845 3,216 56,258 2,803 56,447 2,614 56,064 2,997

57,448 1,613 57,010 2,051 57,064 1,997 57,224 1,837 57,099 1,962

58,201 860 57,948 1,113 58,334 727 58,101 960 58,397 664

58,552 509 58,382 679 58,601 460 58,340 721 58,228 833

10× 58,533 528 58,478 583 58,188 873 58,343 718 58,337 724

Complete data 58,647 414 58,457 604 58,592 469 58,457 604 58,521 540

Table 4:

We report additional performance statistics for PHI on all our datasets. We specify the number of recombinations used in the solution in the second column. Next, we mention the runtime and memory usage of PHI. In the fifth and the sixth columns, we specify edit distance and alignment identity between the output MHC sequence and the ground-truth sequence. Alignment identify is defined as the ratio of the number of character matches divided by the length of the alignment. In the last three columns, we give statistics about the minimizers computed from sequencing reads. We give the count of distinct minimizers observed in the read set. A fraction of minimizers would be absent from the graph, and some fraction would be present in all reference haplotypes, making them ‘uninformative’. The matches of only the remaining fraction minimizers are useful while solving the optimization problem.

Coverage Recombinations Time (s) Memory (GB) Edit distance Alignment identity (%) Minimizers (Reads) Minimizers % Absent | % Uninformative
Haplotype: APD
0.1 × 3 1840 72 7551 99.85 33248 36.33 | 43.12
0.5× 7 1294 84 2272 99.95 156209 37.90 | 41.42
7 2338 93 2220 99.95 289795 41.46 | 39.19
9 2702 108 1948 99.96 508720 46.47 | 35.84
10 4671 125 1779 99.96 984355 59.39 | 27.05
10 × 10 3683 134 1810 99.96 1599325 72.22 | 18.33
16.26× 10 4536 134 1810 99.96 2288126 80.17 | 13.00
Haplotype: DBB
0.1 × 2 1604 70 2191 99.96 33901 37.28 | 41.78
0.5× 4 1467 83 1415 99.97 157510 39.66 | 39.60
4 2022 92 1496 99.97 293996 42.54 | 37.84
4 2502 108 1472 99.97 518085 47.59 | 34.28
4 4175 126 1385 99.97 1015730 60.37 | 25.75
10 × 4 4525 132 1377 99.97 1660305 72.79 | 17.55
12.91 × 4 4743 135 1377 99.97 2028107 77.31 | 14.58
Haplotype: MANN
0.1 × 3 1680 67 41028 99.19 33614 34.31 | 43.07
0.5× 7 1658 85 38379 99.24 153933 36.66 | 41.50
8 2183 94 37898 99.25 288713 39.33 | 39.76
9 3054 109 37728 99.25 502336 44.89 | 36.22
12 3774 126 36263 99.28 964364 57.71 | 27.55
10 × 14 5426 132 35941 99.29 1553694 70.85 | 18.86
18.20× 14 4843 134 35940 99.29 2450244 81.06 | 12.15
Haplotype: QBL
0.1 × 3 2222 88 15062 99.69 32464 35.13 | 43.05
0.5× 9 1236 81 7829 99.84 153818 37.47 | 41.77
10 2388 92 4610 99.91 284587 39.92 | 40.35
14 2981 109 3561 99.93 502087 46.98 | 36.14
17 3986 123 3349 99.93 966151 58.80 | 27.40
10 × 17 4049 129 3356 99.93 1566636 71.76 | 18.63
12.85× 17 4113 131 3343 99.93 1862566 75.90 | 15.84
Haplotype: SSTO
0.1 × 2 2013 72 17626 99.65 33792 36.06 | 41.98
0.5× 12 1812 84 10471 99.79 156473 37.60 | 41.12
20 2536 93 5150 99.90 291484 41.05 | 38.59
24 2977 108 4671 99.91 513683 46.50 | 35.02
24 5023 124 4611 99.91 992511 59.01 | 26.68
10 × 24 5021 132 4634 99.91 1609715 71.88 | 18.16
15.04 × 24 4499 137 4637 99.91 2206289 79.07 | 13.44

Fig. 2:

Fig. 2:

Evaluation of the performance of the IQP method with and without relaxation of the binary edge variables xuv. We compared runtime using various short-read datasets.

Fig. 3:

Fig. 3:

Evaluation of the performance of the ILP method with and without relaxation of the binary edge variables xuv. We compared runtime using various short-read datasets.

Footnotes

References

  • 1.Baaijens J.A., Bonizzoni P., Boucher C., Della Vedova G., Pirola Y., Rizzi R., Sirén J.: Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing pp. 1–28 (2022) [DOI] [PMC free article] [PubMed]
  • 2.Bradbury P.J., Casstevens T., Jensen S.E., Johnson L., Miller Z., Monier B., Romay M., Song B., Buckler E.S.: The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics 38(15), 3698–3702 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chandra G., Gibney D., Jain C.: Haplotype-aware sequence alignment to pangenome graphs. Genome Research 34(9), 1265–1275 (2024) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics 19(1), 118–135 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Davies R.W., Kucka M., Su D., et al. : Rapid genotype imputation from sequence with reference panels. Nature Genetics 53(7), 1104–1111 (Jun 2021). 10.1038/s41588-021-00877-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dilthey A., Cox C., Iqbal Z., Nelson M.R., McVean G.: Improved genome inference in the MHC using a population reference graph. Nature genetics 47(6), 682–688 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dilthey A.T.: State-of-the-art genome inference in the human MHC. The International Journal of Biochemistry & Cell Biology 131, 105882 (2021) [DOI] [PubMed] [Google Scholar]
  • 8.Ebert P., Audano P.A., et al. : Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372(6537) (Apr 2021). 10.1126/science.abf7117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ebler J., Ebert P., Clarke W.E., Rausch T., Audano P.A., Houwaart T., Mao Y., Korbel J.O., Eichler E.E., Zody M.C., et al. : Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nature genetics 54(4), 518–525 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Eggertsson H.P., Jonsson H., Kristmundsdottir S., et al. : Graphtyper enables population-scale genotyping using pangenome graphs. Nature genetics 49(11), 1654–1660 (2017) [DOI] [PubMed] [Google Scholar]
  • 11.Gao Y., Yang X., Chen H., Tan X., Yang Z., Deng L., Wang B., Kong S., Li S., Cui Y., et al. : A pangenome reference of 36 chinese populations. Nature 619(7968), 112–121 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Grytten I., Dagestad Rand K., Sandve G.K.: Kage: Fast alignment-free graph-based genotyping of SNPs and short indels. Genome Biology 23(1), 209 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Harris L., McDonagh E.M., Zhang X., Fawcett K., Foreman A., Daneck P., Sergouniotis P.I., Parkinson H., Mazzarotto F., Inouye M., et al. : Genome-wide association testing beyond SNPs. Nature Reviews Genetics pp. 1–15 (2024) [DOI] [PMC free article] [PubMed]
  • 14.Hickey G., Heller D., Monlong J., Sibbesen J.A., Sirén J., Eizenga J., Dawson E.T., Garrison E., Novak A.M., Paten B.: Genotyping structural variants in pangenome graphs using the vg toolkit. Genome biology 21, 1–17 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hickey G., Monlong J., Ebler J., Novak A.M., Eizenga J.M., Gao Y., Marschall T., Li H., Paten B.: Pangenome graph construction from genome alignments with minigraph-cactus. Nature Biotechnology pp. 1–11 (2023) [DOI] [PMC free article] [PubMed]
  • 16.Houwaart T., Scholz S., Pollock N.R., et al. : Complete sequences of six major histocompatibility complex haplotypes, including all the major MHC class II structures. HLA 102(1), 28–43 (Mar 2023). 10.1111/tan.15020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jain C., Tavakoli N., Aluru S.: A variant selection framework for genome graphs. Bioinformatics 37(Supplement_1), i460–i467 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Letcher B., Hunt M., Iqbal Z.: Gramtools enables multiscale variation analysis with genome graphs. Genome biology 22, 1–27 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li H.: Sample graphs and sequences for testing sequence-to-graph alignment (2022). 10.5281/zenodo.6617246 [DOI]
  • 20.Li J.H., Mazur C.A., Berisa T., Pickrell J.K.: Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays. Genome research 31(4), 529–537 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Li N., Stephens M.: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165(4), 2213–2233 (2003) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Liao W.W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J.K., Monlong J., Abel H.J., et al. : A draft human pangenome reference. Nature 617(7960), 312–324 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mahmoud M., Gobet N., Cruz-Dávalos D.I., Mounier N., Dessimoz C., Sedlazeck F.J.: Structural variant calling: the long and the short of it. Genome biology 20, 1–14 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Martin A.R., Atkinson E.G., Chapman S.B., Stevenson A., Stroud R.E., Abebe T., Akena D., Alemayehu M., Ashaba F.K., Atwoli L., et al. : Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. The American Journal of Human Genetics 108(4), 656–668 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mun T., Vaddadi N.S.K., Langmead B.: Pangenomic genotyping with the marker array. Algorithms for Molecular Biology 18(1), 2 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mustafa H., Karasikov M., Mansouri Ghiasi N., Rätsch G., Kahles A.: Label-guided seed-chain-extend alignment on annotated de bruijn graphs. Bioinformatics 40(Supplement_1), i337–i346 (2024) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Nurk S., Koren S., Rhie A., Rautiainen M., et al. : The complete sequence of a human genome. Science 376(6588), 44–53 (apr 2022). 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Paten B., Novak A.M., Eizenga J.M., Garrison E.: Genome graphs and the evolution of genome inference. Genome research 27(5), 665–676 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pritt J., Chen N.C., Langmead B.: Forge: prioritizing variants for graph genomes. Genome biology 19(1), 1–16 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Roberts M., Hayes W., Hunt B.R., Mount S.M., Yorke J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (jul 2004). 10.1093/bioinformatics/bth408 [DOI] [PubMed] [Google Scholar]
  • 31.Rubinacci S., Ribeiro D.M., et al. : Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics 53(1), 120–126 (Jan 2021). 10.1038/s41588-020-00756-0 [DOI] [PubMed] [Google Scholar]
  • 32.Scholz S.: Complete sequences of six major histocompatibility complex haplotypes rev2 (2024). 10.5281/ZENODO.13889311 [DOI] [PMC free article] [PubMed]
  • 33.Sibbesen J.A., Maretty L., Consortium D.P.G., Krogh A.: Accurate genotyping across variant classes and lengths using variant graphs. Nature genetics 50(7), 1054–1059 (2018) [DOI] [PubMed] [Google Scholar]
  • 34.Sirén J., Monlong J., Chang X., Novak A.M., Eizenga J.M., Markello C., Sibbesen J.A., Hickey G., Chang P.C., Carroll A., et al. : Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374(6574), abg8871 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sirén J., Eskandar P., Ungaro M.T., et al. : Personalized pangenome references. Nature Methods (Sep 2024). 10.1038/s41592-024-02407-2 [DOI] [PubMed]
  • 36.Smith T.P., Bickhart D.M., Boichard D., Chamberlain A.J., Djikeng A., Jiang Y., Low W.Y., Pausch H., Demyda-Peyrás S., Prendergast J., et al. : The bovine pangenome consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. Genome biology 24(1), 139 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Tavakoli N., Gibney D., Aluru S.: Haplotype-aware variant selection for genome graphs. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. pp. 1–9 (2022) [Google Scholar]
  • 38.Vaddadi K., Mun T., Langmead B.: Minimizing reference bias with an impute-first approach (Dec 2023). 10.1101/2023.11.30.568362 [DOI]

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES