Exploring gene content with pangene graphs

Heng Li; Maximillian Marin; Maha R Farhat

doi:10.1093/bioinformatics/btae456

. 2024 Jul 23;40(7):btae456. doi: 10.1093/bioinformatics/btae456

Exploring gene content with pangene graphs

Heng Li ^1,^2,^3,^✉, Maximillian Marin ⁴, Maha R Farhat ^5,⁶

Editor: Peter Robinson

PMCID: PMC11286280 PMID: 39041615

Abstract

Motivation

The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome.

Results

We developed pangene, a computational tool to identify gene orientation, gene order, and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools.

Availability and implementation

Source code at https://github.com/lh3/pangene; prebuilt pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org

1 Introduction

A human genome contains about 20 000 protein-coding genes. A small number of them have frequent copy-number or gene order changes in the human population (Sudmant et al. 2010, Handsaker et al. 2015). These genes are under fast evolution and may be responsible for immune responses, affecting brain functionality (Ju et al. 2016) and drug metabolism (Taylor et al. 2020), or associated with known diseases (Mercuri et al. 2022). They may have profound biological and biomedical implications.

Thanks to the recent advances in sequencing technologies (Wenger et al. 2019) and assembly algorithms (Nurk et al. 2020, Cheng et al. 2021, Rautiainen et al. 2023), we can routinely achieve haplotype-resolved assembly over genes under copy-number or order changes. We have also developed algorithms to construct pangenome sequence graphs that encode variations between genomes. However, how to identify these gene-level variations is not straightforward. Among the three whole-genome pangenome construction tools used by the Human Pangenome Reference Consortium (HPRC), minigraph (Li et al. 2020) and minigraph-cactus (Hickey et al. 2023) are unable to align through complex genomic regions and may miss genes in long segmental duplications; PGGB (PanGenome Graph Builder; Garrison et al. 2023) collapses paralogous genes which makes it difficult to study individual paralogs. In addition, all three tools do not directly reveal how genomic variations affect genes. To study gene-level variations, HPRC had to manually annotate genes on each haplotype (Liao et al. 2023) which is a time-consuming process. PGR-TK (PanGenomic Research ToolKit; Chin et al. 2023) reconstructs local haplotype structures from genomic sequences but it does not directly model genes and is not intended for whole-genome data. The current human pangenome tooling is not designed for studying gene variations.

In contrast, research on bacterial pangenome focuses on protein-coding genes instead of genome sequences, to the point that in the literature, a bacterial “pangenome” often refers to the collection of protein-coding genes. Several high-quality tools have been developed for constructing the gene content of bacterial genomes (Page et al. 2015, Ding et al. 2018, Gautreau et al. 2020, Tonkin-Hill et al. 2020, Zhou et al. 2020). In a nutshell, they start with ab initio gene annotation in each genome, cluster the resulted protein sequences, and then post-process clusters to identify orthologous genes and to fix issues caused by imperfect assembly, annotation or clustering (Tonkin-Hill et al. 2023). These bacterial pangenome tools however have not considered splicing, multiple isoforms, frequent segmental duplications and the much larger size of the human genome. They have not been shown to work with human pangenome data.

Here we developed pangene, a new computational tool to explore the gene content of a pangenome. Unlike bacterial pangenome pipelines, pangene effectively annotates protein-coding genes by aligning protein sequences to each genome with miniprot (Li 2023). As miniprot can align through in-frame stop codons and frameshifts, this procedure simplifies gene annotation and is robust to insertion/deletion errors in the input genomes. Furthermore, pangene constructs a bidirected gene graph and can capture inversions missed by bacterial pangenome tools. It also provides an algorithm to identify gene copy-number or gene order variations. Pangene is optimized for human genomes and also works for bacterial genomes.

2 Materials and methods

Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the Graphical Fragament Assembly format (GFA; Li et al. 2020). It involves two steps: aligning the set of protein sequences to each input assembly with miniprot (Li 2023), and deriving a graph from the alignment with each contig encoded as a walk of genes. Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes otherwise. Pangene can also identify generalized “bubbles” in the graph, which represent local gene order, gene copy-number or gene orientation variations among the input genomes.

Given perfect gene annotation and orthology assignment between the genes, the pangene graph construction algorithm is conceptually simple: it takes an orthologous group as a vertex and adds an edge between two groups if on an input genome, a gene in one group is adjacent to a gene in the other group (Fig. 1). The practical difficulty is to obtain accurate annotation in the presence of redundant sequences, paralogous genes and errors in assembly or alignment.

Figure 1. — Examples of pangene graphs. (a) Human haplotypes around the *HLA-DRB1* gene. (b) The pangene graph around *HLA-DRB1*. (c) Human haplotypes around the *RHD* gene. *RHD* has copy-number changes and *TMEM50A* may be inverted. (d) The corresponding pangene graph.

An important use of pangenome graphs is to identify genomic variations represented as “bubbles.” While bubbles have been extensively studied in directed graphs (Onodera et al. 2013), they have not been rigorously defined on bidirected graphs that are the necessary outputs of assemblers and pangenome constructors. If the field is to move towards the more generalized bidirected graph as a representation of genomic variation developing the concept of bubbles in this context will be necessary. Hence, how to define and to find “bubbles” in bidirected graphs will be a major topic of this article.

2.1 Defining pangene graphs

Let V be the set of genes and $X = V \times {>, <}$ be the set of oriented genes. If $v \in V$ is a gene, $x = (v, >) \in X$ , or simply $x = > v$ , denotes an oriented gene and $ν (x) = v$ gives the underlying gene behind oriented gene x. $\bar{x}$ is the reverse complement of x. i.e. if $x = > v, \bar{x} = < v$ , and vice versa. Throughout this article, we will use symbol u, v, or w to denote a gene and use x, y, or z to denote an oriented gene (Table 1).

Table 1.

Notations.

Notation	Description
V	Set of genes
v, w, u	Genes; vertices in a bidirected graph
X	Set of oriented genes; $X = V \times (>, <)$
x, y, z	Oriented genes; vertices in a directed graph
$> v, < v$	Oriented genes
$\bar{x}$	Reverse complement of x; $\bar{> v} = < v, \bar{< v} = > v$
$ν (x)$	The gene behind x; $ν (> v) = ν (< v) = v$

Open in a new tab

Let T be the set of input contigs. They harbor oriented genes in X. 2-tuple $(x, y) \in X \times X$ is said to be supported by contig $t \in T$ if y immediately follows x on t or, due to the strand symmetry of DNA, $\bar{x}$ immediately follows $\bar{y}$ .

A pangene graph is usually visualized as Fig. 2a (Wick et al. 2015) and can be mathematically defined in a few equivalent ways. It can be formulated as a directed graph $G_{D} = (X, E)$ where $E \subset X \times X$ is the set of edges. In the context of pangene graphs, (x, y) is an edge, also written as $x \to y$ , if (x, y) is supported by a contig. Notably, if $x \to y$ is an edge, $\bar{y} \to \bar{x}$ must also be an edge. In graph theory, this property is called the skew symmetry. Figure 2e shows an example of a directed pangene graph which can be described in the GFA format (Fig. 2b). In G_D, it is possible that y can reach both x and $\bar{x}$ . When this happens, x or $ν (x)$ is said to be an inversion.

Figure 2. — An example of a gene graph. (a) A gene graph for visualization. (b) The corresponding GFA format. (c) Bidirected graph. (d) Biedged graph. (e) Directed graph. (f) Net graph, which loses information. (g) A depth-first traversal of the net graph with “0” representing a super node not in the original net graph. (a) through (e) are equivalent representations.

A pangene graph can also be thought as a bidirected graph $G_{B} = (V, E)$ where the definition of E is the same as above. An edge in E is effectively associated with two directions. For example, $(> v, < w) \in E$ , also written as $v > < w$ , suggests the edge between v and w has a first direction “ $>$ ” towards vertex v and a second direction “ $<$ ” towards vertex w. Figure 2c shows the bidirected equivalence of Fig. 2a.

In a directed graph, $x \to y \leftarrow z$ is not a path because x cannot reach z. The similar constraint is applied to the bidirected graph G_B. For example, $v ≫ u ≪ w$ is not a path because both v and w go into u. $v ≫ u > < w$ is a path, which can also be expressed as a path in G_D: $(> v, > u, < w)$ .

As pangene constructs the graph from protein-to-genome alignment (Li 2023), it also attaches alignment scores to edges. Let $T_{x \to y} \subset T$ be the set of contigs that support $x \to y$ and $S (x | y, t), t \in T_{x \to y}$ , be the alignment score of x when $x \to y$ is supported by contig t. $S (x | y)$ is calculated as

S (x | y) = \frac{1}{| T_{x \to y} |} \sum_{t \in T_{x \to y}} S (x | y, t)

which is the average score of x among contigs supporting $x \to y$ .

2.2 Obtaining the protein set

Pangene requires a set of protein sequences as input which may have redundancies. In principle, if every input genome is annotated, we may merge all annotated protein sequences and align to all genomes with miniprot. This strategy works when there are a small number of diverged genomes. If there are many similar genomes, all with annotations, a faster way is to cluster protein sequences with CD-HIT (Li and Godzik 2006) or MMseqs2 (Schneider et al. 2017), and take a representative protein sequence from each cluster to generate the protein set.

Most human genomes do not have high-quality gene annotation. We may generate the protein set from the reference gene annotation supplemented with protein-coding genes or diverged alleles not present in the reference genome, such as HLA-DRB3. Pangene allows one gene to be associated with multiple protein sequences. If protein names follow format <GeneID>:<ProteinID>, pangene will select one protein for each <GeneID> and use <GeneID> as the segment name in the output GFA file.

2.3 Selecting non-orthologous genes

The central problem in graph construction is how to consistently annotate multiple genomes with a subset of input proteins. A naive way to annotate a genome is to align all protein sequences to the genome and if there are multiple proteins mapped to the same locus, select the best scoring protein as the annotation. This simple strategy does not work if there are orthologous proteins in the input protein set. For example, suppose gene v and w are orthologous to each other and are thus aligned to the same locus in each genome. When some genomes match v better and the rest match w better, both v and w will be considered as accessory genes that are not present in all genomes. However, the preferred choice is to elect one of them as a core gene.

Pangene selects non-orthologous genes as follows. For gene v, pangene inspects its best hit in each genome (hashes of gene names are used to break ties) and counts b(v) the number of genomes where v is aligned better than all other genes overlapping with v. Pangene then traverses each gene v in the descending order of current b(v). If b(v) > 0, select v as a non-orthologous gene and for each genome where the best mapping of w overlaps with v but is aligned better than v, reduce b(w) by one. This way, if v and w are orthologs, their best mapping positions likely overlap and thus only one of them will be selected; if v and w are paralogs, their best mapping positions usually differ and thus both will be selected.

2.4 Adjusting pangene graphs

Pangene constructs an initial graph using genes selected from the last section. It applies additional heuristics to improve gene annotations based on the graph topology and alignment scoring in two special cases.

First, a gene may be wrongly mapped to its paralogs if its best position is missing. Suppose a gene on chromosome X has a second best hit on an autosome. If the alignment score of the autosomal hit is lower than the hit to chromosome X by 3% miniprot will only report the hit on chromosome X. However, because a phased male human assembly does not have chromosome X, the gene will be mapped to the autosome (Fig. 3a). This false hit will lead to spurious edges in the initial graph (Fig. 3b). To solve this problem, pangene marks $x \to y$ as a false edge if there exists $x \to z$ such that y and z are on different contigs in every input genome and $S (x | y) < S (x | z) \cdot r_{1}$ (r₁ defaults to 95%). Pangene then filters out all alignment of x that are incident to a false edge $x \to y$ (Fig. 3c).

Figure 3. — Graph-based annotation adjustment. (a) *MSL3* is located on chromosome X but has a weaker hit on chromosome 2 when the gene is aligned to a male haplotype without chromosome X. (b) The initial pangene graph. Percentages indicate alignment identity. (c) The adjusted graph after filtering out the *MSL3* hit on chr2. (d) Two haplotypes involving *CYP2D6* and *CYP2D7* which are paralogous genes. The second haplotype should have *CYP2D7* only but because *CYP2D6* is longer, the *CYP2D6* alignment score is higher than the *CYP2D7* score. (e) The initial pangene graph assuming the second haplotype has *CYP2D6*. (f) The adjusted graph.

Second, similar to the first case, if the best position of a gene is missing due to a real deletion in evolution, this gene may interfere with the alignment of paralogous genes (Fig. 3d) and leads to a graph (Fig. 3e) inconsistent with manual annotation (Liao et al. 2023). To improve this case, pangene marks $x \to y$ to be low priority if there exists $x \to z$ such that y and z are on the same contig in an input genome and $S (x | y) < S (x | z) \cdot r_{2}$ (r₂ defaults to 98%). Pangene then marks all x alignment that are incident to a low-priority edge $x \to y$ to have low priority as well and prefers a gene without the low-priority mark. In Fig. 3e, $x = <$ CYP2D6, $y = <$ TCF20 and $z = <$ CYP2D7. $<$ CYP2D6 is marked to have low priority on the second contig. This results in a graph in Fig. 3f that is more consistent with the known evolution at this locus (Liao et al. 2023).

Both cases may also be caused by processed pseudogenes. When a protein-coding gene has a recent pseudogene that has a near identical sequence, the two heuristics may not work due to the score ratio threshold. By default, pangene filters out an alignment without splicing as a potential pseudogene alignment if the protein has a spliced alignment in other samples. Pangene also has an option to drop a protein if it does not have spliced alignment in any sample. For human samples, these strategies can reduce false connections between chromosomes.

2.5 Past work on bubble finding

The concept of bubble is perhaps first coined by Zerbino and Birney (2008) in the context of assembly graphs. A bubble was initially meant to be simple in that paths through a bubble do not share vertices except the start and the end vertices. Onodera et al. (2013) introduced superbubble to allow more complex acyclic topology and found a quadratic algorithm to identify all superbubbles. The time complexity was improved to $O (| E | log | E |)$ by Sung et al. (2015) and then to linear by Brankovic et al. (2016).

A subgraph induced by a superbubble is acyclic, but in a pangene graph or a variation graph, we also care about cyclic subgraph. Gärtner et al. (2018) defined weak superbubble without the acyclic condition, and presented a linear algorithm to identify all weak superbubbles (Gärtner and Stadler 2019). Interestingly, more than 20 years before this work, Johnson et al. (1994) had already come up with a linear algorithm to effectively find weak superbubbles in the context of compiler design, which will be the basis of our algorithm.

Weak superbubble is defined on directed graphs, not on bidirected graphs. If we apply the previous algorithms to directed graph G_D, we will find most superbubbles twice on opposite strands. More importantly, superbubble cannot capture the visual intuition in the presence of inversions. For example, in Fig. 4a, >A and <C enclose a subgraph that looks like a bubble, but the equivalent directed graph (Fig. 4f) does not have a superbubble when >A and <A are taken as different vertices. Dabbaghie et al. (2022) implemented the algorithm by Onodera et al. (2013) for sequence graphs, but how the issues above are handled is not apparent. The closest concept to weak superbubble in bidirected graphs is snarl (Paten et al. 2018), which leads to a bidirected subgraph that can be separated from the rest of the graph. Snarl does not require reachability which is necessary in defining superbubbles. To this end, there are no equivalent definition of weak superbubble in directed graphs.

Figure 4. — Example of gene graphs with inversions. (a) A gene graph for visualization. (`>A,<C`) is a generalized bibubble. (b) The corresponding GFA format. (c) Bidirected graph. (d) Biedged graph. (e) Net graph, which loses information. (f) Directed graph. (g) A second gene graph with (`>A,>D`) being a generalized bibubble. (`>A,<C`) is not a generalized bibubble because U(`>A,<C`)={`B, C`} but U(`>C,<A`)={`B`}. (h) A third gene graph without any generalized bibubbles.

2.6 Defining generalized bibubbles

The definition of “bubble” here tries to capture the visual intuition. It is inspired by Onodera et al. (2013). Let $U (x, y) \subset V$ be the set of vertices that are reachable from x without passing through x, $\bar{x}$ , or y. A generalized bidirected bubble, or a generalized bibubble in short, is a pair $(x, y) \in X \times X$ such that (i) $U (x, y) = U (\bar{y}, \bar{x}) = \emptyset$ ; (ii) $\forall v \in U (x, y)$ , there is a walk from x to y that passes v; (iii) there does not exist $z \in U (x, y) \times {>, <}$ that satisfies $U (x, z) = U (\bar{z}, \bar{x})$ or $U (z, y) = U (\bar{y}, \bar{z})$ . Due to the skew symmetry, if (x, y) is a generalized bibubble, $(\bar{y}, \bar{x})$ is also a generalized bibubble.

In the definition of U(x, y), we intentionally allow walks passing through $\bar{y}$ ; otherwise U(>A,<C) in Fig. 4g would become {B}, making (>A,<C) a generalized bibubble. This example also suggests (x, y) would not be a generalized bibubble if $x \in U (x, y)$ or $y \in U (x, y)$ . Condition (ii) above aims to exclude examples such as Fig. 4h. Condition (iii) requires a bibubble to be minimal. As a result, it is not possible for (x, y) and (x, z) both to be generalized bibubbles unless y = z. Generalized bibubbles are also nested in that if $(x, x')$ and $(y, y')$ are generalized bibubbles and $U (x, x') \cap U (y, y') = \emptyset$ , then $U (x, x') \subset U (y, y')$ or $U (y, y') \subset U (x, x')$ .

For a generalized bibubble (x, y), let $\bar{U} (x, y) = V ∖ U (x, y) ∖ {ν (x), ν (y)}$ . No vertex in $\bar{U} (x, y)$ can reach vertices U(x, y) because otherwise $w \in \bar{U} (x, y)$ reachable from x (or from $\bar{y}$ ) would not pass through y (or $\bar{x}$ ) and would be added to U(x, y) (or $U (\bar{y}, \bar{x})$ )—this would violate the definition of U(x, y). This observation indicates that removing $ν (x)$ and $ν (y)$ would separate U(x, y) from the rest of the graph.

A naive algorithm to identify all generalized bibubbles is to enumerate each pair of oriented genes and then test whether the pair forms a generalized bibubble based on the definition. This is an $O (| V |^{2} \cdot (| V | + | E |))$ algorithm, impractical for large graphs. To describe a new algorithm for bubble finding, we will first introduce net graphs and cycle equivalence.

2.7 Biedged graphs and net graphs

Let $Z = V \times {s, e}$ , where symbol “ $s$ ” represents the start of a gene and symbol “ $e$ ” represents the end. Given a bidirected graph $G_{B} = (V, E)$ , a biedged graph $G_{E} = (Z, E_{g}, E_{l})$ is constructed with $E_{g} = {{(v, s), (v, e)} | v \in V}$ and E_l being the set of undirected connections between starts or ends of genes in Z (Paten et al. 2018). Figure 2d shows an example. Note that a valid walk in a biedged graph cannot pass two consecutive edges in E_l. Due to this restriction, many classical algorithms in graph theory are not applicable to a biedged graph.

A net graph $G_{N} = (P, E_{n})$ is constructed by contracting all edges in E_l. As a result, a vertex in P is a component connected by edges in E_l and an edge represents a gene in V (Figs 2e and 4e). Because E_n and V have one-to-one relationship, we will also use v to denote an edge in G_N. A net graph is an undirected graph and is not equivalent to the bidirected graph G_B. For example, in Fig. 2a, there is not a walk that goes through both gene C and D, but in Fig. 2f, there is a walk involving both genes.

In graph theory, a set of edges in G_N is a cut-set if cutting through these edges partitions G_N into two disconnected subgraphs. A key observation is that if (x, y) is a bibubble in bidirected graph G_B, ${ν (x), ν (y)}$ is a cut-set in G_N.

2.8 Cycle equivalence

Two edges e₁ and e₂ in G_N are said to be cycle equivalent if every cycle containing e₁ contains e₂ and vice versa. Cycle equivalent edges form equivalent classes, which can be determined in linear time (Johnson et al. 1994).

For any pair of e₁ and e₂ that are cycle equivalent, ${e_{1}, e_{2}}$ is a cut-set. To see this, let p₁₁ and p₁₂ are the two vertices incident to e₁. Without losing generality, suppose G_n is a connected graph. If G_n remained connected after removing e₁ and e₂, there would exist a vertex q that can reach both p₁₁ and p₁₂ without passing through e₂ (because e₂ has been removed). Then e₁ would be in a cycle with q in the original G_N but without passing through e₂, violating the cycle equivalence of e₁ and e₂.

Conversely, if ${e_{1}, e_{2}}$ is a cut-set, e₁ and e₂ are cycle equivalent. To prove this, let ${e_{1}, e_{2}}$ partition the vertex set P into P₁ and P₂. If e₁ and e₂ were not cycle equivalent, there must exist a cycle that only contains e₁ but not e₂. Then cutting both e₁ and e₂ would still keep P₁ and P₂ connected due to this cycle, violating that ${e_{1}, e_{2}}$ is a cut-set.

To this end, ${e_{1}, e_{2}}$ is a cut-set if and only if e₁ and e₂ are cycle equivalent. Then if (x, y) is a generalized bibubble in G_B, $ν (x)$ and $ν (y)$ are cycle equivalent in G_N. The reverse is not always true. For example, in Fig. 2f, “C” and “D” are cycle equivalent, but (>C,>D) or in any orientation is not a generalized bibubble. Similarly, “A” and “B” are cycle equivalent in Fig. 5c but they do not enclose generalized bibubbles.

Figure 5. — Example of gene graphs that complicate net graph based bubble finding. (a) Gene annotation on a contig. (b) The corresponding pangene graph. (c) Net graph with the dummy vertex “0” connected to “1” only. (d) Net graph with the dummy vertex connected to genes at the ends of the contig.

2.9 Identifying generalized bibubbles

Given a bidirected graph G_B, pangene transforms it to a net graph G_N and assigns a class number to each gene v such that v and w are in the same class if they are cycle equivalent in G_N. For each oriented gene x that has multiple edges, pangene applies a breadth-first search (BFS) to traverse up to m genes (100 by default) reachable from x. During BFS, pangene collects y in the same equivalent class and does not traverse beyond y. It then checks each (x, y) using the definition of generalized bibubble. Because x can only appear in one generalized bibubble, pangene stops testing other $(x, \cdot)$ pairs if it has found a generalized bibubble.

The time complexity in the worst case is $O (| V | + | E | + b \cdot m^{2})$ , where b is the number of genes with multiple incoming or outgoing edges and m is effectively the maximum bibubble size. On real pangene graphs, the time complexity is close to $O (| V | + | E | + b \cdot m)$ . It takes pangene less than one second to find ∼200 generalized bibubbles in a human graph including 20 000 genes.

In addition to cycle equivalence, Johnson et al. (1994) described another linear algorithm to find single-entry-single-exist (SESE) regions in G_N provided that G_N contains a vertex p₀ that puts every other vertex in G_N in a cycle with p₀. A SESE region is intuitively a bubble-like subgraph in G_N. Pangene also implemented this algorithm. It introduced p₀ that connects all vertices with degree one in G_N and connects the first and the last gene on each reference chromosome (Figs 2f and 5d). Because a SESE region does not always correspond to a generalized bibubble in G_B, pangene still has to test each SESE region using the definition of generalized bibubble. The time complexity of this second algorithm is $O (| V | + | E | + b' m^{2})$ where $b'$ is the number of SESE regions in G_N. It is faster in theory, but pangene still uses the first algorithm by default as the placement of p₀ is heuristic which caused one false negative in a human pangene graph.

3 Results

We applied pangene to HPRC samples and manually inspected subgraphs around CYP2D6, RHD, C4, HLA-DRB1, and KIR genes which have been curated for these samples (Liao et al. 2023, Zhou et al. 2024). We also looked at many other genes that are known to have gene-level variations (Sudmant et al. 2010, Boettger et al. 2012, Handsaker et al. 2015). While these genes have not been manually annotated for HPRC samples, we could check whether pangene captured known events such as the MAPT inversion and the ORM1 copy-number changes.

Due to the lack of genome-wide ground truth in human data, we could only manually evaluate pangene in individual subgraphs. To get a better sense of overall statistics, we additionally applied pangene to two bacterial datasets and compared the results to existing bacterial pangenome tools.

3.1 Structural variants between two human genomes

We downloaded GENCODE comprehensive human annotation v45, retained the protein-coding transcripts tagged with “Ensembl_canonical” and filtered out readthrough transcripts (a transcript that joins two adjacent genes) and mitochondrial genes. Only one transcript was selected per gene. Because GENCODE does not include HLA-DRB3, HLA-DRB4 and several KIR genes, we manually added 20 HLA-DRB genes and alleles from the IPD-IMGT/HLA database and 15 KIR genes from the IPD-KIR database. In the end, we constructed a protein set with 19 421 sequences.

We aligned the proteins to the human reference genome GRCh38 (Schneider et al. 2017) and T2T-CHM13 (Nurk et al. 2022) with miniprot (Li 2023) v0.12 under option “--outs = 0.97 -Iu” and constructed a pangene graph. The resulting graph contains 19 088 genes and 19 390 edges. We identified 91 generalized bubbles in the graph that contain 100 genes or fewer. The bibubbles involved 691 genes affected by gene copy, gene order or gene orientation changes between the two genome. These genes included many known events such as PDPR, SMN2, CTAGE9, HPR, ORM1, CCL4, NCF1, and Amylase (Sudmant et al. 2010, Handsaker et al. 2015). Some of the large gene clusters affected by structural changes, such as SMN2 and Amylase, would not be easily captured by whole-genome alignment as they are not represented by colinear genome alignments.

3.2 Analyzing 100 human haplotypes

We obtained the assembly of CN1 (Yang et al. 2023), YAO (He et al. 2023) and 47 samples released by the Human Reference Pangenome Consortium (Liao et al. 2023). Together with GRCh38 and T2T-CHM13, this gave us 100 human haplotypes. We aligned the same set of proteins to the 100 haplotypes. Pangene took less than one minute to construct a graph and less than one second to identify 266 generalized bibubbles in the graph. Many of the bibubbles were supported by one contig only. Pruning edges supported by one contig resulted in a graph containing 209 generalized bibubbles. Further dropping about 2000 single-coding-exon genes led to a graph containing 163 bibubbles. These included manually confirmed gene-level variations around CYP2D6, C4 and RHD (Liao et al. 2023) and around HLA-DBR1 (Zhou et al. 2024).

Pangene effectively annotates each input genome. For a sanity check, we compared the pangene gene annotation to the GENCODE annotation. Out of 19 043 protein coding genes in both annotations, only four gene locations are different. In one example, pangene assigned NUDT4B to NUDT4. Pangene misses NUDT4B because it is a single-exon gene, and pangene puts NUDT4B at the NUDT4 locus because the protein sequence of NUDT4 is a strict substring of NUDT4B.

We also compared the pangene annotation to the T2T-CHM13 annotation generated by Ensembl. There are 41 differences out 18 676 genes present in both annotations. These include four genes CSAG2, CSAG3, MAGEA6, and MAGEA3 that are close to each other. Pangene gives better protein-to-genome alignment for the four genes. We speculate that Ensembl may be annotating these genes based on synteny with GRCh38, but synteny here might be broken due to an apparent inversion around the genes. For another example, in the RCCX gene cluster (Carrozza et al. 2021), pangene annotates CYP21A2 ahead of TNXB but Ensembl puts CYP21A2 inside an intron of TNXB, which seems to be an error. Ensembl also misses C4A, another gene in RCCX. We believe pangene is doing better in these cases. Most other genes among the 41 differences come from huge complex gene clusters. We cannot tell what the correct annotation is.

We next investigated genes with presence/absence variants (PAVs). Out of 16 996 multi-exon genes in the graph, 16 315 were on the autosomes of GRCh38. 3.4% of them were absent from one or more haplotypes and 0.4% (n = 58) were only present in $\leq$ 50% of haplotypes. Among the 58 genes, TRIM64, e.g. is a close paralog to TRIM64B. On most haplotypes, TRIM64B was aligned better than TRIM64 and pangene annotated two TRIM64B genes but no TRIM64. Therefore, TRIM64 became a PAV. When we excluded genes having paralogs of $\geq$ 95% identity from the list, only 14 were left. This list included three KIR genes and HLA-DRB4 that are known to be PAVs (Zhou et al. 2024). We checked a few remaining genes in the list. PRH1 and GSTM1 were in relatively simple regions and look like real PAVs. Both of them have paralogs of identity below 95%. GSTT2 has a close paralog GSTT2B that CD-HIT failed to identify—it seemed an algorithm glitch. The flanking region around NOTCH2NLR looked complex and was poorly assembled in most haplotypes. This may explain the low occurrence rate of this gene. Overall, almost all human genes or their close paralogs are present in the majority of human haplotypes. It is rare to see a protein-coding gene with sequences completely missing without paralogs as backups.

Our analysis is insensitive to thresholds r₁ and r₂ in Section 2.4. Reducing r₁ from 95% to 90% and r₂ from 98% to 96% only affects 9 out of 266 bibubbles.

3.3 Incorporating 10 great ape haplotypes

We also constructed a graph including ten haplotype assemblies of great apes, including chimpanzee, bonobo, gorilla, and two orangutan subspecies (Makova et al. 2024). We filtered out single-exon genes and pruned edges supported by one contig. This graph contained 239 generalized bibubbles, 131 of which were polymorphic among the 100 human samples. The fewer polymorphic bibubbles in human (in comparison to 163 identified without great ape samples) were caused by the merging of small bibubbles, by the increased difficulty in resolving paralogs given more remote outgroups, or by chromosome-scale inversions or translocations that disrupted the local bubble topology. The last cause seemed to be the major contributor. We will use the 17q21.31 inversion around MAPT as an example.

The 17q21.31 inversion is polymorphic in the human population (Boettger et al. 2012, Steinberg et al. 2012). In the graph derived from the 100 human haplotypes, pangene identified a generalized bibubble corresponding to this inversion. The bibubble contained LRRC37A and LRRC37A2 which have similar sequences. The two genes have another paralog LRRC37A3 that is ∼18 Mb away. All great apes only have one copy of the gene. Intriguingly, gorilla haplotypes are similar to human haplotypes at 17q24.1 but chimpanzee and orangutan haplotypes involve genes at 17q21.31 (Fig. 6). As a result, the pangene graph including great apes connected genes in 17q24.1 and 17q21.31 and broke the 17q21.31 bibubble in the human-only graph. Involving outgroup species may hurt the finding of localized bibubbles.

Figure 6. — Human and great ape haplotypes around LRRC37A* genes. A vertical bar indicates the two genes on each side are not adjacent on the genome. In the plot, each genome contains two blocks of genes. The genomic distance between the two blocks is 17.3 Mb in human, 61.3 Mb in bonobo, 17.6 Mb in gorilla or 2.8 Mb in orangutan. GRCh38 has *LRRC37A* and *LRRC37A2* in 17q21.31 and *LRRC37A3* in 17q24.1. HG00741 has a known inversion around *MAPT*. All great apes have only one LRRC37A* gene. Chimpanzee and bonobo have the same gene haplotype. The two orangutan subspecies also have the same haplotype. The *LRRC37A3* alignment in gorilla has two in-frame stop codons. It is challenging to distinguish LRRC37A* paralogs as they are similar in sequence and lack synteny with nearby genes.

3.4 Analyzing 152 M.tuberculosis strains

We downloaded the M.tuberculosis reference strain H37Rv and its gene annotation from RefSeq (AC: GCF_000195955.2) and obtained the complete long-read assemblies of 151 other strains (Peker et al. 2021, Marin et al. 2022, Hall et al. 2023). Following the instruction of Panaroo (Tonkin-Hill et al. 2020), we ran Prodigal (Hyatt et al. 2010) v2.6.3 on the reference strain to train the Prodigal model and ran Prokka (Seemann 2014) v1.14.6 with the pretrained model to predict protein coding genes in all strains. We used CD-HIT (Li and Godzik 2006, Fu et al. 2012) v4.8.1 with option “-c 0.98” to cluster non-reference protein sequences, which resulted in 6744 clusters. We mapped these proteins to each M.tuberculosis genome using miniprot with option “-S” to disable splicing. We finally ran pangene with “-p.001” to keep all genes regardless of their frequency in the pangenome.

Pangene constructed a graph consisting of 4216 genes, 3652 of which were present in all 152 genomes (Table 2; Fig. 7). To check if pangene captured the gene content in these strains, we compared the pangene result to Panaroo v1.3.4. We aligned the Panaroo proteins to the pangene proteins with MMseqs2 (Steinegger and Söding 2017) v13.45111 and identified 46 Panaroo proteins do not hit to pangene proteins at the default E-value threshold of $10^{- 3}$ . We mapped the 46 proteins to H37Rv with miniprot and found 43 of them can be aligned and 76% of the aligned regions overlap with annotated CDS in RefSeq. Manually investigating the overlaps revealed that most of the 43 proteins were aligned to the opposite strand of some RefSeq genes or in different reading frames. Identifying homology based on the genomic locations of input proteins, pangene did not include them into the final graph. Conversely, 40 pangene proteins do not hit to Panaroo proteins.

Table 2.

Number of detected genes in bacterial pangenomes.

	Panaroo	pangene	PPanGGOLiN
Mtb: #total genes	4207	4216	4744
Mtb: #core genes	3769	3774	3660
Ecoli: #total genes	14 342	13 902	14 291
Ecoli: #core genes	3065	3118	3038

Open in a new tab

Panaroo, pangene, and PPanGGOLiN were applied to two sets of gapless bacterial assemblies: 152 M.tuberculosis (Mtb) strains and 50 E.coli strains. Mtb genes were annotated by Prokka and Ecoli genes were annotated by NCBI and downloaded from GenBank. A “core” gene is a gene that is inferred to be present in $\geq$ 99% of assemblies in each dataset. Panaroo was invoked in the strict mode with paralogs merged (option “--clean-mode strict --merge_paralogs”).

Figure 7. — Comparing three bacterial pangenome construction tools. In each Venn diagram, pangene, Panaroo and PPanGGOLiN proteins are merged and clustered with MMseqs2 (`easy-cluster -c 0.9 --min-seq-id 0.5`). The diagram shows the number of clusters in each intersection. MMseqs2 merges paralogous genes, so the number of clusters for each tool is smaller than the number of genes in Table 2. Venn diagrams were generated with BioVenn (Hulsen *et al.* 2008).

We additionally ran PPanGGOLiN (Gautreau et al. 2020) v1.2.105 with the Prokka annotation as the input. PPanGGOLiN collected 4744 genes in the pangenome with 3463 present in all. 277 genes did not hit to genes selected by pangene. 274 of these genes could be aligned to H37Rv by miniprot and 90% of the aligned regions overlap with annotated CDS in RefSeq. We note that although overlapping genes in different open reading frames are rare in H37Rv (0.2% of coding regions), these genes may occur in other strains and may have functional implications (Vargas et al. 2023, Snobre et al. 2024). PPanGGOLiN’s preference to count such genes in the pangenome is not necessarily a weakness.

3.5 Analyzing 50 E.coli strains

M.tuberculosis has low diversity with each strain similar to each other (Marin et al. 2022). To understand how pangene performs given more diverse strains, we collected 50 E.coli genomes with complete assemblies (Shaw et al. 2021).

We followed the same procedure to run pangenome tools. To get clean graph, pangene by default filters out genes that had >10 edges or connected >3 distant loci in the graph. This filter only removed six genes in M.tuberculosis dataset but it filtered out 848 genes in E.coli. We added option “-p.001 -g50 -r10” to retain more high-copy genes and genes connecting multiple loci. Although pangene collected fewer genes (Table 2), only 72 Panaroo genes do not hit to genes collected by pangene and the pangene pangenome size after clustering is not noticeably smaller than the Panaroo and the PPanGGOLiN pangenomes (Fig. 7). The differences between the tools might be determined by subtle thresholds on how to resolve homology and may not reflect the capability of each algorithm.

4 Discussion

Pangene constructs a graph to jointly represent the gene content of multiple high-quality genomes. It provides approximate gene annotation and orthology assignment as a byproduct. Given bacterial genomes, it plays a similar role to other bacterial pangenome tools for identifying core and accessory genes (Tonkin-Hill et al. 2023). Pangene is unique in that it also works for eukaryotic genomes and captures localized gene-level variants. Pangene complements whole-genome alignment based pangenome tools and gives a more concise high-level view of gene content changes.

Constructing a pangene graph is challenging because the biologically correct graph is unknown. We tend to think a “correct” graph should model the evolution history in principle. However, evolution is complex in the presence of convergent/divergent evolution in the long time scale and non-homologous recombinations and non-allelic gene conversions (Vollger et al. 2023) in the short time scale. We do not know the evolution history of most complex loci except several well-studied cases. Furthermore, we take each gene as an atomic unit but biologically, a gene may be subjected to structural changes such as losing/gaining exons or fusing with another gene. These events are difficult to model even if we had the exact annotation in all input genomes.

The pangene graph construction algorithm is ad hoc. We have manually tuned various heuristics to correctly encode known gene-level variants in human, but may have overlooked other cases that pangene may fail to represent. Ideally, we would prefer to model graph construction to a global optimization problem. We have not been able to find such a formulation that can reliably encode known variations. This limitation is not specific to pangene; it is also applied to other pangenome construction tools.

As protein-to-genome alignment can identify orthology within mammals or even vertebrates (Li 2023), we could in principle apply pangene to more distantly related species. We did construct a gene graph from the human and mouse reference genomes. The graph revealed synteny blocks visible from the human-mouse whole-genome alignment, but the topology of the whole graph looked complex due to frequent large-scale rearrangements over long evolutionary distance. Many generalized bibubbles identified from this graph were due to real orthologous alignments missed by miniprot. We will need another way at higher level to investigate such a graph. Another practical challenge in building a cross-species graph is to select the set of input proteins. When constructing the great ape graph, we tried to merge human and great ape protein sequences. In the resultant graph, we observed thousands of alignments in the intergenic or intronic regions of GRCh38. These spurious alignments degraded the overall quality of the graph. The issue will become more prominent when we take diverse genomes without high-quality gene annotation as input. Nevertheless, we note that cross-species whole-genome alignment is also challenging and is perhaps no better than pangene. We still believe with improvements, pangene can construct a cross-species graph that helps to reveal evolution in longer term.

On the theoretical side, this article presented a rigorous definition of “bubble” in a bidirected graph but it did not find an efficient algorithm to identify such generalized bibubbles. While the current implementation in pangene works for gene graphs containing ∼20 000 genes, it will be slow for a minigraph-cactus or PGGB graph that contains tens of millions of nodes. How to efficiently identify generalized bibubbles remains an open and critical problem. Furthermore, “bubbles” cannot represent rearrangements at the chromosome scale. How to explore a multi-species gene graph or an alignment graph in general may be another interesting research topic.

Acknowledgements

We are grateful to Pouria Salehi and Luke O’Connor for the helpful discussions about the definition of variants in a sequence graph. We thank Erik Garrison for pointing us to the paper by Johnson et al. (1994).

Contributor Information

Heng Li, Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, Unites States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Maximillian Marin, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, Unites States.

Maha R Farhat, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, Unites States; Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114, United States.

Conflict of interest

None declared.

Funding

This work was supported by the National Institute of Health [grant numbers R01HG010040 and U01HG010961 to H.L., T15LM007092 to M.M., and R01AI155765 to M.R.F.] and by the Chan-Zuckerberg Initiative (to H.L.).

Data availability

Pangene source code available at https://github.com/lh3/pangene; prebuilt pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org.

References

Boettger LM, Handsaker RE, Zody MC. et al. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat Genet 2012;44:881–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brankovic L, Iliopoulos CS, Kundu R. et al. Linear-time superbubble identification algorithm for genome assembly. Theor Comput. Sci 2016;609:374–83. [Google Scholar]
Carrozza C, Foca L, De Paolis E. et al. Genes and pseudogenes: complexity of the RCCX locus and disease. Front Endocrinol (Lausanne) 2021;12:709758. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng H, Concepcion GT, Feng X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 2021;18:170–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chin C-S, Behera S, Khalak A. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023;20:1213–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dabbaghie F, Ebler J, Marschall T.. BubbleGun: enumerating bubbles and superbubbles in genome graphs. Bioinformatics 2022;38:4217–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ding W, Baumdicker F, Neher RA.. Panx: pan-genome analysis and exploration. Nucleic Acids Res 2018;46:e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garrison E, Guarracino A, Heumos S. et al. Building pangenome graphs. bioRxiv, 2023, preprint: not peer reviewed.
Gärtner F, Müller L, Stadler PF.. Superbubbles revisited. Algorithms Mol Biol 2018;13:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gärtner F, Stadler PF.. Direct superbubble detection. Algorithms 2019;12:81. [Google Scholar]
Gautreau G, Bazin A, Gachet M. et al. PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol 2020;16:e1007732. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall MB, Rabodoarivelo MS, Koch A. et al. Evaluation of nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. Lancet Microbe 2023;4:e84–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
Handsaker RE, Van Doren V, Berman JR. et al. Large multiallelic copy number variations in humans. Nat Genet 2015;47:296–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
He Y, Chu Y, Guo S. et al. T2T-YAO: a telomere-to-telomere assembled diploid reference genome for Han Chinese. Genom Proteomics Bioinf 2023;21:1085–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hickey G, Monlong J, Ebler J, et al. Pangenome graph construction from genome alignments with minigraph-cactus. Nat Biotechnol 2023;42:663–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hulsen T, de Vlieg J, Alkema W.. BioVenn—a web application for the comparison and visualization of biological lists using area-proportional venn diagrams. BMC Genom 2008;9:488. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hyatt D, Chen G-L, Locascio PF. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson R, Pearson D, Pingali K.. 1994. The program structure tree: computing control regions in linear time. In: Sarkar V, Ryder BG, Soffa ML (eds.), Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (PLDI). Orlando, Florida, USA: ACM, June 20–24, 1994, 171–85. [Google Scholar]
Ju X-C, Hou Q-Q, Sheng A-L. et al. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. Elife 2016;5:e18197. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. Protein-to-genome alignment with miniprot. Bioinformatics 2023;39:btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Feng X, Chu C. et al. The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li W, Godzik A.. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. [DOI] [PubMed] [Google Scholar]
Liao W-W, Asri M, Ebler J. et al. A draft human pangenome reference. Nature 2023;617:312–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Makova KD, Pickett BD, Harris RS. et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature2024;630:401–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marin M, Vargas R, Harris M. et al. Benchmarking the empirical accuracy of short-read sequencing across the m. tuberculosis genome. Bioinformatics 2022;38:1781–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mercuri E, Sumner CJ, Muntoni F. et al. Spinal muscular atrophy. Nat Rev Dis Primers 2022;8:52. [DOI] [PubMed] [Google Scholar]
Nurk S, Koren S, Rhie A. et al. The complete sequence of a human genome. Science 2022;376:44–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nurk S, Walenz BP, Rhie A. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020;30:1291–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Onodera T, Sadakane K, Shibuya T. Detecting superbubbles in assembly graphs. Lect Notes Comput Sci 2013;8126:338–48. [Google Scholar]
Page AJ, Cummins CA, Hunt M. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015;31:3691–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paten B, Eizenga JM, Rosen YM. et al. Superbubbles, ultrabubbles, and cacti. J Comput Biol 2018;25:649–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peker N, Schuele L, Kok N. et al. Evaluation of whole-genome sequence data analysis approaches for short- and long-read sequencing of mycobacterium tuberculosis. Microb Genom 2021;7:000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rautiainen M, Nurk S, Walenz BP. et al. Telomere-to-telomere assembly of diploid chromosomes with verkko. Nat Biotechnol 2023;41:1474–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schneider VA, Graves-Lindsay T, Howe K. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 2017;27:849–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014;30:2068–9. [DOI] [PubMed] [Google Scholar]
Shaw LP, Chau KK, Kavanagh J, et al. Niche and local geography shape the pangenome of wastewater- and livestock-associated enterobacteriaceae. Sci Adv 2021;7:eabe3868. [DOI] [PMC free article] [PubMed] [Google Scholar]
Snobre J, Meehan CJ, Mulders W. et al. Frameshift mutations in Rv0678 preserve bedaquiline susceptibility in Mycobacterium tuberculosis by maintaining protein integrity. In: SSRN, 2024. 10.2139/ssrn.4769101. [DOI]
Steinberg KM, Antonacci F, Sudmant PH. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat Genet 2012;44:872–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinegger M, Söding J.. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]
Sudmant PH, Kitzman JO, Antonacci F, et al. Diversity of human copy number variation and multicopy genes. Science 2010;330:641–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sung W-K, Sadakane K, Shibuya T. et al. An o(m log m)-time algorithm for detecting superbubbles. IEEE/ACM Trans Comput Biol Bioinform 2015;12:770–7. [DOI] [PubMed] [Google Scholar]
Taylor C, Crosby I, Yip V. et al. A review of the important role of CYP2D6 in pharmacogenomics. Genes (Basel) 2020;11:1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tonkin-Hill G, Corander J, Parkhill J.. Challenges in prokaryote pangenomics. Microb Genom 2023;9:mgen001021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tonkin-Hill G, MacAlasdair N, Ruis C. et al. Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol 2020;21:180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vargas R Jr, Luna MJ, Freschi L. et al. Phase variation as a major mechanism of adaptation in Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A 2023;120:e2301394120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vollger MR, Dishuck PC, Harvey WT, et al. Increased mutation and gene conversion within human segmental duplications. Nature 2023;617:325–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wenger AM, Peluso P, Rowell WJ. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 2019;37:1155–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wick RR, Schultz MB, Zobel J. et al. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 2015;31:3350–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang C, Zhou Y, Song Y. et al. The complete and fully-phased diploid genome of a male han Chinese. Cell Res 2023;33:745–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zerbino DR, Birney E.. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 2008;18:821–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Y, Song L, Li H.. Full resolution HLA and KIR genes annotation for human genome assemblies. Genome Res 2024;gr.278985.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z, Charlesworth J, Achtman M.. Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res 2020;30:1667–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Pangene source code available at https://github.com/lh3/pangene; prebuilt pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org.

[btae456-B1] Boettger LM, Handsaker RE, Zody MC. et al. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat Genet 2012;44:881–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B2] Brankovic L, Iliopoulos CS, Kundu R. et al. Linear-time superbubble identification algorithm for genome assembly. Theor Comput. Sci 2016;609:374–83. [Google Scholar]

[btae456-B3] Carrozza C, Foca L, De Paolis E. et al. Genes and pseudogenes: complexity of the RCCX locus and disease. Front Endocrinol (Lausanne) 2021;12:709758. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B4] Cheng H, Concepcion GT, Feng X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 2021;18:170–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B5] Chin C-S, Behera S, Khalak A. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023;20:1213–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B6] Dabbaghie F, Ebler J, Marschall T.. BubbleGun: enumerating bubbles and superbubbles in genome graphs. Bioinformatics 2022;38:4217–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B7] Ding W, Baumdicker F, Neher RA.. Panx: pan-genome analysis and exploration. Nucleic Acids Res 2018;46:e5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B8] Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B9] Garrison E, Guarracino A, Heumos S. et al. Building pangenome graphs. bioRxiv, 2023, preprint: not peer reviewed.

[btae456-B10] Gärtner F, Müller L, Stadler PF.. Superbubbles revisited. Algorithms Mol Biol 2018;13:16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B11] Gärtner F, Stadler PF.. Direct superbubble detection. Algorithms 2019;12:81. [Google Scholar]

[btae456-B12] Gautreau G, Bazin A, Gachet M. et al. PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol 2020;16:e1007732. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B13] Hall MB, Rabodoarivelo MS, Koch A. et al. Evaluation of nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. Lancet Microbe 2023;4:e84–92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B14] Handsaker RE, Van Doren V, Berman JR. et al. Large multiallelic copy number variations in humans. Nat Genet 2015;47:296–303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B15] He Y, Chu Y, Guo S. et al. T2T-YAO: a telomere-to-telomere assembled diploid reference genome for Han Chinese. Genom Proteomics Bioinf 2023;21:1085–100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B16] Hickey G, Monlong J, Ebler J, et al. Pangenome graph construction from genome alignments with minigraph-cactus. Nat Biotechnol 2023;42:663–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B17] Hulsen T, de Vlieg J, Alkema W.. BioVenn—a web application for the comparison and visualization of biological lists using area-proportional venn diagrams. BMC Genom 2008;9:488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B18] Hyatt D, Chen G-L, Locascio PF. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B19] Johnson R, Pearson D, Pingali K.. 1994. The program structure tree: computing control regions in linear time. In: Sarkar V, Ryder BG, Soffa ML (eds.), Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (PLDI). Orlando, Florida, USA: ACM, June 20–24, 1994, 171–85. [Google Scholar]

[btae456-B20] Ju X-C, Hou Q-Q, Sheng A-L. et al. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. Elife 2016;5:e18197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B21] Li H. Protein-to-genome alignment with miniprot. Bioinformatics 2023;39:btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B22] Li H, Feng X, Chu C. et al. The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B23] Li W, Godzik A.. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. [DOI] [PubMed] [Google Scholar]

[btae456-B24] Liao W-W, Asri M, Ebler J. et al. A draft human pangenome reference. Nature 2023;617:312–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B25] Makova KD, Pickett BD, Harris RS. et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature2024;630:401–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B26] Marin M, Vargas R, Harris M. et al. Benchmarking the empirical accuracy of short-read sequencing across the m. tuberculosis genome. Bioinformatics 2022;38:1781–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B27] Mercuri E, Sumner CJ, Muntoni F. et al. Spinal muscular atrophy. Nat Rev Dis Primers 2022;8:52. [DOI] [PubMed] [Google Scholar]

[btae456-B28] Nurk S, Koren S, Rhie A. et al. The complete sequence of a human genome. Science 2022;376:44–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B29] Nurk S, Walenz BP, Rhie A. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020;30:1291–305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B30] Onodera T, Sadakane K, Shibuya T. Detecting superbubbles in assembly graphs. Lect Notes Comput Sci 2013;8126:338–48. [Google Scholar]

[btae456-B31] Page AJ, Cummins CA, Hunt M. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015;31:3691–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B32] Paten B, Eizenga JM, Rosen YM. et al. Superbubbles, ultrabubbles, and cacti. J Comput Biol 2018;25:649–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B33] Peker N, Schuele L, Kok N. et al. Evaluation of whole-genome sequence data analysis approaches for short- and long-read sequencing of mycobacterium tuberculosis. Microb Genom 2021;7:000695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B34] Rautiainen M, Nurk S, Walenz BP. et al. Telomere-to-telomere assembly of diploid chromosomes with verkko. Nat Biotechnol 2023;41:1474–82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B35] Schneider VA, Graves-Lindsay T, Howe K. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 2017;27:849–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B36] Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014;30:2068–9. [DOI] [PubMed] [Google Scholar]

[btae456-B37] Shaw LP, Chau KK, Kavanagh J, et al. Niche and local geography shape the pangenome of wastewater- and livestock-associated enterobacteriaceae. Sci Adv 2021;7:eabe3868. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B38] Snobre J, Meehan CJ, Mulders W. et al. Frameshift mutations in Rv0678 preserve bedaquiline susceptibility in Mycobacterium tuberculosis by maintaining protein integrity. In: SSRN, 2024. 10.2139/ssrn.4769101. [DOI]

[btae456-B39] Steinberg KM, Antonacci F, Sudmant PH. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat Genet 2012;44:872–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B40] Steinegger M, Söding J.. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]

[btae456-B41] Sudmant PH, Kitzman JO, Antonacci F, et al. Diversity of human copy number variation and multicopy genes. Science 2010;330:641–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B42] Sung W-K, Sadakane K, Shibuya T. et al. An o(m log m)-time algorithm for detecting superbubbles. IEEE/ACM Trans Comput Biol Bioinform 2015;12:770–7. [DOI] [PubMed] [Google Scholar]

[btae456-B43] Taylor C, Crosby I, Yip V. et al. A review of the important role of CYP2D6 in pharmacogenomics. Genes (Basel) 2020;11:1295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B44] Tonkin-Hill G, Corander J, Parkhill J.. Challenges in prokaryote pangenomics. Microb Genom 2023;9:mgen001021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B45] Tonkin-Hill G, MacAlasdair N, Ruis C. et al. Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol 2020;21:180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B46] Vargas R Jr, Luna MJ, Freschi L. et al. Phase variation as a major mechanism of adaptation in Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A 2023;120:e2301394120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B47] Vollger MR, Dishuck PC, Harvey WT, et al. Increased mutation and gene conversion within human segmental duplications. Nature 2023;617:325–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B48] Wenger AM, Peluso P, Rowell WJ. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 2019;37:1155–62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B49] Wick RR, Schultz MB, Zobel J. et al. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 2015;31:3350–2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B50] Yang C, Zhou Y, Song Y. et al. The complete and fully-phased diploid genome of a male han Chinese. Cell Res 2023;33:745–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B51] Zerbino DR, Birney E.. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 2008;18:821–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B52] Zhou Y, Song L, Li H.. Full resolution HLA and KIR genes annotation for human genome assemblies. Genome Res 2024;gr.278985.124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae456-B53] Zhou Z, Charlesworth J, Achtman M.. Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res 2020;30:1667–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Exploring gene content with pangene graphs

Heng Li

Maximillian Marin

Maha R Farhat

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

Figure 1.

2.1 Defining pangene graphs

Table 1.

Figure 2.

2.2 Obtaining the protein set

2.3 Selecting non-orthologous genes

2.4 Adjusting pangene graphs

Figure 3.

2.5 Past work on bubble finding

Figure 4.

2.6 Defining generalized bibubbles

2.7 Biedged graphs and net graphs

2.8 Cycle equivalence

Figure 5.

2.9 Identifying generalized bibubbles

3 Results

3.1 Structural variants between two human genomes

3.2 Analyzing 100 human haplotypes

3.3 Incorporating 10 great ape haplotypes

Figure 6.

3.4 Analyzing 152 M.tuberculosis strains

Table 2.

Figure 7.

3.5 Analyzing 50 E.coli strains

4 Discussion

Acknowledgements

Contributor Information

Conflict of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases