Abstract
We previously introduced Giraffe, a short-read-to-pangenome graph mapper available in the vg pangenomics toolkit. Giraffe was fast and accurate for mapping short reads to human-scale pangenomes, but struggled with long reads. Long reads present a unique challenge to pangenome mapping algorithms due to their length and error profile, which allow them to take more topologically complex paths through the pangenome graph and increase the possible search space for the algorithm. We present updates to Giraffe that allow it to quickly and accurately map long reads to pangenome graphs. For both short and long reads, Giraffe mapping to a pangenome containing data from more than 450 human haplotypes, generated by the Human Pangenome Reference Consortium, is comparable in speed to linear mappers to human reference genomes; Giraffe is also over an order of magnitude faster than GraphAligner, the current state-of-the-art long-read-to-pangenome mapper. Its alignments produce similar or improved small and structural variant calling results, compared to those from commonly used graph-based and linear mappers. We additionally demonstrate using Giraffe’s long read alignments in a pangenome-guided assembly workflow, which is capable of producing more contiguous local assemblies than Hifiasm in our test regions.
2. Introduction
Pangenomes are emerging as a promising alternative to the standard single linear reference genome, because they are capable of representing the genetic diversity present in a population [1, 2]. There have been several recent efforts to establish human reference pangenomes, either targeted (at particular countries [3], ancestries [4], or sets of ethnicities [5]), or global across the human population [6]. Pangenomes are commonly represented as variation graphs, where nodes represent nucleotide sequences, edges represent adjacencies between sequences, and paths through the graph represent the original haplotype sequences of the pangenome [7]. These graph structures are an efficient representation of pangenomes, but the complex topology of graphs makes them more difficult to work with than linear reference genomes, especially as the graphs grow in sequence content and genetic diversity. In particular, mapping sequencing reads, a critical first step for many common bioinformatics pipelines, requires specialized data structures and algorithms to deal efficiently with a graph reference.
Read mappers that align to a standard linear reference [8, 9, 10] commonly use a seed-and-extend approach [11]. Short seed alignments are found using an index of the reference genome. Many algorithms then partition seeds into groups that are likely to have come from the same alignment to the reference. Seeds are then extended into the full base-level alignment using dynamic programming.
Existing graph mappers use versions of these basic components that have been adapted to a graph context. Early graph alignment algorithms were designed to align reads to partial order alignment (POA) graphs that themselves represent multiple sequence alignments [12]. These POA aligners generalized standard dynamic programming algorithms to align to acyclic graphs [12, 13]. The original vg map algorithm further extended the POA algorithm to cyclic graphs by “unrolling” the graph to remove cycles, then applying POA to implement its extension step [7]. GraphAligner, currently the only practical tool for mapping long reads to arbitrary graphs, generalizes the linear Shift-And exact string matching algorithm [14] for its seeding step, and Myers’ bitvector alignment algorithm [15] for its extension step, to align to cyclic graphs [16].
The original short read Giraffe relied on embedded haplotype sequences to map to the graph [17]. This set of embedded haplotypes could first be personalized using haplotype sampling, to remove local haplotypes not relevant to a particular sample [18]. Next, seeds were found by using a minimizer index over the haplotypes, and clustered by using a distance index to find minimum graph distances. Seeds within promising clusters were gaplessly extended along the haplotypes, and finally the extension was completed using dynamic programming against the graph. Due to the fast gapless extension step, in which the entire alignment was usually found, Giraffe was very fast for short reads [17]. However, being designed for short reads, where gaps are rare and a read will usually follow a single known haplotype for its entire length, Giraffe quickly overwhelmed our computational resources when applied to long reads.
Long reads are another rapidly growing technology with great potential to improve genomic analyses. Currently, High Fidelity (HiFi) reads from Pacific Biosciences have lengths between 10 and 25 kbp and accuracy of up to 99.95% [19, 20]. The most recent R10 chemistry from Oxford Nanopore Technologies achieves read lengths on the order of 10-100 kbp, with ultra-long reads achieving lengths up to 882 kbp [21], and accuracy of over 99% [22, 23]. Because of their increased length, these technologies are useful for resolving complex and large structural variants, and for identifying long-range linkages between alleles. Both HiFi and R10 sequencing technologies have gained widespread use in large-scale genomics projects [24, 25] due to this improved resolution ability, as well as their lowering cost and high throughput.
Despite the growing popularity of both pangenomes and long reads, there are currently few tools for mapping long reads to pangenomes. The length and error profile of the reads makes every stage of a mapping algorithm more difficult. Gapped alignment is a particularly challenging problem with long reads: even in a linear context, long reads can produce dynamic programming matrices that are unmanageably large. The problem becomes even harder in a graph context, as aligners must also navigate potentially large and complex graph topologies.
To deal with the increased difficulty of mapping long reads, many tools [10, 26, 27, 28] build on the seed-and-extend approach by adopting a seed-and-chain or seed-chain-extend strategy [11]. After seeds are found, and before they are extended into alignments, chains of co-linear seeds are identified. Then, extension consists of smaller alignment problems between consecutive seeds in a chain.
Chaining requires an ordering and a distance metric between seeds, both of which are ambiguous and more expensive to compute in a graph. Chaining is particularly difficult in graphs that contain cycles, as a node may occur before or after itself or any other node in the cycle [29, 28, 30].
Existing graph chaining algorithms use distance calculations based on linear approximations of the graph. Some tools [29, 28, 31] are based on an algorithm from Mäkinen et al. [30] that uses a minimum path cover of the graph to determine reachability and distance. Other tools might enumerate shortest-length walks between graph positions (Minigraph [26]) or estimate linear distances based on chains in the snarl decomposition (GraphAligner [32]). As a result of these different heuristics for overcoming the challenges of long read-to-pangenome mapping, different mappers have different advantages and limitations.
GraphAligner, for example, can accurately map reads to arbitrary graphs, but it is slow, particularly as the complexity of the graph increases [32]. Minigraph [26], which was originally designed to map assemblies to graphs containing only structural variants (SVs), struggles to map to graphs containing small insertions or deletions (indels) or single nucleotide variants (SNVs). PanAligner [27] uses its own graph chaining algorithm with components from Minigraph, and GraphChainer [28] uses its own graph chaining algorithm with components from GraphAligner. Minichain uses a haplotype-aware chaining algorithm that prioritizes alignments that remain on the same haplotype [31]. PanAligner has the same limitations as Minigraph and was an order of magnitude slower than GraphAligner in our initial tests. GraphChainer and Minichain can only map to graphs without cycles.
We present an updated version of Giraffe capable of mapping both long and short reads. The long read Giraffe algorithm uses a seed-chain-extend framework, using a novel data structure for computing distances for chaining. We show that Giraffe is among the fastest graph-based or linear read mappers, while still being competitive in mapping accuracy and downstream variant calling results. Finally, we use Giraffe’s long read mappings in a haplotype-resolved pangenome-guided assembly workflow for assembling a specific region of interest, solving a relaxed version of the haplotype-resolved de novo assembly problem addressed by tools like Hifiasm [33].
3. Results
3.1. Long read Giraffe algorithm overview
The long read Giraffe algorithm is adapted from the original short read Giraffe algorithm [17] to efficiently deal with the longer sequence and higher error rate of long reads (Figure 1). The seed-and-extend short read algorithm has been expanded into a seed-chain-extend algorithm of the kind typical of long read mappers [11], with additional improvements that are useful in a long read context. For seeding, the long read algorithm uses weighted minimizer seeds, which prioritizes seeds with fewer hits in the graph (Figure 1 C). For chaining, we address the challenges of complex graph topology using a novel data structure for finding graph-distances between seeds that are supported by the corresponding read-distances (Figure 1 D). Using this data structure, the algorithm computes co-linear chains of seeds that occur in the same order in the read and graph, which form the backbone of the final alignment (Figure 1 E). Finally, for extension, base-level alignment is done in between consecutive seeds in the chain using alignment algorithms that limit the problem to alignment either against linear haplotypes or within smaller bands of the alignment matrix (Figure 1 F).
Figure 1: Overview of the long read Giraffe algorithm.
(A) The inputs to the algorithm: the query read and the reference variation graph. (B) The variation graph with its corresponding snarl tree. Rectangles in the snarl tree represent chains and ovals represent snarls. Chains on the snarl tree are colored according to their node children. Nodes are not shown on the snarl tree. (C) Seeds are found using a minimizer index. Arrows on the read and reference graph represent seed alignments. (D) A zip code tree is constructed to represent the connectivity between seeds on the graph using snarl tree relationships. Nodes in the zip code tree represent either a seed on the graph, or the boundary of a snarl or chain containing a seed. The nodes are ordered according to a pre-and-post-order traversal of the snarl tree. Edges connect adjacent seeds and boundaries, and are labeled with minimum distances from the variation graph. Distances between seeds in the graph can be calculated in a traversal of the zip code tree. (E) Co-linear chaining is done by dynamic programming over the seeds to maximize the coverage over the read and minimize the gap cost of consecutive seeds. The cost of a gap between seeds is found using read distances from the read and graph distances from a traversal of the zip code tree. (F) Base-level alignment is done between adjacent seeds in a chain, and out from the first and last seeds to produce the full-length alignment.
3.2. Experiment setup: graphs, reads, and mappers
We compared Giraffe to a variety of linear mappers aligning to the CHM13 reference genome, and to graph mappers aligning to graphs constructed from the Human Pangenome Reference Consortium’s (HPRC) release 2 assemblies (Section S5.1). We used the HPRC graph constructed with just Minigraph (HPRC-Minigraph) as well as the graph constructed with the Minigraph-Cactus pipeline with allele frequency filtering (HPRC-d46) [6]. In a previous study [18], we found that short read Giraffe performed better when using personalized pangenome graphs (see 2). We therefore tested Giraffe on graphs that were haplotype-sampled to contain only 16 haplotypes and whichever of the two linear references was appropriate (HPRC-Sampled16). The haplotype sampled graph used a base graph constructed with the Minigraph-Cactus pipeline with selective clipping (HPRC-clipped). Our haplotype-sampling pipeline requires k-mers from a whole-genome read set, so we generated a different sampled graph for each sample and sequencing technology, by using an appropriate set of real reads (Section S5.1). Finally, we also used Giraffe to map to a negative control variation graph containing only CHM13 (also referred to as CHM13).
We mapped real datasets of long read technologies (HiFi and R10) and short read technologies (Illumina and Element) reads (Section S5.2).
We compared Giraffe to Minimap2 [10], Winnowmap [34], GraphAligner [16, 32], and Minigraph [26] for mapping HiFi and R10 reads; PBMM2 [10, 35] for mapping HiFi reads; and to Minimap2 and BWA-MEM [9] for mapping Illumina and Element reads (Section S5.3). Giraffe was run using the long read algorithm on HiFi and R10 reads, and using the paired-end short read algorithm for Illumina and Element reads. GraphAligner was run using the default parameters on simulated reads and when aligning to the HPRC-Minigraph graph with R10 reads, but we were unable to finish mapping real HiFi or R10 reads to the HPRC-d46 graph, or real HiFi reads to the HPRC-Minigraph graph, within the seven-day time limit of our servers, so this was done using a parameter set recommended by the developer to optimize for speed, which we call the “fast” parameters (Section S5.3).
3.3. Mapping evaluation
We used simulated read datasets to evaluate mapper performance on HiFi, R10, Illumina, and Element reads. Each dataset contained 1 million reads simulated to match the error profiles of the real datasets (Section S5.2).
Giraffe’s accuracy was similar to that of other mappers, and it had the highest true positive rate for HiFi reads when mapping to the allele frequency-filtered HPRC-d46 graph (Figure 2, Tables S1-S4). Giraffe generally had better mapping quality calibration than other mappers (Figure S4) and the fewest incorrectly mapped reads with a mapping quality (MAPQ) of 60, meaning that it was less likely to overestimate its accuracy. (Tables S1-S4).
Figure 2: Mapping accuracy on simulated reads.

ROC plots for one million reads simulated to resemble HiFi (A), R10 (B), Illumina (C), and Element (D) reads, zoomed in to show the separation of the curves. The results are stratified by mapping quality; the size of each point represents the log-scaled number of reads with the given mapping quality. X-axes are log-scaled. Linear mappers (Minimap2, PBMM2, Winnowmap, BWA-MEM) were used to align to CHM13, Giraffe to the linear CHM13 graph, and graph mappers (Giraffe, GraphAligner, Minigraph) to the HPRC-d46 graph (Giraffe and GraphAligner), HPRC-Minigraph graph (Minigraph), and the HPRC-Sampled16 graph (Giraffe).
We also mapped real read sets for each sequencing technology (Section S5.2), to evaluate unaligned bases, and to assess speed and memory use. Because GraphAligner exceeded the seven-day time limit of our servers in some cases, we re-ran it using a parameter set recommended by the developer (Section S5.3).
We evaluated the base-level alignments of real long reads by comparing the total number of different types of unaligned bases (soft-clipped, hard-clipped, and unmapped) that each mapper produced; for short reads, we looked at the fraction of unmapped bases (Figure 3). For long reads, again, results were similar between Giraffe and most other mappers. For short reads, we found that the existing short read Giraffe algorithm could leave noticeably more reads unmapped than other mappers. We noted the same pattern of unmapped reads when mapping simulated reads. Of the simulated reads that Giraffe left unmapped, all were incorrectly mapped by BWA-MEM and Minimap2.
Figure 3: Unaligned bases.
Bar plots of the percent of base pairs that were unaligned due to being softclipped, hardclipped, or in reads that were left unmapped, in real HiFi (A) and R10 (B) reads, and the percent of bases that were unmapped in Illumina (C) and Element (D) reads. Bases that were unaligned due to being in unmapped reads are represented by the lower, darker portion of each bar in (A) and (B), and are the entirety of each bar in (C) and (D).
Giraffe was about an order of magnitude faster than GraphAligner’s fast mode (Figure 4). Giraffe’s speed and memory use varied with the complexity of the reference graph it aligned to, with the simplest CHM13 graph being the fastest and lowest memory and the most complex HPRC graphs (HPRC-d46 and HPRC-Sampled16) being the slowest with the biggest memory footprint (Figure 4).
Figure 4: Runtime and memory use.
Runtime (A, B, C, and D) and memory use (E, F, G, and H) of each mapper on real PacBio HiFi (A and E), Oxford Nanopore R10 (B and F), Illumina (C and G), and Element (D and H) reads. The lower, darker portion of each runtime bar shows the part of the runtime used for loading and building indexes, as reported by each mapper, including the time for haplotype sampling for Giraffe.
* GraphAligner exceeded the 7-day time limit of our servers when run with the default settings on HiFi and R10 reads on the HPRC-d46 graph, and on HiFi reads on the HPRC-Minigraph graph. It was re-run using a parameter set (GraphAligner-fast) that was provided by the developer (Section S5.3).
3.4. Small variant calling with DeepVariant
To more robustly assess the quality of Giraffe’s long read alignments, we compared the performance of DeepVariant [36] using reads mapped with Giraffe and Minimap2. For Giraffe, PacBio and R10 reads from the HPRC (for HG002) were mapped to the HPRC-d46 graph and the haplotype-sampled HPRC-Sampled16 graph, then surjected onto CHM13. For Minimap2, these reads were mapped directly to CHM13 (see Section S5.3). Variant calling was done for both mappers using the ONT_R104 model of DeepVariant for R10 reads, and the PACBIO model for HiFi reads mapped with Minimap2. For HiFi reads mapped with Giraffe, we used a custom model trained on reads mapped by Giraffe (see Section S5.4.1). Calling accuracy was evaluated using hap.py version 0.3.15 [37] against the Genome in a Bottle (GIAB) 4.2.1 truth set [38, 39]. We did not assess short read variant calling accuracy because this has been tested extensively [17, 18, 40] and has not changed with the new release of Giraffe.
Giraffe generally had slightly higher precision, recall, and F1 scores than Minimap2, with the haplotype sampling pipeline providing further improvements (Figure 5 A-D).
Figure 5: Variant calling and genotyping.
SNPs (A and B) and indels (C and D) were called using DeepVariant with HiFi (A and C) and R10 (B and D) reads. SV genotyping was performed with Giraffe on the HPRC-d46 graph (Giraffe), Giraffe with a sampled graph with 16 haplotypes (Giraffe sampled), and GraphAligner’s fast mode on the HPRC-d46 graph, all with vg call. SV calling and genotyping was done using Minimap2 with Sniffles2. SV genotyping accuracy was compared for HiFi (E) and R10 (F) reads. Giraffe was used to map reads to the HPRC-d46 graph and to the HPRC-Sampled16 graph. Graphaligner’s fast mode was used to map reads to the HPRC-d46 graph. Minimap2 was used to map reads to CHM13.
3.5. Structural variant genotyping with vg call
We assessed the performance of Giraffe on structural variant genotyping with vg call using a pipeline based on Hickey et al. [41] (Section S5.5). vg call was run on Giraffe’s mappings to either the HPRC-d46 graph or the HPRC-Sampled16 graph. SV genotyping in each condition used the same graph that was used for mapping, except that for the HPRC-Sampled16 graph, a version was used with with both linear references included. We compared these two Giraffe conditions against two competitors: vg call on GraphAligner’s fast mode alignments to the HPRC-d46 graph, and the linear SV caller and genotyper Sniffles2 [42, 43] on Minimap2’s alignments to the CHM13 linear reference. Genotyping results were evaluated against the Genome in a Bottle Consortium’s T2T Q100 HG002 truth set [44] using Truvari [45] (Section S5.5).
Despite being restricted to only the variants present in the graph, the Giraffe-and-vg call pipeline with the haplotype-sampled graph performed comparably with Minimap2-and-Sniffles2, with the highest recall of any of the methods tested and a higher F1 score that Minimap2-and-Sniffles2 for R10 reads (Figure 5 E and F, Tables S13, S14).
3.6. Validation and application of pangenome-guided assembly for complex regions
We developed a prototype pangenome-guided assembler (PGA) (see Section 5.3) that leverages long read Giraffe alignments for targeted regional assembly. Briefly, Giraffe is used to map long reads to a pangenome. Sites of variation within the pangenome graph, termed bubbles or snarls, are used to determine haplotypic anchors in the reads, and then these anchors and their connections in the reads are used to create a phased genome assembly. Using the anchors derived from common variant sites, the assembly process is able to assemble a sample’s haplotype sequences, including private variations not present in the mapped-to pangenome.
We initially evaluated its performance using HG002 R10 reads (Section S6.1), across six selected loci: three structurally simple regions (CHR13-1Mb, CHR8-700kb, CHR2-600kb) and three complex, medically relevant regions with segmental duplications (RCCX, PDPK, GOLGA8).
For all six regions, PGA produced accurate phased assemblies (Figures S5-S10). For comparison, we generated a complete Hifiasm assembly of the sample, aligned it to the HG002v1.0.1 reference with Minimap2, and retrieved the contigs overlapping the target region, trimming them to the aligned segments for evaluation. The evaluation showed comparable mismatch rates between both assemblies, even though PGA does not perform explicit read correction. In certain cases (Figures S6, S8), PGA produced better assembly contiguity than Hifiasm.
Following validation using the HG002 benchmark sample, we applied PGA to assemble the complex RCCX locus (Section S6.2) in two Congenital Adrenal Hyperplasia (CAH) cases, who had been analyzed in a previous study published by our group ([46, 47]), and compared the assemblies to those from a specialized method (Parakit) developed for analyzing this challenging locus [47]. With PGA, we were able to directly assemble the RCCX haplotypes for these two CAH probands (IDs DSDTRN09 and DSDTRN04 in Negi et al. [46]), and accurately detect the pathogenic compound heterozygous variants within these haplotype-resolved assemblies (Figure 6.2), recapitulating the earlier results made using Parakit and demonstrating accurate assembly of variants not found within the pangenome.
Figure 6: Pangenome-guided assembly workflow and application to rare disease cases.
1. PGA workflow overview: The inputs to the workflow are the reference variation graph and sample long reads. (A) Personalized pangenome construction using graph-unique k-mer classifications (faded node is not part of the personalized graph). (B) Reads are aligned to the personalized graph using long read Giraffe. (C) Read mappings at leaf snarls are used to generate anchors (potentially haplotype-informative k-mers enabling separation of reads). (D) The Shasta assembler constructs an anchor graph, with anchors as vertices and shared reads as edges. Its detangling algorithm traverses this graph to resolve haplotype paths and generate a phased assembly.
2. Resolving RCCX haplotypes in two probands: For each proband, the figure displays PGA assemblies, sequencing statistics, the alignment profile of the phased haplotypes against an RCCX collapsed pangenome (using Parakit’s visualization), and the final RCCX haplotypes structure in sample. In the alignment profiles, when a haplotype traverses multiple RCCX modules (nodes: red = pseudogene-specific; green = gene-specific; blue = shared between SD modules) by looping through the pangenome, the different traversals are separated vertically and shown as multiple rows. The gene annotation track shows CYP21A2 and CYP21A1P annotations. Vertical dotted lines highlight variant positions across panels. (A) Proband DSDTRN09: Hap-1 alignment through the pseudogenes followed by the genes reveals HERV-K deletion in C4A and a gene-converted pathogenic Int2G SNV (red circle). Hap-2 contains a fusion located beyond the CYP21A2 gene in TNXB (blue triangle). (B) Proband DSDTRN04: Assembly alignment reveals a gene-converted pathogenic Q319X nonsense SNV (pink circle) in Hap-1 and a hybrid 5’-CYP21A1P/CYP21A2-3’ fusion in Hap-2.
4. Discussion
As the field of pangenomics continues to expand, there is a growing need for efficient tools for working with increasingly large and complex pangenome graphs. The latest version of Giraffe, part of our vg toolkit, is now capable of mapping both short and long reads to graphs of varying complexity—including to graphs constructed from more than 450 haplotypes from the HPRC’s second release.
For both short and long reads, Giraffe achieves similar read mapping, and overall slightly better variant calling accuracy, relative to existing best-of-breed linear and pangenome mappers, while being comparable in speed to the best linear mappers. Notably, Giraffe is over an order of magnitude faster than GraphAligner, the current state-of-the-art long read-to-pangenome mapper.
To demonstrate the potential of long-read-to-pangenome mapping, we developed a prototype pangenome guided assembly pipeline. With this pipeline, we were able to produce local, complete, phased assemblies of targeted regions of the genome, including those containing highly similar segmental duplications, that were more contiguous than the assemblies produced by Hifiasm and comparably accurate. Assembling the RCCX loci from two patients, we were able to accurately recover complex haplotypes that contain structural variations not present in the pangenome. This is in contrast to existing methods for structural variant calling against a pangenome, like the vg call-based methods evaluated here, which generally require the variation to be known in advance.
While further scaling and testing is required to convert this prototype assembler into a fully fledged tool, such a hybrid assembly process has great promise for future, efficient genome inference. It avoids the need for all-against-all read alignment (a step necessary during de novo assembly), and, by providing a strong prior for likely haplotypes, provides a framework that might facilitate more accurate assembly, particularly, we suspect, at lower sequencing coverages.
We demonstrate Giraffe’s performance on human graphs containing cycles and both large and small variants, but it is still untested on the more complex graphs constructed with the PanGenome Graph Builder (PGGB) [48] and on more genetically diverse species. PGGB’s reference-free approach to graph building provides an unbiased view of the homology within and among genomes. However, due to the complex topology of the PGGB graphs, we have not been able to construct indexes for Giraffe, and to our knowledge there are no tools capable of mapping reads to these graphs. There remains a need for indexes and algorithms that can handle the large cycles and deeply-nested variants present in PGGB graphs; we are actively working to extend Giraffe in this direction.
The construction of the graph being mapped to also has a profound effect on the downstream alignments and calling results. When graph alignments are pro jected back onto a linear reference, the base-level alignment of reads will often follow the alignment used to construct the graph. A pathological “spelling” of the graph’s alignment (for example, one providing multiple paths that spell the same sequence) may negatively affect downstream results, even if the read was mapped to the correct location and represents the correct sequence.
We found ourselves obliged to work around this effect (see Section S4.2). We are actively pursuing new graph construction techniques that we think will allow for building graphs that produce graph alignments that in turn produce more useful linear alignments. We also suspect that there may be an opportunity here to revive the field of indel realignment, seemingly last popular in the era of GATK 3. We hypothesize that an indel realigner with long read support, ideally working in pangenome space, would improve point variant calling accuracy from pangenome alignments. Unfortunately, we know of no suitable existing tool.
There is still work to be done to further improve the Giraffe algorithm. We currently use the original Giraffe algorithm for mapping paired-end short reads, but we expect results to improve when adding pairing support to the long read algorithm. Furthermore, while the improvements in variant calling demonstrated are currently modest, we believe that there is considerable room for further improvement. One future idea is to score recombinations, and so penalize read mappings that are haplotype inconsistent, as is often case when mapping reads to paralogs [31]. Such recombination awareness also opens the possibility to assign reads to local haplotypes and use these local haplotypes more directly in downstream applications.
5. Online Methods
5.1. Variation graphs and indexes
Giraffe is implemented as part of the vg toolkit, which uses a variation graph model for representing pangenomes [7]. A variation graph is a bidirected graph in which nodes are labeled with nucleotide sequences, and edges represent possible adjacencies between these sequences. Nodes in a variation graph have two sides, which are arbitrarily designated as left and right. Edges connect pairs of node sides. Valid walks through the graph must enter and exit opposite sides of a node, so each node visit along the walk has an orientation. A forward (i.e. left to right) traversal of a node represents the forward sequence of the nucleotides, whereas a backwards traversal represents its reverse complement. Walks through the graph can therefore be used to represent longer sequences, found by concatenating the sequences of the nodes. We call walks used for this purpose paths, and they are usually stored along with the variation graph that they are inside of.
Paths in a variation graph usually correspond to a small number of reference sequences. In addition to the reference sequences, Giraffe also uses a larger panel of haplotype paths, stored in a Graph Burrows-Wheeler Transform or GBWT index [49]. Originally, GBWT indexes were stored separately from their graphs, but the GBZ format [50] was developed to store them together, by storing the GBWT and a “GBWT graph” of node sequences in the same file, and storing non-haplotype paths along with the haplotypes in the GBWT. Because we found GBZ graphs so easy to work with, Giraffe now always uses the GBZ format. One subtlety is that a GBZ uses the paths in the GBWT to implicitly define the edges of the graph. Thus, a GBZ graph represents the subgraph of some original sequence graph whose edges are supported by the haplotypes and reference paths in the pangenome.
Giraffe uses a distance index [51] to find minimum distances between positions in the graph. The distance index is based on decomposing the graph into snarls [52], which are sites of variation in the graph where paths can diverge to represent different alleles.
Formally, a snarl is a subgraph defined by two boundary node sides that are separable and minimal. Two nodes sides are separable if cutting their nodes into their component node sides makes the subgraph between them unreachable from the rest of the graph. The two boundary node sides are minimal if no other node side within the subgraph is separable with either of the boundaries. Snarls often occur consecutively, with boundaries on opposite sides of shared nodes between them.
A maximal length run of one or more nodes, with snarls between each node and the next, is called a chain. (These chains of snarls are not to be confused with the chains of alignment seeds used later.) The simplest possible chain is just a single node, so every node exists in some chain. More complex chains can be nested within snarls to represent nested variation, such as a series of SNPs that occur within an insertion. The decomposition of the graph into nested snarls and chains is described by a snarl tree (Figure 1 B). Although formally the decomposition is not unique, we usually work with one decomposition at a time.
5.2. Giraffe algorithm
The long read Giraffe algorithm follows the common seed-chain-extend strategy [11] for mapping algorithms (Figure 1). As in the original short read algorithm, long read Giraffe uses a minimizer index over the sequences of haplotypes in the GBWT to find seed alignments (Figure 1 C). For long reads, we use weighted minimizers that prioritize those with fewer occurrences in the graph, and add a filter to skip minimizers with too many occurrences (Section S1).
The seeds are then chained together (Figure 1 D-E, Section S3). A chain of seeds (not to be confused with the chains of snarls used previously) is an ordered set of seeds that are co-linear in the read and the reference. In a graph context, this means that each seed must be reachable in the graph from the previous seed. Giraffe tries to find optimal chains that maximize the coverage in the read and minimize the expected cost of the gap (the difference between the distance between the seeds in the read and the distance between seeds in the graph) that will be taken between consecutive seeds. We use a gap cost based on that of Minimap2 [10].
To efficiently calculate reachability and distance in a graph, we use a novel data structure called a zip code tree that represents distances among a set of seeds (Figure 1 D). The zip code tree approximates a placement of seeds in an unrolled, acyclic view of the reference graph, providing a partial ordering of seeds that can be used to find distances for chaining (Figure S1, Section S2). Using distances calculated with the zip code tree, we are able to find reasonable chains in regions of the graph with complex topologies such as duplications (Figure S2) and inversions (Figure S3). The zip code tree is computed using small data structures called zip codes, that store, for each seed, information about the position’s placement on the snarl tree and associated distances (Section S2.3). Because of the cache-efficiency of these lightweight data structures, it is fast to sort and calculate distances among seeds to construct the zip code tree. A zip code tree is constructed for all seeds in a single read. Chaining is done using two passes of a dynamic programming algorithm (Section S3.1).
Finally, the chain (Figure 1 E) is extended into an alignment between the read and the graph. Base-level alignment is done between consecutive seeds in the chain and out from the first and last seeds (Figure 1 F, Section S4). We use a fast wave-front algorithm [53, 54, 55] for aligning substrings of the read to the haplotype sequences in the graph, both for between-seed alignment and anchored tail alignment. If this algorithm fails to find an alignment or is predicted to use too much memory (or for any alignment problem over 233 bases), we align to the graph itself, using a banded global aligner for aligning between seeds and a Single Instruction Multiple Data (SIMD)-accelerated X-drop aligner (called Dozeu) for aligning tails. These algorithms limit the size of the dynamic programming band to constrain the amount of time and memory used.
5.3. Pangenome-guided assembly overview
Using Giraffe’s ability to align long reads to pangenomes, we can do haplotype-resolved pangenome-guided assembly of long reads. Here we focus on local assembly.
Our method assembles a specified region of interest (ROI) using a personalized-pangenome-guided assembly workflow, depicted in Figure 6.1. The workflow takes as input a variation graph and the sample’s long reads. For this study, we used the HPRC-clipped pangenome graph.
First, the variation graph is personalized based on the input reads [18] by sampling 16 haplotypes (without using --diploid-sampling, which would further downsample to two). Input reads are then mapped to this personalized graph using long read Giraffe. For a target ROI, all reads mapping to the region, along with the subgraph and its snarl distance index, are extracted. These serve as input for the subsequent anchor generation step.
Anchors are defined as potentially haplotype-informative k-mers, and enable partitioning of reads by haplotype. They are generated from unique paths within leaf snarls (snarls that contain no other nested snarls). Anchors of a given snarl share common boundary nodes but differ at internal sentinel node(s) that represent allelic variation. An anchor is considered to be supported by a read only if the corresponding read subsequence aligns as an exact match to the anchor path. To mitigate phasing errors caused by noisy long read alignments, we perform a filtering step to remove anchors in unreliable snarls that exhibit inconsistent phasing relative to neighboring snarls. Finally, the complete set of read-supported anchors from the reliable snarls is passed to the Shasta assembler, which performs bubble cleanup and detangling to generate a phased assembly of the ROI.
Supplementary Material
Acknowledgments
Funding
This work was supported in part by the National Human Genome Research Institute (NHGRI) under award numbers U01HG013748, U41HG010972 and U24HG011853. This project has received funding from the European Union’s Horizon 2020 Research and Innovation Staff Exchange programme under the Marie Skłodowska-Curie grant agreement No. 872539.
Footnotes
Competing interests
A.C., K.S., and P.-C.C. are employees of Google and own Alphabet stock as part of the standard compensation package. The remaining authors declare no competing interests.
Data and materials availability
An archived copy of the code, graphs, DeepVariant models, and simulated reads (excluding R10 reads) used in this paper are available on Zenodo [56]. All data on Zenodo is also available at https://cgl.gi.ucsc.edu/data/lr-giraffe/, along with the R10 simulated reads and HG002 real read sets used for mapping evaluations.
These real read sets used for mapping and calling valuations are available on Zenodo, at https://cgl.gi.ucsc.edu/data/lr-giraffe/, and are publicly available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/HG002/assemblies/polishing/HG002/v1.0/mapping/hifi_revio_pbmay24/ (HiFi reads), s3://ont-open-data/giab_2025.01/ (R10 reads), gs://brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/40x/ (Illumina reads), and https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/scratch/HG002/sequencing/element/trio/HG002/ins500_600/ (Element reads). The read sets used for training DeepVariant are also publicly available; for details see Section S5.4.1.
The HG002 R10 reads used for the pangenome-guided assembly pipeline are publicly available from GIAB here https://epi2me.nanoporetech.com/giab-2025.01/. The CAH samples are available from the authors upon request (see Section S6.1).
The HPRC-d46 graph and HPRC-Minigraph graphs used for mapping and calling evaluations are available on Zenodo [56], at https://cgl.gi.ucsc.edu/data/lr-giraffe/, and at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/scratch/2025_02_28_minigraph_cactus/. The personalized haplotype sampled graphs are not publicly available, but they can be recreated using HG002 read sets and the base HPRC-clipped graph, which is available on Zenodo [56] and at https://cgl.gi.ucsc.edu/data/lr-giraffe/.
The Snakemake workflow used for read mapping, alignment analyses, and variant calling is available at https://github.com/vgteam/long-read-giraffe-experiments/tree/main. The regional PGA Snakemake workflow used for HG002 testing is available at https://github.com/shlokanegi/pga_workflow/tree/lrg2025-paper. The latest version of the vg toolkit is released at https://github.com/vgteam/vg. Copies of each of these GitHub repositories are available on Zenodo [56].
References
- [1].Wang Ting, Lucinda Antonacci-Fulton Kerstin Howe, Lawson Heather A., Lucas Julian K., Phillippy Adam M., Popejoy Alice B., Asri Mobin, Carson Caryn, Chaisson Mark J. P., Chang Xian, Cook-Deegan Robert, Felsenfeld Adam L., Fulton Robert S., Garrison Erik P., Garrison Nanibaa’ A., Graves-Lindsay Tina A., Ji Hanlee, Kenny Eimear E., Koenig Barbara A., Li Daofeng, Marschall Tobias, McMichael Joshua F., Novak Adam M., Purushotham Deepak, Schneider Valerie A., Schultz Baergen I., Smith Michael W., Sofia Heidi J., Weissman Tsachy, Flicek Paul, Li Heng, Miga Karen H., Paten Benedict, Jarvis Erich D., Hall Ira M., Eichler Evan E., Haussler David, and the Human Pangenome Reference Consortium. The Human Pangenome Project: a global resource to map genomic diversity. Nature, 604(7906):437–446, April 2022. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-022-04601-8. URL https://www.nature.com/articles/s41586-022-04601-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Miga Karen H. and Wang Ting. The Need for a Human Pangenome Reference Sequence. Annual Review of Genomics and Human Genetics, 22(1):81–102, August 2021. ISSN 1527-8204, 1545-293X. doi: 10.1146/annurev-genom-120120-081921. URL https://www.annualreviews.org/doi/10.1146/annurev-genom-120120-081921. [DOI] [Google Scholar]
- [3].Gao Yang, Yang Xiaofei, Chen Hao, Tan Xinjiang, Yang Zhaoqing, Deng Lian, Wang Baonan, Kong Shuang, Li Songyang, Cui Yuhang, Lei Chang, Wang Yimin, Pan Yuwen, Ma Sen, Sun Hao, Zhao Xiaohan, Ying-bing Shi Ziyi Yang, Wu Dongdong, Wu Shaoyuan, Zhao Xingming, Shi Binyin, Jin Li, Hu Zhibin, Lu Yan, Chu Jiayou, Ye Kai, and Xu Shuhua. A pangenome reference of 36 Chinese populations. Nature, 619(7968):112–121, July 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06173-7. URL https://www.nature.com/articles/s41586-023-06173-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Littlefield Connor, Lazaro-Guevara Jose M., Stucki Devorah, Lansford Michael, Pezzolesi Melissa H., Taylor Emma J., Wolfgramm Etoni-Ma’asi C., Taloa Jacob, Lao Kime, Dave C., Dumaguit C., Ridge Perry G., Tavana Justina P., Holland William L., Raphael Kalani L., and Pezzolesi Marcus G., A Draft Pacific Ancestry Pangenome Reference, August 2024. URL https://www.biorxiv.org/content/10.1101/2024.08.07.606392v2. [Google Scholar]
- [5].Nassir Nasna, Almarri Mohamed A., Kumail Muhammad, Mohamed Nesrin, Balan Bipin, Hanif Shehzad, AlObathani Maryam, Jamalalail Bassam, Elsokary Hanan, Kondaramage Dasuki, Shiyas Suhana, Kosaji Noor, Satsangi Dharana, Abdelmotagali Madiha Hamdi Saif, Tayoun Ahmad Abou, Zuhair Olfat, Ahmed Salem, Youssef Douaa Fathi, Suwaidi Hanan Al, Albanna Ammar, Du Plessis Stefan, Khansaheb Hamda Hassan, Alsheikh-Ali Alawi, and Uddin. Mohammed, A draft Arab pangenome reference, July 2024. URL https://www.biorxiv.org/content/10.1101/2024.07.09.602638v1. [Google Scholar]
- [6].Liao Wen-Wei, Asri Mobin, Ebler Jana, Doerr Daniel, Haukness Marina, Hickey Glenn, Lu Shuangjia, Lucas Julian K., Monlong Jean, Abel Haley J., Buonaiuto Silvia, Chang Xian H., Cheng Haoyu, Chu Justin, Colonna Vincenza, Eizenga Jordan M., Feng Xiaowen, Fischer Christian, Fulton Robert S., Garg Shilpa, Groza Cristian, Guarracino Andrea, Harvey William T., Heumos Simon, Howe Kerstin, Jain Miten, Lu Tsung-Yu, Markello Charles, Martin Fergal J., Mitchell Matthew W., Munson Katherine M., Njagi Mwaniki Moses, Novak Adam M., Olsen Hugh E., Pesout Trevor, Porubsky David, Prins Pjotr, Sibbesen Jonas A., Sirén Jouni, Tomlinson Chad, Villani Flavia, Vollger Mitchell R., Antonacci-Fulton Lucinda L., Baid Gunjan, Baker Carl A., Belyaeva Anastasiya, Billis Konstantinos, Carroll Andrew, Chang Pi-Chuan, Cody Sarah, Cook Daniel E., Cook-Deegan Robert M., Cornejo Omar E., Diekhans Mark, Ebert Peter, Fairley Susan, Fedrigo Olivier, Felsenfeld Adam L., Formenti Giulio, Frankish Adam, Gao Yan, Garrison Nanibaa’ A., Giron Carlos Garcia, Green Richard E., Haggerty Leanne, Hoekzema Kendra, Hourlier Thibaut, Ji Hanlee P., Kenny Eimear E., Koenig Barbara A., Kolesnikov Alexey, Korbel Jan O., Kordosky Jennifer, Koren Sergey, Lee HoJoon, Lewis Alexandra P., Magalhães Hugo, Marco-Sola Santiago, Marijon Pierre, McCartney Ann, McDaniel Jennifer, Mountcastle Jacquelyn, Nattestad Maria, Nurk Sergey, Olson Nathan D., Popejoy Alice B., Puiu Daniela, Rautiainen Mikko, Regier Allison A., Rhie Arang, Sacco Samuel, Sanders Ashley D., Schneider Valerie A., Schultz Baergen I., Shafin Kishwar, Smith Michael W., Sofia Heidi J., Abou Tayoun Ahmad N., Thibaud-Nissen Françoise, Tricomi Francesca Floriana, Wagner Justin, Walenz Brian, Wood Jonathan M. D., Zimin Aleksey V., Bourque Guillaume, Chaisson Mark J. P., Flicek Paul, Phillippy Adam M., Zook Justin M., Eichler Evan E., Haussler David, Wang Ting, Jarvis Erich D., Miga Karen H., Garrison Erik, Marschall Tobias, Hall Ira M., Li Heng, and Paten Benedict. A draft human pangenome reference. Nature, 617(7960):312–324, May 2023. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-023-05896-x. URL https://www.nature.com/articles/s41586-023-05896-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Garrison Erik, Sirén Jouni, Novak Adam M., Hickey Glenn, Eizenga Jordan M., Dawson Eric T., Jones William, Garg Shilpa, Markello Charles, Lin Michael F., Paten Benedict, and Durbin Richard. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875–879, October 2018. ISSN 1546-1696. doi: 10.1038/nbt.4227. URL https://www.nature.com/articles/nbt.4227. [DOI] [Google Scholar]
- [8].Langmead Ben and Salzberg Steven L. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357–359, March 2012. ISSN 1548-7091. doi: 10.1038/nmeth.1923. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Li Heng . Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, May 2013. [Google Scholar]
- [10].Li Heng. New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37(23):4572–4574, December 2021. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btab705. URL https://academic.oup.com/bioinformatics/article/37/23/4572/6384570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Sahlin Kristoffer, Baudeau Thomas, Cazaux Bastien, and Marchet Camille. A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1):133, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Lee Christopher, Grasso Catherine, and Sharlow Mark F.. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452–464, March 2002. ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/18.3.452. URL https://academic.oup.com/bioinformatics/article/18/3/452/236691. [DOI] [PubMed] [Google Scholar]
- [13].Grasso Catherine and Lee Christopher. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics, 20(10):1546–1556, July 2004. ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/bth126. URL https://academic.oup.com/bioinformatics/article/20/10/1546/237238. [DOI] [PubMed] [Google Scholar]
- [14].Baeza-Yates Ricardo, and Gonnet Gaston H. A new approach to text searching. Communications of the ACM, 35(10):74–82, October 1992. ISSN 0001-0782, 1557-7317. doi: 10.1145/135239.135243. URL https://dl.acm.org/doi/10.1145/135239.135243. [DOI] [Google Scholar]
- [15].Myers Gene. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46(3):395–415, May 1999. ISSN 0004-5411, 1557-735X. doi: 10.1145/316542.316550. URL https://dl.acm.org/doi/10.1145/316542.316550. [DOI] [Google Scholar]
- [16].Rautiainen Mikko, Mäkinen Veli, and Marschall. Tobias, Bit-parallel sequence-to-graph alignment. Bioinformatics, 35(19):3599–3607, October 2019. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btz162. URL https://academic.oup.com/bioinformatics/article/35/19/3599/5372677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Sirén Jouni, Monlong Jean, Chang Xian, Novak Adam M., Eizenga Jordan M., Markello Charles, Sibbesen Jonas A., Hickey Glenn, Chang Pi-Chuan, Carroll Andrew, Gupta Namrata, Gabriel Stacey, Blackwell Thomas W., Ratan Aakrosh, Taylor Kent D., Rich Stephen S., Rotter Jerome I., Haussler David, Garrison Erik, and Paten Benedict. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 374(6574):abg8871, December 2021. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.abg8871. URL https://www.science.org/doi/10.1126/science.abg8871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Sirén Jouni, Eskandar Parsa, Ungaro Matteo Tommaso, Hickey Glenn, Eizenga Jordan M., Novak Adam M., Chang Xian, Chang Pi-Chuan, Kolmogorov Mikhail, Carroll Andrew, Monlong Jean, and Paten Benedict. Personalized pangenome references. Nature Methods, 21(11):2017–2023, November 2024. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-024-02407-2. URL https://www.nature.com/articles/s41592-024-02407-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Wenger Aaron M., Peluso Paul, Rowell William J., Chang Pi-Chuan, Hall Richard J., Concepcion Gregory T., Ebler Jana, Fungtammasan Arkarachai, Kolesnikov Alexey, Olson Nathan D., Töpfer Armin, Alonge Michael, Mahmoud Medhat, Qian Yufeng, Chin Chen-Shan, Phillippy Adam M., Schatz Michael C., Myers Gene, DePristo Mark A., Ruan Jue, Marschall Tobias, Sedlazeck Fritz J., Zook Justin M., Li Heng, Koren Sergey, Carroll Andrew, Rank David R., and Hunkapiller. Michael W., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology, 37(10):1155–1162, October 2019. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-019-0217-9. URL https://www.nature.com/articles/s41587-019-0217-9. [DOI] [Google Scholar]
- [20].Warburton Peter E. and Sebra Robert P.. Long-Read DNA Sequencing: Recent Advances and Remaining Challenges. Annual Review of Genomics and Human Genetics, 24(1):109–132, August 2023. ISSN 1527-8204, 1545-293X. doi: 10.1146/annurev-genom-101722-103045. URL https://www.annualreviews.org/doi/10.1146/annurev-genom-101722-103045. [DOI] [Google Scholar]
- [21].Jain Miten, Koren Sergey, Miga Karen H, Quick Josh, Rand Arthur C, Sasani Thomas A, Tyson John R, Beggs Andrew D, Dilthey Alexander T, Fiddes Ian T, Malla Sunir, Marriott Hannah, Nieto Tom, O’Grady Justin, Olsen Hugh E, Pedersen Brent S, Rhie Arang, Richardson Hollian, Quinlan Aaron R, Snutch Terrance P, Tee Louise, Paten Benedict, Phillippy Adam M, Simpson Jared T, Loman Nicholas J, and Loose Matthew. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 36(4):338–345, 2018. ISSN 1087-0156. doi: 10.1038/nbt.4060. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/. [DOI] [Google Scholar]
- [22].Kim Bernard Y., Gellert Hannah R., Church Samuel H., Suvorov Anton, Anderson Sean S., Barmina Olga, Beskid Sofia G., Comeault Aaron A., Crown K. Nicole, Diamond Sarah E., Dorus Steve, Fujichika Takako, Hemker James A., Hrcek Jan, Kankare Maaria, Katoh Toru, Magnacca Karl N., Martin Ryan A., Matsunaga Teruyuki, Medeiros Matthew J., Miller Danny E., Pitnick Scott, Schiffer Michele, Simoni Sara, Steenwinkel Tessa E., Syed Zeeshan A., Takahashi Aya, Wei Kevin H.-C., Yokoyama Tsuya, Eisen Michael B., Kopp Artyom, Matute Daniel, Obbard Darren J., O’Grady Patrick M., Price Donald K., Toda Masanori J., Werner Thomas, and Petrov Dmitri A.. Single-fly genome assemblies fill major phylogenomic gaps across the Drosophilidae Tree of Life. PLOS Biology, 22(7):e3002697, July 2024. ISSN 1545-7885. doi: 10.1371/journal.pbio.3002697. URL https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Damaraju Nikhita, Miller Angela L, and Miller Danny E. Long-Read DNA and RNA Sequencing to Streamline Clinical Genetic Testing and Reduce Barriers to Comprehensive Genetic Testing. The Journal of Applied Laboratory Medicine, 9(1):138–150, January 2024. ISSN 2475-7241. doi: 10.1093/jalm/jfad107. URL https://academic.oup.com/jalm/article/9/1/138/7502994. [DOI] [PubMed] [Google Scholar]
- [24].Kolmogorov Mikhail, Billingsley Kimberley J., Mastoras Mira, Meredith Melissa, Monlong Jean, Ryan Lorig-Roach Mobin Asri, Pilar Alvarez Jerez Laksh Malik, Dewan Ramita, Reed Xylena, Genner Rylee M., Daida Kensuke, Behera Sairam, Shafin Kishwar, Pesout Trevor, Prabakaran Jeshuwin, Carnevali Paolo, Yang Jianzhi, Rhie Arang, Scholz Sonja W., Traynor Bryan J., Miga Karen H., Jain Miten, Timp Winston, Phillippy Adam M., Chaisson Mark, Sedlazeck Fritz J., Blauwendraat Cornelis, and Paten Benedict. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nature Methods, 20(10):1483–1492, October 2023. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-023-01993-x. URL https://www.nature.com/articles/s41592-023-01993-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Mahmoud M., Huang Y., Garimella K., Audano P. A., Wan W., Prasad N., Handsaker R. E., Hall S., Pionzio A., Schatz M. C., Talkowski M. E., Eichler E. E., Levy S. E., and Sedlazeck F. J.. Utility of long-read sequencing for All of Us. Nature Communications, 15(1):837, January 2024. ISSN 2041-1723. doi: 10.1038/s41467-024-44804-3. URL https://www.nature.com/articles/s41467-024-44804-3. [DOI] [Google Scholar]
- [26].Li Heng, Feng Xiaowen, and Chu Chong. The design and construction of reference pangenome graphs with minigraph. Genome Biology, 21(1):265, October 2020. ISSN 1474-760X. doi: 10.1186/s13059-020-02168-z. URL https://doi.org/10.1186/s13059-020-02168-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Rajput Jyotshna, Chandra Ghanshyam, and Jain Chirag. Co-linear chaining on pangenome graphs. Algorithms for Molecular Biology, 19(1):4, January 2024. ISSN 1748-7188. doi: 10.1186/s13015-024-00250-w. URL https://doi.org/10.1186/s13015-024-00250-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Ma Jun, Cáceres Manuel, Salmela Leena, Mäkinen Veli, and Tomescu Alexandru I.. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics, 39(8), August 2023. doi: 10.1093/bioinformatics/btad460. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423031/. [DOI] [Google Scholar]
- [29].Chandra Ghanshyam and Jain Chirag. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. Journal of Computational Biology, 30(11):1182–1197, November 2023. ISSN 1557-8666. doi: 10.1089/cmb.2023.0186. URL https://www.liebertpub.com/doi/10.1089/cmb.2023.0186. [DOI] [PubMed] [Google Scholar]
- [30].Mäkinen Veli, Tomescu Alexandru I., Kuosmanen Anna, Paavilainen Topi, Gagie Travis, and Chikhi. Rayan, Sparse Dynamic Programming on DAGs with Small Width. ACM Transactions on Algorithms, 15(2):1–21, April 2019. ISSN 1549-6325, 1549-6333. doi: 10.1145/3301312. URL https://dl.acm.org/doi/10.1145/3301312. [DOI] [Google Scholar]
- [31].Chandra Ghanshyam, Gibney Daniel, and Jain Chirag. Haplotype-aware sequence alignment to pangenome graphs. Genome Research, 34(9):1265–1275, September 2024. ISSN 1088-9051, 1549-5469. doi: 10.1101/gr.279143. 124. URL http://genome.cshlp.org/content/34/9/1265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Rautiainen Mikko and Marschall Tobias. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biology, 21(1):253, September 2020. ISSN 1474-760X. doi: 10.1186/s13059-020-02157-2. URL https://doi.org/10.1186/s13059-020-02157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Cheng Haoyu, Concepcion Gregory T, Feng Xiaowen, Zhang Haowen, and Li Heng. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods, 18(2):170–175, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Jain Chirag, Rhie Arang, Hansen Nancy F., Koren Sergey, and Phillippy Adam M.. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods, 19(6):705–710, June 2022. ISSN 1548-7105. doi: 10.1038/s41592-022-01457-8. URL https://www.nature.com/articles/s41592-022-01457-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].PacificBiosciences/pbmm2, July 2025. URL https://github.com/PacificBiosciences/pbmm2. original-date: 2017-11-12T13:19:24Z. [Google Scholar]
- [36].Poplin Ryan, Chang Pi-Chuan, Alexander David, Schwartz Scott, Colthurst Thomas, Ku Alexander, Newburger Dan, Dijamco Jojo, Nguyen Nam, Afshar Pegah T, Gross Sam S, Dorfman Lizzie, McLean Cory Y, and DePristo. Mark A, A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology, 36(10):983–987, November 2018. ISSN 1087-0156, 1546-1696. doi: 10.1038/nbt.4235. URL https://www.nature.com/articles/nbt.4235. [DOI] [Google Scholar]
- [37].Illumina/hap.py, November 2020. URL https://github.com/Illumina/hap.py. https://github.com/Illumina/hap.py. [Google Scholar]
- [38].Krusche Peter, Trigg Len, Boutros Paul C., Mason Christopher E., De La Vega Francisco M., Moore Benjamin L., Gonzalez-Porta Mar, Eberle Michael A., Tezak Zivana, Lababidi Samir, Truty Rebecca, Asimenos George, Funke Birgit, Fleharty Mark, Chapman Brad A., Salit Marc, and Zook Justin M., Best practices for benchmarking germline small-variant calls in human genomes. Nature Biotechnology, 37(5):555–560, May 2019. ISSN 1546-1696. doi: 10.1038/s41587-019-0054-x. URL https://www.nature.com/articles/s41587-019-0054-x. [DOI] [Google Scholar]
- [39].Wagner Justin, Olson Nathan D., Harris Lindsay, Khan Ziad, Farek Jesse, Mahmoud Medhat, Stankovic Ana, Kovacevic Vladimir, Yoo Byunggil, Miller Neil, Rosenfeld Jeffrey A., Ni Bohan, Zarate Samantha, Kirsche Melanie, Aganezov Sergey, Schatz Michael C., Narzisi Giuseppe, Byrska-Bishop Marta, Clarke Wayne, Evani Uday S., Markello Charles, Shafin Kishwar, Zhou Xin, Sidow Arend, Bansal Vikas, Ebert Peter, Marschall Tobias, Lansdorp Peter, Hanlon Vincent, Mattsson Carl-Adam, Martinez Barrio Alvaro, Fiddes Ian T., Xiao Chunlin, Fungtammasan Arkarachai, Chin Chen-Shan, Wenger Aaron M., Rowell William J., Sedlazeck Fritz J., Carroll Andrew, Salit Marc, and Zook Justin M.. Benchmarking challenging small variants with linked and long reads. Cell Genomics, 2(5):100128, May 2022. ISSN 2666979X. doi: 10.1016/j.xgen.2022.100128. URL https://linkinghub.elsevier.com/retrieve/pii/S2666979X2200057X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Asri Mobin, Chang Pi-Chuan, Mier Juan Carlos, Sirén Jouni, Eskandar Parsa, Kolesnikov Alexey, Cook Daniel E., Brambrink Lucas, Hickey Glenn, Novak Adam M., Dorfman Lizzie, Webster Dale R., Carroll Andrew, Paten Benedict, and Shafin. Kishwar Pangenome-aware DeepVariant, June 2025. URL http://biorxiv.org/lookup/doi/10.1101/2025.06.05.657102. [Google Scholar]
- [41].Hickey Glenn, Heller David, Monlong Jean, Sibbesen Jonas A., Sirén Jouni, Eizenga Jordan, Dawson Eric T., Garrison Erik, Novak Adam M., and Paten. Benedict, Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biology, 21(1):35, February 2020. ISSN 1474-760X. doi: 10.1186/s13059-020-1941-7. URL https://doi.org/10.1186/s13059-020-1941-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Sedlazeck Fritz J., Rescheneder Philipp, Smolka Moritz, Fang Han, Nattestad Maria, von Haeseler Arndt, and Schatz. Michael C. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6):461–468, June 2018. ISSN 1548-7105. doi: 10.1038/s41592-018-0001-7. URL https://www.nature.com/articles/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Smolka Moritz, Paulin Luis F., Grochowski Christopher M., Horner Dominic W., Mahmoud Medhat, Behera Sairam, Kalef-Ezra Ester, Gandhi Mira, Hong Karl, Pehlivan Davut, Scholz Sonja W., Carvalho Claudia M. B., Proukakis Christos, and Sedlazeck Fritz J.. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology, 42(10):1571–1580, October 2024. ISSN 1546-1696. doi: 10.1038/s41587-023-02024-y. URL https://www.nature.com/articles/s41587-023-02024-y. [DOI] [Google Scholar]
- [44].Zook Justin M., Hansen Nancy F., Olson Nathan D., Chapman Lesley M., Mullikin James C., Xiao Chunlin, Sherry Stephen, Koren Sergey, Phillippy Adam M., Boutros Paul C., Sahraeian Sayed Mohammad E., Huang Vincent, Rouette Alexandre, Alexander Noah, Mason Christopher E., Hajirasouliha Iman, Ricketts Camir, Lee Joyce, Tearle Rick, Fiddes Ian T., Barrio Alvaro Martinez, Wala Jeremiah, Carroll Andrew, Ghaffari Noushin, Rodriguez Oscar L., Bashir Ali, Jackman Shaun, Farrell John J, Wenger Aaron M, Alkan Can, Soylev Arda, Schatz Michael C., Garg Shilpa, Church George, Marschall Tobias, Chen Ken, Fan Xian, English Adam C., Rosenfeld Jeffrey A., Zhou Weichen, Mills Ryan E., Sage Jay M., Davis Jennifer R., Kaiser Michael D., Oliver John S., Catalano Anthony P., Chaisson Mark JP, Spies Noah, Sedlazeck Fritz J., and Salit. Marc, A robust benchmark for detection of germline large deletions and insertions. Nature biotechnology, 38(11):1347–1355, November 2020. ISSN 1087-0156. doi: 10.1038/s41587-020-0538-8. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8454654/. [DOI] [Google Scholar]
- [45].English Adam C., Menon Vipin K., Gibbs Richard A., Metcalf Ginger A., and Sedlazeck Fritz J.. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biology, 23(1):271, December 2022. ISSN 1474-760X. doi: 10.1186/s13059-022-02840-6. URL https://doi.org/10.1186/s13059-022-02840-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Negi Shloka, Stenton Sarah L., Berger Seth I., Canigiula Paolo, McNulty Brandy, Violich Ivo, Gardner Joshua, Hillaker Todd, O’Rourke Sara M., O’Leary Melanie C., Carbonell Elizabeth, Austin-Tse Christina, Lemire Gabrielle, Serrano Jillian, Mangilog Brian, VanNoy Grace, Kolmogorov Mikhail, Vilain Eric, O’Donnell-Luria Anne, Délot Emmanuèle, Miga Karen H., Monlong Jean, and Paten. Benedict, Advancing long-read nanopore genome assembly and accurate variant calling for rare disease detection. The American Journal of Human Genetics, 112(2):428–449, 2025. ISSN 0002-9297. doi: 10.1016/j.ajhg.2025.01.002. URL https://www.sciencedirect.com/science/article/pii/S0002929725000023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Monlong Jean, Chen Xiao, Barseghyan Hayk, Rowell William J, Negi Shloka, Nokoff Natalie, Mohnach Lauren, Hirsch Josephine, Finlayson Courtney, Keegan Catherine E., Almalvez Miguel, Berger Seth I., de Dios Ivan, McNulty Brandy, Robertson Alex, Miga Karen H., Speiser Phyllis W., Paten Benedict, Vilain Eric, and Délot Emmanuèle C., Long-read sequencing resolves the clinically relevant cyp21a2 locus, supporting a new clinical test for congenital adrenal hyperplasia. medRxiv, 2025. doi: 10.1101/2025.02.07.25321404. URL https://www.medrxiv.org/content/early/2025/02/10/2025.02.07.25321404. [DOI] [Google Scholar]
- [48].Garrison Erik, Guarracino Andrea, Heumos Simon, Villani Flavia, Bao Zhigui, Tattini Lorenzo, Hagmann Jörg, Vorbrugg Sebastian, Marco-Sola Santiago, Kubica Christian, Ashbrook David G., Thorell Kaisa, Rusholme-Pilcher Rachel L., Liti Gianni, Rudbeck Emilio, Golicz Agnieszka A., Nahnsen Sven, Yang Zuyu, Mwaniki Moses Njagi, Nobrega Franklin L., Wu Yi, Chen Hao, De Ligt Joep, Sudmant Peter H., Huang Sanwen, Weigel Detlef, Soranzo Nicole, Colonna Vincenza, Williams Robert W., and Prins. Pjotr, Building pangenome graphs. Nature Methods, 21(11):2008–2012, November 2024. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-024-02430-3. URL https://www.nature.com/articles/s41592-024-02430-3. [DOI] [PubMed] [Google Scholar]
- [49].Sirén Jouni, Garrison Erik, Novak Adam M, Paten Benedict, and Durbin. Richard, Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, January 2020. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btz575. URL https://academic.oup.com/bioinformatics/article/36/2/400/5538990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Sirén Jouni, and Paten. Benedict, GBZ file format for pangenome graphs. Bioinformatics, 38(22):5012–5018, November 2022. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btac656. URL https://academic.oup.com/bioinformatics/article/38/22/5012/6731924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Chang Xian, Eizenga Jordan, Novak Adam M, Sirén Jouni, and Paten. Benedict, Distance indexing and seed clustering in sequence graphs. Bioinformatics, 36(Supplement 1):i146–i153, July 2020. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btaa446. URL https://academic.oup.com/bioinformatics/article/36/Supplement_1/i146/5870464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Paten Benedict, Eizenga Jordan M., Rosen Yohei M., Novak Adam M., Garrison Erik, and Hickey Glenn. Superbubbles, Ultrabubbles, and Cacti. Journal of Computational Biology, 25(7):649–663, July 2018. ISSN 1557-8666. doi: 10.1089/cmb.2017.0251. URL http://www.liebertpub.com/doi/10.1089/cmb.2017.0251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Marco-Sola Santiago, Moure Juan Carlos, Moreto Miquel, and Espinosa Antonio. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, 37(4):456–463, May 2021. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btaa777. URL https://academic.oup.com/bioinformatics/article/37/4/456/5904262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Eizenga Jordan M. and Paten Benedict. Improving the time and space complexity of the WFA algorithm and generalizing its scoring, January 2022. URL http://biorxiv.org/lookup/doi/10.1101/2022.01.12.476087. [Google Scholar]
- [55].Marco-Sola Santiago, Eizenga Jordan M, Guarracino Andrea, Paten Benedict, Garrison Erik, and Moreto Miquel, Optimal gap-affine alignment in space. Bioinformatics, 39(2):btad074, February 2023. ISSN 1367-4811. doi: 10.1093/bioinformatics/btad074. URL https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btad074/7030690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Chang Xian, Novak Adam, Eizenga Jordan, Sirén Jouni, Monlong Jean, Negi Shloka, Andreace Francesco, Nag Sagorika, Kyriakidis Konstantinos, Hickey Glenn, Hwang Stephen, Délot Emmanuèle, Carroll Andrew, Shafin Kishwar, Chang Pi-Chuan, Okamoto Faith, and Paten. Benedict, Software and products for “Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe”, September 2025. URL https://zenodo.org/doi/10.5281/zenodo.17169602. [Google Scholar]
- [57].Jain Chirag, Rhie Arang, Zhang Haowen, Chu Claudia, Walenz Brian P, Koren Sergey, and Phillippy. Adam M, Weighted minimizer sampling improves long read mapping. Bioinformatics, 36(Supplement 1):i111–i118, July 2020. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btaa435. URL https://academic.oup.com/bioinformatics/article/36/Supplement_1/i111/5870473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Abraham Ittai, Delling Daniel, Goldberg Andrew V, and Werneck. Renato F, Hierarchical hub labelings for shortest paths. In European Symposium on Algorithms, pages 24–35. Springer, 2012. [Google Scholar]
- [59].Nurk Sergey, Koren Sergey, Rhie Arang, Rautiainen Mikko, Bzikadze Andrey V., Mikheenko Alla, Vollger Mitchell R., Altemose Nicolas, Uralsky Lev, Gershman Ariel, Aganezov Sergey, Hoyt Savannah J., Diekhans Mark, Logsdon Glennis A., Alonge Michael, Antonarakis Stylianos E., Borchers Matthew, Bouffard Gerard G., Brooks Shelise Y., Caldas Gina V., Chen Nae-Chyun, Cheng Haoyu, Chin Chen-Shan, Chow William, De Lima Leonardo G., Dishuck Philip C., Durbin Richard, Dvorkina Tatiana, Fiddes Ian T., Formenti Giulio, Fulton Robert S., Fungtammasan Arkarachai, Garrison Erik, Grady Patrick G. S., Graves-Lindsay Tina A., Hall Ira M., Hansen Nancy F., Hartley Gabrielle A., Haukness Marina, Howe Kerstin, Hunkapiller Michael W., Jain Chirag, Jain Miten, Jarvis Erich D., Kerpedjiev Peter, Kirsche Melanie, Kolmogorov Mikhail, Korlach Jonas, Kremitzki Milinn, Li Heng, Maduro Valerie V., Marschall Tobias, McCartney Ann M., McDaniel Jennifer, Miller Danny E., Mullikin James C., Myers Eugene W., Olson Nathan D., Paten Benedict, Peluso Paul, Pevzner Pavel A., Porubsky David, Potapova Tamara, Rogaev Evgeny I., Rosenfeld Jeffrey A., Salzberg Steven L., Schneider Valerie A., Sedlazeck Fritz J., Shafin Kishwar, Shew Colin J., Shumate, Sims Ying, Smit Arian F. A., Soto Daniela C., Sovćc Ivan, Storer Jessica M., Streets Aaron, Sullivan Beth A., Thibaud-Nissen Françoise, Torrance James, Wagner Justin, Walenz Brian P., Wenger Aaron, Wood Jonathan M. D., Xiao Chunlin, Yan Stephanie M., Young Alice C., Zarate Samantha, Surti Urvashi, McCoy Rajiv C., Dennis Megan Y., Alexandrov Ivan A., Gerton Jennifer L., O’Neill Rachel J., Timp Winston, Zook Justin M., Schatz Michael C., Eichler Evan E., Miga Karen H., and Phillippy. Adam M., The complete sequence of a human genome. Science, 376(6588):44–53, April 2022. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.abj6987. URL https://www.science.org/doi/10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Zook Justin M., Catoe David, Jennifer McDaniel Lindsay Vang, Spies Noah, Sidow Arend, Weng Ziming, Liu Yuling, Mason Christopher E., Alexander Noah, Henaff Elizabeth, McIntyre Alexa B. R., Chandramohan Dhruva, Chen Feng, Jaeger Erich, Moshrefi Ali, Pham Khoa, Stedman William, Liang Tiffany, Saghbini Michael, Dzakula Zeljko, Hastie Alex, Cao Han, Deikus Gintaras, Schadt Eric, Sebra Robert, Bashir Ali, Truty Rebecca M., Chang Christopher C., Gulbahce Natali, Zhao Keyan, Ghosh Srinka, Hyland Fiona, Fu Yutao, Chaisson Mark, Xiao Chunlin, Trow Jonathan, Sherry Stephen T., Zaranek Alexander W., Ball Madeleine, Bobe Jason, Estep Preston, Church George M., Marks Patrick, Kyriazopoulou-Panagiotopoulou Sofia, Zheng Grace X. Y., Schnall-Levin Michael, Ordonez Heather S., Mudivarti Patrice A., Giorda Kristina, Sheng Ying, Rypdal Karoline Bjarnesdatter, and Salit. Marc, Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3(1):160025, June 2016. ISSN 2052-4463. doi: 10.1038/sdata.2016.25. URL https://www.nature.com/articles/sdata201625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Baid Gunjan, Nattestad Maria, Kolesnikov Alexey, Goel Sidharth, Yang Howard, Chang Pi-Chuan, and Carroll Andrew. An extensive sequence dataset of gold-standard samples for benchmarking and development. bioRxiv, 2020. doi: 10.1101/2020.12.11.422022. URL https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022. [DOI] [Google Scholar]
- [62].Ono Yukiteru, Asai Kiyoshi, and Hamada Michiaki. Pbsim2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics, 37(5):589–595, September 2020. ISSN 1367-4803. doi: 10.1093/bioinformatics/btaa835. URL https://doi.org/10.1093/bioinformatics/btaa835. [DOI] [Google Scholar]
- [63].Liu Silvia, Obert Caroline, Yu Yan-Ping, Zhao Junhua, Ren Bao-Guo, Liu Jia-Jun, Wiseman Kelly, Krajacich Benjamin J., Wang Wenjia, Metcalfe Kyle, Smith Mat, Ben-Yehezkel Tuval, and Luo. Jian-Hua, Utility analyses of AVITI sequencing chemistry. BMC Genomics, 25:778, August 2024. ISSN 1471-2164. doi: 10.1186/s12864-024-10686-4. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316309/. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Olson Nathan D, Wagner Justin, McDaniel Jennifer, Stephens Sarah H, Westreich Samuel T, Prasanna Anish G, Johanson Elaine, Boja Emily, Maier Ezekiel J, Serang Omar, et al. Precisionfda truth challenge v2: Calling variants from short and long reads in difficult-to-map regions. Cell genomics, 2(5), 2022. [Google Scholar]
- [65].github.com/vgteam/vgwdl/DeepVariant, July 2025. URL zenodo.org/doi/10.5281/zenodo.17081107. [Google Scholar]
- [66].Vivian John, Rao Arjun Arkal, Nothaft Frank Austin, Ketchum Christopher, Armstrong Joel, Novak Adam, Pfeil Jacob, Narkizian Jake, Deran Alden D, Musselman-Brown Audrey, et al. Toil enables reproducible, open source, big biomedical data analyses. Nature biotechnology, 35(4):314–316, 2017. [Google Scholar]
- [67].Garrison Erik, Kronenberg Zev N., Dawson Eric T., Pedersen Brent S., and Prins Pjotr. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLOS Computational Biology, 18(5):e1009123, May 2022. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1009123. URL https://dx.plos.org/10.1371/journal.pcbi.1009123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Carrozza Cinzia, Foca Laura, Elisa De Paolis, and Paola Concolino. Genes and pseudogenes: complexity of the rccx locus and disease. Frontiers in Endocrinology, 12:709758, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [69].Shiryagin Vladimir V, Devyatkin Andrey A, Fateev Oleg D, Petriaikina Ekaterina S, Bogdanov Viktor P, Antysheva Zoia G, Volchkov Pavel Yu, Yudin Sergey M, Woroncow Mary, and Skvortsova. Veronika I, Genomic complexity and clinical significance of the rccx locus. PeerJ, 12:e18243, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Claahsen-Van Der Grinten Hedi L, Speiser Phyllis W, Ahmed S Faisal, Arlt Wiebke, Auchus Richard J, Falhammar Henrik, Flück Christa E, Guasti Leonardo, Huebner Angela, Kortmann Barbara BM, et al. Congenital adrenal hyperplasia—current insights in pathophysiology, diagnostics, and management. Endocrine reviews, 43(1):91–159, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Burch Grant H, Gong Yan, Liu Wenhui, Dettman Robert W, Curry Cynthia J, Smith Lynne, Miller Walter L, and Bristow. James, Tenascin–x deficiency is associated with ehlers–danlos syndrome. Nature genetics, 17(1):104–108, 1997. [DOI] [PubMed] [Google Scholar]
- [72].Schalkwijk Joost, Zweers Manon C, Steijlen Peter M, Dean Willow B, Taylor Glen, van Vlijmen Ivonne M, van Haren Brigitte, Miller Walter L, and Bristow. James, A recessive form of the ehlers–danlos syndrome caused by tenascin-x deficiency. New England Journal of Medicine, 345(16):1167–1175, 2001. [DOI] [PubMed] [Google Scholar]
- [73].Kamitaki Nolan, Sekar Aswin, Handsaker Robert E, De Rivera Heather, Tooley Katherine, Morris David L, Taylor Kimberly E, Whelan Christopher W, Tombleson Philip, Loohuis Loes M Olde, et al. Complement genes contribute sex-biased vulnerability in diverse disorders. Nature, 582(7813):577–581, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [74].Miller Walter L. Tenascin-x—discovery and early research. Frontiers in Immunology, 11:612497, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





