Abstract
Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit (https://github.com/vgteam/vg) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.
Pangenome graphs1 represent aligned haplotype sequences. Each node in the graph is labeled with a sequence, and each path represents the sequence obtained by concatenating the node labels. While any path in the graph is a potential haplotype, most paths are unlikely recombinations of true haplotypes.
One common application for pangenome graphs is as a reference for read mapping2–4. Graph-based aligners tend to be more accurate than aligners using a linear reference sequence when mapping reads to regions where the sequenced genome differs from the reference sequence. If the graph contains another haplotype close enough to the sequenced genome in that region, the aligner can usually find the correct mapping. On the other hand, sequence variation that is present in the graph but not in the sequenced genome can make the aligner less accurate. Such variation can imitate other regions, making incorrect mappings more likely.
Because having the right variants in the graph has a greater effect on mapping accuracy than having variants not present in the sequenced genome, the usual approach is building a graph based on common variants only. The vg short-read aligner2 got best results with the 1,000 Genomes Project (1000GP)5 graph when variants below a 1% allele-frequency threshold were left out. Pritt et al.6 developed a model for selecting variants based not only on their frequency but also on their effect on the repetitiveness of the reference.
The Giraffe short-read aligner4 uses another approach for avoiding rare variants. As a haplotype-aware aligner, Giraffe can generate synthetic haplotypes by sampling the existing ones proportionally. The algorithm considers k-node subpaths (default k = 4) and extends the haplotype paths it generates greedily. It looks at the last k − 1 nodes and all possible single-node extensions, and selects the extension that best maintains proportionality. Giraffe got best results with the 1000GP graph by generating 64 haplotypes. However, frequency filtering was still used with the Human Pangenome Reference Consortium (HPRC)7 subgraph intended for Giraffe. As the initial graphs contain only 90 haplotypes, the frequency threshold was set to 10%.
Another alternative is building a personalized reference based on the reads. Dilthey et al.8 used a hidden Markov model to infer a diploid reference for the major histocompatibility complex region according to k-mer counts in the reads. Their personalized reference was a series of bubbles connected by shared sequence. The method was variant-centric, and phasing was not maintained from bubble to bubble.
Vaddadi et al.9 proposed a VCF-based pipeline for building a personalized diploid reference genome. They genotyped variants in a reference VCF using subsampled reads and imputed the haplotypes based on the reference panel. The resulting personalized VCF file could be used for building a graph for any graph-based read mapping and variant calling pipeline.
In this paper, we propose a more direct approach with less overhead. We create a personalized pangenome reference by sampling haplotypes that are similar to the sequenced genome according to k-mer counts in the reads. We work directly with assembled haplotypes and maintain phasing within 10 kbp (kilo-base pair) blocks. The sampled graph we build is a subgraph of the original graph. Therefore, any alignments in the sampled graph are also valid in the original graph.
Our approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. We assume a graph with a linear high-level structure, such as graphs built using the Minigraph–Cactus pipeline10. We further assume that read coverage is high enough (at least 20×) that we can reliably classify k-mers into absent, heterozygous and homozygous according to k-mer counts.
When used with HPRC graphs, our sampling approach increases the overall running time of the pipeline by less than 15 minutes. We show that the sampled graph is a better mapping target than the universal frequency-filtered graph. We see a small improvement in the accuracy of calling small variants with DeepVariant11 and a large increase in the accuracy of genotyping structural variants (SVs) with vg12 and PanGenie13.
Results
Haplotype sampling pipeline
Our haplotype sampling pipeline partitions the original pangenome graph into nonoverlapping blocks of target length 10 kpb (Fig. 1a). In each block, we find k-mers that are specific to the block and describe the local haplotypes by the presence or absence of these k-mers (Fig. 1b). The information is stored as a binary k-mer presence matrix. This preprocessing is done once per graph.
Fig. 1 |. Illustrating haplotype sampling at adjacent blocks in the pangenome.

a, A variation graph representing adjacent locations in the pangenome, composed of a bidirected sequence graph (top) and a set of embedded reference haplotypes (below); vertical alignment and base labels are used to indicate the correspondence between each haplotype and its path within the sequence graph; the dotted lines represent the boundary between the two blocks; for clarity, non-varying bases (those present in all haplotypes) are omitted. b, k-mers that occur once within the graph, termed graph-unique k-mers, are identified in the haplotypes; here k = 5 and graph-unique k-mers are colored red. The presence and absence of these graph-unique k-mers identifies each haplotype. c, The graph-unique k-mers are counted in the reads (here each read is a rectangle with only reads containing an informative k-mer shown), and based on counts classified as present, likely heterozygous (shown in orange), present, likely homozygous (shown in blue) or absent (all red k-mers in b not identified in the reads). d, Using the identified graph-unique k-mer classifications, a subset of reference haplotypes is selected at each location, defining a personalized pangenome reference subgraph of the larger graph (grayed nodes are not part of the subgraph, and only the shown embedded haplotypes are included). Where needed, recombinations are introduced (lightning bolt) to create contiguous haplotypes.
For each sample, we count k-mers in the reads. We then use the k-mer counts for classifying the k-mers in the matrices as absent, heterozygous or homozygous in the sample (Fig. 1c). We sample a number of most relevant haplotypes in each block heuristically according to the classification. We may use the sampled haplotypes directly, or we can use them as candidates for selecting the best pair in diploid sampling. Then we find the subgraph of the original graph induced by the selected haplotypes and reference sequences (Fig. 1d). The resulting personalized pangenome can then be used as a reference for read mapping and downstream analysis.
HPRC pangenome graphs
The experiments below feature pangenome graphs constructed from the HPRC draft pangenome dataset. They were all generated with Minigraph–Cactus10 using the same 90 haplotypes (44 diploid samples, GRCh38 and T2T-CHM13), whose assemblies are described in detail in ref. 7. The graphs do not contain the samples HG001 to HG005 we use in read mapping, genotyping and variant calling experiments.
The graphs were referenced on GRCh38, meaning this genome was left acyclic and unclipped by Minigraph–Cactus. The needs of the present study in part drove the creation of a v.1.1 release of these graphs14, created with a newer version of the tools, which is the primary version used here. Most changes to the graph construction methodologies used are detailed in ref. 10. One change that was not previously described but is relevant for this work is the implementation of a filter that guarantees that each connected component has a single top-level chain. This logic, implemented in vg clip and activated in Minigraph–Cactus as of v.2.6.3, repeatedly removes all nodes that have one side of degree 0 (no edges attached) and that are not contained in the reference path, until none are left. Any edges on the left side of the first node of a reference path or the right side of the last node of the reference path are likewise removed.
The resulting graph has exactly two ‘tips’ (node sides with degree 0) per connected component, corresponding to the two ends of the reference chromosome. The snarl decomposition uses these two tips to root the snarl tree, guaranteeing a single top-level chain that can be used to guide the haplotype sampling algorithm described above. The number of nodes, tips, edges and edges removed by this procedure, as well as its impact on the number of top-level chains, is shown in Supplementary Table 3.
Allele-frequency-filtered versions of each graph were obtained from Minigraph–Cactus using a filter threshold of nine (that is, 10% of haplotypes). Sampling was only run on the unfiltered (default) graphs. Control results are presented for both the v.1.1 graphs and the v.1.0 graphs that were released alongside the HPRC data7. As the v.1.0 graphs have many top-level chains per component, they cannot be used with the haplotype sampling approach described in this paper.
Mapping reads
The following benchmarks were done on an Amazon Web Services i4i.16xlarge instance with 32 physical/64 logical CPU cores and 512 GiB of memory. We used a development version of vg that was effectively the same as v.1.52.0. All tools were allowed to use 32 threads.
We aligned 30× NovaSeq reads for the Genome in a Bottle (GIAB) HG002 sample to various references using Giraffe4 and measured the running time. We also used BWA-MEM15 with a linear reference (GRCh38) as a baseline. For haplotype sampling, we first had to create a sampled graph for HG002. We counted k-mers in the reads using KMC16. Then we took the v.1.1 default graph, containing all haplotypes, and ran haplotype sampling with four, eight, 16 and 32 haplotypes as well as diploid sampling from 32 candidates. Before mapping the reads, we also had to build a distance index and a minimizer index for the sampled graph; these steps are now integrated into Giraffe to be run automatically.
The results can be seen in Fig. 2. Giraffe is faster with all graphs than BWA-MEM is with a linear reference. Mapping the reads to the v.1.1 filtered graph took less time than mapping them to the v.1.0 filtered graph; anecdotally we observe that faster-to-compute alignments are often of higher quality. Giraffe was even faster with the v.1.1 diploid graph, but the other steps (k-mer counting 432 s, haplotype sampling 341 s, index construction 997 s) increased the overall running time to the same level as with the v.1.0 filtered graph. For sampled graphs without diploid sampling, both mapping time and the overall time increased with the number of haplotypes.
Fig. 2 |. Mapping 30× NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).

The graphs (y axis) are Minigraph–Cactus graphs built using GRCh38 as the reference. For the sampled graphs, we tested sampling 4, 8, 16 and 32 haplotypes. For the v.1.1 diploid graph, 32 candidate haplotypes were used for diploid sampling. We show the overall running time and the time spent for mapping only (left), and the fraction of reads with an exact, gapless, properly paired and Mapq 60 alignment.
The most memory intensive part of the haplotype sampling pipeline is index construction, which takes about 60 GiB with the graphs we used. In pipelines based on a linear reference, the memory bottleneck is typically in the variant calling stage. For example, we observed DeepVariant using 20 GiB. Therefore, our memory usage is three times higher than in a pipeline based on a linear reference. Figure 2 also includes some statistics on the mapped reads. Giraffe is more likely to find an exact (that is, all matches) alignment with the v.1.1 diploid graph than with the frequency-filtered graphs, and the reads are more likely to be properly paired. It is also more confident about the alignments to the v.1.1 diploid graph, as evidenced by the higher proportion of reads with mapping quality (Mapq) 60. The variant calling and genotyping results in the following sections show this confidence is well-founded.
When we use sampled graphs without diploid sampling, we find more exact and gapless alignments than with the diploid graph, and the number of such alignments increases with number of haplotypes. On the other hand, we find fewer properly paired alignments, and a smaller fraction of the alignments get mapping quality 60. This indicates that while the additional haplotypes enable us to find better alignments, some of these alignments are likely wrong.
Calling small variants
We evaluated the impact of haplotype sampling on the accuracy of calling small variants, using the GIAB v.4.2.1 GRCh38 benchmark set17. We mapped PCR-free NovaSeq 40× reads18 with Giraffe4 and called variants using DeepVariant11, using the pipeline discussed in ref. 7. Then we evaluated the results of our mapping with the benchmark using hap.py19. We compared the performance of different graphs, including the v.1.1 filtered graph and v.1.1 sampled graphs with four, eight, 16 and 32 haplotypes, as well as the v.1.1 diploid graph (Fig. 3a and Supplementary Table 5). The v.1.1 diploid graph consistently outperformed the other graph configurations, reducing total errors by 9.1% across samples HG001 to HG005 relative to the v.1.1 filtered graph. The performances of the v.1.0 and v.1.1 filtered graphs were similar (Supplementary Table 5). Sampling either four or eight haplotypes improved performance relative to the filtered graphs, although we observe an inverse correlation between sampled haplotype number and performance.
Fig. 3 |. Small variants evaluation across samples HG001 to HG005.

a, The number of false positive (FPs) and false negative (FNs) indels and single-nucleotide polymorphisms (SNPs) across four different graphs, each using GRCh38 as the reference: v.1.1 filtered, v.1.1 sampled with four and eight haplotypes and v.1.1 diploid, using the Giraffe–DeepVariant pipeline. b, Comparing the Giraffe–DeepVariant using the v.1.1 diploid graph to BWA-MEM–DeepVariant and GATK best-practice pipelines, both using the GRCh38 reference. c, The performance of the Giraffe–DeepVariant pipeline using the v.1.1 diploid graph with different coverage levels of NovaSeq reads (20×, 30× and 40×). d, Comparing the number of errors using either NovaSeq 40× data or Element 36× 1,000 bp insert data; in both cases, using the Giraffe–DeepVariant pipeline with the v.1.1 diploid graph. HG005 Element sequencing data were not available for comparison.
We compared performance to mapping to a linear reference (GRCh38) with BWA-MEM15 and using DeepVariant to call variants, as well as to the Genome Analysis Toolkit (GATK) best-practice20 pipeline (Fig. 3b). Mapping the same NovaSeq 40× data to the v.1.1 diploid graph with Giraffe reduced errors by an average 35% relative to BWA-MEM–GRCh38, and 74% (a near fourfold reduction in total errors) relative to the GATK best practice.
We examined the effect of varying read coverage on the performance of the v.1.1 diploid graph (Fig. 3c). We observed large improvements in increasing coverage from 20 to 30× (average 39.7% reduction in errors), and more minor improvements going from 30 to 40× (average 7.8% reduction). It is worth stating that for all samples Giraffe–DV with 20× data made substantially fewer errors than GATK best practice with 40× data.
To determine how a different read technology would affect the results, we repeated the analysis with sequencing data from Element Biosciences21 (Fig. 3d and Supplementary Table 5). Relative to the comparable coverage Illumina dataset (Illumina NovaSeq 40×, Element 36× 1,000 bp insert), we find a net reduction in errors of 22%, recapitulating the previous Google–Element results21 showing these Element data to be of excellent quality.
We further assessed the specificity of variant calling by comparing the incidence of false variants among three different tools across variants that are both present in the pangenome and in the National Institute of Standards and Technology (NIST) benchmark, as well as those exclusively in the NIST benchmark but absent from the pangenome (Supplementary Fig. 3). This analysis was crucial for understanding the performance of the tools on genomic variants that do not directly correspond to those represented within the pangenome graph. For the variants not included in the pangenome, v.1.1 diploid shows 82% improvement relative to GATK best practice and 26% improvement relative to BWA-MEM–GRCh38.
Genotyping SVs
We assessed the ability of pangenome SVs genotyping algorithms (PanGenie13 and vg call12) using different pangenome graphs. We used the GIAB Tier1 v.0.6 truth set comprising the set of confident SVs for the HG002 sample. Haplotype sampling improves genotyping performance relative to both the v.1.1 default and v.1.1 filtered graphs (Fig. 4a and Supplementary Table 6); the combination of the v.1.1 graphs and haplotype sampling also substantially improves on the performance of 1.0 graphs described in ref. 7 for the draft human pangenome. For example, using PanGenie, the F1 score of variant calls increased from 0.7780 using the v.1.0 filtered graph to 0.8926 with the v.1.1 diploid graph.
Fig. 4 |. SVs benchmark evaluation.

a, Precision, recall and F1 scores of both vg call and PanGenie for different pangenome reference graphs on the GIAB v.0.6 Tier1 call set. Graphs were built using GRCh38 as the reference. b, As with a but using a benchmark set of SVs created with DipCall from the T2T v.0.9 HG002 genome assembly, comparing genome wide but excluding centromeres. c, Comparing the performance of PanGenie and vg call using the 1.1 diploid graph to other genotyping methods. Illumina short reads were used with Delly28, SVaBA32, Scalpel31, Manta30 and MetaSV29 as well as with vg call12 and PanGenie13. Also shown are long-read methods (CuteSV34, Sniffles2 (ref. 35), Hapdup33 and HPRC de novo assemblies7).
Regarding the number of haplotypes sampled, as with small variants, increasing beyond eight hinders genotyping performance, with v.1.1 diploid graph performing best overall. In general, PanGenie outperforms vg call across different graph builds and sampling strategies, although the difference between vg call and PanGenie for the best-performing graph (v.1.1 diploid graph) is small (F1 of 0.8926 for PanGenie versus 0.888 for vg call). vg call also has an average 30% faster runtime (Supplementary Table 4 and Supplementary Fig. 4).
We tested the performance of haplotype sampling using the challenging medically relevant genes benchmark (Supplementary Tables 7 and 9) from GIAB22. We find similar trends to those observed in the whole genome, but performance drops slightly given the difficult nature of these regions. For example, the best-performing method in this benchmark, vg call, with the 1.1 diploid graph has an F1 score of 0.8683, whereas using the v.0.6 GIAB benchmark its F1 score is 0.888.
The GIAB benchmark sets cover a limited subset of SVs in the genome. To get a more complete assessment of performance, we benchmarked our approach using a call set derived using DipCall23 from a draft telomere-to-telomere (T2T) assembly of the HG002 sample (Fig. 4b and Supplementary Table 8). Excluding centromeric regions, which cannot yet be reliably compared, we generate a call set covering 92.02% of the GRCh38 reference containing 25,732 variants; in contrast the GIAB Tier1 v.0.6 call set contains 9,575 variants covering 86.29% of the reference. Across this larger call set, we find similar performance patterns to the GIAB v.0.6 call set, with an F1 score of 0.8419 genome wide, excluding centromeres, for PanGenie with the v.1.1 diploid graph.
We finally compared our haplotype sampling results to published results24,25 using the GIAB Tier1 v.0.6 SVs benchmark set (Fig. 4c and Supplementary Table 12). Relative to contemporary methods using short reads and the GRCh38 reference, our results with the v.1.1 diploid graph are dramatically better (the F1 score for the best-performing short-read method, MetaSV is 0.29, whereas PanGenie achieves 0.89) and are now close to rivaling those using long reads.
Discussion
We have developed a k-mer-based method for sampling local haplotypes that are similar to a sequenced genome. When a subgraph based on the sampled haplotypes is used as a reference for read mapping, the results are more accurate than with the full graph or with a frequency-filtered graph containing only common variants. This translates into a small improvement in the accuracy of calling small variants and a large improvement in the accuracy of genotyping SVs relative to existing pangenome approaches. While the presented results represent an advance over previous filter-based methods, several limitations exist.
Our k-mer presence matrices are not compressed. With the HPRC v.1.1 graph in the section on ‘Mapping reads’, we used 176 million k-mers for describing the haplotypes. This corresponds to 21 MiB per haplotype, plus overhead. As the number of haplotypes in the graph grows, this can become excessive, especially as we will need more k-mers for describing the haplotypes. The solution is compressing the matrices, possibly using methods developed for compressing the color matrices in colored de Bruijn graphs26. With a suitable compression scheme, we can also take advantage of the compressed representation for faster sampling.
We currently sample the haplotypes independently in each block and combine them arbitrarily. This is acceptable with short reads, as the blocks are much longer than the reads. In other applications such as long-read mapping, we need phased haplotypes to ensure contiguity across block boundaries. While PanGenie can infer phased haplotypes, the algorithm does not scale well with the number of haplotypes. As a possible solution, we could sample equivalence classes of (almost) identical haplotypes in each block and then choose specific haplotypes from the equivalence classes to maximize contiguity across block boundaries.
With properly phased haplotypes, we may also be able to drop some restrictions on the graph structure. As long as we sample the haplotypes independently in each block, any recombination of them should be a plausible haplotype, because the haplotypes are all constrained to visit the same blocks in order and match each other at the boundaries. But if we can infer phased haplotypes, we can use the long-distance information in them to ensure plausibility instead of relying on these constraints.
SVs have been notoriously difficult to study due to their complex and repetitive nature27, especially so when using the linear reference in combination with short reads28–32. Substantial improvements have been made with the advent of long-read sequencing7,33–35. However, long reads are comparatively expensive and often unavailable for large cohorts of human samples, such as those underpinning global-scale sequencing projects. The approach we presented, which uses knowledge from population sequencing (and therefore long reads), improves on the state of the art for short-read SV genotyping, allowing the typing of common SV variants with accuracy comparable to long-read methods.
Looking forward, we expect pangenomes from the HPRC and elsewhere to substantially grow in genome number, perhaps into the thousands of haplotypes over the next several years. Methods for personalizing pangenomes will therefore increase in importance as the fraction of all variations that is rare within them naturally expands. Conversely, as pangenomes grow the expanding fraction of rare variation they cover should allow increasingly accurate personalized pangenomes to be imputed, providing a framework for genome imputation to include complex structural variation. As a consequence, we expect the relevance and power of the personalization methods of the type introduced here to become increasingly vital to pangenome workflows.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41592-024-02407-2.
Methods
Bidirected sequence graphs
Human pangenome graphs are usually based on the bidirected sequence graph model. Nodes have identifiers, and they contain a sequence. Edges are undirected and connect two node sides. A forward traversal of a node enters from the left, reads the sequence and exits from the right. A reverse traversal enters from the right, reads the reverse complement of the sequence and exits from the left.
Bidirected sequence graphs can be stored in the text-based graphical fragment assembly format. The GBZ file format36 is a space-efficient binary format for pangenome graphs. It is based on the GBWT index37 that stores haplotype paths as node sequences. GBZ is compatible with a subset of graphical fragment assembly, and the two can be converted efficiently to each other.
The structure of a bidirected sequence graph can be described hierarchically by its snarl decomposition38. A snarl is a generalization of a bubble39, and denotes a site of genomic variation. It is a subgraph separated by two node sides from the rest of the graph. Each snarl must be minimal in the sense that neither of the sides defining it forms a snarl with a node side located within the subgraph. A graph can be decomposed into a set of chains, each of which is a sequence of nodes and snarls. A snarl may either be primitive, or it may be further decomposed into a set of chains.
Preprocessing haplotypes
We assume that the graph is in the GBZ format. While the format is space-efficient and supports efficient queries, it is not suitable for selecting haplotypes based on sequence similarity. We therefore need to preprocess the graph and store the haplotypes in a more appropriate format.
Our preprocessing approach is similar to that used in PanGenie13 (Fig. 1a). We assume that each weakly connected component in the graph corresponds to a single top-level chain in the snarl decomposition of the graph. While PanGenie combines bubbles that are less than k bp apart (default k = 31), we combine adjacent snarls in the top-level chains into approximately b bp blocks (default b = 10,000), using a minimum distance index40 for determining the length of a block.
Top-level chains generally correspond to chromosomes or other meaningful contigs. Because we partition each top-level chain into a sequence of blocks, the true haplotype paths corresponding to that contig visit the same blocks in the same order. If we sample the haplotypes independently in each block, we can therefore assume that any recombination of them is a plausible haplotype. Despite this linear high-level structure, the graph may contain reversals that let the same real haplotype visit the same blocks multiple times. To avoid sampling unbalanced recombinations, we consider each minimal end-to-end visit to a block a separate haplotype in that block. We find such visits efficiently by listing the paths visiting the border nodes of a block using the r-index41 and matching the node visits at both ends of the block.
We describe the haplotypes in each block in terms of k-mers that are specific to the block (Fig. 1b). While PanGenie uses k-mers (default k = 31) with at most a single hit in each haplotype (but possibly multiple hits within the bubble), we use minimizers (default k = 29, w = 11) with a single hit in the graph (but possibly multiple hits in a haplotype). We also avoid uninformative k-mers that are present in each haplotype. For each block, we then build a k-mer presence matrix: a binary matrix that marks the presence or absence of each selected graph-unique k-mer in each haplotype.
Sampling haplotypes
The haplotype sampling workflow uses k-mer counts in the reads to classify the k-mers used in the matrices as absent, heterozygous, homozygous or frequent in the sample (Fig. 1c). Then it samples the most relevant haplotypes independently in each block and combines them to form full-length haplotypes (Fig. 1d).
We start with counting k-mers in the reads. Any external tool that supports the k-mer file format (KFF)42 can be used. The sampling algorithm ignores k-mers with a single occurrence in the reads, and it combines the counts of a k-mer and its reverse complement. It also ignores all k-mers not used in the k-mer presence matrices.
If k-mer coverage is not provided, we estimate it from the counts. If the most common count is above the median, we use it as the estimate. Otherwise, if there is a good enough secondary peak above the median at approximately twice the primary peak, we use it. If both attempts fail, the user must provide an estimate. Given the k-mer coverage and k-mer counts, we classify each k-mer in the matrices as absent, heterozygous, homozygous or frequent. We then ignore all frequent k-mers. See Supplementary Information Section 2.1 for further details. While this classification is inherently noisy, on aggregate, it is useful for our purposes.
We sample the haplotypes independently in each block. This is in contrast to PanGenie, which attempts to infer phased haplotypes. Sampling is based on a greedy algorithm that selects the highest-scoring haplotype and then adjusts the scores for the remaining haplotypes.
Let be the set of k-mers used to describe the haplotypes in a block. For each k-mer , let be the current score for the k-mer. Given a haplotype in the block, let be a function such that if and if . We score each haplotype as the sum of k-mer scores:
We initialize k-mer scores as for homozygous k-mers, for absent k-mers and for heterozygous and frequent k-mers. The first haplotype we sample is therefore an approximation of the consensus sequence. With the subsequent haplotypes, we aim to cover all homozygous k-mers, as we may not have any reference haplotypes that contain all the k-mers that should be homozygous according to the k-mer counts. We also aim to select haplotypes both with and without heterozygous k-mers. For this purpose, we update k-mer scores and rescore the haplotypes every time after selecting a haplotype. Let be the haplotype we just selected.
If is a homozygous k-mer and , we discount its score by a multiplicative factor: .
If is a heterozygous -mer, we adjust its score by an additive term to make the opposite outcome more likely: .
The numerical parameters were selected in a parameter sweep that aimed to maximize the F1 scores for calling single-nucleotide polymorphisms and small indels with DeepVariant11. See Supplementary Information Section 2.2 for a discussion.
After sampling haplotypes (typically to 32), we may use the selected haplotypes directly. Alternatively, we can use them as candidates in optional diploid sampling. In diploid sampling, we consider each pair of candidates and and select the highest-scoring pair. The scoring is:
where is the number of times (0, 1 or 2) k-mer occurs in the haplotypes and and is the number of times (again 0, 1 or 2) it should occur according to the classification of the k-mer.
After sampling local haplotypes in each block of a top-level chain, we combine them to form full-length haplotypes. If the same haplotype was selected in adjacent blocks, we connect them together. Otherwise, the haplotypes we form are arbitrary recombinations of the local haplotypes we sampled. We insert the haplotypes into an empty GBWT index, along with any reference paths for that chain if we want to include them in the personalized pangenome reference.
The sampling process can be parallelized over the top-level chains. Because top-level chains correspond to weakly connected components in the graph, the resulting GBWT indexes are disjoint and can be merged efficiently. Once the final GBWT is made, we build the GBZ graph induced by the haplotypes in the index. Because this sampled graph is a subgraph of the original graph, any alignment in it is also valid in the original graph.
Supplementary Material
Additional information
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41592-024-02407-2.
Acknowledgements
This work was supported in part by the National Human Genome Research Institute and the National Institutes of Health (NIH). B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Footnotes
Competing interests
P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.
Code availability
The haplotype sampling approach described in this article is part of the vg toolkit available under MIT license at https://github.com/vgteam/vg. There is an example dataset in directory test/haplotype-sampling. Documentation can be found at https://github.com/vgteam/vg/wiki/Haplotype-Sampling. See Supplementary Sections 4 and 5 for details on other software used.
References
- 1.Eizenga JM et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Garrison E et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rautiainen M & Marschall T GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sirén J et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pritt J, Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liao Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dilthey A, Cox C, Iqbal Z, Nelson MR & McVean G Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Vaddadi K, Mun T & Langmead B Minimizing reference bias with an impute-first approach. Preprint bioRxiv 10.1101/2023.11.30.568362 (2023). [DOI] [Google Scholar]
- 10.Hickey G et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 10.1038/s41587-023-01793-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Poplin R et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018). [DOI] [PubMed] [Google Scholar]
- 12.Hickey G et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ebler J et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub; https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023). [Google Scholar]
- 15.Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).
- 16.Kokot M, Długosz M & Deorowicz S KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017). [DOI] [PubMed] [Google Scholar]
- 17.Wagner J et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Baid G et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv 10.1101/2020.12.11.422022 (2020). [DOI] [Google Scholar]
- 19.Krusche P et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Poplin R et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv 10.1101/201178 (2018). [DOI] [Google Scholar]
- 21.Carroll A et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv 10.1101/2023.08.11.553043 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wagner J et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li H et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zook JM et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kolmogorov M et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Marchet C et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Feuk L, Carson AR & Scherer SW Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006). [DOI] [PubMed] [Google Scholar]
- 28.Rausch T et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mohiyuddin M et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen X et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016). [DOI] [PubMed] [Google Scholar]
- 31.Fang H et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wala JA et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Kolmogorov M, Yuan J, Lin Y & Pevzner PA Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). [DOI] [PubMed] [Google Scholar]
- 34.Jiang T et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Smolka M, Paulin LF, Grochowski CM et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 10.1038/s41587-023-02024-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sirén J & Paten B GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sirén J, Garrison E, Novak AM, Paten B & Durbin R Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Paten B et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zerbino DR & Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Chang X, Eizenga J, Novak AM, Sirén J & Paten B Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gagie T, Navarro G & Prezza N Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020). [Google Scholar]
- 42.Dufresne Y et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.
