Personalized pangenome references

Jouni Sirén; Parsa Eskandar; Matteo Tommaso Ungaro; Glenn Hickey; Jordan M Eizenga; Adam M Novak; Xian Chang; Pi-Chuan Chang; Mikhail Kolmogorov; Andrew Carroll; Jean Monlong; Benedict Paten

doi:10.1038/s41592-024-02407-2

. Author manuscript; available in PMC: 2025 Nov 25.

Published in final edited form as: Nat Methods. 2024 Sep 11;21(11):2017–2023. doi: 10.1038/s41592-024-02407-2

Personalized pangenome references

Jouni Sirén ^1,^✉, Parsa Eskandar ¹, Matteo Tommaso Ungaro ^1,², Glenn Hickey ¹, Jordan M Eizenga ¹, Adam M Novak ¹, Xian Chang ¹, Pi-Chuan Chang ³, Mikhail Kolmogorov ⁴, Andrew Carroll ³, Jean Monlong ^1,⁵, Benedict Paten ^1,^✉

PMCID: PMC12643174 NIHMSID: NIHMS2113270 PMID: 39261641

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit (https://github.com/vgteam/vg) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

Pangenome graphs¹ represent aligned haplotype sequences. Each node in the graph is labeled with a sequence, and each path represents the sequence obtained by concatenating the node labels. While any path in the graph is a potential haplotype, most paths are unlikely recombinations of true haplotypes.

One common application for pangenome graphs is as a reference for read mapping^2–4. Graph-based aligners tend to be more accurate than aligners using a linear reference sequence when mapping reads to regions where the sequenced genome differs from the reference sequence. If the graph contains another haplotype close enough to the sequenced genome in that region, the aligner can usually find the correct mapping. On the other hand, sequence variation that is present in the graph but not in the sequenced genome can make the aligner less accurate. Such variation can imitate other regions, making incorrect mappings more likely.

Because having the right variants in the graph has a greater effect on mapping accuracy than having variants not present in the sequenced genome, the usual approach is building a graph based on common variants only. The vg short-read aligner² got best results with the 1,000 Genomes Project (1000GP)⁵ graph when variants below a 1% allele-frequency threshold were left out. Pritt et al.⁶ developed a model for selecting variants based not only on their frequency but also on their effect on the repetitiveness of the reference.

The Giraffe short-read aligner⁴ uses another approach for avoiding rare variants. As a haplotype-aware aligner, Giraffe can generate synthetic haplotypes by sampling the existing ones proportionally. The algorithm considers k-node subpaths (default k = 4) and extends the haplotype paths it generates greedily. It looks at the last k − 1 nodes and all possible single-node extensions, and selects the extension that best maintains proportionality. Giraffe got best results with the 1000GP graph by generating 64 haplotypes. However, frequency filtering was still used with the Human Pangenome Reference Consortium (HPRC)⁷ subgraph intended for Giraffe. As the initial graphs contain only 90 haplotypes, the frequency threshold was set to 10%.

Another alternative is building a personalized reference based on the reads. Dilthey et al.⁸ used a hidden Markov model to infer a diploid reference for the major histocompatibility complex region according to k-mer counts in the reads. Their personalized reference was a series of bubbles connected by shared sequence. The method was variant-centric, and phasing was not maintained from bubble to bubble.

Vaddadi et al.⁹ proposed a VCF-based pipeline for building a personalized diploid reference genome. They genotyped variants in a reference VCF using subsampled reads and imputed the haplotypes based on the reference panel. The resulting personalized VCF file could be used for building a graph for any graph-based read mapping and variant calling pipeline.

In this paper, we propose a more direct approach with less overhead. We create a personalized pangenome reference by sampling haplotypes that are similar to the sequenced genome according to k-mer counts in the reads. We work directly with assembled haplotypes and maintain phasing within 10 kbp (kilo-base pair) blocks. The sampled graph we build is a subgraph of the original graph. Therefore, any alignments in the sampled graph are also valid in the original graph.

Our approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. We assume a graph with a linear high-level structure, such as graphs built using the Minigraph–Cactus pipeline¹⁰. We further assume that read coverage is high enough (at least 20×) that we can reliably classify k-mers into absent, heterozygous and homozygous according to k-mer counts.

When used with HPRC graphs, our sampling approach increases the overall running time of the pipeline by less than 15 minutes. We show that the sampled graph is a better mapping target than the universal frequency-filtered graph. We see a small improvement in the accuracy of calling small variants with DeepVariant¹¹ and a large increase in the accuracy of genotyping structural variants (SVs) with vg¹² and PanGenie¹³.

Results

Haplotype sampling pipeline

Our haplotype sampling pipeline partitions the original pangenome graph into nonoverlapping blocks of target length 10 kpb (Fig. 1a). In each block, we find k-mers that are specific to the block and describe the local haplotypes by the presence or absence of these k-mers (Fig. 1b). The information is stored as a binary k-mer presence matrix. This preprocessing is done once per graph.

For each sample, we count k-mers in the reads. We then use the k-mer counts for classifying the k-mers in the matrices as absent, heterozygous or homozygous in the sample (Fig. 1c). We sample a number of most relevant haplotypes in each block heuristically according to the classification. We may use the sampled haplotypes directly, or we can use them as candidates for selecting the best pair in diploid sampling. Then we find the subgraph of the original graph induced by the selected haplotypes and reference sequences (Fig. 1d). The resulting personalized pangenome can then be used as a reference for read mapping and downstream analysis.

HPRC pangenome graphs

The experiments below feature pangenome graphs constructed from the HPRC draft pangenome dataset. They were all generated with Minigraph–Cactus¹⁰ using the same 90 haplotypes (44 diploid samples, GRCh38 and T2T-CHM13), whose assemblies are described in detail in ref. 7. The graphs do not contain the samples HG001 to HG005 we use in read mapping, genotyping and variant calling experiments.

The graphs were referenced on GRCh38, meaning this genome was left acyclic and unclipped by Minigraph–Cactus. The needs of the present study in part drove the creation of a v.1.1 release of these graphs¹⁴, created with a newer version of the tools, which is the primary version used here. Most changes to the graph construction methodologies used are detailed in ref. 10. One change that was not previously described but is relevant for this work is the implementation of a filter that guarantees that each connected component has a single top-level chain. This logic, implemented in vg clip and activated in Minigraph–Cactus as of v.2.6.3, repeatedly removes all nodes that have one side of degree 0 (no edges attached) and that are not contained in the reference path, until none are left. Any edges on the left side of the first node of a reference path or the right side of the last node of the reference path are likewise removed.

The resulting graph has exactly two ‘tips’ (node sides with degree 0) per connected component, corresponding to the two ends of the reference chromosome. The snarl decomposition uses these two tips to root the snarl tree, guaranteeing a single top-level chain that can be used to guide the haplotype sampling algorithm described above. The number of nodes, tips, edges and edges removed by this procedure, as well as its impact on the number of top-level chains, is shown in Supplementary Table 3.

Allele-frequency-filtered versions of each graph were obtained from Minigraph–Cactus using a filter threshold of nine (that is, 10% of haplotypes). Sampling was only run on the unfiltered (default) graphs. Control results are presented for both the v.1.1 graphs and the v.1.0 graphs that were released alongside the HPRC data⁷. As the v.1.0 graphs have many top-level chains per component, they cannot be used with the haplotype sampling approach described in this paper.

Mapping reads

The following benchmarks were done on an Amazon Web Services i4i.16xlarge instance with 32 physical/64 logical CPU cores and 512 GiB of memory. We used a development version of vg that was effectively the same as v.1.52.0. All tools were allowed to use 32 threads.

We aligned 30× NovaSeq reads for the Genome in a Bottle (GIAB) HG002 sample to various references using Giraffe⁴ and measured the running time. We also used BWA-MEM¹⁵ with a linear reference (GRCh38) as a baseline. For haplotype sampling, we first had to create a sampled graph for HG002. We counted k-mers in the reads using KMC¹⁶. Then we took the v.1.1 default graph, containing all haplotypes, and ran haplotype sampling with four, eight, 16 and 32 haplotypes as well as diploid sampling from 32 candidates. Before mapping the reads, we also had to build a distance index and a minimizer index for the sampled graph; these steps are now integrated into Giraffe to be run automatically.

The results can be seen in Fig. 2. Giraffe is faster with all graphs than BWA-MEM is with a linear reference. Mapping the reads to the v.1.1 filtered graph took less time than mapping them to the v.1.0 filtered graph; anecdotally we observe that faster-to-compute alignments are often of higher quality. Giraffe was even faster with the v.1.1 diploid graph, but the other steps (k-mer counting 432 s, haplotype sampling 341 s, index construction 997 s) increased the overall running time to the same level as with the v.1.0 filtered graph. For sampled graphs without diploid sampling, both mapping time and the overall time increased with the number of haplotypes.

The most memory intensive part of the haplotype sampling pipeline is index construction, which takes about 60 GiB with the graphs we used. In pipelines based on a linear reference, the memory bottleneck is typically in the variant calling stage. For example, we observed DeepVariant using 20 GiB. Therefore, our memory usage is three times higher than in a pipeline based on a linear reference. Figure 2 also includes some statistics on the mapped reads. Giraffe is more likely to find an exact (that is, all matches) alignment with the v.1.1 diploid graph than with the frequency-filtered graphs, and the reads are more likely to be properly paired. It is also more confident about the alignments to the v.1.1 diploid graph, as evidenced by the higher proportion of reads with mapping quality (Mapq) 60. The variant calling and genotyping results in the following sections show this confidence is well-founded.

When we use sampled graphs without diploid sampling, we find more exact and gapless alignments than with the diploid graph, and the number of such alignments increases with number of haplotypes. On the other hand, we find fewer properly paired alignments, and a smaller fraction of the alignments get mapping quality 60. This indicates that while the additional haplotypes enable us to find better alignments, some of these alignments are likely wrong.

Calling small variants

We evaluated the impact of haplotype sampling on the accuracy of calling small variants, using the GIAB v.4.2.1 GRCh38 benchmark set¹⁷. We mapped PCR-free NovaSeq 40× reads¹⁸ with Giraffe⁴ and called variants using DeepVariant¹¹, using the pipeline discussed in ref. 7. Then we evaluated the results of our mapping with the benchmark using hap.py¹⁹. We compared the performance of different graphs, including the v.1.1 filtered graph and v.1.1 sampled graphs with four, eight, 16 and 32 haplotypes, as well as the v.1.1 diploid graph (Fig. 3a and Supplementary Table 5). The v.1.1 diploid graph consistently outperformed the other graph configurations, reducing total errors by 9.1% across samples HG001 to HG005 relative to the v.1.1 filtered graph. The performances of the v.1.0 and v.1.1 filtered graphs were similar (Supplementary Table 5). Sampling either four or eight haplotypes improved performance relative to the filtered graphs, although we observe an inverse correlation between sampled haplotype number and performance.

We compared performance to mapping to a linear reference (GRCh38) with BWA-MEM¹⁵ and using DeepVariant to call variants, as well as to the Genome Analysis Toolkit (GATK) best-practice²⁰ pipeline (Fig. 3b). Mapping the same NovaSeq 40× data to the v.1.1 diploid graph with Giraffe reduced errors by an average 35% relative to BWA-MEM–GRCh38, and 74% (a near fourfold reduction in total errors) relative to the GATK best practice.

We examined the effect of varying read coverage on the performance of the v.1.1 diploid graph (Fig. 3c). We observed large improvements in increasing coverage from 20 to 30× (average 39.7% reduction in errors), and more minor improvements going from 30 to 40× (average 7.8% reduction). It is worth stating that for all samples Giraffe–DV with 20× data made substantially fewer errors than GATK best practice with 40× data.

To determine how a different read technology would affect the results, we repeated the analysis with sequencing data from Element Biosciences²¹ (Fig. 3d and Supplementary Table 5). Relative to the comparable coverage Illumina dataset (Illumina NovaSeq 40×, Element 36× 1,000 bp insert), we find a net reduction in errors of 22%, recapitulating the previous Google–Element results²¹ showing these Element data to be of excellent quality.

We further assessed the specificity of variant calling by comparing the incidence of false variants among three different tools across variants that are both present in the pangenome and in the National Institute of Standards and Technology (NIST) benchmark, as well as those exclusively in the NIST benchmark but absent from the pangenome (Supplementary Fig. 3). This analysis was crucial for understanding the performance of the tools on genomic variants that do not directly correspond to those represented within the pangenome graph. For the variants not included in the pangenome, v.1.1 diploid shows 82% improvement relative to GATK best practice and 26% improvement relative to BWA-MEM–GRCh38.

Genotyping SVs

We assessed the ability of pangenome SVs genotyping algorithms (PanGenie¹³ and vg call¹²) using different pangenome graphs. We used the GIAB Tier1 v.0.6 truth set comprising the set of confident SVs for the HG002 sample. Haplotype sampling improves genotyping performance relative to both the v.1.1 default and v.1.1 filtered graphs (Fig. 4a and Supplementary Table 6); the combination of the v.1.1 graphs and haplotype sampling also substantially improves on the performance of 1.0 graphs described in ref. 7 for the draft human pangenome. For example, using PanGenie, the F1 score of variant calls increased from 0.7780 using the v.1.0 filtered graph to 0.8926 with the v.1.1 diploid graph.

Fig. 4 | — a, Precision, recall and F1 scores of both vg call and PanGenie for different pangenome reference graphs on the GIAB v.0.6 Tier1 call set. Graphs were built using GRCh38 as the reference. b, As with a but using a benchmark set of SVs created with DipCall from the T2T v.0.9 HG002 genome assembly, comparing genome wide but excluding centromeres. c, Comparing the performance of PanGenie and vg call using the 1.1 diploid graph to other genotyping methods. Illumina short reads were used with Delly²⁸, SVaBA³², Scalpel³¹, Manta³⁰ and MetaSV²⁹ as well as with vg call¹² and PanGenie¹³. Also shown are long-read methods (CuteSV³⁴, Sniffles2 (ref. 35), Hapdup³³ and HPRC de novo assemblies⁷).

Regarding the number of haplotypes sampled, as with small variants, increasing beyond eight hinders genotyping performance, with v.1.1 diploid graph performing best overall. In general, PanGenie outperforms vg call across different graph builds and sampling strategies, although the difference between vg call and PanGenie for the best-performing graph (v.1.1 diploid graph) is small (F1 of 0.8926 for PanGenie versus 0.888 for vg call). vg call also has an average 30% faster runtime (Supplementary Table 4 and Supplementary Fig. 4).

We tested the performance of haplotype sampling using the challenging medically relevant genes benchmark (Supplementary Tables 7 and 9) from GIAB²². We find similar trends to those observed in the whole genome, but performance drops slightly given the difficult nature of these regions. For example, the best-performing method in this benchmark, vg call, with the 1.1 diploid graph has an F1 score of 0.8683, whereas using the v.0.6 GIAB benchmark its F1 score is 0.888.

The GIAB benchmark sets cover a limited subset of SVs in the genome. To get a more complete assessment of performance, we benchmarked our approach using a call set derived using DipCall²³ from a draft telomere-to-telomere (T2T) assembly of the HG002 sample (Fig. 4b and Supplementary Table 8). Excluding centromeric regions, which cannot yet be reliably compared, we generate a call set covering 92.02% of the GRCh38 reference containing 25,732 variants; in contrast the GIAB Tier1 v.0.6 call set contains 9,575 variants covering 86.29% of the reference. Across this larger call set, we find similar performance patterns to the GIAB v.0.6 call set, with an F1 score of 0.8419 genome wide, excluding centromeres, for PanGenie with the v.1.1 diploid graph.

We finally compared our haplotype sampling results to published results^24,25 using the GIAB Tier1 v.0.6 SVs benchmark set (Fig. 4c and Supplementary Table 12). Relative to contemporary methods using short reads and the GRCh38 reference, our results with the v.1.1 diploid graph are dramatically better (the F1 score for the best-performing short-read method, MetaSV is 0.29, whereas PanGenie achieves 0.89) and are now close to rivaling those using long reads.

Discussion

We have developed a k-mer-based method for sampling local haplotypes that are similar to a sequenced genome. When a subgraph based on the sampled haplotypes is used as a reference for read mapping, the results are more accurate than with the full graph or with a frequency-filtered graph containing only common variants. This translates into a small improvement in the accuracy of calling small variants and a large improvement in the accuracy of genotyping SVs relative to existing pangenome approaches. While the presented results represent an advance over previous filter-based methods, several limitations exist.

Our k-mer presence matrices are not compressed. With the HPRC v.1.1 graph in the section on ‘Mapping reads’, we used 176 million k-mers for describing the haplotypes. This corresponds to 21 MiB per haplotype, plus overhead. As the number of haplotypes in the graph grows, this can become excessive, especially as we will need more k-mers for describing the haplotypes. The solution is compressing the matrices, possibly using methods developed for compressing the color matrices in colored de Bruijn graphs²⁶. With a suitable compression scheme, we can also take advantage of the compressed representation for faster sampling.

We currently sample the haplotypes independently in each block and combine them arbitrarily. This is acceptable with short reads, as the blocks are much longer than the reads. In other applications such as long-read mapping, we need phased haplotypes to ensure contiguity across block boundaries. While PanGenie can infer phased haplotypes, the algorithm does not scale well with the number of haplotypes. As a possible solution, we could sample equivalence classes of (almost) identical haplotypes in each block and then choose specific haplotypes from the equivalence classes to maximize contiguity across block boundaries.

With properly phased haplotypes, we may also be able to drop some restrictions on the graph structure. As long as we sample the haplotypes independently in each block, any recombination of them should be a plausible haplotype, because the haplotypes are all constrained to visit the same blocks in order and match each other at the boundaries. But if we can infer phased haplotypes, we can use the long-distance information in them to ensure plausibility instead of relying on these constraints.

SVs have been notoriously difficult to study due to their complex and repetitive nature²⁷, especially so when using the linear reference in combination with short reads^28–32. Substantial improvements have been made with the advent of long-read sequencing^7,33–35. However, long reads are comparatively expensive and often unavailable for large cohorts of human samples, such as those underpinning global-scale sequencing projects. The approach we presented, which uses knowledge from population sequencing (and therefore long reads), improves on the state of the art for short-read SV genotyping, allowing the typing of common SV variants with accuracy comparable to long-read methods.

Looking forward, we expect pangenomes from the HPRC and elsewhere to substantially grow in genome number, perhaps into the thousands of haplotypes over the next several years. Methods for personalizing pangenomes will therefore increase in importance as the fraction of all variations that is rare within them naturally expands. Conversely, as pangenomes grow the expanding fraction of rare variation they cover should allow increasingly accurate personalized pangenomes to be imputed, providing a framework for genome imputation to include complex structural variation. As a consequence, we expect the relevance and power of the personalization methods of the type introduced here to become increasingly vital to pangenome workflows.

Online content

Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41592-024-02407-2.

Methods

Bidirected sequence graphs

Human pangenome graphs are usually based on the bidirected sequence graph model. Nodes have identifiers, and they contain a sequence. Edges are undirected and connect two node sides. A forward traversal of a node enters from the left, reads the sequence and exits from the right. A reverse traversal enters from the right, reads the reverse complement of the sequence and exits from the left.

Bidirected sequence graphs can be stored in the text-based graphical fragment assembly format. The GBZ file format³⁶ is a space-efficient binary format for pangenome graphs. It is based on the GBWT index³⁷ that stores haplotype paths as node sequences. GBZ is compatible with a subset of graphical fragment assembly, and the two can be converted efficiently to each other.

The structure of a bidirected sequence graph can be described hierarchically by its snarl decomposition³⁸. A snarl is a generalization of a bubble³⁹, and denotes a site of genomic variation. It is a subgraph separated by two node sides from the rest of the graph. Each snarl must be minimal in the sense that neither of the sides defining it forms a snarl with a node side located within the subgraph. A graph can be decomposed into a set of chains, each of which is a sequence of nodes and snarls. A snarl may either be primitive, or it may be further decomposed into a set of chains.

Preprocessing haplotypes

We assume that the graph is in the GBZ format. While the format is space-efficient and supports efficient queries, it is not suitable for selecting haplotypes based on sequence similarity. We therefore need to preprocess the graph and store the haplotypes in a more appropriate format.

Our preprocessing approach is similar to that used in PanGenie¹³ (Fig. 1a). We assume that each weakly connected component in the graph corresponds to a single top-level chain in the snarl decomposition of the graph. While PanGenie combines bubbles that are less than k bp apart (default k = 31), we combine adjacent snarls in the top-level chains into approximately b bp blocks (default b = 10,000), using a minimum distance index⁴⁰ for determining the length of a block.

Top-level chains generally correspond to chromosomes or other meaningful contigs. Because we partition each top-level chain into a sequence of blocks, the true haplotype paths corresponding to that contig visit the same blocks in the same order. If we sample the haplotypes independently in each block, we can therefore assume that any recombination of them is a plausible haplotype. Despite this linear high-level structure, the graph may contain reversals that let the same real haplotype visit the same blocks multiple times. To avoid sampling unbalanced recombinations, we consider each minimal end-to-end visit to a block a separate haplotype in that block. We find such visits efficiently by listing the paths visiting the border nodes of a block using the r-index⁴¹ and matching the node visits at both ends of the block.

We describe the haplotypes in each block in terms of k-mers that are specific to the block (Fig. 1b). While PanGenie uses k-mers (default k = 31) with at most a single hit in each haplotype (but possibly multiple hits within the bubble), we use minimizers (default k = 29, w = 11) with a single hit in the graph (but possibly multiple hits in a haplotype). We also avoid uninformative k-mers that are present in each haplotype. For each block, we then build a k-mer presence matrix: a binary matrix that marks the presence or absence of each selected graph-unique k-mer in each haplotype.

Sampling haplotypes

The haplotype sampling workflow uses k-mer counts in the reads to classify the k-mers used in the matrices as absent, heterozygous, homozygous or frequent in the sample (Fig. 1c). Then it samples the most relevant haplotypes independently in each block and combines them to form full-length haplotypes (Fig. 1d).

We start with counting k-mers in the reads. Any external tool that supports the k-mer file format (KFF)⁴² can be used. The sampling algorithm ignores k-mers with a single occurrence in the reads, and it combines the counts of a k-mer and its reverse complement. It also ignores all k-mers not used in the k-mer presence matrices.

If k-mer coverage is not provided, we estimate it from the counts. If the most common count is above the median, we use it as the estimate. Otherwise, if there is a good enough secondary peak above the median at approximately twice the primary peak, we use it. If both attempts fail, the user must provide an estimate. Given the k-mer coverage and k-mer counts, we classify each k-mer in the matrices as absent, heterozygous, homozygous or frequent. We then ignore all frequent k-mers. See Supplementary Information Section 2.1 for further details. While this classification is inherently noisy, on aggregate, it is useful for our purposes.

We sample the haplotypes independently in each block. This is in contrast to PanGenie, which attempts to infer phased haplotypes. Sampling is based on a greedy algorithm that selects the highest-scoring haplotype and then adjusts the scores for the remaining haplotypes.

Let $B$ be the set of k-mers used to describe the haplotypes in a block. For each k-mer $x \in B$ , let $w (x)$ be the current score for the k-mer. Given a haplotype $H \subseteq B$ in the block, let $p_{H}$ be a function such that $p_{H} (x) = 1$ if $x \in H$ and $p_{H} (x) = - 1$ if $x \notin H$ . We score each haplotype as the sum of k-mer scores:

w (H) = \sum_{x \in B} p_{H} (x) w (x) .

We initialize k-mer scores as $w (x) = 1$ for homozygous k-mers, $w (x) = - 0.8$ for absent k-mers and $w (x) = 0$ for heterozygous and frequent k-mers. The first haplotype we sample is therefore an approximation of the consensus sequence. With the subsequent haplotypes, we aim to cover all homozygous k-mers, as we may not have any reference haplotypes that contain all the k-mers that should be homozygous according to the k-mer counts. We also aim to select haplotypes both with and without heterozygous k-mers. For this purpose, we update k-mer scores and rescore the haplotypes every time after selecting a haplotype. Let $H$ be the haplotype we just selected.

If $x$ is a homozygous k-mer and $x \in H$ , we discount its score by a multiplicative factor: $w (x) ≔ 0.9 \times w (x)$ .
If $x$ is a heterozygous $k$ -mer, we adjust its score by an additive term to make the opposite outcome more likely: $w (x) ≔ w (x) - 0.05 \times p_{H} (x)$ .

The numerical parameters were selected in a parameter sweep that aimed to maximize the F1 scores for calling single-nucleotide polymorphisms and small indels with DeepVariant¹¹. See Supplementary Information Section 2.2 for a discussion.

After sampling $n$ haplotypes (typically $n = 4$ to 32), we may use the selected haplotypes directly. Alternatively, we can use them as candidates in optional diploid sampling. In diploid sampling, we consider each pair of candidates $H$ and $H^{'}$ and select the highest-scoring pair. The scoring is:

w (H, H^{'}) = \sum_{x \in B} (1 - |n_{x} (H, H^{'}) - n (x)|),

where $n_{x} (H, H^{'})$ is the number of times (0, 1 or 2) k-mer $x$ occurs in the haplotypes $H$ and $H^{'}$ and $n (x)$ is the number of times (again 0, 1 or 2) it should occur according to the classification of the k-mer.

After sampling local haplotypes in each block of a top-level chain, we combine them to form full-length haplotypes. If the same haplotype was selected in adjacent blocks, we connect them together. Otherwise, the haplotypes we form are arbitrary recombinations of the local haplotypes we sampled. We insert the haplotypes into an empty GBWT index, along with any reference paths for that chain if we want to include them in the personalized pangenome reference.

The sampling process can be parallelized over the top-level chains. Because top-level chains correspond to weakly connected components in the graph, the resulting GBWT indexes are disjoint and can be merged efficiently. Once the final GBWT is made, we build the GBZ graph induced by the haplotypes in the index. Because this sampled graph is a subgraph of the original graph, any alignment in it is also valid in the original graph.

Supplementary Material

Table

NIHMS2113270-supplement-Table.xlsx^{(110.8KB, xlsx)}

Supplementary Materials

NIHMS2113270-supplement-Supplementary_Materials.pdf^{(1.1MB, pdf)}

Additional information

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41592-024-02407-2.

Acknowledgements

This work was supported in part by the National Human Genome Research Institute and the National Institutes of Health (NIH). B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Footnotes

Competing interests

P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.

Code availability

The haplotype sampling approach described in this article is part of the vg toolkit available under MIT license at https://github.com/vgteam/vg. There is an example dataset in directory test/haplotype-sampling. Documentation can be found at https://github.com/vgteam/vg/wiki/Haplotype-Sampling. See Supplementary Sections 4 and 5 for details on other software used.

References

1.Eizenga JM et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Garrison E et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Rautiainen M & Marschall T GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sirén J et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Pritt J, Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Liao Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Dilthey A, Cox C, Iqbal Z, Nelson MR & McVean G Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Vaddadi K, Mun T & Langmead B Minimizing reference bias with an impute-first approach. Preprint bioRxiv 10.1101/2023.11.30.568362 (2023). [DOI] [Google Scholar]
10.Hickey G et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 10.1038/s41587-023-01793-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Poplin R et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018). [DOI] [PubMed] [Google Scholar]
12.Hickey G et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Ebler J et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub; https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023). [Google Scholar]
15.Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).
16.Kokot M, Długosz M & Deorowicz S KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017). [DOI] [PubMed] [Google Scholar]
17.Wagner J et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Baid G et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv 10.1101/2020.12.11.422022 (2020). [DOI] [Google Scholar]
19.Krusche P et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Poplin R et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv 10.1101/201178 (2018). [DOI] [Google Scholar]
21.Carroll A et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv 10.1101/2023.08.11.553043 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Wagner J et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Li H et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zook JM et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Kolmogorov M et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Marchet C et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Feuk L, Carson AR & Scherer SW Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006). [DOI] [PubMed] [Google Scholar]
28.Rausch T et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Mohiyuddin M et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Chen X et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016). [DOI] [PubMed] [Google Scholar]
31.Fang H et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Wala JA et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Kolmogorov M, Yuan J, Lin Y & Pevzner PA Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). [DOI] [PubMed] [Google Scholar]
34.Jiang T et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Smolka M, Paulin LF, Grochowski CM et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 10.1038/s41587-023-02024-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Sirén J & Paten B GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Sirén J, Garrison E, Novak AM, Paten B & Durbin R Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Paten B et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Zerbino DR & Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Chang X, Eizenga J, Novak AM, Sirén J & Paten B Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Gagie T, Navarro G & Prezza N Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020). [Google Scholar]
42.Dufresne Y et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table

NIHMS2113270-supplement-Table.xlsx^{(110.8KB, xlsx)}

Supplementary Materials

NIHMS2113270-supplement-Supplementary_Materials.pdf^{(1.1MB, pdf)}

Data Availability Statement

[R1] 1.Eizenga JM et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Garrison E et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Rautiainen M & Marschall T GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Sirén J et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Pritt J, Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Liao Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Dilthey A, Cox C, Iqbal Z, Nelson MR & McVean G Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Vaddadi K, Mun T & Langmead B Minimizing reference bias with an impute-first approach. Preprint bioRxiv 10.1101/2023.11.30.568362 (2023). [DOI] [Google Scholar]

[R10] 10.Hickey G et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 10.1038/s41587-023-01793-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Poplin R et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018). [DOI] [PubMed] [Google Scholar]

[R12] 12.Hickey G et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Ebler J et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub; https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023). [Google Scholar]

[R15] 15.Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).

[R16] 16.Kokot M, Długosz M & Deorowicz S KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017). [DOI] [PubMed] [Google Scholar]

[R17] 17.Wagner J et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Baid G et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv 10.1101/2020.12.11.422022 (2020). [DOI] [Google Scholar]

[R19] 19.Krusche P et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Poplin R et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv 10.1101/201178 (2018). [DOI] [Google Scholar]

[R21] 21.Carroll A et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv 10.1101/2023.08.11.553043 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Wagner J et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Li H et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Zook JM et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Kolmogorov M et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Marchet C et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Feuk L, Carson AR & Scherer SW Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006). [DOI] [PubMed] [Google Scholar]

[R28] 28.Rausch T et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Mohiyuddin M et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Chen X et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016). [DOI] [PubMed] [Google Scholar]

[R31] 31.Fang H et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Wala JA et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Kolmogorov M, Yuan J, Lin Y & Pevzner PA Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). [DOI] [PubMed] [Google Scholar]

[R34] 34.Jiang T et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Smolka M, Paulin LF, Grochowski CM et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 10.1038/s41587-023-02024-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Sirén J & Paten B GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Sirén J, Garrison E, Novak AM, Paten B & Durbin R Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Paten B et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Zerbino DR & Birney E Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Chang X, Eizenga J, Novak AM, Sirén J & Paten B Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Gagie T, Navarro G & Prezza N Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020). [Google Scholar]

[R42] 42.Dufresne Y et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Personalized pangenome references

Jouni Sirén

Parsa Eskandar

Matteo Tommaso Ungaro

Glenn Hickey

Jordan M Eizenga

Adam M Novak

Xian Chang

Pi-Chuan Chang

Mikhail Kolmogorov

Andrew Carroll

Jean Monlong

Benedict Paten

Abstract

Results

Haplotype sampling pipeline

Fig. 1 |. Illustrating haplotype sampling at adjacent blocks in the pangenome.

HPRC pangenome graphs

Mapping reads

Fig. 2 |. Mapping 30× NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).

Calling small variants

Fig. 3 |. Small variants evaluation across samples HG001 to HG005.

Genotyping SVs

Fig. 4 |. SVs benchmark evaluation.

Discussion

Online content

Methods

Bidirected sequence graphs

Preprocessing haplotypes

Sampling haplotypes

Supplementary Material

Acknowledgements

Footnotes

Data availability

Code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases