Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 1.
Published in final edited form as: Nat Methods. 2024 May 10;21(6):967–970. doi: 10.1038/s41592-024-02269-8

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Haoyu Cheng 1,2, Mobin Asri 3, Julian Lucas 3, Sergey Koren 4, Heng Li 1,2,*
PMCID: PMC11214949  NIHMSID: NIHMS1995747  PMID: 38730258

Abstract

Despite advances in long-read sequencing technologies, constructing a near telomere-to-telomere assembly is still computationally demanding. Here we present hifiasm (UL), an efficient de novo assembly algorithm combining multiple sequencing technologies to scale up population-wide near telomere-to-telomere assemblies. Applied to 22 human and two plant genomes, our algorithm produces better diploid assemblies at a cost of an order of magnitude lower than existing methods and it also works with polyploid genomes.


The emergence of accurate PacBio High-Fidelity (HiFi) long reads has revolutionized the assembly of large genomes, making high-quality haplotype-resolved assembly a routine procedure15. However, HiFi reads are often not long enough to resolve long exact repeats, resulting in fragmented components around repeat-rich regions such as centromeres6. Recent advances by Oxford Nanopore Technologies (ONT) have enabled the generation of ultra-long reads, which are approximately 5–10 times longer than HiFi reads though at relatively lower accuracy7. The Telomere-to-Telomere (T2T) consortium has demonstrated that with careful manual curation, it is possible to assemble the homozygous CHM13 human genome from telomere to telomere by combining HiFi and ultra-long reads8.

Learning from the complete human genome assembly of CHM13, Verkko is one of the first efforts towards automated telomere-to-telomere assembly of diploid samples9. It can produce high-quality assembly when parental sequence data are available. However, as we will show later, Verkko does not fully phase a single diploid sample without parental data and thus results in incomplete assembly. It may produce relatively fragmented assembly at lower read coverage and is unable to produce haplotype-resolved assemblies of polyploid samples. Verkko is also compute intensive, making it costly to deploy Verkko to a large number of samples.

For the efficient near telomere-to-telomere assembly of diploid and polyploid samples, we developed hifiasm (UL) that tightly integrates PacBio HiFi, ONT ultra-long, Hi-C reads and trio data and produces high-quality assembly in one go. Unlike Verkko that is based on the multiplex de Bruijn graph10,11, hifiasm (UL) represents sequences with two string graphs12 (Fig. 1a). The first string graph is built from HiFi reads (Fig. 1b), the same as the original hifiasm graph1. The second string graph is built from ultra-long reads in reduced representation (Fig. 1bd). Hifiasm (UL) then merges the two graphs to produce the final assembly graph (Fig. 1e). The use of two assembly graphs at different scales distinguishes hifiasm (UL) from other assemblers.

Figure 1. Hybrid assembly with PacBio HiFi and ONT ultra-long reads.

Figure 1.

(a) Overall workflow. Hifiasm (UL) corrects HiFi reads, constructs a string graph with HiFi reads alone and aligns ultra-long reads to the HiFi graph. Based on the graph alignment, hifiasm (UL) encodes an ultra-long read as a sequence of integers with each integer uniquely corresponding to a node (also known as a unitig) in the HiFi graph. It then constructs a string graph of integer-encoded ultra-long reads, and merges the HiFi graph and the ultra-long graph to generate the final assembly. Steps highlighted with the gray background are developed specifically for the hifiasm (UL) algorithm. In contrast, other steps are common to our previous hifiasm and hifiasm (Hi-C) algorithms. (b) HiFi assembly graph and ultra-long alignment. Circles in orange and blue represent heterozygous nodes constructed by HiFi reads from haplotype 1 and haplotype 2, respectively. Green circles represent homozygous nodes within the HiFi string graph. The alignment paths of ultra-long reads from haplotype 1 and haplotype 2 are represented by orange and blue lines, respectively. (c) Ultra-long reads encoded as sequences of integer unitig identifiers in the HiFi graph. Nucleotide sequences are ignored at this step. (d) Ultra-long assembly graph and the resulting contigs in the integer encoding. (e) Final assembly graph by incorporating the ultra-long contigs into the HiFi graph. From the initial HiFi graph, hifiasm (UL) removes unitigs that are present on the ultra-long contigs and adds the ultra-long contigs back together with edges between remaining unitigs and unitigs on the ultra-long contigs. Some unitigs (green circles in the example) may appear multiple times in the final graph.

To compare hifiasm (UL) with Verkko at a population scale, we evaluated both approaches using 22 human samples selected from the Human Pangenome Reference Consortium (HPRC)16. Eleven of these samples were chosen from the Year-1 dataset of the HPRC, while the remaining eleven samples were selected from the Year-2 dataset (Supplementary Table 3). We carried out trio assembly for all 22 samples but only did Hi-C-based single-sample assembly for the 11 Year-1 samples. Verkko natively supports trio binning assembly. As it does not support internal Hi-C phasing, we utilized the Hi-C phasing approach, gfase13, in combination with Verkko for the single-sample phased assembly. In total, we collected a total of 132 assembled haplotypes for comprehensive evaluation of hifiasm (UL) and Verkko.

For each sample, both hifiasm (UL) and Verkko yielded assemblies of similar sizes (Fig. 2a) and exhibited comparable phasing accuracy (Supplementary Table 1). However, when assembling HPRC Year-1 samples at lower HiFi and ultra-long coverage (Supplementary Table 3), hifiasm (UL) tended to produce more contiguous assemblies (Fig. 2b). It generated contiguous contigs spanning from telomere to telomere for multiple chromosomes, whereas Verkko did not produce telomere- to-telomere contigs for Year-1 samples (Supplementary Fig. 1a). The consistent improvement to assembly contiguity highlights the advantages of our approach. Although Verkko could produce scaffolds that bridge entire chromosomes (Supplementary Table 4), the assembly gaps in the scaffolds will complicate downstream analysis. In addition, Verkko could not assemble chromosome-long scaffolds for all chromosomes. The final assembly still required a Hi-C scaffolder for reliable scaffolding.

Figure 2. Statistics of different assemblies.

Figure 2.

Hifiasm(UL)_trio and verkko_trio assemblies were generated using HiFi and ultra-long reads, along with parental short reads. Hifiasm(UL)_hic and verkko(UL)_gfase assemblies were constructed using HiFi, ultra-long, and Hi-C reads obtained from the same sample. Verkko(UL)_gfase applied the standalone Hi-C phasing algorithm, gfase13, to the Verkko assembly graph. (a) Assembly length of 11 human samples. (b) Contig N50 representing the assembly contiguity of human samples. (c) Problematic autosomal genes reported by the asmgene method14. The number of each assembly is the sum of the asmgene results for haplotype 1 and 2. (d) Cloud computing cost for assembling human data. Only three samples were assembled by Verkko using cloud computing. (e) Assembly length of the haploid Arabidopsis thaliana sample and the autotetraploid potato sample. Hifiasm(HiFi) represents hifiasm assemblies without the ultra-long integration. (f) Contig N50 of Arabidopsis and potato assemblies by filtering out contigs shorter than 500kb. (g) BUSCO15 scores of Arabidopsis and potato assemblies by filtering out contigs shorter than 500kb.

For HPRC Year-2 datasets at higher coverage (Supplementary Table 1), Verkko assemblies were broadly comparable to hifiasm (UL) assemblies in terms of assembly contiguity (Fig. 2b), the number of telomere-to-telomere contigs (Supplementary Fig. 1a), and phasing accuracy (Supplementary Table 1). A noticeable difference between the two assemblers is that Verkko did not assign all contigs to specific haplotypes given Hi-C data. We observed that the majority of unassigned sequences come from unpaired sex chromosomes of male samples, but there are also relatively larger numbers of unassigned sequences from paired sex chromosomes and autosomes. Due to these unassigned contigs, Verkko assemblies missed more autosomal genes in comparison to hifiasm (UL) and were thus less complete (Fig. 2c). Meanwhile, for samples HG01099 and HG03710, Verkko produced noticeably more duplicated genes. Close inspection of these errors revealed that Verkko duplicated a few regions on one haplotype but left these regions blank on the other haplotype. Hifiasm (UL) was less affected by this issue. We assembled all Year-2 samples with hifiasm (UL) and three samples with Verkko using cloud computing and recorded the cost. Hifiasm (UL) is 8–15 times more cost-effective. The low computational cost of hifiasm (UL) is particularly important for population-scale telomere-to-telomere assembly projects.

We used all HiFi reads and ultra-long reads with a minimum length of 50kb from the Arabidopsis thaliana (Col-0) dataset17 to evaluate the assembly results for non-human genomes (Fig. 2eg). As an inbred plant strain, A. thaliana Col-0 has five long chromosomes with a large number of ribosome DNAs (rDNAs) on the short arms of chromosomes 2 and 4. Hifiasm (UL) produced exactly five contigs that are 500 kb or longer. Three of them were telomere-to-telomere contigs corresponding to chromosomes 1, 3 and 5 (Supplementary Fig. 1b). The other two contigs represented the majority of chromosomes 2 and 4 except the rDNA arrays on their short arms. Hifiasm (UL) assembled tens of Mb of small contigs <500 kb (Fig. 2e). Almost all of them could be aligned to rDNA or the chloroplast DNA. Also interestingly, the contig corresponding to chromosome 2 integrated 294 kb of mitochondrial DNA towards the telomere end of the short arm. This integration is also present in the assembly done by the authors who produced the dataset17 but is absent from the A. thaliana reference genome or the assembly done by Naish et al18. For the A. thaliana dataset, Verkko only generated one telomere-to-telomere contig corresponding to chromosome 5 (Supplementary Fig. 1b), partly due to homozygous regions that are longer than ultra-long reads but do not span entire chromosome arms. The Verkko assembly at present was less contiguous (Fig. 2f) and less complete based on the BUSCO evaluation15 (Fig. 2g). The Verkko contig corresponding to chromosome 2 was fragmented on the short arm and did not reveal the mitochondrion integration. Both Verkko and hifiasm (UL) assemblies were more contiguous hifiasm HiFi-only assembly, indicating the additional power of ultra-long reads.

To evaluate polyploid assembly, we further assembled an autotetraploid potato genome19. As Verkko does not support polyploid phasing, only hifiasm (UL) and hifiasm (HiFi) were applied with all HiFi reads and ultra-long reads with a minimum length of 50kb. By leveraging the additional genetic map information from progeny, both hifiasm (UL) and hifiasm (HiFi) could assemble four haplotypes based on the polyploidy graph-binning approach (Methods). The integration of ultra-long reads not only significantly increased assembly contiguity (Fig. 2f and Supplementary Fig. 1b) but also improved the completeness for all haplotypes (Fig. 2g). For the polyploid genome assembly, the main limitation of our current algorithm is that it requires genetic map information from progeny. In order to address this issue, we implemented an experimental single-sample approach using Hi-C phasing, and applied it to the autotetraploid potato dataset. This resulted in four haplotype assemblies, which have slightly worse phasing accuracy and contiguity in comparison to the genetic-map-based assemblies. However, the four Hi-C phased haplotype assemblies are imbalanced, with one assembly being 20% larger than the others. In future, we plan to optimize our Hi-C phasing algorithms for polyploid genomes.

Another limitation of hifiasm (UL) is that it requires long reads of accuracy well beyond 99%. It currently does not work with ordinary ONT simplex reads. Nonetheless, with improved accuracy of this data type, producing near telomere-to-telomere assembly using ONT simplex reads alone may become an option. It is another direction we are actively working on.

Hifiasm (UL) combines accurate long reads, ultra-long reads and Hi-C or trio data to produce high-quality assemblies often with multiple chromosomes assembled from telomere to telomere. In comparison to Verkko, another assembler that takes multiple data types as input, hifiasm (UL) achieves comparable assembly quality at a fraction of cost. It paves the way for near telomere-to-telomere assembly at the population scale and will improve our understanding of complex genomic regions, such as centromeres and highly repetitive segmental duplications, and their biomedical relevance in the long term.

Methods

Overview of hifiasm (UL).

The main objective of hifiasm (UL) is to leverage the benefits of HiFi and ultra-long reads, simplifying the assembly graph as much as possible (Fig. 1). A complete and clean assembly graph will substantially simplify the following steps like Hi-C phasing and phased contig generation. Our previous phasing algorithms1,20 are then applied to the graph to produce haplotype-resolved telomere-to-telomere assemblies. In building the high-quality assembly graph, hifiasm (UL) generally follows the traditional hybrid assembly paradigm, which uses the accurate HiFi graph as the backbone and extends the graph by aligning ultra-long reads to it. However, unlike existing methods, hifiasm (UL) performs an additional round of ultra-long-to-HiFi alignments in advance. This provides extra information for accurately constructing the HiFi graph and alleviates the contained read problem specifically for the string graph21. We then create an integer graph for the ultra-long reads and subsequently merge it with the initial HiFi graph to produce the final assembly graph.

The advantages of hifiasm (UL) stem mainly from the double graph framework for co-assembly (Fig. 1) that fully exploits all information in reads. Our algorithm involves two main steps: (i) build two string graphs individually, one for HiFi reads (Fig. 1b) and another for ultra-long reads (Fig. 1d) and (ii) merge the two graphs to produce a final graph that combine both HiFi and ultra-long reads (Fig. 1e).

Building an accurate string graph as the backbone.

A string graph is an assembly graph that preserves the information of complete reads, where each node represents a read, and edges connecting the nodes correspond to overlaps between reads. Hifiasm (UL) builds the initial backbone graph with HiFi reads, as they are much more accurate than ultra-long reads. To further eliminate sequencing errors, all HiFi reads are self-corrected with the haplotype-resolved error correction algorithm described in the original hifiasm1. Once the graph is constructed, it is necessary to perform multiple rounds of graph cleaning to simplify the graph by removing edges that are less likely to be real.

Although the string graph has been widely utilized in many long-read assemblers, the issue of the contained read remains unclear and could potentially impact the completeness of the graph21. Given two reads X and Y, if there is an overlap between X and Y that covers a part of X and the whole Y, Y is a contained read that is totally contained in X. Extended Data Fig. 1a gives an example. Read h11 and h12 are two contained reads covered by read h3. Practical implementations of the string graph remove all contained reads when building graphs, since the edges in the string graph correspond to the prefix-to-suffix or suffix-to-prefix overlaps between reads12. However, simply ignoring contained reads could introduce breakpoints in the string graph, especially in highly repetitive regions and homologous regions between two haplotypes. For instance, read h12 is a critical read for one haplotype (reads in blue) but is an unnecessary contained read for another haplotype (reads in orange), as shown in Extended Data Fig. 1a. Removing read h12 does not affect the haplotype in orange but leads to a breakpoint for the haplotype in blue, resulting in a fragmented assembly graph. Identifying critical contained reads and retaining them in the string graph is the primary challenge posed by the contained read problem. Several approaches have been proposed to tackle it based on the simplified assumptions of read coverage or length21, which are not always reliable, especially in highly repetitive regions.

Hifiasm (UL) alleviates the contained read problem within the HiFi string graph by utilizing ultra-long-to-HiFi read alignments. A HiFi read is considered a critical contained read only if it lacks sufficient informative variants to distinguish it from reads originating from other repeat copies (read h12 in Extended Data Fig. 1a). Given that ultra-long reads are frequently ten times longer than HiFi reads (with a median length exceeding 100kb), it is less probable that an ultra-long read is a critical contained read without any informative variant. As a result, when ultra-long reads are aligned to HiFi reads, the HiFi contained reads that must be covered by ultra-long read alignments are expected to be the critical reads. To this end, hifiasm (UL) theoretically constructs a HiFi string graph that includes both contained and uncontained reads (Extended Data Fig. 1b). It then employs the graph alignment to align all ultra-long reads to this graph. As shown in Extended Data Fig. 1b, the contained read h12 must be covered by the alignment paths of ultra-long reads u6 and u7, while another contained read h11 could be skipped by read h3. Consequently, hifiasm (UL) retains the critical read h12 for constructing a complete string graph of HiFi reads, while safely removing read h11 to simplify the graph.

The ultra-long-to-HiFi read alignment could also be used to avoid the incorrect graph cleaning. In an ideal scenario where all HiFi reads are longer than any homozygous or repetitive regions, each node in the string graph should have a maximum of one edge extending towards the left and right sides. However, due to the limited length of HiFi reads, some nodes may have multiple edges, making it difficult for assemblers to determine the real number of edges to be retained. For instance, hifiasm and HiCanu3 utilize a length-based strategy that prioritizes the edge with the longest overlap length and often removes other shorter edges. These heuristics graph cleaning solutions may result in the overcutting of real edges or retaining unrelated edges. If the initial backbone HiFi graph is either oversimplified or too complex, the downstream steps of hifiasm (UL) may not accurately resolve difficult-to-assemble regions. By utilizing the ultra-long-to-HiFi read alignment, hifiasm (UL) is able to ascertain the number of ultra-long reads supported for each edge, providing additional information to prevent incorrect graph cleaning (Extended Data Fig. 1b).

Integer graph with ultra-long reads.

To fully capture the length information of ultra-long reads, hifiasm (UL) constructs another string graph using only those reads. However, generating the string graph requires the computationally intensive all-versus-all pairwise read comparison, which constitutes the primary bottleneck in the long-read assembly workflow. Moreover, identifying correct overlaps among ultra-long reads is particularly challenging due to their significantly higher error rate compared to HiFi reads. Furthermore, the high frequency of recurrent sequence errors in ONT ultra-long reads makes it nearly impossible to accurately identify overlaps in difficult regions.

Hifiasm (UL) constructs a lightweight integer graph to entirely avoid the expensive all-versus-all base-level read comparison and ensure the accuracy of the ultra-long graph is comparable to that of the HiFi graph. In short, all ultra-long reads are converted from the base pair space to a low-dimensional integer space using the graph alignments of ultra-long reads. By working in the integer space, the graph construction procedure is both efficient and straightforward. The detailed steps of the integer graph construction are listed as follows.

  1. Mapping ultra-long reads into the integer space. All ultra-long reads are aligned to the HiFi graph to obtain the alignment paths (Fig. 1b). Given an ultra-long read, hifiasm (UL) first collects its linear alignments to the nodes of the HiFi graph using pairwise base-level alignments. Linear alignments are then chained in the graph space using the approach described in minigraph22. A graph alignment path is a sequence of the aligned node identifiers. For each ultra-long read, hifiasm (UL) only keeps the node identifiers and disregards all alignment details and base pairs. Node identifiers can be represented as integers, meaning that ultra-long reads of over tens of kilobases are transformed into ultra-long sequences consisting of tens of integers (Fig. 1c).

  2. Calculating overlaps among ultra-long integer sequences. To construct a string graph in the integer space, obtaining overlaps between integer sequences is essential. As the base-level sequencing errors within ultra-long reads have already been corrected through the graph alignment to the accurate HiFi graph, hifiasm (UL) only allows exact overlaps in the integer space. Notably, this step is considerably faster than the conventional all-versus-all inexact pairwise alignment.

  3. Constructing an integer graph. An integer graph is a type of string graph where each node is an integer sequence. Hifiasm (UL) constructs an integer graph by utilizing ultra-long integer sequences and their overlaps (Fig. 1d). Specifically, each node in this graph represents an ultra-long integer sequence, and the edges connecting the nodes correspond to exact overlaps between these sequences. However, even after this initial construction, multiple rounds of standard graph cleaning are still necessary to further simplify the integer graph. As ultra-long reads are typically long enough to assemble through repetitive or homozygous regions, hifiasm (UL) employs highly aggressive graph cleaning strategies to eliminate ambiguous edges associated with each node.

  4. Producing integer contigs. A contig corresponds to a non-branching path in the string graph. Given a contig in the integer graph, hifiasm (UL) produces its sequence by concatenating the subsequences of nodes within the corresponding path (Fig. 1d). After the contig generation process, each resulting contig is an integer sequence that is significantly longer than any individual ultra-long read. In fact, these integer contigs represent the paths that can untangle intricate structures within the initial HiFi graph.

Building final assembly graph by graph incorporation.

The integer graph produces ultra-long integer contigs that correspond to assembly paths within the initial HiFi graph. These integer contigs represent another HiFi string graph that resolves the majority of tangles and homozygous regions within the initial HiFi graph into linear sequences. By incorporating ultra-long integer contigs into the initial HiFi graph, hifiasm (UL) can produce the final assembly. Specifically, hifiasm (UL) first removes all nodes within the initial HiFi graph that also appear in ultra-long integer contigs, and then merges the remaining nodes and overlaps with ultra-long integer contigs. Fig. 1e provides an example. In the final assembly graph, all nodes except h7 come from ultra-long integer contigs. This is because all nodes except node h7 are present in both the initial HiFi graph (Fig. 1b) and ultra-long integer contigs (Fig. 1d).

Constructing haplotype-resolved assemblies.

The high-quality assembly graphs combining HiFi and ultra-long reads significantly simplifies the generation of haplotype-resolved assemblies. With the addition of Hi-C23,24 or parental short reads, hifiasm (UL) can reuse previous Hi-C20 or trio-binning1 algorithms to assign haplotype-specific markers to the nodes of the assembly graph. The final haplotype-resolved assemblies are then produced using the graph-binning strategy1. For polyploid genomes, we implemented a polyploidy graph-binning approach that extends our previous diploid graph-binning method. In the polyploidy graph-binning approach, when emitting the assembly of one haplotype, all nodes with other haplotype-specific markers are discarded from the assembly graph. This is the main difference between the polyploidy graph-binning and the diploid graph-binning approaches.

Optimizing for cloud computing.

To evaluate the computational cost of both hifiasm (UL) and Verkko, we assembled all human samples with hifiasm (UL) and three human samples with Verkko using the Terra platform on top of Google Cloud Platform. We further reduced the computational costs by executing assemblers with preemptible instances. A preemptible instance takes much lower cost but its running times often cannot exceed 24 hours. As a result, both hifiasm (UL) and Verkko were divided into multiple short tasks, which were executed individually using preemptible instances (Supplementary Section 1.4). In order to make a fair and clear comparison, Fig. 2d presents the total computational cost for the entire assembly workflow, rather than the individual costs of each subtask.

Extended Data

Extended Data Fig. 1. Accurate HiFi string graph combining PacBio HiFi and ONT ultra-long reads.

Extended Data Fig. 1

(a) Effect of contained reads in the string graph. Rectangles in orange and blue represent heterozygous HiFi reads from haplotype 1 and haplotype 2, respectively. Green rectangles are HiFi reads originating from homozygous regions, whereas red rectangles are contained reads. The string graph is constructed using all reads, except for two contained reads.

(b) Hifiasm (UL) aligns ultra-long reads to the HiFi string graph with contained reads to alleviate the contained read problem. The alignment paths of ultra-long reads from haplotype 1 and haplotype 2 are represented by orange and blue lines, respectively. Despite being a contained read, h12 is retained as the critical read because it is covered by ultra-long reads u6 and u7. To ensure accurate graph cleaning, hifiasm (UL) also tracks the number of ultra-long reads that support each edge as its weight. For instance, the edge weight between h5 and h8 is 2 because ultra-long reads u4 and u5 cover it.

Supplementary Material

Supplementary Information

Acknowledgements

This study was supported by US National Institutes of Health (grant R01HG010040, U01HG010971 and U41HG010972 to H. L., grant K99HG012798 to H.C.). We thank the Human Pangenome Reference Consortium for making Year-1 and Year-2 datasets publicly available.

Footnotes

Competing interests

The authors declare no competing interests.

Code availability

Hifiasm (UL) along with its source code is free available at https://github.com/chhylp123/hifiasm.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article

Data availability

Human reference genome: GRCh38 and CHM13v2; HiFi reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/1E2DD570-3B26-418B-B50F-5417F64C5679--HIFLDEEPCONSENSUS/; ONT ultra-long reads of HPRC Year-2 samples except HG002 (R9.4.1 flow cells): https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/90A1F283-2752-438B-917F-53AE76C9C43E--UCSC_HPRC_nanopore_Year2/; Hi-C reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/4C696EB9-9AD2-47A2-8011-2F43977CC4E0--Y2-HIC/; Parental short reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/AD30A684-C7A8-4D24-89B2-040DFF021B0C--Y2_1000G_DATA/; HG002 HiFi reads (Google Cloud Storage): gs://brain-genomics/awcarroll/t2t/fastq/q20/m64011_190830_220126.Q20.fastq.gz,gs://brain-genomics/awcarroll/t2t/fastq/q20/m64011_190901_095311.Q20.fastq.gz, gs://brain-genomics/awcarroll/t2t/fastq/q20/m64012_190920_173625.Q20.fastq.gz,gs://brain-genomics/awcarroll/t2t/fastq/q20/m64012_190921^34837.Q20.fastq.gz; HG002 ultra-long reads (R9.4.1 flow cells, pass reads only): https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_1_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com7human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_2_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRLUCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_3_Guppy_6.1.2_5mc_cg_prom_sup.tar; HG002 Parental short reads: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=workmg/HPRC_PLUS/HG002/raw_data/Illumina/parents/; HG002 Hi-C reads: “HG002.HiC_1*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/HPRC_PLUS/HG002/raw_data/hic/downsampled/; All reads of HPRC Year-1 samples (R9.4.1 flow cells for ONT reads): https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_vL0; All reads of Arabidopsis (R9.4.1 flow cells for ONT reads): https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=CRA004538; All reads of potato (R9.4.1 flow cells for ONT ultra-long reads): https://ngdc.cncb.ac.cn/gsa/browse/CRA006012; Hifiasm (UL) assemblies of HPRC Year-2 samples: “*hifiasm_v0.19.5*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.htmnprefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/; Verkko assemblies of HPRC Year-2 samples: “*verkko_1.3.1*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/; All evaluated HPRC Year-1 and plant assemblies are available at https://zenodo.org/record/7996422 and https://zenodo.org/record/7962930, respectively.

References

  • 1.Cheng H, Concepcion GT, Feng X, Zhang H. & Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wenger AM et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Nurk S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 30, 1291–1305 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kolmogorov M, Yuan J, Lin Y. & Pevzner PA Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). [DOI] [PubMed] [Google Scholar]
  • 5.Luo X, Kang X. & Schönhuth A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol 22, 299 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Porubsky D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res. (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jain M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. biotechnology 36, 338–345 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nurk S. et al. The complete sequence of a human genome. Sci. 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rautiainen M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 1–9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D. & Pevzner PA Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol 40, 1075–1081 (2022). [DOI] [PubMed] [Google Scholar]
  • 11.Rautiainen M. & Marschall T. MBG: Minimizer-based sparse de Bruijn Graph construction. Bioinforma. 37, 2476–2478 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Myers EW The fragment assembly string graph. Bioinforma. 21, ii79–ii85 (2005). [DOI] [PubMed] [Google Scholar]
  • 13.Lorig-Roach R. et al. Phased nanopore assembly with Shasta and modular graph phasing with GFAse. bioRxiv 2023–02 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinforma. 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV & Zdobnov EM BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinforma. 31, 3210–3212 (2015). [DOI] [PubMed] [Google Scholar]
  • 16.Liao W-W et al. A draft human pangenome reference. Nat. 617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wang B. et al. High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads. Genomics, proteomics & bioinformatics 20, 4–13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Naish M. et al. The genetic and epigenetic landscape of the arabidopsis centromeres. Sci. 374, eabi7489 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bao Z. et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol. Plant 15, 1211–1226 (2022). [DOI] [PubMed] [Google Scholar]
  • 20.Cheng H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jain C. Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinforma. 39, btad124 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li H, Feng X. & Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 21, 265 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Edge P, Bafna V. & Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 27, 801–812 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Martin M. et al. WhatsHap: fast and accurate read-based phasing. bioRxiv (2016). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

Data Availability Statement

Human reference genome: GRCh38 and CHM13v2; HiFi reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/1E2DD570-3B26-418B-B50F-5417F64C5679--HIFLDEEPCONSENSUS/; ONT ultra-long reads of HPRC Year-2 samples except HG002 (R9.4.1 flow cells): https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/90A1F283-2752-438B-917F-53AE76C9C43E--UCSC_HPRC_nanopore_Year2/; Hi-C reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/4C696EB9-9AD2-47A2-8011-2F43977CC4E0--Y2-HIC/; Parental short reads of HPRC Year-2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/AD30A684-C7A8-4D24-89B2-040DFF021B0C--Y2_1000G_DATA/; HG002 HiFi reads (Google Cloud Storage): gs://brain-genomics/awcarroll/t2t/fastq/q20/m64011_190830_220126.Q20.fastq.gz,gs://brain-genomics/awcarroll/t2t/fastq/q20/m64011_190901_095311.Q20.fastq.gz, gs://brain-genomics/awcarroll/t2t/fastq/q20/m64012_190920_173625.Q20.fastq.gz,gs://brain-genomics/awcarroll/t2t/fastq/q20/m64012_190921^34837.Q20.fastq.gz; HG002 ultra-long reads (R9.4.1 flow cells, pass reads only): https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_1_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com7human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_2_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRLUCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_3_Guppy_6.1.2_5mc_cg_prom_sup.tar; HG002 Parental short reads: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=workmg/HPRC_PLUS/HG002/raw_data/Illumina/parents/; HG002 Hi-C reads: “HG002.HiC_1*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/HPRC_PLUS/HG002/raw_data/hic/downsampled/; All reads of HPRC Year-1 samples (R9.4.1 flow cells for ONT reads): https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_vL0; All reads of Arabidopsis (R9.4.1 flow cells for ONT reads): https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=CRA004538; All reads of potato (R9.4.1 flow cells for ONT ultra-long reads): https://ngdc.cncb.ac.cn/gsa/browse/CRA006012; Hifiasm (UL) assemblies of HPRC Year-2 samples: “*hifiasm_v0.19.5*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.htmnprefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/; Verkko assemblies of HPRC Year-2 samples: “*verkko_1.3.1*” from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/; All evaluated HPRC Year-1 and plant assemblies are available at https://zenodo.org/record/7996422 and https://zenodo.org/record/7962930, respectively.

RESOURCES