Skip to main content
[Preprint]. 2024 Mar 19:2024.03.15.585294. [Version 2] doi: 10.1101/2024.03.15.585294

Table 1:

Assemblies of the reference human genome HG002 using only ONT sequencing.

Asm Contig NG50 (Mb) Scaffold NG50 (Mb) Hamming error QV Dup Gene Missing Gene T2T ctgs T2T scfs
Downsampled (50x Duplex / 30× ONT-UL)
Verkko + trio 103.00 135.21 0.75% 55.77 200 292 16 27/46
Verkko + Pore-C 86.69 135.21 0.75% 55.72 232 361 13 26/46
Full-coverage (70x Duplex)
Verkko + trio 59.40 133.48 0.70% 57.00 296 309 1 23/46
Verkko + Pore-C 43.16 133.48 0.77% 56.49 290 310 4 17/46
HiFi (43x/30x ONT-UL) (Cheng et al. 2023)
Verkko + trio 101.76 121.21 0.17% 59.33 206 314 8 16/46
Hifiasm + trio 101.21 N/A 0.20% 60.37 182 287 7 N/A/46

Contig NG50: The length of the shortest contig such that half of the genome is in contigs of this length or greater. No gaps are allowed and sequences are split where a gap of at least 3 Ns is present. The genome size is defined as 6.08 Gbps based on the reference HG002 assembly (https://github.com/marbl/HG002/blob/main/README.md). Scaffold NG50: same as contig NG50 without splitting at gaps. Hamming error: The haplotype error rate computed using yak (Liao et al. 2023) and parent short-read sequence databases measuring the consistency of each scaffold with a single haplotype, lower is better. QV: the Phred (Ewing and Green 1998) log-scaled quality score calculated using Merqury (Rhie et al. 2020), higher is better. Dup/Missing Gene: duplicated or missing genes computed using compleasm (Huang and Li 2023) using the OrthoDB v10 (Waterhouse et al. 2018; Zdobnov et al. 2021) primate database, lower is better. Each haplotype was measured independently and the missing and duplicated genes reported are the sum of both haplotypes. Since single-copy genes from Chromosome X are expected to be missing on the paternal haplotype and some genes may be true duplications, we also measured gene completeness on the HG002 v0.7 assembly (https://github.com/marbl/HG002/blob/main/README.md) (Supplementary Table 2) as a baseline. This assembly has 180 duplicated and 290 missing genes. T2T ctgs: The count of telomere-to-telomere contigs for each assembly. A contig is defined as T2T if it has the canonical (TTAGGG) telomere sequence within 50 kbp of the start and end and has no gaps, higher is better. T2T scfs: same as T2T ctgs but gaps are allowed, higher is better.