Table 1.
Dataset | Assembler | Size (Gb) |
N50 (Mb) |
Hamming error (%) |
Multicopy genes missed (%) |
Gene completeness | |
---|---|---|---|---|---|---|---|
Complete (%) | Duplicated (%) | ||||||
HG002 (HiFi + trio/Hi-C) | hifiasm (Hi-C) | 3.075/2.909 | 50.0/55.1 | 1.42/0.82 | 19.82/19.98 | 99.28/99.08 | 0.32/0.32 |
Falcon-Phase (Hi-C) | 3.027/3.027 | 32.1/32.1 | 18.66/19.15 | 40.13/39.25 | 99.29/99.26 | 3.14/3.13 | |
hifiasm (trio) | 2.936/3.033 | 57.9/57.8 | 0.75/0.74 | 21.18/16.72 | 99.17/99.24 | 0.29/0.33 | |
HG002 (HiFi only) | hifiasm (dual) | 3.033/3.015 | 57.8/44.7 | 28.25/21.59 | 18.47/20.30 | 99.11/99.04 | 0.35/0.31 |
hifiasm (primary/alt) | 3.112/2.910 | 89.9/0.4 | 22.30/1.99 | 13.14/32.25 | 99.44/88.10 | 0.34/2.67 | |
HiCanu (primary/alt) | 2.960/3.143 | 48.4/0.3 | 27.76/0.68 | 34.95/20.62 | 98.88/85.63 | 0.19/5.15 | |
HG00733 (HiFi + trio/Hi-C/Strand-seq) | hifiasm (Hi-C) | 3.024/3.062 | 44.5/40.6 | 1.79/1.48 | 14.97/18.31 | 99.44/99.51 | 0.31/0.35 |
DipAsm (Hi-C) | 2.934/2.933 | 26.3/28.2 | 2.81/2.57 | 66.08/67.44 | 99.03/99.04 | 0.39/0.40 | |
PGAS (Strand-seq) | 2.905/2.900 | 30.1/25.9 | 3.25/2.60 | 66.48/68.31 | 99.15/99.18 | 0.16/0.15 | |
hifiasm (trio) | 3.047/3.026 | 52.3/45.6 | 0.78/0.99 | 14.57/18.87 | 99.50/99.28 | 0.42/0.32 | |
HG00733 (HiFi only) | hifiasm (dual) | 3.027/3.049 | 48.3/36.4 | 38.08/36.40 | 19.82/17.52 | 99.36/99.18 | 0.34/0.42 |
hifiasm (primary/alt) | 3.077/3.018 | 68.3/0.3 | 39.63/2.23 | 12.18/28.98 | 99.58/84.95 | 0.51/2.89 | |
HiCanu (primary/alt) | 2.918/3.312 | 44.5/0.2 | 38.79/1.00 | 42.75/14.81 | 98.89/82.78 | 0.14/6.29 | |
European badger (HiFi + trio/Hi-C) | hifiasm (Hi-C) | 2.731/2.536 | 84.5/73.6 | 1.51/2.09 | 96.77/94.33 | 1.68/1.63 | |
hifiasm (trio) | 2.633/2.560 | 91.5/57.2 | 0.65/3.28 | 94.44/95.11 | 1.70/1.68 | ||
European badger (HiFi only) | hifiasm (dual) | 2.628/2.643 | 80.6/70.9 | 16.56/16.13 | 95.32/96.14 | 1.65/1.65 | |
hifiasm (primary/alt) | 2.724/1.711 | 85.0/0.2 | 12.88/1.83 | 96.82/51.59 | 1.67/1.35 | ||
HiCanu (primary/alt) | 2.690/1.371 | 67.1/0.1 | 11.36/1.12 | 96.75/38.30 | 1.96/2.57 | ||
Sterlet (HiFi + trio/Hi-C) | hifiasm (Hi-C) | 1.869/1.879 | 10.4/9.3 | 3.48/2.52 | 93.05/93.16 | 57.83/58.35 | |
hifiasm (trio) | 1.865/1.853 | 11.3/11.4 | 0.75/0.44 | 93.30/93.27 | 59.15/57.91 | ||
Sterlet (HiFi only) | hifiasm (dual) | 1.873/1.869 | 10.6/9.2 | 11.32/11.34 | 93.41/92.80 | 56.92/58.79 | |
hifiasm (primary/alt) | 1.927/1.885 | 27.7/1.5 | 24.94/0.87 | 93.43/92.64 | 59.01/55.66 | ||
HiCanu (primary/alt) | 1.724/2.114 | 7.3/2.2 | 12.31/1.99 | 91.48/90.25 | 42.47/59.97 | ||
South Island takahe (HiFi + trio/Hi-C) | hifiasm (Hi-C) | 1.315/1.154 | 12.5/13.2 | 0.70/0.64 | 97.01/90.27 | 0.54/0.46 | |
hifiasm (trio) | 1.237/1.236 | 12.9/12.6 | 1.87/0.19 | 91.22/92.29 | 0.64/0.61 | ||
South Island takahe (HiFi only) | hifiasm (dual) | 1.237/1.257 | 13.8/10.7 | 6.03/5.06 | 92.56/94.45 | 0.49/0.52 | |
hifiasm (primary/alt) | 1.320/0.644 | 16.3/0.3 | 5.12/1.01 | 97.11/45.33 | 0.59/0.73 | ||
Black Rhinoceros (HiFi + trio/Hi-C) | hifiasm (Hi-C) | 2.992/3.056 | 31.6/28.9 | 1.16/1.44 | 96.49/96.82 | 0.82/0.78 | |
hifiasm (trio) | 3.014/3.050 | 30.1/31.3 | 0.93/0.33 | 96.13/96.81 | 0.89/0.90 | ||
Black Rhinoceros (HiFi only) | hifiasm (dual) | 2.929/3.047 | 26.8/27.3 | 35.05/34.13 | 94.49/95.99 | 0.80/0.87 | |
hifiasm (primary/alt) | 3.055/2.846 | 38.9/0.7 | 36.44/3.42 | 96.79/84.76 | 0.80/1.01 | ||
HiCanu (primary/alt) | 3.058/2.560 | 22.2/0.3 | 31.55/0.61 | 96.79/70.11 | 1.53/1.38 |
All assemblies of the same sample use the same HiFi and Hi-C reads, except PGAS which relies on strand-seq data for phasing. Each assembly consists of two sets of contigs. The two sets may represent paternal/maternal with trio binning, haplotype 1/haplotype 2 with haplotype-resolved assembly or hifiasm dual assembly, or represent primary/alternate contigs. The two numbers in each cell give the metrics for the two sets of contigs, respectively. FALCON-Phase HG002 assembly, DipAsm and PGAS HG00733 assemblies were acquired from their associated publications. For South Island takahe, HiCanu could not produce assembly in 3 weeks so it is excluded. The N50 of an assembly is defined as the sequence length of the shortest contig at 50% of the total assembly size. The completeness scores of all human assemblies were calculated by the asmgene method14 with GRCh38 as the reference genome. The completeness of non-human assemblies were evaluated by BUSCO15. All samples have parental short reads, which were used to calculate the phasing switch error rates (Supplementary Table 1) and phasing hamming error rates with yak5. The hamming error rate equals Σi min{pi, mi}/Σi(pi + mi) where pi and mi are the number of paternal- and maternal-specific 31-mers on contig i, respectively. ‘Multicopy genes missed’ is the percentage of multi-copy genes in GRCh38 (multiple mapping positions at ≥99% sequence identity) that are not multi-copy in the assembly. This metric is only reported for human samples as other species lack high-quality reference genomes and good gene annotations.