Table 2.
Dataset | Assembly | Size (Gb) | NG50 (Mb) | NGA50 (Mb) | QV | Multi-copy genes retained (%) | Resolved BACs (%) | Gene completeness (asmgene) |
|
---|---|---|---|---|---|---|---|---|---|
Complete (%) | Duplicated (%) | ||||||||
CHM13 (HiFi 32×) | hifiasm | 3.052 | 88.9 | 86.7 | 54.2 | 99.7 | 98.8 | 99.97 | 0.05 |
HiCanu | 3.037 | 69.7 | 67.9 | 54.1 | 98.9 | 97.6 | 99.97 | 0.04 | |
Peregrine | 2.990 | 37.8 | 33.4 | 43.8 | 51.1 | 39.7 | 99.64 | 0.16 | |
Falcon | 2.862 | 27.1 | 21.8 | 50.1 | 30.2 | 34.2 | 99.47 | 0.03 | |
(ONT 120×) | Canu | 2.936 | 80.0 | 47.3 | 32.7 | 76.9 | 86.7 | 99.30 | 0.10 |
Flye | 2.900 | 37.5 | 34.0 | 33.5 | 54.7 | 60.6 | 99.22 | 0.11 | |
Shasta | 2.820 | 41.3 | 33.4 | 30.4 | 26.7 | 27.9 | 98.05 | 0.01 | |
HG00733 (HiFi 33×) | hifiasm (purge) | 3.043 | 68.3 | 55.3 | 49.9 | 74.6 | 80.4 | 99.07 | 0.39 |
HiCanu (purge) | 2.921 | 40.5 | 34.2 | 50.5 | 55.2 | 65.9 | 98.47 | 0.32 | |
Peregrine | 3.035 | 30.1 | 30.1 | 40.5 | 37.2 | 38.5 | 98.70 | 0.31 | |
Falcon | 2.861 | 24.4 | 23.2 | 46.3 | 33.6 | 38.0 | 96.51 | 0.15 | |
(ONT 50×) | Canu | 2.923 | 41.1 | 36.6 | 29.5 | 54.6 | 69.3 | 98.32 | 0.66 |
Flye | 2.890 | 26.7 | 25.4 | 29.9 | 34.2 | 44.7 | 97.88 | 0.20 | |
Shasta | 2.805 | 21.2 | 20.8 | 30.0 | 17.0 | 22.9 | 97.19 | 0.05 | |
HG002 (HiFi 36×) | hifiasm (purge) | 3.067 | 98.2 | 64.1 | 51.5 | 75.8 | 99.26 | 0.32 | |
HiCanu (purge) | 2.953 | 48.3 | 39.4 | 52.1 | 59.7 | 98.71 | 0.18 | ||
Peregrine | 3.081 | 33.4 | 32.5 | 41.3 | 42.5 | 99.14 | 0.36 | ||
Falcon | 2.955 | 30.4 | 29.0 | 46.7 | 36.6 | 99.00 | 0.20 |
Polished ONT assemblies were generated by the Shasta developers8. HiCanu and hifiasm were run without duplication purging for the homozygous CHM13 cell line, and run with purging for the heterozygous HG00733 and HG002 cell lines. The NGA50 of an assembly is defined as the length of the correctly aligned block at 50% of the total reference genome size which is assumed to be 3.1 Gb. It was calculated based on the minigraph26 contig-to-reference alignment. The “QV” (quality value) equals the Phred-scaled contig base error rate measured by comparing 31-mers in contigs to 31-mers in short reads from the same sample. Percent “multi-copy genes retained” is reported by asmgene (Online Methods). It is the percentage of multi-copy genes in reference genome (multiple mapping positions at ≥99% sequence identity) that remain multi-copy in the assembly. A BAC is resolved if 99.5% of its bases can be mapped the assembly. There are 330 CHM13-specific BACs excluding those not resolved by the telomere-to-telomere (T2T) assembly, and there are 179 HG00733-specific BACs. HG002 does not have BAC data. Throughout the table, GRCh38 is used as the reference genome for HG00733 and HG002, and the T2T CHM13 assembly v0.9 is used as the reference for CHM13.