Table 1.
NA12878 | HG00733 | NA24385 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WENGAN-M | WENGAN-A | WENGAN-D | MaSuRCA | CANU | WTDBG2 | FLYE | WENGAN-D | SHASTA | FALCON | WENGAN-D | SHASTA | ||
Assembly statistics | Contigs (≥50 kb) | 490 | 425 | 364 | 1,111 | 798 | 934 | 797 | 387 | 649 | 826 | 270 | 660 |
Total length (Mb) | 2,779 | 2,780 | 2,823 | 2,876 | 2,824 | 2,701 | 2,898 | 2,812 | 2,802 | 2,893 | 2,871 | 2,819 | |
NG50 (Mb) | 17.24 | 25.99 | 35.31 | 8.43 | 10.41 | 11.84 | 22.91 | 32.35 | 21.71 | 22.33 | 50.59 | 20.35 | |
Structural quality | Reference covered (%) | 94.22 | 94.30 | 95.25 | 95.80 | 95.05 | 91.70 | 95.56 | 95.12 | 94.98 | 96.06 | 96.36 | 95.61 |
Duplication ratio | 1.002 | 1.002 | 1.006 | 1.013 | 1.008 | 0.991 | 1.025 | 1.004 | 1.002 | 1.020 | 1.011 | 1.002 | |
Unaligned length (Mb) | 5.21 | 4.76 | 8.63 | 24.68 | 9.42 | 32.13 | 21.29 | 6.96 | 6.49 | 15.41 | 10.54 | 6.52 | |
NGA50 (Mb) | 11.82 | 14.34 | 16.41 | 5.69 | 7.12 | 7.38 | 12.36 | 17.31 | 12.99 | 14.61 | 24.52 | 14.32 | |
Longest alignment (Mb) | 45.66 | 75.32 | 72.84 | 32.62 | 34.07 | 70.48 | 78.99 | 71.03 | 78.22 | 71.68 | 75.56 | 75.65 | |
Assembly errors | 153 | 91 | 158 | 275 | 194 | 124 | 177 | 119 | 107 | 198 | 156 | 126 | |
Computational resources | CPU hours (h) | 203 | 725 | 589 | 20,000 | ~150,000 | 891 | 5,000 | 936 | ~768 | 20,000 | 963 | ~768 |
Maximum RAM (Gb) | 53 | 45 | 622 | 500 | – | 222 | 600 | 644 | ~966 | – | 651 | ~692 | |
Consensus accuracy | |||||||||||||
Indels | Short (Mb) | 1.99 | 1.72 | 0.85 | 0.56 | 0.57 | 27.16 | 37.3/2.65 | 0.64 | 3.09 | 1.19 | 0.62 | 3.38 |
Rate (bp) | 1,381 | 1,592 | 3,252 | 4,966 | 4,828 | 98 | 74/1,047 | 4,372 | 899 | 2,323 | 4,499 | 828 | |
Medium (Mb) | 0.45 | 0.43 | 0.29 | 0.38 | 0.35 | 1.85 | 2.43/0.74 | 0.27 | 0.7 | 0.28 | 0.29 | 0.73 | |
Rate (bp) | 6,049 | 6,358 | 9,447 | 7,335 | 7,766 | 1,442 | 1,142/3,753 | 10,161 | 3,982 | 10,046 | 9,608 | 3,817 | |
Long (kb) | 17.95 | 18.73 | 17.74 | 19.21 | 45.96 | 12.64 | 15.60/16.49 | 22.9 | 18.13 | 17.65 | 24.85 | 23.05 | |
Rate (kb) | 153 | 146 | 157 | 145 | 60 | 211 | 178/169 | 121 | 153 | 157 | 113 | 121 | |
Per 100 kb | 102 | 90 | 53 | 47 | 55 | 1,135 | 1,471/147 | 36 | 141 | 62 | 39 | 152 | |
100-mer completeness (%) | 84.24 | 84.82 | 87.44 | 87.54 | 86.41 | 29.47 | 22.47/81.47 | 87.45 | 79.84 | 86.42 | 88.53 | 79.38 | |
Median QV | 27.84 | 28.41 | 31.02 | 27.10 | 28.79 | 17.08 | 16.41/23.48 | 26.42 | 23.36 | 27.30 | – | - | |
Gene completeness | |||||||||||||
BUSCO | No. complete | 3,884 | 3,893 | 3,898 | 3,866 | 3,882 | 1,974 | 2,268/3,680 | 3,907 | 3,788 | 3,874 | 3,904 | 3,752 |
Percentage complete | 94.64 | 94.86 | 94.98 | 94.20 | 94.59 | 48.10 | 55.26/89.67 | 95.20 | 92.30 | 94.40 | 95.13 | 91.42 |
Structural and consensus accuracy was determined as described in detail in the Methods (Assembly validation). All of the assemblies were built by the assembler developers. In particular, all of the NA12878 assemblies were generated using the Nanopore (rel5) plus Illumina data at the assembly or polishing steps (except WTDBG2). The CANU assembly was hybrid polished with NANOPOLISH ×2, RACON ×2 and PILON ×2. The FLYE assembly was hybrid polished with RACON ×2 and NTEDIT ×3 (Methods). The SHASTA assemblies were polished using only Nanopore reads with HELEN and MARGINPOLISH. The FALCON assembly was polished using only PacBio CLR reads with QUIVER. The WENGAN, MaSuRCA and WTDBG2 assemblies were not polished by external tools. The reported CPU time does not include the CPU time spent polishing the assembly with external tools. NG50 and NGA50 were computed using as the genome size the total chromosome lengths of GRCh38 (3.088 Gb). Assembly errors overlapping centromeric regions or SDs of GRCh38 were excluded from the analysis. The indels were called from aligned blocks ≥1 kb at average identity ≥99%, and were classified according to their length into short (1–2 bp), medium (3–50 bp) and long (>50 bp). The indel rate was computed by dividing the amount of assembly sequence aligned by the number of indels called on such alignments. The ‘indels per 100 kb’ was computed by QUAST from aligned blocks ≥0.5 kb with a minimum identity ≥80%. The 100-mer completeness is the fraction of distinct 100-mers in the GRCh38 reference (2.835 Gb) that are captured in the corresponding assembly. Consensus statistics before and after the polishing are included for the FLYE assembly. The best and worst performers on each assembly metric (rows) are highlighted in bold or underlined font, respectively.