Skip to main content
. 2020 Dec 14;39(4):422–430. doi: 10.1038/s41587-020-00747-w

Table 1.

WENGAN assemblies of the diploid NA12878, HG00733 and NA24385 genomes

NA12878 HG00733 NA24385
WENGAN-M WENGAN-A WENGAN-D MaSuRCA CANU WTDBG2 FLYE WENGAN-D SHASTA FALCON WENGAN-D SHASTA
Assembly statistics Contigs (≥50 kb) 490 425 364 1,111 798 934 797 387 649 826 270 660
Total length (Mb) 2,779 2,780 2,823 2,876 2,824 2,701 2,898 2,812 2,802 2,893 2,871 2,819
NG50 (Mb) 17.24 25.99 35.31 8.43 10.41 11.84 22.91 32.35 21.71 22.33 50.59 20.35
Structural quality Reference covered (%) 94.22 94.30 95.25 95.80 95.05 91.70 95.56 95.12 94.98 96.06 96.36 95.61
Duplication ratio 1.002 1.002 1.006 1.013 1.008 0.991 1.025 1.004 1.002 1.020 1.011 1.002
Unaligned length (Mb) 5.21 4.76 8.63 24.68 9.42 32.13 21.29 6.96 6.49 15.41 10.54 6.52
NGA50 (Mb) 11.82 14.34 16.41 5.69 7.12 7.38 12.36 17.31 12.99 14.61 24.52 14.32
Longest alignment (Mb) 45.66 75.32 72.84 32.62 34.07 70.48 78.99 71.03 78.22 71.68 75.56 75.65
Assembly errors 153 91 158 275 194 124 177 119 107 198 156 126
Computational resources CPU hours (h) 203 725 589 20,000 ~150,000 891 5,000 936 ~768 20,000 963 ~768
Maximum RAM (Gb) 53 45 622 500 222 600 644 ~966 651 ~692
Consensus accuracy
Indels Short (Mb) 1.99 1.72 0.85 0.56 0.57 27.16 37.3/2.65 0.64 3.09 1.19 0.62 3.38
Rate (bp) 1,381 1,592 3,252 4,966 4,828 98 74/1,047 4,372 899 2,323 4,499 828
Medium (Mb) 0.45 0.43 0.29 0.38 0.35 1.85 2.43/0.74 0.27 0.7 0.28 0.29 0.73
Rate (bp) 6,049 6,358 9,447 7,335 7,766 1,442 1,142/3,753 10,161 3,982 10,046 9,608 3,817
Long (kb) 17.95 18.73 17.74 19.21 45.96 12.64 15.60/16.49 22.9 18.13 17.65 24.85 23.05
Rate (kb) 153 146 157 145 60 211 178/169 121 153 157 113 121
Per 100 kb 102 90 53 47 55 1,135 1,471/147 36 141 62 39 152
100-mer completeness (%) 84.24 84.82 87.44 87.54 86.41 29.47 22.47/81.47 87.45 79.84 86.42 88.53 79.38
Median QV 27.84 28.41 31.02 27.10 28.79 17.08 16.41/23.48 26.42 23.36 27.30 -
Gene completeness
BUSCO No. complete 3,884 3,893 3,898 3,866 3,882 1,974 2,268/3,680 3,907 3,788 3,874 3,904 3,752
Percentage complete 94.64 94.86 94.98 94.20 94.59 48.10 55.26/89.67 95.20 92.30 94.40 95.13 91.42

Structural and consensus accuracy was determined as described in detail in the Methods (Assembly validation). All of the assemblies were built by the assembler developers. In particular, all of the NA12878 assemblies were generated using the Nanopore (rel5) plus Illumina data at the assembly or polishing steps (except WTDBG2). The CANU assembly was hybrid polished with NANOPOLISH ×2, RACON ×2 and PILON ×2. The FLYE assembly was hybrid polished with RACON ×2 and NTEDIT ×3 (Methods). The SHASTA assemblies were polished using only Nanopore reads with HELEN and MARGINPOLISH. The FALCON assembly was polished using only PacBio CLR reads with QUIVER. The WENGAN, MaSuRCA and WTDBG2 assemblies were not polished by external tools. The reported CPU time does not include the CPU time spent polishing the assembly with external tools. NG50 and NGA50 were computed using as the genome size the total chromosome lengths of GRCh38 (3.088 Gb). Assembly errors overlapping centromeric regions or SDs of GRCh38 were excluded from the analysis. The indels were called from aligned blocks ≥1 kb at average identity ≥99%, and were classified according to their length into short (1–2 bp), medium (3–50 bp) and long (>50 bp). The indel rate was computed by dividing the amount of assembly sequence aligned by the number of indels called on such alignments. The ‘indels per 100 kb’ was computed by QUAST from aligned blocks ≥0.5 kb with a minimum identity ≥80%. The 100-mer completeness is the fraction of distinct 100-mers in the GRCh38 reference (2.835 Gb) that are captured in the corresponding assembly. Consensus statistics before and after the polishing are included for the FLYE assembly. The best and worst performers on each assembly metric (rows) are highlighted in bold or underlined font, respectively.