Table 1. Comparison of assembles of E. coli K12 MG1655 benchmark dataset.
Assembler | Assembly as reported in | Contig N50 (kbp) | Scaffold N50 (kbp) | Coverage | Errors reported |
Allpaths2 | Allpaths2 | 337 | 2,680 | 99.3% | Base accuracy Q67; no misassemblies |
Soapdenovo | Soapdenovo | 89 | 105 | NR | 5 incorrect contigs |
Velvet | Allpaths2 | 62 | 298 | 97.7 | Base accuracy Q34; 6.9% of 10 kb regions missassembled |
Velvet | ABySS | 54 | NR | 98.8 | 9 incorrect contigs (mean size 33 kbp) |
Euler-SR | ABySS | 57 | NR | 99.8 | 26 incorrect contigs (mean size 52 kbp) |
Euler | Allpaths2 | 19 | 19 | 94.6 | Base accuracy Q30; 7.0% of 10 kb regions misassembled |
Meraculous | This report | 41 | 57 | 97.8% | No errors * |
Edena | ABySS | 16 | NR | 99.1% | 6 incorrect contigs (mean size 13 kbp) |
ABySS | ABySS | 45 | NR | 99.4% | 13 incorrect contigs (mean size 33 kbp) |
SSAKE | ABySS | 11 | NR | 99.99% | 38 incorrect contigs (mean size 6 kbp) |
In ref. [9] analysis of ABySS, Velvet, Euler-SR, SSAKE, and Edena, only contigs of at least 100 bp were considered and genome coverage was based on full length, partial, and broken alignments with at least 95% identity to reference. Contigs with broken alignments, or that aligned with less than 95% identity, were considered incorrect. In the ref. [23] analysis of Allpaths2, Velvet, and Euler, only contigs of at least 1 kbp were considered. Genome coverage computed as the fraction of 100-mers in the reference sequence that are present in the assembly, allowing for multiple occurrences in the assembly. Base quality reported as total number of discrepancies to reference, computed over ∼10 kb assembly segments that contain fewer than 1% such discrepancies. Misassemblies were reported as the total fraction of bases in ∼10 kb segments containing at least 1% error. In the ref. [11] summary of Soap denovo assembly, contigs >100 bp were reported.
NR: not reported.
*Four localized discrepancies were noted between our meraculous assembly and the E. coli K12 MG1655 reference sequence. As described in the text, further examination showed that all four discrepancies were in fact errors in the reference (or mutations in the lineages separating the MG1655 reference sample from the short read dataset sample). Analysis of errors reported for other assemblers have not been analysed.