Skip to main content
. 2011 Aug 18;6(8):e23501. doi: 10.1371/journal.pone.0023501

Table 1. Comparison of assembles of E. coli K12 MG1655 benchmark dataset.

Assembler Assembly as reported in Contig N50 (kbp) Scaffold N50 (kbp) Coverage Errors reported
Allpaths2 Allpaths2 337 2,680 99.3% Base accuracy Q67; no misassemblies
Soapdenovo Soapdenovo 89 105 NR 5 incorrect contigs
Velvet Allpaths2 62 298 97.7 Base accuracy Q34; 6.9% of 10 kb regions missassembled
Velvet ABySS 54 NR 98.8 9 incorrect contigs (mean size 33 kbp)
Euler-SR ABySS 57 NR 99.8 26 incorrect contigs (mean size 52 kbp)
Euler Allpaths2 19 19 94.6 Base accuracy Q30; 7.0% of 10 kb regions misassembled
Meraculous This report 41 57 97.8% No errors *
Edena ABySS 16 NR 99.1% 6 incorrect contigs (mean size 13 kbp)
ABySS ABySS 45 NR 99.4% 13 incorrect contigs (mean size 33 kbp)
SSAKE ABySS 11 NR 99.99% 38 incorrect contigs (mean size 6 kbp)

In ref. [9] analysis of ABySS, Velvet, Euler-SR, SSAKE, and Edena, only contigs of at least 100 bp were considered and genome coverage was based on full length, partial, and broken alignments with at least 95% identity to reference. Contigs with broken alignments, or that aligned with less than 95% identity, were considered incorrect. In the ref. [23] analysis of Allpaths2, Velvet, and Euler, only contigs of at least 1 kbp were considered. Genome coverage computed as the fraction of 100-mers in the reference sequence that are present in the assembly, allowing for multiple occurrences in the assembly. Base quality reported as total number of discrepancies to reference, computed over ∼10 kb assembly segments that contain fewer than 1% such discrepancies. Misassemblies were reported as the total fraction of bases in ∼10 kb segments containing at least 1% error. In the ref. [11] summary of Soap denovo assembly, contigs >100 bp were reported.

NR: not reported.

*Four localized discrepancies were noted between our meraculous assembly and the E. coli K12 MG1655 reference sequence. As described in the text, further examination showed that all four discrepancies were in fact errors in the reference (or mutations in the lineages separating the MG1655 reference sample from the short read dataset sample). Analysis of errors reported for other assemblers have not been analysed.