Table 3.
Human and mouse assemblies
Assemblies: | Human |
Mouse |
|||||
Assembly no.: | 1 | 2 | 3 | 4 | 5 | 6 | |
Sequence data: | Illumina | Illumina | ABI3730 | Illumina | Illumina | ABI3730 | |
Program: | ALLPATHS-LG | SOAP | Celera | ALLPATHS-LG | SOAP | ARACHNE | |
Completeness | |||||||
Covered, % | 91.1 | 74.3 | 96.2 | 88.7 | 86.2 | 94.2 | |
Captured, % | 6.6 | 18.6 | 1.3 | 8.6 | 8.0 | 3.8 | |
Uncaptured, % | 2.3 | 7.0 | 2.5 | 2.7 | 5.7 | 2.0 | |
Segmental duplication coverage, % | 41.1 | 12.1 | 62.2 | 42.3 | 27.9 | 65.7 | |
Exon bases covered, % | 95.1 | 81.2 | 96.2 | 96.7 | 92.4 | 97.3 | |
Continuity | |||||||
Contig N50, kb | 24 | 5.5 | 109 | 16 | 16 | 25 | |
Scaffold N50, kb | 11,543 | 399 | 17,646 | 7,156 | 340 | 16,871 | |
Contig accuracy | |||||||
Ambiguous bases, % | 0.08 | 0 | 0 | 0.04 | 0 | 0 | |
1-kb chunks vs. reference | NA12878 | GRC | GRC | GRC | B6 | B6 | B6 |
(I) perfect | 77.1 | 88.6 | 76.8 | 78.0 | |||
(II) ≤0.1% error rate | 8.7 | 2.5 | 2.9 | 7.0 | |||
(III) ≤1% | 10.2 | 5.7 | 6.1 | 11.7 | |||
(IV) ≤10% | 3.1 | 3.6 | 5.5 | 3.6 | 2.8 | 11.8 | 2.4 |
(V) >10% | 0.4 | 0.4 | 0.7 | 0.5 | 0.2 | 2.4 | 0.3 |
Base quality, from I–III | Q33 | Q36 | Q35 | Q33 | |||
Misassembly % of 1-kb chunks, from IV–V | 3.5 | 4.0 | 6.2 | 4.1 | 3.0 | 14.2 | 2.7 |
Scaffold accuracy | |||||||
Validity at 100 kb, % | 99.1 | 99.5 | 99.7 | 99.0 | 98.8 | 99.1 |
An evaluation of human and mouse assemblies is shown. Contigs of size <1 kb were excluded from the analysis. Reference sequences are described in SI Materials and Methods. Assembly no.: Assemblies 1, 4, and 5 are from the data of this paper and are deposited in DDBJ/EMBL/GenBank under accession nos. AEKP00000000, AEKQ00000000, and AEKR00000000, respectively. The versions described in this paper are the first versions, AEKP01000000, AEKQ01000000, and AEKR01000000. For each ambiguity {x1, … , xn}, we inserted x1 into the fasta sequence and referred to x2, … , xn in a note at the locus. Assemblies 2, 3, and 6 are from refs. 3, 12, and 19). Completeness: Contigs were aligned to the reference sequence, with each contig assigned to at most one location. The covered fraction of a genome consists of the fraction of total bases in the reference (exclusive of gaps) that lie under a contig. The captured fraction consists of those bases that lie within a gap in a scaffold. All other bases are uncaptured. Exon coverage was computed from RefSeq gene annotations (http://genome.ucsc.edu/cgi-bin/hgTables). Segmental duplication coverage was computed from http://humanparalogy.gs.washington.edu/build36/oo.weild10kb.join.all.cull.xwparse and http://mouseparalogy.gs.washington.edu/She2008_download/WGAC.tab.gz. Continuity: We report the N50 sizes of contigs and scaffolds, excluding gaps in the latter case. Contig accuracy: We first report the fraction of bases labeled as ambiguous (SI Materials and Methods). We then divide the contigs into 1-kb chunks (as in ref. 9, which, however, used a chunk size of 10 kb). Each chunk is then aligned to the reference sequence using the Smith–Waterman algorithm, seeded on perfect 100-mer matches, to find the optimal placement, and the number of errors (mismatch plus indel bases) is computed. (Contigs having no 100-mer match were treated as novel sequence and ignored for purposes of this analysis. There was <1% of novel sequence in all cases.) The contig is then assigned to one of five mutually exclusive classes on the basis of its error rate. The percentages of chunks landing in each class are listed. Note that for assembly 1, contig accuracy was calculated with respect to two reference sequences. Base quality: Inferred Phred quality (20) of bases in chunk classes I–III. Misassembly %: Total fraction of bases in chunk classes IV–V. Scaffold accuracy: Validity at 100 kb (9): We report the probability that two 100-base sequences in the assembly, separated by 100 kb, and also present in the reference, have the same orientation and are separated by 100 kb ± 10%.