. 2010 Dec 27;108(4):1513–1518. doi: 10.1073/pnas.1017351108

Table 3.

Human and mouse assemblies

Assemblies:	Human				Mouse
Assembly no.:	1		2	3	4	5	6
Sequence data:	Illumina		Illumina	ABI3730	Illumina	Illumina	ABI3730
Program:	ALLPATHS-LG		SOAP	Celera	ALLPATHS-LG	SOAP	ARACHNE
Completeness
Covered, %	91.1		74.3	96.2	88.7	86.2	94.2
Captured, %	6.6		18.6	1.3	8.6	8.0	3.8
Uncaptured, %	2.3		7.0	2.5	2.7	5.7	2.0
Segmental duplication coverage, %	41.1		12.1	62.2	42.3	27.9	65.7
Exon bases covered, %	95.1		81.2	96.2	96.7	92.4	97.3
Continuity
Contig N50, kb	24		5.5	109	16	16	25
Scaffold N50, kb	11,543		399	17,646	7,156	340	16,871
Contig accuracy
Ambiguous bases, %	0.08		0	0	0.04	0	0
1-kb chunks vs. reference	NA12878	GRC	GRC	GRC	B6	B6	B6
(I) perfect	77.1				88.6	76.8	78.0
(II) ≤0.1% error rate	8.7				2.5	2.9	7.0
(III) ≤1%	10.2				5.7	6.1	11.7
(IV) ≤10%	3.1	3.6	5.5	3.6	2.8	11.8	2.4
(V) >10%	0.4	0.4	0.7	0.5	0.2	2.4	0.3
Base quality, from I–III	Q33				Q36	Q35	Q33
Misassembly % of 1-kb chunks, from IV–V	3.5	4.0	6.2	4.1	3.0	14.2	2.7
Scaffold accuracy
Validity at 100 kb, %	99.1		99.5	99.7	99.0	98.8	99.1

An evaluation of human and mouse assemblies is shown. Contigs of size <1 kb were excluded from the analysis. Reference sequences are described in SI Materials and Methods. Assembly no.: Assemblies 1, 4, and 5 are from the data of this paper and are deposited in DDBJ/EMBL/GenBank under accession nos. AEKP00000000, AEKQ00000000, and AEKR00000000, respectively. The versions described in this paper are the first versions, AEKP01000000, AEKQ01000000, and AEKR01000000. For each ambiguity {x₁, … , x_n}, we inserted x₁ into the fasta sequence and referred to x₂, … , x_n in a note at the locus. Assemblies 2, 3, and 6 are from refs. 3, 12, and 19). Completeness: Contigs were aligned to the reference sequence, with each contig assigned to at most one location. The covered fraction of a genome consists of the fraction of total bases in the reference (exclusive of gaps) that lie under a contig. The captured fraction consists of those bases that lie within a gap in a scaffold. All other bases are uncaptured. Exon coverage was computed from RefSeq gene annotations (http://genome.ucsc.edu/cgi-bin/hgTables). Segmental duplication coverage was computed from http://humanparalogy.gs.washington.edu/build36/oo.weild10kb.join.all.cull.xwparse and http://mouseparalogy.gs.washington.edu/She2008_download/WGAC.tab.gz. Continuity: We report the N50 sizes of contigs and scaffolds, excluding gaps in the latter case. Contig accuracy: We first report the fraction of bases labeled as ambiguous (SI Materials and Methods). We then divide the contigs into 1-kb chunks (as in ref. 9, which, however, used a chunk size of 10 kb). Each chunk is then aligned to the reference sequence using the Smith–Waterman algorithm, seeded on perfect 100-mer matches, to find the optimal placement, and the number of errors (mismatch plus indel bases) is computed. (Contigs having no 100-mer match were treated as novel sequence and ignored for purposes of this analysis. There was <1% of novel sequence in all cases.) The contig is then assigned to one of five mutually exclusive classes on the basis of its error rate. The percentages of chunks landing in each class are listed. Note that for assembly 1, contig accuracy was calculated with respect to two reference sequences. Base quality: Inferred Phred quality (20) of bases in chunk classes I–III. Misassembly %: Total fraction of bases in chunk classes IV–V. Scaffold accuracy: Validity at 100 kb (9): We report the probability that two 100-base sequences in the assembly, separated by 100 kb, and also present in the reference, have the same orientation and are separated by 100 kb ± 10%.