Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2002 Nov 15;30(22):5004–5014. doi: 10.1093/nar/gkf633

Comparison of whole genome assemblies of the human genome

Eric C Rouchka 1,2,a, Warren Gish 2, David J States 3,b
PMCID: PMC137179  PMID: 12434005

Abstract

A fundamental problem in the human genome project is uncovering the correct assembly of the human genome. Many studies, including transcriptional analysis, SNP detection and characterization, gene finding and EST clustering, use genome assemblies as templates so it is important to determine the consistency among the various whole genome assemblies. A comparison of the order and orientation of the GenBank entries used to construct the NCBI and UCSC Goldenpath assemblies was made. In addition, a sequence level comparison was performed using MULTI, an efficient database search tool developed to make whole genome comparisons possible. The resulting comparisons show significant discrepancies in the sequence as well as in the order and orientation of GenBank entries used in constructing the NCBI and UCSC assemblies.

INTRODUCTION

In February 2001, the ‘complete’ human genome was published to a draft sequence level (1,2). Assembly of the pieces of the human genome, whether using a clone-based method as with the public effort or through shotgun sequencing as with the private endeavor, is not a trivial task. Events such as repetitive elements (3,4), gene duplication (5) and segmental duplications (6) within the human genome contribute to the complexity of genomic assembly. A growing number of projects including transcriptional analysis (7), SNP detection and characterization (8), gene finding (9) and EST clustering (10,11) require the sequence of the human genome as a template for mining data. Therefore, a consistent and correct assembly is imperative. The comparisons outlined here are based on the August 6, 2001 release of the University of California–Santa Cruz’s Goldenpath and the National Center for Biotechnology Information (NCBI) human genome contigs build 26 (based on data through September 24, 2001).

One widely used assembly of the human genome is the Goldenpath assembly produced by a group at the University of California–Santa Cruz. The Goldenpath assembly begins with a greedy algorithm to build an initial set of sequence contigs and then uses mRNA, paired plasmid ends, ESTs, BAC end pairs, BAC fingerprint maps and additional information to order and orient these individual contigs into larger assemblies (12). The August 6, 2001 release assembled a total of 2.88 GB of sequence data into contigs representing over 92% of the human genome (see http://genome.cse.ucsc.edu/cgi-bin/crom.cgi.init).

The NCBI maintain their own assembly of the human genome based on the current GenBank (13) entries. The contig assembly process they employ is discussed on their web site (http://www.ncbi.nlm.nih.gov/genome/guide/build.html# contig) and is summarized here. Individual clones are assigned a chromosome based on three factors: annotation contained within the GenBank record, the presence of at least three STS markers that have been mapped to the same chromosome, or personal communication from a sequencing center. Conflicting sequence overlaps and redundant clones are filtered. The remaining clones are constructed into contigs based on sequence overlap. The contigs are organized into melds, which are ordered and oriented based on ESTs, mRNAs, paired plasmid reads and annotation information concerning order and orientation provided by the submitting groups. Build 26 assembled a total of 2.86 GB of sequence data into contigs (http://www.ncbi.nlm.nih.gov/genome/guide/human/HsStats.html), or slightly less than the Goldenpath August 6, 2001 release.

In order to aid the detection of regions of similarities between assemblies at the sequence level, a tool called MULTI was developed. MULTI allows for the efficient searching of regions of similarity by using a deterministic finite automaton (DFA) approach.

MATERIALS AND METHODS

Data retrieval

The August 6, 2001 Goldenpath assembly by chromo some was downloaded from http://genome.cse.ucsc.edu/goldenPath/06aug2001/chromosomes/. The files downloaded include chr1.zip, chr2.zip, chr3.zip, chr4.zip, chr5.zip, chr6.zip, chr7.zip, chr8.zip, chr9.zip, chr10.zip, chr11.zip, chr12.zip, chr13.zip, chr14.zip, chr15.zip, chr16.zip, chr17.zip, chr18.zip, chr19.zip, chr20.zip, chr21.zip, chr22.zip, chrX.zip and chrY.zip. Each of these files represents a zipped fasta representation of the assembled individual chromosome sequences. In addition, the file chromAgp.zip was downloaded from http://genome.cse.ucsc. edu/goldenPath/06aug2001/bigZips/. Information pertaining to the order and orientation of individual GenBank entries used to construct each chromosome is described in this file.

The contigs for each NCBI chromosome were down loaded from the url ftp://ncbi.nlm.nih.gov/genomes/H_sapiens/. Included were the files hs_chr1.fa.gz, hs_chr2.fa.gz, hs_chr3.fa.gz, hs_chr4.fa.gz, hs_chr5.fa.gz, hs_chr6.fa.gz, hs_chr7.fa.gz, hs_chr8.fa.gz, hs_chr9.fa.gz, hs_chr10.fa.gz, hs_chr11.fa.gz, hs_chr12.fa.gz, hs_ chr13.fa.gz, hs_chr14.fa.gz, hs_chr15.fa.gz, hs_chr16.fa.gz, hs_chr17.fa.gz, hs_chr18.fa.gz, hs_chr19.fa.gz, hs_ chr20.fa.gz, hs_chr21.fa.gz, hs_chr22.fa.gz, hs_chrX.fa.gz and hs_chrY.fa.gz. Each of these files represents the individual gzip compressed fasta sequences for the contigs within each respective chromosome. However, these individual files do not represent the actual ordering and orientation of the contigs relative to one another. The file seq_contig.md contains this information and was therefore downloaded. An assembly of each individual chromosome was constructed using the seq_contig.md file as a guide. In the cases where the orientation of an individual clone is unknown, the clone was left in the orientation provided by the GenBank entry. The file allcontig.agp.build26.gz was downloaded as well since it contains order and orientation information for the individual GenBank entries used to construct each contig. A list of the order and orientation of individual clones used to construct a chromosomal assembly was created using both the allcontig. agp.build26.gz and seq_contig.md files.

Order and orientation comparisons

The first comparison examined the order and orientation of individual GenBank entries used to assemble individual chromosomes. The NCBI data set was compared with the Goldenpath data set in an entry-by-entry fashion. Graphs were drawn to indicate the relationship between the order and orientation of GenBank entries. In addition, a table comparing the order and orientation of entries used in the assemblies was created.

Sequence comparisons

Sequence-based comparisons were made using a DFA-based approach called MULTI. MULTI sets up a DFA based on the contents of a target sequence. A query sequence is compared with the target sequence looking for exact nmers in the DFA where n is set by the user. For the genome assembly comparisons, the word size n was set to be 1000, since two separate assemblies of the human genome would be expected to contain large regions of identity. Lowering the value of n would result in a slower but more finely grained comparison. A step size of 500 was used in generating search windows. This means that, in the query sequence, windows of 1000mers are searched against all of the possible 1000mers in the target data set, shifting by a 500 nt window each time. For instance, the first 1000mer searched was the one occurring from bases 1 to 1000, the second was located at bases 500 to 1499, the third from 1000 to 1999 and so on.

Assembly differences were classified as major if the contig or orientation differed, or if there is an insertion or deletion >1 kb. All other assembly differences were classified as minor.

MULTI

MULTI is a computationally efficient database search tool designed to make efficient use of system RAM as well as CPU resources. Like BLASTN (14), MULTI builds a suffix tree storing a set of search words derived from the query sequences and then converts this suffix tree to a DFA. After the DFA has been constructed, a target database can be scanned in linear time, simultaneously searching for matches between the target database and any of the query sequences. In searching for near identity matches, the words used to build the DFA can be made long enough that few false positive word matches will occur. As long as the DFA is resident in RAM, the time required to scan a fixed-size target database is essentially constant and does not depend on the query length or number of query words stored in the DFA. Maximal computational efficiency is achieved when the DFA occupies the maximal fraction of available physical RAM. Preprocessing scripts to aggregate short queries have been implemented (15), but it is difficult to optimize memory use in BLAST or MUMMER because the size of the DFA depends on the length and redundancy of the query sequence. MULTI optimizes use of system resources by allowing the user to specify explicitly the amount of RAM available for the DFA. The program loads multiple query sequences sequentially into the suffix tree until no additional query words can be stored in the available RAM. Redundancy between queries results in sharing of nodes in the upper levels of the suffix tree and some additional efficiency in RAM utilization. As a result, storing a set of queries with aggregate length 2L does not require twice the storage as a set of queries with aggregate length L.

Biological sequences are not random, and the presence of repetitive sequences can result in false positives in sequence similarity searches (16). To suppress these false positives and optimize memory use, both the query sequence and the target database are masked using XNU (16). MULTI further optimizes the use of available memory resources by allowing the user to specify the stride between search words explicitly. For a near identity search using a long word length in constructing the DFA, very little information is gained by using sequential overlapping search words. For example, if the query and target sequences are expected to be 99% identical and the search words have length 20, then there is an 82% chance that a word in a true alignment will be a perfect match, resulting in a positive identification of the alignment. Conversely, if a given search word fails to match, then there is only a 5% chance that the next sequentially derived search word will be a perfect match because a mismatch anywhere, except the first position, will be present in the overlapping search word as well. Using a stride length equal to or longer than the search word length between adjacent search words guarantees that there will be no overlap and that each word in the DFA will contribute a maximal amount of information to the search engine. For near identity searches, very high sensitivity can be achieved even when a sparse set of search words is used. The probability of missing an alignment altogether is the product of the probabilities for missing each search word individually. Thus, the search sensitivity, Sn is

Sn = 1 – (1 – Pm)Nw

Pm = fidLw1

where Nw is the number of search words in the query derived from the aligned region, Lw is the length of a search word and fid is the fraction identity in the alignment. For an EST of length 200 with search words generated every 20th nucleotide, there will be 10 search words. If the fraction identity in the alignment is expected to be 99% (the single read sequencing error rate), then the probability that any given search word will fail to be a perfect match is 18% and the probability that all 10 search words will fail to match is 4e – 8. Increasing the stride between adjacent search words entered in the DFA from 1 to 20 nt decreases the number of words per query sequence by a factor of 20 and increases the number of query sequences that can be searched simultaneously by a corresponding factor of 20.

Very long query words can be used for genomic sequence assembly comparisons between a pair of finished genomic assemblies. For convenience, we selected a word size of 1000 nt with a stride of 500 nt between words. Assuming that both assemblies are accurate to the NIH error rate of less than 1e – 4, a 1000mer in the query would have, at most, a 10% chance of having an error and the target word would have, at most, a 10% chance of containing a sequencing error. If the errors in the query and target words were independent, 81% of 1000mers will still match. Since the query and target assemblies are derived from the same raw data, the errors are likely to be highly correlated and this can be viewed as a lower limit. The sensitivity for detecting an overlap length of 10 kb or longer is >99.99%. An advantage of the DFA algorithm implemented in MULTI is that very long word sizes can be used in building the DFA (1000 in the case of genome assembly comparisons). As a result, matches identified by DFA hits are highly specific. The probability of an exact 1000 nt match occurring at random is essentially zero, but false matches can occur between near identical repeats in the genome.

MULTI only attempts to identify the aligned regions and generates an output that the user can then use to retrieve and align the relevant subsequences using other programs. A dynamic programming algorithm is used to order individual word matches derived from each query–target pair into sets consistent with a continuous alignment. MULTI implements a banded search that is linear time in the combined length of the query plus target sequences. During the course of this work, several other tools for efficient genome to genome comparison were developed (1720). A full comparison of these tools is beyond the scope of this paper.

MULTI is implemented in portable C++. Source code and precompiled executables are available at http://stateslab. bioinformatics.med.umich.edu/software/multi/.

RESULTS

Clone ordering

Tables 1 and 2 summarize the entries used in the Goldenpath and NCBI assemblies. Due to sequence redundancy, the assemblies were not guaranteed to use the same GenBank entries. In some cases, entries were marked as unordered, meaning they could be placed on a chromosome but could not be accurately ordered. It is also possible that entries could not be accurately assigned to a chromosome. These entries fall into the unknown category. The Goldenpath assembly ordered 237 entries that NCBI left as unordered while NCBI ordered 198 entries left unordered in the Goldenpath assembly. In addition, there were 39 entries the Goldenpath assembly used that NCBI left as unassigned and 15 entries the NCBI assembly used that were unassigned in the Goldenpath. There were 252 entries used in both assemblies that pointed to the same sequence entry via different accession identifiers. A total of 227 entries were placed on conflicting chromosomes within the assemblies. There were 28 200 entries that were used consistently within both assemblies. An additional 847 entries were found only within the Goldenpath assembly. There were 4380 entries used uniquely in the NCBI assembly.

Table 1. Summary of accessions used in the August 6, 2001 Goldenpath assembly.

Chromosome Total useda Unorderedb Unknownc Different accessiond Different chromosomee Unmatched accessionsf Matched accessionsg
1 2704 20 0 11 20 71 2593
2 1965 6 0 50 13 82 1864
3 2004 22 5 28 19 121 1837
4 1723 21 5 150 26 69 1602
5 2084 37 0 0 7 42 1998
6 1932 11 3 0 13 37 1868
7 1561 10 2 10 15 46 1488
8 1444 17 17 0 19 37 1354
9 1117 13 0 0 12 46 1046
10 1300 15 0 1 5 21 1259
11 1666 12 3 0 8 17 1626
12 1323 10 1 0 13 44 1255
13 893 2 0 0 8 12 871
14 678 1 0 0 3 9 665
15 826 2 0 0 12 18 794
16 856 9 0 2 10 33 804
17 763 5 1 0 8 5 744
18 968 10 0 0 9 8 941
19 819 5 1 0 1 14 798
20 629 0 0 0 0 0 629
21 103 0 0 0 0 99 4
22 527 0 0 0 0 0 527
X 1465 9 1 0 6 16 1433
Y 200 0 0 0 0 0 200
Totals 29 550 237 39 252 227 847 28 200

aThe total number of GenBank sequences used by UCSC to construct the chromosome labeled in the first column.

bThe number of sequences UCSC uses that remain unordered in the NCBI assembly.

cThe number of sequences assigned to a chromosome by UCSC that NCBI labels as unknown.

dThose sequences used by both UCSC and NCBI that are identical, but different accession IDs are given.

eThose sequences that UCSC assigns to one chromosome and NCBI assigns to another.

fThe total number of sequences used in the UCSC chromosome assembly that are not found anywhere in the NCBI assembly.

gThe total number of sequences that are used in both assemblies.

Table 2. Summary of accessions used in the NCBI build 26.

Chromosome Total useda Unorderedb Unknownc Different accessiond Different chromosomee Unmatched accessionsf Matched accessionsg
1 3088 9 1 11 30 455 2593
2 2134 3 0 50 15 252 1864
3 2078 2 0 28 7 232 1837
4 1765 11 5 150 17 130 1602
5 2348 13 1 0 14 322 1998
6 2337 7 0 0 12 450 1868
7 1716 1 0 10 9 218 1488
8 1556 3 0 0 10 189 1354
9 1193 1 1 0 9 136 1046
10 1463 4 2 1 11 187 1259
11 1970 20 0 0 19 305 1626
12 1438 2 0 0 14 167 1255
13 1078 0 0 0 8 199 871
14 817 3 0 0 5 144 665
15 891 0 0 0 3 94 794
16 894 4 0 2 5 71 804
17 816 14 0 0 7 51 744
18 1055 2 0 0 2 110 941
19 901 14 0 0 12 77 798
20 629 0 0 0 0 0 629
21 475 78 4 0 5 384 4
22 527 0 0 0 0 0 527
X 1661 7 1 0 13 207 1433
Y 200 0 0 0 0 0 200
Totals 33 020 198 15 252 227 4380 28 200

aThe total number of GenBank sequences used by NCBI to construct the chromosome labeled in the first column.

bThe number of sequences NCBI uses that remain unordered in the UCSC assembly.

cThe number of sequences assigned to a chromosome by NCBI that UCSC labels as unknown.

dThose sequences used by both UCSC and NCBI that are identical, but different accession IDs are given.

eThose sequences that NCBI assigns to one chromosome and UCSC assigns to another.

fThe total number of sequences used in the NCBI chromosome assembly that are not found anywhere in the UCSC assembly.

gThe total number of sequences that are used in both assemblies.

Figure 1 presents a graphical view of the clone-ordering comparisons. Entries found in both assemblies are drawn as polygons. If the clone occurred in the same orientation on both assemblies and was within a 10% distance threshold, the polygon was drawn in dark green. If the position was outside the distance threshold, it was drawn in light green. Entries that occurred in different orientations were drawn in dark red if they were within a 10% distance threshold, or light red if the positions disagreed by more than 10% of the length of the chromosome. Blue polygons indicate cases where the orientation of an individual entry was unknown in at least one of the assemblies. If the positioning was within 10% on both assemblies, dark blue was used; otherwise light blue was used.

Figure 1.

Figure 1

Clone ordering comparisons between NCBI build 26 and the August 6, 2001 Goldenpath assembly. Shown in each of these images is a graph relating the location of clones in the Goldenpath assembly (top) to their location in the NCBI assembly (bottom). In this figure, clone orientation data are included as well. If the orientation of both clones is the same, they are colored green. If they are different, they are colored red. If the orientation is unknown, it is drawn in blue. If the difference between the clone locations on the two assemblies differs by more than 10%, it is drawn in a lighter color.

These graphs show large areas of agreement. The assemblies for chromosomes 20, 21, 22 and Y produced the most consistent results. This occurred since the NCBI and UCSC assemblies are based on those constructed at the centers that sequenced these chromosomes (http://genome.cse.ucsc.edu/goldenPath/limits.html). The centers include the Max Plank Institute and RIKEN (chromosome 21) (21), the Sanger Centre (chromosomes 20 and 22) (22,23) and Washington University (chromosome Y). A large gap exists in the chromosome 21 accession comparison. After further examination, it was determined that the Goldenpath assembly used GenBank entries submitted for chromosome 21 that break the chromosome into 340 kb fragments while the NCBI assembly was based on the actual sequenced clones. Both produced the same sequence assembly.

For the assemblies not originating from the same source (unlike chromosomes 20, 21, 22 and Y), there were problematic regions. For instance, there were portions on chromosomes 13 and 14 where a large number of GenBank entries have been misplaced, resulting in a wide ‘X’ structure. Chromosomes 1, 9 and 16 clearly show regions where there were several entries inconsistently assigned to different chromosomal arms. The centromere was represented by a gap in each sequence.

The orientation of clones was also inconsistent. Table 3 summarizes the orientation of GenBank entries. 92% (25 703 out of 27 948) of the GenBank entries were consistently ordered within a 10% distance threshold in both assemblies. Only 67% (18 729 out of 27 948) of the entries were consistently oriented in the assemblies. 28% (7859 out of 27 948) of the entries were inconsistently oriented and 4% (1360 out of 27 948) had an unknown orientation. These results suggest that ordering clones is a more reliable process than orientation.

Table 3. GenBank entry orientations.

  Consistent orientationa Inconsistent orientationb Unknown orientationc
Chromosome Within threshold Outside threshold Within threshold Outside threshold Within threshold Outside threshold
1 1327 269 802 143 37 4
2 1125 37 558 13 78 3
3 906 29 701 18 151 4
4 690 8 588 2 160 4
5 1008 60 676 40 176 38
6 1562 44 233 12 17 0
7 1229 1 209 0 39 0
8 623 123 395 100 81 32
9 224 496 111 196 8 11
10 856 43 298 14 46 1
11 924 2 642 1 55 2
12 728 8 385 11 122 1
13 728 1 122 0 20 0
14 582 36 18 25 4 0
15 453 0 313 3 25 0
16 351 195 155 81 7 13
17 414 5 271 3 39 12
18 481 36 324 22 71 7
19 589 17 154 7 27 4
20 628 0 0 0 1 0
21 4 0 0 0 0 0
22 527 0 0 0 0 0
X 1160 2 211 2 56 2
Y 198 0 0 0 0 2
Totals 17 317 1412 7166 693 1220 140

aThe orientation of a GenBank entry is considered consistent if the entry occurs in the same orientation in both the NCBI and Goldenpath assemblies.

bAn inconsistent orientation occurs when the orientation of the entry is different in both assemblies.

cIn the case that the orientation is marked as unknown in at least one of the assemblies, the entry is marked with an unknown orientation.

The distance threshold used means that the GenBank entry positions must agree within 10% in both assemblies.

In order to determine the role of the sequence quality in determining order and orientation, the data in Table 3 were separated according to whether an individual clone sequence status was draft or finished (data not shown). Slightly more than half of the clone entries (51.7%; 14 459 out of 27 948) shared in the NCBI and Goldenpath assemblies are draft sequences while slightly less than half (48.3%; 13 489 out of 27 948) are finished.

Finished sequences are more likely to be consistently oriented than draft sequences in the two assemblies. 83.1% (11 214 out of 13 489) of all finished sequences are consistently oriented, compared with 52.0% (7515 out of 14 459) of all draft sequences. 14.0% (1888 out of 13 489) of finished sequences are inconsistently oriented while 41.3% (5971 out of 14 459) of the draft sequences show inconsistent orientation. In addition, the orientation for 2.9% (387 out of 13 489) of finished sequence and 6.7% (973 out of 14 459) of draft sequence is unknown.

Finished sequence also tends to be ordered more consistently in the assemblies. 94.1% (12 696 out of 13 489) of all finished sequences are consistently ordered within a given threshold. In comparison, 90.0% (13 007 out of 14 459) of draft sequences maintain a consistent order.

Sequence level

Figure 2 shows the results from the sequence level comparisons. The results appear to be consistent with the ordering and orienting graphs. In addition, these graphs show how repetitive chromosomal regions can be as indicated by regions that have more than one polygon connecting them. The results for chromosome Y are particularly useful in this respect since the assemblies were identical, with the exception of the variable lengths of gaps.

Figure 2.

Figure 2

Sequence level comparison of NCBI build 26 with the August 2001 Goldenpath assembly. Each of these graphs shows a sequence level comparison between the August 6, 2001 Goldenpath chromosome assembly (top) and the NCBI build 26 chromosomal assembly (bottom). Each exact 1000 base match is drawn with a polygon connecting the locations of the match. If the matches are in the same orientation, they are drawn in green. If the matches are in different orientations, they are drawn in red. Exact 1000 base matches were obtained using MULTI.

Table 4 indicates the number of bases that MULTI matched where the orientation was the same in both assemblies, as well as those where the orientation was different. The MULTI results should be used with caution for a couple of reasons. First of all, the orientation of 4% of the GenBank entries used in the assemblies was unknown. Since the rules used to create the NCBI assembly keep such regions in the original orientation, false inconsistencies in the orientation in these regions can be reported. In addition, MULTI allows for a single region to match multiple regions. Since the human genome is far from unique, many longer exact repeats will be reported. Over 20% of all sequence level comparisons reported in Table 4 occur between sequence fragments in opposite orientations. Bailey et al. (6) show that 10.6% of the January 2001 Goldenpath assembly shows regions of >1 kb in length and >98% identity. Even if all of these segmental duplications occurred on the same chromosome and in different orientation, they could not account for 20% of all matched regions. Thus, there was a large amount of inconsistently oriented data between the NCBI and Goldenpath assemblies.

Table 4. Aligned bases using MULTI.

Chromosome Matching bases, same orientationa Matching bases, different orientationb
1 61 746 063 20 411 421
2 68 880 069 16 725 895
3 46 402 958 18 531 575
4 39 549 971 20 747 599
5 40 449 965 21 510 030
6 62 936 552 8 228 667
7 53 201 681 5 840 056
8 37 439 999 11 180 941
9 34 975 746 9 624 503
10 42 060 960 8 006 401
11 37 438 424 13 694 342
12 36 328 682 11 486 143
13 35 695 803 4 281 521
14 32 428 345 2 117 332
15 19 193 109 8 687 356
16 17 702 596 6 568 383
17 15 667 567 6 813 709
18 20 677 969 7 429 676
19 12 368 846 3 453 858
20 23 484 031 33 978
21 12 994 871 2 498
22 12 412 454 25 480
X 49 177 058 9 246 770
Y 10 897 336 2 787 913
Totals 824 111 055 217 436 047

aThe number of matching bases for each chromosome where the matches occur in the same orientation.

bThe number of matching bases for each chromosome where the matches occur in a different orientation.

Length to next major mismatch

In order to help determine the confidence in the assembly of any particular chromosome, we calculated a metric to determine the expected nucleotide length to the next major mismatch between the NCBI and UCSC assemblies. Each matching MULTI block includes beginning and end positions within the NCBI and UCSC assemblies. Consecutive matching blocks were compared to determine whether or not they should be merged together. The nucleotide distance between two consecutive blocks was calculated for both assemblies. The distance for each of these was compared. If the difference was <1 kb, then the two blocks were merged together. Otherwise they were kept separate. Once merging of consecutive blocks was finished, the length of each block was stored. For each chromosome, the percentage of nucleotides with at least 1, 10, 100, 1000, 10 000, 105, 106, 107 and 108 bases before the end of a block was calculated. This gave a measure of the agreement between the UCSC and NCBI assemblies.

The results for the chromosomes with the longest length to next mismatch (chromosome 20), shortest length to next mismatch (chromosome 4) and all chromosomes are given in Figure 3. The graphs shown indicate a length to next major mismatch when only draft sequences are considered, when only finished sequences are considered and when both draft and finished sequences are considered. The data used to construct the graphs of Figure 3 are provided in Table 5. Examination of the data in Table 5 indicates that the agreement falls off between 10 and 100 kb when only major mismatches in the draft sequence are considered, which is roughly the size of an individual clone. If only finished data are considered, the length to the next major mismatch increases considerably. In order to illustrate this point further, the N50 consistent fragment lengths were calculated (Table 6). N50 is the length L such that 50% of the sequence lies in contigs of at least L (1). The huge size of the contigs is not surprising. The data in Table 6 illustrate that the N50 lengths, when only finished data are considered, are typically several times the length of the N50 fragments when only draft sequences are considered. This suggests that regions that have been finished are more likely to maintain a consistent ordering between both the NCBI and Goldenpath assemblies. In fact, when all of the locations of major mismatches are considered (data not shown), >94% occur in regions of the assembly involving draft sequence data.

Figure 3.

Figure 3

Length to next major mismatch. Shown are the percentages of nucleotides (y axis) that have a length to the next major mismatch of at least as many nucleotides as specified in the x axis. The results are shown for all chromosomes as well as for chromosome 4 (the worst case) and chromosome 20 (the best case). (A) considers only major mismatches in the draft sequence. (B) considers only major mismatches in finished sequence. (C) considers a major mismatch in either draft or finished sequence. The data used to construct these graphs are provided in Table 5.

Table 5. Length to next major mismatch data.

Length to next mismatch Chromosome 4 Chromosome 20 All chromosomes
  Draft Finished Both Draft Finished Both Draft Finished Both
1 100 100 100 100 100 100 100 100 100
10 99.97 100 99.97 100 100 100 99.98 100 99.98
100 99.63 99.99 99.62 100 100 100 99.83 99.99 99.82
1000 96.33 99.87 96.20 100 99.99 99.99 98.28 99.90 98.17
10 000 74.78 98.79 73.75 99.98 99.86 99.86 88.19 99.04 87.50
100 000 17.74 90.04 15.36 99.83 98.80 98.80 52.90 92.17 49.93
1 000 000 0 45.52 0 98.33 95.23 95.23 13.99 58.66 11.06
10 000 000 0 0 0 83.31 65.19 65.19 5.10 6.91 2.36
100 000 000 0 0 0 0 0 0 0 0 0

Shown in this table are the data used to construct the graphs in Figure 3. The percentage of nucleotides having a length to the next mismatch at least as large as the value in the first column is given for chromosomes 4 and 20 as well as for all of the chromosomes combined. The length to the next major mismatch is computed for three different cases: (i) only major mismatches in draft sequence are considered, (ii) only major mismatches in finished sequence are considered and (iii) major mismatches in either draft or finished sequence are considered.

Table 6. N50 consistent fragment lengths.

Chromosome Draft Finished Overall
1 163 004 4 968 219 157 461
2 330 037 1 551 511 274 931
3 65 344 5 469 316 63 923
4 75 390 2 119 913 70 698
5 129 487 1 901 406 107 025
6 1 215 114 12 291 177 1 195 745
7 876 374 3 392 985 717 256
8 183 655 3 710 225 175 365
9 397 958 3 623 289 371 456
10 327 708 2 537 971 299 579
11 105 105 8 973 281 101 197
12 217 060 3 248 967 198 638
13 1 079 919 4 031 084 1 014 696
14 1 679 933 10 101 338 1 678 683
15 153 097 9 558 682 148 769
16 162 234 1 622 063 148 710
17 214 317 1 773 603 181 633
18 98 052 6 360 442 97 657
19 399 676 1 075 339 302 961
20 59 923 448 33 270 000 33 270 000
21 33 446 726 4 243 300 4 243 300
22 34 835 520 26 646 801 26 646 801
X 719 320 1 422 962 479 655
Y 33 924 847 990 110 990 110

Shown in this table are the N50 consistent fragment lengths for major mismatches occurring in draft sequence, finished sequence, and either draft or finished sequence. All of the clones used to construct sequence data for chromosomes 20, 21, 22 and Y are finished clones, and thus the N50 value for the draft sequence represents the length of the whole chromosome.

Major mismatches could result due to differences in gap length, repetitive regions and assembly errors or discrepancies. In order to illustrate these differences, dot plots of the NCBI versus UCSC assembly for chromosomes 5 and Y are shown in Figure 4A and B. The degree of agreement on chromosome Y is due to the fact that both assemblies are the same. Matches on chromosome Y that are not along the major diagonal represent repetitive regions. Each of these is labeled as a mismatch in the earlier analysis. The dot plot for chromosome 5 indicates a larger assembly discrepancy between UCSC and NCBI. Figure 4C shows a dot plot for the UCSC assembly against itself for comparison purposes. By looking at Figure 4A and C, it can be seen that there are a number of potential large scale repeats on chromosome 5. However, many of these may be due to errors in the UCSC assembly. In addition, a comparison of the major diagonals of Figure 4A and C shows that there are quite a few breaks in agreement between the UCSC and NCBI assemblies, suggesting that care must be taken when analyzing information gathered by looking at assembly data.

Figure 4.

Figure 4

Chromosome dot plots. (A) and (B) show dot plots resulting from a MULTI alignment of the NCBI assembly (x axis) to the UCSC Goldenpath assembly (y axis). Shown are the results for chromosome 5 (A) and chromosome Y (B). (C) represents the dot plot from a MULTI self alignment of the chromosome 5 UCSC Goldenpath assembly.

DISCUSSION

Which assembly is better?

Due to the substantial discrepancies demonstrated between the NCBI and Goldenpath assemblies, it is important to understand the strengths and limitations of each in order to perform meaningful studies on genomic regions. Without knowing the actual correct assembly, it is difficult to determine which produces the correct results. However, confirmed mapping data can be used to measure the correctness of the two assemblies. Recent articles have discussed assembly comparisons within different regions (24,25).

Christian et al. (24) studied the order and orientation of 50 STS markers with multiple sources of mapping data anchored in a 15-MB region of chromosome 13 linked to bipolar disorder and schizophrenia. In their analysis of these markers within the NCBI and UCSC Goldenpath assemblies, they show that the NCBI assembly (dated September 2001) orders the STS markers more accurately (50 out of 50 in the correct order for NCBI versus 40 out of 50 for UCSC Goldenpath), while the Goldenpath assembly (April 2001) is more accurate in determining the orientation. In addition, one STS is duplicated in the NCBI assembly while another is incorrectly assigned to a different chromosome in the Goldenpath assembly.

Semple et al. (25) examined six separate assemblies of a region on human chromosome 4p. Included among those studied was the April 2001 NCBI assembly as well as the April 2001 UCSC Goldenpath assembly. A total of 107 accurately ordered markers were compared. Both the NCBI and UCSC assemblies include the duplication of three markers. Eight markers are deleted in the NCBI assembly while ten are not found in the UCSC data. Thirteen markers are rearranged in the NCBI data set while five are rearranged in the Goldenpath. Additionally, in the original April 2001 NCBI data set, a total of 15 MB, not belonging to this region, was removed. If this were to remain, the rate of misassemblies would greatly increase for the NCBI assembly.

Based on the descriptions of the respective assembly protocols (discussed in the Introduction), it appears as though the NCBI and UCSC assemblies have a basic difference. It seems as though the NCBI assembly relies heavily on the individual sequencing centers and their annotation for clone order and orientation. At the same time, it appears as though the UCSC assembly uses more independent verification for clone placement.

Kent and Haussler (12) tested their GigAssembler program (used to create the Goldenpath assembly) using different artificial data sets. Their simulations demonstrate that, on average, ∼10% of all fragments are oriented incorrectly while ∼13–15% are ordered incorrectly. They also suggest on their web site (http://genome.cse.ucsc.edu/FAQ.html#123) that the NCBI assembly shows slightly better local order and orientation when compared with the UCSC assembly. However, the NCBI assembly has somewhat worse tracking of chromosomal level maps.

The results of these independent studies suggest that neither of the assemblies is superior to the other. The study by Christian et al. (24) indicates the NCBI assembly may be more accurate in determining order, while the UCSC Goldenpath is more accurate in determining orientation. However, this study only looked at 50 markers within a small region of the genome and, therefore, cannot be reliably extrapolated to the genome as a whole. Whole genome assembly is a daunting task. It is safe to say that, no matter which assembly is studied, errors in order and orientation are likely to exist. In fact, it is possible that errors in the order and orientation of clones used in the NCBI and UCSC assemblies are consistent with each other, but not with the actual sequence.

Use of genome assembly information

The availability of public assemblies of the human genome has opened the doors to a vast amount of research that can now be performed. However, as we have demonstrated, this information should be used carefully at this time since it is not a perfect product. Analysis of assembled regions containing only finished clones are more likely to sustain consistent ordering and orientation than those containing draft clones as well.

Large-scale genomic duplications within the genome present many challenges to assembly (6). Thus, it may be important to understand whether or not an assembled region being studied contains duplications. Tandem repeats pose similar problems, especially if they are under-represented in the clone libraries. If it is known that the region of interest contains tandem repeats, current assemblies should not be relied upon to determine the exact copy number accurately. This can greatly affect the distance between two markers, and skew statistics based on physical distance.

Another thing to keep in mind in analysis is where gaps occur within the assemblies. While the length of gaps can be estimated, they cannot be completely determined until they are closed. This is especially important to be aware of if a feature of interest lies near the beginning or ending of a contig, or if it bridges two different contigs. In the latter case, inconsistent orientation between two separate contigs could cause problems.

The best method to determine whether or not the region of interest is consistent is to compare it to a map of markers where the map is confirmed from multiple sources. This will only assure the correct order and orientation of the markers. Sequences falling in between markers cannot be completely ruled out as containing incorrect order and orientation.

The results of our studies show a greater consistency among the assemblies when only finished contigs are considered. As more finished data become available, the various assemblies seem to converge to agreement. This is most visible when looking at chromosomes 20, 21, 22 and Y in Figure 2. However, there is still at least 50% of the human genome available only at the rough-draft level. Incorporation of the rough-draft sequence data introduces more inconsistencies both in terms of ordering and orientation. Deloukas et al. (23) discuss a comparison of a rough draft assembly for chromosome 20 versus the finished sequence and illustrate how errors were introduced at the rough-draft level.

The public effort to provide a single human genome assembly has been coordinated as of the December 22, 2001 UCSC Goldenpath assembly (released in February 2002) and GenBank build 28 (see http://genome.ucsc.edu/ for more details). Hopefully this collaboration on a single assembly will help to correct the ordering and orientation inconsistencies seen through our analysis with previous assemblies.

Acknowledgments

ACKNOWLEDGEMENTS

We would like to thank Drs Zhengyan Kan and Thomas Blackwell for helpful insight and review of this article. This work was supported in part through grants from the National Institutes of Health (HG-01-01391), the Department of Energy (DE-FG02-94ER61910) and the Merck Foundation for Genome Research (grant 225). The assembly comparison graphs are png files created using the gd.pm perl module.

REFERENCES

  • 1.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [DOI] [PubMed] [Google Scholar]
  • 2.Venter J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351. [DOI] [PubMed] [Google Scholar]
  • 3.Smit A.F. (1999) Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev., 9, 657–663. [DOI] [PubMed] [Google Scholar]
  • 4.Jurka J. (1998) Repeats in genomic DNA: mining and meaning. Curr. Opin. Struct. Biol., 8, 333–337. [DOI] [PubMed] [Google Scholar]
  • 5.Lynch M. and Conery,J.S. (2000) The evolutionary fate and consequences of duplicate genes. Science, 290, 1151–1155. [DOI] [PubMed] [Google Scholar]
  • 6.Bailey J.A., Yavor,A.M., Massa,H.F., Trask,B.J. and Eichler,E.E. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res., 11, 1005–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kan Z., Gish,W., Rouchka,E., Glasscock,J. and States,D. (2000) UTR reconstruction and analysis using genomically aligned EST sequences. ISMB, 8, 218–227. [PubMed] [Google Scholar]
  • 8.Blackwell T.W., Rouchka,E.C. and States,D.J. (1999) Identity by descent genome segmentation based on single nucleotide polymorphism distributions. ISMB, 7, 54–59. [PubMed] [Google Scholar]
  • 9.Burge C.B. and Karlin,S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8, 346–354. [DOI] [PubMed] [Google Scholar]
  • 10.Liang F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and Quakenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nature Genet., 25, 239–240. [DOI] [PubMed] [Google Scholar]
  • 11.Wolfsberg T.G. and Landsman,D. (1997) A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res., 25, 1626–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kent J.W. and Haussler,D. (2001) Assembly of the working draft of the human genome with GigAssembler. Genome Res., 11, 1541–1548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Benson D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 10–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Altschul S.F., Gish,W., Miller,W., Myers,E.W.and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [DOI] [PubMed] [Google Scholar]
  • 15.Korf I. and Gish,W. (2000) MPBLAST: improved BLAST performance with multiplexed queries. Bioinformatics, 16, 1052–1053. [DOI] [PubMed] [Google Scholar]
  • 16.Claverie J.-M. and States,D. (1993) Information enhancement methods for large scale sequence analysis. Comput. Chem., 17, 191–201. [Google Scholar]
  • 17.Zhang Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol., 7, 203–214. [DOI] [PubMed] [Google Scholar]
  • 18.Ning Z., Cox,A.J. and Mullikin,J.C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res., 11, 1725–1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Delcher A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30, 2478–2483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kent W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hattori M., Fujiyama,A., Taylor,T.D., Watanabe,H., Yada,T., Park,H.S., Toyoda,A., Ishii,K., Totokyi,Y., Choi,D.K. et al. (2000) The DNA sequence of human chromosome 21. Nature, 405, 311–319. [DOI] [PubMed] [Google Scholar]
  • 22.Dunham I., Shimizu,N., Roe,B.A., Chissoe,S., Hunt,A.R., Collins,J.E., Bruskiewich,R., Beare,D.M., Clamp,M., Smink,L.J. et al. (1999) The DNA sequence of human chromosome 22. Nature, 402, 489–495. [DOI] [PubMed] [Google Scholar]
  • 23.Deloukas P., Matthews,L.H., Ashurst,J., Burton,J., Gilbert,J.G., Jones,M., Stavrides,G., Almeida,J.P., Babbage,A.K., Bagguley,C.L. et al. (2001) The DNA sequence and comparative analysis of human chromosome 20. Nature, 414, 865–871. [DOI] [PubMed] [Google Scholar]
  • 24.Christian S.L., McDonough,J., Liu,C.-Y., Shaikh,S., Vlamakis,V., Badner,J.A., Chakravarti,A. and Gershon,E.S. (2002) An evaluation of the assembly of an approximately 15-Mb region on human chromosome 13q32-q33 linked to bipolar disorder and schizophrenia. Genomics, 79, 635–656. [DOI] [PubMed] [Google Scholar]
  • 25.Semple C.A.M., Morris,S.W., Porteous,D.J. and Evans,K.L. (2002) Computational comparison of human genomic sequence assemblies for a region of chromosome 4. Genome Res., 12, 424–429. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES