Abstract
The Human Genome Project has generated extensive map and sequence data for a large number of Bacterial Artificial Chromosome (BAC) clones. In order to maximize the efficient use of the data and to minimize the redundant work for the research community, The Institute for Genomic Research (TIGR) comprehensive BAC resource (cBACr) (http://www.tigr.org/tdb/BacResource/BAC_resource_intro.html) was built as an expansion of the TIGR human BAC ends database. This resource collects, integrates and reports the information on library, maps, sequence, annotation and functions for each human and mouse BAC. The current database contains 635 016 human BACs and 265 617 mouse BACs that were characterized by various approaches, among which 22 705 human clones and 1000 mouse clones have sequence and annotation data.
INTRODUCTION
Bacterial Artificial Chromosome (BAC) clones are a standard clone type for large scale genomic projects because of their high stability and large insert sizes (80–300 kb) (1–3). Indeed, BAC mapping and sequencing have allowed the International Human Genome Sequencing Consortium to complete a working draft of the human genome (http://www.nhgri.nih.gov/NEWS/sequencing_consortium.html). GenBank currently contains ∼23 000 human sequence records for BACs from RPCI-11 (4) and CalTech human BAC libraries. Additionally, large scale mapping projects (Table S1 in Supplementary Material) have characterized over 630 000 human BACs by approaches such as BAC end sequencing (5–7), fingerprinting (8–9), radiation hybrid (RH) mapping, fluorescence in situ hybridization (FISH) (10) and marker screening (11).
Mice and humans share many of the same fundamental biological and behavioral processes. Consequently, the mouse is a significant model organism and National Institutes of Health (NIH) is funding for mapping and sequencing of the mouse genome (http://www.nhgri.nih.gov/NEWS/MouseRelease.htm). Currently a mouse BAC library RPCI-23 (12) is being end sequenced at TIGR and fingerprinted at British Columbia Genome Sequencing Center (BCGSC) in Canada. A low resolution map comprised of 4500 markers is being generated by several groups using RPCI-23 clones (Table S2 in Supplementary Material). GenBank contains ∼20 Mb finished sequences and 180 Mb draft sequences for the mouse.
To efficiently use this information, the TIGR comprehensive BAC resource (cBACr) database was formed. This database expands from the TIGR human BAC ends database to include BAC ends and other mapping data, BAC sequences and annotation for both human and mouse BACs. The cBACr presents a complete picture on library, maps, sequence and annotation for each BAC clone. The database is described at http://www.tigr.org/tdb/BacResource/BAC_resource_intro.html.
DATA COLLECTION, INTEGRATION AND CURATION IN THE cBACr
The cBACr database contains 635 016 human BACs and 265 617 mouse BACs with each clone with the integrated information as summarized below. The information is stored in a Sybase relational database and is updated at least once every two weeks to incorporate any newly generated data.
Library
The human BAC libraries RPCI-11 and CalTech ones have been described (3–4). The mouse genome project is currently mostly using the library RPCI-23 (12) (http://www.chori.org/bacpac/23framefmouse.htm). This library contains EcoRI/EcoRI methylase partially digested male C57BL/6J mouse DNA in the pBACe3.6 cloning vector. It contains a total of 170 634 clones with an average insert size of 197.5 kb (or an overall coverage of 11.2X). However, other mouse libraries have been constructed (i.e. RPCI-24, RPCI-22, CalTech CitbCJ7) (Table S3 in Supplementary Material). The cBACr stores information such as source DNA, clone vector, average insert size, coverage for these libraries.
Mapping
BAC ends. The cBACr contains 775 726 human BAC end sequences (BES) (3) and 238 143 mouse BES. The mouse BES were generated by TIGR and are from 134 404 RPCI-23 clones (Table 1). The mouse ends sequencing is conducted on 3700 machines that ensure >98% BES are associated with the correct clones, which is essential for the genome assembly projects and for retrieving the clones from the library based on the sequence matches. TIGR is finishing the entire RPCI-23 library and will end sequence additional BACs from another library RPCI-24. Besides TIGR, 493 BES from 350 clones have been submitted to GenBank by other groups. The cBACr is updated daily to include the newly generated BES and BES are annotated for the contents of repeats, EST and STS markers.
Table 1. TIGR RPCI-23 end sequencing effort.
BESa |
Clones with BESb |
Readsc |
Q20 basesd |
% Pairse |
Tracking accuracyf |
238 143 | 134 404 | 462 bp | 382 bp | 77% | 98% |
aTotal non-redundant BAC end sequences (BES).
bTotal clones with either one end or both ends.
cRead length after vector- and quality trimming.
dThe number of bases with phred quality values ≥20 after vector- and quality trimming.
eThe percentage of clones with paired-ends.
fThe percentage of BES associated with the correct clones.
Fingerprints. Fingerprint data are used to construct BAC contig maps using FPC (fingerprinted contigs) software. As of September 26, 2000, the human fingerprints resource generated by WUGSC (Washington University Genome Sequencing Center) contains 369 610 clones, 1452 contigs and 69 507 markers (12 776 anchored) (http://genome.wustl.edu/gsc/human/human_database.shtml). Based on the data, genome wide BAC clone maps have been constructed and include 331 783 clones representing 97–98% of the genome (http://genome.wustl.edu/gsc/human/Mapping/index.shtml). The mouse fingerprints resource is being generated at BCGSC (http://www.bcgsc.bc.ca/projects/mouse_mapping/) and contains 129 442 RPCI-23 clones, 14 147 contigs and 9770 markers as of September 28, 2000. For each clone, information such as the insert size calculated from the bands, contig maps, overlapping clones, markers and chromosome locations are extracted from the FPC database every two weeks.
FISH. FISH maps BACs to the specific cytogenetic bands and the resource is useful in studies such as identifying the genes affected by chromosomal rearrangements that cause genetic disease and cancer. The FISH data are from two groups. The FISH BAC resource at University of Washington (UW) contains 7615 human clones (http://www.biotech.washington.edu/bacresource/index.shtml) and the BAC resource at Cedars-Sinai Medical Center (CSMC) contains 963 CalTech human clones (10) and 153 CitbCJ7 mouse clones (13) (http://www.csmc.edu/genetics/korenberg/korenberg.htm). The FISH bands of these clones are stored in the cBACr database and the data are updated every two weeks to incorporate the new clones.
RH. RH mapping places BACs at precise locations of the chromosomes. To date, a total of 29 120 BES generated by TIGR have been mapped at the Stanford genome center (SHGC) (http://shgc-www.stanford.edu/Mapping/index.html) and 10 452 STSs have been submitted to GenBank, which has allowed chromosome assignments for 8331 RPCI-11 clones. Besides the chromosome assignments, the RH maps of these clones will be incorporated into the cBACr database once available.
Marker screening. The cBACr contains the mapping information of over 14 000 human clones and over 5000 mouse clones by STS or EST marker hybridization from several groups. CalTech BAC resource contains 3000 clones mapped by using cancer-related gene markers (http://informa.bio.caltech.edu/idx_www_tree.html). The GenMapDB at University of Pennsylvania (UPenn) contains 2454 mapped RPCI-11 clones (http://genomics.med.upenn.edu/genmapdb/). The BACPAC resource at Roswell Park Cancer Institute (RPCI) contains 8307 RPCI-11 human clones (http://bacpac.med.buffalo.edu/human/overview.html) and 3322 RPCI-23 mouse clones (http://bacpac.med.buffalo.edu/mouse/overview.html). Baylor College of Medicine (BCM) has placed 1838 RPCI-23 BACs to chromosome 11 using existing and BES derived markers (http://www.mouse-genome.bcm.tmc.edu/BACMAPPING/BAC_Index.asp). A number of RPCI-23 clones were also mapped by the Albert Einstein College of Medicine genome center (AECOM) (http://sequence.aecom.yu.edu/mouseDB/mousePUB/mouse_welcome_all.hts). The cBACr is updated every two weeks to incorporate newly mapped clones from these sites.
Sequences
Complete BAC sequences were collected from GenBank (ftp://ncbi.nlm.nih.gov/genbank/genomes/) and deposited into the cBACr database (Tables 2 and 3). GenBank sequences represent all phases of the sequencing projects. Phase 0 is survey sequence (~0.5–1× coverage), phase 1 is unordered contigs (each >2 kb), phase 2 is ordered contigs (each >2 kb) with gaps and phase 3 is finished sequences. These sequences are being assembled and ordered at University of California Santa Cruz (UCSC) for the human genome (GoldenPath, http://genome.ucsc.edu/). The ordered and assembled sequences for the underlying BACs used for the assembly will be extracted from the assembled genome and integrated into the cBACr database. The sequences are updated every two weeks.
Table 2. Human genomic sequences in GenBank.
Phase |
RPCI-11 clones |
CalTech Clones |
0 | 2863 | 144 |
1 | 16 035 | 1012 |
2 | 410 | 853 |
3 | 1787 | 716 |
Table 3. Mouse genomic sequences in GenBank.
Phase |
Total records |
RPCI-23 |
RPCI-22 |
CitbCJ7 |
0 | 154 | 93 | 7 | 2 |
1 | 1035 | 914 | 39 | 4 |
2 | 153 | 115 | 17 | |
3 | 131 | 11 | 13 |
Annotation
The sequences submitted to GenBank are minimally annotated for repeats, genes, STS markers and sequence variations such as SNP. The data will be extracted and incorporated into the cBACr database. Meanwhile, BAC sequences will go through an automated annotation process at TIGR. Sequences will be repeats-masked by MaskerAid (http://sapiens.wustl.edu/maskeraid) that replaced crossmach by WU-BLAST and is the accelerated version of RepeatMasker. The sequences will be searched against the EST, protein and STS databases and the results will be stored in the cBACr database.
Electronically mapped clones
A total of 9140 BACs have been placed onto the human genome sequence map by matching their paired-end sequences to contig sequences of ≥600 kb (http://www.tigr.org/tdb/BacResource/SearchMapping.html). The criteria for placing a BAC on the sequence map are: (i) identity of ≥95%; (ii) match length of ≥75% BAC end sequence length; (iii) paired-ends pointing towards each other; and (iv) insert size between 1 and 400 kb (Fig. S1a and b in Supplementary Material). Once mapped to the genome sequence, the BAC insert sequence can be inferred. The validity of this mapping approach is indicated by four tests: (i) among 116 clones overlapping with UW FISH resource, 114 have the same chromosome assignments; (ii) among 55 clones overlapping with CalTech BAC resource, 55 were assigned to the same chromosomes; (iii) among 20 clones overlapping with GenMapDB resource, 17 have the same chromosome assignments; (iv) for 414 clones overlapping with Stanford RH resource, 396 have the same chromosome assignments. For the mouse, this approach has placed 49 RPCI-23 clones on chromosome 6, 14 and 17. This mapping process is repeated every two weeks to incorporate the new contigs into the cBACr database.
DATA PRESENTATION AND AVAILABILITY
Data presentation
Each BAC has a clone report page (http://www.tigr.org/tdb/BacResource/BAC_Clone_Search.html) that presents information on library, maps, sequence, annotation and function studies with links to the resource producing sites and other related sites. The library section reports clone availability, construction group, DNA source, cloning vector, average insert size, numbers of segments, plates and clones, and coverage. The map section includes: (i) BAC ends data as described (3); (ii) fingerprint data such as image, band sizes, markers and chromosome locations, contigs (graphical displays of the overlapping clones and markers); (iii) FISH bands; (iv) marker screening data such as marker names, accessions, chromosome locations; (v) RH data such as STS accessions, BAC end sequence names and accessions; (vi) computationally mapped data such as contig name and length, match identity and length, insert size, chromosome location, graphical display of the matches; (vii) locations on genome wide maps such as UCSC’s assembled sequence map, WUGSC BAC clone and accession maps. The sequence section includes phase, GenBank accession, map and sequence. The annotation section reports the contents for repeats, STSs, ESTs and protein coding regions. Lastly, clones are flagged for any of the cases: discrepant data reported by different groups; mapped to different locations; BAC end sequences not in the BAC sequence of the same clone.
Data availability
The comprehensive BAC resource is free to the public via search and FTP services on TIGR’s website. Users can search the database using a variety of parameters, including query sequences of any length and clone names. The sequence match results are graphically displayed and sequence alignments are shown. Each match clone is linked to the clone report page described above. The entire database can be FTPed and multi-FASTA files of sequences, repeat-masked sequences and headers-only are available. The FASTA headers have all information presented on the clone report page except the sequences. The database and related sites can be found at http://www.tigr.org/tdb/BacResource/BAC_resource_intro.html.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at NAR Online including comprehensive BAC resource URLs, paired ends matching to human chromosomes 21 and 22 and source data URLs.
Acknowledgments
ACKNOWLEDGEMENTS
I am grateful to every individual at TIGR and other centers generating the data; to Michael Heaney, Susan Lo, Mark Sengamalay and other members at TIGR for the database support; to John Heidelberg, Herve Tettelin, Malcolm Gardner, John Quackenbush for their critical comments on the manuscript.
References
- 1.Shizuya H., Birren,B., Kim,U.J., Mancino,V., Slepak,T., Tachiiiri,Y. and Simon,M. (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl Acad. Sci. USA, 89, 8794–8797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kim U.J., Birren,B., Sheng,Y.L., Slepak,T., Mancino,V., Boysen,C., Kang,H.L., Simon,M.I. and Shizuya,H. (1996) Construction and characterization of a human Bacterial Artificial Chromosome library. Genomics, 34, 213–218. [DOI] [PubMed] [Google Scholar]
- 3.Kim U.J., Shizuya,H., Kang,H.-L. et al. (1996) A bacterial artificial chromosome-based framework contig map of human chromosome 22q. Proc. Natl Acad. Sci. USA, 93, 6297–6301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Osoegawa K., Woon,P.Y., Zhao,B., Frengen,E., Tateno,M., Catanese,J.J. and de Jong,P.J. (1998) An improved approach for construction of bacterial artificial chromosome libraries. Genomics, 52, 1–8. [DOI] [PubMed] [Google Scholar]
- 5.Zhao S. (2000) Human BAC Ends. Nucleic Acids Res., 28, 129–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhao S., Malek,J., Mahairas,G., Fu,L., Nierman,W., Venter,J.C. and Adams,M.D. (2000) Human BAC Ends Quality Assessment and Sequence Analyses. Genomics, 63, 321–332. [DOI] [PubMed] [Google Scholar]
- 7.Mahairas G., Wallace,J., Smith,K., Swartzell,S., Holzman,T., Keller,A., Shaker,R., Furlong,J., Young,J., Zhao,S., Adams,M. and Hood,L. (1999) Sequence Tagged Connectors: A Sequence Approach to Mapping and Scanning the Human Genome. Proc. Natl Acad. Sci. USA, 96, 9739–9744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Marra M., Kucaba,T., Dietrich,N. et al. (1997) High throughput fingerprint analysis of large-insert clones. Genome Res., 7, 1072–1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Marra M., Kucaba,T., Sekhon,M., Hillier,L., Martienssen,R., Chinwalla,A., Crockett,J., Fedele,J., Grover,H., Gund,C. et al. (1999) zA map for sequence analysis of the Arabidopsis thaliana genome. Nature Genet., 22, 265–270. [DOI] [PubMed] [Google Scholar]
- 10.Korenberg J.R., Chen,X.N., Sun,Z., Shi,Z.Y., Ma,S., Vataru,E., Yimlamai,D., Weissenbach,J.S., Shizuya,H., Simon,M.I., Gerety,S.S., Nguyen,H., Zemsteva,I.S., Hui,L., Silva,J., Wu,X., Birren,B.W. and Hudson,T.J. (1999) Human genome anatomy: BACs integrating the genetic and cytogenetic maps for bridging genome and biomedicine Genome Res., 9, 994–1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cheung V.G., Dalrymple,H.L., Narasimhan,S., Watts,J., Schuler,G., Raap,A.K., Morley,M. and Bruzel,A. (1999) A resource of mapped human bacterial artificial chromosome clones. Genome Res., 9, 989–993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Osoegawa K., Tateno,M., Woon,P.Y., Frengen,E., Mammoser,A.G., Catanese,J.J., Hayashizaki,Y. and de Jong,P.J. (2000) Bacterial artificial chromosome libraries for mouse sequencing and functional analysis Genome Res., 10, 116–128. [PMC free article] [PubMed] [Google Scholar]
- 13.Korenberg J.R., Chen,X.N., Devon,K.L., Noya,D., Oster-Granite,M.L. and Birren,B.W. (1999) Mouse molecular cytogenetic resource: 157 BACs link the chromosomal and genetic maps. Genome Res., 9, 514–523. [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The comprehensive BAC resource is free to the public via search and FTP services on TIGR’s website. Users can search the database using a variety of parameters, including query sequences of any length and clone names. The sequence match results are graphically displayed and sequence alignments are shown. Each match clone is linked to the clone report page described above. The entire database can be FTPed and multi-FASTA files of sequences, repeat-masked sequences and headers-only are available. The FASTA headers have all information presented on the clone report page except the sequences. The database and related sites can be found at http://www.tigr.org/tdb/BacResource/BAC_resource_intro.html.