Abstract
The GenBank® sequence database incorporates publicly available DNA sequences of >55 000 different organisms, primarily through direct submission of sequence data from individual laboratories and large-scale sequencing projects. Most submissions are made using the BankIt (Web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI’s integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping and protein structure information, plus the biomedical literature via PubMed. Sequence similarity searching is provided by the BLAST family of programs. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. NCBI also offers a wide range of WWW retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the NCBI home page at http://www.ncbi.nlm.nih.gov
INTRODUCTION
GenBank (1) is a public database of all known nucleotide and protein sequences with supporting bibliographic and biological annotation, built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH).
NCBI builds GenBank primarily from the direct submission of sequence data from authors. Another major source of data is bulk submission of EST and other high-throughput data from sequencing centers. The US Office of Patents and Trademarks (USPTO) also contributes sequence data from issued patents. The data are supplemented by sequences submitted to other public databases. Through a long-standing international collaboration with the EMBL Data Library (2) in the UK and the DNA Databank of Japan (DDBJ) (3), data are exchanged daily to ensure that all three sites maintain a comprehensive collection of sequence information. NCBI makes the data available at no cost over the Internet, by FTP access and by Web text and sequence similarity search services. NCBI also offers a wide range of WWW retrieval and analysis services which operate on the GenBank data (4).
ORGANIZATION OF THE DATABASE
GenBank continues to grow at an exponential rate. Over the past 12 months 2.1 million new sequences have been added. As of Release 113 in August 1999, GenBank contained over 3.4 billion nucleotide bases from 4.6 million different sequences. Complete genomes (http://www.ncbi.nlm.nih.gov/Entrez/Genome/org.html ) represent a growing portion of the database, with 14 of the 22 complete genomes now in GenBank deposited over the past 2 years. Recent additions include Aeropyrum pernix K1, the deep sea archaea bacterium, Pyrococcus abyssi and the nearly complete genome of Caenorhabditis elegans. There are at least 40 additional microorganism genomes, plus that of Drosophila melanogaster, that are being sequenced. Many of these are expected to be in the public databases over the coming year. Historically, GenBank had been doubling in size about every 18 months, but that rate has accelerated to doubling every 15 months due primarily to the enormous growth in data from expressed sequence tags (ESTs). Over 63% of the sequences in the current GenBank release are ESTs, and current EST projects for human, mouse, rat and other organisms will contribute still more data.
Sequence-based taxonomy
Over 55 000 different species are represented in GenBank and new species are being added at the rate of 1250 per month. Human sequences constitute 56% of the total sequences (34% of all sequences are human ESTs). After Homo sapiens, the top species in GenBank in terms of the number of bases include Mus musculus, C.elegans, D.melanogaster and Arabidopsis thaliana. Database sequences are processed and can be queried using a comprehensive sequence-based taxonomy (http://www. ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html ) developed by NCBI in collaboration with EMBL and DDBJ and with the valuable assistance of external advisors and curators. The NCBI taxonomy is also covered in a separate article in this issue (4).
GenBank records and divisions
Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, bibliographic references, and a table of features (http://www.ncbi.nlm.nih.gov/collab/FT/index.html ) that identifies coding regions and other sites of biological significance, such as transcription units, repeat regions, sites of mutations or modifications and other sequence features. Protein translations for coding regions are also in the feature table.
The files in the GenBank distribution have traditionally been divided into ‘divisions’ that roughly correspond to taxonomic divisions, e.g., bacteria, viruses, primates and rodents. In recent years divisions have been added as needed for specific initiatives in biology, such as divisions for EST sequences, genome survey sequences and high throughput genomic sequences. There are currently 16 divisions. For convenience in file transfer, the larger divisions, e.g., EST and primate, are divided into multiple files when posting the bimonthly GenBank releases on NCBI’s FTP site.
Expressed sequence tag (EST) data
ESTs continue to be the major source of new sequence records and genes. Last year there were 1 765 860 sequences in the EST division of GenBank. Over the past year the number of ESTs has increased by >70% to the current total of 3 017 997 sequences representing over 180 different organisms. The top five organisms include: human, with 1 587 029 sequences (53% of the total); mouse, with 696 281 sequences (23%); rat, with 124 350 sequences (4%); nematode, with 100 844 sequences (3%); and the fruit fly, with 86 121 sequences (3%).
ESTs also continue to provide the major source of new gene discoveries. As part of its daily processing of EST data, NCBI identifies through BLAST searches all homologies for new EST sequences and incorporates that information into the companion dbEST database (5). In order to organize the EST data in a useful fashion, NCBI maintains the UniGene (http://www.ncbi.nlm.nih.gov/UniGene/ ) collection of unique human (6), mouse and rat genes. Additional information about UniGene is included in a separate article in this issue (4).
Sequence-tagged site (STS) data
The STS division of GenBank currently contains >83 000 sequences and includes anonymous STSs based on genomic sequence as well gene-based STSs derived from the 3′ ends of genes and ESTs. These STS records usually include primer sequences, annotations and PCR reaction conditions.
The ultimate purpose for creating high resolution physical maps of the human genome is to create a scaffold for organizing large scale sequencing (7). Physical maps based on STS landmarks are used to develop so-called ‘sequence-ready’ clones consisting of overlapping cosmids or BACs. As the HTG sequence data derived from these clones are submitted to GenBank, STSs become crucial reference points for organizing, presenting and searching the data. NCBI uses ‘electronic PCR’ to compare all human sequences with the contents of the STS division of GenBank; this identifies primer-binding sites on the human sequences that may be amplified in a PCR reaction. This tool permits the assignment of an initial location on the map for sequence data and the association of existing GenBank entries to the new reference sequence. The electronic PCR tool is also being made publicly available on the Web to enable any researcher with a new human sequence to relate that sequence to existing maps and HTG sequence data.
Genome Survey Sequence (GSS) data
The Genome Survey Sequences (GSS) division of GenBank has been the fastest growing division in the last year, having increased 4.5-fold to a total of 1 008 904 records with 518 917 227 nucleotides. GSS records represent ‘random’ genomic sequences, but are predominantly represented by ‘BAC ends’ which are single reads from bacterial artificial chromosomes used in a variety of genome sequencing projects, notably that of human (819 918 records), Oryza sativa (49 585) and Fugu rubripes (32 030 records). The human data is being used (http://www.ncbi.nlm.nih.gov/genome/clone ) along with the STS records in tiling the BACs used for the Human Genome Project (8).
High throughput genomic (HTG) data
The high throughput genomic sequences in the HTG division of GenBank are unfinished large-scale genomic records that are in transition to a finished state, after which they will be placed in the appropriate organism division (9). These records are designated as Phase 0–3 depending on the quality of the data. Phase 0 records consist of survey sequences generated to characterize clones and may or may not progress to Phase 1. Phase 1 records contain unfinished sequence, and may consist of unordered, unoriented contigs with gaps. Phase 2 records contain unfinished sequence as ordered, oriented contigs, with or without gaps. Phase 3 records consist of finished sequence, with no gaps and may have annotations. When a HTG record reaches phase 3 it is moved from the HTG division into the appropriate organismic division of GenBank. It is now clear that a great number of human sequences will remain in the unfinished (HTG) division of GenBank as working draft sequence, while completed sequences will continue to move to the corresponding organismic division (PRI). Together these two divisions should add some 2000 Mb of new genomic sequences from US-sponsored laboratories within the next year.
Sequence identifiers and accession numbers
Each GenBank DNA sequence record is assigned an accession number, which is a stable and unique identifier for the GenBank entry as a whole, and does not change, even when there is a change to the sequence or annotation. In order to identify specific sequences from different sources, as well as keep track of modifications to the actual sequence data, NCBI additionally assigns a unique identifier, termed a ‘gi’ number, to each sequence. When a change in a sequence occurs, a new gi number is assigned to the new sequence version. These gi numbers appear in the ‘NID’ (Nucleotide ID) field of a GenBank record, immediately following the ACCESSION field.
By agreement among the collaborative DNA sequence databases, a third identifier was introduced in February 1999 which consolidates the information present in both the gi and accession numbers. GenBank displays this identifier on the VERSION line, which appears below the NID line in the GenBank flat-file format and is of the form ‘Accession.version’. For example, an entry appearing in the database for the first time has a VERSION number equivalent to the ACCESSION number followed by ‘.1’ to reflect that this is the first version of the sequence in this entry, e.g.,
ACCESSION AF000001
NID g987654321
VERSION AF000001.1 GI: 987654321
The VERSION line also displays the gi number. If the nucleotide sequence changes, then so will the gi number and the version, but the accession will remain the same. Although the NID line carries redundant information, this line remains in the file to ensure compatibility with existing programs.
A similar system for tracking changes in the corresponding protein translations was also introduced in February 1999. Protein sequences now have identification numbers (in the format of three letters followed by five digits, e.g., AAA00001) that do not change, followed by a version number that increases with each subsequent version of the sequence. This identifier appears as a qualifier for a CDS feature in the FEATURES table portion of a GenBank entry, e.g., /protein_id=‘AAA00001.1’
Protein sequence translations also currently receive their own unique gi number, which appears as a second qualifier on the CDS feature: /db_xref=‘PID:g1234567’. The letter prefix indicates the database of origin for these identifiers (d=DDBJ, e=EMBL, g=GenBank). Eventually, the gi number will be phased out since the new ‘protein_id’, complete with the version number, will represent both a unique identifier and a means to identify changes in the sequence.
BUILDING THE DATABASE
The data in GenBank, and the collaborating databases EMBL and DDBJ, come from two sources: (i) individual authors who submit data directly to one of the databases, and (ii) bulk submissions from sequencing centers in the form of ESTs, STSs, GSSs or large genomic records (usually sequences from cosmids, BACs or YACs). Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources.
Direct submission
Virtually all records enter GenBank as direct electronic submissions, with the majority of authors using the BankIt or Sequin programs. Many journals require authors with sequence data to submit the data to a public database as a condition of publication.
GenBank staff can usually assign an accession number to a sequence submission within 2 working days of receipt, and do so at a rate of several hundred per day. The accession number serves as confirmation that the sequence has been submitted and allows readers of the article to retrieve the relevant data. All direct submissions receive a systematic quality assurance review including checking for vector contamination, verifying proper translation of coding regions, and checking for correct taxonomy and bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database. Authors have the right to request that their sequences be kept confidential until the time of publication. In these cases, authors are instructed to inform GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. GenBank policy requires that deposited sequence data be made public when the sequence or accession number is published. Although only the submitting scientist is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank at update@ncbi. nlm.nih.gov
Several large-scale sequencing projects are producing megabases of human genomic DNA sequence. NCBI works closely with sequencing centers to ensure timely incorporation of these data into GenBank for public release. In parallel, NCBI has developed methods to integrate these sequences with genetic and physical map data and to search the sequences more effectively (e.g., through options in BLAST to mask Alu and other types of repetitive elements). GenBank offers special batch procedures for large-scale sequencing groups to facilitate data submission, including the program ‘fa2htgs’ and other tools (10).
BankIt
About 40% of individual submissions are received through a Web-based data submission tool, BankIt (http://www.ncbi. nlm.nih.gov/BankIt ). With BankIt, authors enter sequence information directly into a form, edit as necessary and add biological annotation (e.g., coding regions, mRNA features). Free-form text boxes allow the submitter to further describe the sequence, without having to learn formatting rules or use restricted vocabularies. BankIt creates a draft record in GenBank flat-file format for the user to review and revise. BankIt is the tool of choice for simple submissions, especially when only one or a small number of records is submitted (9). BankIt can also be used by submitters to update their existing GenBank records.
Sequin
NCBI has developed a stand-alone multi-platform submission program called Sequin (http://www.ncbi.nlm.nih.gov/Sequin/index.html ) which can also be linked online to NCBI. Sequin handles simple sequences (e.g. a cDNA), as well as long sequences and segmented entries, for which BankIt and other Web-based submission tools are not well-suited. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for enhanced quality assurance. It is also designed to facilitate the submission of sequences from phylogenetic, population and mutation studies, and can incorporate alignment data. Sequin can be used to edit and update sequence records, as well as to perform sequence analysis. For example, Sequin can now incorporate any analysis tool available on the Web that accepts FASTA or ASN.1 (Abstract Syntax Notation 1) formatted data as its input. In addition, Sequin is able to work on large records (e.g., the Escherichia coli genome at 5.6 Mb) and read in all of its annotations via simple tables. Versions for Macintosh, PC and Unix computers are available via anonymous FTP to ‘ncbi.nlm.nih.gov’ in the ‘sequin’ directory. Once a submission is completed, users can Email it to the address: gb-sub@ncbi.nlm.nih.gov . Additional information about Sequin can be found through the NCBI home page.
RETRIEVING GenBank DATA
The Entrez system
Entrez (http://www.ncbi.nlm.nih.gov/entrez/ ) is an integrated database retrieval system that accesses DNA and protein sequence data, genome mapping data, population sets, the NCBI taxonomy, protein structures from the Molecular Modeling Database, MMDB (11) and MEDLINE references via PubMed. The DNA and protein sequence data are integrated from a variety of sources and therefore include more sequence data than are available within GenBank alone. Entrez searching is provided on NCBI’s Web site, via the Query Email server (query@ncbi.nlm.nih.gov ), and as a network client that can be downloaded by FTP. Entrez is also discussed elsewhere in this issue (4).
BLAST sequence-similarity searching
The most frequent type of analysis performed using GenBank is the search for sequences similar to a query sequence. NCBI offers the BLAST (http://www.ncbi.nlm.nih.gov/BLAST/ ) family of programs to locate good alignments between a query sequence and database sequences (12,13). BLAST searching is provided on NCBI’s Web site, via an Email server (blast@ncbi.nlm.nih.gov ), and as a set of stand-alone programs distributed by FTP. BLAST is discussed in more detail in a separate article in this issue (4).
Obtaining GenBank by FTP
NCBI uses the ASN.1 data format for internal maintenance of GenBank, but distributes the GenBank releases in the traditional flat-file format. The full GenBank release (issued every 2 months) and the daily updates (which also incorporate sequence data from EMBL and DDBJ) are available by anonymous FTP from ‘ncbi.nlm.nih.gov’. The full release in flat-file format is available as compressed files in the directory, ‘genbank’. A cumulative update file is contained in the sub-directory, ‘daily’, and a non-cumulative set of updates is contained in ‘daily-nc’. A set of sequence-only files in FASTA format, corresponding to the GenBank database subsets searched by BLAST and including the non-redundant nucleotide and protein databases, is available in the ‘blast/db’ directory. Software developers creating their own interfaces or analysis tools for GenBank data are offered the NCBI ToolKit to assist in developing specialized applications. The ToolKit software can be found in the directory ‘toolbox/ncbi_tools’.
CONTACT DETAILS
Mailing address
GenBank, National Center for Biotechnology Information, Building 38A, Room 8S-803, 8600 Rockville Pike, Bethesda, MD 20894, USA. Tel: +1 301 496 2475; Fax: +1 301 480 9241.
Electronic addresses
http://www.ncbi.nlm.nih.gov/ (NCBI Home Page).
gb-sub@ncbi.nlm.nih.gov ( submission of sequence data to GenBank).
update@ncbi.nlm.nih.gov (revisions to GenBank entries and notification of release of ‘hold until published’ entries).
info@ncbi.nlm.nih.gov (general information about NCBI and services).
CITING GenBank
If you use GenBank as a tool in your published research, we ask that this paper be cited.
REFERENCES
- 1.Benson D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F.F., Rapp,B.A. and Wheeler,D.L. (1999) Nucleic Acids Res., 26, 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stoesser G., Tuli,M.A. and Lopez,R. (1999) Nucleic Acids Res., 27, 18–24. Updated article in this issue: Nucleic Acids Res. (2000), 28, 19–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sugawara H., Miyazaki,S., Gojobori,T. and Tateno,Y. (1999) Nucleic Acids Res., 27, 25–28. Updated article in this issue: Nucleic Acids Res. (2000), 28, 24–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wheeler D.L., Chappey,C., Lash,A., Leipe,D.D., Madden,T.L., Schuler,G.D., Tatusova,T. and Rapp,B.A. (2000) Nucleic Acids Res., 28, 10–14 (this issue). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Boguski M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) Nature Genet., 4, 332–333. [DOI] [PubMed] [Google Scholar]
- 6.Schuler G.D. (1997) J. Mol. Med., 75, 694–698. [DOI] [PubMed] [Google Scholar]
- 7.Hudson T.J., Stein,L.D., Gerety,S., Ma,J., Castle,A.B., Silva,J., Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S.-H. et al. (1995) Science, 270, 1945–1954. [DOI] [PubMed] [Google Scholar]
- 8.Smith M.W., Holmsen,A.L., Wei,Y.H., Peterson,M. and Evans,G.A. (1994) Nature Genet., 7, 40–47. [DOI] [PubMed] [Google Scholar]
- 9.Kans J.A. and Ouellette,B.F.F. (1998) In Baxevanis,A. and Ouellette,B.F.F. (eds), Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley and Sons, Inc., New York, NY, pp. 319–353.
- 10.Ouellette B.F.F. and Boguski,M.S. (1997) Genome Res., 7, 952–957. [DOI] [PubMed] [Google Scholar]
- 11.Wang Y., Addess,K.J., Geer,L., Madej,T., Marchler-Bauer,A., Zimmerman,D. and Bryant,S.H. (2000) Nucleic Acids Res., 28, 243–245 (this issue). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Altschul S.F, Madden,T.L., Schaffer,A.A., Zhang,J., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang Z., Schaffer,A.A., Miller,W., Madden,T.L., Lipman,D.J., Koonin,E.V. and Altschul,S.F. (1998) Nucleic Acids Res., 26, 3986–3991. [DOI] [PMC free article] [PubMed] [Google Scholar]