Abstract
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 19.6 trillion base pairs from over 2.9 billion nucleotide sequences for 504 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include resources for data from the SARS-CoV-2 virus, NCBI Datasets, BLAST ClusteredNR, the Submission Portal, table2asn, a Foreign Contamination Screening tool and BioSample.
INTRODUCTION
GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. This paper begins with a summary of recent developments in GenBank over the past year, followed by a series of brief usage guidelines for submitting data to GenBank and for accessing the data. For a general overview of GenBank, we suggest readers refer to https://www.ncbi.nlm.nih.gov/genbank/.
GenBank data are collected in various divisions, the size and growth of which are shown in Table 1 and Figure 1. GenBank received almost 5 million new SARS-CoV-2 sequences over the past year, accounting for the large increase in the VRL division. Multiple complete mouse genomes (e.g. BioProject PRJEB47108) accounted for about two-thirds of the growth in the ROD division.
Table 1.
GenBank divisions
Division | Description | Base pairs* |
---|---|---|
WGS | Whole genome shotgun data | 17 511 809 676 629 |
TSA | Transcriptome shotgun data | 511 680 950 707 |
PLN | Plants | 484 803 006 831 |
INV | Invertebrates | 269 338 221 858 |
VRL | Viruses | 187 366 647 663 |
BCT | Bacteria | 166 217 792 419 |
VRT | Other vertebrates | 99 921 122 967 |
ROD | Rodents | 66 092 410 483 |
TLS | Targeted loci studies | 43 852 280 645 |
EST | Expressed sequence tags | 43 330 114 068 |
MAM | Other mammals | 41 720 029 494 |
PAT | Patent sequences | 30 938 105 095 |
HTG | High-throughput genomic | 27 801 878 633 |
GSS | Genome survey sequences | 26 380 049 011 |
PRI | Primates | 15 619 743 253 |
ENV | Environmental samples | 8 516 518 905 |
SYN | Synthetic | 8 030 787 249 |
PHG | Phages | 1 158 493 277 |
HTC | High-throughput cDNA | 740 853 492 |
STS | Sequence tagged sites | 640 923 137 |
UNA | Unannotated | 4 436 341 |
*Release 251 (8/2022).
Figure 1.
Annual increase in base pairs (bp) for each division of GenBank in release 251 (August 2022) measured relative to GenBank release 245 (August 2021). The ‘TOTAL’ bar indicates the growth for GenBank as a whole. See Table 1 for a description of the division abbreviations.
RECENT DEVELOPMENTS
SARS-CoV-2 resources
We continue to update a customized submission portal for SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). This portal accepts a variety of data from unassembled reads to annotated genomes, including sequences for single genes or partial genomes, and on average provides accessions to submitters in 2 hours. We encourage submitters to use this portal, as doing so maximizes free data access, both through the INSDC databases but also through the NCBI Virus resource (2), RefSeq (3), and BLAST (4). These resources contain genomes from 60 coronaviruses and smaller sequences from over 1200 coronaviruses. We continue to collect the latest data and resources related to SARS-CoV-2 on a single landing page (https://www.ncbi.nlm.nih.gov/sars-cov-2/). Finally, NCBI Datasets (see below) provides downloads for over 1.5 million complete SARS-CoV-2 genomes (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes) as well as a new taxonomy page for this virus (https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/2697049/).
Monkeypox sequences
In response to the ongoing outbreak of monkeypox infections, GenBank automatically detects submissions of monkeypox sequences and processes them immediately. GenBank staff also are adding annotations to all submissions of monkeypox virus genomes to accelerate data availability. In the past year GenBank received 1200 monkeypox sequences, increasing the number of monkeypox sequences by 270%.
NCBI Datasets
NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) allows users to download complex genomic datasets easily using either a web interface, an API, or a UNIX/LINUX command-line tool. NCBI Datasets now provides a new genome table interface (https://www.ncbi.nlm.nih.gov/data-hub/genome/) that allows users to filter, view, and download data from multiple species or other taxonomic nodes. This interface returns genomic data in GenBank, and includes a filter for genomes annotated by their GenBank submitter.
BLAST ClusteredNR
The web interface for protein BLAST now offers the ClusteredNR database, a collection of sequences derived from the standard protein nr database by creating clusters of sequences containing members that are >90% identical to one another and that are within 90% of the length of the longest member. Searches using this database are faster and allow users to explore taxonomic diversity much more easily. The BLAST results show a representative sequence from each cluster that is generally well-annotated and that indicates the function of the protein.
Submission portal
As part of our ongoing effort to centralize submission workflows, we are shifting submission workflows for eukaryotic nuclear mRNA sequences from BankIt to the Submission Portal (https://submit.ncbi.nlm.nih.gov). An interactive wizard will simplify the process, and will ultimately allow submitters to have more control over release dates along with more abilities to edit previous submissions. The initial beta release will be most appropriate for small submissions (e.g. <10 sequences). Submitters needing to submit more sequences may continue to use BankIt.
Table2asn
The command-line tool table2asn (https://www.ncbi.nlm.nih.gov/genbank/table2asn/) is a new utility for preparing GenBank submissions and replaces the older tool tbl2asn. This new tool not only is more efficient but also offers additional functions, such as accepting annotations in GenBank-format GFF files. The above web page includes a comparison of argument values between the two tools along with links to download table2asn and access complete documentation. We encourage current users of tbl2asn to migrate to table2asn.
Foreign contamination screening
In support of the Comparative Genomic Resource (CGR, https://www.ncbi.nlm.nih.gov/data-hub/cgr/data-quality-tools/), NCBI released a beta version of a Foreign Contamination Screening (FCS) tool (https://github.com/ncbi/fcs) that can assist GenBank submitters in improving the quality of their submitted data. The tool consists of two components: FCS-adaptor that detects adaptor and vector contamination, and FCS-GX that detects contamination from organisms not arising from the intended biological source. We have incorporated the FCS-GX component into the standard processing of new genome submissions to GenBank. Both components are available as Docker or Singularity images, and detailed instructions are provided on the pages linked above. We hope to release additional FCS updates in the coming year.
BioSample
Recently BioSample released several new packages that we would highlight here. BioSample packages serve as templates for types of BioSample records and specify the attributes that such records must contain (https://submit.ncbi.nlm.nih.gov/biosample/template/). The ‘Pathogen’ package standardizes samples of pathogenic organisms from either the clinic or from the environment, such as an outbreak of food contamination. Two SARS-CoV-2 packages are available: one for clinical samples and one for samples from wastewater surveillance. We encourage submitters to check the above page for available packages that may support their submissions.
USING GENBANK
Advice for submitters
Notifying GenBank of published data
It has been a longstanding policy that GenBank will, upon request from the submitter, withhold the release of new sequence submissions either until the date that associated research is published or until a release date specified by the submitters, whichever occurs first. Submitters may set the length of these delays. To avoid additional delays in releasing data, when such research is published we urge submitters to send the full publication details to GenBank at update@ncbi.nlm.nih.gov.
Contextual metadata
As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude) and collection date. Such reporting is becoming required under the Spatio-temporal annotation policy that the INSDC announced in late 2021 (https://www.insdc.org/news/spatio-temporal-annotation-policy-18-11-2021/). We expect that this policy will be phased in during 2023 for BioSample data and genome submissions. The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful.
Additional best practices
We encourage submitters to use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/genbank/evidence/), and to cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their assemblies. Submitting prokaryotic genomes is more efficient when submitters create annotated genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi.nlm.nih.gov/genome/annotation_prok/).
Citing GenBank accessions
As described previously (5), when citing a given GenBank record in a publication, the best practice is to use the full accession.version identifier for the record (e.g. AF123456.2). Since citing only the accession portion (e.g. AF123456) retrieves the current version, including the version suffix ensures clarity about which version of the record the authors are referencing.
Acquiring the database
NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 251 (15 August 2022) there are 5836 files requiring 2723GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.
Contributor Information
Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Mark Cavanaugh, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Karen Clark, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Stephen T Sherry, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Linda Yankie, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Ilene Karsch-Mizrachi, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
MAILING ADDRESS
GenBank, National Center for Biotechnology Information, Building 45, Room 6AN12D-37, 45 Center Drive, Bethesda, MD 20892, USA.
ELECTRONIC ADDRESSES
www.ncbi.nlm.nih.gov: NCBI Home Page.
gb-sub@ncbi.nlm.nih.gov: Submission of sequence data to GenBank.
update@ncbi.nlm.nih.gov: Revisions to, or notification of release of, ‘confidential’ GenBank entries.
info@ncbi.nlm.nih.gov: General information about NCBI resources.
CITING GENBANK
If you use the GenBank database in your published research, we ask that this article be cited.
FUNDING
Funding for open access charge: National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health.
Conflict of interest statement. None declared.
REFERENCES
- 1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2021; 49:D92–D96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Brister J.R., Ako-Adjei D., Bao Y., Blinkova O.. NCBI viral genomes resource. Nucleic Acids Res. 2015; 43:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]