Abstract
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 15.3 trillion base pairs from over 2.5 billion nucleotide sequences for 504 000 formally described species. Recent updates include resources for data from the SARS-CoV-2 virus, including a SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus and the Submission Portal. We also discuss upcoming changes to GI identifiers, a new data management interface for BioProject, and advice for providing contextual metadata in submissions.
INTRODUCTION
GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. After discussing updates to SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) resources, this paper summarizes the growth of GenBank in the past year and briefly reviews recent updates and developments.
SARS-CoV-2 RESOURCES
As part of our ongoing response to the COVID-19 pandemic that emerged in early 2020, NCBI continues to update several tools and interfaces to support both submitters and consumers of sequence data for SARS-CoV-2. These include the SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus, and the Submission Portal.
SARS-CoV-2 landing page
The SARS-CoV-2 landing page (https://www.ncbi.nlm.nih.gov/sars-cov-2/) collects a wide variety of data and resources related to SARS-CoV-2, including all relevant data in GenBank. Of particular interest to users seeking GenBank data are links to NCBI Datasets and NCBI Virus (see below) along with a link to download the full list of nucleotide accessions for SARS-CoV-2.
NCBI Datasets
NCBI Datasets is an experimental product that allows users to download complex genomic datasets easily using either a web interface, an API or a UNIX/LINUX command-line tool (https://www.ncbi.nlm.nih.gov/datasets/). The specialized coronavirus page released last year now provides genome downloads for almost 430 000 complete SARS-CoV-2 genomes, an annual increase of 29-fold (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes). This page provides downloads of metadata tables for SARS-CoV-2 genomes as well as complete genomic datasets. Users interested in SARS-CoV-2 proteins can access these data on a separate specialized page (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/). Finally, NCBI Datasets also includes a new genome interface that supports taxonomic searches and selection based on the taxonomic tree. This page may be of interest to users seeking data for other coronaviruses not included on the specialized SARS-CoV-2 pages.
NCBI Virus
The NCBI Virus resource contains an SARS-CoV-2 Hub (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/sars-cov-2) that organizes an extensive set of data and visualizations about SARS-CoV-2 data (Figure 1), including data from the Sequence Read Archive (SRA). The visualizations on the default “Dashboard” view include a world map showing the geographical distribution of SARS-CoV-2 collection locations. Two interactive filters allow users to subset these data by collection date and release date, and these filters update the map display. A 'Tabular View' option loads an interactive table listing all SARS-CoV-2 sequences with 21 filters, including sequence length, collection date, and geographic region. Conveniently, any filters set on the Dashboard transfer to the table, allowing easy exploration of the data. Moreover, users can spawn alignments from this table, and can also build phylogenetic trees.
Figure 1.
SARS-CoV-2 Data Hub in the NCBI Virus resource.
Submission Portal
NCBI continues to update a customized submission portal for both assembled and unassembled SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). On average this portal provides accessions back to submitters in 1–2 h, and assembled sequences will be annotated with VADR (2). We encourage submitters to use these portals, as this ensures not only that sequence data are made available through the INSDC databases, but also through the NCBI Virus resource (3), RefSeq (4), and BLAST (5). We also encourage submitters to submit both reads and traditional GenBank sequences, and to submit data to BioProject and BioSample. We are actively updating all of these resources to support novel variants and to update the content of the SARS-CoV-2 pages discussed above.
GROWTH OF THE DATABASE
Divisions with notable increases
GenBank sequences are organized into 21 divisions, each of which is represented by a three-letter abbreviation (Table 1). As shown in Table 1, especially large increases occurred in the VRL, UNA and INV divisions. Not surprisingly, the large increase in the VRL division resulted from the many submissions of SARS-CoV-2 sequences (Figure 2).
Table 1.
Growth of GenBank Divisions
Division | Description | Base pairsa | Annual increaseb |
---|---|---|---|
VRL | Viruses | 39 351 597 469 | 575.68% |
UNA | Unannotated | 4 421 782 | 550.93% |
INV | Invertebrates | 108 680 334 593 | 450.00% |
ROD | Rodents | 23 336 550 435 | 93.02% |
PRI | Primates | 15 165 437 356 | 72.97% |
WGS | Whole genome shotgun data | 13 888 187 863 722 | 57.08% |
TLS | Targeted Loci Studies | 39 930 167 315 | 43.50% |
MAM | Other mammals | 28 568 850 588 | 37.06% |
VRT | Other vertebrates | 85 320 979 451 | 34.22% |
BCT | Bacteria | 130 518 385 589 | 32.07% |
PLN | Plants | 350 590 744 188 | 30.12% |
TSA | Transcriptome shotgun data | 454 757 992 932 | 19.31% |
PHG | Phages | 935 884 237 | 19.59% |
PAT | Patent sequences | 29 588 418 021 | 11.85% |
ENV | Environmental samples | 7 394 414 660 | 9.46% |
SYN | Synthetic | 7 994 601 379 | 0.78% |
HTC | High-throughput cDNA | 737 423 641 | 0.57% |
HTG | High-throughput genomic | 27 800 219 072 | 0.07% |
EST | Expressed sequence tags | 43 324 455 796 | 0.05% |
GSS | Genome survey sequences | 26 380 049 011 | 0.01% |
STS | Sequence tagged sites | 640 923 137 | 0.00% |
TOTAL | All GenBank sequences | 15 309 209 714 374 | 54.79% |
aRelease 245 (8/2021).
bRelative to release 239 (8/2020).
Figure 2.
Growth of SARS-CoV-2 sequence data in GenBank. Each data point represents the cumulative number of records (left axis) or base pairs (right axis) at each date.
Handling long sequence records
As previously discussed (1), improving sequencing technologies are now capable of producing very long sequences, some of which are longer than what signed 32-bit integers can represent (about 2.1 Gbp). In such cases, submitters must split such records in order to submit them to GenBank. A recent example of such a case is chromosome 1 from the West African lungfish, Protopterus annectens. The total length of this chromosome is 5.26 Gbp, and so in GenBank it is represented by three records: CM033073 (2.00 Gbp), CM033074 (2.00 Gbp) and CM033075 (1.26 Gbp). We encourage GenBank users and developers of products that rely on GenBank data to be aware of the implications involved in representing very long sequences and to consider preparing their own tools for sequence lengths and feature locations that will require 64-bit integers.
RECENT DEVELOPMENTS
Updates to integer sequence identifiers
In addition to the above issue of handling very long individual sequences, GenBank is approaching a point where the number of sequences will exhaust the space of GI identifiers provided by 32-bit integers. To mitigate this, we are taking multiple approaches. First, we continue to recommend that users shift to using accession.version identifiers to refer to all GenBank data (6). Most external NCBI interfaces, including the Entrez web interface and the E-utilities API, now accept and return accession.version identifiers for all sequences. Second, we are transitioning our internal software to use 64-bit integers for GI identifiers. Once the transition occurs, GenBank users will encounter these identifiers in the XML and ASN.1 presentations of GenBank data provided through the Entrez web interface and in GenBank FTP products (https://ncbiinsights.ncbi.nlm.nih.gov/2021/09/02/64-bit-gis/). We encourage developers who rely on GenBank data to ensure that their software is capable of handling these 64-bit identifiers. Such identifiers are easy to recognize, as they are any integer greater than 2 147 483 647.
BioProject data management
When submitters register sequencing projects in the BioProject database (https://www.ncbi.nlm.nih.gov/bioproject), we can create reliable linkages between such sequencing projects and the data they produce, and in many cases to the BioSample database (7) that provides additional information about the biological materials used in the study. In many cases, submitters create BioProject records before they have collected all relevant data and published the results of the study. We have now made it easier for submitters to update their BioProject records with such information by offering a ‘Manage Data’ interface in the Submission portal (https://dataview.ncbi.nlm.nih.gov/?archive=bioproject). Using this interface, submitters can add publications and grants or edit text metadata such as the BioProject title and description. We hope this will allow BioProject to reflect better the current state of these projects and provide a better service to the community.
Advice for submitters
Contextual metadata
As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude). The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity in the current era. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful. Where possible, adding links to permanent samples or vouchers at biorepositories provides access to sources with important, richly populated information. This facilitates replication and validation, while also allowing for analyses across scientific disciplines (8). GenBank has long followed the standards of structuring vouchers using Darwin Core formats (9) that allows us to link to specimen pages at external biorepositories using URLs curated in the NCBI BioCollections database (10). Recently, BioCollections included a new category, 'digital repository', that will include online data aggregators of collection data that do not include physical specimens. Additionally, to prepare the way for a more comprehensive treatment of these data elements and to make their presence required as part of the submission process, GenBank and the INSDC developed a set of standardized terms to clearly indicate when submitters cannot provide voucher information, for example in cases where the data were not collected or cannot be reported because of privacy concerns (https://www.insdc.org/missing-value-reporting).
In addition to the above, there are other ways in which submitters can enhance their data. Submitters can use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/genbank/evidence/). They can cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their assemblies. When submitting prokaryotic genomes, they can create annotated genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi.nlm.nih.gov/genome/annotation_prok/) either by submitting FASTA files and requesting PGAP during submission of the genomes to GenBank or by running the public version of PGAP themselves and then submitting the GenBank-ready ASN.1 output file.
Acquiring the database
NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 245 (15 August 2021) there are 4032 files requiring 1888 GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.
CITING GENBANK
If you use the GenBank database in your published research, we ask that this article be cited.
Notes
Electronic addresses:
www.ncbi.nlm.nih.gov - NCBI Home Page. gb-sub@ncbi.nlm.nih.gov - Submission of sequence data to GenBank.
update@ncbi.nlm.nih.gov - Revisions to, or notification of release of, ‘confidential’ GenBank entries.
info@ncbi.nlm.nih.gov - General information about NCBI resources.
Contributor Information
Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Mark Cavanaugh, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Karen Clark, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Conrad L Schoch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Stephen T Sherry, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Ilene Karsch-Mizrachi, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
FUNDING
Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health.
Conflict of interest statement. None declared.
REFERENCES
- 1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2021; 49:D92–D96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Schaffer A.A., Hatcher E.L., Yankie L., Shonkwiler L., Brister J.R., Karsch-Mizrachi I., Nawrocki E.P.. VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics. 2020; 21:211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Brister J.R., Ako-Adjei D., Bao Y., Blinkova O.. NCBI viral genomes resource. Nucleic Acids Res. 2015; 43:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Barrett T., Clark K., Gevorgyan R., Gorelenkov V., Gribov E., Karsch-Mizrachi I., Kimelman M., Pruitt K.D., Resenchuk S., Tatusova T.et al.. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012; 40:D57–D63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Cook J.A., Arai S., Armien B., Bates J., Bonilla C.A.C., Cortez M.B.S., Dunnum J.L., Ferguson A.W., Johnson K.M., Khan F.A.A.et al.. Integrating biodiversity infrastructure into pathogen discovery and mitigation of emerging infectious diseases. Bioscience. 2020; 70:531–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Wieczorek J., Bloom D., Guralnick R., Blum S., Doring M., Giovanni R., Robertson T., Vieglais D. Darwin Core: an evolving community-developed biodiversity data standard. PLoS One. 2012; 7:e29715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Sharma S., Ciufo S., Starchenko E., Darji D., Chlumsky L., Karsch-Mizrachi I., Schoch C.L.. The NCBI BioCollections Database. Database (Oxford). 2018; 2018:bay006. [DOI] [PMC free article] [PubMed] [Google Scholar]