Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2022 Nov 9;51(D1):D141–D144. doi: 10.1093/nar/gkac1012

GenBank 2023 update

Eric W Sayers 1,, Mark Cavanaugh 2, Karen Clark 3, Kim D Pruitt 4, Stephen T Sherry 5, Linda Yankie 6, Ilene Karsch-Mizrachi 7
PMCID: PMC9825519  PMID: 36350640

Abstract

GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 19.6 trillion base pairs from over 2.9 billion nucleotide sequences for 504 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include resources for data from the SARS-CoV-2 virus, NCBI Datasets, BLAST ClusteredNR, the Submission Portal, table2asn, a Foreign Contamination Screening tool and BioSample.

INTRODUCTION

GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. This paper begins with a summary of recent developments in GenBank over the past year, followed by a series of brief usage guidelines for submitting data to GenBank and for accessing the data. For a general overview of GenBank, we suggest readers refer to https://www.ncbi.nlm.nih.gov/genbank/.

GenBank data are collected in various divisions, the size and growth of which are shown in Table 1 and Figure 1. GenBank received almost 5 million new SARS-CoV-2 sequences over the past year, accounting for the large increase in the VRL division. Multiple complete mouse genomes (e.g. BioProject PRJEB47108) accounted for about two-thirds of the growth in the ROD division.

Table 1.

GenBank divisions

Division Description Base pairs*
WGS Whole genome shotgun data 17 511 809 676 629
TSA Transcriptome shotgun data 511 680 950 707
PLN Plants 484 803 006 831
INV Invertebrates 269 338 221 858
VRL Viruses 187 366 647 663
BCT Bacteria 166 217 792 419
VRT Other vertebrates 99 921 122 967
ROD Rodents 66 092 410 483
TLS Targeted loci studies 43 852 280 645
EST Expressed sequence tags 43 330 114 068
MAM Other mammals 41 720 029 494
PAT Patent sequences 30 938 105 095
HTG High-throughput genomic 27 801 878 633
GSS Genome survey sequences 26 380 049 011
PRI Primates 15 619 743 253
ENV Environmental samples 8 516 518 905
SYN Synthetic 8 030 787 249
PHG Phages 1 158 493 277
HTC High-throughput cDNA 740 853 492
STS Sequence tagged sites 640 923 137
UNA Unannotated 4 436 341

*Release 251 (8/2022).

Figure 1.

Figure 1.

Annual increase in base pairs (bp) for each division of GenBank in release 251 (August 2022) measured relative to GenBank release 245 (August 2021). The ‘TOTAL’ bar indicates the growth for GenBank as a whole. See Table 1 for a description of the division abbreviations.

RECENT DEVELOPMENTS

SARS-CoV-2 resources

We continue to update a customized submission portal for SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). This portal accepts a variety of data from unassembled reads to annotated genomes, including sequences for single genes or partial genomes, and on average provides accessions to submitters in 2 hours. We encourage submitters to use this portal, as doing so maximizes free data access, both through the INSDC databases but also through the NCBI Virus resource (2), RefSeq (3), and BLAST (4). These resources contain genomes from 60 coronaviruses and smaller sequences from over 1200 coronaviruses. We continue to collect the latest data and resources related to SARS-CoV-2 on a single landing page (https://www.ncbi.nlm.nih.gov/sars-cov-2/). Finally, NCBI Datasets (see below) provides downloads for over 1.5 million complete SARS-CoV-2 genomes (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes) as well as a new taxonomy page for this virus (https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/2697049/).

Monkeypox sequences

In response to the ongoing outbreak of monkeypox infections, GenBank automatically detects submissions of monkeypox sequences and processes them immediately. GenBank staff also are adding annotations to all submissions of monkeypox virus genomes to accelerate data availability. In the past year GenBank received 1200 monkeypox sequences, increasing the number of monkeypox sequences by 270%.

NCBI Datasets

NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) allows users to download complex genomic datasets easily using either a web interface, an API, or a UNIX/LINUX command-line tool. NCBI Datasets now provides a new genome table interface (https://www.ncbi.nlm.nih.gov/data-hub/genome/) that allows users to filter, view, and download data from multiple species or other taxonomic nodes. This interface returns genomic data in GenBank, and includes a filter for genomes annotated by their GenBank submitter.

BLAST ClusteredNR

The web interface for protein BLAST now offers the ClusteredNR database, a collection of sequences derived from the standard protein nr database by creating clusters of sequences containing members that are >90% identical to one another and that are within 90% of the length of the longest member. Searches using this database are faster and allow users to explore taxonomic diversity much more easily. The BLAST results show a representative sequence from each cluster that is generally well-annotated and that indicates the function of the protein.

Submission portal

As part of our ongoing effort to centralize submission workflows, we are shifting submission workflows for eukaryotic nuclear mRNA sequences from BankIt to the Submission Portal (https://submit.ncbi.nlm.nih.gov). An interactive wizard will simplify the process, and will ultimately allow submitters to have more control over release dates along with more abilities to edit previous submissions. The initial beta release will be most appropriate for small submissions (e.g. <10 sequences). Submitters needing to submit more sequences may continue to use BankIt.

Table2asn

The command-line tool table2asn (https://www.ncbi.nlm.nih.gov/genbank/table2asn/) is a new utility for preparing GenBank submissions and replaces the older tool tbl2asn. This new tool not only is more efficient but also offers additional functions, such as accepting annotations in GenBank-format GFF files. The above web page includes a comparison of argument values between the two tools along with links to download table2asn and access complete documentation. We encourage current users of tbl2asn to migrate to table2asn.

Foreign contamination screening

In support of the Comparative Genomic Resource (CGR, https://www.ncbi.nlm.nih.gov/data-hub/cgr/data-quality-tools/), NCBI released a beta version of a Foreign Contamination Screening (FCS) tool (https://github.com/ncbi/fcs) that can assist GenBank submitters in improving the quality of their submitted data. The tool consists of two components: FCS-adaptor that detects adaptor and vector contamination, and FCS-GX that detects contamination from organisms not arising from the intended biological source. We have incorporated the FCS-GX component into the standard processing of new genome submissions to GenBank. Both components are available as Docker or Singularity images, and detailed instructions are provided on the pages linked above. We hope to release additional FCS updates in the coming year.

BioSample

Recently BioSample released several new packages that we would highlight here. BioSample packages serve as templates for types of BioSample records and specify the attributes that such records must contain (https://submit.ncbi.nlm.nih.gov/biosample/template/). The ‘Pathogen’ package standardizes samples of pathogenic organisms from either the clinic or from the environment, such as an outbreak of food contamination. Two SARS-CoV-2 packages are available: one for clinical samples and one for samples from wastewater surveillance. We encourage submitters to check the above page for available packages that may support their submissions.

USING GENBANK

Advice for submitters

Notifying GenBank of published data

It has been a longstanding policy that GenBank will, upon request from the submitter, withhold the release of new sequence submissions either until the date that associated research is published or until a release date specified by the submitters, whichever occurs first. Submitters may set the length of these delays. To avoid additional delays in releasing data, when such research is published we urge submitters to send the full publication details to GenBank at update@ncbi.nlm.nih.gov.

Contextual metadata

As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude) and collection date. Such reporting is becoming required under the Spatio-temporal annotation policy that the INSDC announced in late 2021 (https://www.insdc.org/news/spatio-temporal-annotation-policy-18-11-2021/). We expect that this policy will be phased in during 2023 for BioSample data and genome submissions. The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful.

Additional best practices

We encourage submitters to use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/genbank/evidence/), and to cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their assemblies. Submitting prokaryotic genomes is more efficient when submitters create annotated genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi.nlm.nih.gov/genome/annotation_prok/).

Citing GenBank accessions

As described previously (5), when citing a given GenBank record in a publication, the best practice is to use the full accession.version identifier for the record (e.g. AF123456.2). Since citing only the accession portion (e.g. AF123456) retrieves the current version, including the version suffix ensures clarity about which version of the record the authors are referencing.

Acquiring the database

NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 251 (15 August 2022) there are 5836 files requiring 2723GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.

Contributor Information

Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Mark Cavanaugh, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Karen Clark, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Stephen T Sherry, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Linda Yankie, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Ilene Karsch-Mizrachi, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

MAILING ADDRESS

GenBank, National Center for Biotechnology Information, Building 45, Room 6AN12D-37, 45 Center Drive, Bethesda, MD 20892, USA.

ELECTRONIC ADDRESSES

www.ncbi.nlm.nih.gov: NCBI Home Page.

gb-sub@ncbi.nlm.nih.gov: Submission of sequence data to GenBank.

update@ncbi.nlm.nih.gov: Revisions to, or notification of release of, ‘confidential’ GenBank entries.

info@ncbi.nlm.nih.gov: General information about NCBI resources.

CITING GENBANK

If you use the GenBank database in your published research, we ask that this article be cited.

FUNDING

Funding for open access charge: National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2021; 49:D92–D96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Brister J.R., Ako-Adjei D., Bao Y., Blinkova O.. NCBI viral genomes resource. Nucleic Acids Res. 2015; 43:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES