GenBank 2023 update

Eric W Sayers; Mark Cavanaugh; Karen Clark; Kim D Pruitt; Stephen T Sherry; Linda Yankie; Ilene Karsch-Mizrachi

doi:10.1093/nar/gkac1012

. 2022 Nov 9;51(D1):D141–D144. doi: 10.1093/nar/gkac1012

GenBank 2023 update

Eric W Sayers ^1,^✉, Mark Cavanaugh ², Karen Clark ³, Kim D Pruitt ⁴, Stephen T Sherry ⁵, Linda Yankie ⁶, Ilene Karsch-Mizrachi ⁷

PMCID: PMC9825519 PMID: 36350640

Abstract

GenBank^® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 19.6 trillion base pairs from over 2.9 billion nucleotide sequences for 504 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include resources for data from the SARS-CoV-2 virus, NCBI Datasets, BLAST ClusteredNR, the Submission Portal, table2asn, a Foreign Contamination Screening tool and BioSample.

INTRODUCTION

GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. This paper begins with a summary of recent developments in GenBank over the past year, followed by a series of brief usage guidelines for submitting data to GenBank and for accessing the data. For a general overview of GenBank, we suggest readers refer to https://www.ncbi.nlm.nih.gov/genbank/.

GenBank data are collected in various divisions, the size and growth of which are shown in Table 1 and Figure 1. GenBank received almost 5 million new SARS-CoV-2 sequences over the past year, accounting for the large increase in the VRL division. Multiple complete mouse genomes (e.g. BioProject PRJEB47108) accounted for about two-thirds of the growth in the ROD division.

Table 1.

GenBank divisions

Division	Description	Base pairs*
WGS	Whole genome shotgun data	17 511 809 676 629
TSA	Transcriptome shotgun data	511 680 950 707
PLN	Plants	484 803 006 831
INV	Invertebrates	269 338 221 858
VRL	Viruses	187 366 647 663
BCT	Bacteria	166 217 792 419
VRT	Other vertebrates	99 921 122 967
ROD	Rodents	66 092 410 483
TLS	Targeted loci studies	43 852 280 645
EST	Expressed sequence tags	43 330 114 068
MAM	Other mammals	41 720 029 494
PAT	Patent sequences	30 938 105 095
HTG	High-throughput genomic	27 801 878 633
GSS	Genome survey sequences	26 380 049 011
PRI	Primates	15 619 743 253
ENV	Environmental samples	8 516 518 905
SYN	Synthetic	8 030 787 249
PHG	Phages	1 158 493 277
HTC	High-throughput cDNA	740 853 492
STS	Sequence tagged sites	640 923 137
UNA	Unannotated	4 436 341

Open in a new tab

*Release 251 (8/2022).

Figure 1. — Annual increase in base pairs (bp) for each division of GenBank in release 251 (August 2022) measured relative to GenBank release 245 (August 2021). The ‘TOTAL’ bar indicates the growth for GenBank as a whole. See Table 1 for a description of the division abbreviations.

RECENT DEVELOPMENTS

SARS-CoV-2 resources

We continue to update a customized submission portal for SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). This portal accepts a variety of data from unassembled reads to annotated genomes, including sequences for single genes or partial genomes, and on average provides accessions to submitters in 2 hours. We encourage submitters to use this portal, as doing so maximizes free data access, both through the INSDC databases but also through the NCBI Virus resource (2), RefSeq (3), and BLAST (4). These resources contain genomes from 60 coronaviruses and smaller sequences from over 1200 coronaviruses. We continue to collect the latest data and resources related to SARS-CoV-2 on a single landing page (https://www.ncbi.nlm.nih.gov/sars-cov-2/). Finally, NCBI Datasets (see below) provides downloads for over 1.5 million complete SARS-CoV-2 genomes (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes) as well as a new taxonomy page for this virus (https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/2697049/).

Monkeypox sequences

In response to the ongoing outbreak of monkeypox infections, GenBank automatically detects submissions of monkeypox sequences and processes them immediately. GenBank staff also are adding annotations to all submissions of monkeypox virus genomes to accelerate data availability. In the past year GenBank received 1200 monkeypox sequences, increasing the number of monkeypox sequences by 270%.

NCBI Datasets

NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) allows users to download complex genomic datasets easily using either a web interface, an API, or a UNIX/LINUX command-line tool. NCBI Datasets now provides a new genome table interface (https://www.ncbi.nlm.nih.gov/data-hub/genome/) that allows users to filter, view, and download data from multiple species or other taxonomic nodes. This interface returns genomic data in GenBank, and includes a filter for genomes annotated by their GenBank submitter.

BLAST ClusteredNR

The web interface for protein BLAST now offers the ClusteredNR database, a collection of sequences derived from the standard protein nr database by creating clusters of sequences containing members that are >90% identical to one another and that are within 90% of the length of the longest member. Searches using this database are faster and allow users to explore taxonomic diversity much more easily. The BLAST results show a representative sequence from each cluster that is generally well-annotated and that indicates the function of the protein.

Submission portal

As part of our ongoing effort to centralize submission workflows, we are shifting submission workflows for eukaryotic nuclear mRNA sequences from BankIt to the Submission Portal (https://submit.ncbi.nlm.nih.gov). An interactive wizard will simplify the process, and will ultimately allow submitters to have more control over release dates along with more abilities to edit previous submissions. The initial beta release will be most appropriate for small submissions (e.g. <10 sequences). Submitters needing to submit more sequences may continue to use BankIt.

Table2asn

The command-line tool table2asn (https://www.ncbi.nlm.nih.gov/genbank/table2asn/) is a new utility for preparing GenBank submissions and replaces the older tool tbl2asn. This new tool not only is more efficient but also offers additional functions, such as accepting annotations in GenBank-format GFF files. The above web page includes a comparison of argument values between the two tools along with links to download table2asn and access complete documentation. We encourage current users of tbl2asn to migrate to table2asn.

Foreign contamination screening

In support of the Comparative Genomic Resource (CGR, https://www.ncbi.nlm.nih.gov/data-hub/cgr/data-quality-tools/), NCBI released a beta version of a Foreign Contamination Screening (FCS) tool (https://github.com/ncbi/fcs) that can assist GenBank submitters in improving the quality of their submitted data. The tool consists of two components: FCS-adaptor that detects adaptor and vector contamination, and FCS-GX that detects contamination from organisms not arising from the intended biological source. We have incorporated the FCS-GX component into the standard processing of new genome submissions to GenBank. Both components are available as Docker or Singularity images, and detailed instructions are provided on the pages linked above. We hope to release additional FCS updates in the coming year.

BioSample

Recently BioSample released several new packages that we would highlight here. BioSample packages serve as templates for types of BioSample records and specify the attributes that such records must contain (https://submit.ncbi.nlm.nih.gov/biosample/template/). The ‘Pathogen’ package standardizes samples of pathogenic organisms from either the clinic or from the environment, such as an outbreak of food contamination. Two SARS-CoV-2 packages are available: one for clinical samples and one for samples from wastewater surveillance. We encourage submitters to check the above page for available packages that may support their submissions.

USING GENBANK

Advice for submitters

Notifying GenBank of published data

It has been a longstanding policy that GenBank will, upon request from the submitter, withhold the release of new sequence submissions either until the date that associated research is published or until a release date specified by the submitters, whichever occurs first. Submitters may set the length of these delays. To avoid additional delays in releasing data, when such research is published we urge submitters to send the full publication details to GenBank at update@ncbi.nlm.nih.gov.

Contextual metadata

As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude) and collection date. Such reporting is becoming required under the Spatio-temporal annotation policy that the INSDC announced in late 2021 (https://www.insdc.org/news/spatio-temporal-annotation-policy-18-11-2021/). We expect that this policy will be phased in during 2023 for BioSample data and genome submissions. The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful.

Additional best practices

We encourage submitters to use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/genbank/evidence/), and to cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their assemblies. Submitting prokaryotic genomes is more efficient when submitters create annotated genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi.nlm.nih.gov/genome/annotation_prok/).

Citing GenBank accessions

As described previously (5), when citing a given GenBank record in a publication, the best practice is to use the full accession.version identifier for the record (e.g. AF123456.2). Since citing only the accession portion (e.g. AF123456) retrieves the current version, including the version suffix ensures clarity about which version of the record the authors are referencing.

Acquiring the database

NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 251 (15 August 2022) there are 5836 files requiring 2723GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.

Contributor Information

Eric W Sayers, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Mark Cavanaugh, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Karen Clark, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Kim D Pruitt, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Stephen T Sherry, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Linda Yankie, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Ilene Karsch-Mizrachi, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.

MAILING ADDRESS

GenBank, National Center for Biotechnology Information, Building 45, Room 6AN12D-37, 45 Center Drive, Bethesda, MD 20892, USA.

ELECTRONIC ADDRESSES

www.ncbi.nlm.nih.gov: NCBI Home Page.

gb-sub@ncbi.nlm.nih.gov: Submission of sequence data to GenBank.

update@ncbi.nlm.nih.gov: Revisions to, or notification of release of, ‘confidential’ GenBank entries.

info@ncbi.nlm.nih.gov: General information about NCBI resources.

CITING GENBANK

If you use the GenBank database in your published research, we ask that this article be cited.

FUNDING

Funding for open access charge: National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health.

Conflict of interest statement. None declared.

REFERENCES

1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2021; 49:D92–D96. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Brister J.R., Ako-Adjei D., Bao Y., Blinkova O.. NCBI viral genomes resource. Nucleic Acids Res. 2015; 43:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] 1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2021; 49:D92–D96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Brister J.R., Ako-Adjei D., Bao Y., Blinkova O.. NCBI viral genomes resource. Nucleic Acids Res. 2015; 43:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., Ma N., Madden T.L., Matten W.T., McGinnis S.D., Merezhuk Y.et al.. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41:W29–W33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

GenBank 2023 update

Eric W Sayers

Mark Cavanaugh

Karen Clark

Kim D Pruitt

Stephen T Sherry

Linda Yankie

Ilene Karsch-Mizrachi

Abstract

INTRODUCTION

Table 1.

Figure 1.

RECENT DEVELOPMENTS

SARS-CoV-2 resources

Monkeypox sequences

NCBI Datasets

BLAST ClusteredNR

Submission portal

Table2asn

Foreign contamination screening

BioSample

USING GENBANK

Advice for submitters

Notifying GenBank of published data

Contextual metadata

Additional best practices

Citing GenBank accessions

Acquiring the database

Contributor Information

MAILING ADDRESS

ELECTRONIC ADDRESSES

CITING GENBANK

FUNDING

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

GenBank 2023 update

Eric W Sayers

Mark Cavanaugh

Karen Clark

Kim D Pruitt

Stephen T Sherry

Linda Yankie

Ilene Karsch-Mizrachi

Abstract

INTRODUCTION

Table 1.

Figure 1.

RECENT DEVELOPMENTS

SARS-CoV-2 resources

Monkeypox sequences

NCBI Datasets

BLAST ClusteredNR

Submission portal

Table2asn

Foreign contamination screening

BioSample

USING GENBANK

Advice for submitters

Notifying GenBank of published data

Contextual metadata

Additional best practices

Citing GenBank accessions

Acquiring the database

Contributor Information

MAILING ADDRESS

ELECTRONIC ADDRESSES

CITING GENBANK

FUNDING

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases