Skip to main content
. Author manuscript; available in PMC: 2015 Mar 5.
Published in final edited form as: Curr Protoc Bioinformatics. 2010 Dec;0 9:Unit–9.13. doi: 10.1002/0471250953.bi0913s32

Table A.1B.2.

A Summary of Fields Commonly Found in GenBank Records (Fig. A.1B.2)

Field Identifier(s) in Figure A.1B.2 Contents
LOCUS 1a: Locus name Although the locus name was originally intended to identify
similar sequences, it no longer carries such significance. Each
GenBank file has a unique locus name. Often, it is either the
first letter of the genus and species followed by the accession
number, or simply the GenBank accession number of the file.
1b: Sequence length The number of nucleotide base pairs (bp) or amino acid
residues (aa) in the gene or gene product.
1c: Molecule type Identifies the type of sequence found in a particular file.
Possibilities include: genomic DNA, genomic RNA, precursor
RNA, mRNA, rRNA, tRNA, small nuclear RNA, and
cytoplasmic RNA.
1d: Molecular topology The molecule's expected topology. The options are linear and
circular.
1e: GenBank division Each GenBank sequence is currently classified in one of the
following 17 subdivisions: PRI, primates; ROD, rodents; MAM,
mammals (excluding primates and rodents); VRT, vertebrates
(excluding mammals); INV, invertebrates; PLN, plants, fungi,
and algae; BCT, bacteria; VRL, viral; PHG, bacteriophages;
SYN, synthetic; UNA, unannotated; EST, expressed sequence
tag; PAT, patent sequence; STS, sequence tagged sites; GSS,
genome survey sequence; HTG, high-throughput genomic
sequence; HTC, unfinished high-throughput cDNA sequence.
Note that the organismal subdivisions do not coincide with the
current NCBI taxonomy. They are purely historical.
1f: Modification date Indicates when the file was last revised.
DEFINITION 2 A brief description of the sequence, including the organism
source and the gene or protein name.
ACCESSION 3 A unique, stable, identifier for the particular file, which is
usually a combination of one or two letters with five or six
digits.
VERSION 4 Allows users to track multiple incarnations of a given
sequence. The version number is the accession number
concatenated with a period and a number. For the first version
of a particular accession, the number following the period is
set to 1. Each time the sequence data are modified, the number
following the period is incremented by 1. The example shown
in Figure A.1B.2 is the first version of accession number
M93361.
This field will also contain a GenInfo Identifier (GI) for
nucleotide sequence files. This number uniquely identifies
each nucleotide sequence in GenBank, even if they differ by a
single nucleotide. Note that, unlike the accession number for a
file, the GI number may change.
KEYWORDS 5 A word or phrase describing the sequence. Although
frequently found in older GenBank records, this field is
generally not present in more recent GenBank files.
SOURCE 6 The first line is a free-format description of the source
organism, followed by the molecule type. The subsequent lines
contain the subfield ORGANISM, which has the complete
scientific name of the source organism and its phylogenetic
classification as given by the NCBI Taxonomy Database.
REFERENCE 7 Publications by the authors of the GenBank entry that discuss
the molecule. Multiple publications may be listed in
chronological order, ending with the most recent. Each
reference entry will contain subfields (e.g., AUTHORS, TITLE,
JOURNAL, MEDLINE) that are appropriate for the particular
publication type.
FEATURES 8 This is essentially a concise summary of the gene or protein
annotation. It offers a list of genes, gene products, and regions
of biological interest that have been identified within the
reported sequence. The first subfield in each FEATURE list is
the source subfield, which contains the length of the
sequence, the scientific name of the source organism, and the
taxon ID number. Additional subfields are given—e.g., gene,
promoter, TATA signal,
5′ UTR, 3′ UTR, and coding sequence (CDS)—depending on
the features within the sequence. For each feature, the
GenBank record provides its location within the sequence and
other pertinent information (e.g., the product or gene name,
possible function, and protein translation).
BASE COUNT 9 The number of adenine, cytosine, thymine (or uracil), and
guanine nucleotide bases within the sequence.
ORIGIN 10 This field is often left blank. In older records, it may contain
the experimentally derived restriction cleavage site. Note that
the ORIGIN field should be included in every GenBank
record, even if it contains no information. Most parsers look
for the sequence on the first line after the word ORIGIN.
11 The sequence data with 60 bases (or residues) per line. The
bases on each line are presented in six groups of ten bases per
group, with the groups separated by spaces. The sequence ends
with two slashes (//).