. Author manuscript; available in PMC: 2015 Mar 5.

Published in final edited form as: Curr Protoc Bioinformatics. 2010 Dec;0 9:Unit–9.13. doi: 10.1002/0471250953.bi0913s32

Table A.1B.2.

A Summary of Fields Commonly Found in GenBank Records (Fig. A.1B.2)

Field	Identifier(s) in Figure A.1B.2	Contents
`LOCUS`	1a: Locus name	Although the locus name was originally intended to identify similar sequences, it no longer carries such significance. Each GenBank file has a unique locus name. Often, it is either the first letter of the genus and species followed by the accession number, or simply the GenBank accession number of the file.
	1b: Sequence length	The number of nucleotide base pairs (bp) or amino acid residues (aa) in the gene or gene product.
	1c: Molecule type	Identifies the type of sequence found in a particular file. Possibilities include: genomic DNA, genomic RNA, precursor RNA, mRNA, rRNA, tRNA, small nuclear RNA, and cytoplasmic RNA.
	1d: Molecular topology	The molecule's expected topology. The options are linear and circular.
	1e: GenBank division	Each GenBank sequence is currently classified in one of the following 17 subdivisions: `PRI`, primates; `ROD`, rodents; `MAM`, mammals (excluding primates and rodents); `VRT`, vertebrates (excluding mammals); `INV`, invertebrates; `PLN`, plants, fungi, and algae; `BCT`, bacteria; `VRL`, viral; `PHG`, bacteriophages; `SYN`, synthetic; `UNA`, unannotated; `EST`, expressed sequence tag; `PAT`, patent sequence; `STS`, sequence tagged sites; `GSS`, genome survey sequence; HTG, high-throughput genomic sequence; `HTC`, unfinished high-throughput cDNA sequence. Note that the organismal subdivisions do not coincide with the current NCBI taxonomy. They are purely historical.
	1f: Modification date	Indicates when the file was last revised.
`DEFINITION`	2	A brief description of the sequence, including the organism source and the gene or protein name.
`ACCESSION`	3	A unique, stable, identifier for the particular file, which is usually a combination of one or two letters with five or six digits.
`VERSION`	4	Allows users to track multiple incarnations of a given sequence. The version number is the accession number concatenated with a period and a number. For the first version of a particular accession, the number following the period is set to 1. Each time the sequence data are modified, the number following the period is incremented by 1. The example shown in Figure A.1B.2 is the first version of accession number M93361.
		This field will also contain a GenInfo Identifier (GI) for nucleotide sequence files. This number uniquely identifies each nucleotide sequence in GenBank, even if they differ by a single nucleotide. Note that, unlike the accession number for a file, the GI number may change.
`KEYWORDS`	5	A word or phrase describing the sequence. Although frequently found in older GenBank records, this field is generally not present in more recent GenBank files.
`SOURCE`	6	The first line is a free-format description of the source organism, followed by the molecule type. The subsequent lines contain the subfield `ORGANISM`, which has the complete scientific name of the source organism and its phylogenetic classification as given by the NCBI Taxonomy Database.
`REFERENCE`	7	Publications by the authors of the GenBank entry that discuss the molecule. Multiple publications may be listed in chronological order, ending with the most recent. Each reference entry will contain subfields (e.g., `AUTHORS`, `TITLE`, `JOURNAL`, `MEDLINE`) that are appropriate for the particular publication type.
`FEATURES`	8	This is essentially a concise summary of the gene or protein annotation. It offers a list of genes, gene products, and regions of biological interest that have been identified within the reported sequence. The first subfield in each `FEATURE` list is the `source` subfield, which contains the length of the sequence, the scientific name of the source organism, and the taxon ID number. Additional subfields are given—e.g., `gene`, `promoter`, `TATA signal`, `5′ UTR, 3′ UTR`, and coding sequence (`CDS`)—depending on the features within the sequence. For each feature, the GenBank record provides its location within the sequence and other pertinent information (e.g., the product or gene name, possible function, and protein translation).
`BASE COUNT`	9	The number of adenine, cytosine, thymine (or uracil), and guanine nucleotide bases within the sequence.
`ORIGIN`	10	This field is often left blank. In older records, it may contain the experimentally derived restriction cleavage site. Note that the `ORIGIN` field should be included in every GenBank record, even if it contains no information. Most parsers look for the sequence on the first line after the word `ORIGIN`.
	11	The sequence data with 60 bases (or residues) per line. The bases on each line are presented in six groups of ten bases per group, with the groups separated by spaces. The sequence ends with two slashes (//).