Table A.1B.2.
Field | Identifier(s) in Figure A.1B.2 | Contents |
---|---|---|
LOCUS | 1a: Locus name | Although the locus name was originally intended to identify similar sequences, it no longer carries such significance. Each GenBank file has a unique locus name. Often, it is either the first letter of the genus and species followed by the accession number, or simply the GenBank accession number of the file. |
1b: Sequence length | The number of nucleotide base pairs (bp) or amino acid residues (aa) in the gene or gene product. |
|
1c: Molecule type | Identifies the type of sequence found in a particular file. Possibilities include: genomic DNA, genomic RNA, precursor RNA, mRNA, rRNA, tRNA, small nuclear RNA, and cytoplasmic RNA. |
|
1d: Molecular topology | The molecule's expected topology. The options are linear and circular. |
|
1e: GenBank division | Each GenBank sequence is currently classified in one of the following 17 subdivisions: PRI, primates; ROD, rodents; MAM, mammals (excluding primates and rodents); VRT, vertebrates (excluding mammals); INV, invertebrates; PLN, plants, fungi, and algae; BCT, bacteria; VRL, viral; PHG, bacteriophages; SYN, synthetic; UNA, unannotated; EST, expressed sequence tag; PAT, patent sequence; STS, sequence tagged sites; GSS, genome survey sequence; HTG, high-throughput genomic sequence; HTC, unfinished high-throughput cDNA sequence. Note that the organismal subdivisions do not coincide with the current NCBI taxonomy. They are purely historical. |
|
1f: Modification date | Indicates when the file was last revised. | |
DEFINITION | 2 | A brief description of the sequence, including the organism source and the gene or protein name. |
ACCESSION | 3 | A unique, stable, identifier for the particular file, which is usually a combination of one or two letters with five or six digits. |
VERSION | 4 | Allows users to track multiple incarnations of a given sequence. The version number is the accession number concatenated with a period and a number. For the first version of a particular accession, the number following the period is set to 1. Each time the sequence data are modified, the number following the period is incremented by 1. The example shown in Figure A.1B.2 is the first version of accession number M93361. |
This field will also contain a GenInfo Identifier (GI) for nucleotide sequence files. This number uniquely identifies each nucleotide sequence in GenBank, even if they differ by a single nucleotide. Note that, unlike the accession number for a file, the GI number may change. |
||
KEYWORDS | 5 | A word or phrase describing the sequence. Although frequently found in older GenBank records, this field is generally not present in more recent GenBank files. |
SOURCE | 6 | The first line is a free-format description of the source organism, followed by the molecule type. The subsequent lines contain the subfield ORGANISM, which has the complete scientific name of the source organism and its phylogenetic classification as given by the NCBI Taxonomy Database. |
REFERENCE | 7 | Publications by the authors of the GenBank entry that discuss the molecule. Multiple publications may be listed in chronological order, ending with the most recent. Each reference entry will contain subfields (e.g., AUTHORS, TITLE, JOURNAL, MEDLINE) that are appropriate for the particular publication type. |
FEATURES | 8 | This is essentially a concise summary of the gene or protein annotation. It offers a list of genes, gene products, and regions of biological interest that have been identified within the reported sequence. The first subfield in each FEATURE list is the source subfield, which contains the length of the sequence, the scientific name of the source organism, and the taxon ID number. Additional subfields are given—e.g., gene, promoter, TATA signal, 5′ UTR, 3′ UTR, and coding sequence (CDS)—depending on the features within the sequence. For each feature, the GenBank record provides its location within the sequence and other pertinent information (e.g., the product or gene name, possible function, and protein translation). |
BASE COUNT | 9 | The number of adenine, cytosine, thymine (or uracil), and guanine nucleotide bases within the sequence. |
ORIGIN | 10 | This field is often left blank. In older records, it may contain the experimentally derived restriction cleavage site. Note that the ORIGIN field should be included in every GenBank record, even if it contains no information. Most parsers look for the sequence on the first line after the word ORIGIN. |
11 | The sequence data with 60 bases (or residues) per line. The bases on each line are presented in six groups of ten bases per group, with the groups separated by spaces. The sequence ends with two slashes (//). |