ELECTRONIC SUPPLEMENT FOR THE JOURNAL WEB SITE

- entrance    http://www.kazusa.or.jp/codon/

data source

The latest data source available during the preparation of this manuscript was NCBI-GenBank Flat File Release 113.0 [15 August 1999]. Pri (primate sequence entries), rod (rodent sequence entries), mam (other mammalian sequence entries), vrt (other vertebrate sequence entries), inv (invertebrate sequence entries), pln (plant sequence entries), bct (bacterial sequence entries), vrl (viral sequence entries) and phg (phage sequence entries) contain the sequence data for major taxonomical groups.

Other files such as est (EST: expressed sequence tag sequence entries), pat (patent sequence entries) and rna (Structural RNA sequence entries), for example, were not used, since they were not taxonomical collections and consisted of only a small number of full-length protein genes. All of the completed sequences of protein coding genes (CDS's) were used. Codons containing ambiguous bases were simply excluded from the analysis.

- dataset on the ftp sites

The complete dataset of CUTG is available through the following URLs:

(i) Kazusa ftp://ftp.kazusa.or.jp/pub/codon/current/

(ii) DDBJ ftp://ftp.nig.ac.jp/pub/db/codon/current/

(iii) EBI ftp://ftp.ebi.ac.uk/pub/databases/cutg/

In August 1999, the construction and primary distribution site was moved to Kazusa DNA Research Institute from the DNA Information and Stock Center.

Files named gb***.codon list the codon usage of each gene registered in the selected GenBank Flat Files. The LOCUS names given in GenBank were used to designate individual genes. Each LOCUS name is followed by fields of information extracted from the FEATURES part of the CDS used to define the open reading frames analyzed here. The order of the codons in the table is the same as in the previous compilation (see the CODON_LABEL file).

To reveal the characteristics of the codon usage of a wide range of organisms, as well as viruses and organella, the frequency (per thousand) of codon use in each organism was calculated by summing up the numbers of codons used. Files named gb***.spsum list the sum of numbers of codon usage in each species as well as in viruses and organella (see the SPSUM_LABEL file).

The files are distributed in two forms, as gzip-compressed files and as flat files. CUTG.**.tar.gz at compressed directory (** is a number which shows the GenBank major release used for the construction) is an all-in-one file for the current dataset. The file contains two "LABEL" files and all the "codon" and "spsum" files. Use "gunzip" and "tar" to extract files from the archive. If you do not need all of the sections, you can download the required file, such as the gbbct.spsum.gz or gbbct.codon.gz file for bacterial entries, from the"compressed" directory. If you do not have "gunzip" or "tar" in your local operating system, you may fetch each file in flat text format from this directory.

On each ftp site, a species directory, which contains the codon usage files derived from each organism, is provided. File names consist of the Latin name of the species, which are joined together by an under bar, followed by a dot and the GenBank division. (eg. For the codon usage list of Arabidopsis thaliana, the URL at the Kazusa ftp site is ftp://ftp.kazusa.or.jp/pub/codon/current/species/Arabidopsis_thaliana.pln)

- species search and the alphabetical list

A query box on the top page allows one tosearch for the codon usage table of each organism. The default search process is case sensitive. Case insensitive options can also be selected. No answers are returned for ambiguous query strings that result in more than 100 hits for organisms in the database. In the answer list to a query or in the alphabetical list, the name of the organism appears, followed by the name of the division of GenBank [gbbct, gbinv etc.], a colon and the number of compiled CDS, like in the following example;

Arabidopsis thaliana [gbpln]: 13430

If a user selects a link to an organism, the codon usage table of that organism will appear. The table shows the frequency (per thousand) and the number of times each codon occurs as a sum of all the CDS's in the organism. A table, including either the names of the amino acids or formatted in GCG style, is also shown when one genetic code system is selected.

By selecting the link "Codon usage of each CDS" under the table, all the codon usage tables of the CDS's of an organism can be browsed or downloaded. Table format is in CUTG style (See below the link for "CODEN_LABEL" file in CUTG).

- table formatting tool

The format tool is accessible

on the codon usage table page. There are two styles for the codon usage table. One is the traditional display style of the codon usage table, and the other is a style compatible with the CodonFrequency output option of the GCG Wisconsin PackageTM. Users must select a genetic code to display the codon usage table with amino acids or in GCG format. Users, who have the GCG package in their local environment, can make further analyses using this style.

- CDS search

On the codon usage table page for each species, a query box has been provided to search for a codon usage table by simple keyword. The search process is case sensitive. Queries that result in more than 1000 hits for genes give an error message. The user can select CDS's by keywords and then make codon usage tables from the selected items. This tool provides users with the ability to analyze for intra-species variation in codon usage.