Abstract
The Gene3D database (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/) provides structural assignments for genes within complete genomes. These are available via the internet from either the World Wide Web or FTP. Assignments are made using PSI-BLAST and subsequently processed using the DRange protocol. The DRange protocol is an empirically benchmarked method for assessing the validity of structural assignments made using sequence searching methods where appropriate assignment statistics are collected and made available. Gene3D links assignments to their appropriate entries in relevent structural and classification resources (PDBsum, CATH database and the Dictionary of Homologous Superfamilies). Release 2.0 of Gene3D includes 62 genomes, 2 eukaryotes, 10 archaea and 40 bacteria. Currently, structural assignments can be made for between 30 and 40 percent of any given genome. In any genome, around half of those genes assigned a structural domain are assigned a single domain and the other half of the genes are assigned multiple structural domains. Gene3D is linked to the CATH database and is updated with each new update of CATH.
INTRODUCTION
Considerable progress has been made in the field of genome annotation in the past five years and it is now evident that some structural or functional annotation can be provided for most of the genes in any given organism (6–12). Currently, the state of the art allows up to 80% (7) of the genes in any given organism to be assigned functional or structural annotation. Most annotations methods rely almost solely on inheriting functional annotation via sequence comparison but one must exercise a degree of caution when interpreting such results. This is particularly pertinent when considering the annotation of distant homologues [∼30% sequence identity, (13)]. The benefit of structural annotation is often useful when assessing the functional annotations of these homologues. Use of structural data enables 3D models to be built to inform functional predictions (14,15). Gene3D aims to provide the biologist with reliable precalculated relationships to protein structures and, as a result, the relevant links to the functional and structural data curated within the CATH domain structure classification database. These data can then be used as the starting point for homology modelling or evolutionary studies. A related resource, SUPERFAMILY (16), is linked to the SCOP structural database (17).
METHODS
The Gene3D database is derived from data produced by the DomainFinder algorithm (18) and the DRange protocol (2). This resource is created by scanning the sequences from the CATH structural domains against a large database derived from the non-redundant sequence database from GenBank that contains the sequences from the completed genomes. The PSI-BLAST (1) iterative database search algorithm is used (19) to scan CATH database sequences against the GenBank sequences. Preprocessing is carried out by DomainFinder and the DRange protocol selects and validates the putative structural annotations suggested by DomainFinder. Gene3D and the associated DRange protocol are described below.
DomainFinder and DRange
The Gene3D population process is illustrated in Figure 1. The procedure starts with a dataset of non-identical sequences from the CATH database (CATH S95Rep) sequences, which is searched against a library of sequences (in this case the sequences from the GenBank non-redundant database which includes the sequences from the completed genomes) using PSI-BLAST (Fig. 1A) with the aim of producing a series of matches of the structural domains to the genomic sequences (Fig. 1B).
Figure 1.
Populating the Gene3D Database. (A) CATH Representative sequences (S95Reps) are scanned against the GenBank non-redundant database containing the sequences from the completed genomes using PSI-BLAST. Search results (B) are processed by DomainFinder to generate ‘Ranges’ (C). These are ‘cleaned-up’ by the DRange package (D) and final assignments are assimilated in the Gene3D database (E).
In the subsequent step, the DomainFinder algorithm is used to convert the ‘raw’ hits into ‘Ranges’ (18). These ‘Ranges’ act as descriptors which indicate which regions of a gene are putatively thought to belong to which CATH Homologous Superfamilies (Fig. 1C). In the last data manipulation step (Fig. 1D) assignments are cleaned-up using the DRange package (Fig. 1D) and the resulting assignments stored in the Gene3D database (Fig. 1E). The DRange package is composed of three modules: Collapse, MultiParse and CleanAssign (2). These three modules are used to verify structural domain assignments. The ‘clean-up’ procedure is a triage procedure distinguishing between probably correct and probably incorrect assignments.
Results
Gene3D is the repository for structural assignments verified using the DRange protocol and is available on-line at http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/. This protocol is applied to all complete genomes released. In May 2002, Gene3D included whole genome structural assignments for 66 genomes. The data are also available via the CATH FTP site at ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/.
Typical assignments statistics, for four typical genomes in the database, one from each of the major branches of life (one multicellular eukaryote, one unicellular eukaryote, an archaea and a bacterium) are presented in Table 1. The level of assignment ranges from ∼22% to ∼55% of the genes in a given organism in the database receive annotation with at least one structural domain. Of these genes, usually around half are annotated with a single domain and the other half of the genes are assigned multiple domains (see Fig. 2). The figure also shows that the eukaryotic genomes have many more genes with a large number of domains and closer inspection of these indicates that the largest of these genes are made of long strings of immunoglobulin like domains and are likely to be cell–cell signalling domains. The percentage of residues covered (see Table 1) is often around half (and frequently lower) of the percentage of genes with an annotation. This indicates that many genes that pick up a domain are not being fully annotated and could be annotated with further domains.
Table 1. Genome assignments statistics.
| Organism | Celeg | Sacc | Mjan | Ecoli |
|---|---|---|---|---|
| Total number of genes | 17046 | 6297 | 1706 | 4289 |
| Total number of bases | 100 109 819 | 12 155 026 | 1 664 970 | 4 639 221 |
| Total number of residues | 7 689 303 | 2 973 987 | 480 140 | 1 358 990 |
| CATH covered residues | 630 655 | 346 272 | 103 259 | 354 537 |
| % CATH covered residues | 8.20 | 11.64 | 21.51 | 26.09 |
| Number of genes with CATH domains | 3641 | 2030 | 607 | 1788 |
| % Genes with CATH domains | 21.36 | 32.23 | 35.58 | 41.69 |
Statisitics are shown for four representative organisms, the first two are eukaryotes, the next an archaean and the final a bacterium. Celeg: Caenorhabditis elegans, Sacc: Saccharomyces cerevisiae, Mjan: Methanococcus jannaschii and Ecoli: Escherichia coli.
Figure 2.
Bar chart showing the distribution domains assigned to genes in three typical organisms: Caenorhabdatis elegans, Methanococcus jannaschii and Escherichia coli. The Y axis has been truncated.
Cursory inspection of the assignment data shows that bacterial and archaeal genomes pick up approximately the same ratios of the various types of CATH domains and that no single genome appears to be strongly biased in the type of CATH domains it utilises (see Fig. 3). The eukaryotes appear to make more use of the all-beta domains in the CATH database, which is probably due to their greater use of cell–cell signalling proteins that typically use immunoglobulin like domains.
Figure 3.
Bar chart showing the relative percentage of domain classes from the four major CATH classes for genes that have been assigned a CATH domain. The four classes are: Class 1: all alpha folds; Class 2: all beta folds; Class: 3 mixed alpha and beta folds and Class 4: folds with little secondary structure.
In the database, the eukaryotic genomes pick up the least annotation which may be due to a prokaryotic bias in the structures that are deposited within the PDB (20).
The Gene3D Web Server
The Gene3D web server is made up of a number of inter linked web pages which allow the retrieval of data on specific genes within the represented genomes. Each genome features an entry page (Fig. 4A) with a summary of the assignment statistics and a CATH wheel (21). The CATH wheel is a pie plot indicating which folds in CATH are present in the organism. Those folds not detected in the genome are blacked out. The statistics are similar to those presented in Table 1. From this entry page, it is possible to search the genome in one of the two ways. The first is by a simple keyword or gene identifier search that returns a list of matching genes. The second is to browse the complete list of genes within an organism that have had a structural assignment made to them. By either route once a gene is selected a results page is returned (Fig. 4B). These results page presents a schematic diagram of both the gene (hatched in green) and the structural domains assigned (colour coded by domain type). Presented alongside this is the ‘ranges’ data for this assignment and the E-values from PSI-BLAST upon which this assignment was accepted. We recommend that for batch downloads users refer to the FTP site (ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/).
Figure 4.
Diagram shows a typical entry page (A) for a given genome (e.g. Mycoplasma genitalium) and the statistics presented and an (B) example of the diagram and data that can be retrieved for a gene.
DISCUSSION
The data within Gene3D are there to provide biologists and bioinformaticists with an initial stepping stone from which structural, functional and evolutionary studies can begin. In future, we hope to integrate Pfam domain assignments (12) to maximise the annotated coverage of genomes and we also hope to provide alignments of the CATH or Pfam domains to the genes that they matched. It is our hope that by integrating Pfam domain assignments, we can provide the assignments for most, if not all, of the genes in the complete genomes.
That we can annotate so much of the complete genome sequences from the structure databases alone suggests that we may not need to solve structures for every sequence but rather for every sequence family containing relatives of high sequence identity (for example ∼40%) sequence identity. In such families, homology modelling could then be used to predict the structures of all the relatives from one representative structure. This bodes well for the success of the structural genomics projects.
REFERENCES
- 1.Altschul S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Buchan D., Shepherd,A., Lee,D., Pearl,F., Rison,S., Thornton,J. and Orengo,C. (2002) Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res., 12, 503–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Laskowski R. (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res., 29, 221–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pearl F., Martin,N., Bray,J., Buchan,D., Harrison,A., Lee,D., Reeves,G., Shepherd,A., Sillitoe,I., Todd,A., Thornton,J. and Orengo,C. (2001) A rapid classification protocol for the CATH domain database to support structural genomics. Nucleic Acids Res., 29, 223–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bray J., Todd,A., Pearl,F., Thornton,J. and Orengo,C. (2000) The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. Protein Eng., 13, 153–165. [DOI] [PubMed] [Google Scholar]
- 6.Gerstein M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562–576. [DOI] [PubMed] [Google Scholar]
- 7.Teichmann S., Chothia,C. and Gerstein,M. (1999) Advances in structural genomics. Curr. Opin. Struct. Biol., 9, 390–399. [DOI] [PubMed] [Google Scholar]
- 8.Muller A., MacCallum,R. and Sternberg,M. (1999) Benchmarking PSI-BLAST in genome annotation. J. Mol. Biol., 293, 1257–1271. [DOI] [PubMed] [Google Scholar]
- 9.Iliopoulos I., Tsoka,S., Andrade,M., Janssen,P., Audit,B., Tramontano,A., Valencia,A., Leroy,C., Sander,C. and Ouzounis,C. (2001) Genome sequences and great expectations. Genome Biol., 2, Interactions0001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Apweiler R., Biswas,M., Fleischmann,W., Kanapin,A., Karavidopoulou,Y., Kersey,P., Kriventseva,E., Mittard,V., Mulder,N., Phan,I. and Zdobnov,E. (2001) Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucleic Acids Res., 29, 44–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kanehisa M., Goto,S., Kawashima,S. and Nakaya,A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res., 30, 42–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bateman A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S., Griffiths-Jones,S., Howe,K., Marshall,M. and Sonnhammer,E. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Todd A., Orengo,C. and Thornton,J. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol., 307, 1113–1143. [DOI] [PubMed] [Google Scholar]
- 14.Laskowski R., Luscombe,N., Swindells,M. and Thornton,J. (1996) Protein clefts in molecular recognition and function. Protein Sci., 5, 2438–2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Luscombe N., Laskowski,R. and Thornton,J. (1997) NUCPLOT: a program to generate schematic diagrams of protein–nucleic acid interactions. Nucleic Acids Res., 25, 4940–4945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gough J. and Chothia,C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lo Conte L., Brenner,S., Hubbard,T., Chothia,C. and Murzin,A. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Pearl F., Lee,D., Bray,J., Buchan,D., Shepherd,A. and Orengo,C. (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci., 11, 233–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Altschul S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Westbrook J., Feng,Z., Jain,S., Bhat,T., Thanki,N., Ravichandran,V., Gilliland,G., Bluhm,W., Weissig,H., Greer,D., Bourne,P. and Berman,H. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Michie A., Orengo,C. and Thornton,J. (1996). Analysis of domain structural class using an automated class assignment protocol. J. Mol. Biol., 262, 168–185. [DOI] [PubMed] [Google Scholar]




