Abstract
The SURFACE (SUrface Residues and Functions Annotated, Compared and Evaluated, URL http://cbm.bio.uniroma2.it/surface/) database is a repository of annotated and compared protein surface regions. SURFACE contains the results of a large-scale protein annotation and local structural comparison project. A non-redundant set of protein chains is used to build a database of protein surface patches, defined as putative surface functional sites. Each patch is annotated with sequence and structure-derived information about function or interaction abilities. A new procedure for structure comparison is used to perform an all-versus-all patches comparison. Selection of the results obtained with stringent parameters offers a similarity score that can be used to associate different patches and allows reliable annotation by similarity. Annotation exerted through the comparison of regions of protein surface allows the highlighting of similarities that cannot be recognized by other methods of sequence or structure comparison. A graphic representation of the surface patches, functional annotations and the structural superpositions is available through the web interface.
INTRODUCTION
Extracting information about protein functions directly from structure is becoming a crucial task in the structural genomics era (1–3). Structural comparison may lead to the identification of functional relationships even when no clear sequence similarity is detected (4,5). A limitation of this approach is that function is very often encoded in a small number of residues, and cases are known in which proteins sharing similar fold and/or sequence have completely different function (6,7) as well as cases in which a clear functional relationship does not involve sequence/structure similarity (8). In such cases, sequence or structure comparison is likely to be inadequate in describing or identifying protein functions and evolutionary relationships between proteins. These tools, while generally useful in protein classification, may fail in inferring protein functions, being unable to spot local differences. Starting from this background, we have built a relational database to store and spread the results of a large-scale local surface comparison experiment, allowing the scientific community to retrieve non-obvious functional similarities detected between proteins of known 3D structure.
LOCAL SURFACE COMPARISON
We developed a procedure to spot local structural similarities, which is focused on putative functional residues. This method relies on the automatic identification, annotation and structural comparison of functional sites. The first step is the identification of protein surface clefts using the SURFNET algorithm (9), with the demonstrated assumption that there is a clear correspondence between cleft volume and functional involvement (10). Cleft boundaries are explored, identifying those residues that surround the cavity, and that compose the so-called surface ‘patch’. For each patch functional information is retrieved from the PROSITE database (11) and from the structure itself, assessing binding abilities from analysis of the bound ligands in the crystal. With the integration of all this information it is possible to obtain a collection of annotated functional sites as surface local patches. This analysis generates a functional sites compendium that can be used to scan a protein structure in order to automatically infer the function(s) of a protein given its structure.
Using a cut-off on volume size to select the biggest clefts, we identify 10 175 surface patches from a non-redundant list of 1924 protein chains whose structure is available. Each patch is composed of an average of 26.5 residues, with a residue distribution that is similar to the residue distribution on the protein surface (a higher frequency of charged and polar residues and a lower frequency of hydrophobic and bulky residues, with respect to the buried residues) (12), although some distinctive features can be detected (i.e. the W frequency in the surface clefts is lower than the W frequency on the overall surface, while the G frequency is higher: follow the Statistics link in the SURFACE home page). We were able to associate at least one functional annotation with 14.4% of these hypothetical functional sites. Using a newly developed structure comparison algorithm (described below) we compare each annotated patch with the whole patches database. Algorithm parameters [such as the root mean square deviation (r.m.s.d.) and minimum similarity of the superposed residues] are set to stringent values, to find only reliable similarities. Moreover, in order to focus the comparison on the putative functional sites, the algorithm is forced to include the annotated residues in the superposition. The similarity between patches is evaluated by means of the number of superposed residues (the score), and through an evaluation of the match statistical significance based on the score distribution for a given patch (the Z-score).
To test the reliability of the procedure, we verified that for 90% of the annotated patches, at least one patch with the same annotation can be found among the highest-scoring matches (in cases where the annotation being examined is associated with at least two different patches), i.e. the algorithm is able to detect similarity between patches sharing the same annotation. We filtered the huge number of results selecting on the basis of the Z-score: we set a Z-score threshold value calculating, for each different annotation, the average Z-score value of matches between patches sharing the same annotation. Then we fetch those non-annotated patches matching at least two patches with the same annotation and with a Z-score greater than or at least equal to the threshold Z-score for that annotation. We use these conditions to filter the results of the annotated patches-versus-all comparison. A manual analysis of each match has been done, using the literature and information derived from different sources and databases, to determine whether the detected structural similarities can be associated to a functional relationship. For each of the 426 selected matches, a functional relationship between the query and the target patch can be retrieved, validating the procedure and the filtering conditions. The reliability of the procedure in finding meaningful similarities in our non-redundant annotated patches data set has been tested using a set of benchmark cases, in which cryptic similarity between unrelated proteins has been reported, as nucleotide/nucleoside triphosphate binding related to the P-loop (8).
ALGORITHM
A fast and sequence/fold-independent algorithm for local surface comparison has been developed. The algorithm is suitable for large-scale structural comparison given its speed and ability to explore all the combinations of similar/identical residues in a sequence-independent way. Two subsets of amino acids are considered to match when their superposition can be associated with a low r.m.s.d. and a good residue similarity according to a chosen substitution matrix (Fig. 1). The first step of the procedure is the reduction of the spatial information: each residue is represented as a pseudo-residue composed of two points: the Cα atom and the geometric center of the side chain atoms (Fig. 1A). The algorithm starts by comparing all possible residue pairs of the query and the target patches. Good seed matches are then selected on the basis of their r.m.s.d. and residue similarity (Fig. 1A). These initial matches are then expanded sequentially by scanning all the remaining residues within 7.5 Å of the seed match (Fig. 1B). At each step (Fig. 1C–E), a new expanded match is accepted or rejected by setting a cut-off value for the r.m.s.d. (typically 0.7 Å) and for the residue similarity (typically 1.2) according to a Dayhoff substitution matrix. The algorithm stops when all possible combinations of subsets have been explored (Fig. 1F).
DATABASE CONTENT AND INTERFACE
Functional annotations and structural comparison results are stored in a freely accessible database; an intuitive interface allows the user to access this information. The database is built with PostgreSQL, an open source object-relational database management system which uses SQL (Structured Query Language) as the query system. The relational structure (not shown) allows easy expansion of the annotation system: new functional annotations can be added without altering the database structure. A collection of Python scripts has been developed to query the database, as well as to create the web interface.
A PDB code can be submitted, or a PDB file can be retrieved through a keyword search. If the selected PDB chain is a representative member in the non-redundant data set, the user can access the chain data, analysis and comparison data. Otherwise the representative member of the redundancy group is proposed. The complete PDB chain data set can be accessed to select a protein. Moreover, the user can access the list of the chains that bind a specific ligand or that match a chosen PROSITE pattern. A form allows the user to submit a protein sequence, retrieving the chains with the highest similarity in the non-redundant list by means of a BLAST search (13). Once a protein chain has been selected, a schematic summary of the annotations and extracted patches is shown (Fig. 2), while a table with the entire information residue per residue can be accessed. Graphical representations of the protein-extracted patches and functional annotations are accessible through the browser plug-in CHIME, based on the program RasMol (14) (Fig. 2), and RasMol.
For each surface patch, the user can retrieve all the structural comparison results with a Z-score > 3 or > 7.0. The comparison results are divided into two blocks: the upper panel shows the patches found structurally similar using the selected patch as bait, sorted by annotation and by Z-score; the lower part shows the patches that fish the query patch, sorted by Z-score. The similarity between patches is scored via the number of the superposed residues (score). Data about the global sequence similarity between the protein chains encompassing the patches is displayed in order to help the user highlighting non-obvious cases (sequence similarity). The user can select one or more matches, and a table, showing the superposed residues, is displayed (Fig. 3). A graphic representation of the detected structural similarities can be visualized using CHIME or RasMol. In Figure 3 a screen snapshot shows the structural similarity between the unrelated proteins human p21 RAS (PDB code: 5p21) and the bovine mitochondrial Ef-Tu protein (PDB code: 1d2e).
CONCLUSIONS AND FUTURE DIRECTIONS
Given the amount of the stored data and the user-friendly web interface, the SURFACE database can be a useful resource for scientific research, providing information about protein functions inferred from different sources and allowing a structural alignment to be obtained easily. This approach has been used to infer the function(s) of a set of uncharacterized proteins whose structure has been solved in structural genomics projects. By adding new categories of functional annotations previously undiscovered similarities can be found. The next database release will include annotations derived from the ELM functional motif database (15) and from the SwissProt database features (16), as well as protein–protein interaction information derived from multimeric complexes in the PDB (17) and from the MINT database on protein–protein interactions (18). The upload of a protein structure, its comparison against the SURFACE database and the retrieval of the similarities detected will be available soon.
Acknowledgments
ACKNOWLEDGEMENTS
This work was supported by the Telethon multi-centre project GP0101Y01, the EEC project QLRT-2001-02910 and by the AIRC (Associazione Italiana per la Ricerca sul Cancro).
REFERENCES
- 1.Schmid M.B. (2002) Structural proteomics: the potential of high-throughput structure determination. Trends Microbiol., 10 (Suppl.), S27–S31. [DOI] [PubMed] [Google Scholar]
- 2.Kinoshita K., Furui,J. and Nakamura,H. (2001) Identification of protein functions from a molecular surface database, eF-site. J. Struct. Funct. Genomics, 2, 9–22. [DOI] [PubMed] [Google Scholar]
- 3.Schmitt S., Kuhn,D. and Klebe,G. (2002) A new method to detect related function among proteins independent of sequence and fold homology. J. Mol. Biol., 323, 387–406. [DOI] [PubMed] [Google Scholar]
- 4.Holm L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. [DOI] [PubMed] [Google Scholar]
- 5.Holm L. and Sander,C. (1997) An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins, 28, 72–82. [PubMed] [Google Scholar]
- 6.Kauvar L.M. and Villar,H.O. (1998) Deciphering cryptic similarities in protein binding sites. Curr. Opin. Biotechnol., 9, 390–394. [DOI] [PubMed] [Google Scholar]
- 7.Whisstock J. and Lesk,A. (2003) Prediction of protein function from protein sequence and structure. Q. Rev. Biophys., in press. [DOI] [PubMed] [Google Scholar]
- 8.Via A., Ferre,F., Brannetti,B., Valencia,A. and Helmer-Citterich,M. (2000) Three-dimensional view of the surface motif associated with the P-loop structure: cis and trans cases of convergent evolution. J. Mol. Biol., 303, 455–465. [DOI] [PubMed] [Google Scholar]
- 9.Laskowski R.A. (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J. Mol. Graph., 13, 307–308, 323–330. [DOI] [PubMed] [Google Scholar]
- 10.Laskowski R.A., Luscombe,N.M., Swindells,M.B. and Thornton,J.M. (1996) Protein clefts in molecular recognition and function. Protein Sci., 5, 2438–2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Falquet L., Pagni,M., Bucher,P., Hulo,N., Sigrist,C.J., Hofmann,K. and Bairoch,A. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res., 30, 235–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Miller S., Janin,J., Lesk,A.M. and Chothia,C. (1987) Interior and surface of monomeric proteins. J. Mol. Biol., 196, 641–656. [DOI] [PubMed] [Google Scholar]
- 13.Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [DOI] [PubMed] [Google Scholar]
- 14.Sayle R. and Milner-White,B.J. (1995) RasMol: biomolecular graphics for all. Trends Biochem. Sci., 20, 374. [DOI] [PubMed] [Google Scholar]
- 15.Puntervoll P., Linding,R., Gemuend,C., Chabanis-Davidson,S., Mattingsdal,M., Cameron,S., Martin,D.M.A., Ausiello,G., Brannetti,B., Costantini,A. et al. (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res., 31, 3625–3630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Boeckmann B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Westbrook J., Feng,Z., Chen,L., Yang,H. and Berman,H.M. (2003) The Protein Data Bank and structural genomics. Nucleic Acids Res., 31, 489–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zanzoni A., Montecchi-Palazzi,L., Quondam,M., Ausiello,G., Helmer-Citterich,M. and Cesareni,G. (2002) MINT: a Molecular INTeraction database. FEBS Lett., 513, 135–140. [DOI] [PubMed] [Google Scholar]