Abstract
Whole-genome sequencing projects are a major source of unknown function proteins. However, as predicting protein function from sequence remains a difficult task, research groups recently started to use 3D protein structures and structural models to bypass it. MED-SuMo compares protein surfaces analyzing the composition and spatial distribution of specific chemical groups (hydrogen bond donor, acceptor, positive, negative, aromatic, hydrophobic, guanidinium, hydroxyl, acyl and glycine). It is able to recognize proteins that have similar binding sites and thus, may perform similar functions. We present here a fine example which points out the interest of MED-SuMo approach for functional structural annotation.
Keywords: annotation, function, protein structures, prediction
Background
The number of available protein sequences has increased drastically during the last decade (472 complete genomes have been sequenced, http://www.genomesonline.org/). [1] Still, about 40% of these sequences are characterized as “unknown function” [2,3]; they represented more than 3% [2] of the Protein DataBank (PDB) structures. [4]
A majority of functional annotation methods relies on sequence similarity research, e.g. ProtoNet [5], or characterized sequence motifs, e.g. PROSITE. [6] The direct use of 3D structures or structural models to assign protein functions is an emerging field. This development is due to the increasing number of available crystallographic structures, of hypothetical proteins obtained by structural genomics consortium [7] and to new automatic crystallization methods. The first dedicated methods were directly derived from 3D local similarity methods, i.e. local rigid superimposition approaches. SuMo was one of the first software to use chemical groups description combined with fast graph comparison heuristic. [8] SiteEngine, developed later, had a comparable approach. [9] ProFunc is a popular web server composed of a compendium of structure-based and sequence-based methods. [10]
Description
Recognition of similar binding regions on the protein surface is crucial for functional classification and for functional prediction. MED-SuMo (http://www.medit.fr/) is able to recognize proteins that have similar binding sites and thus may perform similar functions. It is an improved version of SuMo (http://sumo-pbil.ibcp.fr/ [8,11]) with an updated source code; it is now faster and considers an increased amount of natural and synthetic ligands. Its heuristic is based on a unique representation of macromolecules using selected triplets of chemical groups which have their own geometry, regardless of the notions of main and lateral chains of amino acids. To extract similar sites, MED-SuMo transforms the binding site (or the full structure) of a query into a graph in which vertices are triplets of chemical group. Then, it is compared to binding sites extracted from the PDB which are already pre-assessed and stored in a database. [11]
A major drawback in functional annotation is the difficulty to identify true “unknown function” proteins. The PDB website (http://www.rcsb.org/) associates more than 1,500 structures to an “unknown function” annotation. Nevertheless, numerous can be annotated using classical approaches (high sequence identity, structural homology, residue conservation analysis, sequence motifs research, Cleft analysis). As an illustration, 3-keto-L-gulonate 6-phosphate decarboxylase, a lyase, is represented by 14 proteins in the PDB. Among this family, 4 structures are classified as “unknown function”, but their functions can be found in both, the PDB and the reference paper title (e.g. PDB code: 1XBX). Moreover, they have significant sequence identity/similarity rate and low root mean square deviation (rmsd) with 10 other protein structures.
For our study, we have selected proteins from the “Joint Center for Structural Genomics” (JCSG, http://www.jcsg.org/); they have determined more than 350 protein structures. About half of these proteins are classified as “Structural Genomics Unknown Function” but most of them share sequence or fold similarity with known proteins. Tm1012 is a hypothetical protein from Thermotoga maritime (PDB code: 2EWR) and cannot be associated to proteins with any known functions. Classical approach such as PSI-BLAST [12] launched on the NR database (via NCBI web service), or dedicated tool as ProFunc [5] could identify neither any related sequence nor any set of residues potentially implicated in known interaction or protein function.
As most of these methods, MED-SuMo does not give an all-or-nothing answer. The results are set out in a hit list, which are potentially interesting regions of the protein query, superimposed with corresponding similar regions of selected targets. Concerning the 2EWR query, the best hit of MED-SuMo results corresponds to the same protein crystallized by the same consortium under different experimental conditions (PDB code: 2FCL). The following hits are not directly related to the query (not superimposable, nor sharing any significant sequence identity, i.e. less than 20%, with 2EWR): 2CJ5, 5APR and 1OD1. 5APR and 1OD1 have a significant sequence identity rate, 38 % with a rmsd of 1.8 Å. Otherwise sequence identity rates are less than 22%. Moreover, it is not possible to superimpose any of these proteins on more than 20% of their length [13], i.e. these proteins are distinct.
Figures 1 shows two regions of interest for tm1012. The first one implicates residues 134, 135 and 138 of tm1012 corresponding to residues 17, 18 and 31 of protein 2CJ5 (cell wall invertase inhibitor from Nicotiana tabacum). Figures 1a outlines the fact that these 2 proteins cannot be globally superimposed and Figures 1b displays a closer view of the local superimposition of the corresponding residues with the ligand, an acetate ion. Local rmsd is less than 0.5 Å and the two regions correspond to the same residues (YQ–L).
Figure 1.
Examples of MED-SuMo results for hypothetical protein (tm1012) from Thermotoga maritime (in green, PDB code [4]: 2EWR, crystallized by the Joint Center for Structural Genomics, JCSG, http://www.jcsg.org/). (a) Complete superimposition of tm1012 with the cell wall invertase inhibitor from Nicotiana tabacum (in blue, PDB code: 2CJ5), (b) local view of the residues superimposed by MED-SuMo, and the ligand acetate ion. (c) Superimposition of tm1012 with rhizopuspepsin (in yellow, PDB code: 5APR), (d) with the statine ligand. (e) Superimposition of binding site of tm1012 with Endothiapepsin (PDB code: 1OD1), the MEDSuMo groups are represented; blue: HBond donor, red: HBond acceptor, dark red: Positive, purple: hydroxyl function, light grey: hydrophobic
The second region implicates more residues. Figures 1c and 1d show the superimposition of tm1012 and rhizopuspepsin, 5APR. Residues S76Y77G78D79-S81 of 5APR and R99L100E101D102-T104 of tm1012 are superimposed, the same residues are involved for 1OD1 (see Figures 1e). The local rmsd is quite small even if the residues are different. Interestingly only one residue (Aspartate) is common, whereas, 9 of the groups used by MED-SuMo (hydrophobic, negative, δ+, δ-, hydroxyl [8]) to define the ligand (statine) site of 5APR and 1OD1, are present in the binding site defined by this study for tm1012. Besides, analysis of the conservation of the rhizopuspepsin binding site motif shows that only 3 out of 5 residues are conserved (QYGT-S), but 8 of the 9 groups used by MED-SuMo are always present. These results show the interest of such an approach. They must be more deeply analyzed but gives very interesting insight for further in silico and in vivo research that could permit to functionally annotate this protein.
Conclusion
Classical sequence based approaches make possible to find more than half of the protein sequence functions. However, prediction of a protein function from its 3D structure is becoming more and more important as the worldwide structural genomics initiatives continue to solve 3D structures, many of which are unknown function proteins. We highlight the interest of MED-SuMo's heuristic, providing an application example, in which MED-SuMo is able to determine the position of a potential binding site of a hypothetical protein whereas other approaches fail. This study outlines the potential use of MED-SuMo in annotating hypothetical proteinsand for enhanced sequence annotation by coupling the MED-SuMo to homology modeling pipelines. Besides, the approach may be explored for highlighting discrepancies in functional annotation that would be blurred due to fold similarities. Future studies will involve a large-scale analysis of hypothetical as well as already annotated protein structures.
Footnotes
Citation:Doppelt et al., Bioinformation 1(9): 357-359 (2007)
References
- 1.Liolios K, et al. Nucleic Acid Res. 2006;34:D332. doi: 10.1093/nar/gkj145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Friedberg I, et al. Protein Science. 2006;15:1527. doi: 10.1110/ps.062158406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sivashankari S, Shanmughavel P. Bioinformation. 2006;1:335. doi: 10.6026/97320630001335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Berman HM, et al. Nucleic Acid Res. 2000;28:235. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sasson O, et al. Protein Science. 2006;15:1557. doi: 10.1110/ps.062185706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hulo N, et al. Nucleic Acid Res. 2006;34:D227. doi: 10.1093/nar/gkj063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chandonia JM, Brenner SE. Science. 2006;311:347. doi: 10.1126/science.1121018. [DOI] [PubMed] [Google Scholar]
- 8.Jambon M, et al. Proteins. 2003;52:137. doi: 10.1002/prot.10339. [DOI] [PubMed] [Google Scholar]
- 9.Shulman-Peleg A, et al. J Mol Biol. 2004;339:607. doi: 10.1016/j.jmb.2004.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Laskowski RA, et al. J Mol Biol. 2005;351:614. doi: 10.1016/j.jmb.2005.05.067. [DOI] [PubMed] [Google Scholar]
- 11.Jambon M, et al. Bioinformatics. 2006;21:3929. doi: 10.1093/bioinformatics/bti645. [DOI] [PubMed] [Google Scholar]
- 12.Altschul SF, et al. Nucleic Acid Res. 1997;25:3389. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tyagi M, et al. Nucleic Acids Res. 2006;34:W119. doi: 10.1093/nar/gkl199. [DOI] [PMC free article] [PubMed] [Google Scholar]

