Summary
With genomic data skyrocketing, their biological interpretation remains a serious challenge. Diverse computational methods address this problem by pointing to the existence of recurrent patterns among sequence, structure, and function. These patterns emerge naturally from evolutionary variation, natural selection, and divergence—the defining features of biological systems—and they identify molecular events and shapes that underlie specificity of function and allosteric communication. Here we review these methods, and the patterns they identify in case studies and in proteome-wide applications, to infer and rationally redesign function.
Keywords: Protein engineering, function prediction, functional sites, molecular evolution, structural genomics, networks
INTRODUCTION
Proteins remain difficult to characterize functionally despite the exponential growth in experimental data on sequence, structure, and function. There are many reasons for this persistent challenge. Proteins have not a single molecular function but rather multiple features that cooperatively sustain their biological fitness. The details and parameters of these features, e.g. folding, dynamics, cellular targeting, molecular interactions, catalytic activity, allosteric control, post-translational modifications, and degradation, to name a few, are often vague for a lack of laboratory assays to measure them accurately, on a large scale, and in their relevant cellular context. As a consequence, as of March 2012, fewer than 0.1% of the 21 million protein sequences from 3173 completely sequenced genomes 1 had experimentally tested functions, and only two-thirds had at least one automated computationally inferred annotation 2–4. The number of genes without known function is 37% in eukaryotes, 24% in humans, 33% in the far simpler and much studied E. coli, and 40% in other bacteria 2, 5. Although most of the 4225 E. coli genes were recently assigned putative annotations of functional associations, they were not assigned biochemical function 6. Given concerns that some of these annotations may not be accurate 7, the problem of translating sequence into function, and more broadly of translating genotype into phenotype, remains daunting.
Computational methods have long sought to fill this role. A remarkable early success was to realize that sequence and structure diverge smoothly: the root mean square deviation of protein backbones increases exponentially with the sequence divergence of evolutionarily related proteins, or homologs 8. This elegant observation is robust 9, and extends to other functional features besides folding 10 so that, in practice, it justifies homology-based predictions of structure and of function 11, arguably the two most widespread computational applications in biology. Other basic evolutionary principles are emerging from high throughput and systems biology 7. Protein mutation rate and protein expression are inversely correlated 8, biological networks obey power-laws and are scale-free 12; and the evolutionary rates of orthologs follow a Gaussian spread 13. Despite their statistical power, because these principles involve ensemble averages over whole sequences, structures, families, genomes and networks, as well as very long time-scales, they carry limited information on the direct role of individual sequence positions to the function of a given protein.
Single residue variations may profoundly impact function, and explain why homology-based function prediction can lead to incorrect annotations: although alike in sequence and structure, two homologs may harbor differences at one or just a few residues with disproportionate impact on function 14. The identification of such key residues is therefore essential to distinguish meaningful variations of function. This review therefore focuses on methods to identify functionally relevant evolutionary patterns among sequence, structure, and function. Such patterns emerge naturally from random variations and natural selection; they identify molecular events and shapes that determine function and specificity; and they can be approached by focusing on sequences, on structures, and on evolutionary classification. In the second part of the review, the focus will shift to the combination of these techniques in a unifying Evolutionary Trace framework.
Throughout the review, we will refer to two popular functional classification systems. Gene Ontology (GO) 4 provides well-defined terms for the molecular function, cellular component, and biological process of a gene product, along with evidence codes that specify the basis for the annotation and therefore its reliability. Enzyme Commission classification designates enzymatic function into four (EC) numbers 15, indicating the mechanism of the enzyme, the type of bond, the catalyzed reaction, and the substrate, respectively.
SEQUENCE-BASED PATTERNS
The simplest and most widespread evolutionary pattern for defining function is homology between proteins or domains. The rationale is that homology implies that proteins share a common ancestry and hence the function of that common ancestor. Once it is recognized by similarity searches with BLAST or PSI-BLAST 16, function is transferred between close homologs. A concern is that these homologs may have already evolved distinct functions. Thus homology-based annotation errors are not uncommon: divergence of activity has been observed even between enzymes with as much as 70% sequence identity 17. To compound this problem, these errors may in turn propagate across databases 7. To reduce incorrect annotations, multiple techniques, including GOtcha 18, ESG 19, and GOPred 20, tally the GO terms of all of the most significant sequence similarity matches and identify those with the best statistics. For example, GOtcha weighs this tally by the significance of each PSI-BLAST match to a database of proteins with GO annotations, to generate a probability that the query protein performs a particular function.
Other methods go beyond whole sequence comparison to focus on alignment columns with significant conservation 21, 22. The results are generalized profiles to infer structural or functional similarities. Pfam 23 is a widely used database of Hidden Markov Model profiles generated by HMMER 24 applied to the Uniprot database 2. To enhance specificity, Pfam-A uses a smaller set of almost 12,000 sequences representative of individual families that were hand-curated with functional annotations from literature references; to achieve sensitivity, Pfam-B uses a larger set of nearly 140,000 families that were clustered automatically and without dedicated annotation or reference. While Pfam and methods such as Prosite 25 and Interpro 26 focus primarily on the entire protein domain, other sources, such as the ELM database 27, focus instead on smaller motifs.
Even more refined searches focus on specific residues that together define a functional signature. Transfer of function based on these signatures can increase annotation specificity, i.e. lower false positives, by recognizing functionally inconsistent differences among key residues. Several sequence motif-based algorithms were designed specifically for this task, including Confunc 28, DME 29, and EFICAz2 30. All rely on discovering discriminatory sequence fragments shared by proteins with identical function and not others. ConFunc applies GO terms to partition homologs into multiple subsets. The sequences of each subset are then aligned to identify conserved residues. A GO term can then be transferred to a new homolog if it shares this residue signature. Controls suggest 24% greater accuracy of annotation compared to BLAST for homologs with less than 35% sequence identity. Likewise, DME and EFICAz2 use conservation to key in on functional residues specific to given enzyme functions.
Together these studies show that comparative sequence analyses identify evolutionary patterns at different levels of resolution, from whole sequence to profiles to motifs, that are all relevant to structure and function and useful to transfer annotations among proteins.
STRUCTURE-BASED PATTERNS
Structural information adds another dimension to the search for functionally relevant similarities among proteins. First, global structure alignments will detect homologies that elude sequence searches 8. Additionally, spatial correlation among key residues can reveal highly specific three-dimensional (3D) functional features 31. Some structural comparisons treat the structure as a rigid body, as in DALI 32 and TM-align 33, while others tolerate flexibility, as in TOPS++FATCAT 34. A challenge for these structural alignment is the lack of a universally accepted definition of structural similarity 35. In order to address this, CATH 36 and SCOP 37 created manually curated protein structure classification codes based on both domain and evolutionary similarities. These classifications enable functional inference of protein structure in many cases, but overall, and for the same reasons that a few amino acid prove determinant of function in sequence comparisons, the structure-to-function relationship over protein domains is not one-to-one 38.
This motivated searches for specific structural regions resembling previously characterized pockets for catalysis and ligand-binding or surface regions for macromolecular interactions 39. In a control set of 332 ligand-binding proteins, ConCavity 40 correctly predicted the binding site in 80% of cases by searching jointly for the local conservation of sequence and structural topology. Similar methods 41, 42 are listed in Table 1. FINDSITE 43 and 3DLigandSite 44 extend these ideas to homology models and detect the functional determinants of a ligand binding site. FINDSITE specifically creates homology models of the query, structurally aligns these to determine a likely binding site, and then suggests ligands and other GO functional annotations. In controls with less than 35% sequence identity to the nearest target protein, FINDSITE reached 67% accuracy. A related method, pevoSOAR 45, annotates structures for enzymatic function with 80% accuracy in limited controls. Together these studies show that patterns of local structural similarities add important information for functional inference.
Table 1.
Method | Website | Comments |
---|---|---|
Gene Ontology | http://www.geneontology.org | Standard representation of gene and gene product attributes |
Enzyme Nomenclature | http://www.chem.qmul.ac.uk/iubmb/enzyme | Enzyme classification |
BLAST/PSI-BLAST | http://blast.ncbi.nlm.nih.gov/Blast.cgi | Sequence comparison |
Gotcha | http://www.compbio.dundee.ac.uk/Software/GOtcha/gotcha.html | Assigns GO terms based on sequence comparison |
ESG | http://kiharalab.org/web/esg.php | Assigns GO terms based on sequence comparison |
GOPred | http://kinaz.fen.bilkent.edu.tr/gopred | Assigns GO terms based on sequence comparison |
Pfam | http://pfam.sanger.ac.uk | Database of protein families and their MSA |
HMMER | http://hmmer.janelia.org | Sequence comparison based on hidden markov models |
Prosite | http://prosite.expasy.org | Database of protein domains, families and functional sites |
Interpro | http://www.ebi.ac.uk/interpro | Database of protein functional signatures |
ELM | http://elm.eu.org/links.html | Resource to investigate functional sites in eukaryotic proteins |
ConFunc | http://www.sbg.bio.ic.ac.uk/~confunc | Assigns GO terms based on sequence comparison |
DME | http://adios.tau.ac.il/DME11.html | Assigns full EC number based on sequence comparison |
Eficaz2 | http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html | Assigns full EC number based on sequence comparison |
Dali | http://ekhidna.biocenter.helsinki.fi/dali_server | 3D protein structure comparison |
TM-align | http://zhanglab.ccmb.med.umich.edu/TM-align | 3D protein structure comparison |
TOPS++FATCAT | http://fatcat.burnham.org/TOPS | 3D protein structure comparison |
CATH | http://www.cathdb.info | Protein domain structure classification |
SCOP | http://scop.mrc-lmb.cam.ac.uk/scop | Protein domain structure classification |
ConCavity | http://compbio.cs.princeton.edu/concavity | Predicts ligand binding sites from protein structure |
FTSite | http://ftsite.bu.edu | Predicts ligand binding sites from protein structure |
LIGSITEcsc | http://projects.biotec.tu-dresden.de/pocket | Predicts ligand binding sites from protein structure |
3DLigandSite | http://www.sbg.bio.ic.ac.uk/~3dligandsite/ | A threading-based method to predict ligand binding site |
FINDSITE | http://cssb.biology.gatech.edu/skolnick/files/FINDSITE | A threading-based method to predict binding site, ligand, and function |
pevoSOAR | http://sts.bioengr.uic.edu/pevosoar | Assigns up to four digit EC numbers based on local structure similarities |
Catalytic Site Atlas | http://www.ebi.ac.uk/thornton-srv/databases/CSA | Database of known and predicted catalytic residues in the protein structures |
FunClust | http://pdbfun.uniroma2.it/funclust | Identifies local functional motifs in the protein structures |
GASPSdb | http://gaspsdb.rbvi.ucsf.edu | Database of 3D motifs generated by GASPS algorithm |
SuMo | http://sumo-pbil.ibcp.fr/cgi-bin/sumo-welcome | 3D structure comparison based on local structure similarity |
Par-3D | http://sunserver.cdfd.org.in:8080/protease/PAR_3D/index.html | Detects active site residues using 3D templates |
PINTS | http://www.russelllab.org/cgi-bin/tools/pints.pl | 3D structure comparison based on non-sequential local motifs |
Flora | http://www.mcsg.anl.gov/ | Assigns three digit EC numbers based on local structural similarities |
GeMMA | http://www.biochem.ucl.ac.uk/cgi-bin/dlee/GeMMA | Provides classification based on phylogenetic analysis |
SCI-PHY | http://phylogenomics.berkeley.edu/ | Provides classification based on phylogenetic analysis |
PROTONET | http://www.protonet.cs.huji.ac.il | Classifies protein sequences based on phylogenetic analysis |
SIFTER | http://sifter.berkeley.edu | Assigns GO terms based on phylogenetic analysis |
PhylomeDB | http://phylomedb.org/ | Database of phylogenetic trees with ortholog assignments |
TreeFam | http://www.treefam.org/ | Database of phylogenetic trees with ortholog assignments |
ET | http://mammoth.bcm.tmc.edu/ETserver.html | Ranks amino acids based on phylgenetic analysis |
ETA | http://mammoth.bcm.tmc.edu/eta | Assigns three digit EC numbers and GO terms based on local structural similarities |
Further following the logic of sequence comparisons, structural searches can also focus on just the few residues that mediate the most essential aspects of catalysis or binding. The example of the Ser-His-Asp catalytic triad of serine proteases illustrates that only a few amino acids in a well-defined structural conformation are sufficient to annotate function in structures 46. This suggests a general strategy in which a small but functionally essential structural motif, called a 3D template, is matched geometrically in other protein structures. A matched protein may then potentially perform the function associated with the template 47. Several methods, including FunClust 48, GASPS 49, SuMo 50, PAR-3D 51, and PINTS 52 follow this strategy. They typically rely on a source of structural motifs that are functionally relevant, such as The Catalytic Site Atlas 53 database, which compiles templates for enzyme activity taken from the experimental literature. To identify enzymatic templates more generally, FLORA defines them in terms of recurrent structural patterns in the superimposed structures of enzyme homologs 54.
PHYLOGENOMIC PATTERNS
Molecular function may also be inferred from phylogenomic classifications. Starting with an alignment of homologs and an associated phylogenetic tree, annotations are transferred within branches following the topology of the tree 55. Typically, uncharacterized proteins can inherit the annotation of the ortholog subfamily to which they belong. GeMMA 56, SCI-PHY 57, PROTONET 58, and SIFTER 59, 60 reflect these ideas. The phylogenetic tree of PROTONET 58 has nearly 10 million sequences, and a user can retrieve the evolutionary tree relevant to a query protein of their choice, and navigate its branches to search for functional information. In a more automated approach, SIFTER models protein evolution to propagate GO annotations within the tree 59, 60. This is a slow process, but limiting the number of possible combinations of molecular functions for individual proteins significantly raises efficiency without loss of prediction accuracy 60.
Because paralogs arise from gene duplication and usually evolve different functions, it is important to distinguish them from orthologs. Algorithms that detect orthology often rely on tree reconciliation approaches. Typically, a phylogenetic tree of homologs is compared to a speciation tree, allowing paralogs and orthologs to be identified by inferring the order of events for gene loss and duplication. TreeFam 61 provides ortholog and paralog assignments based on this approach, as well as phylogenetic trees for individual proteins for mammal families. PhylomeDB 62 uses a different species-overlap algorithm, which compares the species identity of closely related branches to decide whether their parental node is a duplication or a speciation. It provides orthology predictions, alignments, and phylogenetic trees for human, the Saccharomyces cerevisiae, and Escherichia coli.
SYNTHESIS THROUGH EVOLUTIONARY TRACE PATTERNS
It is possible to integrate the diverse evolutionary patterns seen in sequences, motifs, templates, and phylogenies through Evolutionary Trace (ET) analysis 63. This approach applies proteome-wide and has been extensively validated in experimental case studies. It yields tools to map functional sites in proteins, identify their key determinants, guide protein redesign studies, and extract 3D functional motifs with which to annotate protein function in novel structures. In view of this variety of applications, ET patterns arise from a surprisingly basic classification procedure.
In order to discover which residues are important to structure and function, ET systematically ranks amino acid positions by their phylogenetic patterns of variation. Starting with a protein family alignment and the corresponding evolutionary divergence tree, ET ranks residue positions better, or worse, depending on whether the substitutions in their alignment column correlate with larger, or smaller, tree divergences (Figure 1). Thus, by definition, variations of top-ranked ET residues entail big evolutionary steps, suggesting that they contribute importantly to structure and function. Variations of poorly-ranked residues, by contrast, entail small evolutionary steps and suggest at best a limited influence on structure and function. Thus, by systematizing these comparisons between alignment and tree, ET ranks residue positions relative to each other by the size of their phylogenetic variations. This procedure mimics the laboratory strategy of measuring with assays which substitutions disrupt function, replacing assays and mutations in the wet lab with divergences and variations, respectively, in silico 63.
A series of technical studies show that the ET rank of evolutionary importance reveals structurally and functionally relevant patterns (Table 2). First, top-ranked ET residues cluster spatially in protein structure 63–65. Second, this clustering is widespread in the structural genome and greater than expected by chance as measured with a z-score to yield an overall measure of structural clustering of important residues (Figure 2). When no structure is available, sequence-based quality measures can also assess the significance of ET patterns 66. Third, these clusters overlap with functional sites as shown in 37 of 38 proteins with known ligand binding sites, and so can yield insights into the regions of a protein that mediate function most directly 64, 67. Fourth, the ET link between sequence and structure is such that better clustering z-score strongly correlates with more accurate functional sites discovery 67, as shown in 50 diverse proteins by varying the input parameters of ET and observing correlations mostly above 0.7 68. Mapping evolutionarily important residues to the structure has also been useful in other studies. Spatial clustering of important residues formed presumed functional sites useful for protein-protein docking 69 and the prediction of catalytic residues 70. Thus phylogenetic patterns of residue variations in sequences are linked to a clustering bias in structures that reveals functional sites. As discussed next, one may then interrogate a novel structure with ET to identify its functional sites and its residue determinants. In a variety of prospective experimental case studies, this guided the design of separation-of-function mutations; the rewiring of functional specificity, such as the discovery and reprogramming of an allosteric pathway; and the design of peptide inhibitors. On a structural proteomic scale, top-ranked ET residues enable large-scale function prediction.
Table 2.
PROTEOMIC RULES |
---|
|
CASE STUDIES: EVOLUTIONARY PATTERNS AND FUNCTIONAL REDESIGN
Selective separation of function mutations helped clarify in the eukaryotic Ku70/80 heterodimer how different and antagonistic functions co-exist in the same complex, and suggested a long-sought interaction site with the gene repressor LexA in the prokaryotic protein RecA. The former study identified two structurally distant clusters of top-ranked ET residues that suggested distinct functional sites in Ku70/80. Targeted mutations to one of the clusters disrupted end-joining but not telomere-maintenance, and mutations of the other cluster did the reverse. Thus double-strand break DNA repair and telomere maintenance segregate to opposite ends of the Ku structure which explains how both functions may be performed without risking end to end chromosome fusion 71. Likewise, in RecA, ET revealed a number of new functional sites that were then mutated. These mutations disrupted either DNA repair by recombination, or LexA interaction, but not both. Thus, even though RecA is a heavily mutagenized, classic example for homologous DNA repair, ET patterns of evolutionary importance revealed previously unrecognized functional regions including the potential trigger of LexA-mediated error prone DNA repair—one of the root causes of antibiotic resistance 72.
ET patterns typically identify functional sites on protein surfaces, but they can also suggest internal mechanisms. An ET study mapped key functional residues in the seven-helical transmembrane core of G protein-coupled receptors (GPCR) and suggested that distinct internal functional modules couple allosterically the binding of extracellular ligands to intracellular signaling through G proteins or β-arrestin-mediated internalization. Consistent with predictions, mutations of top-ranked ET residues in each module variously inhibited ligand binding, caused constitutive activity 73, and could even block G protein signaling while leaving β-arrestin signaling intact 74. More recently, a difference analysis of ET applied solely to bioamine receptors and applied to all rhodopsin-related receptors suggested a set of residues uniquely important to bioamine function. Single point mutations then transferred these putative bioamine specificity determinants from the 5HT-2A serotonin receptor into the D2R dopamine receptor and, as a result, increased serotonin signaling and decreased dopamine signaling independent of changes in binding affinity 75. These mutations, located deep in the GPCR transmembrane core, show that the GPCR allosteric pathway can encode signaling response specificity independently of binding, demonstrating the concept of allosteric specificity, and that this specificity code can be traced back and rekeyed, at least in part, by swapping top-ranked ET residues between paralogs.
Besides point mutations, ET patterns have been moved whole into a new scaffold to create functional mimetics. A clusters of ET residues suggested a novel binding site on surface exposed helices of G protein-coupled receptor kinases (GRK), proteins that phosphorylate the intracellular loops of GPCRs to regulate their activity 67. This site was then mimicked with peptides designed to keep the evolutionarily important residues intact, while less important amino acids were substituted in order to stabilize a helical structure. Some of these peptides inhibited GPCR phosphorylation by 80% 67. Together these studies show that in diverse proteins and in diverse types of experimental manipulation, top-ranked ET residues consistently identify the key determinants of functional sites. They should therefore be useful for 3D functional motifs to annotate function in novel protein structures.
ETA FUNCTIONAL ANNOTATION
In order to annotate function of novel protein structures solved by structural genomics, ET Annotation (ETA) follows the 3D motifs strategies reviewed above. Uniquely, this approach repeatedly exploits ET patterns to select motifs and to filter acceptable matches. ETA applies ET ranks to the structure of an unknown protein, the query, to identify six best clustering, top-ranked ET residues at or near a protein structure’s surface: the 3D template. Simple geometric matches of such templates to protein structures of known function, the targets, often prove too non-specific to suggest identical functions accurately. However, false positives can be reduced dramatically by requiring that the matched sites in the target be composed of top-ranked residues 76; that a 3D template from the target reciprocally match the query 77; and that a plurality of targets concur in suggesting the same function 76. If so, this functional annotation may be reliably transferred to the query in high throughput fashion, with 92% accuracy for enzymes at three-digit EC numbers; and 94% accuracy for non-enzymes at the third GO depth level in over a thousand Structural Genomics protein controls 78. These studies confirm, on a large scale, that phylogenetic residue variation patterns convey highly specific structure-function information.
A recent extension of ETA exploits graph-based semi-supervised learning to improve function annotation specificity and coverage. The approach ties all-against-all ETA matches among all known protein structures into a network, in which nodes represent protein structures and links indicate ETA 3D structural template matches between proteins 79. Labels that indicate function are then diffused globally following the topology of this network. Although all labels reach nearly all nodes, only a fraction does so with any statistical significance. This global analysis improves accuracy by 6% (to 96% accuracy) at 65% coverage over all four EC numbers compared to ETA, and it also performs favorably against other methods 54. As further validation, a novel and nontrivial ETA network annotation was experimentally confirmed as a carboxylesterase (EC 3.1.1.1) in a vancomycin resistant strain of Staphylococcus aureus 79. This annotation was based on matches to three structures with sequence identities ranging between 11 and 13%. These data show that global comparison of phylogenetic variations patterns of 6 residues, in a well-defined structural arrangement, uncovers accurate and specific functional information, including the resolution of substrate specificity, far into the twilight zone of protein sequence similarity.
CONCLUSIONS
The relationship between sequence, structure and function is part of the broad effort to understand how genotype is linked to phenotype. Some approaches rely on biophysical modeling and others are purely experimental. However, because genotype information is constantly in flux and a gene’s survival depends on the fitness that it encodes, evolutionary analysis is another central approach to understand how genotype relates to phenotype. The exponential dependence of deviations in structure and function as a result of deviations in sequence among homologs suggests that evolution proceeds smoothly following regular processes over long time periods. A challenge is to complement these statistical observations of evolutionary regularity with equally precise molecular level patterns that help to recover biological meaning from high throughput sequence, structure, and function data. This review shows that different approaches that compare sequences and structures, motifs and templates, correlations and phylogenetic classification are able to identify general patterns that contain precise information on molecular function.
Many of the benefits of each of these approaches are naturally contained in Evolutionary Trace analysis. This approach scores sequence positions by their relative evolutionary impact, as judged from the size of the evolutionary steps associated with their variations. Thus, residues are ranked by how well their own evolution correlated with the evolution of all other sequence positions, represented by the phylogenetic tree. Critically, residues with variations that correlate with root divergences are more important and have remarkable structural and functional properties: they cluster structurally; these clusters map functional sites; clustering quality correlates with functional site prediction; experimental mutations at top-ranked residues control function and specificity; and their mimicry enable the transfer of function to a peptide, or to other protein structures on a proteomic scale in silico. Thus top-ranked ET residues embody features in the sequence, in the structure, in the protein function, and in the phylogeny that are reproducible as general across the proteome. This suggests that they capture basic patterns linking genotype to phenotype during evolution. To fully support this view, however, it remains to reframe evolutionary trace analysis in a formal and extensible framework to make explicit the genotype to phenotype relationship. Such a relationship might then, in turn, help clarify the impact of missense mutations on protein function.
Highlights.
Evolutionary patterns in sequences, structures, and phylogenomic classifications can predict some aspects of protein function.
These patterns can be global in nature, such as in folds and profiles, or local, such as in motifs and templates.
The Evolutionary Trace (ET) integrates in a single framework the analysis of these different types of functionally relevant patterns.
ET residues cluster in structures and map out catalytic sites, binding interfaces and allosteric pathways and their specificity determinants.
Acknowledgments
We wish to thank Rhonald Lua and Eric Venner for helpful discussions, and gratefully acknowledge grant support from the National Institute of Health through R01GM079656 and R01GM066099, and from the National Science Foundation, through CCF 0905536, NSF DBI-0851393, CCF 0905536, as well as from the Cancer Prevention Research Institute of Texas, through CPRIT RP120258.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Liolios K, et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 38:D346–354. doi: 10.1093/nar/gkp848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009;37:D169–174. doi: 10.1093/nar/gkn664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2008;36:D475–479. doi: 10.1093/nar/gkm884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kuznetsova E, et al. Enzyme genomics: Application of general enzymatic screens to discover new enzymes. FEMS Microbiol Rev. 2005;29:263–279. doi: 10.1016/j.femsre.2004.12.006. [DOI] [PubMed] [Google Scholar]
- 6.Hu P, et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 2009;7:e96. doi: 10.1371/journal.pbio.1000096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17:429–431. doi: 10.1016/s0168-9525(01)02348-4. [DOI] [PubMed] [Google Scholar]
- 8.Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. doi: 10.1002/j.1460-2075.1986.tb04288.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hegyi H, Gerstein M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol. 1999;288:147–164. doi: 10.1006/jmbi.1999.2661. [DOI] [PubMed] [Google Scholar]
- 10.Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–107. [PubMed] [Google Scholar]
- 11.Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc. 5:725–738. doi: 10.1038/nprot.2010.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000;407:651–654. doi: 10.1038/35036627. [DOI] [PubMed] [Google Scholar]
- 13.Lemoine F, Lespinet O, Labedan B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol. 2007;7:237. doi: 10.1186/1471-2148-7-237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Schmidt DM, et al. Evolutionary potential of (beta/alpha)8-barrels: functional promiscuity produced by single substitutions in the enolase superfamily. Biochemistry. 2003;42:8387–8393. doi: 10.1021/bi034769a. [DOI] [PubMed] [Google Scholar]
- 15.Bairoch A. The ENZYME database in 2000. Nucleic Acids Res. 2000;28:304–305. doi: 10.1093/nar/28.1.304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rost B. Enzyme function less conserved than anticipated. J Mol Biol. 2002;318:595–608. doi: 10.1016/S0022-2836(02)00016-5. [DOI] [PubMed] [Google Scholar]
- 18.Martin DM, Berriman M, Barton GJ. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics. 2004;5:178. doi: 10.1186/1471-2105-5-178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chitale M, Hawkins T, Park C, Kihara D. ESG: extended similarity group method for automated protein function prediction. Bioinformatics. 2009;25:1739–1745. doi: 10.1093/bioinformatics/btp309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sarac OS, Atalay V, Cetin-Atalay R. GOPred: GO molecular function prediction by combined classifiers. PLoS One. 5:e12382. doi: 10.1371/journal.pone.0012382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875–1882. doi: 10.1093/bioinformatics/btm270. [DOI] [PubMed] [Google Scholar]
- 22.Pei J, Grishin NV. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics. 2001;17:700–712. doi: 10.1093/bioinformatics/17.8.700. [DOI] [PubMed] [Google Scholar]
- 23•.Punta M, et al. The Pfam protein families database. Nucleic Acids Res. 40:D290–301. doi: 10.1093/nar/gkr1065. With over 8000 citations, the Pfam database is used extensively to characterize protein domains based on sequence motif matches. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39:W29–37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sigrist CJ, et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002;3:265–274. doi: 10.1093/bib/3.3.265. [DOI] [PubMed] [Google Scholar]
- 26.Hunter S, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40:D306–312. doi: 10.1093/nar/gkr948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27•.Dinkel H, et al. ELM--the database of eukaryotic linear motifs. Nucleic Acids Res. 40:D242–251. doi: 10.1093/nar/gkr1064. ELM differs from other motifs database by focusing on functional regions without domain specific considerations. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wass MN, Sternberg MJ. ConFunc--functional annotation in the twilight zone. Bioinformatics. 2008;24:798–806. doi: 10.1093/bioinformatics/btn037. [DOI] [PubMed] [Google Scholar]
- 29.Weingart U, Lavi Y, Horn D. Data mining of enzymes using specific peptides. BMC Bioinformatics. 2009;10:446. doi: 10.1186/1471-2105-10-446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Arakaki AK, Huang Y, Skolnick J. EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinformatics. 2009;10:107. doi: 10.1186/1471-2105-10-107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ma B, Elkayam T, Wolfson H, Nussinov R. Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci U S A. 2003;100:5772–5777. doi: 10.1073/pnas.1030237100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Holm L, Rosenstrom P. Dali server: conservation mapping in 3D. Nucleic Acids Res. 2010;38 (Suppl):W545–549. doi: 10.1093/nar/gkq366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34•.Veeramalai M, Ye Y, Godzik A. TOPS++FATCAT: fast flexible structural alignment using constraints derived from TOPS+ Strings Model. BMC Bioinformatics. 2008;9:358. doi: 10.1186/1471-2105-9-358. TOPS++FATCAT is a speedy search for structural neighbors against either a filtered PBD or CASP database. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hasegawa H, Holm L. Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol. 2009;19:341–348. doi: 10.1016/j.sbi.2009.04.003. [DOI] [PubMed] [Google Scholar]
- 36.Greene LH, et al. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007;35:D291–297. doi: 10.1093/nar/gkl959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Andreeva A, et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008;36:D419–425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Friedberg I, Godzik A. Functional differentiation of proteins: implications for structural genomics. Structure. 2007;15:405–415. doi: 10.1016/j.str.2007.02.005. [DOI] [PubMed] [Google Scholar]
- 39.Laskowski RA, Luscombe NM, Swindells MB, Thornton JM. Protein clefts in molecular recognition and function. Protein Sci. 1996;5:2438–2452. doi: 10.1002/pro.5560051206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol. 2009;5:e1000585. doi: 10.1371/journal.pcbi.1000585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ngan CH, et al. FTSite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics. 2012;28:286–287. doi: 10.1093/bioinformatics/btr651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Huang B, Schroeder M. LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol. 2006;6:19. doi: 10.1186/1472-6807-6-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43•.Brylinski M, Skolnick J. FINDSITE: a threading-based approach to ligand homology modeling. PLoS Comput Biol. 2009;5:e1000405. doi: 10.1371/journal.pcbi.1000405. FINDSITE predicts binding sites, ligands, and functional annotations using homology models of structures in complex with ligands. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wass MN, Sternberg MJ. Prediction of ligand binding sites using homologous structures and conservation at CASP8. Proteins. 2009;77 (Suppl 9):147–151. doi: 10.1002/prot.22513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Tseng YY, Dundas J, Liang J. Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns. J Mol Biol. 2009;387:451–464. doi: 10.1016/j.jmb.2008.12.072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wallace AC, Laskowski RA, Thornton JM. Derivation of 3D coordinate templates for searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases and lipases. Protein Sci. 1996;5:1001–1013. doi: 10.1002/pro.5560050603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Meng EC, Polacco BJ, Babbitt PC, Rigden DJ. Springer; Netherlands: 2009. pp. 187–216. [Google Scholar]
- 48.Ausiello G, et al. FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures. BMC Bioinformatics. 2008;9 (Suppl 2):S2. doi: 10.1186/1471-2105-9-S2-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Polacco BJ, Babbitt PC. Automated discovery of 3D motifs for protein function annotation. Bioinformatics. 2006;22:723–730. doi: 10.1093/bioinformatics/btk038. [DOI] [PubMed] [Google Scholar]
- 50.Jambon M, et al. The SuMo server: 3D search for protein functional sites. Bioinformatics. 2005;21:3929–3930. doi: 10.1093/bioinformatics/bti645. [DOI] [PubMed] [Google Scholar]
- 51.Goyal K, Mohanty D, Mande SC. PAR-3D: a server to predict protein active site residues. Nucleic Acids Res. 2007;35:W503–505. doi: 10.1093/nar/gkm252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Stark A, Russell RB. Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Res. 2003;31:3341–3344. doi: 10.1093/nar/gkg506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Porter CT, Bartlett GJ, Thornton JM. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 2004;32:D129–133. doi: 10.1093/nar/gkh028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Redfern OC, Dessailly BH, Dallman TJ, Sillitoe I, Orengo CA. FLORA: a novel method to predict protein function from structure in diverse superfamilies. PLoS Comput Biol. 2009;5:e1000485. doi: 10.1371/journal.pcbi.1000485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Eisen JA, Sweder KS, Hanawalt PC. Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. Nucleic Acids Res. 1995;23:2715–2723. doi: 10.1093/nar/23.14.2715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56••.Lee DA, Rentzsch R, Orengo C. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res. 38:720–737. doi: 10.1093/nar/gkp1049. GeMMA provides high throughput classification of functional subfamilies through pattern recognition and clustering of local sequence conservation. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Brown DP, Krishnamurthy N, Sjolander K. Automated protein subfamily identification and classification. PLoS Comput Biol. 2007;3:e160. doi: 10.1371/journal.pcbi.0030160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58••.Rappoport N, Karsenty S, Stern A, Linial N, Linial M. ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree. Nucleic Acids Res. 40:D313–320. doi: 10.1093/nar/gkr1027. ProtoNet lets users traverse hierarchically classified and annotated sequences in search of functional information. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Engelhardt BE, Jordan MI, Muratore KE, Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol. 2005;1:e45. doi: 10.1371/journal.pcbi.0010045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Engelhardt BE, Jordan MI, Srouji JR, Brenner SE. Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Res. 21:1969–1980. doi: 10.1101/gr.104687.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Li H, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–580. doi: 10.1093/nar/gkj118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Huerta-Cepas J, Bueno A, Dopazo J, Gabaldon T. PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res. 2008;36:D491–496. doi: 10.1093/nar/gkm899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Lichtarge O, Bourne HR, Cohen FE. Evolutionarily conserved Galphabetagamma binding surfaces support a model of the G protein-receptor complex. Proc Natl Acad Sci U S A. 1996;93:7507–7511. doi: 10.1073/pnas.93.15.7507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Madabushi S, et al. Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol. 2002;316:139–154. doi: 10.1006/jmbi.2001.5327. [DOI] [PubMed] [Google Scholar]
- 65.Baranski TJ, et al. C5a receptor activation. Genetic identification of critical residues in four transmembrane helices. J Biol Chem. 1999;274:15757–15765. doi: 10.1074/jbc.274.22.15757. [DOI] [PubMed] [Google Scholar]
- 66.Wilkins AD, Lua R, Erdin S, Ward RM, Lichtarge O. Sequence and structure continuity of evolutionary importance improves protein functional site discovery and annotation. Protein Sci. 19:1296–1311. doi: 10.1002/pro.406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Baameur F, et al. Role for the regulator of G-protein signaling homology domain of G protein-coupled receptor kinases 5 and 6 in beta 2-adrenergic receptor and rhodopsin phosphorylation. Mol Pharmacol. 77:405–415. doi: 10.1124/mol.109.058115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Mihalek I, Res I, Lichtarge O. A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins. Bioinformatics. 2006;22:149–156. doi: 10.1093/bioinformatics/bti791. [DOI] [PubMed] [Google Scholar]
- 69.Aloy P, Querol E, Aviles FX, Sternberg MJ. Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol. 2001;311:395–408. doi: 10.1006/jmbi.2001.4870. [DOI] [PubMed] [Google Scholar]
- 70.Gutteridge A, Bartlett GJ, Thornton JM. Using a neural network and spatial clustering to predict the location of active sites in enzymes. J Mol Biol. 2003;330:719–734. doi: 10.1016/s0022-2836(03)00515-1. [DOI] [PubMed] [Google Scholar]
- 71.Ribes-Zamora A, Mihalek I, Lichtarge O, Bertuch AA. Distinct faces of the Ku heterodimer mediate DNA repair and telomeric functions. Nat Struct Mol Biol. 2007;14:301–307. doi: 10.1038/nsmb1214. [DOI] [PubMed] [Google Scholar]
- 72••.Adikesavan AK, et al. Separation of recombination and SOS response in Escherichia coli RecA suggests LexA interaction sites. PLoS Genet. 7:e1002244. doi: 10.1371/journal.pgen.1002244. A long-sought RecA site that mediates LexA proteolysis, and thus triggers error-prone DNA repair, was found with ET. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Madabushi S, et al. Evolutionary trace of G protein-coupled receptors reveals clusters of residues that determine global and class-specific functions. J Biol Chem. 2004;279:8126–8132. doi: 10.1074/jbc.M312671200. [DOI] [PubMed] [Google Scholar]
- 74.Shenoy SK, et al. beta-arrestin-dependent, G protein-independent ERK1/2 activation by the beta2 adrenergic receptor. J Biol Chem. 2006;281:1261–1273. doi: 10.1074/jbc.M506576200. [DOI] [PubMed] [Google Scholar]
- 75••.Rodriguez GJ, Yao R, Lichtarge O, Wensel TG. Evolution-guided discovery and recoding of allosteric pathway specificity determinants in psychoactive bioamine receptors. Proc Natl Acad Sci U S A. 107:7787–7792. doi: 10.1073/pnas.0914877107. This work demonstrates allosteric pathway specificity: single point mutations of cognate ET residues create serotonin responsive mutant receptor mutants with wild type binding affinity to either dopamine or serotonin. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Kristensen DM, et al. Prediction of enzyme function based on 3D templates of evolutionarily important amino acids. BMC Bioinformatics. 2008;9:17. doi: 10.1186/1471-2105-9-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Ward RM, et al. De-orphaning the structural proteome through reciprocal comparison of evolutionarily important structural features. PLoS One. 2008;3:e2136. doi: 10.1371/journal.pone.0002136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Erdin S, Ward RM, Venner E, Lichtarge O. Evolutionary trace annotation of protein function in the structural proteome. J Mol Biol. 396:1451–1473. doi: 10.1016/j.jmb.2009.12.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79•.Venner E, et al. Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities. PLoS One. 5:e14286. doi: 10.1371/journal.pone.0014286. A diffusion model was applied to a protein networks of local structural and evolutionary similarities in order to predict enzymatic function and substrate. Experiments documented accurate matches down to negligeable sequence identity, in the low teens. [DOI] [PMC free article] [PubMed] [Google Scholar]