Abstract
As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.
Keywords: enzyme genomics, structural genomics, structure–function relationship, fold plasticity, fold innovability, CATH, protein function, protein structure, GEBA, enzymes, homologous superfamily, genome and metagenome annotation
Introduction
The era of high-throughput sequencing has yielded a deluge of sequence data that runs the risk of translating into information deserts because of limitations in annotation protocols. While experimental annotation methods are limited by throughput, computational methods are plagued with issues like imprecise discrimination between homologous versus orthologous relationships,1 transitive error propagation,2 lack of precision and limited sensitivity.3,4 With respect to the latter, structurally informed methods have the potential to increase annotation coverage beyond the twilight zone of sequence similarity. The structure-function relationship is also an essential tenet of modern biology and the rationale behind structural genomics efforts.
On the other hand, the aim of generalizing the use of biocatalysts in various fields of application would benefit from the ability to identify enzymes among the bulk of protein encoding genes, as non-enzyme proteins represent on average about 70% of the prokaryotic coding potential,5 while roughly 80% of known enzyme functions are carried out by only about a quarter of the total number of known folds. The ability to locate putative enzyme coding sequences would also benefit efforts in enzyme genomics6 such as the enzyme function initiative.7 Because most methods for the identification of enzymes rely on rather conservative and supervised schemes, which use protein sequence and/or structure derived features to assign specific enzymatic functions8–15 (typically EC codes and GO terms), we wanted to assess whether coarse grained (i.e., superfamily level) 3D structural information, as available from reference sequence and structure databases, has on its own the power to predict the enzymatic versus non-enzymatic nature of anonymous (meta)genomic sequences.
Even though statements that knowledge of the specific fold of a protein does not directly imply a function abound in the literature,16 these either refer to a much finer grained notion of function than the one we consider here (i.e., enzyme versus non-enzyme), or pertain to observations of a limited number of large superfamilies,17–20 for example, that there are about thirty different homologous superfamilies adopting the TIM barrel fold (similar observations for the Rossmann fold are also quite frequent), which cover over 60 different EC classifications. On the other hand, very valuable resources derived from large scale structural annotation of genomes do exist,18,21–23 but these do not specifically focus on the systematic coarse grained structural discrimination of enzyme versus non-catalytic proteins. To the best of our knowledge, there is no specific report addressing straightforwardly this question; a fact that recently led Jacobson et al.24 to formulate the first of their handful of “outstanding questions” (see box 1 in Ref.24 in the following way: “Can enzymes be readily identified from sequence or structure, compared with proteins that lack catalytic function?”.
Results
First, it is apparent that about two thirds (∼66%) of the protein complement from the 41 phylogenetically diverse proteomes can be structurally annotated, with not much variation in the percentage of structural annotation observed between the various proteomes, and with the same few superfamilies dominant across the various proteomes, in a way consistent with measurements from large scale structure based annotations of genomes.21–23,25 There is a significant representation of protein structural types within the selected proteomes, which includes 766 out of 1313 distinct CATH topologies/folds and 1471 out of 2626 distinct CATH homologous superfamilies.
Relationship between functional and structural space
With respect to the relationship between functional (the catalytic character of protein sequences was determined by mapping EC codes from Uniprot's SwissProt and TrEMBL resources, see Methods) and structural spaces, we observed that 57% of the domains within the reference proteomes belonged to homologous superfamilies (39.6% (583) of matched superfamilies) whose member domains occur in both enzymatic and non-enzymatic protein sequences (Class I; Table1 and Fig. 1).
Table 1.
Distribution of Structural Superfamilies and Domain Sequences Occurring in the Analyzed Genomes into Three Structural/Functional Classes, for All Superfamilies and With the 10% Largest Superfamilies Removed, Leaving Only Superfamilies With Less That 2% Abundance (See Text)
| Class | % Superfamilies (occurring in the reference genomes) | % Domains (occurring in the reference genomes) | ||
|---|---|---|---|---|
| All | Top 10% removed | All | Top 10% removed | |
| I Enzyme +/− | 39.6 | 38.5 | 57 | 44.9 |
| II Enzyme −/− | 51.4 | 52.9 | 8 | 11 |
| III Enzyme +/+ | 8.9 | 6.7 | 0.8 | 0.8 |
| No Structural Annotation | - | 34 | ||
Figure 1.

Distribution of sequences within CATH homologous superfamilies.
Additionally, 8% of our reference proteome domains belonged to homologous superfamilies (51.4% (756) of matched superfamilies) not associated with sequences labelled with an EC code (Class II; Table2). Finally, less than 1% (0.83%) of the proteome domains belonged to homologous superfamilies (8.9% (127) of matched superfamilies) that so far harbour no non-enzyme sequences, that is, are always associated with enzymatic sequences (Class III).
Table 2.
Distribution of Structural Superfamilies and Domains Occurring in the Analyzed Genomes Based on the Percentage of Enzymatic Sequences Among Their Class Members (See Text)
| % Enzyme bins | % Superfamilies (from reference genomes) | % Domains (from reference genomes) |
|---|---|---|
| 0 (0–0.49) | 52.7 | 11.2 |
| 1–10 | 9.0 | 27.0 |
| 11–20 | 6.4 | 9.8 |
| 21–30 | 3.9 | 3.8 |
| 31–40 | 3.6 | 3.5 |
| 41–50 | 3.3 | 3.6 |
| 51–60 | 2.6 | 2.9 |
| 61–70 | 2.5 | 1.2 |
| 71–80 | 2.4 | 0.6 |
| 81–90 | 2.5 | 0.8 |
| 91–99 | 2.5 | 0.8 |
| 100 | 8.6 | 0.8 |
| No structural assignment | - | 34 |
To assess the possible impact of the criteria used to define the three differently enzyme biased classes of superfamilies, we repeated the analysis under more relaxed definitions for these (see Methods, Table2 and Supporting Information Table II) by allowing for a growing fraction of presumed false negative enzymes in Class III and possible false positive enzymes in Class II. The adoption of such increasingly less stringent definitions of enzyme versus non-enzyme biased superfamilies results in pulling out superfamilies from the “hybrid” Class I, therefore increasing the representation of superfamilies in the Class II and Class III partitions (see Supporting Information Table II).
In terms of domain coverage of the reference proteomes, this translated into significantly reduced coverage for the “hybrid” class (Class I) and significantly increased coverage for the “non-enzyme” class (Class II). Importantly, it had only marginal effect on the domain coverage of the “enzyme biased” class (Class III), which only increases from 0.8% to 3.1% (Supporting Information Table II).
The underrepresentation of enzyme biased classes is readily apparent from Figure 2, which illustrates the segregation of the superfamilies anchored on the reference genomes into bins according to the fraction of EC labelled sequences within their member sequences. The Figure 3 illustrates a complementary view of the distribution of EC labelled sequences from a domain perspective.
Figure 2.

Segregation of CATH homologous superfamilies occurring in the analyzed genomes into classes (bins) according to the proportion of sequences labelled as enzymes among their member sequences.
Figure 3.

Breakdown of domains occurring in the analyzed genomes into classes (bins) according to the proportion of enzyme associated sequences among their sequence members.
Similarly, the results turned out to be robust with respect to the removal of all superfamilies whose abundance exceeded 2%, thus demonstrating that they were not dominated by the contribution from the most abundant folds (see Table1).
The observed low genomic coverage by structurally anchored domains strongly indicative of enzymatic activity demonstrates that the power of coarse grained structural information alone to predict the catalytic nature of anonymous sequences is limited.
Caveats
First of all, the process we relied on to label sequence and structures as possessing enzymatic activity or not is dependent on the correctness of functional annotations in the Uniprot resource.26 Even though some concerns have been expressed over the overall accuracy of some annotation protocols,3,27 using the Uniprot—Swissprot and TrEMBL—resources for EC labelling is probably a reasonable choice given the vast majority of proteins arising from sequencing projects do not have any experimental support for functional assignments. The fact that the most frequent annotation errors measured by3 involved “overprediction” of molecular function, which is less of a concern given the very coarse grained functional classes we considered in our analysis, supports this view. Beyond the error prone assignment of EC codes from sequence alignments,13 another issue is that sequences may possess enzymatic activities that have not yet been recognized, for example, because this activity is new or because of an unrecognized relationship to functionally similar relatives. On a more fundamental level, one should also point to internal limitations of the EC classification scheme itself.28 On the other hand, it is known that a single residue mutation in the active site can destroy catalytic activity, raising the possibility that some EC labelled sequences might not (or actually no longer be) functional (see Discussion). We also have to keep in mind that protein sequences, not protein domains, are assigned EC codes. Hence, the labelling of domains not directly contributing to the functionality of an enzyme with its EC code may result in some erroneously labelled superfamilies.
DISCUSSION
Any discussion of the observed underrepresentation of “enzyme biased” superfamilies should probably rely on the concepts of fold plasticity and evolvability,29,30 and discuss these in a superfamily dependent manner. On the other hand, a straightforward rationalization of our observations could invoke superfamily antiquity, where the abundance of a fold would correlate with its age (e.g., due to evolutionary mechanisms like gene duplication), as would the probability to generate a new biological function (enzymatic or not) within it.
Such correlations between domain age and domain abundance are documented in the literature (see Ref.31 for a recent example). Thus, if we assume that fold level structural features are not strong determinants of bio-catalytic potential, we would expect the more abundant/old superfamilies to be enriched in structures encompassing both enzyme and non-enzyme proteins (i.e., Class I superfamilies). That this is indeed the case is apparent from Figure1.
Other clues coherent with the antiquity-based explanation for the under-representation of Class III superfamilies can be found in Ref.31: although the full data resulting from their fold dating process is not available, it is noticeable that all of their fifteen oldest folds contain both enzyme and non-enzyme protein members (i.e., belong to Class I). However, in order to go beyond the rather obvious observation that the most ancient and highly populated superfamilies contain domains from both enzyme and non-enzyme relatives, and to exclude the possibility that these superfamilies are actually dominating the global results, we also carried out the same analysis after excluding the 10% most abundant superfamilies, so that the abundance of any of the remaining superfamilies was less than 2%. Crucially, this only altered the original figures in a very minor way (Table1).
The absence of a strong relationship between coarse grained structural data and the propensity of a protein to exercise a catalytic activity can be interpreted in the wider contexts of structural plasticity32 and of the balancing of innovability related versus robustness related structural features.29,30 It is now well appreciated that convergent evolution of enzyme active sites is not a rare phenomenon,33 and we measured that as many as 60% (Table3) of three level EC classes—which are frequently assumed to embody the overall chemical reaction type catalyzed by a given enzyme, the fourth digit acting more or less like a serial number—that are associated to single domain proteins from the reference genomes can be linked to more than one superfamily (Tables3 and 4, and Supporting Information data), thus emphasizing the tremendous amount of coarse grained structural degrees of freedom that are available to assemble more constrained and finer grained structural features into working catalysts. On the other hand, the observation that 40% of superfamilies can be linked to more than one EC code (Table4) reflects the high levels of successful balancing of innovability and robustness related structural features achieved at the fold level (see Refs.29–30 for a thorough discussion of this issue). These two processes, which account for the observed rarity of invention of new folds (especially at the level of enzyme proteins), concur in blurring any straightforward fold to function relationship over time.
Table 3.
Relationship Between EC Codes and CATH Codes for Single Domain Proteins in the Analyzed Genomes
| EC codes (assigned to single domain proteins in the reference genomes) | ||
|---|---|---|
| % 3rd level EC codes | % 4th level EC codes | |
| One homologous superfamily | 40 (56/140) | 81.56 (42/526) |
| Multiple homologous Superfamilies | 60 (84/140) | 18.44 (97/526) |
| One Topology | 42.14 (59/140) | 81.37 (428/526) |
| Multiple Topologies | 57.86 (81/140) | 18.83 (98/526) |
Absolute counts are in parentheses.
Table 4.
Relationship of CATH Codes and EC Codes for Single Domain Proteins in the Reference Genomes
| CATH (occurring single domain proteins in GEBA genomes) | ||||
|---|---|---|---|---|
| % Topologies (associated to EC codes) | % Homologous Superfamilies (associated to EC codes) | % All Topologies | % All Homologous Superfamilies | |
| One 3rd level EC code | 61.82 (102/165) | 70.87 (180/254) | 19.96 (102/511) | 19.93 (180/903) |
| Multiple 3rd level EC codes | 38.12 (63/165) | 30.13 (74/254) | 12.24 (63/511) | 8.20 (74/903) |
| No EC code assignment | - | - | 67.71 (346/511) | 71.87 (649/903) |
| One 4rd level EC code | 49.1 (81/165) | 60.39 (154/254) | 24.28 (81/511) | 17.05 (154/903) |
| Multiple 4rd level EC codes | 50.9 (84/165) | 39.61 (101/254) | 29.5 (84/511) | 11.07 (100/903) |
| No EC code assignment | - | - | 67.71 (346/511) | 71.87 (649/903) |
Figures are reported for both the subset of CATH topologies and superfamilies assigned to EC codes and all superfamilies (i.e., with or without EC assignments). Absolute counts are in parentheses.
The ubiquity of hybrid classes illustrates that both enzyme and non-enzyme proteins frequently share conserved structural cores embodying distinct functionalities. Several studies of homologous enzyme and non-enzyme pairs are documented in the literature,34–37 with36 showing that the most frequent evolutionary scenario involved the derivation of a non-enzyme from the ancestral enzyme rather than the reverse. One of the most striking examples of homologous enzyme and non-enzyme pair is provided by eye crystallins derived from lactate dehydrogenases, while other examples include azurin and rusticyanin, sharing a cupredoxin fold similar to L-ascorbate oxidases and copper containing nitrate-reductase, TIM-Barrel proteins Narbonin (non-enzymatic) and xylanases, glyoxylase I and bleomycin resitance protein, and PutA proline dehydrogenase and a transcriptional repressor.20,36 Related to the latter example, a strong case for the prevalence of so-called “dead enzymes” adopting new roles as biological regulators is made by,37 who argue that most enzyme families include inactive homologues and performed a deep analysis of inactive homologues of rhomboid proteases (iRhoms).34,37 also showed how eukaryotic transcription regulators can be derived from ancient enzymatic domains, and35 discuss several examples where evolutionary conserved inactive enzyme homologues have adopted new function in regulatory processes.
Notwithstanding the overall low coverage achieved by superfamilies strongly indicative of enzyme function, several individual superfamilies have high predictive power on their own as they were almost exclusively associated with EC labelled sequences in the analyzed genomes (see Supporting Information Table III for a listing of such superfamilies having 75% or more EC associated sequences). The most frequent topology associated to such enzyme predictive superfamilies was the Rossmann “superfold”, accounting for about 10% of these superfamilies, while other frequently observed topologies included the TIM Barrel “superfold” and folds conferring nucleotide binding capabilities, for example, as found in DNA topoisomerase and polymerase enzymes.
In summary, the fact that less than 1% of the reference genome domains analyzed could be associated with homologous superfamilies strongly indicative of enzymatic function, and that on the other hand only 8% of the analyzed domains are clearly suggestive of non-enzymatic function, indicates that high level structure based annotation alone is of limited value to predict the enzymatic versus non-enzymatic nature of anonymous sequences. Thus, inferring enzymatic functionality will require the integration of finer grained structural features (e.g., 3D structural motifs38,39) and/or rely on more stringent primary sequence conservation.
METHODS
In order to undertake this study, a set of phylogenetically diverse proteomes from the GEBA genome initiative40—we considered proteomes from GEBA rather than all organisms in Gene3D because the former seeks to correct the phylogenetic biases manifest in the current set of sequenced bacterial and archaeal genomes through targeted phylogenomic approaches—was selected.40 The GEBA proteomic data was linked to structural data using the Gene3D database,23 which relies on the CATH classification system for protein structures.25 The proteomic data was then connected to enzyme related data, that is, the EC codes, using Uniprot's Swissprot and TrEMBL resources.26 For multi-domain proteins (approximately 40%41 to 60%42 of the proteins in an average prokaryotic genome, depending on domain definition and the inclusion or not of disordered and transmembrane proteins), the occurrence of an EC code at the level of the full-length protein was propagated to each of its constituent domains. The Uniprot sequence ID along with the taxonomic ID was used to link sequence information across these resources, which resulted in integrated information for 41 (3 archaeal and 38 bacterial) out of 56 GEBA genomes (the names of the corresponding organisms are listed in the Supporting Information data).
The structural diversity, both within and between the proteomes was studied at the domain level (using the CATH domain definition25), and the domain structures were analyzed at the top four CATH levels: Class, Architecture, Topology and Homologous superfamily.25
Finally, the aggregated structural and functional data for the GEBA proteomes was used to label all the domains with respect to their propensity to occur in enzyme (Class III herein), non-enzyme (Class II) or both (Class I) proteins on the basis of their EC content. Initially, we resorted to a stringent classification scheme requiring all members of Class III to have associated EC codes and all members of Class II to be devoid of EC codes, all remaining classes being labelled Class I. Subsequently, we relaxed this stringent definition, allowing for a varying fraction of Class III members to lack an EC number, and a varying fraction of Class II members to harbour one (see Supporting Information Table II). To get a broader perspective and probe the potential effect of fixing arbitrary threshold values, we partitioned the superfamilies into compartments defined by increasing enzyme content (Fig. 2 and Table2). The resulting histogram, as well as the complementary one depicting the corresponding domain centric view (Fig. 3), makes apparent the scarcity of superfamilies with very strong biases toward enzyme contents.
On the other hand, as both the distributions of the number of distinct enzymatic function associated with different CATH folds and the genomic abundance of CATH folds are known to be skewed, we also repeated the analysis after excluding the most prevalent folds (i.e., whose abundance is higher than 2%) in order to assess whether this known bias was driving our results (Table1).
Subsequently, the distribution of the different domain labelled classes was calculated for all the proteomes combined, and the resulting information was used to probe the question: “Can enzymatic sequences be differentiated from non-enzymatic sequences based on high-level structure alone?”.
Evaluation of non-EC annotated sequences with gene ontology terms
Because the dependence on EC number annotation might result in under-prediction of enzymatic activity bearing homologous superfamilies, the sequences devoid of EC code were further evaluated against the functional gene ontology (GO).43 This was undertaken in order to check that the functional GO terms they are associated with are not descendants of higher level metabolic related nodes (e.g., “catalytic activity”). This resulted in the rescuing of only 20 superfamilies from the pool of non-enzyme superfamilies: 15 of these ended up in Class I and 5 in Class III.
Glossary
- EC
Enzyme Commission
- CATH Class
Architecture, Topology and Homologous Superfamily
- GEBA
Genomic Encyclopedia of Bacteria and Archaea.
Supporting Information
Additional Supporting Information may be found in the online version of this article.
Supporting Information Figure 1.
Supporting Information Figure 2.
Supporting Information Table 1.
Supporting Information Table 2.
Supporting Information Table 3.
References
- Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nature Rev Mol Cell Biol. 2007;8:995–1005. doi: 10.1038/nrm2281. [DOI] [PubMed] [Google Scholar]
- Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–1649. doi: 10.1093/bioinformatics/18.12.1641. [DOI] [PubMed] [Google Scholar]
- Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009;5:e1000605. doi: 10.1371/journal.pcbi.1000605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–107. [PubMed] [Google Scholar]
- Blattner FR, Plunkett G, 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- Karp PD. Call for an enzyme genomics initiative. Genome Biol. 2004;5:401. doi: 10.1186/gb-2004-5-8-401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerlt JA, Allen KN, Almo SC, Armstrong RN, Babbitt PC, Cronan JE, Dunaway-Mariano D, Imker HJ, Jacobson MP, Minor W, Poulter CD, Raushel FM, Sali A, Shoichet BK, Sweedler JV. The enzyme function initiative. Biochemistry. 2011;50:9950–9962. doi: 10.1021/bi201312u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [IUPAC-IUB Commission on Biochemical Nomenclature (CBN). Nomenclature of multiple enzyme types. Recommendations 1971] Hoppe Seylers Z Physiol Chem. 1972;353:852–854. [PubMed] [Google Scholar]
- Claudel-Renard C, Chevalet C, Faraut T, Kahn D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res. 2003;31:6633–6639. doi: 10.1093/nar/gkg847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobson PD, Doig AJ. Distinguishing enzyme structures from non-enzymes without alignments. J Mol Biol. 2003;330:771–783. doi: 10.1016/s0022-2836(03)00628-4. [DOI] [PubMed] [Google Scholar]
- Kumar N, Skolnick J. EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics. 2012;28:2687–2688. doi: 10.1093/bioinformatics/bts510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagao C, Nagano N, Mizuguchi K. Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PLoS One. 2014;9:e84623. doi: 10.1371/journal.pone.0084623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quester S, Schomburg D. EnzymeDetector: an integrated enzyme function prediction tool and database. BMC Bioinform. 2011;12:376. doi: 10.1186/1471-2105-12-376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Toronen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DW, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Honigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Bjorne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJ, Skunca N, Supek F, Bosnjak M, Panov P, Dzeroski S, Smuc T, Kourmpetis YA, van Dijk AD, ter Braak CJ, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nature Meth. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- von Grotthuss M, Plewczynski D, Ginalski K, Rychlewski L, Shakhnovich EI. PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics. BMC Bioinform. 2006;7:53. doi: 10.1186/1471-2105-7-53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang D, Iyer LM, Burroughs AM, Aravind L. Resilience of biochemical activity in protein domains in the face of structural divergence. Curr Opin Struct Biol. 2014;26C:92–103. doi: 10.1016/j.sbi.2014.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Furnham N, Sillitoe I, Holliday GL, Cuff AL, Laskowski RA, Orengo CA, Thornton JM. Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol. 2012;8:e1002403. doi: 10.1371/journal.pcbi.1002403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, Lee D, Lees JG, Lewis TE, Studer RA, Rentzsch R, Yeats C, Thornton JM, Orengo CA. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res. 2013;41:D490–498. doi: 10.1093/nar/gks1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martinez Cuesta S, Furnham N, Rahman SA, Sillitoe I, Thornton JM. The evolution of enzyme function in the isomerases. Curr Opin Struct Biol. 2014;26C:121–130. doi: 10.1016/j.sbi.2014.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. 2001;307:1113–1143. doi: 10.1006/jmbi.2001.4513. [DOI] [PubMed] [Google Scholar]
- Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42:D304–309. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson D, Madera M, Vogel C, Chothia C, Gough J. The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 2007;35:D308–313. doi: 10.1093/nar/gkl910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 2008;36:D414–418. doi: 10.1093/nar/gkm1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jacobson MP, Kalyanaraman C, Zhao S, Tian B. Leveraging structure for enzyme function prediction: methods, opportunities, and challenges. Trends Biochem Sci. 2014;39:363–371. doi: 10.1016/j.tibs.2014.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuff AL, Sillitoe I, Lewis T, Clegg AB, Rentzsch R, Furnham N, Pellegrini-Calace M, Jones D, Thornton J, Orengo CA. Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Res. 2011;39:D420–426. doi: 10.1093/nar/gkq1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magrane M, Consortium U. UniProt Knowledgebase: a hub of integrated protein data. Database. 2011;2011:bar009. doi: 10.1093/database/bar009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol. 2013;9:e1003063. doi: 10.1371/journal.pcbi.1003063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schomburg I, Chang A, Schomburg D. Standardization in enzymology—Data integration in the world's enzyme information system BRENDA. Perspect Sci. 2014;1:15–23. [Google Scholar]
- Dellus-Gur E, Toth-Petroczy A, Elias M, Tawfik DS. What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs. J Mol Biol. 2013;425:2609–2621. doi: 10.1016/j.jmb.2013.03.033. [DOI] [PubMed] [Google Scholar]
- Toth-Petroczy A, Tawfik DS. The robustness and innovability of protein folds. Curr Opin Struct Biol. 2014;26C:131–138. doi: 10.1016/j.sbi.2014.06.007. [DOI] [PubMed] [Google Scholar]
- Bukhari SA, Caetano-Anolles G. Origin and evolution of protein fold designs inferred from phylogenomic analysis of CATH domain structures in proteomes. PLoS Comput Biol. 2013;9:e1003009. doi: 10.1371/journal.pcbi.1003009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dessailly BH, Dawson NL, Mizuguchi K, Orengo CA. Functional site plasticity in domain superfamilies. Biochim Biophys Acta. 2013;1834:874–889. doi: 10.1016/j.bbapap.2013.02.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gherardini PF, Wass MN, Helmer-Citterich M, Sternberg MJ. Convergent evolution of enzyme active sites is not a rare phenomenon. J Mol Biol. 2007;372:817–845. doi: 10.1016/j.jmb.2007.06.017. [DOI] [PubMed] [Google Scholar]
- Aravind L, Koonin EV. A colipase fold in the carboxy-terminal domain of the Wnt antagonists--the Dickkopfs. Curr Biol. 1998;8:R477–478. doi: 10.1016/s0960-9822(98)70309-4. [DOI] [PubMed] [Google Scholar]
- Pils B, Schultz J. Inactive enzyme-homologues find new function in regulatory processes. J Mol Biol. 2004;340:399–404. doi: 10.1016/j.jmb.2004.04.063. [DOI] [PubMed] [Google Scholar]
- Todd AE, Orengo CA, Thornton JM. Sequence and structural differences between enzyme and nonenzyme homologs. Structure. 2002;10:1435–1451. doi: 10.1016/s0969-2126(02)00861-4. [DOI] [PubMed] [Google Scholar]
- Adrain C, Freeman M. New lives for old: evolution of pseudoenzyme function illustrated by iRhoms. Nature Rev Mol Cell Biol. 2012;13:489–498. doi: 10.1038/nrm3392. [DOI] [PubMed] [Google Scholar]
- Kristensen DM, Chen BY, Fofanov VY, Ward RM, Lisewski AM, Kimmel M, Kavraki LE, Lichtarge O. Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity. Protein Sci. 2006;15:1530–1536. doi: 10.1110/ps.062152706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Furnham N, Holliday GL, de Beer TA, Jacobsen JO, Pearson WR, Thornton JM. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Res. 2014;42:D485–489. doi: 10.1093/nar/gkt1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ, Hooper SD, Pati A, Lykidis A, Spring S, Anderson IJ, D'Haeseleer P, Zemla A, Singer M, Lapidus A, Nolan M, Copeland A, Han C, Chen F, Cheng JF, Lucas S, Kerfeld C, Lang E, Gronow S, Chain P, Bruce D, Rubin EM, Kyrpides NC, Klenk HP, Eisen JA. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–1060. doi: 10.1038/nature08656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001;310:311–325. doi: 10.1006/jmbi.2001.4776. [DOI] [PubMed] [Google Scholar]
- Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005;348:231–243. doi: 10.1016/j.jmb.2005.02.007. [DOI] [PubMed] [Google Scholar]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information Figure 1.
Supporting Information Figure 2.
Supporting Information Table 1.
Supporting Information Table 2.
Supporting Information Table 3.
