Abstract
The diversity of characterized protein functions found amongst experimentally interrogated proteins suggests that a vast array of unknown functions remains undiscovered. These protein functions are imparted by specific geometric distributions of amino acid residue chemical moieties, each contributing a functional interaction. We hypothesize that individual residue function contributions are predictable through sequence analytic knowledge based algorithms, and that they can be recombined to understand composite protein function by predicting spatial relation in tertiary structure. We assess the former by training a meta-functional signature algorithm to specifically predict calcium ion binding residues from protein sequence. We estimate the latter by testing for match between predictive contribution of positions in predicted secondary structures and patterns of side chain proximity forced by secondary structure moieties. Specific training for calcium binding results in 83% area under the receiver operator characteristic curve added value over random (AUCoR) and p<10−300 significance as measured by Kendall’s τ in ten fold cross validation for parallel sets of 811 residues in 336 proteins and 696 residues in 299 proteins. Training for generalized function results in 63% AUCoR and p≅10−221 for the same tests. Including inference of side chain proximity improves predictive ability by 2% AUCoR consistently. The results demonstrate that protein meta-functional signatures can be trained to predict specific protein functions by considering amino acid identity and structural features accessible from sequence, laying the groundwork for composite sequence based function site prediction.
Keywords: Protein sequence analysis, Protein function prediction, Calcium, Protein binding site, Functional signature
1. Introduction
The increasing abundance of genomic data calls for accurate and informative automated sequence analysis algorithms to understand biologic function. Millions of genes across 1,129 fully sequenced genomes have not been experimentally characterized beyond sequence [1]. The astronomical number of experiments necessary to characterize the organisms encoded by these genomes to match the contemporary data for Saccharomyces cerevisiae or Escherichia coli would be an unreasonable use of resources. Rather, these data demand dramatic improvements in the informatic modeling of gene function to guide bench exploration [2].
We previously demonstrated that available data describing protein function can be transferred as annotations to protein gene products without the limitations of homology mapping and in the absence of tertiary structure [3]. We use the concept of a meta-functional signature (MFS) to combine incongruent measures of functional information encoded in the protein sequence into an estimate of functional importance for each amino acid residue. Here we extend MFS to include physicochemical conservation and conservation of residues predicted to be nearby in the functional conformation, to supplement amino acid type, sequence and evolutionary conservation. Philosophically distinct conservation measures have been shown to be synergistic in predictive ability [46,3]. Thus we anticipate that physicochemical, entropic, and evolutionary conservation would be complementary. We train the combination of algorithms that estimate these parameters by logistic regression for the specific protein function of calcium binding to demonstrate specificity imparted by amino acid type and structural inferences (figure 1).
1.1. Sequence based structural inferences of function
Protein substrate specificity is governed by geometric distribution of polarity, charge, and hydrophobicity. The spatial pattern of substrate electron density is complementarily mirrored by the protein to thermodynamically favor binding [4,5]. Differences in residue identity within an otherwise similar binding site and protein scaffold facilitate metabolite preferences and variation of enzymatic reaction [6,7]. Meanwhile, since it is the variation in these sites which enables specificity, differences in residue conservation for the position are minimal, assuaging accessibility for automated algorithms to predict specific functions. Thus it is not surprising that incorporation of predicted tertiary structure improves identification of functional sites [8].
1.2. Sequence based metal ion binding prediction
In the recent international blinded community wide experiment on the critical assessment of techniques for protein structure prediction (CASP8), we applied MFS as a predictive algorithm for substrate binding with a simple distance threshold to nonlocal contacts from our predicted tertiary structures. We submitted ten or less predicted residues for each protein without knowing the identity of the substrate ligand, or whether one was present in the crystal. These predictions matched the real metal ion binding sites with a Matthew’s correlation coefficient (MCC) of 0.6, and coverage of 85% (true positive predictions divided by all real function sites). The coverage for each protein correlated with the quality of the related predicted structure (Pearson’s R = 0.52). We were the third best ranked in predicting metal ion binding sites out of over one hundred participating groups from around the world [8].
1.3. Relevance of residue function prediction to tertiary structure
Accuracy of protein tertiary structure prediction without a template structure is sparse and computationally cumbersome [9]. Contemporary bench structure assessment methods may be limited to approximately 40% of proteins. For example 7,179 structures have been successfully characterized by the structural genomics initiatives, but work on 28,090 targets has ceased [10,11]. Analysis of protein binding reveals high entropy and enthalpy for unbound states, while binding of the physiologic partner induces stability in a thermodynamic tradeoff similar to that of folding [12,13]. The portions of metal ion binding sites that are dynamic when unbound are often observed in the difficult to model loop regions. The electronegativity required to coordinate the positively charged metal ion would be mutually repulsive for binding residues without the ion mediator, adding noise to the structure prediction process which effects the quality of the entire model. For example, specific consideration of the zinc binding loop in target T0476 led to the most highly accurate model for target T0476 in CASP8 by the Baker group who filtered templates for this region based on specific constraints allowing close proximity without disulfide bonds for four cysteine side chains in the loop [14].
1.4. Spatial clustering of functional residues
Protein residues do not function in isolation. Mechanisms are most commonly specified by the arrangement of spatially clustered side chains, with main chain contributions less dependent on residue identity. Therefore if side chain proximity is known (e.g. nonlocal contacts), accurate prediction for the functional contribution of one residue can be used to improve function prediction for nearby residues. Thus sequence based methods to infer structural parameters of function are desired [15,16]. We approach this goal by inferring side chain proximity from geometric features of secondary structure motifs, and consider the distribution of physicochemical properties for the same residue position in orthologs [17]. The relevance of structure to metal binding demonstrated by our predictions in the CASP8 experiment motivated us to consider a physiologically relevant specific type of metal ion binding site.
1.5. Calcium ions
Calcium is the most abundant metal and fifth most abundant element in animals, and essential for life. Protein calcium interactions mediate essential physiology including cellular trafficking via vesicle fusion, fission, secretion, and uptake; electrical impulses for cellular signaling via creation of solute gradients; biomineralization by inclusion with negatively charged salts [18]; and metabolic control via hormone sequestration such as osteocalcin binding to calcium atoms along the hydroxyaptite surface of bone [19]. Calcium binding represents a unique protein function, completely separable from organic substrates and interchangeable only with magnesium. Additionally, while the exact binding mechanisms of most ligands remain elusive, the common addition of calcium salts into the mother liquor of protein crystallization and the ease by which to identify this heavy atom in the diffraction pattern gives detailed experimental characterization of nearly four thousand protein calcium binding sites in the Protein Data Bank [20]. Filtering for nonspecific crystal interactions and protein redundancy yields roughly three hundred proteins to use as one benchmark set.
1.6. Previous protein calcium ion binding residue prediction methods
Previous approaches to computational prediction for mechanisms of protein function have traditionally focused on mapping annotation by detection of similar structure or sequence [8,21–25]. These methods are limited by the ability of the search engine to find a similar protein about which more is known, and is also limited by the need for such a protein to exist [26,27]. Other automated function prediction methods do not depend on mapping. Such methods commonly exploit features derived from the protein structure such as deep pockets [28], unstable side chains thermodynamically poised for metabolite binding [12], or spatial clusters of oxygen atoms for metal ion binding [29,30]. Yet the need for an experimentally derived structure limits the application of these methods tremendously.
1.7. Sequence based protein calcium ion binding residue prediction
Sequence based approaches that measure conservation of the position amongst many similar sequences are limited by the particular feature modeled in the estimation of residue conservation, e.g. conservation throughout evolution, presence across contemporary proteins, or physicochemical conservation. We overcome this limit by designing measures different enough to be combined [3]. When using a single measure of conservation, the best scoring residues are generally catalytic, and many methods have been designed to specifically find these residues [32–33]. However, methods trained for a broad range of functions achieve similar or better performance in detecting catalytic residues [3,34]. Thus an open question is whether a sequence based method can derive better predictions for a specific application such as calcium binding.
Therefore we designed a study to create an algorithm that determines calcium binding residues in a protein sequence using regression to train the contribution of amino acid identity, with sequence, evolutionary, physicochemical, and neighbor conservation for the residues observed to bind calcium in a nonredundant set of proteins in the Protein Data Bank [20] (figure 1).
2. Research Design and Methods
We predict functional contribution to calcium binding by amino acid type, functional importance scores based on multiple sequence alignments, and the scores of residues predicted to be nearby in 3D space. We then use backwards stepwise multiple regression to remove score types that do not add weight to the prediction with statistical significance (p>0.001) and at least 2% contribution to the score, i.e. include all scores, then remove one at a time with cycles of training by logistic regression until they all add significant improvement to the training set. We employ supervised learning only by forcing the maintenance of all amino acid types, as a base from which to improve. All trained methods are tested by ten fold cross validation, within the below logistic regression equation that comprises MFSCa (figure 1).
Each component algorithm of MFSCa (HMMRE, SSR, AA, CloseSS, sMAPP, see below) is assigned a coefficient (a, b, c, d, e) trained in the regression. Sub coefficients are enumerated for each amino acid type (ci) and each of one to four positions separated in each secondary structure type (di,j). ε denotes the error term.
MFS1Ca refers to retraining the regression with the same algorithms (HMMRE, SSR, AA) as the original sequence based MFS, and MFS2ca refers to the regression that includes these as well as the novel algorithms CloseSS and sMAPP.
2.1. Multiple sequence alignment analytic algorithms
We use the position specific iterative basic local alignment search tool (PSI-BLAST [35]) to find similar protein sequences from the nonredundant database [36]. More sensitive and specific methods have emerged, such as the HMM-HMM predictive comparison method (HHpred [37]) and PSI-BLAST intermediate sequence search (ISS [38]), which are reviewed by us in Horst and Samudrala, 2009 [16]. While PSI-BLAST results have inherent limitations of sensitivity that would best be avoided, we do overcome the specificity problem in part by applying the multiple sequence comparison by log-expectation algorithm (MUSCLE [39]) to the PSI-BLAST output, and filtering the top 250 nearest neighbors in the resulting multiple sequence alignment (MSA). For each protein we use a single pass of PSI-BLAST and MUSCLE calculations (each with internal iterations) to drive the entire prediction pipeline, such that predictions for thousands of proteins can be made within a day by our processor farm. Each of the following algorithms calculates functional importance in a trivial amount of time, given this single MSA.
2.1.1. HMMRE
We train a hidden Markov model (HMM) from the MSA using the Hmmer package [40], and compare emission frequency estimates from the model with the amino acid background frequency in nature given by karlin.c of the BLAST program package [35] to produce the HMM relative entropy score for each amino acid position [3,41]. Here we make a significant change by constraining the Markov chain to the architecture of the protein sequence, rather than using the chain apparent from conservation measured in the MSA, as we have done in the past.
2.1.2. SSR
We model the evolutionary context of each position by creating a maximum parsimony phylogenetic tree for the surrounding sequence of each position using the PHYLIP platform [42]. Each protein in the MSA is treated as a leaf in the tree, and the root represents the theoretical ancestral sequence. We quantify the evolutionary divergence of the position by taking the ratio of different amino acid states appearing at the particular position, to the total number of step changes in the modeled evolution between the input and ancestral protein within the phylogenetic tree, termed the state to step ratio (SSR) [3].
2.1.3. HMMRE vs. SSR
We previously designed these residue conservation measures to separately compare the residue position and identity to all available modern proteins via multiple sequence alignment column HMM relative entropy, and to the evolution of the protein modeled by an evolutionary tree for each position [3]. Conservation along the evolution of a protein is specific to the physiologic environment and use of the protein, whereas similarity amongst other contemporary proteins assesses the role of residues in similar functional sites in differing contexts. The methods were shown to be complementary for generalized function prediction [3].
2.1.4. sMAPP
The multivariate analysis of protein polymorphisms algorithm (MAPP) uses an MSA of protein sequence orthologs (the matching protein in another species) to estimate a mean for each of six physicochemical values for each position (MSA column) [17]. For each physicochemical value, deviation from the mean is calculated for all twenty amino acids, and a single composite value is generated by a center of mass calculation on a principal component transformation, wherein each physicochemical property is taken as a coordinate axis. Then the Euclidean distance of each amino acid from this center of mass composite value is taken to estimate the effect of a mutation at that position [17]. We calculate the geometric spread for the MAPP scores of all possible mutations for each residue position, as a novel improvement to predictive accuracy over the values given directly by MAPP (see supplementary figure 1) with the equation below. sMAPP denotes the spread of MAPP scores for a particular position, calculated as the root mean squared difference of the MAPP score for each amino acid type (i) to the arithmetic mean of the nineteen other amino acid types (j).
2.1.5. CloseSS
Protein residues come close together in 3D space to form functional sites. We created a method to consider the joint function of residues predicted to be close in 3D space by secondary structure prediction (close by secondary structure = CloseSS). We hypothesize that the probability of concordant function for a residue one through four positions away is related to the secondary structure predicted for the evaluated position. The standard types of predictable secondary structure motifs bring residue side chains together in 3D space in somewhat predictable ways. When considering a residue in an alpha helix, residues two positions away will not be as relevant to the functional site as if the residue were in a beta sheet. Side chains in the n+2 position of an extended beta strand will tend to be nearby the position, as will the side chains of n+3 and n+4 for an alpha helix (Figure 5). Since the functional moieties of the residues will be near together, they may function together in the same calcium binding site. PSIPRED version 2.61 was used for secondary structure prediction [45]. We tried other freely available secondary structure prediction methods for both calcium binding prediction and generalized function prediction, and achieved similar results (data not shown).
2.1.6. AA type
Binary dummy variables are added to represent all amino acid types except one. Identity of the corresponding amino acid results in a score of one. Alanine is represented by zero values for all amino acid variables. We force the inclusion of all nineteen variables in the reverse stepwise logistic regression model, as a foundation from which to improve. Coefficients for the amino acid type variables are trained in logistic regression.
2.1.7. MFS
We apply the meta-functional signature method exactly as described in our previous work [3]. The HMMRE, SSR, and amino acid type scores were combined with a logistic regression model trained on catalytic and ligand binding sites.
2.2. Data sets
We developed two parallel datasets of 336 (set “0”) and 299 (set “1”) independent protein chains with <35% sequence identity for which Xray crystal diffraction structures demonstrate direct calcium ion binding. These datasets were generated by starting with two random high resolution (<2.1 Å) proteins from the set of 3976 calcium binding chains in the Protein Data Bank (PDB [18,43]), and progressively adding proteins to maximize diversity and maintain <60% sequence identity (independence) between the two sets. Higher resolution was favored when considering addition of two similar proteins. The resulting set of calcium ions are kinetically stable, as described by B factors less than 40 Å2, where 60 Å2 is widely regarded as unstable. Binding residues were defined as those with at least one side chain atom within half an Ångström plus the van der Waals radii to a calcium ion in the crystal Xray diffraction structure. While many carbonyl oxygen atoms contribute to binding, only side chains were considered for specific binding interactions. Protein chains in the sets range in length from 45 to 1332 residues. Set 0 contains 811 binding residues and 71,724 nonbinding residues in 336 proteins, and set 1 contains 696 binding residues and 62,893 residues in 299 proteins. The distribution of calcium ion binders represents roughly one binding residue for every 90 residues in these proteins.
2.3. Evaluation analysis
2.3.1. Ten fold cross validation
A knowledge based (informatic) algorithm assessment protocol wherein the training set is used for testing. The training set is divided randomly into ten roughly equivalent subsets, each of ten versions of the algorithm is trained on the remaining 90% subset and assessed for accuracy based on predictions for the 10% subset. Here, for each benchmark set we distribute all residues randomly across ten nonoverlapping test sets of similar size. The remaining residues (~90%) for each set are used to train the regression model, which is then tested on the respective test set. Since no residue is tested more than once, each test is independent and so analyses can be performed and graphed together. We also performed bootstrapped cross validation with 10 maximally different sets separating roughly half the proteins into each training or test set, but analyses of these tests cannot be graphed together and thus are described but not shown.
2.3.2. ROC
The receiver operator characteristic (ROC) displays the balance of specificity and sensitivity across the range of possible score thresholds. This plot is valuable to enable critical assessment of the weaknesses of a method, demonstrating where accuracy is separable between methods, and in informing selection of a cutoff threshold appropriate to the particular application.
2.3.3. AUCoR
Accuracy across the range of thresholds is summarized by measuring the area under the ROC curve (AUC), for which 50% is random and 100% is perfect. The AUC estimates the probability of concordance between prediction and reality. We employ the term “AUCoR” as any contribution over random prediction, a fraction of perfect prediction: twice the difference between AUC and random. This AUCoR value gives a more representative estimation of added value by the algorithm than the ROC.
2.3.4. Precision recall curve
This plot compares the rate of true positives amongst all positive predictions (precision) to the amount of true positive cases retrieved (recall). This analysis informs users of the proportion of positive instances retrievable at a particular precision, and vice versa.
2.3.5. MCC
Matthew’s correlation coefficient (MCC), or the Φ (phi) coefficient, estimates the similarity between two data sets. Here the MCC is the resulting value of applying an equation to compare a set of binary predictions (e.g. functional or not) to the real values. The true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates are combined by the following equation (note the relation to χ2 test value):
2.3.6. MCC distribution
We plot the Matthew’s correlation coefficient for each prediction score threshold. This analysis depicts the predictive value across all thresholds for each method. Complexity of the predictive distribution communicates applicability of non-linear learning methods such as decision trees and support vector machines. A Gaussian distribution would imply simple scaling of performance by threshold score. Skewness and multiple local extrema (maxima and minima) suggest complex features which might be lost to the simple regression we apply here. The threshold score with the highest correlative value is found as the highest point in the curve. We find this depiction to be more rigorous and informative than the ROC.
2.3.7. Kendall’s τ (tau)
This nonparametric statistical rank sum test is more efficient for large and non-normal distributions than the Student’s t-test. The probabilistic prediction methods used in this experiment produce roughly normal distributions, but others such as amino acid type result in entirely non-normal score distributions. As well, the many residues tested comprise quite large sets. The resulting probability values confer a measure of stability for respective AUC values. This test is equivalent to the Mann-Whitley U test when one variable is binary, as is the case for the binder versus nonbinder residues here.
3. Results
3.1. Calcium binding residue predictions by each algorithm
The best single predictor of calcium binding is the amino acid type (figures 2 and 3). This comes as no surprise, as we only consider side chain contacts and ignore coordination by main chain carbonyls as nonspecific. Aspartic acid (29% contribution to the set 0 logistic regression model, 27% contribution to the set 1 logistic regression model), glutamic acid (21%, 20%), and asparagine (21%, 16%) are the principal amino acid types that contribute heavily to the prediction score for both data sets. The next largest contributions demonstrate separable function for the previously mentioned amino acids: glutamine gives 4.4% contribution to the set 1 logistic regression model but only 1.0% for the set 0 model, and threonine contributes 2.4% for set 0 but <2% for set 1 and therefore is removed.
Other significant contributors include the HMMRE sequence conservation score, the SSR evolutionary conservation score, the sMAPP physicochemical spread score, all three possible predicted secondary structures (α helix, β strand, random coil), and the CloseSS neighbor conservation scores (figures 2 and 3). There are substantial differences between the performance of all algorithms, except that between HMMRE and SSR for which the correlations with respect to each other are 0.36 (Pearson’s R) for set 0 and 0.32 for set 1 (figure 4). The differences in performance can be seen between the CloseSS, SSR, and HMMRE methods in the MCC distribution analysis (figure 3) which is not visible in the ROC or precision recall analysis (figure 2). This informs use of optimal threshold cutoffs.
3.2. Secondary structure as a predictor of functional specificity
Significant contributions from the CloseSS method arise from the third and fourth positions in α helices, the second position in β strands, and the third position in random coils (figure 5). Meanwhile, the only consistently positive contributions are the fourth positions in α helices and the second position in β strands, which were predicted to make the most significant contribution to active sites by side chain proximities in idealized or average secondary structure geometries. The pattern of selectivity for secondary structure positions of calcium binders are separable from those for the general function benchmark set used in the MFS publication [3]. The first position in an α helix is more likely to contribute to function in the MFS set, and the second position of a random coil becomes prominent for calcium binders (figure 5). While the other component algorithms are limited to predict importance to function, these structural features denote specificity.
The CloseSS method is the least significant contributor on its own (figures 3 and 4), but is predictive of calcium binding function (24.8% AUCoR for set 0, 27.4% for set 1), significant at the p<10−34 level for both data sets (Kendall’s τ). When added to the logistic regression compilation (difference between MFS1Ca and MFS2Ca), CloseSS improves the accuracy of MFS2Ca with a consistent increase of 1.6% AUCoR for each set.
3.3. The combination of multiple algorithms outperforms any single method
Training the algorithm for specific rather than generalized function improves by 16.4% AUCoR for set 0 and 19.2% for set 1 (figure 2). Application of MFS2Ca trained on one data set to the other displays near equivalent profiles of accuracy to the applied data set (see supplementary figure 1). A ten fold bootstrap test for which we train on 50% of the benchmark set proteins and test on the remainder maintains consistent ROC AUC values: the mean AUCoR score for set 1 applied to set 0 is 82.2% with a standard deviation of 2.2%, and that for the reverse is 84.1% ± 1.9%. These analyses together demonstrate stability for the predictive ability of the method, indicate saturation of the regression models, and an absence of overtraining. Thus we achieve significant improvement for a specific modality of protein function.
The ROC plots illustrate improvement in calcium binding prediction specificity by logistic regression combination over amino acid type across nearly all sensitivity levels. At 20% sensitivity we improve upon amino acid type specificity by 7.2% (which only considers aspartic acids below 50% sensitivity), reaching a nearly perfect 99.6% specificity. The specificity values of the ROC analysis are particularly relevant for biochemical analysis, as these data suggest that experiments (or further computational analyses) designed to interrogate residues scoring in this range will be prescriptive of outcome. The logistic regression combination for both data sets reach a specificity and sensitivity combination of 87% (figure 2). By interrogating the MCC distribution, we readily observe the threshold score cutoff giving the most information, roughly 0.25 MCC at 15 of 100 for both data sets (figure 3).
3.4. Example prediction on a protein with physiologically significant calcium binding
We applied the MFS2Ca logistic regression to the 165 residues of calmodulin in a recently characterized structure (PDB id 3ewt). The method was retrained after removing the calmodulin homolog from the training set (figure 6). Calmodulin specifically binds four calcium ions in loops flanked by alpha helices. Calcium binding alters the relative stability of extended conformations, allowing greater flexibility in the central region, thereby enabling calmodulin to bind a wide variety of protein substrates effecting physiologic processes including inflammation, metabolism, apoptosis, muscle contraction, intracellular movement, memory, nerve growth and immune response [44].
The top 10% of MFS2Ca predictions include three of the four calcium binding residues for each site, excluding nearly as many nonbinding aspartic acids (figure 6). The top 20% scoring residues include all binders, again enriching over amino acid type. Comparison to the generalized function prediction method of MFS demonstrates the utility of training a functional signature for a specific protein function (figure 6).
4. Discussion
For a given protein sequence, the residues and their degree of functional importance can be thought of as a signature representing the function of the protein. We previously developed a combination of knowledge- and biophysics-based function prediction approaches to elucidate the relationships between the structural and functional roles of individual protein residues. Such a meta-functional signature (MFS) may be used to study proteins of known function in greater detail and to aid experimental characterization of proteins of unknown function [3].
In the year since publishing the MFS method, our server has been used over a thousand times by hundreds of different users. MFS was applied with an automated filter for high scoring residues close in the tertiary structure, on ligand binding sites in the blinded CASP8 function prediction experiment. The approach performed as one of the top algorithms generally and the third best for metal binding prediction. We were surprised to observe that training MFS on a set of organic ligand binders did not perform as well as when training MFS on a diverse combination of function types. Upon closer examination, we learned that the chemical moeities for which we failed to predict binding were not in the training set. Meanwhile we predicted nearly all catalytic and most metal ion binding residues. Therefore we set out to test the ability of MFS to be trained for a highly specific function type, such as a particular moiety or metal ion, here embodied as calcium for its physiologic relevance (figure 1). We also attempt to recover the gain from modeling the complete protein structure by abstracting geometric patterns in secondary structure and physicochemical conservation across orthologs.
Training for a specific type of function improves predictions by 16–19% AUCoR, with p<10−20 MCC profile significance (figures 2 and 3). We use a p<0.0001 and 2% contribution filter for nonsignificant contributions of components in the logistic regression, applied in backwards stepwise multiple regression. This approach removes data that could lead to overtraining, and indicates the most consistently informative algorithms. Application to two parallel benchmark sets shows stability of accuracy for the method (figures 2, 3, and supplementary figure 1). We present an illustrative example of improvement over the general protein function MFS method for the physiologically ubiquitous protein calmodulin, for which we recover 3/4 of the binding site residues in the top 10% and all in the 20% (figure 6).
We attempt to design tools to replace the requirement for tertiary structure with sequence based methods. Separable patterns of amino acid type and predicted secondary structures emerge (figure 5). The CloseSS analysis matches our prediction that informative patterns of side chain proximity are carried in secondary structure. For example the fourth position in an α helix, the second position in a β strand, and the second position in a loop contribute the most predictive value for calcium binding. Comparison of selected positions for each predicted secondary structure type to those for a large set of metabolite binding and catalytic residues (described in [3]) demonstrate structural features of these functions (figure 5). For example, the predictive ability of other residues in random coils (right column) diminishes with sequence distance for the general function set, while calcium ions are often bound by every other residue in a loop. The analysis suggests that these trends present separable predictions imparting specificity for function prediction. Protein meta-functional signature algorithms can specify particular functions using structural features to build upon separation by amino acid type.
We also present a novel analytic tool for displaying the performance of a predictive method across an equally sized array of score thresholds (figures 3 and 5). The Matthew’s correlation coefficients (MCC) distribution is similar to the precision recall curve (figure 3) in realistic consideration of large negative sets, as is the case here: even in calcium binding proteins the nonbinding residues outnumber the binding residues 89 to 1. However, the MCC distribution also shows the cutoff which most enriches the information content of a prediction method. The MCC distribution conveys complexity of predictive accuracy across the range of score thresholds, which informs the applicability of complex machine learning methods. The CloseSS method in particular displays a multiple local extrema, which suggests the use of decision trees, neural networks, and support vector machines to this problem. We will evaluate both the CloseSS and MCC distribution tools in more diverse situations in the future.
While we would prefer to compare this method to others, there are no sequence based metal ion binding servers nor software known to us. Previous annotation methods did not thoroughly describe the benchmark sets. Thus we use conservation measures such as relative entropy (HMMRE) as representatives of what is available in the field.
The stability of performance across a battery of tests suggests validity of the MFSCa method and our analysis, but also a practical limit to the information assessed by the method. Amino acid residues that mitigate specific protein functions are relatively easy to pick out when many known examples are available, as we show here for calcium ion binders. The difficulties in sequence analysis arise when attempting to identify compound functionalities of residues working together, to select amongst multiple candidate functions for a residue or group of residues, to predict function in a protein de novo (without homology), and to derive clinically useful information from this analysis. In future work we will address these challenges by incorporating conservation and identity for spatially nearby residues identified using methods from protein structure prediction, and tuning the analysis to generate and then compare across MFS models of highly specific functions. In this work we show that protein meta-functional signatures can be successfully trained for these specific functions by considering amino acid identity and structural features accessible from sequence, and so lay the groundwork for composite function site prediction.
Supplementary Material
Acknowledgments
The authors would like to thank Orapin V. Horst, Brady Bernard, Aaron Goldman, Kai Wang, Francois Baneyx, and members of the Samudrala group for valuable discussions and comments. JAH is grateful to have been supported for this work by the University of Washington Warren G. Magnuson Scholars Award and the National institute of dental and craniofacial research Ruth L. Kirschstein Individual predoctoral dental scientist fellowship 5F30DE01752.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.US Department of Energy Joint Genome Institute. [accessed November 18, 2009];Intergrated microbial genomes. http://img.jgi.doe.gov.
- 2.Gutteridge A, Thornton JM. Understanding nature’s catalytic toolkit. Trends Biochem Sci. 2005;30:622–629. doi: 10.1016/j.tibs.2005.09.006. [DOI] [PubMed] [Google Scholar]
- 3.Wang K, Horst JA, Cheng G, Nickle D, Samudrala R. Protein meta-functional signatures from combining sequence, structure, evolution and amino acid property information. PLoS Comp Bio. 2008;4:e1000181. doi: 10.1371/journal.pcbi.1000181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Jensen RA. Enzyme recruitment in evolution of new function. Annu Rev Microbiol. 1974;30:409–425. doi: 10.1146/annurev.mi.30.100176.002205. [DOI] [PubMed] [Google Scholar]
- 5.Khersonsky O, Roodveldt C, Tawfik DS. Enzyme promiscuity: evolutionary and mechanistic aspects. Curr Opin Chem Biol. 2006;10:498–508. doi: 10.1016/j.cbpa.2006.08.011. [DOI] [PubMed] [Google Scholar]
- 6.Ashworth J, Havranek JJ, Duarte CM, Sussman D, Monnat RJ, Stoddard BL, Baker D. Computational redesign of endonuclease DNA binding and cleavage specificity. Nature. 2006;441:656–659. doi: 10.1038/nature04818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jiang L, Althoff EA, Clemente FR, Doyle L, Röthlisberger D, Zanghellini A, Gallaher JL, Betker JL, Tanaka F, Barbas CF, Hilvert D, Houk KN, Stoddard BL, Baker D. De novo computational design of retro-aldol enzymes. Science. 2008;319:1387–1391. doi: 10.1126/science.1152692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lopez G, Ezkurdia I, Tress ML. Assessment of ligand binding residue predictions in CASP8. Proteins. 2009;77(Suppl 9):138–146. doi: 10.1002/prot.22557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zhang Y. Progress and challenges in protein structure prediction. Curr Opin Str Biol. 2008;18:342–348. doi: 10.1016/j.sbi.2008.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Protein Structure Inititiative. Structural Genomics Knowledgebase: TargetDB Statistics Summary Report. [accessed November 11, 2009]; http://targetdb.pdb.org/statistics/TargetStatistics.html.
- 11.Chen L, Oughtred R, Berman HB, Westbrook J. TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004;20:2860–2862. doi: 10.1093/bioinformatics/bth300. [DOI] [PubMed] [Google Scholar]
- 12.Cheng G, Qian B, Samudrala R, Baker D. Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acids Res. 2005;33:5861–5867. doi: 10.1093/nar/gki894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shoemaker BA, Portman JJ, Wolynes PG. Speeding molecular recognition by using the folding funnel: the fly-casting mechanism. Proc Natl Acad Sci USA. 2000;97:8868–8873. doi: 10.1073/pnas.160259697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Raman S, Vernon R, Thompson J, Tyka M, Sadreyev R, Pei J, Kim D, Kellogg E, DiMaio F, Lange O, Kinch L, Sheffler W, Kim BH, Das R, Grishin NV, Baker D. Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins. 2009;77(Suppl 9):89–99. doi: 10.1002/prot.22540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol. 2005;15:285–289. doi: 10.1016/j.sbi.2005.05.011. [DOI] [PubMed] [Google Scholar]
- 16.Horst JA, Samudrala R. Diversity of protein structures and difficulties in fold recognition: the curious case of protein G. F1000 Biology Reports. 2009;1:69. doi: 10.3410/B1-69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stone EA, Sidow A. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 2005;15:978–986. doi: 10.1101/gr.3804205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tordoff MG. Calcium: Taste, Intake, and Appetite. Physiol Rev. 2001;81:1567–1597. doi: 10.1152/physrev.2001.81.4.1567. [DOI] [PubMed] [Google Scholar]
- 19.Lee NK, Sowa H, Hinoi E, Ferron M, Ahn JD, et al. Endocrine regulation of energy metabolism by the skeleton. Cell. 2007;130:456–469. doi: 10.1016/j.cell.2007.05.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35:D301–D303. doi: 10.1093/nar/gkl971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fetrow JS, Skolnick J. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J Mol Biol. 1998;281:949–968. doi: 10.1006/jmbi.1998.1993. [DOI] [PubMed] [Google Scholar]
- 22.McDermott J, Bumgarner RE, Samudrala R. Functional annotation from predicted protein interaction networks. Bioinformatics. 2005;21:3217–3226. doi: 10.1093/bioinformatics/bti514. [DOI] [PubMed] [Google Scholar]
- 23.Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol. 1998;283:707–725. doi: 10.1006/jmbi.1998.2144. [DOI] [PubMed] [Google Scholar]
- 24.Ge H, Walhout AJ, Vidal M. Integrating ‘omic’ information: a bridge between genomics and systems biology. Trends Genet. 2003;19:551–560. doi: 10.1016/j.tig.2003.08.009. [DOI] [PubMed] [Google Scholar]
- 25.Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. J Mol Biol. 2005;351:614–626. doi: 10.1016/j.jmb.2005.05.067. [DOI] [PubMed] [Google Scholar]
- 26.Reeves GA, Talavera D, Thornton JM. Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface. 2009;6:129–47. doi: 10.1098/rsif.2008.0341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Fleming K, Kelley LA, Islam SA, MacCallum RM, Muller A, Pazos F, Sternberg MJ. The proteome: structure, function and evolution. Philos Trans R Soc Lond B Biol Sci. 2006;29:441–451. doi: 10.1098/rstb.2005.1802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Abagyan R, Kufareva I. The flexible pocketome engine for structural chemogenomics. Methods Mol Biol. 2009;575:249–279. doi: 10.1007/978-1-60761-274-2_11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Deng H, Chen G, Yang W, Yang JJ. Predicting calcium-binding sites in proteins - a graph theory and geometry approach. Proteins. 2006;64:34–42. doi: 10.1002/prot.20973. [DOI] [PubMed] [Google Scholar]
- 30.Wang X, Kirberger M, Qiu F, Chen G, Yang JJ. Towards predicting Ca2+-binding sites with different coordination numbers in proteins with atomic resolution. Proteins. 2009;75:787–798. doi: 10.1002/prot.22285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dou Y, Zheng X, Wang J. Prediction of catalytic residues using the variation of stereochemical properties. Protein J. 2009;28:29–33. doi: 10.1007/s10930-008-9161-0. [DOI] [PubMed] [Google Scholar]
- 32.Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan L. Accurate sequence-based prediction of catalytic residues. Bioinformatics. 2008;24:2329–2338. doi: 10.1093/bioinformatics/btn433. [DOI] [PubMed] [Google Scholar]
- 33.Sterner B, Singh R, Berger B. Predicting and annotating catalytic residues: an information theoretic approach. J Comput Biol. 2007;14:1058–1073. doi: 10.1089/cmb.2007.0042. [DOI] [PubMed] [Google Scholar]
- 34.Fischer JD, Mayer CE, Söding J. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics. 2008;24:613–620. doi: 10.1093/bioinformatics/btm626. [DOI] [PubMed] [Google Scholar]
- 35.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl Acids Res. 2005;33:D501–D504. doi: 10.1093/nar/gki025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Biegert A, Söding J. Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA. 2009;106:3770–3775. doi: 10.1073/pnas.0810767106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Margelevicius M, Venclovas C. PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability. BMC Bioinformatics. 2005;6:185. doi: 10.1186/1471-2105-6-185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
- 41.Wang K, Samudrala R. Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics. 2006;7:385. doi: 10.1186/1471-2105-7-385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- 43.Protein Data Bank. [accessed July 17, 2009];Research Collaboratory for Structural Bioinformatics. http://www.pdb.org.
- 44.O’Day DH. CaMBOT: profiling and characterizing calmodulin-binding proteins. Cell Signal. 2003;15:347–354. doi: 10.1016/s0898-6568(02)00116-x. [DOI] [PubMed] [Google Scholar]
- 45.Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
- 46.Mihalek I, Res I, Lichtarge O. A family of evolution entropy hybrid methods for ranking protein residues by importance. J Mol Biol. 2004;336:1265–1282. doi: 10.1016/j.jmb.2003.12.078. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.