Abstract
The structural genomics projects have been accumulating an increasing number of protein structures, many of which remain functionally unknown. In parallel effort to experimental methods, computational methods are expected to make a significant contribution for functional elucidation of such proteins. However, conventional computational methods that transfer functions from homologous proteins do not help much for these uncharacterized protein structures because they do not have apparent structural or sequence similarity with the known proteins. Here, we briefly review two avenues of computational function prediction methods, i.e. structure-based methods and sequence-based methods. The focus is on our recently developments of local structure-based methods and sequence-based methods, which can effectively extract function information from distantly related proteins. Two structure-based methods, Pocket-Surfer and Patch-Surfer, identify similar known ligand binding sites for pocket regions in a query protein without using global protein fold similarity information. Two sequence-based methods, PFP and ESG, make use of weakly similar sequences that are conventionally discarded in homology based function annotation. Combined together with experimental methods we hope that computational methods will make leading contribution in functional elucidation of the protein structures.
Keywords: Computational protein function prediction, structure-based function prediction, sequence-based function prediction, ligand binding pocket comparison, protein surface shape comparison, Pocket-Surfer, Patch-Surfer, weakly similar sequences
INTRODUCTION
The structural genomics projects worldwide have determined an increasing number of protein tertiary structures over a decade [1–4]. As of writing of this article (September 2011), there are over 9000 structures in the Protein Data Bank (PDB) [5, 6] that were deposited from the structural genomics projects. Among several objectives of these large-scale efforts, one of the major expectations is that the determined structures provide clues for elucidating evolution and function of the proteins [7, 8]. In fact, function of some targeted proteins in the projects have been elucidated by global structure similarity to known characterized proteins [9–12] or by a combination of structural comparison and other sources, such as identified bound cofactors to the structure [13] or biochemical experimental evidences [14].
Protein function should be ultimately investigated experimentally. There are indeed efforts towards systematic functional screening [15, 16] and also efforts to combine computational and experimental function assignments in a genome-scale [7, 17]. In parallel to such experimental method developments, computational function prediction methods are expected to play an important role in post-structural genomics functional elucidation given the fact that computational methods can quickly screen existing data, which is the basis of transferring function from known proteins. Computational methods alone can often sufficient evidences for inferring function of proteins [9–12]. Also Experiments can be greatly benefited if possible functions of uncharacterized proteins are suggested by computational methods [17].
However, conventional computational methods which transfer function from obvious homologous proteins, such as BLAST [18, 19] or FASTA [20] or domain searches [21], leave many protein structures with unknown function, as evidenced in many structures with unknown function deposited to PDB from the structural genomics projects. Methods that predict function from globally similar proteins work well when highly similar known proteins exist in the database. Protein structures solved by the structural genomics efforts, which are left behind in functional annotation, do not have apparent global similarity to any of known proteins. Therefore, to increase the annotation coverage, it is crucial to develop methods which can use local structure similarity or distantly related proteins.
In this article, we briefly review two avenues of computational function prediction methods, i.e. structure-based methods and sequence-based methods. The focus is on our recently developed function prediction methods, local structure-based methods and sequence-based methods, which can effectively extract function information from distantly related proteins that are discarded by conventional methods. For more general review of computational function prediction, readers are referred to recent comprehensive reviews [22–24].
Approaches for structure-based function prediction
Computational structure-based function prediction methods transfer function of known proteins to a query protein if the known proteins have global or local structural similarity to the query. Structure-based function prediction methods can be divided into global and local approaches. In principle, global methods aim to find distantly related proteins of the same fold to a query protein that are not detectable by considering the sequence similarity [25–27]. Any methods developed for protein structure comparison can be used for this task, e.g., the Combinatorial Extension method [28], Dali [29], SSAP [30], VAST [31], and COSEC [32]. More recently, moment-based methods, the spherical harmonics and the 3D Zernike descriptors (3DZD), have been applied to describe protein surfaces [33–38]. Moment-based methods describe a protein surface as a series expansion of a 3D mathematical function that represents the protein surface shape. Thus, a structure is compactly represented as a vector of coefficients of the series function, which allows a fast real-time structure database search. We showed that some functional class of proteins, e.g. DNA binding proteins, can be detected by surface comparison using the 3DZD, because the descriptors capture the surface shape similarity that is required for the function of the proteins (e.g. saddle like shape of DNA binding regions in DNA binding proteins) [35]. Although global structure similarity indicates functional similarity of proteins in most of the cases, one needs to keep in mind that there are notable exceptions of “superfolds” [39], which are commonly occurring protein folds that are adopted by various protein families. Thus, conservation of functional residues needs to be confirmed before transferring function to a query protein from a known protein of the same global fold.
In contrast to the global structure comparison methods, local structure-based approaches compare local regions of a query protein to a database of known functional sites, e.g. active sites of enzymes. The Catalytic Site Atlas [40], AFT [41], ASSAM[42], SPASM [43], SURFACE [44], FLORA [45], CavBase [46], and SitesBase [47] search local sites with a set of residues/atoms in a query protein that match with known functional sites. Another method, eF-seek [48], represents a protein surface as a graph with nodes characterized with local geometry and electrostatic potentials, and local regions in a query protein that are similar to known functional sites are sought by a sub-graph matching algorithm. Thornton and her colleagues explored the use of the spherical harmonics in representing and comparing protein pockets [49]. Binkowski et al. developed a method that characterize a pocket with conserved residues at the pocket [50, 51] and also with its pairwise atom distances [52]. The ProFunc server performs various structure- and sequence-based function predictions for a query protein structure ranging from sequence and structural motif searches, active site identification, and global fold comparison [53]. Proknow is another method that integrates multiple evidences for function prediction, such as sequence and structure similarity and protein-protein interactions [54]. The advantage of the local structure-based methods over the global structure-based methods is that the former could identify functional sites in a query protein even when the query does not have evolutionary closely related proteins in a database. To meet the urgent task of annotating protein structures solved by the structural genomics projects that do not have apparent homology to known proteins, we have recently developed local structure-based methods, Pocket-Surfer [55] and Patch-Surfer [56, 57]. These two methods will be overviewed in the next section.
Pocket-Surfer & Patch-Surfer
Pocket-Surfer [55] and Patch-Surfer [56, 57] predict ligand molecules that bind to a query protein by comparing the geometrical shape and the physicochemical properties of a pocket region of the query protein to a database of known binding pockets. Both methods allow a quick real-time scan of the database of binding pockets since they use a rotational invariant pocket representation, which does not require time consuming pre-alignment of pockets upon comparison. Technically, the rotational invariant representation is achieved using the 3DZD, a rotation invariant series expansion of a 3D function [58, 59]. The difference between Pocket-Surfer and Patch-Surfer is that the former represents a pocket as a single object while the latter represents a pocket with a set of surface patches, each of which captures features of local regions of the pocket surface. Figure 1 illustrates the query process of Pocket-Surfer (Fig 1A) and Patch-Surfer (Fig 1B).
Figure 1.
The flow chart of pocket search with Pocket-Surfer and Patch-Surfer. A, Pocket-Surfer. Pockets are encoded by the 3DZD. Examples of the 3DZDs are shown in graphs on the right hand side. The x-axis of the plot indicates the position of terms in the series expansion and the y-axis is the value of the coefficients of the terms. The similarity of two 3DZDs is computed as the Euclidean distance of the two vectors of the 3DZDs. B Patch-Surfer. Since a pocket is represented by a set of patches, first, similar patches from the two pockets are matched using a bipartite matching algorithm. The similarity of two pockets reflects the average similarity of the matched patches, the relative position (distance) of patches within each pocket, and the size of the pockets. The weighting factors, which are trained on the known pockets in the database, are used for normalizing scores of different properties at different patches by considering the distribution of the scores.
The first step of the binding ligand prediction for a query protein by Pocket-Surfer and Patch-Surfer is to generate the surface of the protein from its PDB file and to extract binding pockets from the surface. The surface of the proteins is determined by the boundary of solvent accessible and solvent excluded regions generated by the Adaptive Poisson-Boltzmann Solver (APBS) program [60]. If the center location of binding pockets in the query protein is not known, it can be predicted with an external pocket finding program, such as VisGrid [61] or LIGSITE [62]. The extent of a pocket surface is computed by casting rays from the predetermined pocket centers and selecting the surface positions that are encountered first by the rays as the pocket surface. Once a pocket is determined, Pocket-Surfer encodes the geometrical shape and the surface electrostatic potential of the whole pocket with 3DZDs. On the other hand, Patch-Surfer segments the pocket surface into patches. The segmentation is done by spreading seed points on the pocket surface and extracting local surface regions that are included within a sphere of 5 Å radius centered at each seed point. For example, adenosine triphosphate (ATP) binding pocket is represented with, on average, 29.5 overlapping patches. Then, the shape, the electrostatic potential, and the hydrophobicity of each patch are mapped on a 3D grid, each of which is considered as a 3D function and encoded with the 3DZD. Since the 3DZD is a vector of coefficients of each term in the series function, the similarity of two 3DZDs can be efficiently quantified by computing the Euclidean distance of the two vectors. Another advantage of the 3DZD is that the series expansion does not change with rotations of the target 3D object (rotationally invariant). Thus, time-consuming pre-alignment of pockets is not needed for comparison. For mathematical details of the 3DZD, refer to the original papers [35, 56, 58, 59].
Once the query pocket is encoded with the 3DZDs, it is compared to the known pockets stored in the database. Pocket-Surfer simply computes the Euclidean distance between the 3DZD of the query pocket and the 3DZD of the database pockets. The comparison of a pair of pockets is more complex for Patch-Surfer since a pocket is represented by a set of patches. Patch-Surfer first identifies pairs of similar patches from the two pockets by employing a modified bipartite matching algorithm [63]. Then, the similarity of the two pockets is quantified by a linear combination of three terms, the average similarity of matched pairs of patches, the relative distance of patches within each pocket, and the size of the pockets. Next, the existing pockets in the database are sorted by the distance to the query pocket. Finally, the top k most similar pockets are considered to make the final prediction of the binding ligand for the query pocket using a type of the k-nearest neighbour algorithm (essentially the algorithm takes weighted consensus within the k highest ranking ligands). Please refer to the original papers for the details of the algorithm [56, 57].
Performance of Pocket-Surfer on benchmark datasets
We benchmarked the accuracy of pocket retrieval of Pocket-Surfer on two datasets of ligand binding pockets selected from PDB. The first dataset contains 100 proteins that bind either one of nine different ligand molecules including ATP, nicotinamide adenine dinucleotide (NAD), flavin adenine dinucleotide (FAD), and glucose. The second dataset contains 175 proteins that bind one of twelve different ligand molecules. There are no overlap between the ligand types and proteins in the two datasets. Pocket-Surfer identified correct ligand types within top 3 ranks 75.6% of the cases for the first dataset and 61.5% for the second dataset. These results were superior to other similar moment-based methods compared in the study [55]. In addition, comparison with existing binding ligand prediction servers showed that Pocket-Surfer achieved the highest value for the area under the Receiver Operator Characteristic curve (AUC-ROC) [33]. Please refer to the original papers [33, 55] for more detailed results of the benchmark studies.
Patch-Surfer results on the representative binding pocket database
Binding pockets of the same ligand type do not always have similar global shape and physicochemical properties at the corresponding location in the pockets [64]. This divergence of properties of pockets can occur due to several reasons: For example, some ligand molecules can take different conformations upon binding. Also occasionally water molecules or additional ligand molecules bind at the same pocket, which results in the change of overall pocket shape, size, and properties.
The intention of the patch-representation by Patch-Surfer is to identify local surface regions that are consistent in shape and/or physicochemical properties in pockets of the same ligand type that do not have globally similar shape and properties. Overall the performance of Patch-Surfer is better than Pocket-Surfer in a benchmark study we conducted on the dataset of 100 proteins that bind to one of nine different ligand molecules [55, 56]. Pocket-Surfer made correct binding ligand prediction for 36.1% and 82.7% of the cases within top-1 and top-3 predictions (i.e. correct ligand is predicted within top-1/top-3 highest scoring ligands ranked by Pocket-Surfer), whereas Patch-Surfer’s results were 45.0% and 86.0% for the top-1 and top-3 predictions, respectively. The area under the curve (AUC) value of the receiver operator characteristic (ROC), a metric to evaluate the overall database retrieval performance [65], was 0.81 for Pocket-Surfer while 0.82 is achieved by Patch-Surfer [56].
We have recently developed a larger database of representative ligand binding pockets selected from PDB for practical use of Patch-Surfer [66]. The representative pockets were selected from the Protein-Small-Molecule DataBase (PSMDB) [67]. Among several non-redundant datasets of structures of protein-ligand complexes provided in PSMDB, we chose the list available at http://compbio.cs.toronto.edu/psmdb/downloads/CPLX_25_0.85_7HA.list, where proteins were pruned with 25% sequence identity and redundant ligands that have a Tanimoto coefficient of 0.85 or higher to other ligand molecules were filtered out. Small ligands with less than 7 heavy atoms were not included in this list. From this list, we further removed ligands that are too distant from the protein (more distant than 3.5 Å to any heavy atom in the protein) and also covalently bound ligands (ligands that are closer than 1.4 Å to the protein). This procedure remains 9393 pockets (protein-ligand pairs) with 2707 ligand types.
On this representative pocket database, we benchmarked the retrieval performance of Patch-Surfer using a diverse test set of query pockets that bind either FAD (10), HEM (16), NAD (15), biotin (BTN) (8), fructose 6-phosphate (F6P) (8), guanine (GUN) (10), palmitic acid (PLM) (24), or retinol (RTL) (5) (Figure 2). In the second parenthesis, the number of query pockets of that type is shown (in total 96 query pockets). For each of these query pockets, pockets in the database were ranked according to the similarity to the query. Then, the retrieval was evaluated in terms of the enrichment factor (EF), which describes the ratio of correctly retrieved pockets relative to the percentage of the database scanned [68, 69]:
| (1) |
where TP is the total number of pockets that bind the same ligand type P as he query in the database, TDB is the size of the database, NP is the number of pocket for the ligand type P ranked within the top X percent by the database search method (Patch-Surfer) and Nx is the total number of retrieved pockets ranked in the top X percent of the database. EF is a commonly used metric for evaluating the database retrieval for a large database, for example, in evaluation of methods for drug database search in the cheminformatics domain.
Figure 2.
The enrichment factor for eight types of ligand molecules using Patch-Surfer scanned against the representative binding pocket database. In total 96 query pockets from eight ligand types were used. A, EF is shown relative to the percentage of top ranking pockets. The EF is averaged for the same types of ligand molecules. B, the structures of eight ligands used as queries.
The results are shown in Figure 2A. At 0.1% retrieval (i.e. considering top 9 pockets), all of the ligands except for two smallest ligand types, F6P and GUN, have high EF values, ranging from around 16.31 (NAD) to over 84.84 (RTL). At 1% retrieval, all of the ligand has EF over 5.0, from 5.13 (F6P) to 49.26 (RTL). The search against this large database was, on average 5–6 minutes for a query. Having a high EF at a low percentage of retrieval, as we achieved here, is crucial when further computational or experimental validation of binding ligands are to be performed. It is entirely feasible to perform around 100 computational ligand protein docking with consideration of full ligand flexibility [70] or experimental ligand screening in a realistic time. Together with the other types of structure-based function methods, Patch-Surfer as well as Pocket-Surfer will be valuable tools for elucidating function of proteins whose structure are solved by structural genomics efforts. We are currently in the process of making Patch-Surfer available for academic users [66] as a new component of the protein surface comparison server http://kiharalab.org/3d-surfer/ [34]. Users will be able to submit a protein structure, from which pockets regions will be identified and compared against the above-mentioned representative known ligand binding pockets.
Sequence-based function prediction methods that use weakly similar sequences
Sequence-based function prediction methods are applicable for a larger number of proteins than structure-based methods. This is obviously because sequence information is available for the majority of proteins and also because most of function information is stored in sequence databases.
As mentioned in Introduction, conventional methods that are based on homology [18–20] or high sequence-conservation (domains, motifs) [21, 71–73], cover only a small portion of proteins in a genome in terms of function annotation [22, 74]. In recent years, to meet the need of assisting systems biology approaches that deals with a large number of proteins, several novel sequence-based methods have been developed that employ not only highly similar but also weakly similar sequences as the source of function information. The development of such new generation sequence-based approaches is supported by the realization that weakly similar sequences still share functional similarity in many cases [75–77] especially when a general level functional category is concerned [78, 79]. Such methods include those which use BLAST or PSI-BLAST search results systematically by applying algorithmic techniques and making use of the Gene Ontology (GO) vocabulary structure [80] (e.g. Gotcha [81], GoFigure [82], OntoBlast [83], PFP [78, 79, 84, 85], ESG [86], and ConFunc [87]). Another direction of recent development is to consider phylogenetic trees aiming more specific function prediction among protein subfamilies (e.g. SIFTER [88], and FlowerPower [89]). JAFA is a meta-server which combines predictions from different servers [90]. A list of more methods for sequence-based function prediction can be found in our recent article [22, 23, 91].
In the later section we overview two sequence based function prediction methods, Protein Function Prediction (PFP) [78, 79, 84] and Extended Similarity Group (ESG) [86] developed in our group as the examples of these recent methods. PFP makes use of strongly as well as weakly similar sequences to the query sequence and shows improved sensitivity and coverage, whereas ESG draws consensus from the multiple level neighborhoods of similar sequences to improve the precision of predicted GO annotations. Examples of function predictions by PFP using weakly similar sequences are provided.
The Protein Function Prediction (PFP) Algorithm
The PFP algorithm extracts function information (GO terms) from sequences retrieved by PSI-BLAST including very weakly similar sequences with an E-value of up to 100. This enables it to predict low resolution terms when there are no homologous sequences available in the database. GO terms are ranked by a raw score computed using Equation 2. The score for GO term fa is defined as
| (2) |
where N is the number of PSI-BLAST hits obtained for a query sequence, Nfunc(i) is the number of GO annotations for the sequence hit i, E-value(i) is the PSI-BLAST E-value for the sequence hit i, fj is the j-th annotation of the sequence hit i, and constant b takes value 2 (= log10100) to keep the score positive. The conditional probability P(fa|fj) indicates the likelihood of having function fa as an annotation for the query sequence given that fj is used to annotate the sequence (function association). This function association is computed as the ratio of co-occurrences of terms fa and fj in annotations of the same proteins in the UniProt sequence database [92] relative to the number of times term fj is used to annotate proteins. Since the score for a GO term is basically the sum of weights, −log(E-value), of all sequences up to very weakly similar ones, consensus annotations from the weakly similar sequence hits can have a high score even if the annotations do not exist in the top sequence hits. Figure 3 illustrates this computation of the raw score for a GO term fa.
Figure 3.
An example of raw score computation for GO term fa by the PFP algorithm.
Along with this raw score computation, PFP also transfers the scores of a GO term partially to its less specific parent GO terms in the GO hierarchy in proportion to the ratio of the number of genes annotated by the child and the parent terms. The raw scores are then converted into p-values using the background score distributions for each term and are further translated into an expected accuracy based on the benchmark dataset.
The Extended Similarity Group (ESG) Algorithm
ESG runs PSI-BLAST iteratively and advocates functional terms that occur consistently in the series of PSI-BLAST database searches. The algorithm is illustrated in Figure 4. Starting from the query sequence Q, we first obtain N sequence hits by a PSI-BLAST search, S1, S2…SN, which have E-values E1, E2…EN, respectively. Each sequence hit Si at the first level is assigned with a weight Wi given by Equation 3, which consists of a normalized E-value of Si with respect to E-values of all other sequence hits. Then, further from each of the retrieved sequences Si, PSI-BLAST is run again to retrieve a set of sequence hits, Sij. The weight Wij for each second level sequence is computed similarly to the weight Wi assigned to the first level sequences.
| (3) |
Figure 4.
Computing the probability score for GO term fa to annotate the query sequence Q using two levels of ESG. In the initial run of PSI-BLAST, N sequences are retrieved. Among them, S1, S3, and S4 are annotated with fa (colored in gray). From each of the retrieved sequences S1 to SN, PSI-BLAST is run again, retrieving the second level hits for each of them. S32, a sequence retrieved from S3, has annotation fa. The overall score for fa is the sum of the weights for, S1, S3, S4 and S32.
Using the weights assigned to sequences retrieved in the first level and the second level searches, the score of a GO term fa for the query Q is computed as the sum of the weights of sequences which have function fa:
| (4) |
| (5) |
Equation 4 shows that the score of the GO term fa for a query sequence Q (the probability that Q has the GO term fa) is the weighted sum of Psi(fa), the score for fa assigned to each sequences retrieved in the first level using the Equation 5. Now Equation 5 shows that Psi(fa) is the sum of the score ISx(fa), which is 1 when sequence Sx is annotated with fa and 0 otherwise, and the weighted sum of the scores that come from sequences retrieved by the second level search for sequence i. The weighting factor v controls contributions from sequences retrieved in the first level and those found in the second level search.
Function prediction using PFP and ESG
PFP and ESG have been thoroughly benchmarked on several datasets including a large one with 11 complete genomes [78, 79, 86]. The benchmark studies for PFP demonstrate its ability to make correct function predictions even in the cases where the query sequence only has hits with large E-values (i.e. insignificant E-values), e.g. E-value of 10 or more in a PSI-B LAST search [78, 79]. By making use of weakly similar sequence hits, PFP can significantly increase annotation coverage of a genome. When PFP was applied to 15 genome sequences, including microbial genomes, C. elegans, mouse, Arabidopsis, and human genomes, more than two-thirds of the previously unknown proteins in each genome could be assigned a GO function term at the highest confidence level [79]. Predicted function derived mainly from only weakly similar sequence hits are often of low resolution, i.e. GO terms indicating somewhat general function categories that locate at the shallower levels in the GO hierarchy. However, these low resolution functions will be still useful for guiding further detailed investigation of protein function.
To illustrate PFPs’ ability to make correct predictions out of weakly similar sequence hits, we showed in Table 1 four examples of PFP’s predictions which were computed only from sequences with an E-value above 1.0 or 10.0. Note that smaller E-value indicates more statistically significant hits, and the commonly used E-value cutoff value is 0.01 or 0.001. This is to simulate the situation that there are no significant sequence hits in the PSI-BLAST search. The first example is function prediction made for the sequence of outward rectifying potassium channel protein TREK-1 (UniProt ID: O95069). PFP predicts inward rectifier potassium channel activity (GO:0005242) with E-value cutoffs of both 1.0 and 10.0. Although this prediction does not exactly match with this protein’s annotation, it is close in the GO hierarchy to the correct annotation, outward rectifier potassium channel activity (GO:0015271). Both terms have a common immediate parent terms, voltage-gated potassium channel activity (GO:0005249). This query protein is involved in the G-protein coupled receptor protein signaling pathway (GO:0007186), for which PFP using the E-value cutoff of 1.0 and 10.0 has captured more specialized child terms of GO:0004888 trans membrane signaling receptor activity and GO:0004930 G-protein coupled receptor activity (e.g. kappa-opioid receptor activity, GO: 0004987). Overall in this example, even using weak sequence hits, which are conventionally discarded in the homology search, PFP still managed to indicate that this protein is potassium channel that locate inner membrane (transmembrane).
Table 1.
Examples of correct annotations predicted by PFP using weakly sequence hits.
| UniProt ID | GO Annotations |
Definition of the GO terms |
Relevant PFP predictions using E- values > 1.0 |
Definition of the Predicted GO terms |
Rank | Relevant PFP predictions using E- values > 10.0 |
Definition of the Predicted GO terms |
Rank |
|---|---|---|---|---|---|---|---|---|
|
O95069 Outward rectifying potassium channel protein TREK-1 |
GO:0006813 | potassium ion transport | GO:0004878 | complement component C5a receptor activity | 1 | GO:0004987 | kappa-opioid receptor activity | 1 |
| GO:0007186 | G-protein coupled receptor protein signaling pathway | GO:0001847 | opsonin receptor activity | 2 | GO:0005242 | inward rectifier potassium channel activity | 6 | |
| GO:0071805 | potassium ion transmembrane transport | GO:0005242 | inward rectifier potassium channel activity | 3 | GO:0001518 | voltage-gated sodium channel complex | 2 | |
| GO:0034765 | regulation of ion transmembrane transport | GO:0004987 | kappa-opioid receptor activity | 4 | GO:0019866 | inner membrane | 3 | |
| GO:0005249 | voltage-gated potassium channel activity | GO:0001518 | voltage-gated sodium channel complex | 1 | GO:0016020 | membrane | 4 | |
| GO:0015271 | outward rectifier potassium channel activity | GO:0017071 | intracellular cyclic nucleotide activated cation channel complex | 4 | ||||
| GO:0016020 | Membrane | GO:0019866 | inner membrane | 7 | ||||
| GO:0016021 | integral to membrane | GO:0016020 | membrane | 8 | ||||
| GO:0008076 | voltage-gated potassium channel complex | 11 | ||||||
|
E1WAA4 Formate hydrogenlyase transcriptional activator |
GO:0000160 | two-component signal transduction system (phosphorelay) | GO:0005488 | binding | 3 | GO:0005488 | binding | 2 |
| GO:0006351 | transcription, DNA-dependent | GO:0016462 | pyrophosphatase activity | 9 | GO:0016462 | pyrophosphatase activity | 9 | |
| GO:0006355 | regulation of transcription, DNA-dependent | GO:0003677 | DNA binding | 11 | GO:0003677 | DNA binding | 10 | |
| GO:0000166 | nucleotide binding | GO:0003700 | transcription factor activity | 14 | GO:0006351 | transcription, DNA-dependent | 4 | |
| GO:0003677 | DNA binding | GO:0050907 | sensory transduction of chemical stimulus | 6 | ||||
| GO:0003700 | sequence-specific DNA binding transcription factor activity | GO:0006351 | transcription, DNA-dependent | 7 | ||||
| GO:0005524 | ATP binding | |||||||
| GO:0008134 | transcription factor binding | |||||||
| GO:0017111 | nucleoside-triphosphatase activity | |||||||
|
Q8UVE6 Transcription factor AP2 alpha 1 |
GO:0001501 | skeletal system development | GO:0003705 | RNA polymerase II transcription factor activity, enhancer binding | 6 | GO:0008134 | transcription factor binding | 7 |
| GO:0006351 | transcription, DNA-dependent | GO:0003677 | DNA binding | 7 | GO:0003700 | transcription factor activity | 8 | |
| GO:0006355 | regulation of transcription, DNA-dependent | GO:0003700 | transcription factor activity | 13 | GO:0006351 | transcription, DNA-dependent | 2 | |
| GO:0007422 | peripheral nervous system development | GO:0008134 | transcription factor binding | 14 | ||||
| GO:0014036 | neural crest cell fate specification | GO:0006351 | transcription, DNA-dependent | 3 | ||||
| GO:0030318 | melanocyte differentiation | |||||||
| GO:0060041 | retina development in camera-type eye | |||||||
| GO:0003700 | sequence-specific DNA binding transcription factor activity | |||||||
| GO:0005634 | Nucleus | |||||||
|
Q12386 Actin-like protein ARP8 |
GO:0006312 | mitotic recombination | GO:0005488 | binding | 1 | GO:0003676 | nucleic acid binding | 4 |
| GO:0006338 | chromatin remodeling | GO:0003676 | nucleic acid binding | 4 | GO:0003682 | chromatin binding | 7 | |
| GO:0006974 | response to DNA damage stimulus | GO:0003682 | chromatin binding | 7 | GO:0008135 | translation factor activity, nucleic acid binding | 13 | |
| GO:0006355 | regulation of transcription, DNA-dependent | GO:0003677 | DNA binding | 12 | GO:0003723 | RNA binding | 16 | |
| GO:0003729 | mRNA binding | GO:0003697 | single-stranded DNA binding | 13 | GO:0046034 | ATP metabolism | 3 | |
| GO:0043140 | ATP-dependent 3'–5' DNA helicase activity | GO:0003700 | transcription factor activity | 15 | GO:0009199 | ribonucleoside triphosphate metabolism | 9 | |
| GO:0005634 | nucleus | GO:0006351 | transcription, DNA-dependent | 2 | GO:0006351 | transcription, DNA-dependent | 13 | |
| GO:0005856 | cytoskeleton | GO:0046034 | ATP metabolism | 6 | ||||
| GO:0031011 | Ino80 complex | GO:0009199 | ribonucleoside triphosphate metabolism | 9 | ||||
Function predictions by PFP for four proteins are shown. Annotations from the UniProt database are shown in the first three columns from left. The next three columns (from the 4th to the 6th column) show prediction by PFP that are derived only from weak sequence hits with an E-value of 1.0 or larger. Only the predictions relevant to the correct annotations are shown. “Rank” is the rank of the prediction based on the PFP’s confidence score. Since the predicted GO terms are ranked for each of the three GO categories separately, there are multiple (up to 3) predictions with the same rank. The last three columns (the 7th to 9th column) are predictions by PFP using weak sequence hits with an E-value of 10.0 or larger.
The second example of formate hydrogenlyase transcriptional activator (Uniprot ID: E1WAA4) is involved in transcription, DNA-dependent (GO:0006351). This GO term was predicted by PFP within the top 10 ranks when using the E-value cutoff of 1.0 and 10.0. Also this protein is annotated with GO:0017111 nucleoside-triphosphatase activity, where PFP predicts a less specific parental term, GO:0016462 pyrophosphatase activity as an annotation. Similar results can be seen in the last two examples for Q8UVE6 and Q12386. Using only sequence hits of E-value above 1.0/10.0, PFP correctly predicted their functional class, transcription factor. More examples can be found in the original paper [79].
PFP’s superior performance has been also demonstrated in the community-wide computational function prediction assessments. In Automatic Function Prediction Special Interest Group (AFP-SIG) meeting held at the Intelligent System in Molecular Biology (ISMB) AFP-SIG 2005 [93] and the function prediction category at the Critical Assessment of techniques for Protein Structure Prediction 7 (CASP7) [94], PFP has shown best overall performance among the participants.
In contrast to PFP whose aim is to increase the sensitivity to enlarge annotation coverage, ESG is intended to make more precise prediction by iterative database searches. In the thorough benchmark study [86], ESG was found to have a higher precision than PFP and the other existing methods with a comparable sensitivity to PFP. ESG was found to have more accurate prediction for multi-domain proteins since the second round of PSI-BLAST searches are often initiated from different local regions of the query sequence.
Availability of PFP and ESG
PFP and ESG are available freely for academic users as web servers at http://kiharalab.org/web/pfp.php and http://kiharalab.org/web/esg.php. The users can submit sequences and receive predicted GO terms for the sequences. The stand-alone programs are available upon request.
Conclusion
Many protein structures determined by the structural genomics projects remain functionally unknown since they are not homologous to or do not have the global sequence or structural similarity to characterized proteins. In this article, we have discussed structure-based methods and sequence-based methods developed in our group to cope with such proteins with unknown function. Two structure-based methods, Pocket-Surfer and Patch-Surfer, detect similar known binding pockets for pocket regions in a query protein without using global protein fold similarity. Two sequence-based methods, PFP and ESG, make use of weakly similar sequences that are conventionally discarded in homology based function annotation. Combined together with experimental methods we hope that computational methods will make a leading contribution in functional elucidation of the protein structures.
Acknowledgements
This work is supported in part by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM075004, R01GM097528), the National Science Foundation (DMS0800568, EF0850009, IIS0915801) and Showalter Trust. MC is supported by Bilsland Dissertation Fellowship from College of Science, Purdue University.
Abbreviations
- PDB
Protein Data Bank
- 3DZD
3 dimensional Zernike Descriptor
- ATP
adenosine triphosphate
- HEM
heme
- NAD
nicotinamide adenine dinucleotide
- FAD
flavin adenine dinucleotide
- BTN
biotin
- F6P
fructose 6-phosphate
- GUN
guanine
- PLM
palmitic acid
- RTL
retinol
- AUC
area under the curve
- ROC
receiver operator characteristic
- EF
enrichment factor
- GO
gene ontology
- PFP
protein function prediction
- ESG
extended similarity group
- AFP-SIG
Automatic Function Prediction Special Interest Group
- ISMB
Intelligent System in Molecular Biology
- CASP
Critical Assessment of Techniques for Protein Structure Prediction
Reference List
- 1.Chandonia JM, Brenner SE. The impact of structural genomics: expectations and outcomes. Science. 2006;311:347–351. doi: 10.1126/science.1121018. [DOI] [PubMed] [Google Scholar]
- 2.Norvell JC, Berg JM. Update on the protein structure initiative. Structure. 2007;15:1519–1522. doi: 10.1016/j.str.2007.11.004. [DOI] [PubMed] [Google Scholar]
- 3.Terwilliger TC, Stuart D, Yokoyama S. Lessons from structural genomics. Annu Rev Biophys. 2009;38:371–383. doi: 10.1146/annurev.biophys.050708.133740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol. 2005;348:1235–1260. doi: 10.1016/j.jmb.2005.03.037. [DOI] [PubMed] [Google Scholar]
- 5.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein Data Bank and structural genomics. Nucleic Acids Res. 2003;31:489–491. doi: 10.1093/nar/gkg068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ellrott K, Zmasek CM, Weekes D, Sri KS, Bakolitsa C, Godzik A, Wooley J. TOPSAN: a dynamic web database for structural genomics. Nucleic Acids Res. 2011;39:D494–D496. doi: 10.1093/nar/gkq902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shin DH, Hou J, Chandonia JM, Das D, Choi IG, Kim R, Kim SH. Structure-based inference of molecular functions of proteins of unknown function from Berkeley Structural Genomics Center. J Struct Funct Genomics. 2007;8:99–105. doi: 10.1007/s10969-007-9025-4. [DOI] [PubMed] [Google Scholar]
- 9.Teplyakov A, Pullalarevu S, Obmolova G, Doseeva V, Galkin A, Herzberg O, Dauter M, Dauter Z, Gilliland GL. Crystal structure of the YffB protein from Pseudomonas aeruginosa suggests a glutathione-dependent thiol reductase function. BMC Struct Biol. 2004;4:5. doi: 10.1186/1472-6807-4-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Teplyakov A, Obmolova G, Sarikaya E, Pullalarevu S, Krajewski W, Galkin A, Howard AJ, Herzberg O, Gilliland GL. Crystal structure of the YgfZ protein from Escherichia coli suggests a folate-dependent regulatory role in one-carbon metabolism. J Bacteriol. 2004;186:7134–7140. doi: 10.1128/JB.186.21.7134-7140.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li De La Sierra-Gallay, Collinet B, Graille M, Quevillon-Cheruel S, Liger D, Minard P, Blondeau K, Henckes G, Aufrere R, Leulliot N, Zhou CZ, Sorel I, Ferrer JL, Poupon A, Janin J, van TH. Crystal structure of the YGR205w protein from Saccharomyces cerevisiae: close structural resemblance to E. coli pantothenate kinase. Proteins. 2004;54:776–783. doi: 10.1002/prot.10596. [DOI] [PubMed] [Google Scholar]
- 12.Graille M, Quevillon-Cheruel S, Leulliot N, Zhou CZ, Li de la Sierra Gallay, Jacquamet L, Ferrer JL, Liger D, Poupon A, Janin J, van TH. Crystal structure of the YDR533c S. cerevisiae protein, a class II member of the Hsp31 family. Structure. 2004;12:839–847. doi: 10.1016/j.str.2004.02.030. [DOI] [PubMed] [Google Scholar]
- 13.Liger D, Graille M, Zhou CZ, Leulliot N, Quevillon-Cheruel S, Blondeau K, Janin J, van TH. Crystal structure and functional characterization of yeast YLR011wp, an enzyme with NAD(P)H-FMN and ferric iron reductase activities. J Biol Chem. 2004;279:34890–34897. doi: 10.1074/jbc.M405404200. [DOI] [PubMed] [Google Scholar]
- 14.Sanishvili R, Yakunin AF, Laskowski RA, Skarina T, Evdokimova E, Doherty-Kirby A, Lajoie GA, Thornton JM, Arrowsmith CH, Savchenko A, Joachimiak A, Edwards AM. Integrating structure, bioinformatics, and enzymology to discover function: BioH, a new carboxylesterase from Escherichia coli. J Biol Chem. 2003;278:26039–26045. doi: 10.1074/jbc.M303867200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kuznetsova E, Proudfoot M, Sanders SA, Reinking J, Savchenko A, Arrowsmith CH, Edwards AM, Yakunin AF. Enzyme genomics: Application of general enzymatic screens to discover new enzymes. FEMS Microbiol Rev. 2005;29:263–279. doi: 10.1016/j.femsre.2004.12.006. [DOI] [PubMed] [Google Scholar]
- 16.Fridman E, Pichersky E. Metabolomics, genomics, proteomics, and the identification of enzymes and their substrates and products. Curr Opin Plant Biol. 2005;8:242–248. doi: 10.1016/j.pbi.2005.03.004. [DOI] [PubMed] [Google Scholar]
- 17.Roberts RJ. COMBREX: COMputational BRidge to EXperiments. Biochem Soc Trans. 2011;39:581–583. doi: 10.1042/BST0390581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 19.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. doi: 10.1093/nar/gkp985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hawkins T, Kihara D. Function prediction of uncharacterized proteins. J Bioinform Comput Biol. 2007;5:1–30. doi: 10.1142/s0219720007002503. [DOI] [PubMed] [Google Scholar]
- 23.Hawkins T, Chitale M, Kihara D. New paradigm in protein function prediction for large scale omics analysis. Mol Biosyst. 2008;4:223–231. doi: 10.1039/b718229e. [DOI] [PubMed] [Google Scholar]
- 24.Kihara D. Protein function prediction for omics era. London: Springer; 2011. [Google Scholar]
- 25.Gherardini PF, Helmer-Citterich M. Structure-based function prediction: approaches and applications. Brief Funct Genomic Proteomic. 2008;7:291–302. doi: 10.1093/bfgp/eln030. [DOI] [PubMed] [Google Scholar]
- 26.Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M, Laskowski RA, Mitchell JB, Taroni C, Thornton JM. Protein folds and functions. Structure. 1998;6:875–884. doi: 10.1016/s0969-2126(98)00089-6. [DOI] [PubMed] [Google Scholar]
- 27.Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA. From structure to function: approaches and limitations. Nat Struct Biol. 2000;7(Suppl):991–994. doi: 10.1038/80784. [DOI] [PubMed] [Google Scholar]
- 28.Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998;11:739–747. doi: 10.1093/protein/11.9.739. [DOI] [PubMed] [Google Scholar]
- 29.Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993;233:123–138. doi: 10.1006/jmbi.1993.1489. [DOI] [PubMed] [Google Scholar]
- 30.Orengo CA, Taylor WR. SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol. 1996;266:617–635. doi: 10.1016/s0076-6879(96)66038-8. [DOI] [PubMed] [Google Scholar]
- 31.Thompson KE, Wang Y, Madej T, Bryant SH. Improving protein structure similarity searches using domain boundaries based on conserved sequence information. BMC Struct Biol. 2009;9:33. doi: 10.1186/1472-6807-9-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mizuguchi K, Go N. Comparison of spatial arrangements of secondary structural elements in proteins. Protein Eng. 1995;8:353–362. doi: 10.1093/protein/8.4.353. [DOI] [PubMed] [Google Scholar]
- 33.Kihara D, Sael L, Chikhi R, Esquivel-Rodriguez J. Molecular surface representation using 3D Zernike descriptors for protein shape comparison and docking. Curr Protein Pept Sci. 2011;12:520–530. doi: 10.2174/138920311796957612. [DOI] [PubMed] [Google Scholar]
- 34.La D, Esquivel-Rodriguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt S, Kihara D. 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009;25:2843–2844. doi: 10.1093/bioinformatics/btp542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008;72:1259–1273. doi: 10.1002/prot.22030. [DOI] [PubMed] [Google Scholar]
- 36.Sael L, Kihara D. Protein surface representation and comparison: New approaches in structural proteomics. In: Chen J, Lonardi S, editors. Biological Data Mining. Boca Raton, Florida, USA: Chapman & Hall/CRC Press; 2009. pp. 89–109. [Google Scholar]
- 37.Venkatraman V, Sael L, Kihara D. Potential for protein surface shape analysis using spherical harmonics and 3D Zernike descriptors. Cell Biochem Biophys. 2009;54:23–32. doi: 10.1007/s12013-009-9051-x. [DOI] [PubMed] [Google Scholar]
- 38.Ritchie DW, Graham J. Fast computation, rotation, and comparison of low resolution spherical harmonic molecular surfaces. J Comp Chem. 1999;20:383–395. [Google Scholar]
- 39.Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature. 1994;372:631–634. doi: 10.1038/372631a0. [DOI] [PubMed] [Google Scholar]
- 40.Porter CT, Bartlett GJ, Thornton JM. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 2004;32:D129–D133. doi: 10.1093/nar/gkh028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Arakaki AK, Zhang Y, Skolnick J. Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment. Bioinformatics. 2004;20:1087–1096. doi: 10.1093/bioinformatics/bth044. [DOI] [PubMed] [Google Scholar]
- 42.Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J Mol Biol. 1994;243:327–344. doi: 10.1006/jmbi.1994.1657. [DOI] [PubMed] [Google Scholar]
- 43.Kleywegt GJ. Recognition of spatial motifs in protein structures. J Mol Biol. 1999;285:1887–1897. doi: 10.1006/jmbi.1998.2393. [DOI] [PubMed] [Google Scholar]
- 44.Ferre F, Ausiello G, Zanzoni A, Helmer-Citterich M. SURFACE: a database of protein surface regions for functional annotation. Nucleic Acids Res. 2004;32:D240–D244. doi: 10.1093/nar/gkh054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Redfern OC, Dessailly BH, Dallman TJ, Sillitoe I, Orengo CA. FLORA: a novel method to predict protein function from structure in diverse superfamilies. PLoS Comput Biol. 2009;5:e1000485. doi: 10.1371/journal.pcbi.1000485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Schmitt S, Kuhn D, Klebe G. A new method to detect related function among proteins independent of sequence and fold homology. J Mol Biol. 2002;323:387–406. doi: 10.1016/s0022-2836(02)00811-2. [DOI] [PubMed] [Google Scholar]
- 47.Gold ND, Jackson RM. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J Mol Biol. 2006;355:1112–1124. doi: 10.1016/j.jmb.2005.11.044. [DOI] [PubMed] [Google Scholar]
- 48.Kinoshita K, Nakamura H. Identification of the ligand binding sites on the molecular surface of proteins. Protein Sci. 2005;14:711–718. doi: 10.1110/ps.041080105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Morris RJ, Najmanovich RJ, Kahraman A, Thornton JM. Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics. 2005;21:2347–2355. doi: 10.1093/bioinformatics/bti337. [DOI] [PubMed] [Google Scholar]
- 50.Binkowski TA, Adamian L, Liang J. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J Mol Biol. 2003;332:505–526. doi: 10.1016/s0022-2836(03)00882-9. [DOI] [PubMed] [Google Scholar]
- 51.Binkowski TA, Freeman P, Liang J. pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins. Nucleic Acids Res. 2004;32:W555–W558. doi: 10.1093/nar/gkh390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Binkowski TA, Joachimiak A. Protein functional surfaces: global shape matching and local spatial alignments of ligand binding sites. BMC Struct Biol. 2008;8:45. doi: 10.1186/1472-6807-8-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Laskowski RA, Watson JD, Thornton JM. ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 2005;33:W89–W93. doi: 10.1093/nar/gki414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Pal D, Eisenberg D. Inference of protein function from protein structure. Structure (Camb) 2005;13:121–130. doi: 10.1016/j.str.2004.10.015. [DOI] [PubMed] [Google Scholar]
- 55.Chikhi R, Sael L, Kihara D. Real-time ligand binding pocket database search using local surface descriptors. Proteins. 2010;78:2007–2028. doi: 10.1002/prot.22715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sael L, Kihara D. Binding ligand prediction for proteins using partial matching of local surface patches. International Journal of Molecular Sciences. 2011;11:5009–5026. doi: 10.3390/ijms11125009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sael L, Kihara D. Detecting Local Ligand-Binding Site Similarity in Non-Homologous Proteins by Surface Patch Comparison. Proteins. 2011 doi: 10.1002/prot.24018. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Novotni M, Klein R. 3D Zernike descriptors for content based shape retrieval. ACM Symposium on Solid and Physical Modeling, Proceedings of the eighth ACM symposium on Solid modeling and applications; 2003. pp. 216–225. [Google Scholar]
- 59.Canterakis N. 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition. Proc 11th Scandinavian Conference on Image Analysis; 1999. pp. 85–93. [Google Scholar]
- 60.Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA. Electrostatics of nanosystems: application to microtubules and the ribosome. Proc Natl Acad Sci U S A. 2001;98:10037–10041. doi: 10.1073/pnas.181342398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Li B, Turuvekere S, Agrawal M, La D, Ramani K, Kihara D. Characterization of local geometry of protein surfaces with the visibility criterion. Proteins. 2007;71:670–683. doi: 10.1002/prot.21732. [DOI] [PubMed] [Google Scholar]
- 62.Huang B, Schroeder M. LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol. 2006;6:19. doi: 10.1186/1472-6807-6-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Demange G, Gale D, Stomayor M. Multi-item auctions. J Political Economy. 1986;94:863–872. [Google Scholar]
- 64.Kahraman A, Morris RJ, Laskowski RA, Favia AD, Thornton JM. On the diversity of physicochemical environments experienced by identical ligands in binding pockets of unrelated proteins. Proteins. 2009 doi: 10.1002/prot.22633. [DOI] [PubMed] [Google Scholar]
- 65.Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem. 1996;20:25–33. doi: 10.1016/s0097-8485(96)80004-0. [DOI] [PubMed] [Google Scholar]
- 66.Sael L, Kihara D. Constructing patch-based ligand-binding pocket database for predicting function of proteins. BMC Bioinformatics. 2011 doi: 10.1186/1471-2105-13-S2-S7. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Wallach I, Lilien R. The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics. 2009;25:615–620. doi: 10.1093/bioinformatics/btp035. [DOI] [PubMed] [Google Scholar]
- 68.Bender A, Glen RC. A discussion of measures of enrichment in virtual screening: comparing the information content of descriptors with increasing levels of sophistication. J Chem Inf Model. 2005;45:1369–1375. doi: 10.1021/ci0500177. [DOI] [PubMed] [Google Scholar]
- 69.Venkatraman V, Chakravarthy PR, Kihara D. Application of 3D Zernike descriptors to shape-based ligand similarity searching. J Cheminformatics. 2009;1:19. doi: 10.1186/1758-2946-1-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Huang SY, Zou X. Advances and challenges in protein-ligand docking. Int J Mol Sci. 2010;11:3016–3034. doi: 10.3390/ijms11083016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Hulo N, Bairoch A, Bulliard V, Cerutti L, De CE, Langendijk-Genevaux PS, Pagni M, Sigrist CJ. The PROSITE database. Nucleic Acids Res. 2006;34:D227–D230. doi: 10.1093/nar/gkj063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen V, Studholme DJ, Vaughan R, Wu CH. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205. doi: 10.1093/nar/gki106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005;33:D212–D215. doi: 10.1093/nar/gki034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Chitale M, Kihara D. Computational protein function prediction: Framework and challenges. In: Kihara D, editor. Protein function prediction for omis era. London: Springer; 2011. pp. 1–17. [Google Scholar]
- 75.John B, Sali A. Detection of homologous proteins by an intermediate sequence search. Protein Sci. 2004;13:54–62. doi: 10.1110/ps.03335004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Salamov AA, Suwa M, Orengo CA, Swindells MB. Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng. 1999;12:95–100. doi: 10.1093/protein/12.2.95. [DOI] [PubMed] [Google Scholar]
- 77.Park J, Teichmann SA, Hubbard T, Chothia C. Intermediate sequences increase the detection of homology between sequences. J Mol Biol. 1997;273:349–354. doi: 10.1006/jmbi.1997.1288. [DOI] [PubMed] [Google Scholar]
- 78.Hawkins T, Luban S, Kihara D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 2006;15:1550–1556. doi: 10.1110/ps.062153506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Hawkins T, Chitale M, Luban S, Kihara D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins. 2009;74:566–582. doi: 10.1002/prot.22172. [DOI] [PubMed] [Google Scholar]
- 80.Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la CN, Tonellato P, Jaiswal P, Seigfried T, White R. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–D261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Martin DM, Berriman M, Barton GJ. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics. 2004;5:178. doi: 10.1186/1471-2105-5-178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Khan S, Situ G, Decker K, Schmidt CJ. GoFigure: automated Gene Ontology annotation. Bioinformatics. 2003;19:2484–2485. doi: 10.1093/bioinformatics/btg338. [DOI] [PubMed] [Google Scholar]
- 83.Zehetner G. OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res. 2003;31:3799–3803. doi: 10.1093/nar/gkg555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Hawkins T, Chitale M, Kihara D. Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP. BMC Bioinformatics. 2010;11:265. doi: 10.1186/1471-2105-11-265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Si L, Yu D, Kihara D, Yi F. Combining sequence similarity scores and textual information for gene function annotation in the literature. Information Retrieval. 2008;11:389–404. [Google Scholar]
- 86.Chitale M, Hawkins T, Park C, Kihara D. ESG: Extended similarity group method for automated protein function prediction. Bioinformatics. 2009;25:1739–1745. doi: 10.1093/bioinformatics/btp309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Wass MN, Sternberg MJ. ConFunc--functional annotation in the twilight zone. Bioinformatics. 2008;24:798–806. doi: 10.1093/bioinformatics/btn037. [DOI] [PubMed] [Google Scholar]
- 88.Engelhardt BE, Jordan MI, Muratore KE, Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol. 2005;1:e45. doi: 10.1371/journal.pcbi.0010045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Krishnamurthy N, Brown D, Sjolander K. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evol Biol. 2007;7(Suppl 1):S12. doi: 10.1186/1471-2148-7-S1-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Friedberg I, Harder T, Godzik A. JAFA: a protein function annotation meta-server. Nucleic Acids Res. 2006;34:W379–W381. doi: 10.1093/nar/gkl045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Chitale M, Hawkins T, Kihara D. Automated prediction of protein function from sequence. In: Bujnicki J, editor. Prediction of Protein Strucutre, Functions, and Interactions. John Wiley & Sons Ltd.; 2009. pp. 63–86. [Google Scholar]
- 92.Uniprot Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. doi: 10.1093/nar/gkp846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Friedberg I, Jambon M, Godzik A. New avenues in protein function prediction. Protein Sci. 2006;15:1527–1529. doi: 10.1110/ps.062158406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Lopez G, Rojas A, Tress M, Valencia A. Assessment of predictions submitted for the CASP7 function prediction category. Proteins. 2007;69:165–174. doi: 10.1002/prot.21651. [DOI] [PubMed] [Google Scholar]




