Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns

Yan Yuan Tseng; Joseph Dundas; Jie Liang

doi:10.1016/j.jmb.2008.12.072

. Author manuscript; available in PMC: 2010 Mar 27.

Published in final edited form as: J Mol Biol. 2009 Jan 6;387(2):451–464. doi: 10.1016/j.jmb.2008.12.072

Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns

Yan Yuan Tseng ¹, Joseph Dundas ¹, Jie Liang ^1,^*

PMCID: PMC2670802 NIHMSID: NIHMS104909 PMID: 19154742

Abstract

Inferring protein functions from structures is a challenging task, as a large number of orphan protein structures from structural genomics project are now solved without their biochemical functions characterized. For proteins binding to similar substrates or ligands and carrying out similar functions, their binding surfaces experience similar physicochemical constraints, and hence the sets of allowed and forbidden residue substitutions are similar. However, it is difficult to isolate such selection pressure due to protein function from selection pressure due to protein folding, and evolutionary relationship reflected by global sequence and structure similarities between proteins is often unreliable for inferring protein function. We have developed a method, called pevoSOAR (Pocket-based EVOlutionary Search Of Amino acid Residues), for predicting protein functions by solving the problem of uncovering residue substitution pattern due to protein function and separating it from substitution pattern due to protein folding. We incorporate evolutionary information specific to an individual binding region and match local surfaces at large scale to identify those with similar functions. Our pevoSOAR method also computes a profile which characterizes protein binding activities that may involve multiple substrates or ligands. We show that our method can be used to predict enzyme functions with accuracy. It can also assess enzyme binding specificity and promiscuity. In an objective large scale test of 100 enzyme families with thousands of structures, our predictions are found to be sensitive and specific: At the stringent specificity level of 99.98%, we can correctly predict enzyme functions for 80.55% of the proteins. The overall area under the Receiver Operating Characteristic curve measuring the performance of our prediction is 0.955, close to the perfect value of 1.00. The best Matthews Coefficient is 86.6%. Our method also works well in predicting the biochemical functions of orphan proteins from structural genomics project.

Keywords: proteins function prediction, local binding surfaces, binding profile, Bayesian Markov Chain Monte Carlo, substitution rates

1 Introduction

Predicting the molecular functions of a protein and fully characterizing its biochemical roles is an important task. An effective and widely used computational method is to identify evolutionary relationship between a protein of known function and the protein in question through sequence alignment. However, the reliability of this approach deteriorates rapidly when the sequence identity between the two proteins becomes lower than 60-70%¹^,². In addition, this method cannot provide location information on where functionally important regions and what the key residues are. More sophisticated sequence-based methods employ position specific scoring matrix, hidden Markov model, and subfamily specific scoring methods for function predictions³^-⁵.

It is well known that very remote evolutionary relationship can be recognized through analysis of protein fold structures⁶^-⁹. However, knowledge of the three-dimensional fold structure does not necessarily translate into knowledge of protein functions. It is also well known that proteins of the same fold can have different biochemical functions, and proteins of different fold can have similar functions¹⁰^-¹⁴. Further challenges come from structural genomics projects¹⁵, where many proteins have their structures solved first without the knowledge of their biochemical functions. To derive functional information from protein structures, a recent study showed that by integrating information of fold, sequence, motif, and functional linkages, protein functions can be accurately inferred¹⁶. Success in inferring functions of difficult proteins has also been achieved from analyzing the distance relationship in the protein structure space map¹⁷.

Because protein carries out its biological roles by interacting with other molecules, binding surfaces on protein structures play important roles in determining protein functions. As functional annotation cannot be transferred reliably based on global sequence or structure similarity¹^,¹⁸^,¹⁹, a promising approach is to examine local spatial regions where binding occurs and to identify similar local spatial patterns on other proteins whose functions are known¹⁰^,²⁰^-²³^,²³^-³¹. This approach allows the detection of remote functional relationship for proteins in which the global similarity has evolved beyond recognition.

An example of this approach is the pvSOAR method of comparing local surfaces³¹ computed using geometric algorithm²⁸^,³²^,³³. It is based on the analysis of unfilled empty spaces in proteins. There are three types of empty spaces in proteins where binding interactions may occur (see Fig. 1). Voids are unfilled spaces inside the protein that are fully enclosed. Pockets on protein surfaces are caverns that open to the outside of the protein through mouths that are small relative to cavern dimensions but big enough that a solvent ball has access to the outside of the molecule. The mouth of a pocket is narrower then at least one cross section of the interior of the pocket. Depressions are concave regions on protein surfaces that have no constriction at the mouth³⁴. Pockets and voids can be computed from protein structures using the alpha shape method, with residues forming the wall delineated and volume size measured²⁸^,³³^,³⁵^-³⁷. In the pvSOAR method, wall residues of a pocket or a void are concatenated regardless of the separation between residues in the primary sequence into a sequence fragment. The similarity between two surface pockets or voids is first evaluated by assessing the sequence similarity between the two sequence fragments of these surface pockets, with spatial and orientational similarity further assessed. Novel functional relationship between proteins of different families and folds were uncovered using this method³¹.

Fig 1 — Pockets and voids in proteins. There are three types of concave regions on protein surfaces: Fully enclosed *voids* with no outlet, *pockets* accessible from the outside but with constriction at mouths, and shallow *depressions*. We use the general term *surface pockets* to include both pockets and voids.

To scale-up this method and to search rapidly through a database of a large number of protein surface pockets, success hinges upon the use of a scoring matrix for assessing similarity between matched local pocket sequence fragments. However, existing scoring matrices such Blosum, Pam and Jtt³⁸ are not effective for this purpose, because they do not take into account the evolutionary history of the individual protein of interests. These canned matrices have implicit parameters whose values were pre-computed, while the information of the particular protein of interest has limited or no influence. In addition, the counting methods behind the derivation of some of these matrices suffer from underestimation of substitutions in certain branches of a phylogeny³⁹. Furthermore, these matrices are derived based on the assumption that the whole protein or domain experience similar selection pressure and therefore have the same substitution rates. This is unrealistic, as residues in different environment may experience different selection pressure⁴⁰. For example, conserved residues on binding site are under very different selection pressure than conserved residues in the folding core⁴¹.

In this study, we further improve the method of function prediction by incorporating evolutionary information specific to an individual binding surface pocket. By estimating substitution rates of the residues located on a surface pocket, we derive customized scoring matrices for assessing surface similarity for predicting and characterizing complex biochemical functions. Our approach, called pevoSOAR, can effectively separate selection pressure due to the need of binding and function from that due to the need of folding and stability. A novel development of our method is a probabilistic model called computed binding profile, which summarizes the results of surface similarity comparison. This profile can suggest substrates and help to clarify potentially complex binding activities of a protein, as well as possible cross-reactivities. It can be used to predict protein functions with improved sensitivity and specificity. Our paper is organized as follows: We first illustrate how our method works using the example of acetylcholinesterase. We than discuss the probabilistic model for constructing the computed binding profile of a protein. This is followed by a discussion of a large scale test of protein function prediction for 100 protein families. Next, we describe results of the challenging task of predicting the functions of orphan protein structures obtained from structural genomics. We conclude with remarks of general applicability of our method.

2 Results

2.1 Function prediction by detection of similar binding surfaces

For proteins binding similar substrates and catalyzing similar chemical reactions, the surfaces where such activities occur experience similar physical and chemical constraints. Often these surfaces have similar shape and physicochemical properties. Due to these constraints, the sets of allowed and forbidden residue substitutions also share some similarity. Our assumption is that such similarity can be detected using a sensitive computational method. We first describe how binding pockets are similar to each other in general. We then discuss how our method works by assessing similarity using the example of acetylcholinesterase and deformylase.

Sequence fragments of binding pocket and sequence of backbone

It is informative to assess how similar in general binding pockets of similar functions are. We have collected 2,196 protein structures belonging to 100 protein families, each with its own enzyme classification label⁴² and Gene Ontology descriptive terms⁴³. Figure 2a shows the distribution of identity of pairs of sequence fragments of the residues located on the surfaces of binding pocket. Here each pair comes from members of the same protein family. This distribution is characterized by a median of 60.5% for sequence identity. The overall distribution can be regarded as unimodal. Figure 2b shows the distribution of the overall backbone sequence identities of proteins from the same family for this group of 2,196 protein structures. Its median sequence identities is 39.2%, and the smallest sequence identity is 16.4%. This distribution clearly is bi-modal. After removing protein pairs with > 90% full sequence identity from the data, the distribution of pocket sequence fragments has a median of 55.6% sequence identity (Figure 2c), and the distribution of the full sequences has a median identity of 34.2% (Figure 2d). Overall, pocket sequence fragments have about 20% higher identity than that of full sequences.

Fig 2 — Distribution of identity values of binding surface pockets and full sequences between members of a protein family for 100 protein families (2,196 structures). Distributions of identities of fragment of residues on binding surface walls between members of the same protein family (a) before and (c) after removal of sequences with overall backbone identity > 90%. Note there are still many instances where the identity between pocket fragments have identity > 90%. The median sequence identity is 60.5% and 55.6% for (a) and (c), respectively. Distributions of identities of full sequences between members of the same protein family (b) before and (d) after removal of those with overall > 90% sequence identity. The median sequence identity is 39.2% and 34.2%, respectively. Overall, binding surfaces are more conserved than the full sequences.

From these two distributions, it is clear that binding pockets in general have much higher conservation than that of the full sequence. If we use the simple approach of transferring functional annotation between proteins if they share sequence identity greater than a threshold values, and even if we go aggressively beyond the recommended threshold of 60-70%¹^,², we would have failed at the 50% threshold to identify the functions of 1,394 out of the 2,196 proteins, representing a failure rate of 63.4%.

It seems members of the same protein family often can be clustered into two groups based on backbone sequence identity. Members of one group are closely related with each other, and have relatively short evolutionary distance. Members of the other group have diverged further, and are more remotely related. The mixture of these two groups gives rise to the observed bimodal distribtuion. However, by the criterion of similarity among binding surfaces as measured by the identity of pairs of sequence fragments, all members of a given enzyme family apper to follow a unimodal distribution, suggesting their functional roles are closely related.

Illustration: Predicting functions of acetylcholinesterase

We use acetylcholinesterase to illustrate our method. Acetylcholinesterase (Enzyme Commission number E.C.3.1.1.7) is found in the synapse between nerve cells and muscle cells. It breaks down acetylcholine molecules into acetic acid and choline upon stimuli. Using a template structure (pdb 1ea5), our goal is to identify other structures that are acetylcholinesterase with the same E.C. number at the level of all four digits and to locate the surface regions that are involved in enzyme activities. E.C. numbers represent a progressively finer classification of an enzyme, with the 1st digit about the basic reaction, and the last digit often about the specific functional group that is cleaved during reaction.

We first exhaustively compute all pockets on the template structure²⁸^,⁴⁴. Based on annotation contained in the Pdb file, a pocket containing 32 residues (CastP id 79, molecular volume of 986.3 Å³⁴⁴) is identified as the functional pocket (Fig. 3a), which contains the Ser and His residues of the active site triad.

Fig 3 — The binding profile and function prediction of acetylcholinesterase. (a) The functional pocket (castP id = 79) on a structure of acetylcholinesterase (1ea5, E.C. 3.1.1.7). It contains 32 resides and has a molecular volume of 986.3Å³. Two residues from the catalytic triad are shown: Ser200 (red) and His440 (blue). (b) A matched binding surface on a human protein structure (2clj, castP id = 96), with 34 residues and a molecular volume of 981Å³. (c) The multiple alignment of several orthologous sequence fragments of residues located in the binding pockets. The two triad residues Ser200 and His440 are conserved. (d) The phylogenetic tree consisting of 17 sequences of acetylcholinesterase is used for estimating substitution rates of residues at the binding pocket. (e) The structure 1ea5 is predicted to be an acetylcholinesterase, as indicated by the computed binding profile (GO_a ≃ E.C. 3.1.1.7, with its π₁ ≈ 0.99).

To construct an evolutionary model of this functional pocket, we have collected a set of 17 sequences homologous to 1ea5⁴⁵ and built a phylogenetic tree (Fig. 3d)⁴⁶. The residue substitution rates on this binding surface are estimated and scoring matrices for similarity assessment are then calculated (see Methods and⁴¹). Using the pvSOAR search method with these scoring matrices to search the CastP database of computed surface pockets for all PDB structures (> 30, 000, with > 2 million surface pockets), and declare that two protein are of the same function when in addition the RMSD value of their binding pocket residues are at a significance p-value of 10^-4 (see Methods section), a total of 70 Pdb structures are found to have similar functional surfaces as that of the query template 1ea5, and hence are predicted as acetylcholinesterase. Indeed, all of them have the same E.C.3.1.1.7 label as that of 1ea5. The query protein and an example of matched protein surface is shown in Fig. 3a and 3b, respectively. There are 71 Pdb entries with enzyme class label E.C.3.1.1.7 in the Enzyme Structures Database (www.ebi.ac.uk/thornton-srv, for structures of enzymes contained in the Enzyme databank)⁴². Our method successfully identified 70 of them.

Illustration: Predicting functions of deformylase

Another approach other than using the E.C. numbers in describing protein function is to use the hierarchical terms developed by the Gene Ontology consortium, where the biological role of a protein is described by terms of biochemical functions, cellular components, and biological processes. Following the same strategy as that of acetylcholinesterase, we use a structure (pdb 1lm6) of deformylase from streptococcus pneumoniae as a template and search for other protein structures with similar binding surfaces.

We evaluate the results using the three GO terms associated with the query protein structure. 1lm6 has two GO terms for biochemical functions (0042586, iron binding, 0005506, peptide deformalase activity), and one GO term for biological process (0006412, translation). With scoring matrices derived from the substitution rates of residues located on the binding pocket and a significance p-value threshold of RMSD values at 10^-4, a total of 94 protein chains are found to have similar functional surfaces as that of 1lm6. Among these, 50 chains (53%) share all three GO terms as that of 1lm6, and 40 (43%) have no GO annotations. The remaining 4 (4%) are found to have GO terms different from that of 1lm6, and therefore can be considered as incorrect predictions (false positive). Overall, the prediction accuracy among proteins with known GO annotation is 93%. If we make the speculative but reasonable simple assumption that the rest of the 40 chains with unknown GO descriptive terms are sampled from the same distribution as that of the 54 chains with known GO terms, it is expected that the functions of 38 or so will be predicted correctly, and only 3 would be false positives.

Some of the predictions would have eluded sequence alignment methods. Among the 50 chains correctly predicted to have similar functions as that of 1lm6, 12 chains from 10 pdb structures have sequence identities > 60% with the query protein, and these would have been predicted by sequence alignment method following the recommendation from reference¹^,². However, the remaining 38 chains have sequence identities < 60% (24 of which are < 50%), and their functions would be difficult to predict by using the sequence alignment method. Overall, among the 94 chains where predictions are made, 32 have sequence identities > 60% with the query protein, and 62 have sequence identities < 60% (30 of which are < 50%).

Large scale enzyme function prediction

To assess the overall applicability of our method, we have carried out a large-scale study of protein function prediction using enzymes. Enzymes are among the best characterized proteins in the Pdb, and are an important class of proteins. Among >30,000 Pdb structures (version 2006/12), there are 13,877 protein structures that are annotated as enzymes and have enzyme commission (E.C.) labels. In many cases, there is no information about where the active region is located on the structure and what the important residues are.

We obtain a database of computed protein surfaces on all Pdb structures by selecting from the CastP database only surface pockets that contain 8 or more residues⁴⁴. A total of 770,466 local surface pockets are collected from 1,260 enzyme families. We then randomly select 100 enzyme families, each represented by a different E.C. label, with the criterion that there are ≥ 10 structures in each enzyme family. Altogether there are 2,196 structures in this 100 protein families. For each protein family, we take the structure with the best resolution and R-factor, and define the surface pocket containing key residues as annotated either in the Pdb records or in the feature tables of SwissProt as the canonical template of the functional surfaces of this enzyme family. We then derive substitution rate matrix for this canonical template surface using the Bayesian Monte Carlo estimator⁴¹.

Using customized similarity matrices derived from estimated rate matrix, we then take each of the 100 template surfaces in turn and query exhaustively against all 770,466 surfaces in the database. For each matched surface from the 770,466 surfaces, if its cRMSD to the query canonical template surface is smaller than the threshold at the significance level of a cut-off p-value, we declare a hit is found. This threshold is obtained as in reference³¹. We then repeat this process for all 100 surface templates of the protein families. After collecting the list of hits for each of the 100 protein families, we identify the correctly predicted protein structure by comparing the E.C. labels of the hit structure and the template structure. The prediction is correct if all four digits of the two E.C. labels are identical. The results are summarized in the Receiver Operating Characteristic Curve (ROC) shown in Fig 4. This is obtained by calculating the overall sensitivity and specificity of predictions of all 100 protein families at different significance p-values by cRMSD. That is, they are calculated based on the number of true positives and false negatives (for sensitity), and the number of true negatives and false positives (for specificity) found from searches for each template of the 100 protein family against the whole set of 770,466 local surfaces from 1,260 enzyme families. Here an exact match of all 4 digits of the E.C. numbers is required for true positives. At the significance level of p = 10^-3, the specificity of predictions of the functions of all 2,196 structures from the 100 protein family is 99.98% at all 4-digits of the E.C. labels, and the sensitivity is 80.55%. The Mathews Coefficient, another measure evaluating classification quality⁴⁷, is 82.09% at this p-value. The best Mathews Coefficient is 86.6% at the p-value of 10^-1. The overall area under the ROC is 0.955, close to the perfect value of 1.0.

Fig 4 — Results of a large scale test of protein function prediction for 100 protein families. For a declared hit of matched surface, if it comes from a protein structure with the same enzyme commission (E.C.) number (up to the 4-th digit) as that of the query protein, the prediction is regarded as correct. Results are summarized in the Receiver Operating Characteristics (ROC) curve, where the x-axis represents the false positive rate at different statistical significance p-value of cRMSD measurement. Here the false positive rate is 1-specificity, namely, 1 - TN/(TN + FP), where TN is the number of true negatives, FP the number of false positives. The y-axis represents the true positive rate or sensitivity, defined as TP/(TP + FN), where FN is the number of false negatives. An overall performance measure is the area under the ROC curve, which is 95.5%. At the confidence level of cRMSD p = 10^-3, the average specificity of predictions of the functions of all 2,196 proteins in these 100 protein family is 99.98%, and the average sensitivity is 80.55%. The Matthew’s Coefficient⁴⁷ is also plotted in the inset figure.

Similar to what we find from the set of 2,196 protein structures, there are 1,394 instances of proteins with overall backbone sequence identity less than 50%. As noted here, the sequence identity is measured between a query protein and its hit. 1,058 and 608 of which have sequence identity below 40% and 30%, respectively. This indicate that the task of accurately predicting the functions of these 100 protein families is challenging, as 63.4% of which have below 50% sequence identity.

Predicting binding activities and profiling protein functions

The computed binding profile is a probabilistic model that can be used to identify substrates and to predict enzyme specificity. It is derived from querying results of searching a template surface against a large library of protein surfaces. When using E.C. numbers, the binding profile contains a varying number of E.C. labels, each with an associated probability π_i value for the i-th label, which is interpreted as the likelihood of binding the same substrate as enzymes of that E.C. label. We can infer that the biochemical functions of certain enzymes are likely to be highly specific, namely, they act on mostly only one type of substrates and therefore may have very specific biochemical reaction. The computed binding profile of such enzymes contains only one E.C. label with a high probability π_i value.

As an example, flavoenzyme (structure 1trb) from E. coli belongs to a subclasses of oxidoreductase. The computed binding profile of flavoenzyme indicates that this protein is a thioredoxin-disulfide reductase (E.C. number of 1.8.1.9) at a high specificity, with a π₁ ≈ 1.00 (Fig 5a).

Fig 5 — Assessing enzyme specificity and promiscuity from computed binding profiles. (a) Flavoenzyme (1trb) from *E. coli* and its computed binding profile. This protein belongs to a subclasses of oxidoreductase and possess the activity of thioredoxin-disulfide reductase. The computed binding profile gives the correct E.C. label(E.C 1.8.1.9). It also suggests that this enzyme is highly specific (π₁ ≈ 1.00). (b) The computed binding profile of cyclodextrin glycosyltransferase using the template 1d3c. It indicates that this enzyme is promiscuous and has cross-reactivities. It has the enzyme activity of cyclodextrin glycosyltransferase (*E.C._a* = 2.4.1.19) at *π_a* ≈ 0.77, and may also bind and hence catalyzes like an alpha-amylase (*E.C._b* = 3.2.1.1) at *π_b* ≈ 0.22. The computed binding profile also suggests trace amount of other related biochemical activities (E.C. 3.2.1.135, 3.2.1.133 and 3.2.1.98).

Our method can also identify enzymes that catalyzes multiple substrates and hence can predict possible cross-reactivities. Cyclodextrin glycosyltransferase degrades starch to cyclodextrins (circular (1,4)-linked glucoses) through cyclization of 1,4-alpha-D-glucan⁴⁸. This enzyme is also closely related to alpha-amylases and can act on glycogen, related polysaccharides, and oligosaccharides. The predicted binding profile suggests that the functional surface on the structure of 1d3c (CastP 78) from B. circulans acts like a cyclodextrin glycosyltransferase (E.C.2.4.1.19, the correct label) with probability π₁ ≈ 0.77 (Fig 5a). It also correctly indicates that this enzyme may bind and hence catalyzes like an alpha-amylase to a lesser extent (E.C. 3.2.1.1, with a probability π₂ ≈ 0.22).

Predicting biochemical function of orphan protein structures: challenging examples from structural genomics

Orphan protein structures obtained from structural genomics have unknown biochemical functions. It is challenging to predict their functions. Several recent studies addressed this issue and reported success in computational prediction of functions of orphan proteins²⁴^,²⁵.

BioH

The conformation of the BioH protein from E. Coli has unknown biological functions, but is conjectured to be involved in biotin biosynthesis⁴⁹. It is a challenging task to infer the functional roles of BioH, because all structural homologs have ≤ 20% sequence identity, and some sequence homologs with between 30% and 90% seqeunce identity are hypothetical proteins. Using a phylogenetic tree of 28 related sequences (Fig. 6), we estimated the substitution rates of residues on the predicted binding pocket (the union of pockets with castP id 28, 35 and 40 containing 35 resides and a molecular volume 500.2Å³), which contains the suspected triad residues (Fig. 6). Since orphan protein structures such as BioH have no related known structures, we use the oRMSD measure developed in reference³¹ instead of the cRMSD measure for shape similarity.

Fig 6 — Predicting functions of BioH obtained from structural genomics. (a) The structure of BioH (1m33), with the putative binding pocket shown. The catalytic residues (Ser82, Asp207 and His235) are located in the candidate binding pocket. (b) A similar functional surface detected from carboxylic ester hydrolases (1w76, castP id = 128, E.C. 3.1.1.7), with full sequence identify of only (≤ 20%). (c) The phylogenetic tree of 28 sequences related to BioH. Some are hypothetical proteins. (d) The computed binding profile of BioH.

The computed binding profile suggests that BioH is most likely related to a carboxylic ester hydrolase (E.C. 3.1.1.-), and more specifically, it may react as an acetylcholinesterase (E.C. 3.1.1.7, Pdb 1w76, π₁ ≈ 1.0). BioH was tested independently for 12 different enzyme activities with E.C. numbers different from our predictions, but the highest activity was found to be that of an carboxylic esterase (E.C. 3.1.1.1), which has the same first 3-digits as our prediction (E.C. 3.1.1.-)⁵⁰. Work by Sanishvili et al also reported prediction results of the functional roles of BioH, where BioH was predicted to possess lipase, protease, or esterase activities, with additional structural features suggesting possible roles as acyltransferases and thioesterases⁵⁰.

1u9d

The structure of a hypothetical protein (Fig. 7a) from Vibrio cholerae (1u9d, pdb) is solved by Binkowski et al. at the Midwest Center of Structural Genomics of Argonne National Lab. None of the sequence based methods (e.g., Blast and Pfam), structural alignment methods (e.g., ce, dali and 3dpssm), structural classification systems (e.g., Scop and Cath), and the Go ontology database provide any information about the functional roles of this protein. All of the significant hits obtained by these comparison methods are hypothetical proteins with unknown biological functions. It is very challenging to predict the functions of this protein.

Using a method based on properties of shape and chemical texture of protein surfaces, we first identified the putative functional pocket, which is located in the homodimer interface⁵¹. This pocket is used as a template to search for similar surfaces in the database. Our results (Fig. 7c) show that 1u9d is likely to be related to phosphotransferase (with the E.C. label starting with 2.7.), at a probability of π₁ ≈ 0.95. Because the oRMSD measure is less specific than the cRMSD measure, we conservatively estimate that 1u9d has a similar function up to two E.C. digits as that of the hit protein. The other hit of 1u9d is a choline kinase (π₂ = 0.02), which is also a member of the phosophotranferase. In addition, 1u9d may also have trace of activities as carbon-carbon lyases (E.C. 4.1.-.-). Our computed binding profile suggests a limited number of biochemical assays, which can be carried out to further determine the functional profile of 1u9d.

3 Discussion

In this study, we have significantly improved the pvSOAR method for predicting protein functions by incorporating evolutionary information specific to individual binding surfaces. This can be illustrated by the example of alpha-amylase from B. subtitis (1bag, CASTpID=60). Using an updated database, our current method pevoSOAR correctly identified 131 structures as alpha amylase, while the original pvSOAR method predicted correctly 116 structures. The additional 15 structures predicted by pevoSOAR are more challenging. They are more distantly related to the query protein, as their pairwise backbone sequence identities with 1bag are all less than 25%, with only one exception at 27%. In addition, our method can predict the profile of protein binding activities, which may involve multiple substrates or ligands. It can be used to predict protein functions, to identify potential substrates, and to assess binding specificity.

Comparison with other methods

Although sequence based methods such as psi-blast will often find many homologous proteins to a query protein, they require significant overall sequence identity (>60-70%) for confident prediction of protein functions¹^,², without the benefit of identifying the regions or residues that are functionally important. Our approach takes advantage of structural information and can directly identify functionally important local surface regions, and can predict functions of proteins with low sequence identities confidently. For example, several structures we found using the acetylcholinesterase template 1ea5 have low sequence identities with the query template but high local surface sequence identities (e.g., 1qo9, 38% full-length and 60% functional surface identities with 1ea5 in Fig. 3).

Our pevoSOAR method shares some similarities to several recent works. The method of reference²⁴ is most similar to ours in that it uses manually constructed as well as automatically generated local 3D templates to assess the similarity in local structure for inferring protein functions²⁴). Although an exact direct comparison is difficult, as the underlying data set and the methodology are different, these two studies each involves about 100 different protein families. There are important difference in the criteria of prediction evaluation. In our study, the assignment of enzyme functions needs to be identical at all 4 digit levels of the E.C. labels, whereas the study of reference²⁴ is about prediction of the correct Cath domain labels. Although not perfect, E.C. numbers are directly related to biochemical reactions, whereas the same classification label of cath domain does not necessarily guarantee the same protein function⁵². For example, aldehyde reductase (1ads, E.C.1.1.1.21, cath fold 3.20.20.100) has very similar fold structure with phosphotriesterase (1dpm, E.C.3.1.8.1, cath fold 3.20.20.140), yet their functions are quite different. On the other hand, aspartate aminotransferase (1yaa, E.C.2.6.1.1) has similar function with D-amino acid aminotransferas (3daa, E.C.2.6.1.21), but they belong to different folds (cath 3.90.1150.10, 3.40.640.10; cath 3.30.470.10, 3.20.10.10, respectively). It is well known that proteins of the same Scop fold and cath domain may have acquired different functions during evolution⁵³^-⁵⁶.

With this difference in evaluation criteria, our results compares favorably with that of reference²⁴, as the measure of area under the ROC curve in Fig 4 is 95.5%, compared to that of 82% in Fig 4 of²⁴. We therefore conclude that our method can provide accurate information about enzymatic functions with high accuracy.

Challenges in assessing local similarity

Although the idea of inferring protein functions by assessing similarity of local spatial patterns is appealing⁵⁷, there are significant challenges. First, it is difficult to identify the relevant small number of residues that are most informative of the function of a protein. Second, because the number of selected residues is small, it is difficult to extract evolutionary information, as the pattern of conservation is more difficult to detect from smaller amount of data.

The Catalytic Site Atlas project provides a solution to the problem of identifying key residues by painstakingly construct a library of 3D templates of key residues important for enzyme functions. These residues are selected manually from literature and structural analysis⁵⁸. It provides an important resource for studying enzyme function.

A difference between our method and those based on manually constructed 3D functional templates is that our method is fully automated. Because surface pockets are computed automatically, there is no need to manually construct 3D templates. The only requirement for our method is the knowledge that a specific computed surface pocket contains functionally important residues. The identification of such pockets can be obtained from information in annotation, or can be the outcome of a functional site prediction method⁵¹.

Our method also differs from several other methods based on automatically generated 3D templates. The size of the surfaces in our method can be small or large, depending on the geometry of the binding pocket, whereas methods based on 3D template often is limited with the number of residues that can be included (e.g., a few residues)²⁴.

For uncovering evolutionary pattern from a relatively small number of residues, we have shown that the Bayesian Monte Carlo method we developed works well. By explicitly constructing a phylogenetic tree, by using a continues-time Markov process to describe the evolutionary process, and by using a Bayesian framework and a Markov chain Monte Carlo estimator⁴¹, we showed that evolutionary information specifically relevant to binding surface residues and unaltered by other constraints such as folding stability can be obtained. We believe this approach is generally applicable for problems of assessing evolutionary patterns of small regions. It also allows estimation of selection pressure due to protein function that is unaltered by selection pressure due to protein folding.

The role of hypothetical protein sequences

A limitation of our method is that we require the knowledge of the structure of a protein whose function is to be predicted. However, once the structure of one protein is known, sequences with unknown structures and unknown functions (e.g., hypothetical proteins obtained from genome sequencing projects) that can be aligned to the sequence of the known structure become an important source of information about the evolution of protein functional surface. After the surface sequence fragment of binding site residues is extracted from geometric computation, our method does not require the availability of any other protein structures. Sequences that are used to construct the substitution rate matrix can be all of unknown structures, or unknown functions. As an example, several sequences contained in the phylogenetic tree in Fig. 6 are hypothetical proteins with unknown structures and unknown functions (e.g. NP_871588), but they provide critical evolutionary information for predicting protein function. (Fig. 6).

Characterizing complex protein functions

In the large-scale study, we used the E.C. label of the highest probability as the predicted enzyme function. Although enzymes often are characterized well by the E.C. labels, there are several reasons why additional characterizations are important. First, protein structures may have mislabeled E.C. numbers, e.g., a domain is assigned the E.C. number of a different domain simply because they belong to the same peptide chain. Second, for many proteins, E.C. labels do not provide accurate information of the biochemical reactions: an enzyme may be able to react with multiple substrates. Such complex activities cannot be easily characterized. Third, knowledge of the E.C. label per se does not imply knowledge of the location of the active site or binding surface, nor the identities of the key residues. The computed binding profile generated by our method provides more realistic picture of protein activities than just a single label of functions, as shown in the examples of phosphoglycerate mutase, which is very specific, and the example of cyclodextrin glycosyltransferase, which has broader cross-reactivities. The matched surface helps to locate residues important for binding and for function.

General applicability for enzyme characterization

Here we estimate the number of enzyme proteins in the Pdb databank in which our method can provide useful functional information. Among c.a. 30,000 structures in the Pdb (v2006/12), we found that there are 13,877 protein structures annotated as enzymes with enzyme commission (E.C.) numbers assigned. We then select surface pockets on the enzymes in the CastP database that contain residues annotated as functionally important either in a Pdb record or in the feature table of SwissProt. Altogether we found 3,275 enzyme structures whose surface pockets contain annotated functional residues.

For estimation, we use Blosum50 as a crude scoring matrix that does not reflects accurately the bias of residue composition in functional surfaces. This canned matrix does not account for specific evolutionary history of individual protein, or individual local surface. After clustering the 3,275 enzyme structures in the Pdb by E.C. labels, we obtain 343 clusters. We then selected the representative structure in each cluster by the criteria of best resolution and R-factor. Using the surface matching method but with the canned matrix to query each of the 343 representative proteins against the surfaces contained in the CastP database with > 30, 000 proteins, we are able to identify a total number of ≈ 11, 000 protein structures as hits, namely, proteins satisfying the stringent confidence criterion of p-value < 0.001 for coordinate rmsd for aligned surfaces. The study with 100 protein families reported above shows that matched enzyme surfaces at this p-value threshold gives few false positive predictions.

Based on preliminary studies of alpha amylase and other enzymes reported in reference⁴¹, the number of proteins with related functions that can be established with confidence will be increased conservatively by a factor of about 3.0-3.4, when the evolutionary history of the functional surface is analyzed and the binding surface specific rate matrix is used. A rough estimation is that our method will characterize the functional surfaces of about 9,800-11,000 structures among the 13,877 known enzyme structures, i.e., for over 70%-80% of the Pdb structures known to be enzymes. After removing mislabeled, incorrectly assigned, and low quality enzyme structures, it is likely the percentage of enzyme structures whose functions our method will help to characterize will further increase. The binding surfaces of these proteins will also be identified. This represents a significant portion of all known enzyme structures.

Our pevoSOAR method is based on comparing similarity of protein surfaces. It builds upon three techniques: First, we use geometric algorithms to quantify accurately protein local surfaces²⁸^,⁵⁹; second, we use a Bayesian Monte Carlo estimator to characterize the evolutionary history specifically for a local surface⁴¹; third, we compare surfaces by assessing evolutionary similarity of residues on local surfaces⁴¹, similarity in shape, and in orientation³¹. In principle, our method of function characterization by matching protein surfaces is general and can be applied to protein functions other than enzyme activities such as protein-protein interactions. In this case, a prerequisite is the ability to generate a library of surface patches that represent the interfaces of protein-protein interaction accurately.

4 Methods and Designs

Estimating substitution rates

The success in rapid detection of functionally related protein surfaces through the alignment of sequence fragment of binding surface residues⁶⁰ depends on the use of a scoring matrix, which determines the similarity between residues. The instantaneous rate matrices of amino acid residue substitution is the basis for developing such scoring matrices. We use a reversible continuous time Markov process as our evolutionary model³⁹^,⁶¹^-⁶³. Details of Bayesian estimator based on the technique of Markov chain Monte Carlo, including the construction of the phylogenetic tree has been described in⁴¹.

Scoring matrices of similarity for surfaces at different evolutionary time intervals

To derive scoring matrix for assessing functional similarity between two surfaces and for database search, we calculate the residue similarity scores b_ij(t) between residues i and j at evolutionary time t⁶⁴. From the rate matrix, we use the Altschul model to calculate similarity score b_ij(t)⁶⁴:

b_{i j} (t) = \frac{1}{λ} \log \frac{p_{i j} (t)}{π_{j}} = \frac{1}{λ} \log \frac{m_{i j} (t)}{π_{i} π_{j}},

where m_ij(t) is the joint probability of observing both residue type i and j at the two sequences separated by time t, and λ is a scalar. Here p_ij(t) can be computed from the instantaneous rate matrix⁴¹.

Matching local surfaces

Because a priori we do not know how far a particular candidate protein is separated in evolution time from the query template protein, we calculate a series of 300 scoring matrices, each characterizes the residue substitution pattern at a different time separation, ranging from 1 time unit to 300 time unit. Here 1 time unit represents the time required for 1 substitution per 100 residues⁶⁵. We use the Smith-Waterman algorithm as implemented in the Ssearch program with each of the 300 scoring matrices to align sequence fragments of candidate binding surfaces against the database of sequence fragments of protein surface pockets derived from the CastP database⁴⁴.

In addition, matched surfaces by sequence fragments similarity are subject to further shape analysis. We compare surfaces by either the coordinate RMSD values (or the orientational oRMSD value we developed in³¹ when specified). Those that can be superimposed to the residues of the query surface at a statistically significant level (e.g., p-value < 0.001 by coordinate RMSD measure) are declared as a hit³¹^,⁴¹. The p-value for cRMSD and oRMSD is estimated through extensive randomization simulations as described in³¹.

Probabilistic model for profiling protein binding activities

We introduce a probabilistic model, the computed binding profile, for characterizing specific binding activities and for inferring protein functions. We use each of the 300 scoring matrices representing time interval from 1 unit to 300 units to search the surface database in turn. Assuming each time interval is equally likely, the probability of a query protein belonging to the i-th E.C. label is calculated as:

π_{i} = \frac{\sum_{t} # E . C_{i} (t)}{\sum_{t} N (t)},

(1)

where #E.C_i(t) is the number of Pdb hits belonging to the i-th E. C. label using matrix of time distance t, and N(t) is the total number of Pdb hits with a known E.C. number using matrix of time distance t. When a protein has a number of different hits with different E.C. labels with associated probability values, this set of E.C. labels and the corresponding π_i-values provides a computed binding profile that helps to characterize the potentially complex binding activities of a protein.

5 Acknowledgment

We thank Dr. Andrew Binkowski for previous implementation of pvSOAR, Drs. Andrew Binkowski and Andzrej Joachimiak for suggesting BioH and 1u9d for this study. This work is supported by grants from NSF (DBI-0646035 and DMS-0800257), NIH (GM079804-01A1 and GM081682), and ONR (N000140310329).

Abbreviations footnote

CASTP: Computed Atlas of Surface Topography of Proteins
ROC: Receiver Operating Characteristic Curve
MCC: Matthew’s Correlation Coefficient
E.C number: Enzyme Commission number
RMSD: Root-Mean-Square Deviation

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Rost B. Enzyme function less conserved than anticipated. J Mol Biol. 2002;318(2):595–608. doi: 10.1016/S0022-2836(02)00016-5. [DOI] [PubMed] [Google Scholar]
2.Tian W, Skolnick J. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol. 2003;333(4):863–82. doi: 10.1016/j.jmb.2003.08.057. [DOI] [PubMed] [Google Scholar]
3.Hannenhalli S, Russell R. Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol. 2000;303(1):61–76. doi: 10.1006/jmbi.2000.4036. [DOI] [PubMed] [Google Scholar]
4.Jensen L, Gupta R, Staerfeldt H, Brunak S. Prediction of human protein function according to gene ontology categories. Bioinformatics. 2003;19(5):635–42. doi: 10.1093/bioinformatics/btg036. [DOI] [PubMed] [Google Scholar]
5.Sjolander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20(2):170–9. doi: 10.1093/bioinformatics/bth021. [DOI] [PubMed] [Google Scholar]
6.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
7.Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH- a hierarchic classification of protein domain structures. Structure. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]
8.Thornton JM. From genome to function. Science. 2001;292:2095–2097. doi: 10.1126/science.292.5524.2095. [DOI] [PubMed] [Google Scholar]
9.Zarembinski T, Hung L, Dieckmann H, Kim K, Yokota H, Kim R, et al. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc. Natl. Acad. Sci. U. S. A. 1998;95:15189–93. doi: 10.1073/pnas.95.26.15189. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Russell RB. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. J. Mol. Biol. 1998;279:1211–27. doi: 10.1006/jmbi.1998.1844. [DOI] [PubMed] [Google Scholar]
11.Zhang B, Rychlewski L, Pawlowski K, Fetrow J, Skolnick J, Godzik A. From fold predictions to function predictions: automation of functional site conservation analysis for functional genome predictions. Protein Sci. 1999;8(5):1104–15. doi: 10.1110/ps.8.5.1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Copley S, Novak W, Babbitt P. Divergence of function in the thioredoxin fold suprafamily: evidence for evolution of peroxiredoxins from a thioredoxin-like ancestor. Biochemistry. 2004;43:13981–95. doi: 10.1021/bi048947r. [DOI] [PubMed] [Google Scholar]
13.Wang K, Samudrala R. FSSA: a novel method for identifying functional signatures from structural alignments. Bioinformatics. 2005;21:2969–77. doi: 10.1093/bioinformatics/bti471. [DOI] [PubMed] [Google Scholar]
14.Polacco B, Babbitt P. Automated discovery of 3D motifs for protein function annotation. Bioinformatics. 2006;22:723–30. doi: 10.1093/bioinformatics/btk038. [DOI] [PubMed] [Google Scholar]
15.Chandonia J, Brenner S. The impact of structural genomics: expectations and outcomes. Science. 2006;311(5759):347–51. doi: 10.1126/science.1121018. [DOI] [PubMed] [Google Scholar]
16.Pal D, Eisenberg D. Inference of protein function from protein structure. Structure. 2005;13:121–30. doi: 10.1016/j.str.2004.10.015. [DOI] [PubMed] [Google Scholar]
17.Hou J, Jun S, Zhang C, Kim S. Global mapping of the protein structure space and application in structure-based inference of protein function. Proc. Natl. Acad. Sci. U. S. A. 2005;102:3651–6. doi: 10.1073/pnas.0409772102. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wilson C, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations bet ween protein sequence, structure and function through traditional and probabilistic score s. J Mol Biol. 2000;297(1):233–49. doi: 10.1006/jmbi.2000.3550. [DOI] [PubMed] [Google Scholar]
19.Todd A, Orengo C, Thornton J. Evolution of function in protein superfamilies, from a structural perspec tive. J Mol Biol. 2001;307(4):1113–43. doi: 10.1006/jmbi.2001.4513. [DOI] [PubMed] [Google Scholar]
20.Fischer D, Norel R, Wolfson H, Nussinov R. Surface motifs by a computer vision technique: searches, detection, and implications for protein-ligand recognition. Proteins. 1993;16(3):278–92. doi: 10.1002/prot.340160306. [DOI] [PubMed] [Google Scholar]
21.Norel R, Fischer D, Wolfson HJ, Nussinov R. Molecualr surface recognition by computer vision-based technique. Protein Eng. 1994;7 doi: 10.1093/protein/7.1.39. [DOI] [PubMed] [Google Scholar]
22.Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 1996;236:412–420. doi: 10.1006/jmbi.1996.0167. [DOI] [PubMed] [Google Scholar]
23.Glaser F, Pupko T, Paz I, Bell R, Shental D, M artz E, et al. Consurf: identification of functional regions in proteins by surface-mapp ing of phylogenetic information. Bioinformatics. 2003;19(1):163–4. doi: 10.1093/bioinformatics/19.1.163. [DOI] [PubMed] [Google Scholar]
24.Laskowski R, Watson J, Thornton J. Protein function prediction using local 3d templates. J Mol Biol. 2005;351(3):614–26. doi: 10.1016/j.jmb.2005.05.067. [DOI] [PubMed] [Google Scholar]
25.Pazos F, Sternberg M. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A. 2004;101(41):14754–9. doi: 10.1073/pnas.0404569101. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Ferre F, Ausiello G, Zanzoni A, Citterich M. Functional annotation by identification of local surface similarities: a novel tool for structural genomics. BMC Bioinformatics. 2005;6:194. doi: 10.1186/1471-2105-6-194. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Gold N, Jackson R. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J. Mol. Biol. 2006;355:1112–24. doi: 10.1016/j.jmb.2005.11.044. [DOI] [PubMed] [Google Scholar]
28.Liang J, Edelsbrunner H, Woodward C. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci. 1998;7:1884–1897. doi: 10.1002/pro.5560070905. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Laskowski RA. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J. Mol. Graphics. 1995;13:323–330. doi: 10.1016/0263-7855(95)00073-9. [DOI] [PubMed] [Google Scholar]
30.Laskowski RA, Luscombe NM, Swindells MB, Thornton JM. Protein clefts in molecular recognition and function. Protein Sci. 1996;5:2438–2452. doi: 10.1002/pro.5560051206. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Binkowski TA, Adamian L, Liang J. Inferring functional relationships of proteins from local sequen ce and spatial surface patterns. J. Mol. Biol. 2003;332:505–526. doi: 10.1016/s0022-2836(03)00882-9. [DOI] [PubMed] [Google Scholar]
32.Edelsbrunner H, Mücke E. Three-dimensional alpha shapes. ACM Trans. Graphics. 1994;13:43–72. [Google Scholar]
33.Edelsbrunner H, Facello M, Liang J. On the definition and the construction of pockets in macromolecules. Discrete Applied Math. 1998;88:83–102. [PubMed] [Google Scholar]
34.Liang J, Dill KA. Are proteins well-packed? Biophys. J. 2001;81(2):751–766. doi: 10.1016/S0006-3495(01)75739-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Edelsbrunner H. The union of balls and its dual shape. Discrete Comput Geom. 1995;13:415–440. [Google Scholar]
36.Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S. Analytical shape computing of macromolecules II: Identification and computation of inaccessible cavities inside proteins. Proteins. 1998;33:18–29. [PubMed] [Google Scholar]
37.Binkowski TA, Naghibzadeh S, Liang J. CASTp: Computed atlas of surface topography of proteins. Nucleic Acids Res. 2003;31:3352–3355. doi: 10.1093/nar/gkg512. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Jones DT, Taylar WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992;8:275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
39.Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 2001;18(5):691–699. doi: 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]
40.Tourasse N, Li W. Selective constraints, amino acid composition, and the rate of protein evolution. Mol Biol Evol. 2000;17(4):656–64. doi: 10.1093/oxfordjournals.molbev.a026344. [DOI] [PubMed] [Google Scholar]
41.Tseng Y, Liang J. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: A Bayesian Monte Carlo approach. Mol. Biol. Evol. 2006;23(2):421–436. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]
42.Bairoch A. The enzyme data bank. Nucleic Acids Res. 1993;21(13):3155–6. doi: 10.1093/nar/21.13.3155. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, et al. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res. 2004;32:D262–6. doi: 10.1093/nar/gkh021. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Binkowski T, Naghibzadeh S, Liang J. CASTp: Computed atlas of surface topography of proteins. Nuc. Aci. Res. 2003;31(13):3352–3355. doi: 10.1093/nar/gkg512. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Adachi J, Hasegawa M. A computer program package for molecular phylogenetics ver 2.3. 1996. [Google Scholar]
47.Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]
48.Uitdehaag J, Kalk K, van D, Dijkhuizen L, Dijkstra B. The cyclization mechanism of cyclodextrin glycosyltransferase (CGTase) as revealed by a gamma-cyclodextrin-CGTase complex at 1.8-A resolution. J. Biol. Chem. 1999;274:34868–76. doi: 10.1074/jbc.274.49.34868. [DOI] [PubMed] [Google Scholar]
49.Sanishvili R, Yahunin AF, Laskowski RA, Evdokimova E, Skarina E, Doherty-Kirby A, et al. Integrating structure, bioinformatics, and enzymology to discover function: BioH, a new carboxylesterase from escherichia coli. J. Biol. Chem. 2003;278(28):26039–26045. doi: 10.1074/jbc.M303867200. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Sanishvili R, Yakunin A, Laskowski R, Skarina T, Evdokimova E, Kirby A, et al. Integrating structure, bioinformatics, and enzymology to discover function: Bioh, a new carboxylesterase from escherichia coli. J Biol Chem. 2003;278(28):26039–45. doi: 10.1074/jbc.M303867200. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Tseng Y, Liang J. Predicting enzyme functional surfaces and locating key residues automatically from structures. Ann Biomed Eng. 2007;35(6):1037–42. doi: 10.1007/s10439-006-9241-2. [DOI] [PubMed] [Google Scholar]
52.Meng E, Polacco B, Babbitt P. Superfamily active site templates. Proteins. 2004;55:962–76. doi: 10.1002/prot.20099. [DOI] [PubMed] [Google Scholar]
53.Wistow G, Mulders J, de J. The enzyme lactate dehydrogenase as a structural protein in avian and crocodilian lenses. Nature. 1987;326:622–4. doi: 10.1038/326622a0. [DOI] [PubMed] [Google Scholar]
54.Acharya K, Ren J, Stuart D, Phillips D, Fenna R. Crystal structure of human alpha-lactalbumin at 1.7 A resolution. J. Mol. Biol. 1991;221:571–81. doi: 10.1016/0022-2836(91)80073-4. [DOI] [PubMed] [Google Scholar]
55.Orengo C, Todd A, Thornton J. From protein structure to function. Curr. Opin. Struct. Biol. 1999;9:374–82. doi: 10.1016/S0959-440X(99)80051-7. [DOI] [PubMed] [Google Scholar]
56.Jeffery C. Molecular mechanisms for multitasking: recent crystal structures of moonlighting proteins. Curr. Opin. Struct. Biol. 2004;14:663–8. doi: 10.1016/j.sbi.2004.10.001. [DOI] [PubMed] [Google Scholar]
57.Najmanovich R, Hassani A, Morris R, Dombrovsky L, Pan P, Vedadi M, et al. Analysis of binding site similarity, small-molecule similarity and experimental binding profiles in the human cytosolic sulfotransferase family. Bioinformatics. 2007;23(2):e104–9. doi: 10.1093/bioinformatics/btl292. [DOI] [PubMed] [Google Scholar]
58.Porter C, Bartlett G, Thornton J. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic. Acids. Res. 2004;32:D129–33. doi: 10.1093/nar/gkh028. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Edeslbrunner H, Facello M, Liang J. On the definition and the construction of pockets in macromolecules. Disc. Appl. Math. 1998;88:18–29. [PubMed] [Google Scholar]
60.Binkowski T, Freeman P, Liang J. pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins. Nuc. Aci. Res. 2004;32:W555–558. doi: 10.1093/nar/gkh390. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 1998;15(12):1600–11. doi: 10.1093/oxfordjournals.molbev.a025888. [DOI] [PubMed] [Google Scholar]
62.Felsenstein J, Churchill G. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996;13(1):93–104. doi: 10.1093/oxfordjournals.molbev.a025575. [DOI] [PubMed] [Google Scholar]
63.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21(3):468–88. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]
64.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. 1990;87:2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Dayhoff MO, Schwartz RM, Orcutt BC. Atlas of protein sequence and structure. 1978;5(suppl 3):345. [Google Scholar]

[R1] 1.Rost B. Enzyme function less conserved than anticipated. J Mol Biol. 2002;318(2):595–608. doi: 10.1016/S0022-2836(02)00016-5. [DOI] [PubMed] [Google Scholar]

[R2] 2.Tian W, Skolnick J. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol. 2003;333(4):863–82. doi: 10.1016/j.jmb.2003.08.057. [DOI] [PubMed] [Google Scholar]

[R3] 3.Hannenhalli S, Russell R. Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol. 2000;303(1):61–76. doi: 10.1006/jmbi.2000.4036. [DOI] [PubMed] [Google Scholar]

[R4] 4.Jensen L, Gupta R, Staerfeldt H, Brunak S. Prediction of human protein function according to gene ontology categories. Bioinformatics. 2003;19(5):635–42. doi: 10.1093/bioinformatics/btg036. [DOI] [PubMed] [Google Scholar]

[R5] 5.Sjolander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20(2):170–9. doi: 10.1093/bioinformatics/bth021. [DOI] [PubMed] [Google Scholar]

[R6] 6.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]

[R7] 7.Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH- a hierarchic classification of protein domain structures. Structure. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]

[R8] 8.Thornton JM. From genome to function. Science. 2001;292:2095–2097. doi: 10.1126/science.292.5524.2095. [DOI] [PubMed] [Google Scholar]

[R9] 9.Zarembinski T, Hung L, Dieckmann H, Kim K, Yokota H, Kim R, et al. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc. Natl. Acad. Sci. U. S. A. 1998;95:15189–93. doi: 10.1073/pnas.95.26.15189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Russell RB. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. J. Mol. Biol. 1998;279:1211–27. doi: 10.1006/jmbi.1998.1844. [DOI] [PubMed] [Google Scholar]

[R11] 11.Zhang B, Rychlewski L, Pawlowski K, Fetrow J, Skolnick J, Godzik A. From fold predictions to function predictions: automation of functional site conservation analysis for functional genome predictions. Protein Sci. 1999;8(5):1104–15. doi: 10.1110/ps.8.5.1104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Copley S, Novak W, Babbitt P. Divergence of function in the thioredoxin fold suprafamily: evidence for evolution of peroxiredoxins from a thioredoxin-like ancestor. Biochemistry. 2004;43:13981–95. doi: 10.1021/bi048947r. [DOI] [PubMed] [Google Scholar]

[R13] 13.Wang K, Samudrala R. FSSA: a novel method for identifying functional signatures from structural alignments. Bioinformatics. 2005;21:2969–77. doi: 10.1093/bioinformatics/bti471. [DOI] [PubMed] [Google Scholar]

[R14] 14.Polacco B, Babbitt P. Automated discovery of 3D motifs for protein function annotation. Bioinformatics. 2006;22:723–30. doi: 10.1093/bioinformatics/btk038. [DOI] [PubMed] [Google Scholar]

[R15] 15.Chandonia J, Brenner S. The impact of structural genomics: expectations and outcomes. Science. 2006;311(5759):347–51. doi: 10.1126/science.1121018. [DOI] [PubMed] [Google Scholar]

[R16] 16.Pal D, Eisenberg D. Inference of protein function from protein structure. Structure. 2005;13:121–30. doi: 10.1016/j.str.2004.10.015. [DOI] [PubMed] [Google Scholar]

[R17] 17.Hou J, Jun S, Zhang C, Kim S. Global mapping of the protein structure space and application in structure-based inference of protein function. Proc. Natl. Acad. Sci. U. S. A. 2005;102:3651–6. doi: 10.1073/pnas.0409772102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Wilson C, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations bet ween protein sequence, structure and function through traditional and probabilistic score s. J Mol Biol. 2000;297(1):233–49. doi: 10.1006/jmbi.2000.3550. [DOI] [PubMed] [Google Scholar]

[R19] 19.Todd A, Orengo C, Thornton J. Evolution of function in protein superfamilies, from a structural perspec tive. J Mol Biol. 2001;307(4):1113–43. doi: 10.1006/jmbi.2001.4513. [DOI] [PubMed] [Google Scholar]

[R20] 20.Fischer D, Norel R, Wolfson H, Nussinov R. Surface motifs by a computer vision technique: searches, detection, and implications for protein-ligand recognition. Proteins. 1993;16(3):278–92. doi: 10.1002/prot.340160306. [DOI] [PubMed] [Google Scholar]

[R21] 21.Norel R, Fischer D, Wolfson HJ, Nussinov R. Molecualr surface recognition by computer vision-based technique. Protein Eng. 1994;7 doi: 10.1093/protein/7.1.39. [DOI] [PubMed] [Google Scholar]

[R22] 22.Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 1996;236:412–420. doi: 10.1006/jmbi.1996.0167. [DOI] [PubMed] [Google Scholar]

[R23] 23.Glaser F, Pupko T, Paz I, Bell R, Shental D, M artz E, et al. Consurf: identification of functional regions in proteins by surface-mapp ing of phylogenetic information. Bioinformatics. 2003;19(1):163–4. doi: 10.1093/bioinformatics/19.1.163. [DOI] [PubMed] [Google Scholar]

[R24] 24.Laskowski R, Watson J, Thornton J. Protein function prediction using local 3d templates. J Mol Biol. 2005;351(3):614–26. doi: 10.1016/j.jmb.2005.05.067. [DOI] [PubMed] [Google Scholar]

[R25] 25.Pazos F, Sternberg M. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A. 2004;101(41):14754–9. doi: 10.1073/pnas.0404569101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Ferre F, Ausiello G, Zanzoni A, Citterich M. Functional annotation by identification of local surface similarities: a novel tool for structural genomics. BMC Bioinformatics. 2005;6:194. doi: 10.1186/1471-2105-6-194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Gold N, Jackson R. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J. Mol. Biol. 2006;355:1112–24. doi: 10.1016/j.jmb.2005.11.044. [DOI] [PubMed] [Google Scholar]

[R28] 28.Liang J, Edelsbrunner H, Woodward C. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci. 1998;7:1884–1897. doi: 10.1002/pro.5560070905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Laskowski RA. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J. Mol. Graphics. 1995;13:323–330. doi: 10.1016/0263-7855(95)00073-9. [DOI] [PubMed] [Google Scholar]

[R30] 30.Laskowski RA, Luscombe NM, Swindells MB, Thornton JM. Protein clefts in molecular recognition and function. Protein Sci. 1996;5:2438–2452. doi: 10.1002/pro.5560051206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Binkowski TA, Adamian L, Liang J. Inferring functional relationships of proteins from local sequen ce and spatial surface patterns. J. Mol. Biol. 2003;332:505–526. doi: 10.1016/s0022-2836(03)00882-9. [DOI] [PubMed] [Google Scholar]

[R32] 32.Edelsbrunner H, Mücke E. Three-dimensional alpha shapes. ACM Trans. Graphics. 1994;13:43–72. [Google Scholar]

[R33] 33.Edelsbrunner H, Facello M, Liang J. On the definition and the construction of pockets in macromolecules. Discrete Applied Math. 1998;88:83–102. [PubMed] [Google Scholar]

[R34] 34.Liang J, Dill KA. Are proteins well-packed? Biophys. J. 2001;81(2):751–766. doi: 10.1016/S0006-3495(01)75739-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Edelsbrunner H. The union of balls and its dual shape. Discrete Comput Geom. 1995;13:415–440. [Google Scholar]

[R36] 36.Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S. Analytical shape computing of macromolecules II: Identification and computation of inaccessible cavities inside proteins. Proteins. 1998;33:18–29. [PubMed] [Google Scholar]

[R37] 37.Binkowski TA, Naghibzadeh S, Liang J. CASTp: Computed atlas of surface topography of proteins. Nucleic Acids Res. 2003;31:3352–3355. doi: 10.1093/nar/gkg512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Jones DT, Taylar WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992;8:275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]

[R39] 39.Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 2001;18(5):691–699. doi: 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]

[R40] 40.Tourasse N, Li W. Selective constraints, amino acid composition, and the rate of protein evolution. Mol Biol Evol. 2000;17(4):656–64. doi: 10.1093/oxfordjournals.molbev.a026344. [DOI] [PubMed] [Google Scholar]

[R41] 41.Tseng Y, Liang J. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: A Bayesian Monte Carlo approach. Mol. Biol. Evol. 2006;23(2):421–436. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]

[R42] 42.Bairoch A. The enzyme data bank. Nucleic Acids Res. 1993;21(13):3155–6. doi: 10.1093/nar/21.13.3155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, et al. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res. 2004;32:D262–6. doi: 10.1093/nar/gkh021. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Binkowski T, Naghibzadeh S, Liang J. CASTp: Computed atlas of surface topography of proteins. Nuc. Aci. Res. 2003;31(13):3352–3355. doi: 10.1093/nar/gkg512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Adachi J, Hasegawa M. A computer program package for molecular phylogenetics ver 2.3. 1996. [Google Scholar]

[R47] 47.Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]

[R48] 48.Uitdehaag J, Kalk K, van D, Dijkhuizen L, Dijkstra B. The cyclization mechanism of cyclodextrin glycosyltransferase (CGTase) as revealed by a gamma-cyclodextrin-CGTase complex at 1.8-A resolution. J. Biol. Chem. 1999;274:34868–76. doi: 10.1074/jbc.274.49.34868. [DOI] [PubMed] [Google Scholar]

[R49] 49.Sanishvili R, Yahunin AF, Laskowski RA, Evdokimova E, Skarina E, Doherty-Kirby A, et al. Integrating structure, bioinformatics, and enzymology to discover function: BioH, a new carboxylesterase from escherichia coli. J. Biol. Chem. 2003;278(28):26039–26045. doi: 10.1074/jbc.M303867200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Sanishvili R, Yakunin A, Laskowski R, Skarina T, Evdokimova E, Kirby A, et al. Integrating structure, bioinformatics, and enzymology to discover function: Bioh, a new carboxylesterase from escherichia coli. J Biol Chem. 2003;278(28):26039–45. doi: 10.1074/jbc.M303867200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Tseng Y, Liang J. Predicting enzyme functional surfaces and locating key residues automatically from structures. Ann Biomed Eng. 2007;35(6):1037–42. doi: 10.1007/s10439-006-9241-2. [DOI] [PubMed] [Google Scholar]

[R52] 52.Meng E, Polacco B, Babbitt P. Superfamily active site templates. Proteins. 2004;55:962–76. doi: 10.1002/prot.20099. [DOI] [PubMed] [Google Scholar]

[R53] 53.Wistow G, Mulders J, de J. The enzyme lactate dehydrogenase as a structural protein in avian and crocodilian lenses. Nature. 1987;326:622–4. doi: 10.1038/326622a0. [DOI] [PubMed] [Google Scholar]

[R54] 54.Acharya K, Ren J, Stuart D, Phillips D, Fenna R. Crystal structure of human alpha-lactalbumin at 1.7 A resolution. J. Mol. Biol. 1991;221:571–81. doi: 10.1016/0022-2836(91)80073-4. [DOI] [PubMed] [Google Scholar]

[R55] 55.Orengo C, Todd A, Thornton J. From protein structure to function. Curr. Opin. Struct. Biol. 1999;9:374–82. doi: 10.1016/S0959-440X(99)80051-7. [DOI] [PubMed] [Google Scholar]

[R56] 56.Jeffery C. Molecular mechanisms for multitasking: recent crystal structures of moonlighting proteins. Curr. Opin. Struct. Biol. 2004;14:663–8. doi: 10.1016/j.sbi.2004.10.001. [DOI] [PubMed] [Google Scholar]

[R57] 57.Najmanovich R, Hassani A, Morris R, Dombrovsky L, Pan P, Vedadi M, et al. Analysis of binding site similarity, small-molecule similarity and experimental binding profiles in the human cytosolic sulfotransferase family. Bioinformatics. 2007;23(2):e104–9. doi: 10.1093/bioinformatics/btl292. [DOI] [PubMed] [Google Scholar]

[R58] 58.Porter C, Bartlett G, Thornton J. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic. Acids. Res. 2004;32:D129–33. doi: 10.1093/nar/gkh028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Edeslbrunner H, Facello M, Liang J. On the definition and the construction of pockets in macromolecules. Disc. Appl. Math. 1998;88:18–29. [PubMed] [Google Scholar]

[R60] 60.Binkowski T, Freeman P, Liang J. pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins. Nuc. Aci. Res. 2004;32:W555–558. doi: 10.1093/nar/gkh390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 1998;15(12):1600–11. doi: 10.1093/oxfordjournals.molbev.a025888. [DOI] [PubMed] [Google Scholar]

[R62] 62.Felsenstein J, Churchill G. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996;13(1):93–104. doi: 10.1093/oxfordjournals.molbev.a025575. [DOI] [PubMed] [Google Scholar]

[R63] 63.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21(3):468–88. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]

[R64] 64.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. 1990;87:2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Dayhoff MO, Schwartz RM, Orcutt BC. Atlas of protein sequence and structure. 1978;5(suppl 3):345. [Google Scholar]

PERMALINK

Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns

Yan Yuan Tseng

Joseph Dundas

Jie Liang

Abstract

1 Introduction

Fig 1.

2 Results

2.1 Function prediction by detection of similar binding surfaces

Sequence fragments of binding pocket and sequence of backbone

Fig 2.

Illustration: Predicting functions of acetylcholinesterase

Fig 3.

Illustration: Predicting functions of deformylase

Large scale enzyme function prediction

Fig 4.

Predicting binding activities and profiling protein functions

Fig 5.

Predicting biochemical function of orphan protein structures: challenging examples from structural genomics

BioH

Fig 6.

1u9d

Fig 7.

3 Discussion

Comparison with other methods

Challenges in assessing local similarity

The role of hypothetical protein sequences

Characterizing complex protein functions

General applicability for enzyme characterization

4 Methods and Designs

Estimating substitution rates

Scoring matrices of similarity for surfaces at different evolutionary time intervals

Matching local surfaces

Probabilistic model for profiling protein binding activities

5 Acknowledgment

Abbreviations footnote

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases