Structure-based function inference using protein family-specific fingerprints

Deepak Bandyopadhyay; Jun Huan; Jinze Liu; Jan Prins; Jack Snoeyink; Wei Wang; Alexander Tropsha

doi:10.1110/ps.062189906

. 2006 Jun;15(6):1537–1543. doi: 10.1110/ps.062189906

Structure-based function inference using protein family-specific fingerprints

Deepak Bandyopadhyay ^1,³, Jun Huan ¹, Jinze Liu ¹, Jan Prins ¹, Jack Snoeyink ¹, Wei Wang ¹, Alexander Tropsha ²

PMCID: PMC2265098 PMID: 16731985

Abstract

We describe a method to assign a protein structure to a functional family using family-specific fingerprints. Fingerprints represent amino acid packing patterns that occur in most members of a family but are rare in the background, a nonredundant subset of PDB; their information is additional to sequence alignments, sequence patterns, structural superposition, and active-site templates. Fingerprints were derived for 120 families in SCOP using Frequent Subgraph Mining. For a new structure, all occurrences of these family-specific fingerprints may be found by a fast algorithm for subgraph isomorphism; the structure can then be assigned to a family with a confidence value derived from the number of fingerprints found and their distribution in background proteins. In validation experiments, we infer the function of new members added to SCOP families and we discriminate between structurally similar, but functionally divergent TIM barrel families. We then apply our method to predict function for several structural genomics proteins, including orphan structures. Some predictions have been corroborated by other computational methods and some validated by subsequent functional characterization.

Keywords: subgraph mining, Delaunay, almost-Delaunay, protein classification, structure-based function inference, structural genomics, orphan structures

Structural genomics projects (Burley 2000) have generated structures for many proteins of unknown function. Function for these so-called hypothetical proteins is traditionally inferred from sequence similarity or overall structure similarity. Structural genomics targets, however, are selected to avoid sequence similarity so as to sample the protein “structure space”; a quarter of structural genomics proteins deposited by May 2005 had <30% sequence identity and DALI Z-scores (Holm and Sander 1996) <10 with proteins of known function (Bandyopadhyay 2005). Other inference tools are needed for these orphan protein structures.

Recently, methods have been developed to infer function from local structural similarity without relying on sequence and overall structure similarity. Aloy et al. (2001) found that conserved geometric packing patterns of a few residues are often responsible for protein function, and finding them can lead to more accurate function inference than obtained by structural homology. Laskowski et al. (2005b) developed SiteSeer's reverse template method, which also searches for conserved packing patterns within protein structures. Other recent methods find functionally important residues using computed chemical properties (Ko et al. 2005), careful alignments (Pegg et al. 2005), evolutionary information (Wang and Samudrala 2005), and computational protein design (Cheng et al. 2005). Still other methods use Gene Ontology (Gene Ontology Consortium 2004) as a reference to define function, such as ProKnow (Pal and Eisenberg 2005) and PHUNCTIONER (Pazos and Sternberg 2004). A recent review (Ofran et al. 2005) covers these and other structure-based function prediction methods.

Graph representations of protein structure allow more flexibility than rigid templates in representing and matching structural motifs. Earlier methods used graph representations to search for known structure patterns (Artymiuk et al. 1994; Stark and Russell 2003) or determine patterns with limited topology, such as cliques, from groups of proteins (Wangikar et al. 2003; Milik et al. 2003). Using frequent subgraph mining, Huan et al. (2004^,²⁰⁰⁵⁾ defined family-specific fingerprints as packing patterns that are frequent within a family of protein structures but rare within the background. Using serine protease and kinase families, they showed that fingerprints often cover functionally important residues and can distinguish between proteins from similar families.

In this study we propose a new method for function inference that uses family-specific fingerprints automatically derived from SCOP families (Murzin et al. 1995). The method searches for fingerprints within a new structure using fast subgraph isomorphism (Ullman 1976) and assigns a significance score to family membership using the distribution of fingerprints found in members of the family and in the background. Its strength is in distinguishing proteins with related and similar functions.

Our method does not restrict pattern graph types or assume that the functional sites are known. Each fingerprint is statistically linked to its family, and our consensus approach using multiple fingerprints improves the accuracy and specificity of function inference. Families with different function but similar structure can be distinguished since the fingerprints tend to identify functionally important parts of a protein. In contrast, methods based on Gene Ontology suggest broader functional categories more than specific functional families (Pazos and Sternberg 2004; Pal and Eisenberg 2005).

Results

We derived family-specific fingerprints for proteins in 120 SCOP families using a background of 6749 nonredundant proteins, as described in the Materials and Methods section. After this, we examined the family specificity of the fingerprints, then classified new protein structures by identifying cases of functional similarity with and without overall structure similarity and inferred function for orphan structures from structural genomics targets.

Fingerprint occurrence in family and background

To test the uniqueness of a family's fingerprints and establish significance of function inference, we examined the frequency of family-specific fingerprints in the background, as described in the Materials and Methods. In most families, almost all background proteins have fewer fingerprints than the minimum found in any family member; see Figure 1, A and B, for examples of the metallo-dependent hydrolase (SCOP: 51556) and antibiotic resistance (SCOP: 54598) families. Some family members have few fingerprints; the majority of those we inspected had either a different function or mechanism from the other members or errors in the structure file that prevent the identification of fingerprints.

Figure 1. — Distribution of metallo-dependent hydrolase (SCOP: 51556) (A) and antibiotic resistance (SCOP: 54598) (B) fingerprints in the background (light bars), and within the family (dark). (*Inset*) ROC curve showing specificity vs. sensitivity of function inference at different numbers of fingerprints. (C,D) Example of function inference: metallo-dependent hydrolase fingerprints (shown as graphs) in the metallo-dependent hydrolase 1nfg (C) and 1m65 (D) (YcdX, unknown function). (E,F) the same proteins shown as residues covered by metallo-dependent hydrolase fingerprints, color-coded by chemical properties. (G,H) Another example of function inference: residues covered by antibiotic resistance fingerprints in the family protein 1ecs (G), and 1twu (H) (Yyce, unknown function). Snapshots from kinemages viewed in KiNG (C,D), and from VMD (E–H) (Humphrey et al. 1996).

Many background proteins with a majority of the fingerprints for a family turned out to be new family members. For example, four proteins with 30 or more of the 49 metallo-dependent hydrolase fingerprints, 1un7A (48), 1rk6A (40), 1ndyA (33), and 1kcxA (32), were not included in the metallo-dependent hydrolase family in SCOP 1.65, but were in SCOP 1.67. Other high-scoring proteins had closely related enzymatic functions (e.g., phosphatases, phosphoesterases) but came from different SCOP families, e.g., metallohydrolase/oxidoreductase of TIM barrel fold (1p9e, 48) and mannose hydrolase of (βα)₇ fold (1qwn, 44).

Validation on proteins added to SCOP

To test the validity of inferring family membership, we used fingerprints derived from SCOP 1.65 families to classify proteins that were newly added to these families in SCOP 1.67. The detailed results are shown in Tables II–IV in the online Supplemental Material. Of the 442 new members added to 94 families, the number of proteins that can be inferred using fingerprints from the correct family is 316 (71%) at the sensitivity cutoff and 284 (64%) at the 99%-specificity cutoff. Most importantly, for 287 (65%) of the new members, among families with fingerprints above 95% specificity, the correct family was the choice with highest specificity. In contrast, for only 234 (53%) of the new members did a member of the correct family have the most significant sequence hit among all proteins in SCOP 1.65 with at least 40% sequence identity, which is the threshold suggested for inferring function from sequence (Wilson et al. 2000).

Discriminating between similar structures with different function

To test the discrimination power of fingerprints, we searched for the fingerprints of 20 structurally similar (super) families of the TIM barrel fold that have different functions. As shown in Figure 2, the average member of any of these families has 70%–90% of the fingerprints of its own family (orange or red, seen on the diagonal), and 0%–40% of the fingerprints of any other family (blue, seen off the diagonal). Exceptions arise from superfamily–subfamily pairs such as enolase C-terminal domains (ENC) and D-glucarate dehydratases (DGL) that share fingerprints, since their members overlap, and from families that do not have highly significant fingerprints, such as the ribulose-phosphate-binding barrels (RIB). Thus, fingerprints discriminate between functional families whose members cannot be distinguished easily by overall structure similarity.

Figure 2. — Discriminating the TIM barrels using fingerprints. (A) The 20 families selected, with columns listing a three-letter abbreviation for each family, number of members and fingerprints, and maximum number of fingerprints found in a nonfamily protein of the TIM fold. Families mentioned in the last column for which fingerprints were not identified: IMP (inosine monophosphate dehydrogenase) and MAL (malate synthase). (B) Pseudo-color matrix plot showing the percentage of fingerprints of the TIM barrel family in each row found in an average member of the family in each column. High values on the diagonal (red) and low off-diagonal (blue) indicate high discrimination. Exceptions to this trend are documented in the text.

Function inference for structural genomics targets

We classified Structural Genomics targets in the PDB as either proteins with known function, proteins with putative function suggested by overall structure similarity, or orphan structures. We applied our method to suggest function assignments for proteins in the last two categories. For example, strong structural similarity to the metallo-dependent phosphatase superfamily (SCOP: 56300) was found in two hypothetical proteins, 1s3l (14% sequence identity, DALI z-score 13.1 with member 1hpu) and 1xm7 (13% identity, z-score 10.6, 1ii7). For these proteins, we inferred metallo-dependent phosphatase function with 26 and 125 of 316 fingerprints, i.e., 100% specificity, corroborating the function inference suggested by structural similarity. More interesting are two case studies for proteins in the last category, i.e., structural orphans.

Functional inference of YcdX

The YcdX protein (PDB: 1m65, CASP5 target T0147) has a rare (βα)₇ barrel fold called the PHP domain (SCOP: 89551). It had no significant sequence or overall structure similarity with proteins of known function in 2004. We inferred that this protein has a metallo-dependent hydrolase function with 30 of 49 fingerprints from SCOP superfamily 51556, a TIM barrel family. The fingerprints are shown as subgraphs in Figure 1, C and D. The residues included in family-specific fingerprints for this target, depicted in Figure 1, E and F, are localized in space and show similar geometric arrangements and chemical properties in family and target.

Our inference was corroborated by the following: (1) active-site template and reverse-template matches on the ProFunc server Laskowski (Laskowski et al. 2005a, b), (2) suggestions by the CASP5 target classifiers (Kinch et al. 2003), and (3) suggestions by the authors of the structure (Teplyakov et al. 2003), who proposed active-site residues for 1m65 that are included in many of our fingerprints as shown in the online Supplemental Material on our Web site. The PINTS-weekly service (Stark et al. 2004) found active-site patterns from many metallodependent hydrolases in this protein. Finally, GenProtEC, the Escherichia coli genome and proteome database (Serres et al. 2004) has annotated the YcdX gene product as belonging to the SCOP metallo-dependent hydrolase structural domain family on the basis of the SUPERFAMILY database of HMMs for SCOP families (Gough and Chothia 2002; Madera et al. 2004).

Functional inference for Yyce

Protein Yyce from Bacillus subtilis (PDB: 1twu) is unclassified in both SCOP 1.65 and 1.67 and was an orphan structure in 2004, with no significant structural similarity to structures of known function. We found 46 of 62 fingerprints from the antibiotic resistance protein family (SCOP ID: 54598) in 1twu, inferring the antibiotic resistance function with 100% specificity. Figure 1, G and H, show the residues covered by fingerprints in 1twu and in 1ecs, an antibiotic resistance protein in SCOP 1.65. Note the geometric and electrostatic similarity between the upper region covered by fingerprints in both 1twu and 1ecs, which suggests that fingerprints cover functionally important residues.

When the structural similarity of 1twu was re-evaluated in May 2005 using the current DALI database, it was found to be similar to a protein 1nki that was unclassified in SCOP 1.65, but has been added to the antibiotic resistance protein family in SCOP 1.67. This discovery of homology with a newly classified member of the family corroborates our function inference.

Discussion

Our method of using family-specific fingerprints to infer function for proteins was designed to be robust; the graph construction takes into account natural imprecision in coordinates and using multiple local motifs as fingerprints, accommodates remaining representation errors and flexibility in functional sites. The method is also designed to give information that is not implied by sequence patterns, structural alignments, and templates of known functional sites. Thus, not only may it succeed as a stand-alone method where other methods may fail, but it may also be profitably used in consensus with other methods.

The successful function inference for new members of SCOP families validates the predictive power of fingerprints; the success rate of 65% for choosing the correct family is high considering that there are functional outliers among SCOP family members, and that sequence methods could pick the correct family only 53% of the time.

The function discrimination within the TIM barrel fold, and the inference of YcdX as belonging to the sequence-diverse, metallo-dependent hydrolase family despite its different fold, indicate that the packing patterns in fingerprints do capture information that is specific to a functional family rather than shared structural information.

We have seen that the fingerprints detected in YcdX cover its functional regions; this can be attributed to the fact that SCOP families often share a function and superfamilies often share aspects of function. Our subgraph mining finds fingerprints that characterize the shared local structures exclusive to each family. Our method can also derive fingerprints for explicitly functional classifications, such as EC (Bairoch 2000) or GO (Gene Ontology Consortium 2004); we will report these results in the near future.

We have observed annotations that initially appear to disagree with our inferences, sometimes because the annotation was speculative and sometimes because the level of classification was too coarse or too fine. An example of both is 1m65, which is in the PHP-domain family in SCOP. We classify it as a metallo-dependent hydrolase, and the Gene Ontology Annotation (GOA) database (Camon et al. 2004) annotates it as having DNA-directed DNA polymerase activity (GO: 0003887), a putative function assignment based on electronic annotation transferred from the sequence database InterPro. The discoverers of the PHP-domain sequence family (Aravind and Koonin 1998) indicated that metallo-dependent hydrolases share active-site sequence motifs with this family, and hypothesized that bacterial and archaeal DNA polymerases possess intrinsic phosphatase activity. Since several metallo-dependent hydrolases can hydrolyze phosphoester or phosphate bonds, the assigned GO term may still support the function inferred by our method.

The designed robustness of our method suggests its use to predict function from sequence using either good quality predicted structures or sequence patterns derived from fingerprints whose sequence order is preserved within a family. Investigations in this direction are ongoing.

Our method has limitations, arising from representation choices, algorithmic issues, and the nature of the problem itself. In our representation, we use C_α coordinates to calculate graph edges and lengths; this choice captures shared topology, but may miss contacts with long side chains. Currently, we do not allow residue substitutions in patterns other than unifying V,A,I,L. Merging commonly substituted residue types (e.g., D,E) increases the sensitivity of fingerprints but can decrease specificity; we may lose fingerprints that are no longer unique to a family. Finally, the distance edge-matching criteria may be too restrictive to find patterns with widely varying geometry or containing edges that happen to lie on bin boundaries. We are developing a new distance edge representation to fix this problem.

Algorithmically, subgraph mining involves the NP-complete problem of subgraph isomorphism. The FFSM algorithm (Huan et al. 2004) stores graph embeddings, so it does well with small isomorphic subgraphs, but can bog down with the large ones that can arise in families with very similar or identical structures.

It is part of the nature of the problem that classifications that are too fine can produce too many fingerprints due to high local similarity or small sample sizes, i.e., families with three or fewer members. Conversely, too coarse a classification can produce no fingerprints that are specific to a family—this happens with 35% of the SCOP families and superfamilies we considered, especially the latter because of their heterogeneity. Because the number, specificity, and sensitivity of fingerprints depends on size and heterogeneity of the family, the support and background occurrence parameters must be varied to find meaningful sets of fingerprints for the maximum number of families.

In conclusion, the method identifies fingerprints for functional families with four or more representatives by finding packing patterns characteristic to each family and uses them to infer function. Structure errors, missing fragments, or mutations may lead to failure of fingerprint mining or function inference. Careful manual selection of families and fixing errors in structure files should improve the results further. Since our method infers function for many orphan proteins, the ultimate proof will come from experimental validation of its predictions.

Materials and methods

Our method initially finds and calibrates fingerprints (steps 1–4) using the FFSM subgraph mining program from (http://www.cs.unc.edu/~huan/FFSM/). Then there are two steps (5 and 6) for each function inference. These are implemented in MATLAB.

1. Family and background selection

We selected 120 families and superfamilies from SCOP version 1.65. Though SCOP 1.67 was released in February 2005, we have retained the fingerprints derived from SCOP 1.65 to allow unbiased function prediction of structural orphans using information known at the time that they were selected and use new members added in SCOP 1.67 to validate the method. In addition to requiring better than 3 Å resolution and R-factor at most 1.0, we reduced redundancy by using PISCES (Wang and Dunbrack 2003) to select family members having at most 90% sequence identity. The same criteria when applied to the entire PDB produced a representative set of 6749 protein chains in May 2005, which we used as the background for identifying fingerprints. The lists of families, family members, and background selected are in online Supplemental Materials at http://www.cs.unc.edu/~debug/papers/FuncInf.

2. Graph representation

We represent protein structures as graphs, with nodes at each residue labeled with the amino acid type, with V, A, I, and L condensed to one type, since they frequently substitute for one another. Edges represent contact between residues defined by almost-Delaunay edges (Bandyopadhyay and Snoeyink 2004), or distance constraints between noncontacting residues. Edges are labeled with length ranges (0–4, 4–6, 6–8.5, 8.5–10.5, 10.5–12.5, and 12.5–15 Å). Fingerprints mined from this graph representation are called distance edge fingerprints. We do some experiments (e.g., metallo-dependent hydrolase family) using simple edge fingerprints, omitting distance labels.

3. Frequent subgraph mining

We mine frequent subgraphs from the graph representation of all proteins in a family using Fast Frequent Subgraph Mining (Huan et al. 2005). We use a support value of 80% to define frequency. Frequent subgraphs are constrained to have high density by having no more than one edge missing from a clique.

4. Fingerprint identification

Fingerprints are defined as subgraphs found in at least 80% of the family (support), and at most 5% of the background (background occurrence). The aim for families in our data set is to have 10–1000 fingerprints; support and background occurrence are adjusted for heterogeneous or small families until the number of fingerprints is in this range.

5. Search for fingerprints in query

We use a graph similarity index to speed up the subgraph isomorphism algorithm of Ullman (1976). For each node of the fingerprints and of a query structure, we create an index vector that stores the labels of neighboring nodes and edges connected to them, and consider a query embedding a node in a fingerprint only if the index vectors match. This reduces billions of potential embeddings to a handful in most cases. Ullman's algorithm then finds all embeddings of the fingerprint in the query that match node and edge labels. For further details of the index, please refer to Bandyopadhyay et al. 2004.

6. Assigning significance

We assign significance to the function inference by comparing the number of fingerprints found against the distribution of fingerprints in background proteins and in family members. Because these distributions are not normal, we calculate P-values empirically. By picking different numbers of fingerprints at which to infer family membership, we can determine the rates of true and false positives and negatives, calculate specificity and sensitivity, and draw ROC curves as shown in the inset of Figure 1, A and B. We choose two cutoff points for each family, i.e., a sensitivity cutoff to maximize sensitivity with at least 95% specificity, and a higher 99%-specificity cutoff with no constraints on sensitivity.

Electronic supplemental material

Table I describes the SCOP families for which we obtained fingerprints. Tables II–IV give results from the SCOP validation experiment. Other Supplemental data, including kinemages showing the graph representations of fingerprints found in structural genomics targets YcdX and Yyce, may be viewed at http://www.cs.unc.edu/~debug/papers/FuncInf.

Acknowledgments

D.B. and J.S. gratefully acknowledge support from NSF grants 9988742 and 0076984; J.H., J.L., J.P., and W.W. from the Microsoft Research eScience RFP award; and A.T. appreciates the support from the NSF grant ITR/MCB 011289 and a grant from North Carolina–Israel Research Partnership NCI 1999032. We thank Ruchir Shah for many useful discussions.

Footnotes

Supplemental material: see www.proteinscience.org

Reprint requests to: Alexander Tropsha, UNC School of Pharmacy, Medicinal Chemistry and Natural Products, CB# 7360 Beard Hall, Room 327A, University of North Carolina, Chapel Hill, NC 27599-7360, USA; e-mail: tropsha@email.unc.edu; fax: (919) 966-0204.

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062189906.

References

Aloy P., Querol E., Aviles F.X., Sternberg M.J. 2001. Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311 395–408. [DOI] [PubMed] [Google Scholar]
Aravind L. and Koonin E.V. 1998. Phosphoesterase domains associated with DNA polymerases of diverse origins. Nucleic Acids Res. 26 3746–3752. [DOI] [PMC free article] [PubMed] [Google Scholar]
Artymiuk P.J., Poirrette A.R., Grindley H.M., Rice D.W., Willett P. 1994. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J. Mol. Biol. 243 327–344. [DOI] [PubMed] [Google Scholar]
Bairoch A. 2000. The enzyme database in 2000. Nucleic Acids Res. 28 304–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bandyopadhyay D. 2005. “A geometric framework for robust nearest neighbor analysis of protein structure and function.” Ph.D. thesis. University of North Carolina, Chapel Hill, NC.
Bandyopadhyay D. and Snoeyink J. 2004. Almost-Delaunay simplices: Nearest neighbor relations for imprecise points. In ACM/SIAM Symposium On Discrete Algorithms pp. 403–412. , New Orleans, LA.
Bandyopadhyay D., Huan J., Liu J., Wang W., Prins J., Snoeyink J. “Using fast subgraph isomorphism checking for protein functional annotation using SCOP and gene ontology.” UNC Computer Science Technical Report TR04-031. 2004. Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC.
Burley S.K. 2000. An overview of structural genomics. Nat. Struct. Biol. 7 (Suppl) 932–934. [DOI] [PubMed] [Google Scholar]
Camon E., Magrane M., Barrell D., Lee V., Dimmer E., Maslen J., Binns D., Harte N., Lopez R., Apweiler R. 2004. The Gene Ontology Annotation (GOA) database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32 D262–D266. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng G., Qian B., Samudrala R., Baker D. 2005. Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acids Res. 33 5861–5867. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gene Ontology Consortium. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 D258–D261. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gough J. and Chothia C. 2002. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res. 30 268–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Holm L. and Sander C. 1996. Mapping the protein universe. Science 273 595–602. [DOI] [PubMed] [Google Scholar]
Huan J., Wang W., Bandyopadhyay D., Snoeyink J., Prins J., Tropsha A. 2004. Mining protein family specific residue packing patterns from protein structure graphs. Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB). pp. 308–315.
Huan J., Bandyopadhyay D., Wang W., Snoeyink J., Prins J., Tropsha A. 2005. Comparing graph representations of protein structure for mining family-specific residue-based packing motifs. J. Comput. Biol. , San Diego, CA 12 657–671. [DOI] [PubMed] [Google Scholar]
Humphrey W., Dalke A., Schulten K. 1996. VMD–Visual molecular dynamics. J. Mol. Graph. 14 33–38. [DOI] [PubMed] [Google Scholar]
Kinch L.N., Qi Y., Hubbard T.J., Grishin N.V. 2003. CASP5 target classification. Proteins 53 (Suppl 6) 340–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ko J., Murga L.F., Wei Y., Ondrechen M.J. 2005. Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (Suppl 1) i258–i265. [DOI] [PubMed] [Google Scholar]
Laskowski R.A., Watson J.D., Thornton J.M. 2005a. ProFunc: A server for predicting protein function from 3D structure. Nucleic Acids Res. 33 W89–W93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laskowski R.A., Watson J.D., Thornton J.M. 2005b. Protein function prediction using local 3D templates. J. Mol. Biol. 351 614–626. [DOI] [PubMed] [Google Scholar]
Madera M., Vogel C., Kummerfeld S.K., Chothia C., Gough J. 2004. The SUPERFAMILY database in 2004: Additions and improvements. Nucleic Acids Res. 32 D235–D239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Milik M., Szalma S., Olszewski K.A. 2003. Common structural cliques: A tool for protein structure and function analysis. Protein Eng. 16 543–552. [DOI] [PubMed] [Google Scholar]
Murzin A.G., Brenner S.E., Hubbard T., Chothia C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 536–540. [DOI] [PubMed] [Google Scholar]
Ofran Y., Punta M., Schneider R., Rost B. 2005. Beyond annotation transfer by homology: Novel protein-function prediction methods to assist drug discovery. Drug Discov. Today 10 1475–1482. [DOI] [PubMed] [Google Scholar]
Pal D. and Eisenberg D. 2005. Inference of protein function from protein structure. Structure (Camb) 13 121–130. [DOI] [PubMed] [Google Scholar]
Pazos F. and Sternberg M.J. 2004. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. 101 14754–14759. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pegg S.C., Brown S., Ojha S., Huang C.C., Ferrin T.E., Babbitt P.C. 2005. Representing structure-function relationships in mechanistically diverse enzyme superfamilies. Pac. Symp. Biocomput. 358–369. [PubMed]
Serres M.H., Goswami S., Riley M. 2004. GenProtEC: An updated and improved analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res. 32 D300–D302. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark A. and Russell R.B. 2003. Annotation in three dimensions. PINTS: Patterns in non-homologous tertiary structures. Nucleic Acids Res. 31 3341–3344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark A., Shkumatov A., Russell R.B. 2004. Finding functional sites in structural genomics proteins. Structure 12 1405–1412. [DOI] [PubMed] [Google Scholar]
Teplyakov A., Obmolova G., Khil P.P., Howard A.J., Camerini-Otero R.D., Gilliland G.L. 2003. Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins 51 315–318. [DOI] [PubMed] [Google Scholar]
Ullman J.R. 1976. An algorithm for subgraph isomorphism. Journal of the ACM 23 31–42. [Google Scholar]
Wang G. and Dunbrack R.L. 2003. PISCES: A protein sequence culling server. Bioinformatics 19 1589–1591. [DOI] [PubMed] [Google Scholar]
Wang K. and Samudrala R. 2005. FSSA: A novel method for identifying functional signatures from structural alignments. Bioinformatics 21 2969–2977. [DOI] [PubMed] [Google Scholar]
Wangikar P.P., Tendulkar A.V., Ramya S., Mali D.N., Sarawagi S. 2003. Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J. Mol. Biol. 326 955–978. [DOI] [PubMed] [Google Scholar]
Wilson C.A., Kreychman J., Gerstein M. 2000. Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297 233–249. [DOI] [PubMed] [Google Scholar]

[b01] Aloy P., Querol E., Aviles F.X., Sternberg M.J. 2001. Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311 395–408. [DOI] [PubMed] [Google Scholar]

[b02] Aravind L. and Koonin E.V. 1998. Phosphoesterase domains associated with DNA polymerases of diverse origins. Nucleic Acids Res. 26 3746–3752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b03] Artymiuk P.J., Poirrette A.R., Grindley H.M., Rice D.W., Willett P. 1994. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J. Mol. Biol. 243 327–344. [DOI] [PubMed] [Google Scholar]

[b04] Bairoch A. 2000. The enzyme database in 2000. Nucleic Acids Res. 28 304–305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b05] Bandyopadhyay D. 2005. “A geometric framework for robust nearest neighbor analysis of protein structure and function.” Ph.D. thesis. University of North Carolina, Chapel Hill, NC.

[b06] Bandyopadhyay D. and Snoeyink J. 2004. Almost-Delaunay simplices: Nearest neighbor relations for imprecise points. In ACM/SIAM Symposium On Discrete Algorithms pp. 403–412. , New Orleans, LA.

[b07] Bandyopadhyay D., Huan J., Liu J., Wang W., Prins J., Snoeyink J. “Using fast subgraph isomorphism checking for protein functional annotation using SCOP and gene ontology.” UNC Computer Science Technical Report TR04-031. 2004. Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC.

[b08] Burley S.K. 2000. An overview of structural genomics. Nat. Struct. Biol. 7 (Suppl) 932–934. [DOI] [PubMed] [Google Scholar]

[b09] Camon E., Magrane M., Barrell D., Lee V., Dimmer E., Maslen J., Binns D., Harte N., Lopez R., Apweiler R. 2004. The Gene Ontology Annotation (GOA) database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32 D262–D266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10] Cheng G., Qian B., Samudrala R., Baker D. 2005. Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acids Res. 33 5861–5867. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b11] Gene Ontology Consortium. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 D258–D261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12] Gough J. and Chothia C. 2002. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res. 30 268–272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13] Holm L. and Sander C. 1996. Mapping the protein universe. Science 273 595–602. [DOI] [PubMed] [Google Scholar]

[b14] Huan J., Wang W., Bandyopadhyay D., Snoeyink J., Prins J., Tropsha A. 2004. Mining protein family specific residue packing patterns from protein structure graphs. Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB). pp. 308–315.

[b15] Huan J., Bandyopadhyay D., Wang W., Snoeyink J., Prins J., Tropsha A. 2005. Comparing graph representations of protein structure for mining family-specific residue-based packing motifs. J. Comput. Biol. , San Diego, CA 12 657–671. [DOI] [PubMed] [Google Scholar]

[b16] Humphrey W., Dalke A., Schulten K. 1996. VMD–Visual molecular dynamics. J. Mol. Graph. 14 33–38. [DOI] [PubMed] [Google Scholar]

[b17] Kinch L.N., Qi Y., Hubbard T.J., Grishin N.V. 2003. CASP5 target classification. Proteins 53 (Suppl 6) 340–351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18] Ko J., Murga L.F., Wei Y., Ondrechen M.J. 2005. Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (Suppl 1) i258–i265. [DOI] [PubMed] [Google Scholar]

[b19] Laskowski R.A., Watson J.D., Thornton J.M. 2005a. ProFunc: A server for predicting protein function from 3D structure. Nucleic Acids Res. 33 W89–W93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] Laskowski R.A., Watson J.D., Thornton J.M. 2005b. Protein function prediction using local 3D templates. J. Mol. Biol. 351 614–626. [DOI] [PubMed] [Google Scholar]

[b21] Madera M., Vogel C., Kummerfeld S.K., Chothia C., Gough J. 2004. The SUPERFAMILY database in 2004: Additions and improvements. Nucleic Acids Res. 32 D235–D239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22] Milik M., Szalma S., Olszewski K.A. 2003. Common structural cliques: A tool for protein structure and function analysis. Protein Eng. 16 543–552. [DOI] [PubMed] [Google Scholar]

[b23] Murzin A.G., Brenner S.E., Hubbard T., Chothia C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 536–540. [DOI] [PubMed] [Google Scholar]

[b24] Ofran Y., Punta M., Schneider R., Rost B. 2005. Beyond annotation transfer by homology: Novel protein-function prediction methods to assist drug discovery. Drug Discov. Today 10 1475–1482. [DOI] [PubMed] [Google Scholar]

[b25] Pal D. and Eisenberg D. 2005. Inference of protein function from protein structure. Structure (Camb) 13 121–130. [DOI] [PubMed] [Google Scholar]

[b26] Pazos F. and Sternberg M.J. 2004. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. 101 14754–14759. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b27] Pegg S.C., Brown S., Ojha S., Huang C.C., Ferrin T.E., Babbitt P.C. 2005. Representing structure-function relationships in mechanistically diverse enzyme superfamilies. Pac. Symp. Biocomput. 358–369. [PubMed]

[b28] Serres M.H., Goswami S., Riley M. 2004. GenProtEC: An updated and improved analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res. 32 D300–D302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b29] Stark A. and Russell R.B. 2003. Annotation in three dimensions. PINTS: Patterns in non-homologous tertiary structures. Nucleic Acids Res. 31 3341–3344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b30] Stark A., Shkumatov A., Russell R.B. 2004. Finding functional sites in structural genomics proteins. Structure 12 1405–1412. [DOI] [PubMed] [Google Scholar]

[b31] Teplyakov A., Obmolova G., Khil P.P., Howard A.J., Camerini-Otero R.D., Gilliland G.L. 2003. Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins 51 315–318. [DOI] [PubMed] [Google Scholar]

[b32] Ullman J.R. 1976. An algorithm for subgraph isomorphism. Journal of the ACM 23 31–42. [Google Scholar]

[b33] Wang G. and Dunbrack R.L. 2003. PISCES: A protein sequence culling server. Bioinformatics 19 1589–1591. [DOI] [PubMed] [Google Scholar]

[b34] Wang K. and Samudrala R. 2005. FSSA: A novel method for identifying functional signatures from structural alignments. Bioinformatics 21 2969–2977. [DOI] [PubMed] [Google Scholar]

[b35] Wangikar P.P., Tendulkar A.V., Ramya S., Mali D.N., Sarawagi S. 2003. Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J. Mol. Biol. 326 955–978. [DOI] [PubMed] [Google Scholar]

[b36] Wilson C.A., Kreychman J., Gerstein M. 2000. Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297 233–249. [DOI] [PubMed] [Google Scholar]

PERMALINK

Structure-based function inference using protein family-specific fingerprints

Deepak Bandyopadhyay

Jun Huan

Jinze Liu

Jan Prins

Jack Snoeyink

Wei Wang

Alexander Tropsha

Abstract

Results

Fingerprint occurrence in family and background

Figure 1.

Validation on proteins added to SCOP

Discriminating between similar structures with different function

Figure 2.

Function inference for structural genomics targets

Functional inference of YcdX

Functional inference for Yyce

Discussion

Materials and methods

1. Family and background selection

2. Graph representation

3. Frequent subgraph mining

4. Fingerprint identification

5. Search for fingerprints in query

6. Assigning significance

Electronic supplemental material

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Structure-based function inference using protein family-specific fingerprints

Deepak Bandyopadhyay

Jun Huan

Jinze Liu

Jan Prins

Jack Snoeyink

Wei Wang

Alexander Tropsha

Abstract

Results

Fingerprint occurrence in family and background

Figure 1.

Validation on proteins added to SCOP

Discriminating between similar structures with different function

Figure 2.

Function inference for structural genomics targets

Functional inference of YcdX

Functional inference for Yyce

Discussion

Materials and methods

1. Family and background selection

2. Graph representation

3. Frequent subgraph mining

4. Fingerprint identification

5. Search for fingerprints in query

6. Assigning significance

Electronic supplemental material

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases