Abstract
Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.html
INTRODUCTION
Protein domains are structural, functional and evolutionary units of proteins. The prediction of domains from sequence information can improve tertiary structure prediction (1), enhance protein function annotation (2), aid structure determination (3) and guide protein engineering (4) and mutagenesis (5).
A number of different methods have been developed to identify domains starting from primary sequences. These methods can be roughly classified into four categories: template-based methods (6–10), ab initio (template-free) methods (11–22), the hybrid approach combining template-based and ab initio methods (23), and meta-domain prediction methods (24).
Here we describe an accurate, hybrid domain prediction server (DOMAC) that integrates homology modeling, domain parsing and ab initio methods together. The preliminary implementation of the server [under the name: FOLDpro (25)] participated in the domain evaluation in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7) (26,27). It was ranked among the top domain prediction servers in CASP7.
IMPLEMENTATION
Our hybrid approach uses the template-based method to predict domains for proteins having homologous template structures in Protein Data Bank (PDB) (28), and the ab initio method based on neural networks (29) to predict domains for de novo proteins. It predicts protein domains in two steps.
First, it uses the PSI-BLAST (30) to search the target sequence against NCBI Non-Redundant sequence database to construct a profile. The profile is used to search a template structure library built from the proteins in PDB to identify templates, similarly as PDB-BLAST approach (31).
Second, if some significant templates are identified (e-value ⩽0.001), it generates a structure model for the target using Modeller (32) based on the template structures. Multiple significant templates are combined to improve model quality if available. Then it uses an accurate domain parsing tool PDP (33) to parse the model into domains. If the parsed domains do not cover the whole target sequence, DOMAC will assign uncovered regions to adjacent domains.
If no significant homologous template is found, DOMAC will invoke the ab initio domain predictor DOMpro (29) to predict domains. DOMpro uses neural networks in conjunction with sequence profile, predicted secondary structure, and relative solvent accessibility to predict domain boundary. The secondary structure and relative solvent accessibility are predicted by SSpro (34) and ACCpro (35) in the SCRATCH suite (36). DOMpro tries to identify domain boundary positions based on the composition bias of sequence and structural features in domain linker regions.
The preliminary implementation of DOMAC participated in CASP7 and was ranked first among 13 domain prediction servers. Since then, we have significantly speeded up the template identification process without sacrificing accuracy and added a module to update the template library weekly to incorporate the newly released proteins in PDB.
RESULTS
Here we firstly describe the performance of the preliminary implementation of DOMAC in CASP7 (under server name: FOLDpro). We compare it with 12 other server predictors in CASP7 using two evaluation metrics: CASP evaluation metric (37) and domain number accuracy.
CASP metric (NDO: normalized domain overlap score) is to compute the overlapping score of domains without explicitly checking domain number and domain boundary (37). It computes the numbers of correctly and wrongly overlapped residues between true domains and predicted domains, respectively. It summarizes the numbers of the overlapping residues into a single score to evaluate domain prediction. The best score for a target is 1 and the worst score is 0. The domain number accuracy is defined as the percentage of targets with correct domain number predictions.
Table 1 reports the performance of 13 servers on 95 targets in CASP 7. The CASP score is the average domain overlap score across all predicted targets. The domain number accuracy is computed by comparing the domain number predictions with the official domain definitions released by CASP7. In terms of the two evaluation metrics, the preliminary implementation of DOMAC (FOLDpro) yielded the best performance.
Table 1.
The performance of 13 domain prediction servers in CASP7
Method | Target Num | Domain Num Acc. (%) | CASP7 Score |
---|---|---|---|
FOLDpro (DOMAC) | 95 | 93.7 | 0.963 |
Baker-RosettaDom (23) | 94 | 86.2 | 0.940 |
Ma-OPUS-DOM | 94 | 87.2 | 0.933 |
ROBETTA-GINZU (23) | 94 | 84.0 | 0.932 |
DomSSEA (7) | 94 | 78.7 | 0.910 |
HHpred3 (38) | 95 | 75.8 | 0.910 |
Meta-DP (24) | 95 | 74.7 | 0.907 |
HHpred1 (38) | 93 | 75.3 | 0.902 |
DomFOLD | 95 | 75.8 | 0.898 |
DPS(13) | 93 | 75.3 | 0.889 |
Chop (22) | 83 | 56.6 | 0.827 |
Distill (39) | 95 | 70.5 | 0.819 |
NN_PUT-Lab | 92 | 58.7 | 0.795 |
The second column (target num) lists the number of targets for which a predictor made predictions.
We also evaluate DOMAC on the three categories of CASP7 targets: highly homologous, homologous and analogous/ab initio. The domain number prediction accuracy of DOMAC is 96%, 94% and 88% in the three categories, respectively.
However, because the majority (68 out of 95) of CASP7 targets is single-domain proteins, the domain prediction accuracy is very likely over-estimated.
Thus, we evaluate DOMAC on a larger, balanced, high-quality dataset manually curated by Holland et al. (2). The publicly released version of the Holland's benchmark2 dataset has 156 proteins consisting of 54 single-domain proteins, 69 two-domain proteins, 25 three-domain proteins, 4 four-domain proteins, 3 five-domain proteins and 1 six-domain protein. We evaluate both template-based and ab initio methods on the whole dataset, respectively. Table 2 reports the specificity and sensitivity of each method in each category in terms of domain numbers. The overall domain number prediction accuracy of the template-based and ab initio methods is 75% and 46%, respectively.
Table 2.
The specificity and sensitivity of domain number prediction on the Holland's dataset using the template-based and ab initio methods
Method | Acc. (%) | 1-dom | 2-dom | 3-dom | 4-dom | 5-dom | 6-dom |
---|---|---|---|---|---|---|---|
Template | Sens. | 96.1 | 66.7 | 56.0 | 75.0 | 66.7 | – |
Spec. | 74.2 | 88.0 | 70.0 | 42.9 | 33.3 | – | |
Ab initio | Sens. | 88.5 | 31.3 | 12.0 | – | – | – |
Spec. | 46.5 | 48.8 | 30.0 | – | – | – |
Moreover, we assess the accuracy of the domain boundary prediction, which is important for generating hypotheses for crystallizing individual protein domains. Following the same convention (7,22), a predicted boundary within 20 residues away from a true domain boundary is considered correct.
The domain boundary specificity and sensitivity is 50% and 76.5% for the template-based method, and 27% and 14% for the ab initio method. Thus, the accuracy are sufficient for guiding the crystallization experiment, whereas the ab initio method is not always reliable enough for the general, practical use.
USE OF WEB SERVICE
The use of DOMAC are intuitive through a simple input form. Since the reliability assessment of domain predictions is still an open issue, the user is advised to use the accuracy on the Holland's dataset to decide how to use these predictions. The input form requires only three inputs: email address, target name, and protein sequence. DOMAC usually can make predictions within 15 min and send the results back to users through email.
Domain prediction results include the user-defined target name, the protein sequence, the predicted domain number, the start and end positions of each domain and the method (template-based or ab initio). For template-based prediction, it also reports the PDB codes of the templates. Figure 1 shows an output example for the CASP7 target T0324.
Figure 1.
Domain prediction result of CASP7 target T0324. The protein is predicted to have two domains. Domain 1 has two non-continuous segments, spanning from residues 1 to 16 and residues 82 to 208, respectively. Domain 2 spans from residues 17 to 81. The templates used to make the domain prediction are identified by PDB code + chain id. The chain in a single-chain protein is always assigned chain id ‘A’ instead of ‘-’.
CONCLUSION AND FUTURE WORK
We have developed a hybrid domain prediction web service integrating template-based and ab initio methods. The template-based method is accurate enough for guiding protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. However, the ab initio method still needs to be improved for practical use. Since protein domain architecture is largely shaped by gene recombination events, such as gene fusion, fission, domain swapping and exon exchange, leveraging the evolutionary gene recombination signals embedded in the multiple sequence alignment of a protein family and exon boundaries (or splicing sites) in its gene structure, may help improve ab initio domain prediction significantly.
ACKNOWLEDGEMENTS
J.C. is very grateful to Dr Pierre Baldi for the support during his PhD research at University of California Irvine.
Funding to pay the Open Access publication charges for this article was provided by the New Faculty Start-Up Grant at the University of Central Florida.
Conflict of interest statement. None declared.
REFERENCES
- 1.Chivian D, Kim DE, Malmstrom, L. Bradley P, Robertson T, Murphy P, Strauss CE, Bonneau R, Rohl CA, et al. Automated prediction of CASP-5 structures using the Robetta server. Proteins. 2003;53(S6):524–533. doi: 10.1002/prot.10529. [DOI] [PubMed] [Google Scholar]
- 2.Holland T, Veretnik S, Shindyalov IN, Bourne PE. A benchmark for domain assignment from protein 3-dimensional structure and it's applications. J. Mol. Biol. 2006;361:562–590. doi: 10.1016/j.jmb.2006.05.060. [DOI] [PubMed] [Google Scholar]
- 3.Campbell I, Downing A. Building protein structure and function from modular units. Trends Biotechnol. 1994;12:168–172. doi: 10.1016/0167-7799(94)90078-7. [DOI] [PubMed] [Google Scholar]
- 4.Guerois R, Serrano L. Protein design based on folding models. Curr. Opin. Struct. Biol. 2001;11:101–106. doi: 10.1016/s0959-440x(00)00170-6. [DOI] [PubMed] [Google Scholar]
- 5.Nielsen P, Yamada Y. Identification of cell-binding sites on the Laminin a5 n-terminal domain by site-directed mutagenesis. J. Biol. Chem. 2001;276:10906–10912. doi: 10.1074/jbc.M008743200. [DOI] [PubMed] [Google Scholar]
- 6.Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. doi: 10.1016/s0022-2836(03)00269-9. [DOI] [PubMed] [Google Scholar]
- 7.Marsden RL, McGuffin LJ, Jones DT. Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci. 2002;11:2814–2824. doi: 10.1110/ps.0209902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.von Ohsen N, Sommer I, Zimmer R, Lengauer T. Arby: automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics. 2004;20:2228–2235. doi: 10.1093/bioinformatics/bth232. [DOI] [PubMed] [Google Scholar]
- 9.Gewehr JE, Zimmer R. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics. 2006;22:181–187. doi: 10.1093/bioinformatics/bti751. [DOI] [PubMed] [Google Scholar]
- 10.Coin L, Bateman A, Durbin R. Enhanced protein domain discovery using taxonomy. BMC Bioinformat. 2004;5:56. doi: 10.1186/1471-2105-5-56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Park J, Teichmann SA. DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics. 1998;14:144–150. doi: 10.1093/bioinformatics/14.2.144. [DOI] [PubMed] [Google Scholar]
- 12.Gouzy J, Corpet F, Kahn D. Whole genome protein domain analysis using a new method for domain clustering. Comput. chem. 1999;23:333–340. doi: 10.1016/s0097-8485(99)00011-x. [DOI] [PubMed] [Google Scholar]
- 13.Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT. Protein structure prediction servers at University College London. Nucleic Acids Res. 2005;33:w36–w38. doi: 10.1093/nar/gki410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.George RA, Heringa J. SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol. 2002;316:839–851. doi: 10.1006/jmbi.2001.5387. [DOI] [PubMed] [Google Scholar]
- 15.Linding R, Russell RB, Neduva V, Gibson TJ. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003;31:3701–3708. doi: 10.1093/nar/gkg519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nagarajan N, Yona G. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics. 2004;20:1335–1360. doi: 10.1093/bioinformatics/bth086. [DOI] [PubMed] [Google Scholar]
- 17.Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distributions can predict domain boundaries. Bioinformatics. 2000;16:613–618. doi: 10.1093/bioinformatics/16.7.613. [DOI] [PubMed] [Google Scholar]
- 18.Sim J, Kim SY, Lee J. PPRODO: prediction of protein domain boundaries using neural networks. Proteins. 2005;59:627–632. doi: 10.1002/prot.20442. [DOI] [PubMed] [Google Scholar]
- 19.Chen L, Wang W, Ling S, Jia C, Wang F. Kemadom: a web server for domain prediction using kernel machine with local context. Nucleic Acids Res. 2006;34:W158–w163. doi: 10.1093/nar/gkl331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Adams R, Das S, Smith T. Multiple domain protein diagnostic patterns. Prot. Sci. 1996;5:1240–1249. doi: 10.1002/pro.5560050703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.George R, Heringa J. Protein domain identification and improved sequence similarity search using PSI-BLAST. Protein Struct. Funct. Genet. 2002;48:672–681. doi: 10.1002/prot.10175. [DOI] [PubMed] [Google Scholar]
- 22.Liu J, Rost B. Sequence-based prediction of protein domains. Nucleic Acids Res. 2004;32:3522–3530. doi: 10.1093/nar/gkh684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kim DE, Chivian D, Malmstrom L, Baker D. Automated prediction of domain boundaries in casp6 targets using Ginzu and RosettaDOM. Proteins. 2005;61(Suppl. 7):193–200. doi: 10.1002/prot.20737. [DOI] [PubMed] [Google Scholar]
- 24.Saini HK, Fischer D. Meta-DP: domain prediction meta server. Bioinformatics. 2005;21:2917–2920. doi: 10.1093/bioinformatics/bti445. [DOI] [PubMed] [Google Scholar]
- 25.Cheng J, Baldi P. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics. 2006;22:1456–1463. doi: 10.1093/bioinformatics/btl102. [DOI] [PubMed] [Google Scholar]
- 26.Moult J, Fidelis K, Zemla A, Hubbard T. Critical assessment of methods of protein structure prediction (CASP)-round v. Proteins. 2003;53(Suppl. 6):334–339. doi: 10.1002/prot.10556. [DOI] [PubMed] [Google Scholar]
- 27.Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A. Critical assessmentof methods of protein structure prediction (CASP) - round 6. Proteins. 2005;61(Suppl 7):3–7. doi: 10.1002/prot.20716. [DOI] [PubMed] [Google Scholar]
- 28.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cheng J, Sweredoski MJ, Baldi P. DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Mining and Knowledge Discovery. 2006;13:1–10. [Google Scholar]
- 30.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller AA, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rychlewski L, Jaroszewski L, LI W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information”. Protein Sci. 2000;9:232–241. doi: 10.1110/ps.9.2.232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 1993;234:779–815. doi: 10.1006/jmbi.1993.1626. [DOI] [PubMed] [Google Scholar]
- 33.Alexandrov N, Shindyalov I. PDP: protein domain parser. Bioinformatics. 2003;19:429–430. doi: 10.1093/bioinformatics/btg006. [DOI] [PubMed] [Google Scholar]
- 34.Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002;47:228–235. doi: 10.1002/prot.10082. [DOI] [PubMed] [Google Scholar]
- 35.Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002;47:142–153. doi: 10.1002/prot.10069. [DOI] [PubMed] [Google Scholar]
- 36.Cheng J, Randall AZ, Sweredoski MJ, Baldi P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005;33(web server issue):w72–w76. doi: 10.1093/nar/gki396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tai CH, Lee, W.J. Vincent JJ, Lee B. Evaluation of domain prediction in CASP6. Proteins. 2005;61(Suppl. 7):183–192. doi: 10.1002/prot.20736. [DOI] [PubMed] [Google Scholar]
- 38.Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. doi: 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]
- 39.Bau D, Martin AJM, Mooney C, Vullo A, Walsh I, Pollastri G. Distill: A suite of web servers for the prediction of one-, two- and three dimensional structural features of proteins. BMC Bioinformat. 2006;7:402. doi: 10.1186/1471-2105-7-402. [DOI] [PMC free article] [PubMed] [Google Scholar]