Abstract
A new method is presented for identifying distantly related homologous proteins that are unrecognizable by conventional sequence comparison methods. The method combines information about functionally conserved sequence patterns with information about structure context. This information is encoded in stochastic discrete state-space models (DSMs) that comprise a new family of hidden Markov models. The new models are called sequence-pattern-embedded DSMs (pDSMs). This method can identify distantly related protein family members with a high sensitivity and specificity. The method is illustrated with trypsin-like serine proteases and globins. The strategy for building pDSMs is presented. The method has been validated using carefully constructed positive and negative control sets. In addition to the ability to recognize remote homologs, pDSM sequence analysis predicts secondary structures with higher sensitivity, specificity, and Q3 accuracy than DSM analysis, which omits information about conserved sequence patterns. The identification of trypsin-like serine proteases in new genomes is discussed.
Full Text
The Full Text of this article is available as a PDF (6.8 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Adams R. M., Das S., Smith T. F. Multiple domain protein diagnostic patterns. Protein Sci. 1996 Jul;5(7):1240–1249. doi: 10.1002/pro.5560050703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allaire M., Chernaia M. M., Malcolm B. A., James M. N. Picornaviral 3C cysteine proteinases have a fold similar to chymotrypsin-like serine proteinases. Nature. 1994 May 5;369(6475):72–76. doi: 10.1038/369072a0. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res. 1994 Sep;22(17):3578–3580. [PMC free article] [PubMed] [Google Scholar]
- Bairoch A. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2241–2245. doi: 10.1093/nar/19.suppl.2241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A. The ENZYME data bank. Nucleic Acids Res. 1994 Sep;22(17):3626–3627. [PMC free article] [PubMed] [Google Scholar]
- Barbosa J. A., Saldanha J. W., Garratt R. C. Novel features of serine protease active sites and specificity pockets: sequence analysis and modelling studies of glutamate-specific endopeptidases and epidermolytic toxins. Protein Eng. 1996 Jul;9(7):591–601. doi: 10.1093/protein/9.7.591. [DOI] [PubMed] [Google Scholar]
- Bashford D., Chothia C., Lesk A. M. Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol. 1987 Jul 5;196(1):199–216. doi: 10.1016/0022-2836(87)90521-3. [DOI] [PubMed] [Google Scholar]
- Benson D. A., Boguski M., Lipman D. J., Ostell J. GenBank. Nucleic Acids Res. 1994 Sep;22(17):3441–3444. doi: 10.1093/nar/22.17.3441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F., Jr, Brice M. D., Rodgers J. R., Kennard O., Shimanouchi T., Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977 May 25;112(3):535–542. doi: 10.1016/s0022-2836(77)80200-3. [DOI] [PubMed] [Google Scholar]
- Blattner F. R., Plunkett G., 3rd, Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K., Mayhew G. F. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- Bode W., Papamokos E., Musil D. The high-resolution X-ray crystal structure of the complex formed between subtilisin Carlsberg and eglin c, an elastase inhibitor from the leech Hirudo medicinalis. Structural analysis, subtilisin structure and interface geometry. Eur J Biochem. 1987 Aug 3;166(3):673–692. doi: 10.1111/j.1432-1033.1987.tb13566.x. [DOI] [PubMed] [Google Scholar]
- Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. doi: 10.1126/science.273.5278.1058. [DOI] [PubMed] [Google Scholar]
- Cunningham B. C., Wells J. A. High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis. Science. 1989 Jun 2;244(4908):1081–1085. doi: 10.1126/science.2471267. [DOI] [PubMed] [Google Scholar]
- Di Francesco V., Garnier J., Munson P. J. Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. J Mol Biol. 1997 Mar 28;267(2):446–463. doi: 10.1006/jmbi.1996.0874. [DOI] [PubMed] [Google Scholar]
- Doolittle R. F. Similar amino acid sequences: chance or common ancestry? Science. 1981 Oct 9;214(4517):149–159. doi: 10.1126/science.7280687. [DOI] [PubMed] [Google Scholar]
- Eddy S. R. Hidden Markov models. Curr Opin Struct Biol. 1996 Jun;6(3):361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
- Garnier J., Osguthorpe D. J., Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol. 1978 Mar 25;120(1):97–120. doi: 10.1016/0022-2836(78)90297-8. [DOI] [PubMed] [Google Scholar]
- Gibrat J. F., Madej T., Bryant S. H. Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996 Jun;6(3):377–385. doi: 10.1016/s0959-440x(96)80058-3. [DOI] [PubMed] [Google Scholar]
- Goldman N., Thorne J. L., Jones D. T. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol. 1996 Oct 25;263(2):196–208. doi: 10.1006/jmbi.1996.0569. [DOI] [PubMed] [Google Scholar]
- Greer J. Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins. 1990;7(4):317–334. doi: 10.1002/prot.340070404. [DOI] [PubMed] [Google Scholar]
- Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartley B. S., Kauffman D. L. Corrections to the amino acid sequence of bovine chymotrypsinogen A. Biochem J. 1966 Oct;101(1):229–231. doi: 10.1042/bj1010229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991 Dec 11;19(23):6565–6572. doi: 10.1093/nar/19.23.6565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holm L., Sander C. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins. 1997 May;28(1):72–82. [PubMed] [Google Scholar]
- Holm L., Sander C. DNA polymerase beta belongs to an ancient nucleotidyltransferase superfamily. Trends Biochem Sci. 1995 Sep;20(9):345–347. doi: 10.1016/s0968-0004(00)89071-4. [DOI] [PubMed] [Google Scholar]
- Hubbard T. J. New horizons in sequence analysis. Curr Opin Struct Biol. 1997 Apr;7(2):190–193. doi: 10.1016/s0959-440x(97)80024-3. [DOI] [PubMed] [Google Scholar]
- Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- Karplus K., Sjölander K., Barrett C., Cline M., Haussler D., Hughey R., Holm L., Sander C. Predicting protein structure using hidden Markov models. Proteins. 1997;Suppl 1:134–139. doi: 10.1002/(sici)1097-0134(1997)1+<134::aid-prot18>3.3.co;2-q. [DOI] [PubMed] [Google Scholar]
- Krogh A., Brown M., Mian I. S., Sjölander K., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
- Kunst F., Ogasawara N., Moszer I., Albertini A. M., Alloni G., Azevedo V., Bertero M. G., Bessières P., Bolotin A., Borchert S. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997 Nov 20;390(6657):249–256. doi: 10.1038/36786. [DOI] [PubMed] [Google Scholar]
- Lathrop R. H., Smith T. F. Global optimum protein threading with gapped alignment and empirical pair score functions. J Mol Biol. 1996 Feb 2;255(4):641–665. doi: 10.1006/jmbi.1996.0053. [DOI] [PubMed] [Google Scholar]
- Lesk A. M., Chothia C. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol. 1980 Jan 25;136(3):225–270. doi: 10.1016/0022-2836(80)90373-3. [DOI] [PubMed] [Google Scholar]
- Lipinska B., Fayet O., Baird L., Georgopoulos C. Identification, characterization, and mapping of the Escherichia coli htrA gene, whose product is essential for bacterial growth only at elevated temperatures. J Bacteriol. 1989 Mar;171(3):1574–1584. doi: 10.1128/jb.171.3.1574-1584.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mewes H. W., Albermann K., Bähr M., Frishman D., Gleissner A., Hani J., Heumann K., Kleine K., Maierl A., Oliver S. G. Overview of the yeast genome. Nature. 1997 May 29;387(6632 Suppl):7–65. doi: 10.1038/42755. [DOI] [PubMed] [Google Scholar]
- Neuwald A. F., Liu J. S., Lipman D. J., Lawrence C. E. Extracting protein alignment models from the sequence database. Nucleic Acids Res. 1997 May 1;25(9):1665–1677. doi: 10.1093/nar/25.9.1665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearson W. R. Identifying distantly related protein sequences. Comput Appl Biosci. 1997 Aug;13(4):325–332. doi: 10.1093/bioinformatics/13.4.325. [DOI] [PubMed] [Google Scholar]
- Rufo G. A., Jr, Sullivan B. J., Sloma A., Pero J. Isolation and characterization of a novel extracellular metalloprotease from Bacillus subtilis. J Bacteriol. 1990 Feb;172(2):1019–1023. doi: 10.1128/jb.172.2.1019-1023.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sibbald P. R., Argos P. Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequence databases. Comput Appl Biosci. 1990 Jul;6(3):279–288. doi: 10.1093/bioinformatics/6.3.279. [DOI] [PubMed] [Google Scholar]
- Sonnhammer E. L., Eddy S. R., Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997 Jul;28(3):405–420. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
- Stultz C. M., White J. V., Smith T. F. Structural analysis based on state-space modeling. Protein Sci. 1993 Mar;2(3):305–314. doi: 10.1002/pro.5560020302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stöcker W., Bode W. Structural features of a superfamily of zinc-endopeptidases: the metzincins. Curr Opin Struct Biol. 1995 Jun;5(3):383–390. doi: 10.1016/0959-440x(95)80101-4. [DOI] [PubMed] [Google Scholar]
- Taylor W. R. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]
- White J. V., Stultz C. M., Smith T. F. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math Biosci. 1994 Jan;119(1):35–75. doi: 10.1016/0025-5564(94)90004-3. [DOI] [PubMed] [Google Scholar]