Discovering active motifs in sets of related protein sequences and using them for classification

J T Wang; T G Marr; D Shasha; B A Shapiro; G W Chirn

doi:10.1093/nar/22.14.2769

. 1994 Jul 25;22(14):2769–2775. doi: 10.1093/nar/22.14.2769

Discovering active motifs in sets of related protein sequences and using them for classification.

J T Wang ¹, T G Marr ¹, D Shasha ¹, B A Shapiro ¹, G W Chirn ¹

PMCID: PMC308246 PMID: 8052532

Abstract

We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Bacon D. J., Anderson W. F. Multiple sequence alignment. J Mol Biol. 1986 Sep 20;191(2):153–161. doi: 10.1016/0022-2836(86)90252-4. [DOI] [PubMed] [Google Scholar]
Bains W. MULTAN: a program to align multiple DNA sequences. Nucleic Acids Res. 1986 Jan 10;14(1):159–177. doi: 10.1093/nar/14.1.159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1992 May 11;20 (Suppl):2019–2022. doi: 10.1093/nar/20.suppl.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bairoch A. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992 May 11;20 (Suppl):2013–2018. doi: 10.1093/nar/20.suppl.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bashford D., Chothia C., Lesk A. M. Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol. 1987 Jul 5;196(1):199–216. doi: 10.1016/0022-2836(87)90521-3. [DOI] [PubMed] [Google Scholar]
Caserta M., Zacharias W., Nwankwo D., Wilson G. G., Wells R. D. Cloning, sequencing, in vivo promoter mapping, and expression in Escherichia coli of the gene for the HhaI methyltransferase. J Biol Chem. 1987 Apr 5;262(10):4770–4777. [PubMed] [Google Scholar]
Doolittle R. F., Feng D. F., Johnson M. S., McClure M. A. Origins and evolutionary relationships of retroviruses. Q Rev Biol. 1989 Mar;64(1):1–30. doi: 10.1086/416128. [DOI] [PubMed] [Google Scholar]
Goad W. B., Kanehisa M. I. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res. 1982 Jan 11;10(1):247–263. doi: 10.1093/nar/10.1.247. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hanks S. K., Quinn A. M., Hunter T. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science. 1988 Jul 1;241(4861):42–52. doi: 10.1126/science.3291115. [DOI] [PubMed] [Google Scholar]
Henikoff S., Henikoff J. G. Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991 Dec 11;19(23):6565–6572. doi: 10.1093/nar/19.23.6565. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hunter T. A thousand and one protein kinases. Cell. 1987 Sep 11;50(6):823–829. doi: 10.1016/0092-8674(87)90509-5. [DOI] [PubMed] [Google Scholar]
Johnson M. S., Doolittle R. F. A method for the simultaneous alignment of three or more amino acid sequences. J Mol Evol. 1986;23(3):267–278. doi: 10.1007/BF02115583. [DOI] [PubMed] [Google Scholar]
Krishnan G., Kaul R. K., Jagadeeswaran P. DNA sequence analysis: a procedure to find homologies among many sequences. Nucleic Acids Res. 1986 Jan 10;14(1):543–550. doi: 10.1093/nar/14.1.543. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lipman D. J., Altschul S. F., Kececioglu J. D. A tool for multiple sequence alignment. Proc Natl Acad Sci U S A. 1989 Jun;86(12):4412–4415. doi: 10.1073/pnas.86.12.4412. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pósfai J., Bhagwat A. S., Pósfai G., Roberts R. J. Predictive motifs derived from cytosine methyltransferases. Nucleic Acids Res. 1989 Apr 11;17(7):2421–2435. doi: 10.1093/nar/17.7.2421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Queen C., Wegman M. N., Korn L. J. Improvements to a program for DNA analysis: a procedure to find homologies among many sequences. Nucleic Acids Res. 1982 Jan 11;10(1):449–456. doi: 10.1093/nar/10.1.449. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roytberg M. A. A search for common patterns in many sequences. Comput Appl Biosci. 1992 Feb;8(1):57–64. doi: 10.1093/bioinformatics/8.1.57. [DOI] [PubMed] [Google Scholar]
Römisch K., Webb J., Herz J., Prehn S., Frank R., Vingron M., Dobberstein B. Homology of 54K protein of signal-recognition particle, docking protein and two E. coli proteins with putative GTP-binding domains. Nature. 1989 Aug 10;340(6233):478–482. doi: 10.1038/340478a0. [DOI] [PubMed] [Google Scholar]
Smith H. O., Annau T. M., Chandrasegaran S. Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci U S A. 1990 Jan;87(2):826–830. doi: 10.1073/pnas.87.2.826. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith R. F., Smith T. F. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118–122. doi: 10.1073/pnas.87.1.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith R. F., Smith T. F. Identification of new protein kinase-related genes in three herpesviruses, herpes simplex virus, varicella-zoster virus, and Epstein-Barr virus. J Virol. 1989 Jan;63(1):450–455. doi: 10.1128/jvi.63.1.450-455.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sobel E., Martinez H. M. A multiple sequence alignment program. Nucleic Acids Res. 1986 Jan 10;14(1):363–374. doi: 10.1093/nar/14.1.363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Som S., Bhagwat A. S., Friedman S. Nucleotide sequence and expression of the gene encoding the EcoRII modification enzyme. Nucleic Acids Res. 1987 Jan 12;15(1):313–332. doi: 10.1093/nar/15.1.313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. doi: 10.1016/0076-6879(90)83014-z. [DOI] [PubMed] [Google Scholar]
Sznyter L. A., Slatko B., Moran L., O'Donnell K. H., Brooks J. E. Nucleotide sequence of the DdeI restriction-modification system and characterization of the methylase protein. Nucleic Acids Res. 1987 Oct 26;15(20):8249–8266. doi: 10.1093/nar/15.20.8249. [DOI] [PMC free article] [PubMed] [Google Scholar]
Taylor W. R. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]
Vihinen M. An algorithm for simultaneous comparison of several sequences. Comput Appl Biosci. 1988 Mar;4(1):89–92. doi: 10.1093/bioinformatics/4.1.89. [DOI] [PubMed] [Google Scholar]
Vingron M., Argos P. A fast and sensitive multiple sequence alignment algorithm. Comput Appl Biosci. 1989 Apr;5(2):115–121. doi: 10.1093/bioinformatics/5.2.115. [DOI] [PubMed] [Google Scholar]
Wallace J. C., Henikoff S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comput Appl Biosci. 1992 Jun;8(3):249–254. doi: 10.1093/bioinformatics/8.3.249. [DOI] [PubMed] [Google Scholar]
Waterman M. S., Arratia R., Galas D. J. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol. 1984;46(4):515–527. doi: 10.1007/BF02459500. [DOI] [PubMed] [Google Scholar]
Waterman M. S. Multiple sequence alignment by consensus. Nucleic Acids Res. 1986 Nov 25;14(22):9095–9102. doi: 10.1093/nar/14.22.9095. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00866] Bacon D. J., Anderson W. F. Multiple sequence alignment. J Mol Biol. 1986 Sep 20;191(2):153–161. doi: 10.1016/0022-2836(86)90252-4. [DOI] [PubMed] [Google Scholar]

[OCR_00927] Bains W. MULTAN: a program to align multiple DNA sequences. Nucleic Acids Res. 1986 Jan 10;14(1):159–177. doi: 10.1093/nar/14.1.159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00906] Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1992 May 11;20 (Suppl):2019–2022. doi: 10.1093/nar/20.suppl.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00893] Bairoch A. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992 May 11;20 (Suppl):2013–2018. doi: 10.1093/nar/20.suppl.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00946] Bashford D., Chothia C., Lesk A. M. Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol. 1987 Jul 5;196(1):199–216. doi: 10.1016/0022-2836(87)90521-3. [DOI] [PubMed] [Google Scholar]

[OCR_00950] Caserta M., Zacharias W., Nwankwo D., Wilson G. G., Wells R. D. Cloning, sequencing, in vivo promoter mapping, and expression in Escherichia coli of the gene for the HhaI methyltransferase. J Biol Chem. 1987 Apr 5;262(10):4770–4777. [PubMed] [Google Scholar]

[OCR_00830] Doolittle R. F., Feng D. F., Johnson M. S., McClure M. A. Origins and evolutionary relationships of retroviruses. Q Rev Biol. 1989 Mar;64(1):1–30. doi: 10.1086/416128. [DOI] [PubMed] [Google Scholar]

[OCR_00934] Goad W. B., Kanehisa M. I. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res. 1982 Jan 11;10(1):247–263. doi: 10.1093/nar/10.1.247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00846] Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00920] Hanks S. K., Quinn A. M., Hunter T. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science. 1988 Jul 1;241(4861):42–52. doi: 10.1126/science.3291115. [DOI] [PubMed] [Google Scholar]

[OCR_00858] Henikoff S., Henikoff J. G. Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991 Dec 11;19(23):6565–6572. doi: 10.1093/nar/19.23.6565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00953] Hunter T. A thousand and one protein kinases. Cell. 1987 Sep 11;50(6):823–829. doi: 10.1016/0092-8674(87)90509-5. [DOI] [PubMed] [Google Scholar]

[OCR_00878] Johnson M. S., Doolittle R. F. A method for the simultaneous alignment of three or more amino acid sequences. J Mol Evol. 1986;23(3):267–278. doi: 10.1007/BF02115583. [DOI] [PubMed] [Google Scholar]

[OCR_00870] Krishnan G., Kaul R. K., Jagadeeswaran P. DNA sequence analysis: a procedure to find homologies among many sequences. Nucleic Acids Res. 1986 Jan 10;14(1):543–550. doi: 10.1093/nar/14.1.543. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00881] Lipman D. J., Altschul S. F., Kececioglu J. D. A tool for multiple sequence alignment. Proc Natl Acad Sci U S A. 1989 Jun;86(12):4412–4415. doi: 10.1073/pnas.86.12.4412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00838] Pósfai J., Bhagwat A. S., Pósfai G., Roberts R. J. Predictive motifs derived from cytosine methyltransferases. Nucleic Acids Res. 1989 Apr 11;17(7):2421–2435. doi: 10.1093/nar/17.7.2421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00862] Queen C., Wegman M. N., Korn L. J. Improvements to a program for DNA analysis: a procedure to find homologies among many sequences. Nucleic Acids Res. 1982 Jan 11;10(1):449–456. doi: 10.1093/nar/10.1.449. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00889] Roytberg M. A. A search for common patterns in many sequences. Comput Appl Biosci. 1992 Feb;8(1):57–64. doi: 10.1093/bioinformatics/8.1.57. [DOI] [PubMed] [Google Scholar]

[OCR_00842] Römisch K., Webb J., Herz J., Prehn S., Frank R., Vingron M., Dobberstein B. Homology of 54K protein of signal-recognition particle, docking protein and two E. coli proteins with putative GTP-binding domains. Nature. 1989 Aug 10;340(6233):478–482. doi: 10.1038/340478a0. [DOI] [PubMed] [Google Scholar]

[OCR_00850] Smith H. O., Annau T. M., Chandrasegaran S. Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci U S A. 1990 Jan;87(2):826–830. doi: 10.1073/pnas.87.2.826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00854] Smith R. F., Smith T. F. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118–122. doi: 10.1073/pnas.87.1.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00921] Smith R. F., Smith T. F. Identification of new protein kinase-related genes in three herpesviruses, herpes simplex virus, varicella-zoster virus, and Epstein-Barr virus. J Virol. 1989 Jan;63(1):450–455. doi: 10.1128/jvi.63.1.450-455.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00874] Sobel E., Martinez H. M. A multiple sequence alignment program. Nucleic Acids Res. 1986 Jan 10;14(1):363–374. doi: 10.1093/nar/14.1.363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00955] Som S., Bhagwat A. S., Friedman S. Nucleotide sequence and expression of the gene encoding the EcoRII modification enzyme. Nucleic Acids Res. 1987 Jan 12;15(1):313–332. doi: 10.1093/nar/15.1.313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00938] Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. doi: 10.1016/0076-6879(90)83014-z. [DOI] [PubMed] [Google Scholar]

[OCR_00959] Sznyter L. A., Slatko B., Moran L., O'Donnell K. H., Brooks J. E. Nucleotide sequence of the DdeI restriction-modification system and characterization of the methylase protein. Nucleic Acids Res. 1987 Oct 26;15(20):8249–8266. doi: 10.1093/nar/15.20.8249. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00875] Taylor W. R. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]

[OCR_00877] Vihinen M. An algorithm for simultaneous comparison of several sequences. Comput Appl Biosci. 1988 Mar;4(1):89–92. doi: 10.1093/bioinformatics/4.1.89. [DOI] [PubMed] [Google Scholar]

[OCR_00885] Vingron M., Argos P. A fast and sensitive multiple sequence alignment algorithm. Comput Appl Biosci. 1989 Apr;5(2):115–121. doi: 10.1093/bioinformatics/5.2.115. [DOI] [PubMed] [Google Scholar]

[OCR_00940] Wallace J. C., Henikoff S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comput Appl Biosci. 1992 Jun;8(3):249–254. doi: 10.1093/bioinformatics/8.3.249. [DOI] [PubMed] [Google Scholar]

[OCR_00923] Waterman M. S., Arratia R., Galas D. J. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol. 1984;46(4):515–527. doi: 10.1007/BF02459500. [DOI] [PubMed] [Google Scholar]

[OCR_00879] Waterman M. S. Multiple sequence alignment by consensus. Nucleic Acids Res. 1986 Nov 25;14(22):9095–9102. doi: 10.1093/nar/14.22.9095. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Discovering active motifs in sets of related protein sequences and using them for classification.

J T Wang

T G Marr

D Shasha

B A Shapiro

G W Chirn

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Discovering active motifs in sets of related protein sequences and using them for classification.

J T Wang

T G Marr

D Shasha

B A Shapiro

G W Chirn

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases