Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 1997 Mar;6(3):698–705. doi: 10.1002/pro.5560060319

Embedding strategies for effective use of information from multiple sequence alignments.

S Henikoff 1, J G Henikoff 1
PMCID: PMC2143675  PMID: 9070452

Abstract

We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.

Full Text

The Full Text of this article is available as a PDF (2.9 MB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  2. Altschul S. F., Lipman D. J. Protein database searches for multiple alignments. Proc Natl Acad Sci U S A. 1990 Jul;87(14):5509–5513. doi: 10.1073/pnas.87.14.5509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Attwood T. K., Beck M. E. PRINTS--a protein motif fingerprint database. Protein Eng. 1994 Jul;7(7):841–848. doi: 10.1093/protein/7.7.841. [DOI] [PubMed] [Google Scholar]
  4. Bailey T. L., Gribskov M. The megaprior heuristic for discovering protein sequence patterns. Proc Int Conf Intell Syst Mol Biol. 1996;4:15–24. [PubMed] [Google Scholar]
  5. Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1992 May 11;20 (Suppl):2019–2022. doi: 10.1093/nar/20.suppl.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bairoch A. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992 May 11;20 (Suppl):2013–2018. doi: 10.1093/nar/20.suppl.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Boguski M. S., Lowe T. M., Tolstoshev C. M. dbEST--database for "expressed sequence tags". Nat Genet. 1993 Aug;4(4):332–333. doi: 10.1038/ng0893-332. [DOI] [PubMed] [Google Scholar]
  8. Cooper A. A., Stevens T. H. Protein splicing: self-splicing of genetically mobile elements at the protein level. Trends Biochem Sci. 1995 Sep;20(9):351–356. doi: 10.1016/s0968-0004(00)89075-1. [DOI] [PubMed] [Google Scholar]
  9. Eddy S. R. Hidden Markov models. Curr Opin Struct Biol. 1996 Jun;6(3):361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
  10. Gribskov M., Lüthy R., Eisenberg D. Profile analysis. Methods Enzymol. 1990;183:146–159. doi: 10.1016/0076-6879(90)83011-w. [DOI] [PubMed] [Google Scholar]
  11. Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gribskov M., Veretnik S. Identification of sequence pattern with profile analysis. Methods Enzymol. 1996;266:198–212. doi: 10.1016/s0076-6879(96)66015-7. [DOI] [PubMed] [Google Scholar]
  13. Henikoff S. Comparative methods for identifying functional domains in protein sequences. Biotechnol Annu Rev. 1995;1:129–147. doi: 10.1016/s1387-2656(08)70050-4. [DOI] [PubMed] [Google Scholar]
  14. Henikoff S., Haughn G. W., Calvo J. M., Wallace J. C. A large family of bacterial activator proteins. Proc Natl Acad Sci U S A. 1988 Sep;85(18):6602–6606. doi: 10.1073/pnas.85.18.6602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Henikoff S., Henikoff J. G., Alford W. J., Pietrokovski S. Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene. 1995 Oct 3;163(2):GC17–GC26. doi: 10.1016/0378-1119(95)00486-p. [DOI] [PubMed] [Google Scholar]
  16. Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Henikoff S., Henikoff J. G. Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991 Dec 11;19(23):6565–6572. doi: 10.1093/nar/19.23.6565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Henikoff S., Henikoff J. G. Position-based sequence weights. J Mol Biol. 1994 Nov 4;243(4):574–578. doi: 10.1016/0022-2836(94)90032-9. [DOI] [PubMed] [Google Scholar]
  19. Lüthy R., Xenarios I., Bucher P. Improving the sensitivity of the sequence profile method. Protein Sci. 1994 Jan;3(1):139–146. doi: 10.1002/pro.5560030118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Neuwald A. F., Green P. Detecting patterns in protein sequences. J Mol Biol. 1994 Jun 24;239(5):698–712. doi: 10.1006/jmbi.1994.1407. [DOI] [PubMed] [Google Scholar]
  21. Nowak R. Bacterial genome sequence bagged. Science. 1995 Jul 28;269(5223):468–470. doi: 10.1126/science.7624767. [DOI] [PubMed] [Google Scholar]
  22. Patthy L. Detecting homology of distantly related proteins with consensus sequences. J Mol Biol. 1987 Dec 20;198(4):567–577. doi: 10.1016/0022-2836(87)90200-2. [DOI] [PubMed] [Google Scholar]
  23. Pearson W. R. Comparison of methods for searching protein sequence databases. Protein Sci. 1995 Jun;4(6):1145–1160. doi: 10.1002/pro.5560040613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pearson W. R. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. doi: 10.1016/0076-6879(90)83007-v. [DOI] [PubMed] [Google Scholar]
  25. Pietrokovski S. Conserved sequence features of inteins (protein introns) and their use in identifying new inteins and related proteins. Protein Sci. 1994 Dec;3(12):2340–2350. doi: 10.1002/pro.5560031218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Sjölander K., Karplus K., Brown M., Hughey R., Krogh A., Mian I. S., Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996 Aug;12(4):327–345. doi: 10.1093/bioinformatics/12.4.327. [DOI] [PubMed] [Google Scholar]
  27. Smith R. F., Smith T. F. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118–122. doi: 10.1073/pnas.87.1.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  29. Sonnhammer E. L., Kahn D. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 1994 Mar;3(3):482–492. doi: 10.1002/pro.5560030314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Tatusov R. L., Altschul S. F., Koonin E. V. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994 Dec 6;91(25):12091–12095. doi: 10.1073/pnas.91.25.12091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Thompson J. D., Higgins D. G., Gibson T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994 Nov 11;22(22):4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Thompson J. D., Higgins D. G., Gibson T. J. Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput Appl Biosci. 1994 Feb;10(1):19–29. doi: 10.1093/bioinformatics/10.1.19. [DOI] [PubMed] [Google Scholar]
  33. Wallace J. C., Henikoff S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comput Appl Biosci. 1992 Jun;8(3):249–254. doi: 10.1093/bioinformatics/8.3.249. [DOI] [PubMed] [Google Scholar]
  34. Yi T. M., Lander E. S. Recognition of related proteins by iterative template refinement (ITR). Protein Sci. 1994 Aug;3(8):1315–1328. doi: 10.1002/pro.5560030818. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES