Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 1998 Sep 1;26(17):3986–3990. doi: 10.1093/nar/26.17.3986

Protein sequence similarity searches using patterns as seeds.

Z Zhang 1, A A Schäffer 1, W Miller 1, T L Madden 1, D J Lipman 1, E V Koonin 1, S F Altschul 1
PMCID: PMC147803  PMID: 9705509

Abstract

Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.

Full Text

The Full Text of this article is available as a PDF (67.3 KB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F., Boguski M. S., Gish W., Wootton J. C. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. doi: 10.1038/ng0294-119. [DOI] [PubMed] [Google Scholar]
  2. Altschul S. F., Erickson B. W. Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986;48(5-6):603–616. doi: 10.1007/BF02462326. [DOI] [PubMed] [Google Scholar]
  3. Altschul S. F. Generalized affine gap costs for protein sequence alignment. Proteins. 1998 Jul 1;32(1):88–96. [PubMed] [Google Scholar]
  4. Altschul S. F., Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–480. doi: 10.1016/s0076-6879(96)66029-7. [DOI] [PubMed] [Google Scholar]
  5. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  6. Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bairoch A., Bucher P., Hofmann K. The PROSITE database, its status in 1997. Nucleic Acids Res. 1997 Jan 1;25(1):217–221. doi: 10.1093/nar/25.1.217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Benson D. A., Boguski M. S., Lipman D. J., Ostell J., Ouellette B. F. GenBank. Nucleic Acids Res. 1998 Jan 1;26(1):1–7. doi: 10.1093/nar/26.1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bergerat A., de Massy B., Gadelle D., Varoutas P. C., Nicolas A., Forterre P. An atypical topoisomerase II from Archaea with implications for meiotic recombination. Nature. 1997 Mar 27;386(6623):414–417. doi: 10.1038/386414a0. [DOI] [PubMed] [Google Scholar]
  10. Black C. G., Fyfe J. A., Davies J. K. A promoter associated with the neisserial repeat can be used to transcribe the uvrB gene from Neisseria gonorrhoeae. J Bacteriol. 1995 Apr;177(8):1952–1958. doi: 10.1128/jb.177.8.1952-1958.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. doi: 10.1126/science.273.5278.1058. [DOI] [PubMed] [Google Scholar]
  12. Chinnaiyan A. M., Chaudhary D., O'Rourke K., Koonin E. V., Dixit V. M. Role of CED-4 in the activation of CED-3. Nature. 1997 Aug 21;388(6644):728–729. doi: 10.1038/41913. [DOI] [PubMed] [Google Scholar]
  13. Collins J. F., Coulson A. F., Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. doi: 10.1093/bioinformatics/4.1.67. [DOI] [PubMed] [Google Scholar]
  14. Dracheva S., Koonin E. V., Crute J. J. Identification of the primase active site of the herpes simplex virus type 1 helicase-primase. J Biol Chem. 1995 Jun 9;270(23):14148–14153. doi: 10.1074/jbc.270.23.14148. [DOI] [PubMed] [Google Scholar]
  15. Fitch W. M., Smith T. F. Optimal sequence alignments. Proc Natl Acad Sci U S A. 1983 Mar;80(5):1382–1386. doi: 10.1073/pnas.80.5.1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
  17. Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Klenk H. P., Clayton R. A., Tomb J. F., White O., Nelson K. E., Ketchum K. A., Dodson R. J., Gwinn M., Hickey E. K., Peterson J. D. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature. 1997 Nov 27;390(6658):364–370. doi: 10.1038/37052. [DOI] [PubMed] [Google Scholar]
  19. Koonin E. V., Mushegian A. R., Galperin M. Y., Walker D. R. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol Microbiol. 1997 Aug;25(4):619–637. doi: 10.1046/j.1365-2958.1997.4821861.x. [DOI] [PubMed] [Google Scholar]
  20. LeBlanc D. J., Lee L. N., Inamine J. M. Cloning and nucleotide base sequence analysis of a spectinomycin adenyltransferase AAD(9) determinant from Enterococcus faecalis. Antimicrob Agents Chemother. 1991 Sep;35(9):1804–1810. doi: 10.1128/aac.35.9.1804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li P., Nijhawan D., Budihardjo I., Srinivasula S. M., Ahmad M., Alnemri E. S., Wang X. Cytochrome c and dATP-dependent formation of Apaf-1/caspase-9 complex initiates an apoptotic protease cascade. Cell. 1997 Nov 14;91(4):479–489. doi: 10.1016/s0092-8674(00)80434-1. [DOI] [PubMed] [Google Scholar]
  22. Mehldau G., Myers G. A system for pattern matching applications on biosequences. Comput Appl Biosci. 1993 Jun;9(3):299–314. doi: 10.1093/bioinformatics/9.3.299. [DOI] [PubMed] [Google Scholar]
  23. Mushegian A. R., Bassett D. E., Jr, Boguski M. S., Bork P., Koonin E. V. Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc Natl Acad Sci U S A. 1997 May 27;94(11):5831–5836. doi: 10.1073/pnas.94.11.5831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Myers E. W., Miller W. Approximate matching of regular expressions. Bull Math Biol. 1989;51(1):5–37. doi: 10.1007/BF02458834. [DOI] [PubMed] [Google Scholar]
  25. Myers E. W., Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988 Mar;4(1):11–17. doi: 10.1093/bioinformatics/4.1.11. [DOI] [PubMed] [Google Scholar]
  26. Nagase T., Seki N., Tanaka A., Ishikawa K., Nomura N. Prediction of the coding sequences of unidentified human genes. IV. The coding sequences of 40 new genes (KIAA0121-KIAA0160) deduced by analysis of cDNA clones from human cell line KG-1. DNA Res. 1995 Aug 31;2(4):167-74, 199-210. doi: 10.1093/dnares/2.4.167. [DOI] [PubMed] [Google Scholar]
  27. Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  28. Ogiwara A., Uchiyama I., Takagi T., Kanehisa M. Construction and analysis of a profile library characterizing groups of structurally known proteins. Protein Sci. 1996 Oct;5(10):1991–1999. doi: 10.1002/pro.5560051005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pearson W. R. Empirical statistical estimates for sequence similarity searches. J Mol Biol. 1998 Feb 13;276(1):71–84. doi: 10.1006/jmbi.1997.1525. [DOI] [PubMed] [Google Scholar]
  30. Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Robinson A. B., Robinson L. R. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc Natl Acad Sci U S A. 1991 Oct 15;88(20):8880–8884. doi: 10.1073/pnas.88.20.8880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Sankoff D. Matching sequences under deletion-insertion constraints. Proc Natl Acad Sci U S A. 1972 Jan;69(1):4–6. doi: 10.1073/pnas.69.1.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Seshagiri S., Miller L. K. Caenorhabditis elegans CED-4 stimulates CED-3 processing and CED-3-induced apoptosis. Curr Biol. 1997 Jul 1;7(7):455–460. doi: 10.1016/s0960-9822(06)00216-8. [DOI] [PubMed] [Google Scholar]
  34. Smith R. F., Smith T. F. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118–122. doi: 10.1073/pnas.87.1.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Smith T. F., Waterman M. S., Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656. doi: 10.1093/nar/13.2.645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  37. Staden R. Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989 Apr;5(2):89–96. doi: 10.1093/bioinformatics/5.2.89. [DOI] [PubMed] [Google Scholar]
  38. Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. doi: 10.1016/0076-6879(90)83014-z. [DOI] [PubMed] [Google Scholar]
  39. Tatusov R. L., Koonin E. V. A simple tool to search for sequence motifs that are conserved in BLAST outputs. Comput Appl Biosci. 1994 Jul;10(4):457–459. doi: 10.1093/bioinformatics/10.4.457. [DOI] [PubMed] [Google Scholar]
  40. Tsui H. T., Mandavilli B. S., Winkler M. E. Nonconserved segment of the MutL protein from Escherichia coli K-12 and Salmonella typhimurium. Nucleic Acids Res. 1992 May 11;20(9):2379–2379. doi: 10.1093/nar/20.9.2379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wilson R., Ainscough R., Anderson K., Baynes C., Berks M., Bonfield J., Burton J., Connell M., Copsey T., Cooper J. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature. 1994 Mar 3;368(6466):32–38. doi: 10.1038/368032a0. [DOI] [PubMed] [Google Scholar]
  42. Yue D., Maizels N., Weiner A. M. CCA-adding enzymes and poly(A) polymerases are all members of the same nucleotidyltransferase superfamily: characterization of the CCA-adding enzyme from the archaeal hyperthermophile Sulfolobus shibatae. RNA. 1996 Sep;2(9):895–908. [PMC free article] [PubMed] [Google Scholar]
  43. Zhang Z., Berman P., Miller W. Alignments without low-scoring regions. J Comput Biol. 1998 Summer;5(2):197–210. doi: 10.1089/cmb.1998.5.197. [DOI] [PubMed] [Google Scholar]
  44. Zou H., Henzel W. J., Liu X., Lutschg A., Wang X. Apaf-1, a human protein homologous to C. elegans CED-4, participates in cytochrome c-dependent activation of caspase-3. Cell. 1997 Aug 8;90(3):405–413. doi: 10.1016/s0092-8674(00)80501-2. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES