Abstract
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Full Text
The Full Text of this article is available as a PDF (205.2 KB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Allende M. L., Amsterdam A., Becker T., Kawakami K., Gaiano N., Hopkins N. Insertional mutagenesis in zebrafish identifies two novel genes, pescadillo and dead eye, essential for embryonic development. Genes Dev. 1996 Dec 15;10(24):3141–3155. doi: 10.1101/gad.10.24.3141. [DOI] [PubMed] [Google Scholar]
- Altschul S. F. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 1993 Mar;36(3):290–300. doi: 10.1007/BF00160485. [DOI] [PubMed] [Google Scholar]
- Altschul S. F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555–565. doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul S. F., Boguski M. S., Gish W., Wootton J. C. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. doi: 10.1038/ng0294-119. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Carroll R. J., Lipman D. J. Weights for data related by a tree. J Mol Biol. 1989 Jun 20;207(4):647–653. doi: 10.1016/0022-2836(89)90234-9. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Erickson B. W. Locally optimal subalignments using nonlinear similarity functions. Bull Math Biol. 1986;48(5-6):633–660. doi: 10.1007/BF02462328. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Erickson B. W. Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986;48(5-6):603–616. doi: 10.1007/BF02462326. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–480. doi: 10.1016/s0076-6879(96)66029-7. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Bairoch A., Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res. 1997 Jan 1;25(1):31–36. doi: 10.1093/nar/25.1.31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benson D. A., Boguski M. S., Lipman D. J., Ostell J. GenBank. Nucleic Acids Res. 1997 Jan 1;25(1):1–6. doi: 10.1093/nar/25.1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berg O. G., von Hippel P. H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987 Feb 20;193(4):723–750. doi: 10.1016/0022-2836(87)90354-8. [DOI] [PubMed] [Google Scholar]
- Bork P., Hofmann K., Bucher P., Neuwald A. F., Altschul S. F., Koonin E. V. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J. 1997 Jan;11(1):68–76. [PubMed] [Google Scholar]
- Brown M., Hughey R., Krogh A., Mian I. S., Sjölander K., Haussler D. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Int Conf Intell Syst Mol Biol. 1993;1:47–55. [PubMed] [Google Scholar]
- Bucher P., Karplus K., Moeri N., Hofmann K. A flexible motif search technique based on generalized profiles. Comput Chem. 1996 Mar;20(1):3–23. doi: 10.1016/s0097-8485(96)80003-9. [DOI] [PubMed] [Google Scholar]
- Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. doi: 10.1126/science.273.5278.1058. [DOI] [PubMed] [Google Scholar]
- Callebaut I., Mornon J. P. From BRCA1 to RAP1: a widespread BRCT module closely associated with DNA repair. FEBS Lett. 1997 Jan 2;400(1):25–30. doi: 10.1016/s0014-5793(96)01312-9. [DOI] [PubMed] [Google Scholar]
- Chao K. M., Pearson W. R., Miller W. Aligning two sequences within a specified diagonal band. Comput Appl Biosci. 1992 Oct;8(5):481–487. doi: 10.1093/bioinformatics/8.5.481. [DOI] [PubMed] [Google Scholar]
- Collins J. F., Coulson A. F., Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. doi: 10.1093/bioinformatics/4.1.67. [DOI] [PubMed] [Google Scholar]
- Dodd I. B., Egan J. B. Systematic method for the detection of potential lambda Cro-like DNA-binding regions in proteins. J Mol Biol. 1987 Apr 5;194(3):557–564. doi: 10.1016/0022-2836(87)90681-4. [DOI] [PubMed] [Google Scholar]
- Eddy S. R., Mitchison G., Durbin R. Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol. 1995 Spring;2(1):9–23. doi: 10.1089/cmb.1995.2.9. [DOI] [PubMed] [Google Scholar]
- Fitch W. M., Smith T. F. Optimal sequence alignments. Proc Natl Acad Sci U S A. 1983 Mar;80(5):1382–1386. doi: 10.1073/pnas.80.5.1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerstein M., Sonnhammer E. L., Chothia C. Volume changes in protein evolution. J Mol Biol. 1994 Mar 4;236(4):1067–1078. doi: 10.1016/0022-2836(94)90012-4. [DOI] [PubMed] [Google Scholar]
- Gotoh O. A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci. 1995 Oct;11(5):543–551. doi: 10.1093/bioinformatics/11.5.543. [DOI] [PubMed] [Google Scholar]
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
- Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heidenreich R. A., Mallee J., Segal S. Rat galactose-1-phosphate uridyltransferase coding sequence, transcription start site and genomic organization. DNA Seq. 1993;3(5):311–318. doi: 10.3109/10425179309020829. [DOI] [PubMed] [Google Scholar]
- Henikoff J. G., Henikoff S. Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci. 1996 Apr;12(2):135–143. doi: 10.1093/bioinformatics/12.2.135. [DOI] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Embedding strategies for effective use of information from multiple sequence alignments. Protein Sci. 1997 Mar;6(3):698–705. doi: 10.1002/pro.5560060319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Position-based sequence weights. J Mol Biol. 1994 Nov 4;243(4):574–578. doi: 10.1016/0022-2836(94)90032-9. [DOI] [PubMed] [Google Scholar]
- Holm L., Sander C. New structure--novel fold? Structure. 1997 Feb 15;5(2):165–171. doi: 10.1016/s0969-2126(97)00176-7. [DOI] [PubMed] [Google Scholar]
- Jou W. M., Verhoeyen M., Devos R., Saman E., Fang R., Huylebroeck D., Fiers W., Threlfall G., Barber C., Carey N. Complete structure of the hemagglutinin gene from the human influenza A/Victoria/3/75 (H3N2) strain as determined from cloned DNA. Cell. 1980 Mar;19(3):683–696. doi: 10.1016/s0092-8674(80)80045-6. [DOI] [PubMed] [Google Scholar]
- Kaneko T., Sato S., Kotani H., Tanaka A., Asamizu E., Nakamura Y., Miyajima N., Hirosawa M., Sugiura M., Sasamoto S. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 1996 Jun 30;3(3):109–136. doi: 10.1093/dnares/3.3.109. [DOI] [PubMed] [Google Scholar]
- Karlin S., Altschul S. F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873–5877. doi: 10.1073/pnas.90.12.5873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koonin E. V., Altschul S. F., Bork P. BRCA1 protein products ... Functional motifs... Nat Genet. 1996 Jul;13(3):266–268. doi: 10.1038/ng0796-266. [DOI] [PubMed] [Google Scholar]
- Lawrence C. E., Altschul S. F., Boguski M. S., Liu J. S., Neuwald A. F., Wootton J. C. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993 Oct 8;262(5131):208–214. doi: 10.1126/science.8211139. [DOI] [PubMed] [Google Scholar]
- Maskell D. J., Szabo M. J., Deadman M. E., Moxon E. R. The gal locus from Haemophilus influenzae: cloning, sequencing and the use of gal mutants to study lipopolysaccharide. Mol Microbiol. 1992 Oct;6(20):3051–3063. doi: 10.1111/j.1365-2958.1992.tb01763.x. [DOI] [PubMed] [Google Scholar]
- Matsuda G., Maita T., Braunitzer G., Schrank B. Hämoglobine, XXXIII: Notiz zur Sequenz der Hämoglobine des Pferdes. Hoppe Seylers Z Physiol Chem. 1980 Jul;361(7):1107–1116. [PubMed] [Google Scholar]
- McLachlan A. D. Analysis of gene duplication repeats in the myosin rod. J Mol Biol. 1983 Sep 5;169(1):15–30. doi: 10.1016/s0022-2836(83)80173-9. [DOI] [PubMed] [Google Scholar]
- Miki Y., Swensen J., Shattuck-Eidens D., Futreal P. A., Harshman K., Tavtigian S., Liu Q., Cochran C., Bennett L. M., Ding W. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 1994 Oct 7;266(5182):66–71. doi: 10.1126/science.7545954. [DOI] [PubMed] [Google Scholar]
- Myers E. W., Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988 Mar;4(1):11–17. doi: 10.1093/bioinformatics/4.1.11. [DOI] [PubMed] [Google Scholar]
- Nagase T., Seki N., Ishikawa K., Ohira M., Kawarabayasi Y., Ohara O., Tanaka A., Kotani H., Miyajima N., Nomura N. Prediction of the coding sequences of unidentified human genes. VI. The coding sequences of 80 new genes (KIAA0201-KIAA0280) deduced by analysis of cDNA clones from cell line KG-1 and brain. DNA Res. 1996 Oct 31;3(5):321-9, 341-54. doi: 10.1093/dnares/3.5.321. [DOI] [PubMed] [Google Scholar]
- Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- Ohta M., Inoue H., Cotticelli M. G., Kastury K., Baffa R., Palazzo J., Siprashvili Z., Mori M., McCue P., Druck T. The FHIT gene, spanning the chromosome 3p14.2 fragile site and renal carcinoma-associated t(3;8) breakpoint, is abnormal in digestive tract cancers. Cell. 1996 Feb 23;84(4):587–597. doi: 10.1016/s0092-8674(00)81034-x. [DOI] [PubMed] [Google Scholar]
- Patthy L. Detecting homology of distantly related proteins with consensus sequences. J Mol Biol. 1987 Dec 20;198(4):567–577. doi: 10.1016/0022-2836(87)90200-2. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plateau P., Fromant M., Schmitter J. M., Buhler J. M., Blanquet S. Isolation, characterization, and inactivation of the APA1 gene encoding yeast diadenosine 5',5'''-P1,P4-tetraphosphate phosphorylase. J Bacteriol. 1989 Dec;171(12):6437–6445. doi: 10.1128/jb.171.12.6437-6445.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson M., Dilworth M. J., Scawen M. D. The amino acid sequence of leghaemoglobin I from root nodules of broad bean (Vicia faba L.). FEBS Lett. 1975 Mar 1;51(1):33–37. doi: 10.1016/0014-5793(75)80849-0. [DOI] [PubMed] [Google Scholar]
- Robinson A. B., Robinson L. R. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc Natl Acad Sci U S A. 1991 Oct 15;88(20):8880–8884. doi: 10.1073/pnas.88.20.8880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sander C., Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
- Schneider T. D., Stormo G. D., Gold L., Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986 Apr 5;188(3):415–431. doi: 10.1016/0022-2836(86)90165-8. [DOI] [PubMed] [Google Scholar]
- Sibbald P. R., Argos P. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J Mol Biol. 1990 Dec 20;216(4):813–818. doi: 10.1016/S0022-2836(99)80003-5. [DOI] [PubMed] [Google Scholar]
- Sjölander K., Karplus K., Brown M., Hughey R., Krogh A., Mian I. S., Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996 Aug;12(4):327–345. doi: 10.1093/bioinformatics/12.4.327. [DOI] [PubMed] [Google Scholar]
- Smith T. F., Waterman M. S., Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656. doi: 10.1093/nar/13.2.645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- Sonnhammer E. L., Durbin R. A workbench for large-scale sequence homology analysis. Comput Appl Biosci. 1994 Jun;10(3):301–307. doi: 10.1093/bioinformatics/10.3.301. [DOI] [PubMed] [Google Scholar]
- Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505–519. doi: 10.1093/nar/12.1part2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stormo G. D., Hartzell G. W., 3rd Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183–1187. doi: 10.1073/pnas.86.4.1183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatusov R. L., Altschul S. F., Koonin E. V. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994 Dec 6;91(25):12091–12095. doi: 10.1073/pnas.91.25.12091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor W. R. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]
- Thompson J. D., Higgins D. G., Gibson T. J. Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput Appl Biosci. 1994 Feb;10(1):19–29. doi: 10.1093/bioinformatics/10.1.19. [DOI] [PubMed] [Google Scholar]
- Tokunaga O., Yaegashi T., Lowe J., Dobbs L., Padmanabhan R. Sequence analysis in the E1 region of adenovirus type 4 DNA. Virology. 1986 Dec;155(2):418–433. doi: 10.1016/0042-6822(86)90204-7. [DOI] [PubMed] [Google Scholar]
- Waterman M. S., Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. doi: 10.1016/0022-2836(87)90478-5. [DOI] [PubMed] [Google Scholar]
- Wilbur W. J., Lipman D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730. doi: 10.1073/pnas.80.3.726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson R., Ainscough R., Anderson K., Baynes C., Berks M., Bonfield J., Burton J., Connell M., Copsey T., Cooper J. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature. 1994 Mar 3;368(6466):32–38. doi: 10.1038/368032a0. [DOI] [PubMed] [Google Scholar]
- Wu L. C., Wang Z. W., Tsan J. T., Spillman M. A., Phung A., Xu X. L., Yang M. C., Hwang L. Y., Bowcock A. M., Baer R. Identification of a RING protein that can interact in vivo with the BRCA1 gene product. Nat Genet. 1996 Dec;14(4):430–440. doi: 10.1038/ng1296-430. [DOI] [PubMed] [Google Scholar]
- Yi T. M., Lander E. S. Recognition of related proteins by iterative template refinement (ITR). Protein Sci. 1994 Aug;3(8):1315–1328. doi: 10.1002/pro.5560030818. [DOI] [PMC free article] [PubMed] [Google Scholar]