Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 1995 Jun;4(6):1145–1160. doi: 10.1002/pro.5560040613

Comparison of methods for searching protein sequence databases.

W R Pearson 1
PMCID: PMC2143149  PMID: 7549879

Abstract

We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.

Full Text

The Full Text of this article is available as a PDF (4.0 MB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 1993 Mar;36(3):290–300. doi: 10.1007/BF00160485. [DOI] [PubMed] [Google Scholar]
  2. Altschul S. F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555–565. doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  4. Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2247–2249. doi: 10.1093/nar/19.suppl.2247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chao K. M., Pearson W. R., Miller W. Aligning two sequences within a specified diagonal band. Comput Appl Biosci. 1992 Oct;8(5):481–487. doi: 10.1093/bioinformatics/8.5.481. [DOI] [PubMed] [Google Scholar]
  6. Collins J. F., Coulson A. F., Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. doi: 10.1093/bioinformatics/4.1.67. [DOI] [PubMed] [Google Scholar]
  7. Deshpande A. S., Richards D. S., Pearson W. R. A platform for biological sequence comparison on parallel computers. Comput Appl Biosci. 1991 Apr;7(2):237–247. doi: 10.1093/bioinformatics/7.2.237. [DOI] [PubMed] [Google Scholar]
  8. Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Henikoff S., Henikoff J. G. Performance evaluation of amino acid substitution matrices. Proteins. 1993 Sep;17(1):49–61. doi: 10.1002/prot.340170108. [DOI] [PubMed] [Google Scholar]
  10. Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992 Jun;8(3):275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
  11. Karlin S., Altschul S. F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873–5877. doi: 10.1073/pnas.90.12.5873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Pearson W. R., Miller W. Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 1992;210:575–601. doi: 10.1016/0076-6879(92)10029-d. [DOI] [PubMed] [Google Scholar]
  15. Pearson W. R. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. doi: 10.1016/0076-6879(90)83007-v. [DOI] [PubMed] [Google Scholar]
  16. Pearson W. R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991 Nov;11(3):635–650. doi: 10.1016/0888-7543(91)90071-l. [DOI] [PubMed] [Google Scholar]
  17. Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  18. Vingron M., Waterman M. S. Sequence alignment and penalty choice. Review of concepts, case studies and implications. J Mol Biol. 1994 Jan 7;235(1):1–12. doi: 10.1016/s0022-2836(05)80006-3. [DOI] [PubMed] [Google Scholar]
  19. Waterman M. S., Gordon L., Arratia R. Phase transitions in sequence matches and nucleic acid structure. Proc Natl Acad Sci U S A. 1987 Mar;84(5):1239–1243. doi: 10.1073/pnas.84.5.1239. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES