Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1993 Jun 15;90(12):5873–5877. doi: 10.1073/pnas.90.12.5873

Applications and statistics for multiple high-scoring segments in molecular sequences.

S Karlin 1, S F Altschul 1
PMCID: PMC46825  PMID: 8390686

Abstract

Score-based measures of molecular-sequence features provide versatile aids for the study of proteins and DNA. They are used by many sequence data base search programs, as well as for identifying distinctive properties of single sequences. For any such measure, it is important to know what can be expected to occur purely by chance. The statistical distribution of high-scoring segments has been described elsewhere. However, molecular sequences will frequently yield several high-scoring segments for which some combined assessment is in order. This paper describes the statistical distribution for the sum of the scores of multiple high-scoring segments and illustrates its application to the identification of possible transmembrane segments and the evaluation of sequence similarity.

Full text

PDF
5873

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555–565. doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altschul S. F., Erickson B. W. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol. 1985 Nov;2(6):526–538. doi: 10.1093/oxfordjournals.molbev.a040370. [DOI] [PubMed] [Google Scholar]
  3. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  4. Altschul S. F., Lipman D. J. Protein database searches for multiple alignments. Proc Natl Acad Sci U S A. 1990 Jul;87(14):5509–5513. doi: 10.1073/pnas.87.14.5509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1992 May 11;20 (Suppl):2019–2022. doi: 10.1093/nar/20.suppl.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Barker W. C., George D. G., Mewes H. W., Tsugita A. The PIR-International Protein Sequence Database. Nucleic Acids Res. 1992 May 11;20 (Suppl):2023–2026. doi: 10.1093/nar/20.suppl.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Brendel V., Bucher P., Nourbakhsh I. R., Blaisdell B. E., Karlin S. Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A. 1992 Mar 15;89(6):2002–2006. doi: 10.1073/pnas.89.6.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Collins J. F., Coulson A. F., Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. doi: 10.1093/bioinformatics/4.1.67. [DOI] [PubMed] [Google Scholar]
  9. Feng D. F., Johnson M. S., Doolittle R. F. Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21(2):112–125. doi: 10.1007/BF02100085. [DOI] [PubMed] [Google Scholar]
  10. Fitch W. M. Random sequences. J Mol Biol. 1983 Jan 15;163(2):171–176. doi: 10.1016/0022-2836(83)90002-5. [DOI] [PubMed] [Google Scholar]
  11. Gish W., States D. J. Identification of protein coding regions by database similarity search. Nat Genet. 1993 Mar;3(3):266–272. doi: 10.1038/ng0393-266. [DOI] [PubMed] [Google Scholar]
  12. Gonnet G. H., Cohen M. A., Benner S. A. Exhaustive matching of the entire protein sequence database. Science. 1992 Jun 5;256(5062):1443–1445. doi: 10.1126/science.1604319. [DOI] [PubMed] [Google Scholar]
  13. Heilig R., Perrin F., Gannon F., Mandel J. L., Chambon P. The ovalbumin gene family: structure of the X gene and evolution of duplicated split genes. Cell. 1980 Jul;20(3):625–637. doi: 10.1016/0092-8674(80)90309-8. [DOI] [PubMed] [Google Scholar]
  14. Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992 Jun;8(3):275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
  16. Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Karlin S., Brendel V., Bucher P. Significant similarity and dissimilarity in homologous proteins. Mol Biol Evol. 1992 Jan;9(1):152–167. doi: 10.1093/oxfordjournals.molbev.a040704. [DOI] [PubMed] [Google Scholar]
  18. Karlin S., Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science. 1992 Jul 3;257(5066):39–49. doi: 10.1126/science.1621093. [DOI] [PubMed] [Google Scholar]
  19. Karlin S., Bucher P., Brendel V., Altschul S. F. Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem. 1991;20:175–203. doi: 10.1146/annurev.bb.20.060191.001135. [DOI] [PubMed] [Google Scholar]
  20. Kobilka B. K., Frielle T., Collins S., Yang-Feng T., Kobilka T. S., Francke U., Lefkowitz R. J., Caron M. G. An intronless gene encoding a potential member of the family of receptors coupled to guanine nucleotide regulatory proteins. Nature. 1987 Sep 3;329(6134):75–79. doi: 10.1038/329075a0. [DOI] [PubMed] [Google Scholar]
  21. McLachlan A. D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . J Mol Biol. 1971 Oct 28;61(2):409–424. doi: 10.1016/0022-2836(71)90390-1. [DOI] [PubMed] [Google Scholar]
  22. Michael W. M., Bowtell D. D., Rubin G. M. Comparison of the sevenless genes of Drosophila virilis and Drosophila melanogaster. Proc Natl Acad Sci U S A. 1990 Jul;87(14):5351–5353. doi: 10.1073/pnas.87.14.5351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Simon M. A., Bowtell D. D., Rubin G. M. Structure and activity of the sevenless protein: a protein tyrosine kinase receptor required for photoreceptor development in Drosophila. Proc Natl Acad Sci U S A. 1989 Nov;86(21):8333–8337. doi: 10.1073/pnas.86.21.8333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Smith T. F., Waterman M. S., Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656. doi: 10.1093/nar/13.2.645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  27. Tomley F., Binns M., Campbell J., Boursnell M. Sequence analysis of an 11.2 kilobase, near-terminal, BamHI fragment of fowlpox virus. J Gen Virol. 1988 May;69(Pt 5):1025–1040. doi: 10.1099/0022-1317-69-5-1025. [DOI] [PubMed] [Google Scholar]
  28. Wilbur W. J. On the PAM matrix model of protein evolution. Mol Biol Evol. 1985 Sep;2(5):434–447. doi: 10.1093/oxfordjournals.molbev.a040360. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES