Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

S Karlin 1, S F Altschul 1
PMCID: PMC53667  PMID: 2315319

Abstract

An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.

Full text

PDF
2264

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F., Erickson B. W. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol. 1985 Nov;2(6):526–538. doi: 10.1093/oxfordjournals.molbev.a040370. [DOI] [PubMed] [Google Scholar]
  2. Brendel V., Karlin S. Association of charge clusters with functional domains of cellular transcription factors. Proc Natl Acad Sci U S A. 1989 Aug;86(15):5698–5702. doi: 10.1073/pnas.86.15.5698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Doolittle R. F. Similar amino acid sequences: chance or common ancestry? Science. 1981 Oct 9;214(4517):149–159. doi: 10.1126/science.7280687. [DOI] [PubMed] [Google Scholar]
  4. Feng D. F., Johnson M. S., Doolittle R. F. Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21(2):112–125. doi: 10.1007/BF02100085. [DOI] [PubMed] [Google Scholar]
  5. Fitch W. M. Random sequences. J Mol Biol. 1983 Jan 15;163(2):171–176. doi: 10.1016/0022-2836(83)90002-5. [DOI] [PubMed] [Google Scholar]
  6. Gonzalez F. J. The molecular biology of cytochrome P450s. Pharmacol Rev. 1988 Dec;40(4):243–288. [PubMed] [Google Scholar]
  7. Jackson T. R., Blair L. A., Marshall J., Goedert M., Hanley M. R. The mas oncogene encodes an angiotensin receptor. Nature. 1988 Sep 29;335(6189):437–440. doi: 10.1038/335437a0. [DOI] [PubMed] [Google Scholar]
  8. Karlin S., Blaisdell B. E., Brendel V. Identification of significant sequence patterns in proteins. Methods Enzymol. 1990;183:388–402. doi: 10.1016/0076-6879(90)83026-6. [DOI] [PubMed] [Google Scholar]
  9. Karlin S., Blaisdell B. E., Mocarski E. S., Brendel V. A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides. J Mol Biol. 1989 Jan 5;205(1):165–177. doi: 10.1016/0022-2836(89)90373-2. [DOI] [PubMed] [Google Scholar]
  10. Karlin S., Brendel V. Charge configurations in oncogene products and transforming proteins. Oncogene. 1990 Jan;5(1):85–95. [PubMed] [Google Scholar]
  11. Karlin S., Ghandour G., Ost F., Tavare S., Korn L. J. New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci U S A. 1983 Sep;80(18):5660–5664. doi: 10.1073/pnas.80.18.5660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Karlin S., Morris M., Ghandour G., Leung M. Y. Efficient algorithms for molecular sequence analysis. Proc Natl Acad Sci U S A. 1988 Feb;85(3):841–845. doi: 10.1073/pnas.85.3.841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. McLachlan A. D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . J Mol Biol. 1971 Oct 28;61(2):409–424. doi: 10.1016/0022-2836(71)90390-1. [DOI] [PubMed] [Google Scholar]
  14. Mitchell P. J., Tjian R. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989 Jul 28;245(4916):371–378. doi: 10.1126/science.2667136. [DOI] [PubMed] [Google Scholar]
  15. Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  16. Pirrotta V., Manet E., Hardon E., Bickel S. E., Benson M. Structure and sequence of the Drosophila zeste gene. EMBO J. 1987 Mar;6(3):791–799. doi: 10.1002/j.1460-2075.1987.tb04821.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Rauscher F. J., 3rd, Cohen D. R., Curran T., Bos T. J., Vogt P. K., Bohmann D., Tjian R., Franza B. R., Jr Fos-associated protein p39 is the product of the jun proto-oncogene. Science. 1988 May 20;240(4855):1010–1016. doi: 10.1126/science.3130660. [DOI] [PubMed] [Google Scholar]
  18. Riordan J. R., Rommens J. M., Kerem B., Alon N., Rozmahel R., Grzelczak Z., Zielenski J., Lok S., Plavsic N., Chou J. L. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science. 1989 Sep 8;245(4922):1066–1073. doi: 10.1126/science.2475911. [DOI] [PubMed] [Google Scholar]
  19. Risler J. L., Delorme M. O., Delacroix H., Henaut A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol. 1988 Dec 20;204(4):1019–1029. doi: 10.1016/0022-2836(88)90058-7. [DOI] [PubMed] [Google Scholar]
  20. Ryder K., Lau L. F., Nathans D. A gene activated by growth factors is related to the oncogene v-jun. Proc Natl Acad Sci U S A. 1988 Mar;85(5):1487–1491. doi: 10.1073/pnas.85.5.1487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Theissen H., Etzerodt M., Reuter R., Schneider C., Lottspeich F., Argos P., Lührmann R., Philipson L. Cloning of the human cDNA for the U1 RNA-associated 70K protein. EMBO J. 1986 Dec 1;5(12):3209–3217. doi: 10.1002/j.1460-2075.1986.tb04631.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Vogt P. K., Bos T. J., Doolittle R. F. Homology between the DNA-binding domain of the GCN4 regulatory protein of yeast and the carboxyl-terminal region of a protein coded for by the oncogene jun. Proc Natl Acad Sci U S A. 1987 May;84(10):3316–3319. doi: 10.1073/pnas.84.10.3316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wilbur W. J., Lipman D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730. doi: 10.1073/pnas.80.3.726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wilbur W. J. On the PAM matrix model of protein evolution. Mol Biol Evol. 1985 Sep;2(5):434–447. doi: 10.1093/oxfordjournals.molbev.a040360. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES