Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2005 Mar 9;219(3):555–565. doi: 10.1016/0022-2836(91)90193-A

Amino acid substitution matrices from an information theoretic perspective

Stephen F Altschul 1
PMCID: PMC7130686  PMID: 2051488

Abstract

Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a “log-odds” matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human α1B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins.

Keywords: homology, sequence comparison, statistical significance, alignment algorithms, pattern recognition

Abbreviations: MSP, Maximal Segment Pair; Ig, immunoglobin

References

  1. Altschul S.F., Erickson B.W. A nonlinear measure of subalignment similarity and its significance levels. Bull. Math. Biol. 1986;48:617–632. doi: 10.1007/BF02462327. [DOI] [PubMed] [Google Scholar]
  2. Altschul S.F., Lipman D.J. Vol. 87. 1990. Protein database searches for multiple alignments; pp. 5509–5513. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  4. Argos P. A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 1987;193:385–396. doi: 10.1016/0022-2836(87)90226-9. [DOI] [PubMed] [Google Scholar]
  5. Armstrong J., Niemann H., Smeekens S., Rottier P., Warren G. Sequence and topology of a model intracellular membrane protein. El glycoprotein. from a coronavirus. Nature (London) 1984;308:751–752. doi: 10.1038/308751a0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Arratia R., Waterman M.S. The Erdos-Renyl strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 1989;17:1152–1169. [Google Scholar]
  7. Arratia R., Gordon L., Waterman M.S. An extreme value theory for sequence matching. Ann. Stat. 1986;14:971–993. [Google Scholar]
  8. Arratia R., Morris P., Waterman M.S. Stochastic scrabble: large deviations Cor sequences with scores. J. Appl. Prob. 1988;25:106–119. [Google Scholar]
  9. Boguski M.S., States D.J. Molecular sequence databases and their uses. In: Rees A.R., Wetzel R., Sternberg M.D., editors. Protein Engineering: A Practical Approach. IRL Press; Oxford: 1990. chap. 5. [Google Scholar]
  10. Brooks D.E., Means A.R., Wright E.J., Singh S.P., Tiver K.K. Molecular cloning of the cDNA for two major androgen-dependent secretory proteins of 18.5 kilodaltons synthesized by the rat epididymis. J. Biol. Chem. 1986;261:4956–4961. [PubMed] [Google Scholar]
  11. Collins J.F., Coulson A.F.W., Lyall A. The significance of protein sequence similarities. Comput. Appl. Biosci. 1988;4:67–71. doi: 10.1093/bioinformatics/4.1.67. [DOI] [PubMed] [Google Scholar]
  12. Coulton J.W., Mason P., Allatt D.D. fhuC and fhuD genes for iron(III)-ferrichrome transport into Escherichia coli K-12. J. Bacteriol. 1987;169:3844–3849. doi: 10.1128/jb.169.8.3844-3849.1987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cowan S.W., Newcomer M.E., Jones T.A. Crystallographic refinement of human serum retinol binding protein at 2 Å resolution. Proteins. 1990;8:44–61. doi: 10.1002/prot.340080108. [DOI] [PubMed] [Google Scholar]
  14. Dahl M.K., Francoz E., Saurin W., Boos W., Manson M.D., Hofnung M. Comparison of sequences from the malB regions of Salmonella lyphimurium and Enterobacter aerogenes with Escherichia coli K12: a potential new regulatory site in the interoperonie region. Mol. Gen. Genet. 1989;218:199–207. doi: 10.1007/BF00331269. [DOI] [PubMed] [Google Scholar]
  15. Dayhoff M.O., Schwartz R.M., Orcutt B.C. A model of evolutionary change in proteins. In: Dayhoff M.O., editor. vol. 5. Nat. Biomed. Res. Found; Washington. DC: 1978. pp. 345–352. (Atlas of Protein Sequence and Structure). suppl. 3. [Google Scholar]
  16. Dembo A., Karlin S. Strong limit laws of empirical functionals for large exceedences of partial sums of I.I.D. variables. Ann. Prob. 1991 In the press. [Google Scholar]
  17. Drayna D.T., McLean J.W., Wion K.L., Trent J.M., Drabkin H.A., Lawn R.M. Human apolipoprotein D gene: gene sequence, chromosome localization, and homologv to the α2μ-globulin super-family. DNA. 1987;6:199–204. doi: 10.1089/dna.1987.6.199. [DOI] [PubMed] [Google Scholar]
  18. Feng D.F., Johnson M.S., Doolittle R.F. Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 1985;21:112–125. doi: 10.1007/BF02100085. [DOI] [PubMed] [Google Scholar]
  19. Goad W.B., Kanehisa M.I. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucl. Acids Res. 1982;10:247–263. doi: 10.1093/nar/10.1.247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gribskov M., McLachlan A.D., Eisenberg D. Vol. 84. 1987. Profile analysis: detection of distantly related proteins; pp. 4355–4358. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Higgins C.F., Hiles I.D., Whalley K., Jamieson D.J. Nucleotide binding by membrane components of bacterial periplasmic binding protein-dependent transport systems. EMBO J. 1985;4:1033–1039. doi: 10.1002/j.1460-2075.1985.tb03735.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Higgins C.F., Hiles I.D., Salmond G.P., Gill D.R., Downie J.A., Evans I.J., Holland I.B., Gray L., Buckel S.D., Bell A.W., Hermodson M.A. A family of related ATP-binding subunits coupled to many distinct biological processes in bacteria. Nature (London) 1986;323:448–450. doi: 10.1038/323448a0. [DOI] [PubMed] [Google Scholar]
  23. Holmquist R., Goodman M., Conroy T., Czelusniak J. The spatial distribution of fixed mutations within genes coding for proteins. J. Mol. Evol. 1983;19:437–448. doi: 10.1007/BF02102319. [DOI] [PubMed] [Google Scholar]
  24. Husain I., Van Houten B., Thomas D.C., Sancar A. Sequences of Escherichia coli uvrA gene and protein reveal two potential ATP binding sites. J. Biol. Chem. 1986;261:4895–4901. [PubMed] [Google Scholar]
  25. Ishioka N., Takahashi N., Putnam F.W. Vol. 83. 1986. Amino acid sequence of human plasma α IB-glycoprotein: homology to the immunoglobulin supergene family; pp. 2363–2367. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Johnston T.C., Hruska K.S., Adams L.F. The nucleotide sequence of the lux E gene of Vibrio harveyi and a comparison of the amino acid sequences of the acyl-protein synthetases from V. harveyi and V. fischeri. Biochem. Biophys. Res. Commun. 1989;163:93–101. doi: 10.1016/0006-291x(89)92103-7. [DOI] [PubMed] [Google Scholar]
  27. Karlin S., Altschul S.F. Vol. 87. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes; pp. 2264–2268. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Karlin S., Dembo A., Kawabata T. Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 1990;18:571–581. [Google Scholar]
  29. Kaumeyer J.F., Polazzi J.O., Kotick M.P. The mRNA for a proteinase inhibitor related to the HI-30 domain of inter-α-trypsin inhibitor also encodes α-1-microglobulin (protein HC) Nucl. Acids Res. 1986;14:7839–7850. doi: 10.1093/nar/14.20.7839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lipman D.J., Pearson W.R. Rapid and sensitive protein similarity searches. Science. 1985;227:1435–1441. doi: 10.1126/science.2983426. [DOI] [PubMed] [Google Scholar]
  31. McLachlan A.D. Tests for comparing related amino acid sequences. Cytochrome c and cytochrome C551. J. Mol. Biol. 1971;61:409–424. doi: 10.1016/0022-2836(71)90390-1. [DOI] [PubMed] [Google Scholar]
  32. Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  33. Osorio-Keese M.K., Keese P., Gibbs A. Nucleotide sequence of the genome of eggplant mosaic tymovirus. Virology. 1989;172:547–554. doi: 10.1016/0042-6822(89)90197-9. [DOI] [PubMed] [Google Scholar]
  34. Park Y.M., Stauffer G.V. DNA sequence of the metC gene and its flanking regions from Salmonella typhimurium LT2 and homology with the corresponding sequence of Escherichia coli. Mol. Gen. Genet. 1989;216:164–169. doi: 10.1007/BF00332246. [DOI] [PubMed] [Google Scholar]
  35. Patthy L. Detecting homology of distantly related proteins with consensus sequences. J. Mol. Biol. 1987;198:567–577. doi: 10.1016/0022-2836(87)90200-2. [DOI] [PubMed] [Google Scholar]
  36. Pearson W.R., Lipman D.J. Vol. 85. 1988. Improved tools for biological sequence comparison; pp. 2444–2448. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Pech M., Zachau H.G. Immunoglobulin genes of different subgroups are interdigitated within the VK locus. Nucl. Acids Res. 1984;12:9229–9236. doi: 10.1093/nar/12.24.9229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Peitsch M.C., Boguski M.S. Is apolipoprotein D a mammalian bilin-binding protein? New Biologist. 1990;2:197–206. [PubMed] [Google Scholar]
  39. Qiu F., Ray P., Brown K., Barker P.K., Jhanwar S., Ruddle F.H., Besmer P. Primary structure of c-kit: relationship with the CSF-I/PDGF receptor kinase family-oncogenic activation of v-kit involves deletion of extracellular domain and C terminus. EMBO J. 1988;7:1003–1011. doi: 10.1002/j.1460-2075.1988.tb02907.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rajkovic A., Simonsen J.N., Davis R.E., Rottman F.M. Vol. 86. 1989. Molecular cloning and sequence analysis of 3-hydroxy-3-methylglutaryl-coenzyme A reduetase from the human parasite Schistosoma masoni; pp. 8217–8221. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Rao J.K.M. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. Pept. Protein Res. 1987;29:276–281. doi: 10.1111/j.1399-3011.1987.tb02254.x. [DOI] [PubMed] [Google Scholar]
  42. Richardson M., Dilworth M.J., Scawen M.D. The amino acid sequence of leghaemoglobin I from root nodules of broad bean (Vicia faba L.) FEBS Letters. 1975;51:33–37. doi: 10.1016/0014-5793(75)80849-0. [DOI] [PubMed] [Google Scholar]
  43. Riordan J.R., Rommens J.M., Kerem B.S., Alon N., Rozmahel R., Grzelczak Z., Zielenski J., Lok S., Plavsic N., Chou J.L., Drumm M.L., Iannuzzi M.C., Collins F.S., Tsui L.C. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science. 1989;245:1066–1073. doi: 10.1126/science.2475911. [DOI] [PubMed] [Google Scholar]
  44. Risler J.L., Delomne M.O., Delacroix H., Henaut A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J. Mol. Biol. 1988;204:1019–1029. doi: 10.1016/0022-2836(88)90058-7. [DOI] [PubMed] [Google Scholar]
  45. Sankoff D., Kruskal J.B. Addison-Wesley; Reading, MA: 1983. (Time Warps, String Edits und Macromolecules: The Theory and Practice of Sequence Comparison). [Google Scholar]
  46. Schwartz R.M., Dayhoff M.O. Matrices for detecting distant relationships. In: Dayhoff M.O., editor. vol. 5. Nat. Biomed. Res. Found; Washington, DC: 1978. pp. 353–358. (Atlas of Protein Sequence and Structure). suppl. 3. [Google Scholar]
  47. Sellers P.H. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 1974;26:787–793. [Google Scholar]
  48. Sellers P.H. Pattern recognition in genetic sequences by mismatch density. Bull. Math. Biol. 1984;46:501–514. [Google Scholar]
  49. Simmons D., Seed B. The Fey receptor of natural killer cells is a phospholipid-linked membrane protein. Nature (London) 1988;333:568–570. doi: 10.1038/333568a0. [DOI] [PubMed] [Google Scholar]
  50. Smith T.F., Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  51. Smith T.F., Waterman M.S., Burks C. The statistical distribution of nucleic acid similarities. Nucl. Acids Res. 1985;13:645–656. doi: 10.1093/nar/13.2.645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Stormo G.D., Hartzell G.W., III . Vol. 86. 1989. Identifying protein-binding sites from unaligned DNA fragments; pp. 1183–1187. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Suzuki T. Amino acid sequence of a major globin from the sea cucumber Paracaudina chilensis. Biochim. Biophys. Acta. 1989;998:292–296. doi: 10.1016/0167-4838(89)90287-2. [DOI] [PubMed] [Google Scholar]
  54. Taylor W.R. Identification of protein sequence homology by consensus template alignment. J. Mol. Biol. 1986;188:233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]
  55. Urade Y., Nagata A., Suzuki Y., Hayaishi O. Primary structure of rat brain prostaglandin D synthetase deduced from cDNA sequence. J. Biol. Chem. 1989;264:1041–1045. [PubMed] [Google Scholar]
  56. Uzzell T., Corbin K.W. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]
  57. Van de Weghe A., Coppieters W., Bauw G., Vanderkerckhove J., Bouquet Y. The homology between the serum proteins PO2 in pig. Xk in horse and α1 B-glycoprotein in human. Comp. Biochem. Physiol. 1988;90B:751–756. doi: 10.1016/0305-0491(88)90330-6. [DOI] [PubMed] [Google Scholar]
  58. Waterman M.S., Gordon L., Arratia R. Vol. 84. 1987. Phase transitions in sequence matches and nucleic acid structure; pp. 1239–1243. (Proc. Nat. Acad. Sci., U.S.A.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wilbur W.J. On the PAM matrix model of protein evolution. Mol. Biol. Evol. 1985;2:434–447. doi: 10.1093/oxfordjournals.molbev.a040360. [DOI] [PubMed] [Google Scholar]
  60. Zalacain M., Gonzalez A., Guerrero M.C., Mattaliano R.J., Malpartida F., Jimenez A. Nucleotide sequence of the hygromycin B phosphotransferase gene from Streptomyces hygroscopius. Nucl. Acids Res. 1986;14:1565–1581. doi: 10.1093/nar/14.4.1565. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Molecular Biology are provided here courtesy of Elsevier

RESOURCES