Abstract
We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side-chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.
Full Text
The Full Text of this article is available as a PDF (3.1 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Altschul S. F., Boguski M. S., Gish W., Wootton J. C. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. doi: 10.1038/ng0294-119. [DOI] [PubMed] [Google Scholar]
- Argos P., Rossmann M. G. Structural comparisons of heme binding proteins. Biochemistry. 1979 Oct 30;18(22):4951–4960. doi: 10.1021/bi00589a025. [DOI] [PubMed] [Google Scholar]
- Brenner S. E., Chothia C., Hubbard T. J., Murzin A. G. Understanding protein structure: using scop for fold interpretation. Methods Enzymol. 1996;266:635–643. doi: 10.1016/s0076-6879(96)66039-x. [DOI] [PubMed] [Google Scholar]
- Brenner S. E., Hubbard T., Murzin A., Chothia C. Gene duplications in H. influenzae. Nature. 1995 Nov 9;378(6553):140–140. doi: 10.1038/378140a0. [DOI] [PubMed] [Google Scholar]
- Chothia C., Gerstein M. Protein evolution. How far can sequences diverge? Nature. 1997 Feb 13;385(6617):579–581. doi: 10.1038/385579a0. [DOI] [PubMed] [Google Scholar]
- Chothia C., Lesk A. M. Canonical structures for the hypervariable regions of immunoglobulins. J Mol Biol. 1987 Aug 20;196(4):901–917. doi: 10.1016/0022-2836(87)90412-8. [DOI] [PubMed] [Google Scholar]
- Chothia C., Lesk A. M. Evolution of proteins formed by beta-sheets. I. Plastocyanin and azurin. J Mol Biol. 1982 Sep 15;160(2):309–323. doi: 10.1016/0022-2836(82)90178-4. [DOI] [PubMed] [Google Scholar]
- Falicov A., Cohen F. E. A surface of minimum area metric for the structural comparison of proteins. J Mol Biol. 1996 May 24;258(5):871–892. doi: 10.1006/jmbi.1996.0294. [DOI] [PubMed] [Google Scholar]
- Feng Z. K., Sippl M. J. Optimum superimposition of protein structures: ambiguities and implications. Fold Des. 1996;1(2):123–132. doi: 10.1016/s1359-0278(96)00021-1. [DOI] [PubMed] [Google Scholar]
- Frishman D., Argos P. Knowledge-based protein secondary structure assignment. Proteins. 1995 Dec;23(4):566–579. doi: 10.1002/prot.340230412. [DOI] [PubMed] [Google Scholar]
- Gerstein M. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J Mol Biol. 1997 Dec 12;274(4):562–576. doi: 10.1006/jmbi.1997.1412. [DOI] [PubMed] [Google Scholar]
- Gerstein M., Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol. 1996;4:59–67. [PubMed] [Google Scholar]
- Gerstein M., Schulz G., Chothia C. Domain closure in adenylate kinase. Joints on either side of two helices close like neighboring fingers. J Mol Biol. 1993 Jan 20;229(2):494–501. doi: 10.1006/jmbi.1993.1048. [DOI] [PubMed] [Google Scholar]
- Gerstein M., Sonnhammer E. L., Chothia C. Volume changes in protein evolution. J Mol Biol. 1994 Mar 4;236(4):1067–1078. doi: 10.1016/0022-2836(94)90012-4. [DOI] [PubMed] [Google Scholar]
- Gibrat J. F., Madej T., Bryant S. H. Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996 Jun;6(3):377–385. doi: 10.1016/s0959-440x(96)80058-3. [DOI] [PubMed] [Google Scholar]
- Godzik A., Skolnick J. Flexible algorithm for direct multiple alignment of protein structures and sequences. Comput Appl Biosci. 1994 Dec;10(6):587–596. doi: 10.1093/bioinformatics/10.6.587. [DOI] [PubMed] [Google Scholar]
- Godzik A. The structural alignment between two proteins: is there a unique answer? Protein Sci. 1996 Jul;5(7):1325–1338. doi: 10.1002/pro.5560050711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gotoh O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol. 1996 Dec 13;264(4):823–838. doi: 10.1006/jmbi.1996.0679. [DOI] [PubMed] [Google Scholar]
- Graves B. J., Crowther R. L., Chandran C., Rumberger J. M., Li S., Huang K. S., Presky D. H., Familletti P. C., Wolitzky B. A., Burns D. K. Insight into E-selectin/ligand interaction from the crystal structure and mutagenesis of the lec/EGF domains. Nature. 1994 Feb 10;367(6463):532–538. doi: 10.1038/367532a0. [DOI] [PubMed] [Google Scholar]
- Grindley H. M., Artymiuk P. J., Rice D. W., Willett P. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J Mol Biol. 1993 Feb 5;229(3):707–721. doi: 10.1006/jmbi.1993.1074. [DOI] [PubMed] [Google Scholar]
- Harpaz Y., Chothia C. Many of the immunoglobulin superfamily domains in cell adhesion molecules and surface receptors belong to a new structural set which is close to that containing variable domains. J Mol Biol. 1994 May 13;238(4):528–539. doi: 10.1006/jmbi.1994.1312. [DOI] [PubMed] [Google Scholar]
- Hobohm U., Scharf M., Schneider R., Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409–417. doi: 10.1002/pro.5560010313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hogue C. W., Ohkawa H., Bryant S. H. A dynamic look at structures: WWW-Entrez and the Molecular Modeling Database. Trends Biochem Sci. 1996 Jun;21(6):226–229. [PubMed] [Google Scholar]
- Holm L., Sander C. Mapping the protein universe. Science. 1996 Aug 2;273(5275):595–603. doi: 10.1126/science.273.5275.595. [DOI] [PubMed] [Google Scholar]
- Holm L., Sander C. New structure--novel fold? Structure. 1997 Feb 15;5(2):165–171. doi: 10.1016/s0969-2126(97)00176-7. [DOI] [PubMed] [Google Scholar]
- Holm L., Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993 Sep 5;233(1):123–138. doi: 10.1006/jmbi.1993.1489. [DOI] [PubMed] [Google Scholar]
- Holm L., Sander C. The FSSP database of structurally aligned protein fold families. Nucleic Acids Res. 1994 Sep;22(17):3600–3609. [PMC free article] [PubMed] [Google Scholar]
- Hubbard T. J., Murzin A. G., Brenner S. E., Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 1997 Jan 1;25(1):236–239. doi: 10.1093/nar/25.1.236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joshua-Tor L., Xu H. E., Johnston S. A., Rees D. C. Crystal structure of a conserved protease that binds DNA: the bleomycin hydrolase, Gal6. Science. 1995 Aug 18;269(5226):945–950. doi: 10.1126/science.7638617. [DOI] [PubMed] [Google Scholar]
- Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- Laurents D. V., Subbiah S., Levitt M. Different protein sequences can give rise to highly similar folds through different stabilizing interactions. Protein Sci. 1994 Nov;3(11):1938–1944. doi: 10.1002/pro.5560031105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leahy D. J., Axel R., Hendrickson W. A. Crystal structure of a soluble form of the human T cell coreceptor CD8 at 2.6 A resolution. Cell. 1992 Mar 20;68(6):1145–1162. doi: 10.1016/0092-8674(92)90085-q. [DOI] [PubMed] [Google Scholar]
- Lee B., Richards F. M. The interpretation of protein structures: estimation of static accessibility. J Mol Biol. 1971 Feb 14;55(3):379–400. doi: 10.1016/0022-2836(71)90324-x. [DOI] [PubMed] [Google Scholar]
- Lesk A. M., Chothia C. Evolution of proteins formed by beta-sheets. II. The core of the immunoglobulin domains. J Mol Biol. 1982 Sep 15;160(2):325–342. doi: 10.1016/0022-2836(82)90179-6. [DOI] [PubMed] [Google Scholar]
- Lesk A. M., Chothia C. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol. 1980 Jan 25;136(3):225–270. doi: 10.1016/0022-2836(80)90373-3. [DOI] [PubMed] [Google Scholar]
- Lesk A. M., Levitt M., Chothia C. Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1986 Oct-Nov;1(1):77–78. doi: 10.1093/protein/1.1.77. [DOI] [PubMed] [Google Scholar]
- Murzin A. G., Brenner S. E., Hubbard T., Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995 Apr 7;247(4):536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
- Orengo C. A., Jones D. T., Thornton J. M. Protein superfamilies and domain superfolds. Nature. 1994 Dec 15;372(6507):631–634. doi: 10.1038/372631a0. [DOI] [PubMed] [Google Scholar]
- Orengo C. A., Swindells M. B., Michie A. D., Zvelebil M. J., Driscoll P. C., Waterfield M. D., Thornton J. M. Structural similarity between the pleckstrin homology domain and verotoxin: the problem of measuring and evaluating structural similarity. Protein Sci. 1995 Oct;4(10):1977–1983. doi: 10.1002/pro.5560041003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Overington J. P., Zhu Z. Y., Sali A., Johnson M. S., Sowdhamini R., Louie G. V., Blundell T. L. Molecular recognition in protein families: a database of aligned three-dimensional structures of related proteins. Biochem Soc Trans. 1993 Aug;21(3):597–604. doi: 10.1042/bst0210597. [DOI] [PubMed] [Google Scholar]
- Pearson W. R. Effective protein sequence comparison. Methods Enzymol. 1996;266:227–258. doi: 10.1016/s0076-6879(96)66017-0. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell R. B., Barton G. J. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins. 1992 Oct;14(2):309–323. doi: 10.1002/prot.340140216. [DOI] [PubMed] [Google Scholar]
- Sali A., Blundell T. L. Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J Mol Biol. 1990 Mar 20;212(2):403–428. doi: 10.1016/0022-2836(90)90134-8. [DOI] [PubMed] [Google Scholar]
- Sali A., Overington J. P. Derivation of rules for comparative protein modeling from a database of protein structure alignments. Protein Sci. 1994 Sep;3(9):1582–1596. doi: 10.1002/pro.5560030923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schuler G. D., Epstein J. A., Ohkawa H., Kans J. A. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–162. doi: 10.1016/s0076-6879(96)66012-1. [DOI] [PubMed] [Google Scholar]
- Subbiah S., Laurents D. V., Levitt M. Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol. 1993 Mar;3(3):141–148. doi: 10.1016/0960-9822(93)90255-m. [DOI] [PubMed] [Google Scholar]
- Tanimura R., Kidera A., Nakamura H. Determinants of protein side-chain packing. Protein Sci. 1994 Dec;3(12):2358–2365. doi: 10.1002/pro.5560031220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor W. R., Orengo C. A. Protein structure alignment. J Mol Biol. 1989 Jul 5;208(1):1–22. doi: 10.1016/0022-2836(89)90084-3. [DOI] [PubMed] [Google Scholar]
- Thompson J. D., Higgins D. G., Gibson T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994 Nov 11;22(22):4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vingron M., Waterman M. S. Sequence alignment and penalty choice. Review of concepts, case studies and implications. J Mol Biol. 1994 Jan 7;235(1):1–12. doi: 10.1016/s0022-2836(05)80006-3. [DOI] [PubMed] [Google Scholar]
- Vogt G., Etzold T., Argos P. An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol. 1995 Jun 16;249(4):816–831. doi: 10.1006/jmbi.1995.0340. [DOI] [PubMed] [Google Scholar]
- Zhu Z. Y., Sali A., Blundell T. L. A variable gap penalty function and feature weights for protein 3-D structure comparisons. Protein Eng. 1992 Jan;5(1):43–51. doi: 10.1093/protein/5.1.43. [DOI] [PubMed] [Google Scholar]