Abstract
Some proteins are homologous to others after their sequence is circularly permuted. A few such proteins have been recognized, mainly by sequence comparison, but also by comparing their three-dimensional structures. Here we report the result of a systematic search for all protein pairs in the SCOP 90% id domain database that become structurally superimposable when the sequence of one of the pairs is circularly permuted. Using a reasonable set of criteria, we find that 47% of all protein domains are superimposable to at least one other protein domain in the database after their sequence is circularly permuted. Many of these are symmetric proteins, which superimpose to another protein both with and without a circular permutation of the sequence. However, 412 of the total 3035 domains are nonsymmetric, and these become structurally superimposable to another protein only after a circular permutation of the sequence. These include most known and many previously undetected circularly permuted proteins with remote homology.
Keywords: Circular permutation, protein structure, structure alignment, gene duplication
Proteins have been circularly permuted artificially to study folding and stability of the protein or to move the N or C terminus to another position in the protein structure in protein engineering contexts (Heinemann and Hahn 1995a; Baird et al. 1999; McWherter et al. 1999; Nakamura and Iwakura 1999; Iwakura et al. 2000). Circularly permuted proteins occur also in nature. Lindqvist and Schneider (1997) reviewed some eight naturally circularly permuted proteins that were known by 1997, but at least six more (Garcia-Vallve et al. 1998; Murzin 1998; Castillo et al. 1999; Jeltsch 1999; Polekhina et al. 1999; Jung and Lee 2000) have been reported since then. Circularly permuted proteins can arise from a posttranslational modification (Carrington et al. 1985; Bowles et al. 1986), but a majority probably arose from gene duplication (Luger et al. 1989; Ponting and Russell 1995; Jeltsch 1999) or exon shuffling (Doolittle 1987; Gilbert 1987) events. Natural circularly permuted proteins occur in a variety of organisms, including viruses, bacteria, plants, and higher animals. They are mostly β-sheet and α/β proteins, but saposins (Ponting and Russell 1995; Liepinsh et al. 1997) are α-helical proteins. In most known cases, the N and C termini are close to each other (Thornton and Sibanda 1983), but we have found in this work many examples wherein the two termini are not close together. Detecting repeated sequence segments and circularly permuted proteins from a sequence database has been reported recently (Marcotte et al. 1999; Uliel et al. 1999). Here we report the results of a systematic search for protein pairs that have similar structures, but the structural alignment of which requires circular permutation of one of the sequences.
Results and Discussion
There are more than 10,000 entries in the protein structure databank (Berman et al. 2000), which consist of more than 16,000 domains according to the manual SCOP domain parsing result (Murzin et al. 1995). We selected 3035 protein domains from the SCOP domain database, version 1.41, that were at least 40 residues long and had 90% or less sequence identity between any pair of them. Attempts were made to structurally align all pairs of these domains both with and without circularly permuting one of the sequences. Two structures are said to be structurally related when they are sufficiently similar that the structural alignment produces a sufficiently large number of aligned pairs of residues (see Materials and Methods). Of the 9.2 million (3035 × 3035) possible pairs, 136,975 pairs met the criteria for a structural relation when neither sequence was permuted (unpermuted alignment), and 48,016 pairs met the criteria when one of the two sequences was circularly permuted (permuted alignment). The pairs in the latter set are said to be CP related.
The automatic procedure found most known CP relations, including those between plant lectins (Cunningham et al. 1979), bacterial glucanases (Heinemann and Hahn 1995b), (β/α)8 barrel proteins (Sergeev and Lee 1994; Jia et al. 1996; Macgregor et al. 1996; Garcia-Vallve et al. 1998), the C2 domain proteins (Nalefski and Falke 1996), ferredoxins (Jung and Lee 2000), flavin-binding β-barrel domains (Murzin 1998), the six-stranded double-ξ β-barrels (Castillo et al. 1999), and the DNA and other methyltransferases (Jeltsch 1999). Some new examples of CP-related protein pairs are shown in Figure 1 ▶. When a protein has a symmetric structure, it aligns to itself and to other structurally similar proteins both with and without circular permutation of its sequence. One can use this property to identify symmetric structures. Therefore, we operationally define a protein to be symmetric if it is related to another protein both with and without circular permutation and if the two alignments are judged to be distinct (see Materials and Methods). One feature that can be noted from the structures shown in Figure 1 ▶ is that the N and C termini are far apart in many of the structures. The proximity of the N and C termini are not a prerequisite condition for circular permutation.
Fig. 1.
Molscript images of some symmetric (A) and nonsymmetric (B) circularly permuted protein pairs. Each structure is made of two parts, colored red and blue. For each pair, similarly colored parts match structurally, red to red and blue to blue. The red part is the N-terminal part in one protein and the C-terminal part in the other protein of each pair. Similarly, the blue part is the C-terminal part in one protein and the N-terminal part in the other. The point where the color changes is the cut position for the circular permutation.
Individual structural relations are shown in Figure 2 ▶. The number of relations between proteins that belong to the same or different fold, superfamily and family, according to the SCOP classification, are shown in Figure 3 ▶. The unpermuted relations (blue and green dots in Fig. 2 ▶) are mostly between proteins in the same superfamilies (Fig. 3 ▶), indicating that our criteria for structural similarity roughly match the criteria used for the manual SCOP superfamily classification. Many relations do connect different classes (blue dots outside of the boxes in Fig. 2 ▶), but most of these involve protein domains that are small α-helical pieces or small α + β motifs, which resemble a part of many larger proteins. Most of the symmetric CP relations (green dots in Fig. 2 ▶) occur within the same SCOP folds and superfamilies (Fig. 3 ▶), but many nonsymmetric CP relations (red dots in Fig. 2 ▶) connect proteins in different superfamilies and folds (Fig. 3 ▶).
Fig. 2.
Unpermuted and circularly permuted structural relations among the 3035 protein domains. The x- and y-axes represent the proteins sorted according to the SCOP classification number. The chain of seven boxes along the diagonal indicates the seven classes—α, β, α/β, α + β, multidomain, membrane, and small proteins—of SCOP. The nested boxes in each class box, large to small, indicate fold, superfamily, and family according to the SCOP classification. A blue dot is placed when there is an unpermuted structural relation between a pair of proteins, a red dot for a CP relation, and a green dot for a symmetric CP relation. The dot pattern is asymmetric because of the use of the z score for the hit criteria and because only one protein (the x-axis protein) is permuted. The protein along the x-axis is the probe protein (protein a or a`) and that along the y-axis the target protein (protein b). Some of the fold-boxes appear solid green only because of the lack of resolution of the graphical image.
Fig. 3.
Number of unpermuted, CP, symmetric CP, and nonsymmetric CP structural relations. The dotted and solid gray areas indicate the number of relations in which the related pair belongs, respectively, to the same and different class, fold, superfamily, or family.
The number of proteins that bear a relation with another protein is listed in Table 1. Also listed are the number of families, superfamilies, folds, and classes, as defined by SCOP, which these proteins represent. Obviously the precise numbers given in the table depend on the criteria used to judge structural similarity (see Materials and Methods). The fact that structural similarity depends on an ultimately arbitrary choice of a cutoff value is somewhat unsatisfactory. However, the situation is similar in the case of the detection of sequence homology, where a similarly arbitrary cutoff value for the e-score is commonly used. The z-score that we used in this work and the e-score are closely related, being precisely interconvertible when the score distribution is Gaussian for random matches. We made numerous spot checks by visual inspection of superimposed structures and confirmed to our satisfaction that in all cases we concur with the judgment made by the automatic procedure concerning the structural similarity or the lack thereof.
Table 1.
Distribution of structurally related protein domains in different structural typesa
p | fm | sf | fold | c | Np | |
All | 3035 | 957 | 653 | 446 | 7 | 9,211,225 |
Unpermuted | 2859 | 815 | 538 | 354 | 7 | 133,940 |
CP | 1433 | 462 | 350 | 226 | 7 | 48,016 |
sCP | 1025 | 284 | 215 | 125 | 7 | 34,581 |
nsCP | 412 | 243 | 202 | 164 | 7 | 13,435 |
a The first row of numbers (row labeled All) gives the total number of proteins (column p), families (fm), superfamilies (sf), folds (fold), and classes (c) in the reduced SCOP 90% id domain database and the total number of pair-wise structural comparisons made (Np). The next row of numbers (Unpermuted) gives the number of proteins, each of which has at least one other protein that is structurally related without permutation (column p), the number of families (fm), superfamilies (sf), folds (fold), and classes (c) to which these proteins belong, and the total number of structurally related pairs (Np). The following three rows give similar data for the CP relation (CP), for the symmetric CP relation (sCP), and for the non-symmetric CP relation (nsCP).
It can be seen from Table 1 that 47% (1433 of 3035) of the protein domains have a CP relation with at least one other known protein domain and that such proteins are not restricted to a few special folds; circularly permuted proteins occur in all structural classes and in about half (226 of 446) of all known folds. In the SCOP classification, more than one-third of the protein domains belong to the 15 largest folds (1068 out of 3035). There is at least one circularly permuted protein in each of these 15 folds and, on average, 44% of the proteins are permuted in a given fold. It has long been recognized that many multidomain proteins were generated by different combinations of a small number of domains (Patthy 1993). The finding that a large number of protein domains have circular permutation relations with other protein domains indicates that individual domains themselves are also made from a combination of smaller units.
Some 71% of the circularly permuted proteins (1025 of 1433) have symmetric structures. The number of symmetric proteins detected here is therefore 34% of the total number of proteins. These structures might have arisen from ancient gene duplication events (Lang et al. 2000). Marcotte et al. (1999) reported that duplicated gene segments occur in 14% of all protein sequences and more than 20% of all eukaryotic proteins. These must reflect relatively recent gene duplication events because they were detected by sequence homology. In the case of the symmetric structural domains detected here, the sequence homology is generally low; only 91 of the 34,581 symmetric circularly permuted pairs have >30% sequence identity between them. If the symmetry has indeed arisen from gene duplication events, therefore, most of them must be ancient events. Alternatively, one cannot rule out the possibility that at least some of these structures arose without a gene duplication event (convergent evolution).
Materials and methods
Finding circularly permuted alignment
A protein sequence was circularly permuted by deciding on a cut position and then renumbering the residues starting from the carboxy side of the cut position forward to the C terminus of the protein and then continuing to the N terminus and finishing at the amino side of the cut position. The cut position was initially chosen to be the middle of the sequence (Fig. 4 ▶). The structure of the permuted protein was then aligned to another protein, the sequence of which is not permuted, using the recently described structure–structure alignment program SHEBA (Jung and Lee 2000). This structural alignment procedure preserves connectivity so that two structures that are identical except for the numbering of the residues are considered distinct. A new cut position was then determined from the structural alignment. Let na be the number of residues that are matched in the first half (the half that contains the original C terminus) and nb the number of residues that are matched in the second half of the permuted protein. The new cut position is chosen to be next to the last residue matched in the second half if na > nb or chosen just before the first residue matched in the first half if na ≤ nb. Circular permutation using this new cut position increases the number of matched residues in the structural superposition.
Fig. 4.
Circularly permuted structural alignment procedure. N and C indicate the original N and C termini. (a) The two protein sequences to be aligned are shown as parallel arrows, not yet aligned. The second sequence will be permuted. The short vertical line indicates the first cut position, chosen as the middle of the sequence to be permuted. (b1,b2) Two possible outcomes of the structural alignment after the second protein has been permuted. The new cut position is indicated by the vertical line for each case. (c1,c2) Structural alignment after permutation using the second cut position.
Criteria for a structural relation
A structural alignment between two proteins, a and b, gives the match score mab, which is the fraction of matched residues in protein a. For each protein a, the mean match score ma of the random distribution was computed by averaging mab over all b proteins that are structurally unrelated (those with mab < 40%). The root-mean-square deviation σa of mab about ma was also computed. The match scores were then converted to z-score zab, which was defined as (mab − ma)/σa. For the straight structural alignment, a pair of proteins were considered to be structurally related when zab was >5.0. This z-score cutoff value is the same as that used previously for clustering protein structures into groups of similar structures (Jung and Lee 2000). This particular value was chosen primarily because the number of multimember clusters reached a plateau of maximum value at this cutoff value. Two proteins were considered to be related by circular permutation (CP related) if za`b is >5.0, where a` is the permuted protein, and if the number of matched residues of the C- and N-terminal parts of the permuted protein were both >10% of the total number of matched residues for the protein pair.
Criteria for distinct alignment
Two alignments were judged to be distinct if the mean alignment shift per residue, Δr (Jung and Lee 2000), was greater than 5 positions between the two alignments.
Acknowledgments
This study used the high-performance computational capabilities of the Biowulf Cluster at the Center for Information Technology, National Institutes of Health.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1101/
References
- Baird, G.S., Zacharias, D.A., and Tsien, R.Y. 1999. Circular permutation and receptor insertion within green fluorescent proteins. Proc. Natl. Acad. Sci. USA 96 11241–11246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowles, D.J., Marcus, S.E., Pappin, D.J., Findlay, J.B., Eliopoulos, E., Maycox, P.R., and Burgess, J. 1986. Posttranslational processing of concanavalin A precursors in jackbean cotyledons. J. Cell Biol. 102 1284–1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carrington, D.M., Auffret, A., and Hanke, D.E. 1985. Polypeptide ligation occurs during post-translational modification of concanavalin A. Nature 313 64–67. [DOI] [PubMed] [Google Scholar]
- Castillo, R.M., Mizuguchi, K., Dhanaraj, V., Albert, A., Blundell, T.L., and Murzin, A.G. 1999. A six-stranded double-ψ β barrel is shared by several protein superfamilies. Structure Fold Des. 7 227–236. [DOI] [PubMed] [Google Scholar]
- Cunningham, B.A., Hemperly, J.J., Hopp, T.H., and Edelman, G.E. 1979. Favin versus concanavalin A: Circularly permuted amino acid sequences. Proc. Natl. Acad. Sci. USA 76 3218–3222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doolittle, W.F. 1987. What introns have to tell us: Hierarchy in genome evolution. Cold Spring Harbor Symp. Quant. Biol. 52 907–913. [DOI] [PubMed] [Google Scholar]
- Garcia-Vallve, S., Rojas, A., Palau, J., and Romeu, A. 1998. Circular permutants in β-glucosidases (family 3) within a predicted double-domain topology that includes a (β/α)8-barrel. Proteins 31 214–223. [DOI] [PubMed] [Google Scholar]
- Gilbert, W. 1987. The exon theory of genes. Cold Spring Harbor Symp. Quant. Biol. 52 901–905. [DOI] [PubMed] [Google Scholar]
- Heinemann, U. and Hahn, M. 1995a. Circular permutation of polypeptide chains: Implications for protein folding and stability. Prog. Biophys. Mol. Biol. 64 121–143. [DOI] [PubMed] [Google Scholar]
- ———. 1995b. Circular permutations of protein sequence: Not so rare? Trends Biochem. Sci. 20 349–350. [DOI] [PubMed] [Google Scholar]
- Iwakura, M., Nakamura, T., Yamane, C., and Maki, K. 2000. Systematic circular permutation of an entire protein reveals essential folding elements. Nat. Struct. Biol. 7 580–585. [DOI] [PubMed] [Google Scholar]
- Jeltsch, A. 1999. Circular permutations in the molecular evolution of DNA methyltransferases. J. Mol. Evol. 49 161–164. [DOI] [PubMed] [Google Scholar]
- Jia, J., Huang, W., Schorken, U., Sahm, H., Sprenger, G.A., Lindqvist, Y., and Schneider, G. 1996. Crystal structure of transaldolase B from Escherichia coli suggests a circular permutation of the α/β barrel within the class I aldolase family. Structure 4 715–724. [DOI] [PubMed] [Google Scholar]
- Jung, J. and Lee, B. 2000. Protein structure alignment using environmental profiles. Protein Eng. 13 535–543. [DOI] [PubMed] [Google Scholar]
- Lang, D., Thoma, R., Henn-Sax, M., Sterner, R., and Wilmanns, M. 2000. Structural evidence for evolution of the β/α barrel scaffold by gene duplication and fusion. Science 289 1546–1550. [DOI] [PubMed] [Google Scholar]
- Liepinsh, E., Andersson, M., Ruysschaert, J.M., and Otting, G. 1997. Saposin fold revealed by the NMR structure of NK-lysin. Nat. Struct. Biol. 4 793–795. [DOI] [PubMed] [Google Scholar]
- Lindqvist, Y. and Schneider, G. 1997. Circular permutations of natural protein sequences: Structural evidence. Curr. Opin. Struct. Biol. 7 422–427. [DOI] [PubMed] [Google Scholar]
- Luger, K., Hommel, U., Herold, M., Hofsteenge, J., and Kirschner, K. 1989. Correct folding of circularly permuted variants of a βα-barrel enzyme in vivo. Science 243 206–210. [DOI] [PubMed] [Google Scholar]
- Macgregor, E.A., Jespersen, H.M., and Svensson, B. 1996. A circularly permuted α-amylase-type α/β-barrel structure in glucan-synthesizing glucosyltransferases. FEBS Lett. 378 263–266. [DOI] [PubMed] [Google Scholar]
- Marcotte, E.M., Pellegrini, M., Yeates, T.O., and Eisenberg, D. 1999. A census of protein repeats. J. Mol. Biol. 293 151–160. [DOI] [PubMed] [Google Scholar]
- McWherter, C.A., Feng, Y., Zurfluh, L.L., Klein, B.K., Baganoff, M.P., Polazzi, J.O., Hood, W.F., Paik, K., Abegg, A.L., Grabbe, E.S., et al. 1999. Circular permutation of the granulocyte colony-stimulating factor receptor agonist domain of myelopoietin. Biochemistry 38 4564–4571. [DOI] [PubMed] [Google Scholar]
- Murzin, A.G. 1998. Probable circular permutation in the flavin-binding domain. Nat. Struct. Biol. 5 101. [DOI] [PubMed] [Google Scholar]
- Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequence and structures. J. Mol. Biol. 247 536–540. [DOI] [PubMed] [Google Scholar]
- Nakamura, T. and Iwakura, M. 1999. Circular permutation analysis as a method for distinction of functional elements in the M20 loop of Escherichia coli dihydrofolate. J. Biol. Chem. 274 19041–19047. [DOI] [PubMed] [Google Scholar]
- Nalefski, E.A. and Falke, J.J. 1996. The C2 domain calcium-binding motif: Structural and functional diversity. Protein Sci. 5 2375–2390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patthy, L. 1993. Modular design of proteases of coagulation, fibrinolysis, and complement activation: Implications for protein engineering and structure–function studies. Methods Enzymol. 222 10–21. [DOI] [PubMed] [Google Scholar]
- Polekhina, G., Board, P.G., Gali, R.R., Rossjohn, J., and Parker, M.W. 1999. Molecular basis of glutathione synthetase deficiency and a rare gene permutation event. EMBO J. 18 3204–3213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ponting, C.P. and Russell, R.B. 1995. Swaposins: Circular permutations within genes encoding saposin homologues. Trends Biochem. Sci. 20 179–180. [DOI] [PubMed] [Google Scholar]
- Sergeev, Y. and Lee, B. 1994. Alignment of β-barrels in (β/α)8 proteins using hydrogen bonding pattern. J. Mol. Biol. 244 168–182. [DOI] [PubMed] [Google Scholar]
- Thornton, J.M. and Sibanda, B.L. 1983. Amino and carboxy-terminal regions in globular proteins. J. Mol. Biol. 167 443–460. [DOI] [PubMed] [Google Scholar]
- Uliel, S., Fliess, A., Amir, A., and Unger, R. 1999. A simple algorithm for detecting circular permutations in proteins. Bioinformatics 15 930–936. [DOI] [PubMed] [Google Scholar]