Abstract
Hetero dimer (different monomers) interfaces are involved in catalysis and regulation through the formation of interface active sites. This is critical in cell and molecular biology events. The physical and chemical factors determining the formation of the interface active sites is often large in numbers. The combined role of interacting features is frequently combinatorial and additive in nature. Therefore, it is important to determine the physical and chemical features of such interactions. A number of such features have been documented in literature since 1975. However, the use of such interaction features in the prediction of interaction partners and sites given their sequences is still a challenge. In a non-redundant dataset of 156 hetero-dimer structures determined by X-ray crystallography, the interacting partners are often varying in size and thus, size variation between subunits is an important factor in determining the mode of interface formation. The size of protein subunits interacting are either small-small, largelarge, medium-medium, large-small, large-medium and small-medium. It should also be noted that the interface formed between subunits have physical interactions at N terminal (N), C terminal (C) and middle (M) region of the protein with reference to their sequences in one dimension. These features are believed to have application in the prediction of interaction partners and sites from sequences. However, the use of such features for interaction prediction from sequence is not currently clear.
Keywords: protein-protein interaction, mode of interaction, protein size, interface
Background
Protein hetero-dimer subunit interaction is important in regulation and catalysis in living cells. The modes and types of protein-protein interactions using hetero-dimer protein complexes are numerically large in both prokaryotic and eukaryotic cells. However, documented data on such interactions are inadequate in the literature [1–17]. The formation of the interface between the two subunits is governed by both biophysical and chemical features as described in a number of studies elsewhere [1–17]. The documented features available thus far are based on structural datasets of protein-protein complexes of limited size. In these studies, geometrical (interface size, planarity, sphericity and complementarity) and chemical properties (the types of amino acid chemical groups, hydrophobicity, electrostatic interactions and Hbonds) are frequently analyzed.
Janin and colleagues have described the principles of protein-protein interaction since 1975 using a modest dataset of 3 protein complexes to more than 75 complexes in recent years [1–11]. Some of the parameters described by them for understanding protein subunit interactions include, close atomic packing [1,2,8], hydrophobicity and its free energy [1,2,10,16], structural mobility and interface conformation changes [4,11], Interface area based crude energy function [5], surface complementarity [8], interface size and chemistry [03,6,13,15,16,17], statistically derived mean-field potential [7], polar interactions [8], atomic nature of recognition sites[8,9,12], interface residue propensity score [10], hydrophobic score indexes for atomic packing [10] and interface hydrophobic patches [17]. Thornton and colleagues simultaneously studied size and shape, surface complementarity; interface propensity, hydrophobicity, H-bonds, segmentation & secondary structures and conformational changes in protein subunit complexes [12,13,14]. Nussinov and colleagues described hydrogen bonds and their geometry, salt bridges and their distributions, interface hydrophobicity and, charge distributions at the interface [15,16]. This data in later years lead to the development and benchmarking of interaction functions for protein docking predictions by Weng and colleagues [18–19]. The size of the test cases used for the identification of structural features in the development of the docking scoring functions is often the challenge in protein-protein interaction prediction. Despite these progresses, the use of such interaction features for the prediction of interaction partners and sites given their sequences is yet a challenge. Our interest is to describe hetero-dimer protein-protein interactions using size and mode of interactions in a non-redundant dataset of 156 hetero-dimer structures determined by X-ray crystallography.
Methodology
Dataset
We used a non-redundant dataset of 156 hetero-dimers created by Zhanhua and colleagues (Table 1 in supplementary material) [20]. This non-redundant dataset was created from an initial redundant set of 2,488 hetero-dimer structures downloaded from PDB (Protein Databank) and PQS (Protein Quaternary Structure Server). The redundant entries were removed using a 30% sequence similarity cut-off. These structures are solved by X-ray crystallography with resolution ≤ 2.5 Å. Each entry in the dataset is a complex of two different monomers (different proteins) of varying lengths (Table 2 in supplementary material). The minimum, maximum lengths for subunit A is 68 and 535, respectively. The same for subunit B is 41 and 456, respectively. The dataset mean for subunit A and B are about 140 and 252, respectively. However, the standard deviation about the mean for subunit A and B are about 109 and 78, respectively. The categorization of the dataset based on the source type for chains A and B is given in Table 3 (see supplementary material). The dataset is most represented with complexes having chains A and B from the same source and these have regulatory/catalytic functions. The distribution of the protein-protein complexes with different A & B source types are given in Table 4 (see supplementary material).
Size based grouping of complexes
Size variation between the interacting partners is an important factor in determining the mode of interface formation. The size of protein subunits interacting are either small-small, large-large, mediummedium, large-small, large-medium and small-medium (Figure 1). Proteins of sizes 100-300 residues are most represented for subunit A and 50-200 residues for subunit B in the dataset (Table 5 supplementary material).
Figure 1.
Protein-protein interactions and protein size is illustrated using examples; (a) small protein ‐ small protein; (b) large protein ‐ large protein; (c) large protein ‐ small protein.
Modes of interface interaction
The interface formed between subunits A and B have physical interactions at N terminal (N), C terminal (C) and middle (M) region of the protein with reference to its sequence in one dimension (Figure 2). A representation of the protein ¯ protein interface in 3D, 2D and 1D is shown for chemo-taxis proteins chey and chea (PDB ID: 1FFG) in Figure 2. In this dataset, nearly 33% of interfaces have NMC - NMC mode of interactions ( Table 6). However, the dataset contains a wide type of interaction modes like N-N, C-C, NMC-NMC, M-M, NC-NC, NM-NM, and MC-MC (Table 4 in supplementary material).
Figure 2.
Representation of protein-protein interfaces in 3D (structure complex in panel a), 2D (X-Y plots using X-Y plots in panel b) and 1D (sequence in panel c).
Interface residues
Interface residues in a protein-protein complex are identified by calculating the change in accessible surface area (delta ASA) upon complex formation from individual monomers using the software Surface Racer 5.0 with probe radius 1.4 Å´ in Lee and Richard implementation [21]. The interface residues are represented as a function of residue number using delta ASA in an X-Y plot (Figure 2b).
Results
Figure 1 illustrates the role of protein size in protein-protein interactions using examples. Examples of protein-protein complexes are used to show interactions between
small-small protein;
largelarge proteins; and
large–small proteins.
Thus, protein-protein complexes are formed between small, large and medium sized proteins in different combinations. The protein-protein complexes in the dataset of 156 non-redundant structures are subsequently classified based on protein sizes forming the interface (Table 5 supplementary material). Proteins of sizes 100-300 residues are most represented for subunit A and 50-200 residues for subunit B in the dataset.
Figure 2 illustrates the representation of protein-protein interfaces in 3- dimensional (structure complex in 2a), 2-dimensional (X-Y plots in 2b) and 1-dimensional (sequence in 2c) using the ex ample structural complex of chey - chea with the protein databank (PDB) entry 1FFG. A qualitative representation of the interface between the interaction proteins chey and chea is given in Figure 2a using 3-dimenional structural features. However, a quantitative understanding of the interacting residues forming the interface is not possible using this representation. Thus, we represented the interface residues as a function of residue number using delta ASA in Figure 2b. The measure of delta ASA calculated using the Lee and Richards implementation (1977) for the complex determine the extent of residues involved in the formation of the interface. This shows the region (N or C terminal) in the sequence (1-dimensional) involved in the formation of the interface. Figure 2b shows that residues towards the C terminal in chain A and residues in the middle regions of chain B are involved in the formation of the interface. Figure 2chighlights (bold letters) interface residues in sequence (1-dimension). Thus, Figure 2 shows the representation of the protein-protein interface using structure (3D), X-Y plots (2D) and sequence (1D) for visual comparison in all the three dimensions. We extrapolated the principle to all the 156 protein-protein complex structures in the dataset. As shown in Figure 2, the interface formed between subunits A and B have physical interactions at N terminal (N), C terminal (C) and middle (M) region of the protein with reference to its sequence in one dimension for all the structures (Table 6 supplementary material). In this dataset, nearly 33% of interfaces have NMC - NMC mode of interactions. However, the dataset contains a wide type of interaction modes like N-N, C-C, NMC-NMC, M-M, NCNC, NM-NM, and MC-MC.
Discussion
The principle of protein-protein interactions in living cells is a subject of debate and discussion for about four decades now [1–18]. The importance of a protein-protein interface for catalysis and regulation has been frequently described in recent years (Table 3 in supplementary material). It is also known that many such interactions are also associated with those of enzyme -inhibitors and antigen-antibody complexes (Table 4 in supplementary material). The formation of an interface in cellular milieu is often dictated through bio-physical features followed by chemical properties to activate a cascade of subsequent reactions. The biophysical aspects of the protein-protein interaction principles have been extensively studied using protein complexes determined by X-ray crystallography [1–16]. Janin & colleagues [1–11], Thornton & colleagues [12–14] and Nussinov & colleagues [15–16] studied protein complexes in protein databank (PDB) and documented several structural features that are found to be significant. The features described thus far include atom packing, hydrophobicity, polarity, shape complementarity, conformational effect, interface size and residue propensity and interface hydrogen bonds. The information documented has significantly improved the understanding of protein-protein interactions using biophysical data. The use of these data to develop interaction scoring functions for application in protein docking has been shown [18–19]. Thus, it is now fairly possible to dock protein structures given the structures for the interacting partners through the identification of recognition sites in known structural partners. Nonetheless, the prediction of interaction partners and sites given their sequences is still a grand challenge. Hence, we studied a non-redundant dataset of 156 hetero-dimers to understand their mode of interactions in structure to relate to their corresponding sequence. The dataset consists of 89 structures with regulatory role and 61 with enzyme-inhibitor function (Table 3 and Table 4 in supplementary material). The protein sizes involved in the formation of the complexes are variable in the dataset with significant standard deviation about the mean (Table 2 in supplementary material). Protein-protein complexes in the dataset are formed between small, large and medium sized proteins in different combinations as shown in Figure 1. The proteinprotein complexes in the dataset of 156 structures are subsequently classified based on protein sizes forming the interface (Table 5 in supplementary material). Proteins of sizes 100-300 residues are most represented for subunit A and 50-200 residues for subunit B in the dataset. Thus, size variation between the interacting partners is an important factor in determining the mode of interface formation. The size of protein subunits interacting are either small-small, large-large, medium-medium, large-small, large-medium and small-medium. Nonetheless, it should be emphasized that conceptualization of size variations between and across interacting partners is a challenge. This is an important factor in hetero-dimer interaction as proteins of different sizes are capable of interacting in specific and sensitive ways. We believe that formulation of a method to incorporate the protein size components for potential interaction and selection will provide insight to detect protein partners and sites from sequence. It is also important to categorize interacting proteins based on their size to relate and link their potential partners.
The future challenge is to detect interaction partners and sites from their primary sequence. This requires relating interface features in structures to sequences. Hence, we attempted to represent interface features in 3- dimension using 2-dimenional X-Y plots as shown in Figure 2b. In this example, the chey-chea protein complex (PDB ID 1FFG) is shown to interact with each other structurally in 3-dimension. The interface is represented qualitatively and realized visually by the reader. Thus, a qualitative representation of the interface between the interaction proteins chey and chea is established in Figure 2a using 3-dimenional structural features. However, this representation does not provide information on the interacting residue numbers at the interface. Thus, we represented the interface residues as a function of residue number using delta ASA in Figure 2b. The measure of delta ASA calculated using the Lee and Richards implementation (1977) for the complex determine the extent of residues involved in the formation of the interface. This shows the region (N or C terminal) in the sequence (1- dimensional) involved in the formation of the interface. Figure 2b shows that residues towards the C terminal in chain A and residues in the middle regions of chain B are involved in the formation of the interface. Figure 2C highlights (bold letters) interface residues in sequence (1-dimension). Thus, Figure 2 shows the representation of the protein-protein interface using structure (3D), X-Y plots (2D) and sequence (1D) for visual comparison in all the three dimensions. We then used the same principle to extrapolate to all the 156 protein-protein complex structures in the dataset. As shown in Figure 2, the interface formed between subunits A and B have physical interactions at N terminal (N), C terminal (C) and middle (M) region of the protein with reference to its sequence in one dimension for all the structures (Table 6 in supplementary material). Nearly 33% of interfaces in the dataset have NMC - NMC mode of interactions. However, the dataset contains a wide type of interaction modes like N-N, C-C, NMC-NMC, M-M, NC-NC, NM-NM, and MC-MC. Thus, the analysis provides a method to relate interface features in structure to its occurrence in sequence using delta ASA represented in X-Y plots. This method of relating interface features from structure to sequence is interesting. We believe the procedure when incorporated in a formulation (development of a method) will help to detect interaction partners and sites from sequences.
Conclusion
Protein hetero-dimer (different monomers in sequence composition) subunits interact in cellular systems to create interface active sites for catalysis and regulation. The formation of the interface active site is generally determined by the interface shape complementarity and chemical properties between the interacting subunits. These interactions are usually specific and sensitive between the interacting partners. The possible numbers of such interacting partners are often huge and their magnitude is frequently beyond our realization for both prokaryotic and eukaryotic cells. The physical and chemical properties determining subunit interactions have been shown in a relatively small dataset of hetero-dimer structures determined by X-ray crystallography. However, the information thus far gleaned from such datasets is typically limited for the prediction of interaction partners and sites given their sequences. Hence, we analyzed a non-redundant dataset of 156 hetero-dimer structures determined by X-ray crystallography to assemble information to understand interface types using size, shape and interaction mode for potential application in protein-protein interaction prediction from sequence. In this analysis, the dataset mean size of subunit A (140 residues) is smaller than subunit B (250 residues) with a large standard deviation (109.4). Hence, the interacting partners are often varying in size. Thus, size variation between the interacting partners is an important factor in determining the mode of interface formation. The size of protein subunits interacting are either small-small, large-large, medium-medium, large-small, large-medium and small-medium. Proteins of sizes 100-300 residues are most represented for subunit A and 50-200 residues for subunit B in the dataset. Nonetheless, it should be emphasized that conceptualization of size variations between and across interacting partners is a challenge. The interface formed between subunits A and B have physical interactions at N terminal (N), C terminal (C) and middle (M) region of the protein with reference to its sequence in one dimension. In this dataset, nearly 33% of interfaces have NMC - NMC mode of interactions. However, the dataset contains a wide type of interaction modes like N-N, C-C, NMC-NMC, M-M, NC-NC, NM-NM, and MC-MC. These features are implied to have application in the prediction of interaction partners and sites. Nonetheless, the comprehension of such features for protein-protein interaction prediction from sequence is not explicit at the moment.
Supplementary material
Acknowledgments
We express our sincere appreciation to all members of Biomedical Informatics for many discussions on the subject of this article.
Footnotes
Citation: Vaishnavi et al, Bioinformation 4(7): 310-319 (2010)
References
- 1.Chothia C, Janin J. Nature. 1975;256:705–708. doi: 10.1038/256705a0. [DOI] [PubMed] [Google Scholar]
- 2.Chothia C, et al. Proc Natl Acad Sci. 1976;73:3793–3797. doi: 10.1073/pnas.73.11.3793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Miller S, et al. Nature. 1987;328:834–836. doi: 10.1038/328834a0. [DOI] [PubMed] [Google Scholar]
- 4.Janin J , Chothia C. J Biol Chem. 1990;265:16027, 16030. [PubMed] [Google Scholar]
- 5.Duquerroy S, et al. Proteins. 1991;11:271–280. doi: 10.1002/prot.340110406. [DOI] [PubMed] [Google Scholar]
- 6.Janin J. Biochimie. 1995;77:497–505. doi: 10.1016/0300-9084(96)88166-1. [DOI] [PubMed] [Google Scholar]
- 7.Robert CH, Janin J. Mol Biol. 1998;283:1037–1047. doi: 10.1006/jmbi.1998.2152. [DOI] [PubMed] [Google Scholar]
- 8.Lo Conte L, et al. J Mol Biol. 1999;285:2177–2198. doi: 10.1006/jmbi.1998.2439. [DOI] [PubMed] [Google Scholar]
- 9.Chakrabarti P, Janin J. Proteins. 2002;47:334–343. doi: 10.1002/prot.10085. [DOI] [PubMed] [Google Scholar]
- 10.Bahadur RP, et al. Mol Biol. 2004;336:943–955. doi: 10.1016/j.jmb.2003.12.073. [DOI] [PubMed] [Google Scholar]
- 11.Bernauer J , et al. Phys Biol . 2005;2:S17–S23. doi: 10.1088/1478-3975/2/2/S02. [DOI] [PubMed] [Google Scholar]
- 12.Jones S, Thornton JM. Prog Biophys Mol Biol. 1995;63:31. doi: 10.1016/0079-6107(94)00008-w. [DOI] [PubMed] [Google Scholar]
- 13.Jones S, Thornton JM. Proc Natl Acad Sci. 1996;93:13. doi: 10.1073/pnas.93.1.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Nooren IM, Thornton JM. J Mol Biol. 2003;325:991. doi: 10.1016/s0022-2836(02)01281-0. [DOI] [PubMed] [Google Scholar]
- 15.Xu D, et al. Protein Engineering. 1997;10:999. doi: 10.1093/protein/10.9.999. [DOI] [PubMed] [Google Scholar]
- 16.Tsai CJ, et al. Protein Science. 1997;6:53. doi: 10.1002/pro.5560060106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lijnzaad P, Argos P. Proteins. 1997;28:333. [PubMed] [Google Scholar]
- 18.Mintseris J, et al. Proteins. 2005;60:214–216. doi: 10.1002/prot.20560. [DOI] [PubMed] [Google Scholar]
- 19.Hwang H, et al. Proteins. 2008;73:705, 709. doi: 10.1002/prot.22106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhanhua C, et al. Bioinformation. 2005;1(2):28–39. doi: 10.6026/97320630001028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tsodikov OV, et al. J Comput Chem. 2002;23:600–609. doi: 10.1002/jcc.10061. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.