Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2017 Nov 28;114(50):E10612–E10621. doi: 10.1073/pnas.1712021114

Patterns of coevolving amino acids unveil structural and dynamical domains

Daniele Granata a,1,2, Luca Ponzoni b,1,2, Cristian Micheletti b,2, Vincenzo Carnevale a,2
PMCID: PMC5740617  PMID: 29183970

Significance

Patterns of pairwise correlations in sequence alignments can be used to reconstruct the network of residue-residue contacts and thus the three-dimensional structure of proteins. Less explored, and yet extremely intriguing, is the functional relevance of such coevolving networks: Do they encode for the collective motions occurring in proteins at thermal equilibrium? Here, by combining coevolutionary coupling analysis with a state-of-the-art dimensionality reduction approach, we show that the network of pairwise evolutionary couplings can be analyzed to reveal communities of amino acids, which we term “evolutionary domains,” that are in striking agreement with the quasi-rigid protein domains obtained from elastic network models and molecular dynamics simulations.

Keywords: coevolution, protein domains, spectral clustering, structural dynamics, allosteric networks

Abstract

Patterns of interacting amino acids are so preserved within protein families that the sole analysis of evolutionary comutations can identify pairs of contacting residues. It is also known that evolution conserves functional dynamics, i.e., the concerted motion or displacement of large protein regions or domains. Is it, therefore, possible to use a pure sequence-based analysis to identify these dynamical domains? To address this question, we introduce here a general coevolutionary coupling analysis strategy and apply it to a curated sequence database of hundreds of protein families. For most families, the sequence-based method partitions amino acids into a few clusters. When viewed in the context of the native structure, these clusters have the signature characteristics of viable protein domains: They are spatially separated but individually compact. They have a direct functional bearing too, as shown for various reference cases. We conclude that even large-scale structural and functionally related properties can be recovered from inference methods applied to evolutionary-related sequences. The method introduced here is available as a software package and web server (spectrus.sissa.it/spectrus-evo_webserver).


A powerful paradigm in molecular biology is the flow of information from the chemical composition of proteins to their biological function, which is typically viewed as a chain of implications: The protein sequence encodes for the structure, which, in turn, assists function (1, 2). Much attention has been—and still is—paid to the two key steps in this logical ladder, namely the sequence–structure and structure–function relationships. These, however, have been mostly considered separately, and addressed with distinct conceptual frameworks and tools.

For instance, a nowadays well-established mediator of structure and function for globular proteins is their internal dynamics. Single-molecule experiments, in fact, have provided vivid and quantitative descriptions of the dynamical basis of protein function (37). Computational and theoretical studies, from atomistic molecular dynamics (MD) simulations (810) to coarse-grained elastic networks (1113), have also provided a detailed understanding of the strong ties between proteins’ structural architecture and internal dynamics. In particular, the secondary and higher-order structural organization of several proteins and enzymes are well suited to sustain collective conformational changes needed for function (3, 4, 14). These large-scale changes can therefore be efficiently excited by thermal fluctuations or triggered by the binding of ligands and effectors (1531).

Efforts to clarify the sequence–structure relationship have also followed different routes: from the development of increasingly accurate force fields to be used in unbiased folding simulations (8, 9) to higher-level approaches where structural features are inferred from the sole physicochemical or statistical profiling of the primary sequence (3236). Recent methodological breakthroughs for the latter contexts involve the application of statistical inference techniques to the analysis of multiple sequence alignments (MSAs). Correlated substitutions can help identify those sites that host coevolving mutations, and these, in turn, are an indicator of spatial proximity (3746). The natural question posed by these parallel advances in the sequence–structure and structure–function relationships is whether or not it is at all feasible to establish a more direct connection between them (29, 47). In particular, one may ask if, going beyond the simple impact of function on sequence conservation, covariation between pairs of amino acids can be directly related to functional properties without relying on the prior knowledge of the structure. A positive answer to this question would have important practical implications. For instance, it could be used in contexts where structural information is covered more sparsely than at the sequence level. It could also clarify how local mutations in a given protein family are related to conserved global functional features, which is beyond the reach of structure-based approaches.

To our knowledge, this overarching question has been previously addressed only in specific, though important, contexts (29, 4858), and hence a more comprehensive, general approach would be particularly valuable. Here, motivated by these studies and especially by the protein sector analysis of ref. 49, we carry out a systematic characterization of the sequencefunction relationship without harnessing the wealth of dynamical properties encoded in protein structures. We shall, in fact, only rely on sequence-based coevolutionary data and use it to infer dynamical/functional domains whose organization has been conserved by evolution. We term such fundamental units “evolutionary domains” (EDs).

The presentation of the strategy is articulated in the following way. First, we introduce and apply the method to a prototypical system, namely adenylate kinase, and show the consistency of the EDs with known structural and functional units for the enzyme. Next, for a database-wide survey, we apply the method to the previously annotated set of about 800 MSAs of ref. 42. The results show that the EDs, inferred from the sole sequence comutations analysis, are compact in space and well consistent with dynamical, or quasi-rigid domains established from structural fluctuations. They demonstrate also a strong robustness with respect to the number of sequences contained in the MSA and to the methods used in the inference of the coevolutionary coupling and are in good agreement with protein sector analysis (49), as discussed in detail in Results and Discussion and SI Appendix. Finally, for a more direct functional interpretation of the EDs, we consider various representatives of the ion channels superfamily (59). For such members, that have functionally diverged while retaining the same overall structural architecture, we show that our sequence-based partitioning highlights a diverse spatial organization of the domains, allowing for comparative functional analysis.

Results and Discussion

Our strategy is based on the analysis of comutating pairs of amino acids to identify groups of residues that have putatively evolved in a concerted manner. Revealing such coevolving groups is important because they would expectedly reflect the action of structural and functional constraints operating at a larger scale than the pairwise one, which can be elegantly and effectively probed by direct coupling analysis methods (39, 42, 60). Considering this expectedly extended organization of such coevolving clusters of amino acids, we shall refer to them as EDs. A detailed presentation of the ED search method is presented in Materials and Methods. For completeness, and to make this section self-cointained, we provide a brief summary of the method before discussing its applications.

Searching for EDs: Methodological Overview.

As summarized in Fig. 1A, the input of the ED search strategy for a given protein is the matrix of the statistical couplings between any two pairs of amino acid positions within the relative MSA encoding for its protein family. Our method of choice for the coupling analysis is the plmDCA approach described in ref. 60, but similar results can be obtained by other approaches, such as gplmDCA and plmDCA20 (42) (Materials and Methods and SI Appendix).

Fig. 1.

Fig. 1.

(A) Schematic illustration of the steps performed for identifying the groups of coevolving residues (EDs) from a protein MSA. (B) Application to adenylate kinase. The evolutionary partitioning is compared with the subdivisions into quasi-rigid, DDs obtained from the analysis of an MD simulation with the SPECTRUS webserver (61). The local maxima of the quality score guides the choice of the main sequence- and dynamics-based subdivisions, shown in two color-coded representations, both on the protein structure and on its sequence.

The evolutionary relatedness encoded by the statistical coupling is used to assign a similarity score (or evolutionary proximity) between pairs of amino acids. The subdivision of the whole sequence in multiple EDs is then the result of a clustering procedure, the spectral clustering (62), which returns an optimal set of densely connected groups. To ensure a robust domain subdivision, the similarity matrix is regularized into a k-nearest neighbors graph by retaining the top k=7 strongest evolutionary couplings of each amino acid, which have been found to maximize the clustering properties of the coupling network (as will be discussed in Dataset-Wide Survey). The k-nearest neighbors strategy has been chosen for its simplicity, but we note that the final outcomes are not significantly affected by the particular sparsification strategy used to regularize the similarity matrix (Materials and Methods and SI Appendix).

The number of evolutionary-based subdivisions is not specified a priori but is set by analyzing the profile of a quality score indicating the best protein decomposition as a function of the number of domains, providing thus a consistent description of the protein at both large and small scales, respectively, for low and high numbers of clusters. The clustering strategy is similar to that used by the SPECTRUS (spectral-based rigid units subdivision) algorithm (61) to determine dynamical, quasi-rigid domains in proteins or protein complexes.

A Test Case: Adenylate Kinase.

To illustrate and validate the sequence-based ED decomposition, we first apply it to Escherichia coli adenylate kinase [Protein Data Bank (PDB) ID 4AKE], a standard benchmark for domain partitioning methods. The results are given in Fig. 1B. The red curve in the upper left graph shows the quality score, Q, for subdivisions of the enzyme into an increasing number of EDs. The quality score reflects how sharply defined, according to the clustering metrics, is the returned optimal subdivision compared with random partitions. The highest scores for the sequence-based partitioning are found for Q=3, 6, and 9 EDs. The structural and sequence-wise representations of the partitions into Q = 3 and 6 domains are given in Fig. 1B. Note that EDs, which can span several intercalated stretches of the primary sequence, are nonetheless structurally compact. This is a noteworthy and intriguing result, since the evolutionary subdivisions are exclusively sequence-based, with no input about the actual structure of the protein.

As a matter of fact, the returned subdivisions are viable from the structural and functional point of view. This emerges from their comparison with quasi-rigid, dynamical domains (DDs). These were identified with the SPECTRUS web server (61) using as input the structural fluctuations observed in extensive MD simulations of adenylate kinase. As shown in Fig. 1B, the Q=3 and Q=6 evolutionary subdivisions are well consistent, both structurally and sequence-wise, with the high-scoring quasi-rigid partitionings into similar numbers of domains. In particular, for both cases, the Q=3 subdivision corresponds to the well-known partitioning into three main functional domains, namely the ATP-binding site, the AMP-binding site, and the core, shown respectively in red, gray, and blue. In addition, even the finest partitionings (Q=9 and Q=10; see SI Appendix, Fig. S1) provide consistent decompositions in the two cases and highlight structural elements that are arguably crucial for the protein functional dynamics.

The result is noteworthy because, although sequences code for both structural and functional properties, it would have been difficult to anticipate that the latter could be obtained directly from the primary sequence without additionally using a 3D conformation. In addition, although DCA is a very powerful means of extracting reliable indications of proteins’ folds, we are not aware of documented instances where DCA-derived structural information was used to infer functional movements. These considerations reinforce the significance of showing that functional and structural domains can be directly and confidently extracted from the coupling analysis of MSAs (see SI Appendix, Fig. S2).

Dataset-Wide Survey.

For a systematic characterization of the EDs, we then extended the analysis to a dataset of 813 MSAs compiled by Feinauer et al. (42). This was chosen for two main reasons. First, it gives a comprehensive coverage of several protein families, with various MSA sizes (from 16 to 65,000 entries) and protein lengths (from 30 to 500 amino acids). Second, a PDB entry is available for a representative protein of each family/MSA. This is a key element in this study because it allows us to assess the spatial compactness of ED decompositions, which are exclusively based on the sequences defining each MSA, and to compare them to the DDs subdivisions.

Evolutionary couplings: Clustering propensity and community structure.

As a preliminary step toward identifying the EDs, we first investigated if the input networks of statistical couplings Jij, obtained from coevolutionary analysis, exhibit an intrinsic propensity to be densely organized and, thus, to be clustered. As detailed in Materials and Methods, such propensity is conveniently captured by ΔC=CCrand, that is, the difference of the clustering coefficients of the k-nearest neighbors graph, C, and of a randomized, reshuffled version, Crand (63), measuring the probability that two neighbors of a vertex are also connected between themselves. As shown in Fig. 2A, this quantity also proves useful in choosing the optimal k, since the different graphs show usually a maximum for the clustering coefficient ΔC at k=7, especially for MSA containing a large number of sequences (see also SI Appendix, Fig. S3 for the other inference methods). Importantly, the MSA size (calculated as effective number of sequences, Nseqeff, i.e., the number of sequences in the set whose mutual identity is smaller than 90%) crucially affects the clustering propensity of the similarity graph, as clarified by the strong correlation between these two quantities shown in Fig. 2B (and SI Appendix, Fig. S4). Thus, when a large dataset is available and the reconstruction of the network of couplings is most reliable, the latter shows a high tendency to cluster and an unambiguous number of “relevant” neighbors (k=7), which is indicative of an inherent collective organization of the coevolution patterns. Strikingly, this number coincides with the average structural neighbors surrounding each residue in protein structures (6.75±0.04, calculated on the PDB structures of this dataset using a Cβ–Cβ distance threshold of 8.5 Å as in ref. 42). We also note that, for k=7, the percentage of true contacts (including along the sequence) is systematically larger than 50%, especially for larger MSA size (SI Appendix, Fig. S5 and Materials and Methods).

Fig. 2.

Fig. 2.

(A) Histograms of the maximum adjusted clustering coefficient ΔC for plmDCA method, obtained by progressively excluding from the dataset the MSAs containing a low number of sequences. (B) Scatter plots of ΔC as a function of the corresponding MSA size (Nseqeff, effective number of sequences with sequence identity less than 90%).

Compactness of evolutionary domains.

We used the networks of evolutionary couplings, derived from each of the 813 MSAs, as input for the clustering algorithm. For an initial unsupervised overview of the EDs organization, we identified the subdivisions from Q=2 to Q=10 domains for each protein family. Next, we studied whether the sequence-based subdivisions corresponded to spatially compact domains once mapped on the available PDB structures of the MSAs’ representatives. The results are given in Fig. 3.

Fig. 3.

Fig. 3.

(A) Distribution of the average structural compactness ΩQ over the MSA dataset (in red), compared with the one computed for random partitionings of the same protein sequences (in cyan). (B) Scatter plot of the EDs structural compactness Ω, computed for each single MSA and averaged over the subdivisions into Q=2,,10 domains, vs. the relative MSA size. The dashed line represents the average compactness for the set of random partitionings. (CE) Structural representations of three notable examples of ED decompositions, marked by blue squares in B. (E) Views of the transcription factor IIIA, in the apo form and in complex with a 5S rRNA 55mer.

Fig. 3A presents the probability distribution of the compactness parameter, Ω, which measures the fraction of amino acids that are no further than 10 Å from most residues in their same domain (Materials and Methods). For clarity, the results are presented as aggregated over the considered values of Q; more detailed, nonaggregated representations, including those for the other inference methods, are given in SI Appendix, Figs. S7–S9. The Ω distribution of genuine ED partitions in Fig. 3A is strongly skewed toward the Ω=1 limit. In fact, the median value is 0.98, indicating that, over all considered MSAs and partitioning levels Q, very few amino acids are isolated, or at a distance larger than 10 Å from the other members of their domains. By contrast, the compactness Ω computed for random partitioning of the same entries, and into the same range of Q, follows a very different distribution that is so shifted toward lower Ω values (the mean is about 0.57) that it has negligible overlap with the ED one. The scatter plot in Fig. 3B additionally reveals a strong correlation between the number of sequences in the MSAs and the observed compactness of the inferred EDs, similarly to what is observed for the clustering coefficient. In fact, one notes that values in the left tail of the Ω distribution are typically found for MSAs featuring the smallest numbers of entries, 300 sequences or less. We interpret this result as an indirect indication that, when less than 300 sequences are used to infer the couplings, the network is less reliably reconstructed and consequently the ED subdivisions are less compact, although their compactness can still be significantly high compared with the random case. Analogous conclusions, but with even higher values of compactness Ω, can be drawn when repeating the analyses in Fig. 3 A and B for the optimal decompositions in Qopt domains, picked according to the quality score of the ED decomposition, demonstrating its relevance in the method (see SI Appendix, Fig. S10).

To illustrate the concepts discussed above within the context of selected protein structures, we show a few notable examples of ED subdivisions in Fig. 3 C and E. The entry in Fig. 3C corresponds to an MSA with a large pool of sequences (14,080) and an average compactness ΩQ=0.96. The structure shown in figure is the representative PDB entry 1NE2, and the subdivision corresponds to the optimal partitioning (Q=7). Its high degree of compactness, Ω=0.99, is readily perceived by inspecting the subdivisions that, with the sole exception of a terminal residue, are visibly compact in space. The other two examples in Fig. 3 D and E, by contrast, pertain to proteins whose average ED compactness is about 0.62±0.02, i.e., on the low side of the distribution. The first instance is the Ebola viral protein 35, represented by PDB entry 3L28, which has the least numerous MSA in the dataset (nine sequences only). This entry presents a noticeable fragmentation of each of the Q=6 domains, and, indeed, its compactness value is not too dissimilar from the case of random partitions.

The second instance is a most interesting outlier, because it corresponds to one of the most numerous MSAs, specifically encoding the transcription factor IIIA (TFIIIA), a Cys2His2 (C2H2) zinc finger protein involved in nucleic acid recognition and regulation (64). The two structures in Fig. 3E represent the fingers 4 to 6, both in the free state (65) (PDB ID 2J7J), and bound to 5S rRNA 55mer (66) (PDB ID 2HGH). TFIIIA is particularly noteworthy because it contains nine C2H2 domains. As discussed by Espada et al. (67), in such cases, DCA signals can reflect correlations due to the common origin of the domains, as well as correlations due to genuine structural and functional couplings.

The optimal partitioning of TFIIIA, i.e., the one with the highest quality score, consists of Q=3 EDs. When the corresponding tripartite subdivision is superposed to the apo structure of the zinc finger, it yields spatially fragmented domains. However, differently from the previous instance in Fig. 3D, the residues in each domain are not scattered but rather are arranged in coherent structural patterns. In particular, the partitioning of a single zinc finger is consistently repeated across all three motifs. Indeed, when the same subdivision is superposed to the holo (RNA bound) form, the domains acquire a spatial organization that is functionally meaningful. Specifically, (i) the red domain outlines the binding site formed by two cysteines on the β-hairpin and two histidines in the helix (highlighted in the apo form in yellow and cyan, respectively) that coordinate the zinc ions crucial to stabilize the fold (66); (ii) the white domain sustains and locks the hairpin onto the helix (note the facing white residues, consistently present in all three helices); and, finally, (iii) the remaining blue part of the helix (referred to as “recognition helix”) contains residues forming sequence-specific contacts with the groove of the nucleic acid. Therefore, the seemingly fragmented nature of this outlier can be recapitulated in more coherent functional ways in the holo context. This suggests that, even in challenging cases where DCA reflects the presence of repeated domains, the ED analysis can still extract meaningful large-scale functional relationships.

Comparison with dynamical domains.

Motivated by these observations, we undertook a systematic comparison of the EDs and the quasi-rigid (or dynamical) domains (DDs) for each of the 813 MSAs. The DDs were obtained from the SPECTRUS decomposition tool (61), based on an elastic network model (ENM) analysis (68, 69) of the PDB structures of the MSA’s reference entries, as detailed in Materials and Methods. The structure- and dynamics-based character of the DD analysis is an apt complement of the sequence-based one of EDs. This duality makes the comparison particularly interesting and relevant for framing the sequencestructurefunction relationship. The overlap of the two types of domain subdivisions was measured in terms of the adjusted mutual information (AMI), which allows for a straightforward assessment of the statistical significance of the subdivisions overlap, as described in SI Appendix, Supplementary Methods.

To better illustrate the correspondence of the EDs and DDs and to give an immediate meaning to the AMI value, we discuss here two examples. Fig. 4A shows the results for SbmC protein (PDB ID 1JYH, Nseqeff=3,707) subdivided into Q=4 domains. This level of subdivision was considered because it provides the best quality score for dynamical domains. The consistency of the ED and DD subdivisions is very clearly conveyed by the structural and sequence-wise representations, which overlap almost perfectly. This consistency extends to both coarse and finer subdivisions, as highlighted by the AMI profile, which is particularly high (>0.8) for Q=2 and Q=4, and remains larger than 0.5 in all other cases as well. Likewise, for the example in Fig. 4B [ATP-binding cassette (ABC) transporter, PDB ID 2ONK, Nseqeff=17,503], a consistent overlap between EDs and DDs is observed at various levels of subdivision. In particular, we note that even the lowest AMI value of 0.5, attained for Q=4, still corresponds to a clear and satisfactory consistency of the two types of subdivisions.

Fig. 4.

Fig. 4.

(A and B) ED and DD decompositions of an SbmC protein (PDB ID 1JYH:A) and an ABC transporter permease protein (PDB ID 2ONK:C). (C) Scatter plots of the maximum (Upper) and average (Lower) AMI, over the domain number Q, between the ED and DD decompositions, as a function of the effective MSA size.

To extend considerations to the entire dataset, we computed for each MSA the average and the largest AMI between EDs and DDs, for Q in the range [2,10]. The results are presented as a function of the number of MSA sequences in the scatter plots of Fig. 4C. Interestingly, we observe again a strong dependence on Nseqeff: For MSAs with 500 sequences or more, the average values for AMImax and AMIQ are 0.62 and 0.47, compared with the corresponding values of 0.49 and 0.35, respectively, when Nseqeff<500. Clearly, when Nseqeff tends to 0, the AMI vanishes, again consistent with a random partitioning of a sequence. Slightly higher values of AMI are observed, on average, when comparing domains at the optimal ED number Qopt, as determined by the individual quality scores. However, such comparison is more delicate, because the respective Qopts for ED and DD decompositions do not generally coincide, making it advisable to consider the more stable average AMIQ. For more details, see discussion in SI Appendix, Fig. S11.

The good overlap between EDs and DDs at all levels of subdivision suggests that our clustering approach captures all of the relevant topological features from the network of statistical couplings. It thus constitutes a powerful tool for inferring meaningful structural and functional relationships, as discussed in Case Study: Comparative Analysis Across the 6TM Family of Ion Channels.

Case Study: Comparative Analysis Across the 6TM Family of Ion Channels.

To further assess the capability of ED decompositions to outline important functional properties of a protein family, we conclude by applying the ED analysis in a comparative scenario to a specific class of ion channels, the six-transmembrane-helices (6TM) superfamily, for which the sequence–function relationship has been actively investigated in a number of seminal studies (70). This superfamily is characterized by a strictly conserved tetrameric architecture. The latter is shown in Fig. 5A where different colors are used to highlight the main functional domains, including the four-helix bundle voltage sensor domain (VSD) and the pore of the ion conduction pathway, which involves two transmembrane helices and the linking reentrant pore, containing the selectivity filter. This single structural template inherited from an ancestor gene has enabled, through differentiation, an explosion of functional variability. Channels in the 6TM class, in fact, are involved, for instance, in reporting noxious environmental conditions, in shaping the neuronal action potential, and in syncing the beating of the heart (59). Since all these channels share the same architecture, different decompositions in EDs in different phylogenetic groups likely reflect distinct functional rather than structural aspects (51, 52).

Fig. 5.

Fig. 5.

EDs for Kv channels. (A) Schematic representation of biological tetrameric assembly of 6TM channels, with each color representing a single monomeric subunit (top and lateral view). For the blue subunit, the VSD is highlighted in cyan. (B) Representation of the most significant monomeric subdivision, Q=2, shown in the context of the full tetramer; see SI Appendix, Fig. S13 for the quality score. Positively charged residues responsible for voltage sensing are shown as yellow spheres. (CE) Finer subdivisions into four and six monomeric domains.

For definiteness, we focus on three different 6TM families: the voltage-gated potassium-selective channel [Kv, PDB ID 2R9R (71)], the bacterial voltage-gated sodium-selective channel [BacNav, PDB ID 4EKW (72)], and transient receptor potential [TRP, PDB ID 3J5P (73)] channels. We analyzed the MSAs for the three families based on a pool of 800 sequences, each with 200 positions (74) from which we omitted the highly gapped regions of the alignments (typically occurring in loops between the six transmembrane helices). Although the 6TM dataset that we used is the most comprehensive available at the moment, its size is clearly limited by comparison with the much better populated cases discussed previously, showing a pretty low ΔC (see SI Appendix, Fig. S12). To ensure a robust analysis, we decided to decompose the graph corresponding to the maximum ΔC for each MSA.

In Fig. 5 B and E, we present various subdivisions for the Kv family for increasing numbers of domains (see SI Appendix, Fig. S13 for quality score). The subdivision for Q=2 is already unexpectedly informative, since the fourth helix of the VSD (called S4) and its facing residues are associated with the pore domain rather than the rest of the VSD. This is an intriguing result because the aforementioned classic subdivision into structural domains would have kept these elements apart. From a functional point of view, however, the sequence-based subdivision of the primary (Q=2) EDs is meaningful. In fact, it agrees with the strong mechanical coupling between the pore region and S4 (7577). We recall that the latter contains the positive residues (yellow spheres in Fig. 5B) that sense transmembrane potential variations and determine the movement of this helix across the membrane; this movement is, in turn, transmitted to the pore domain for gating. The division into Q=4 EDs, in Fig. 5C, picks up further functional features. One domain largely corresponds to the selectivity region, formed by all of the residues lining the narrow and highly conductive ion pathway (in yellow), another is associated to the gating region (red), and the other two comprise, respectively, the internal and external residues of the VSD. It is notable that a different domain assignment is found for the two faces of the pore helix, with the upper one sustaining the selectivity filter and the lower one contacting the gating domain. This Kv example is also particularly instructive regarding the multilevel description that EDs can provide about the various protein features. Finer subdivisions (Q=6) mostly return the basic structural elements of the system. In this subdivision, the pore and the voltage sensor regions are mostly assigned to different EDs, with the exception of the extracellular portion of the channel (highlighted in blue in Fig. 5 D and E), which still bridges between the two. When viewed in the context of the channel tetramer, it appears natural to speculate that this region is instrumental for the signal propagation between the loops of the voltage-sensing and the pore domains, which can indeed be modulated by external stimuli, like ligand binding. We accordingly surmise that amino acids in this region are genuinely evolutionary-related for functional reasons.

Further elements regarding the functional role of EDs emerge from the comparisons of Kv, BacNav, and TRP subdivisions, which are given in Fig. 6 and are further detailed in SI Appendix, Fig. S13. The comparison between Kv and BacNav (another tetrameric voltage-gated family, selective for sodium) reflects how the functional constraints shaped these two families along evolution, in an almost superimposable way. In fact, the S4 helix segregates with the lower part of the pore, and, together, they form the “gating domain” (in red). Similarly, the reentrant pore helix is split into the upper and lower faces, sustaining the selectivity domain (in yellow), and the rest of VSD is grouped into internal and external residues. The organization of EDs for TRP channels is, instead, totally different. Indeed, this channel family, identified only in Eukaryota, has distinct characteristics with respect to the other ones. Specifically, it is a nonselective cation channel gated by a variety of stimuli, such as temperature, pH, and ligands binding (7881). In particular, these channels have been shown to possess two different gating regions (73, 82), which are, indeed, well captured by the ED decomposition. The division of S4 is compelling in this respect, since it is consistent with the lack of the dynamical role that, instead, characterizes it in voltage-gated ion channels: Only the C-terminal residues are associated with the gating domain (in red). The upper part of S4 is, instead, longitudinally sectioned, with the internal residues all grouped with the upper part of the rest of the VSD. The external part of S4 belongs to the extended yellow domain: The latter represents effectively a second upper gating domain, as suggested in refs. 73 and 82. Remarkably, the yellow cavity determined by the two pore helices and the external part of S4 corresponds exactly to the location of the vannilloid pocket (8286), which represents the main intracellular binding site for the activators of these channels.

Fig. 6.

Fig. 6.

Comparative analysis of the EDs for Kv, BacNav, and TRP channels, corresponding to subdivision Q = 4 (see quality scores in SI Appendix, Fig. S13). While Kv and BacNav show similar organization, coherent with their analogous functional requirements, TRP is characterized by a different domain pattern, consistent with its ligand-gated properties and loss of voltage-gated ones, specific to the other two channels.

The indication from the 6TM family is that EDs can single out domains that, owing to their specific functional character, are distinct from subdvisions made with static structural criteria.

Comparison with Protein Sectors Analysis.

Identifying groups of coevolving residues from patterns of correlated mutations is a long-standing issue (48) that has been tackled from various perspectives. Among the best-known and most elegant approaches are the protein sectors analysis (49) and CoeViz (87), which provides insight into the cooperative nature of residue coevolution. ED analysis is mostly complementary to these techniques, because of several methodological differences. For instance, protein sectors analysis returns a nonexhaustive coverage of the protein residues. In fact, it uses the top eigenvectors of a conservation-weighted covariance matrix built from an MSA, and typically only 20% of residues with the largest component on one eigenvector determines a sector, i.e., a group of residues evolving concertedly. By construction, the method prioritizes the most conserved residues (88). Importantly, this nonexhaustive assignment is nonexclusive too, meaning that one residue can be part of distinct sectors. By contrast, the ED decomposition uses the entire DCA-based similarity to ensure a residue assignment that is both exhaustive and exclusive. The latter feature, in particular, is instrumental to the specific goal pursued here of comparing EDs with DDs.

DCA and statistical coupling analysis share nevertheless important conceptual similarities (89, 90), and, therefore, similarities between EDs and sectors can be expected in specific contexts. We therefore compared the two types of subdivisions for several case studies. We first considered the two datasets of ref. 49, which consist of the PDZ domain and the S1A serine protease families. The former dataset has 240 sequences and features one sector. The quality score profile of the ED analysis in Fig. 7A has an overall decreasing trend, which is typical of datasets of this size, indicating meaningful division for Q=2,3. The first subdivision features a domain that totally includes the aforementioned sector (red spheres). In the finer ED subdivisions, the protein sector is resolved into smaller and spatially coherent EDs (red and gray domains in the sequence diagram), allowing a further comparison with DDs for Q=3: the highlighted residues (and corresponding EDs) overlap with two distinct dynamical partitions of the protein. The second dataset, with a larger number of sequences (1,388), yields three sectors. The EDs quality score profile in Fig. 7B indicates that significant subdivisions are found for Q=2,3,8 domains. Two of the three sectors (red and orange in the diagram) have a good correspondence with the EDs. They are compact and both contained in the red domain for Q=2,3, and then perfectly separated for Q=8. The other sector (in gray in Fig. 7B) instead comprises scattered residues. This is consistent with previous studies that showed that this sector is more related to thermal stability than structural properties (49). Interestingly, when S1A sectors and EDs differ from DDs (again for Q=3), they are still consistent with each other. In fact, one sees in Fig. 7 that the red ED includes the orange sector but both groups differ from the blue DD. Overall, the comparative analysis of these two families, whose MSAs contain homogeneous sets of sequences, shows that EDs and sectors have significant similarities.

Fig. 7.

Fig. 7.

Comparison of ED decomposition and protein sector analysis (49) for (A) the PDZ domain (PDB ID 1BE9) and (B) the S1A serine proteases family (PDB ID 3TGI), also with the corresponding division in DDs. The sectors are shown as spheres in the 3D representations, and EDs and DDs are shown as different colors also in the sequence diagram.

Remarkable differences, however, are observed in case of larger and more heterogeneous sets of sequences. In SI Appendix, Figs. S14–S16, we illustrate three examples discussed previously, namely SbmC gyrase inhibitory protein, adenylate kinase, and ABC transporter, whose MSAs have been built by including the largest number of sequences (42). While, for SbmC (Nseqeff = 3,714), some similarity is still noticeable between two sectors (cyan and orange in SI Appendix, Fig. S14) and the subdivisions in two DDs and EDs, for the other datasets (adenylate kinase and the ABC transporter), it is not possible to relate sectors to EDs or DDs: Protein sector analysis on these large datasets (more than 14,000 effective sequences) returns groups of residues distant in both primary and tertiary structure (see SI Appendix, Figs. S15 and S16). The fact that the differences between EDs and protein sectors are more pronounced for large datasets suggests that, when presented with highly heterogeneous sequence sets, these two algorithms highlight different aspects of residue–residue correlations. For instance, protein sectors analysis has been shown to effectively identify the groups of amino acids that experience the largest variations on passing from one phylogenetic group to another (91). On the other hand, DCA is seemingly less sensitive to the phylogenetic structure of the MSA analyzed (42). For this reason, we believe that the interpretation of EDs in terms of structural domains and DDs ought to be applicable in more general contexts, and particularly to large datasets.

Conclusions

Patterns of correlated mutations in an MSA can be used to reveal a set of pairwise statistical interactions that are often informative about the possible spatial proximity between the residues involved. Strikingly, we showed that this network of couplings has a peculiar structure with communities of residues that are more connected among themselves than with the rest of the sequence. Therefore, beyond compensatory mutations involving pairs of contacting residues, entire groups of residues appear to evolve in a concerted fashion. We characterized these communities, that we term EDs, by interpreting the statistical couplings as a measure of evolutionary proximity between residues. To this end, we used an efficient clustering framework, namely, the spectral clustering. When analyzed in the context of the protein structure, these couplings show an innate tendency to segregate into spatially localized and compact groups. We explored the possible biological meaning of these subdivisions, comparing EDs with the dynamical, quasi-rigid domains identified by a recently introduced approach (61). The sequence- and structure-based subdivisions were found to be very consistent across a diverse repertoire of protein families. This is noteworthy because sequence-based approaches could be expected to give a less direct, and hence noisier, route to domain decompositions than structural ones. As a matter of fact, the two subdivision approaches provide consistent results for both small and large number of domains, thus indicating the viability of ED decompositions at different “spatial resolutions.” These features ought to be particularly valuable in contexts where structural data are not available or are sparser than the sequence-based data.

In these cases, detecting the hierarchical organization in domains can represent a crucial initial step for any structural modeling with an atomistic level of detail. Even more interesting is the perspective of using this approach to engineer existing proteins: Transferring EDs across proteins sharing the same architecture might enable de novo design of protein chimeras with novel biological properties.

Finally, despite the fact that the DCA analysis, used for the inference of statistical couplings, depends crucially on the effective number of sequences in the MSA, we observe robust and consistent results even for a few hundreds of sequences. For this reason, we believe that EDs, and therefore the topology of the network of coevolutionary couplings, are more robust with respect to the sample size than are the single interresidue couplings (41, 42). Importantly, this aspect widens de facto the scope of applicability of these inference methods beyond those cases for which thousands of sequences are available. In particular, this enables comparative studies in which homologous subfamilies, rather than the entire group, are studied separately, with the ultimate goal of highlighting the functionally features distinctive to each subgroup, as shown here for the challenging case of ion channels.

Materials and Methods

SI Appendix contains supplementary results, methods, and figures. The method introduced here is available as a software package and web server at spectrus.sissa.it/spectrus-evo_webserver.

Dataset.

We used the dataset of ref. 42 consisting of 813 MSAs. The latter were obtained with HHblits (92), a homology detection scheme, using sequences of target PDB entries as input queries. The MSAs in the dataset are heterogeneous for both the number of involved sequences (16 to 65,535) and positions (30 to 494). The PDB structures associated with the query sequences were used a posteriori to assess the structural and dynamical characteristics of the EDs inferred from the sole direct coupling analysis of the MSAs (39, 93). The MSA for adenylate kinase was obtained with the approach of ref. 42, while MSAs for the three ion channels were the same ones derived in refs. 74 and 52, based on curated phylogenetic reconstructions. All of the relative couplings were obtained using three DCA methods, described in SI Appendix.

EDs and DDs.

DDs were identified with the SPECTRUS algorithm (61). The method uses interresidue distance fluctuations as input data for a clustering procedure that optimally partitions protein structures into quasi-rigid domains. Interresidue distance fluctuations for adenylate kinase were obtained from atomistic MD trajectories, while, for the other considered instances, they were computed using the ENM of ref. 68.

In brief, the DD partitioning involves the following steps: (i) An auxiliary similarity matrix, Sij, is derived from the distance fluctuations of all pairs of amino acids, i and j; (ii) next, a spectral clustering scheme (61, 62, 94) is used to subdivide amino acids into an increasing number of domains, Q, from 2 to 10. For optimal domain discrimination, this spectral partitioning step is customarily performed with a k-medoids algorithm, after sparsifying the similarity matrix by retaining only the most significant couplings. Following ref. 61, we accordingly retained only entries corresponding to pairs of amino acids that were in contact in the reference PDB structures (Cα distances less than 10 Å). For robustness (62, 94), the stochastic k-medoids procedure was repeated 1,000 times for each value of Q, and we retained only the one with the best k-medoids score. Finally, (iii) for each partition into Q=2,,10 domains, we computed a quality score, and used it to single out the top-most significant (top ranking) subdivision(s).

EDs were identified with exactly the same method, but with two differences. First, the Sij entries of the similarity matrix were derived from the DCA pairwise couplings. Specifically, the couplings were first additively shifted to remove negative values and then squared to increase the contrast between weak and strong couplings. The second change concerns the sparsification of the similarity matrix in step ii, which was necessary because the structural (contact) criterion used for DDs could not be used in the sequence-based context of EDs. As the simplest choice, we adopted the symmetric k-nearest neighbor criterion, where, for each residue, only the k strongest couplings are kept; see Optimization of the k-Nearest Neighbor Network. To ensure the connectivity of the network associated with the S matrix, a requisite for meaningful subdivisions, we always retained entries corresponding to consecutive residues. We evaluated the robustness of the ED assignment procedure with respect to (i) the size of the MSA dataset (see SI Appendix, Figs. S17 and S18), (ii) the “purity” of EDs (that is, their variability over independent instances of the k-medoids assignment; see SI Appendix, Fig. S19), (iii) the specific sparsification strategy applied to the DCA-based similarity matrix (see SI Appendix, Figs. S20–S23), and (iv) the DCA method used (see SI Appendix, Figs. S24–S29). In the latter case, we also tested the domain assignment variability upon omitting residues with the weakest couplings in the network; see SI Appendix, Fig. S23. We surmise that, in future developments of the current method, an optimization on the protocol for omitting such residues could improve EDs resolution.

Optimization of the k-Nearest Neighbor Network.

To set k, we profiled the clustering coefficient C (95), or “cliqueness,” of the network of k-nearest neighboring residues for various values of k. The clustering coefficient measures the probability that two neighbors of a node in the network (i.e., a residue) are themselves neighbors. The local clustering coefficient of node i is defined as Ci=2ti/[ni(ni1)], where ti is the number of links between the ni neighbors of i. The global clustering coefficient is obtained by averaging Ci over all nodes in the network with ni>1. In this paper, we used a generalized, weighted definition of clustering coefficient, which takes into account the weights associated with each link (63).

A commonly used term of reference is a random graph, for which Crand=k¯/(N1), where N is the number of nodes and k¯ is the average number of neighbors per node (96). Here we use this quantity to profile the k dependence of the clustering propensity of the network derived from the DCA. Specifically, for each of the 813 MSA entries, we computed ΔC=CCrand for k{3,5,7,10,15,20,25,30,40,55}. We then identified the value of k yielding the largest ΔC for each entry. The distribution of such optimal values of k is shown in Fig. 2. The distribution is visibly peaked at k=7, which we then used for the ED analysis. The peak location is robust with respect to the specific DCA method used and with respect to the MSA size (see SI Appendix, Fig. S3). The use of a suboptimal graph, anyway, leads basically to the same subdivisions (see SI Appendix, Fig. S29). Accordingly, in Results and Discussion, we always set k=7, making our method parameter-free.

Structural Compactness.

The spatial compactness of EDs was assessed a posteriori by establishing whether the individual EDs consisted of two or more subdomains that are > 10 Å apart. We mapped each ED q to a graph where the nq nodes (residues) were linked only if their Cαs were within 10 Å. We then considered all nq(nq1) distinct pairings of the nodes and checked whether or not they can be connected by a path on the network of links. The degree of disconnectedness of the domain is then measured as dq=bq/(nq1), where bq is the number of residue pairs without a connecting path. The overall compactness of a subdivision into Q domains is then defined as

ΩQ=11Nq=1Qdq, [1]

where N is the total number of nodes/residues. For a physical interpretation of Ω, see SI Appendix, Fig. S6.

Supplementary Material

Supplementary File

Acknowledgments

This work was partially supported by National Institutes of Health Grants R01GM093290, S10OD020095, and P01GM055876 (to V.C.) and National Science Foundation Grant ACI-1614804 (to V.C.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1712021114/-/DCSupplemental.

References

  • 1.Alberts B, et al. Molecular Biology of the Cell. Garland Sci; New York: 2002. [Google Scholar]
  • 2.Petsko G, Ringe D. Protein Structure and Function. New Sci Press; London: 2004. [Google Scholar]
  • 3.Eisenmesser EZ, Bosco DA, Akke M, Kern D. Enzyme dynamics during catalysis. Science. 2002;295:1520–1523. doi: 10.1126/science.1066176. [DOI] [PubMed] [Google Scholar]
  • 4.Eisenmesser EZ, et al. Intrinsic dynamics of an enzyme underlies catalysis. Nature. 2005;438:117–121. doi: 10.1038/nature04105. [DOI] [PubMed] [Google Scholar]
  • 5.Min W, Xie XS, Bagchi B. Two-dimensional reaction free energy surfaces of catalytic reaction: Effects of protein conformational dynamics on enzyme catalysis. J Phys Chem B. 2008;112:454–466. doi: 10.1021/jp076533c. [DOI] [PubMed] [Google Scholar]
  • 6.Min W, Xie XS, Bagchi B. Role of conformational dynamics in kinetics of an enzymatic cycle in a nonequilibrium steady state. J Chem Phys. 2009;131:08B606. doi: 10.1063/1.3207274. [DOI] [PubMed] [Google Scholar]
  • 7.Nevin Gerek Z, Kumar S, Banu Ozkan S. Structural dynamics flexibility informs function and evolution at a proteome scale. Evol Appl. 2013;6:423–433. doi: 10.1111/eva.12052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shaw DE, et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun ACM. 2008;51:91–97. [Google Scholar]
  • 9.Shaw DE, et al. Atomic-level characterization of the structural dynamics of proteins. Science. 2010;330:341–346. doi: 10.1126/science.1187409. [DOI] [PubMed] [Google Scholar]
  • 10.Granata D, Camilloni C, Vendruscolo M, Laio A. Characterization of the free-energy landscapes of proteins by NMR-guided metadynamics. Proc Natl Acad Sci USA. 2013;110:6817–6822. doi: 10.1073/pnas.1218350110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zheng W, Brooks BR, Thirumalai D. Allosteric transitions in the chaperonin GroEL are captured by a dominant normal mode that is most robust to sequence variations. Biophys J. 2007;93:2289–2299. doi: 10.1529/biophysj.107.105270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zen A, Micheletti C, Keskin O, Nussinov R. Comparing interfacial dynamics in protein-protein complexes: An elastic network approach. BMC Struct Biol. 2010;10:26. doi: 10.1186/1472-6807-10-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ramanathan A, Agarwal PK. Evolutionarily conserved linkage between enzyme fold, flexibility, and catalysis. PLoS Biol. 2011;9:e1001193. doi: 10.1371/journal.pbio.1001193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Micheletti C, Lattanzi G, Maritan A. Elastic properties of proteins: Insight on the folding process and evolutionary selection of native structures. J Mol Biol. 2002;321:909–921. doi: 10.1016/s0022-2836(02)00710-6. [DOI] [PubMed] [Google Scholar]
  • 15.Agarwal PK, Billeter SR, Rajagopalan PR, Benkovic SJ, Hammes-Schiffer S. Network of coupled promoting motions in enzyme catalysis. Proc Natl Acad Sci USA. 2002;99:2794–2799. doi: 10.1073/pnas.052005999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hammes-Schiffer S, Benkovic SJ. Relating protein motion to catalysis. Annu Rev Biochem. 2006;75:519–541. doi: 10.1146/annurev.biochem.75.103004.142800. [DOI] [PubMed] [Google Scholar]
  • 17.Carnevale V, Raugei S, Micheletti C, Carloni P. Convergent dynamics in the protease enzymatic superfamily. J Am Chem Soc. 2006;128:9766–9772. doi: 10.1021/ja060896t. [DOI] [PubMed] [Google Scholar]
  • 18.Chennubhotla C, Bahar I. Signal propagation in proteins and relation to equilibrium fluctuations. PLoS Comput Biol. 2007;3:e172. doi: 10.1371/journal.pcbi.0030172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Carnevale V, Pontiggia F, Micheletti C. Structural and dynamical alignment of enzymes with partial structural similarity. J Phys Condens Matter. 2007;19:285206. [Google Scholar]
  • 20.Zen A, Carnevale V, Lesk AM, Micheletti C. Correspondences between low-energy modes in enzymes: Dynamics-based alignment of enzymatic functional families. Protein Sci. 2008;17:918–929. doi: 10.1110/ps.073390208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.del Sol A, Tsai CJ, Ma B, Nussinov R. The origin of allosteric functional modulation: Multiple pre-existing pathways. Structure. 2009;17:1042–1050. doi: 10.1016/j.str.2009.06.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jackson CJ, et al. Conformational sampling, catalysis, and evolution of the bacterial phosphotriesterase. Proc Natl Acad Sci USA. 2009;106:21631–21636. doi: 10.1073/pnas.0907548106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Teilum K, Olsen JG, Kragelund BB. Functional aspects of protein flexibility. Cell Mol Life Sci. 2009;66:2231–2247. doi: 10.1007/s00018-009-0014-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Morra G, Verkhivker G, Colombo G. Modeling signal propagation mechanisms and ligand-based conformational dynamics of the hsp90 molecular chaperone full-length dimer. PLoS Comput Biol. 2009;5:e1000323. doi: 10.1371/journal.pcbi.1000323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Provasi D, Artacho MC, Negri A, Mobarec JC, Filizola M. Ligand-induced modulation of the free-energy landscape of G protein-coupled receptors explored by adaptive biasing techniques. PLoS Comput Biol. 2011;7:e1002193. doi: 10.1371/journal.pcbi.1002193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bhabha G, et al. A dynamic knockout reveals that conformational fluctuations influence the chemical step of enzyme catalysis. Science. 2011;332:234–238. doi: 10.1126/science.1198542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bavro VN, et al. Structure of a kirbac potassium channel with an open bundle crossing indicates a mechanism of channel gating. Nat Struct Mol Biol. 2012;19:158–163. doi: 10.1038/nsmb.2208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Glembo TJ, Farrell DW, Gerek ZN, Thorpe M, Ozkan SB. Collective dynamics differentiates functional divergence in protein evolution. PLoS Comput Biol. 2012;8:e1002428. doi: 10.1371/journal.pcbi.1002428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Liu Y, Bahar I. Sequence evolution correlates with structural dynamics. Mol Biol Evol. 2012;29:2253–2263. doi: 10.1093/molbev/mss097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lai J, Jin J, Kubelka J, Liberles DA. A phylogenetic analysis of normal modes evolution in enzymes and its relationship to enzyme function. J Mol Biol. 2012;422:442–459. doi: 10.1016/j.jmb.2012.05.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Micheletti C. Comparing proteins by their internal dynamics: Exploring structure–function relationships beyond static structural alignments. Phys Life Rev. 2013;10:1–26. doi: 10.1016/j.plrev.2012.10.009. [DOI] [PubMed] [Google Scholar]
  • 32.Shakhnovich EI, Gutin AM. Engineering of stable and fast-folding sequences of model proteins. Proc Natl Acad Sci USA. 1993;90:7195–7199. doi: 10.1073/pnas.90.15.7195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Dokholyan NV, Shakhnovich EI. Understanding hierarchical protein evolution from first principles. J Mol Biol. 2001;312:289–307. doi: 10.1006/jmbi.2001.4949. [DOI] [PubMed] [Google Scholar]
  • 34.Sali A, Potterton L, Yuan F, van Vlijmen H, Karplus M. Evaluation of comparative protein modeling by MODELLER. Proteins Struct Funct Genet. 1995;23:318–326. doi: 10.1002/prot.340230306. [DOI] [PubMed] [Google Scholar]
  • 35.Rohl CA, Strauss CE, Misura KM, Baker D. Protein Structure Prediction Using Rosetta in Methods in Enzymology. Elsevier; San Diego: 2004. pp. 66–93. [DOI] [PubMed] [Google Scholar]
  • 36.Roy A, Kucukural A, Zhang Y. I-TASSER: A unified platform for automated protein structure and function prediction. Nat Protoc. 2010;5:725–738. doi: 10.1038/nprot.2010.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lockless SW. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286:295–299. doi: 10.1126/science.286.5438.295. [DOI] [PubMed] [Google Scholar]
  • 38.Dutheil J, Pupko T, Jean-Marie A, Galtier N. A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol. 2005;22:1919–1928. doi: 10.1093/molbev/msi183. [DOI] [PubMed] [Google Scholar]
  • 39.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc Natl Acad Sci USA. 2008;106:67–72. doi: 10.1073/pnas.0805923106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Morcos F, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011;108:E1293–E1301. doi: 10.1073/pnas.1111471108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci USA. 2013;110:15674–15679. doi: 10.1073/pnas.1314045110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol. 2014;10:e1003847. doi: 10.1371/journal.pcbi.1003847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Skwark M, Raimondi D, Michel M, Elofsson A. Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol. 2014;10:e1003889. doi: 10.1371/journal.pcbi.1003889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Hayat S, Sander C, Marks D, Elofsson A. All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences. Proc Natl Acad Sci USA. 2015;112:5413–5418. doi: 10.1073/pnas.1419956112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Dutheil J, Pupko T, Jean-Marie A, Galtier N. A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol. 2005;22:1919–1928. doi: 10.1093/molbev/msi183. [DOI] [PubMed] [Google Scholar]
  • 46.Fodor AA, Aldrich RW. Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins Struct Funct Bioinformatics. 2004;56:211–221. doi: 10.1002/prot.20098. [DOI] [PubMed] [Google Scholar]
  • 47.Liberles DA, et al. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 2012;21:769–785. doi: 10.1002/pro.2071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Süel G, Lockless S, Wall M, Ranganathan R. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol. 2003;10:59–69. doi: 10.1038/nsb881. [DOI] [PubMed] [Google Scholar]
  • 49.Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: Evolutionary units of three-dimensional structure. Cell. 2009;138:774–786. doi: 10.1016/j.cell.2009.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Dwyer RS, Ricci DP, Colwell LJ, Silhavy TJ, Wingreen NS. Predicting functionally informative mutations in Escherichia coli BamA using evolutionary covariance analysis. Genetics. 2013;195:443–455. doi: 10.1534/genetics.113.155861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Palovcak E, Delemotte L, Klein ML, Carnevale V. Evolutionary imprint of activation: The design principles of VSDs. J Gen Physiol. 2014;143:145–156. doi: 10.1085/jgp.201311103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Palovcak E, Delemotte L, Klein ML, Carnevale V. Comparative sequence analysis suggests a conserved gating mechanism for TRP channels. J Gen Physiol. 2015;146:37–50. doi: 10.1085/jgp.201411329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Sutto L, Marsili S, Valencia A, Gervasio FL. From residue coevolution to protein conformational ensembles and functional dynamics. Proc Natl Acad Sci USA. 2015;112:13567–13572. doi: 10.1073/pnas.1508584112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Woods KN, Pfeffer J. Using THz spectroscopy evolutionary network analysis methods, and MD simulation to map the evolution of allosteric communication pathways in c-type lysozymes. Mol Biol Evol. 2015;33:40–61. doi: 10.1093/molbev/msv178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol Biol Evol. 2015;33:268–280. doi: 10.1093/molbev/msv211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Haldane A, Flynn WF, He P, Vijayan R, Levy RM. Structural propensities of kinase family proteins from a Potts model of residue co-variation. Protein Sci. 2016;25:1378–1384. doi: 10.1002/pro.2954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Sutto L, Marsili S, Valencia A, Gervasio F. From residue coevolution to protein conformational ensembles and functional dynamics. Proc Natl Acad Sci USA. 2015;112:13567–13572. doi: 10.1073/pnas.1508584112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Poon A, et al. Phylogenetic analysis of population-based and deep sequencing data to identify coevolving sites in the nef gene of HIV-1. Mol Biol Evol. 2010;27:819–832. doi: 10.1093/molbev/msp289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Yu FH, Catterall WA. The VGL-Chanome: A protein superfamily specialized for electrical signaling and ionic homeostasis. Sci Signaling. 2004;2004:re15. doi: 10.1126/stke.2532004re15. [DOI] [PubMed] [Google Scholar]
  • 60.Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E. 2013;87:012707. doi: 10.1103/PhysRevE.87.012707. [DOI] [PubMed] [Google Scholar]
  • 61.Ponzoni L, Polles G, Carnevale V, Micheletti C. SPECTRUS: A dimensionality reduction approach for identifying dynamical domains in protein complexes from limited structural datasets. Structure. 2015;23:1516–1525. doi: 10.1016/j.str.2015.05.022. [DOI] [PubMed] [Google Scholar]
  • 62.Von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. [Google Scholar]
  • 63.Saramäki J, Kivelä M, Onnela JP, Kaski K, Kertész J. Generalizations of the clustering coefficient to weighted complex networks. Phys Rev E. 2007;75:027105. doi: 10.1103/PhysRevE.75.027105. [DOI] [PubMed] [Google Scholar]
  • 64.Hanas J, Hazuda D, Bogenhagen D, Wu F, Wu C. Xenopus transcription factor A requires zinc for binding to the 5 S RNA gene. J Biol Chem. 1983;258:14120–14125. [PubMed] [Google Scholar]
  • 65.Lu D, Klug A. Invariance of the zinc finger module: A comparison of the free structure with those in nucleic-acid complexes. Proteins: Structure, Function, and Bioinformatics. 2007;67:508–512. doi: 10.1002/prot.21289. [DOI] [PubMed] [Google Scholar]
  • 66.Lee BM, et al. Induced fit and “lock and key” recognition of 5S RNA by zinc fingers of transcription factor IIIA. J Mol Biol. 2006;357:275–291. doi: 10.1016/j.jmb.2005.12.010. [DOI] [PubMed] [Google Scholar]
  • 67.Espada R, Parra RG, Mora T, Walczak AM, Ferreiro DU. Capturing coevolutionary signals inrepeat proteins. BMC Bioinformatics. 2015;16:207. doi: 10.1186/s12859-015-0648-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Micheletti C, Carloni P, Maritan A. Accurate and efficient description of protein vibrational dynamics: Comparing molecular dynamics and Gaussian models. Proteins Struct Funct Bioinformatics. 2004;55:635–645. doi: 10.1002/prot.20049. [DOI] [PubMed] [Google Scholar]
  • 69.Fuglebakk E, Reuter N, Hinsen K. Evaluation of protein elastic network models based on an analysis of collective motions. J Chem Theor Comput. 2013;9:5618–5628. doi: 10.1021/ct400399x. [DOI] [PubMed] [Google Scholar]
  • 70.Hille B, et al. Ion Channels of Excitable Membranes. Vol 507 Sinauer; Sunderland, MA: 2001. [Google Scholar]
  • 71.Long SB, Tao X, Campbell EB, MacKinnon R. Atomic structure of a voltage-dependent K+ channel in a lipid membrane-like environment. Nature. 2007;450:376–382. doi: 10.1038/nature06265. [DOI] [PubMed] [Google Scholar]
  • 72.Payandeh J, Scheuer T, Zheng N, Catterall WA. The crystal structure of a voltage-gated sodium channel. Nature. 2011;475:353–358. doi: 10.1038/nature10238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Liao M, Cao E, Julius D, Cheng Y. Structure of the TRPV1 ion channel determined by electron cryo-microscopy. Nature. 2013;504:107–112. doi: 10.1038/nature12822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Kasimova M, Granata D, Carnevale V. Voltage-gated sodium channels: Evolutionary history and distinctive sequence features. Curr Top Membr. 2016;78:261–286. doi: 10.1016/bs.ctm.2016.05.002. [DOI] [PubMed] [Google Scholar]
  • 75.Lu Z, Klem AM, Ramu Y. Coupling between voltage sensors and activation gate in voltage-gated K+ channels. J Gen Physiol. 2002;120:663–676. doi: 10.1085/jgp.20028696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Broomand A, Männikkö R, Larsson HP, Elinder F. Molecular movement of the voltage sensor in a K channel. J Gen Physiol. 2003;122:741–748. doi: 10.1085/jgp.200308927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Long S, Campbell E, Mackinnon R. Voltage sensor of Kv1.2: Structural basis of electromechanical coupling. Science. 2005;309:903–908. doi: 10.1126/science.1116270. [DOI] [PubMed] [Google Scholar]
  • 78.Voets T, Talavera K, Owsianik G, Nilius B. Sensing with TRP channels. Nat Chem Biol. 2005;1:85–92. doi: 10.1038/nchembio0705-85. [DOI] [PubMed] [Google Scholar]
  • 79.Ramsey IS, Delling M, Clapham DE. An introduction to TRP channels. Annu Rev Physiol. 2006;68:619–647. doi: 10.1146/annurev.physiol.68.040204.100431. [DOI] [PubMed] [Google Scholar]
  • 80.Feng Q. Temperature Sensing by Thermal TRP Channels in Current Topics in Membranes. Elsevier; San Diego: 2014. pp. 19–50. [DOI] [PubMed] [Google Scholar]
  • 81.Carnevale V, Rohacs T. TRPV1: A target for rational drug design. Pharmaceuticals. 2016;9:E52. doi: 10.3390/ph9030052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Cao E, Liao M, Cheng Y, Julius D. TRPV1 structures in distinct conformations reveal activation mechanisms. Nature. 2013;504:113–118. doi: 10.1038/nature12823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Yang F, et al. Structural mechanism underlying capsaicin binding and activation of the TRPV1 ion channel. Nat Chem Biol. 2015;11:518–524. doi: 10.1038/nchembio.1835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Darré L, Domene C. Binding of capsaicin to the TRPV1 ion channel. Mol Pharmaceutics. 2015;12:4454–4465. doi: 10.1021/acs.molpharmaceut.5b00641. [DOI] [PubMed] [Google Scholar]
  • 85.Elokely K, et al. Understanding TRPV1 activation by ligands: Insights from the binding modes of capsaicin and resiniferatoxin. Proc Natl Acad Sci USA. 2015;113:E137–E145. doi: 10.1073/pnas.1517288113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Gao Y, Cao E, Julius D, Cheng Y. TRPV1 structures in nanodiscs reveal mechanisms of ligand and lipid action. Nature. 2016;534:347–351. doi: 10.1038/nature17964. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Baker FN, Porollo A. Coeviz: A web-based tool for coevolution analysis of protein residues. BMC Bioinformatics. 2016;17:119. doi: 10.1186/s12859-016-0975-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Teşileanu T, Colwell LJ, Leibler S. Protein sectors: Statistical coupling analysis versus conservation. PLoS Comput Biol. 2015;11:e1004091. doi: 10.1371/journal.pcbi.1004091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Rivoire O. Elements of coevolution in biological sequences. Phys Rev Lett. 2013;110:178102. doi: 10.1103/PhysRevLett.110.178102. [DOI] [PubMed] [Google Scholar]
  • 90.Cocco S, Monasson R, Weigt M. From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comput Biol. 2013;9:e1003176. doi: 10.1371/journal.pcbi.1003176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Smock RG, et al. An interdomain sector mediating allostery in hsp70 molecular chaperones. Mol Syst Biol. 2010;6:414. doi: 10.1038/msb.2010.65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Remmert M, Biegert A, Hauser A, Söding J. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011;9:173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
  • 93.Lapedes A, Giraud B, Liu L, Stormo G. 1998. Correlated Mutations in Protein sequences: Phylogenetic and Structural Effects (Santa Fe Inst, Santa Fe, NM)
  • 94.Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Syst. 2002;2:849–856. [Google Scholar]
  • 95.Watts D, Strogatz S. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 96.Newman M. From the Genome to the Internet. Wiley-Blackwell; Berlin: 2004. Random graphs as models of networks; pp. 35–68. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES