Abstract
The essential biological properties of proteins—folding, biochemical activities, and the capacity to adapt—arise from the global pattern of interactions between amino acid residues. The statistical coupling analysis (SCA) is an approach to defining this pattern that involves the study of amino acid coevolution in an ensemble of sequences comprising a protein family. This approach indicates a functional architecture within proteins in which the basic units are coupled networks of amino acids termed sectors. This evolution-based decomposition has potential for new understandings of the structural basis for protein function. To facilitate its usage, we present here the principles and practice of the SCA and introduce new methods for sector analysis in a python-based software package (pySCA). We show that the pattern of amino acid interactions within sectors is linked to the divergence of functional lineages in a multiple sequence alignment—a model for how sector properties might be differentially tuned in members of a protein family. This work provides new tools for studying proteins and for generally testing the concept of sectors as the principal units of function and adaptive variation.
Author Summary
Proteins display the ability to fold, to carry out complex biochemical reactions, and to be adaptive to changing conditions of selection—the essential characteristics contributing to organismal fitness. A major goal is to understand how these properties emerge from the global pattern of interactions between amino acid residues. Here, we describe the principles and implementation of the statistical coupling analysis (SCA), a method to reveal this pattern through analysis of coevolution between amino acids in an ensemble of homologous sequences. The basic result is a decomposition of protein structures into groups of contiguous amino acids called “sectors” which have been linked to conserved functional properties. This work provides conceptual and practical tools for sector analysis in any sufficiently well-represented protein family, and represents a necessary basis for broadly testing the concept of protein sectors.
Introduction
The amino acid sequence of a protein reflects the selective constraints underlying its fitness and, more generally, the evolutionary history that led to its formation [1]. A central problem is to decode this information from the sequence, and thus understand both the “architecture” of natural proteins, and the process by which they evolve. With the dramatic expansion of the sequence databases, a powerful strategy is to carry out statistical analyses of the evolutionary record of a protein family [2–6]. With the assumption that the principal constraints underlying folding, function, and other aspects of fitness are conserved during evolution, the idea is to start with an ensemble of homologous sequences, make a multiple sequence alignment, and compute a matrix of correlations between sequence positions—the expected statistical signature of couplings between amino acids. Using mathematical analyses that explore different aspects of this matrix [7, 8], studies have exposed tertiary structural contacts in protein structures (Direct Coupling Analysis, or DCA, [4, 9]), determinants of binding specificity in paralogous protein complexes [5], and larger, collectively evolving functional networks of amino acids termed “protein sectors” (Statistical Coupling Analysis, or SCA [10]. These different approaches suggest a hierarchy of information contained in protein sequences that ranges from local constraints that come from direct contacts between amino acids in protein structures to global constraints that come from the cooperative action of many amino acids distributed through the protein structure. Sectors are interesting since they may represent the structural basis for functional properties such as signal transmission within [3, 6, 11–14] and between [15–17] proteins, allosteric regulation [6, 15, 18–20], the collective dynamics associated with catalytic reactions [16], and the capacity of proteins to adapt [21]. In addition, experiments show that reconstituting sectors is sufficient to build artificial proteins that fold and function in a manner similar to their natural counterparts [22–24]. Thus, the quantitative analysis of coevolution provides a powerful approach for generating new hypotheses about the physics and evolution of protein folding and function.
These results imply that together with structure determination and functional measurements, the evolution-based decomposition of proteins should be a routine process in our study of proteins. However, the analysis of coevolution poses non-trivial challenges, both conceptually and technically. Conceptually, coevolution is the statistical consequence of the cooperative contribution of amino acid positions to organismal fitness, a property whose relationship to known structural or biochemical properties of proteins remains open for study. Indeed, there is no pre-existing model of physical couplings of amino acids with which to validate patterns of coevolution. Thus, the goal of coevolution based methods is to produce models for the pattern of constraints between amino acids that can then be experimentally tested for structural, biochemical, and evolutionary meaning. Technically, the analysis of coevolution is complicated by both the limited and biased sampling of sequences comprising a protein family. Thus, empirical correlations deduced from multiple sequence alignments do not always reflect coevolution. Interestingly, the complexities in sequence sampling can represent both sources of noise and useful signal in decomposing protein structures, and it is essential to understand these issues in effectively using methods of coevolution.
The DCA approach for mapping amino acid contacts has been well-described by analogy with established theory in statistical physics [4]. Here, we present the principles and implementation of the SCA method for identifying sectors and introduce new tools for understanding the global patterns of coevolution between amino acid positions. The methods are implemented in an open-source python-based software package that is available to the scientific community, and illustrated in the main text using the small G protein family of nucleotide-dependent switches [25, 26] and the S1A family of serine proteases [27, 28]. Technical modifications from previous implementations of SCA are indicated in the main text and summarized in the the S1 Table. In prior work, we examined just the broadest level of coevolution to define sectors—quasi-independent groups of coevolving amino acids [10]. We now go beyond this top-level decomposition to reveal a more elaborate internal architecture for sectors in which subgroups of amino acids diverge along functional, and sometimes phylogenetic, subfamilies within the sequence alignment. Overall, this work provides a necessary foundation for broad testing of the concept of protein sectors.
Results
The SCA begins with an alignment (M sequences by L positions) representing a sampling of homologous sequences expected to share common selective pressures (Fig 1 and S1 Text, section A). Standard sequence database searching algorithms (BLAST, PSI-BLAST, etc.) [29] together with automated alignment tools (e.g. PROMALS [30]), or available databases of multiple sequence alignments such as PFAM [31] seem to provide suitable sequence alignments. Since SCA concentrates on conserved features of protein sequences (see below), it is relatively robust to variations in alignment quality, but will depend on the depth and diversity of sampling of homologous sequences. While an alignment of a protein family is in principle sufficient for an analysis of coevolution, taxonomic and functional annotations and atomic structures are valuable for interpretation. In this work, we will assume that an atomic structure is known for at least one sequence in the alignment. We also assume that the alignment has been subject to a number of pre-processing steps in which positions and sequences with too many gaps are removed, and a simple sequence-weighting scheme [4] is applied to correct for the trivial over-representation of high-identity sequences. Each sequence s is given a weight , where Ns is the number of sequences with an identity to s above a specified threshold (80% by default, Box 1). With sequence weights, we can compute an effective number of sequences in the alignment M′ = ∑s ws where ws is the weight for sequence s. For computational efficiency, the alignment is then down-sampled to a limit that preserves M′. More advanced methods for treating sequence relationships are possible [32] and will require further study.
BOX 1: A SUMMARY OF CALCULATIONS
Alignment preprocessing: An alignment is represented by a M × L × 20 binary array where s = 1, …, M labels the sequences, i = 1, …, L the positions, a = 1, …, 20 the amino acids, with if sequence s has amino acid a at position i and 0 otherwise. Preprocessing steps:
Truncate excessively gapped positions based on a reference sequence or by a specified gap fraction cutoff (default, 0.4);
Remove sequences with a fraction of gaps greater than a specified value γseq (default, γseq = 0.2);
Remove sequences r with Sr < Δ, where Sr if the fractional identity between r and a specified reference sequence (default, Δ = 0.2);
Compute sequence weights ws = 1/|{r: Srs > δ}| where Srs if the fractional identity between r and s (default, δ = 0.8), and truncate positions i with a frequency of gaps greater than a specified value γpos (default, γpos = 0.2);
-
Recompute the sequence weights ws for the truncated alignment, and compute the frequencies of amino acid at individual positions i as , and at pairs of positions ij as , where M′ = ∑s ws represents the effective number of sequences in the alignment and where λ is a small regularization parameter (default, λ = 0.03).
When dealing with large alignments, a sixth step may be added to speed up the subsequent calculations:
Resample M′′ sequences, with M′ < M′′ < M, by drawing them randomly from the original alignment with weights ws so as to form an alignment with a smaller number of sequences but an equivalent effective number of sequences (which may slightly exceed M′, see SI; default, M′′ = 1.5 × M′).
Structure of evolutionary conservation: For a large and diverse alignment (M′ > 100, minimally), the evolutionary conservation of each amino acid a at position i taken independently of other positions is measured by the statistical quantity , the Kullback-Leibler relative entropy of given qa, the background distribution of amino acids:
(1) |
q is computed over the non-redundant database of protein sequences. If gaps are considered, and represents the fraction of gaps in the alignment, a background frequency for gaps can be taken as , and then for the 20 amino acids. Also, defines the overall conservation of position i taking all amino acids into account. To examine the co-evolution of pairs of amino acids, we introduce a measure that reports the significance of the raw correlations, , as judged by the degree of conservation of the underlying amino acids:
(2) |
The information in the amino acid correlation matrix for each pair of positions is compressed into one number by computing the “Frobenius norm” of the 20 × 20 matrix for each (ij):
(3) |
Analysis of involves (a) spectral (or eigenvalue) decomposition of , given by , (b) determination of k* significant eigenvalues (by comparison with vertically randomized alignments), (c) a transformation of the top k* eigenvectors by independent components analysis (ICA), and (d) study of the pattern of residue contributions along independent components (ICs) 1…k*. Distinct groups of positions can emerge along the ICs for two reasons: (1) the existence of multiple independent sectors, or (2) the hierarchical breakdown of one sector into subgroups that arise from heterogeneities in the alignment.
Mapping and sector interpretation: The singular value decomposition (SVD) of the 20 × 20 matrix for each (ij), , has the property that for c ≠ 1 (S1 Fig). That is, the information in the amino acid correlation matrix for each pair of positions can be compressed into one number, the top singular value (also known as the “spectral norm”). Besides compressibility, another empirical property of the SVD of is that for a given position i, the top singular vector is (up to the sign) nearly independent of j (S3 Fig). That is, the amino acids by which a position i makes correlations with other positions j is nearly the same, and therefore is essentially a property of just position i taken independently. This defines a projection matrix with which we can reduce the M × L × 20 array to an M × L alignment matrix . The matrix xsi gives a mapping between the space of positional correlations and the space of sequence correlations. Specifically, if and are the eigenvalues and eigenvectors of , then
(4) |
represents the structure of sequence space corresponding to the positional correlations in . Also, if W is the transformation matrix derived from the ICA of , then represents the sequence space corresponding to the ICs of . This mapping between position and sequence space provides a method to study the origin of the hierarchical pattern of coevolution that underlies sectors.
In this work, we use a PFAM-based alignment of the G protein family (PFAM, PF00071, version 27.0), and an alignment of S1A serine proteases modified from Halabi et al. [10]. After the pre-processing steps with default values for thresholds, we obtain a final G protein sequence alignment of 4978 sequences by 158 positions (3364 effective sequences) and a final S1A alignment of 1344 sequences by 205 positions (928 effective sequences). In what follows, we will assume that sequence weights are applied and for simplicity we will simply denote M′ by M. No calculations below explicitly depend on its value; we shall only assume that M′ is large enough to give good estimates of amino acid frequencies (M′ > 100).
An interesting point is that for nearly all alignments, the number of “variables” (L × 20 possible amino acids) is typically on the order of or greater than the number of “samples” (M). Thus, it would seem impossible to reliably estimate the correlations between every pair of amino acids given such limited sampling. However, the sparsity of the constraints between amino acids observed both statistically [6, 10] and experimentally [33–35] effectively reduces the dimensionality of the solution, enabling practical approaches. The key issue is to propose a general approach for recognizing the “basis”—or groups of relevant amino acid positions—in which this solution largely exists. In SCA, the approach is to weight correlations by the degree of conservation of amino acids with the intuition that this fundamentally defines the relevance of features emerging from an evolutionary process. We develop this approach by first defining the first-order conservation of positions taken independently and then extending to correlations between positions.
First-order statistics: Position-specific conservation
The evolutionary conservation of a sequence position is estimated from the deviation of the observed distribution of amino acids at this position from a background distribution expected by neutral drift. A simple mathematical quantity that captures this concept is
(5) |
where is the observed frequency of amino acid a at position i in the alignment and qa is the background expectation (see S1 Text, section B for derivation). is known as the Kullback-Leibler relative entropy [36] and indicates how unlikely the observed frequency of amino acid a at position i would be if a occurred randomly with probability qa—a quantitative measure of position-specific conservation. Note that only when and increases more and more steeply as deviates from qa (Fig 2), consistent with intuition that a measure of conservation should non-linearly describe the divergence of the observed distribution of amino acids from their expected values. An underlying assumption in the derivation of the relative entropy is that the sampling of sequences in the alignment is unbiased, a condition that, to varying extent, is violated by the tree-like phylogenetic structure of real alignments. But without validated models for protein evolution that can provide a basis for more accurate measures of conservation, this choice reflects the simplest definition that satisfies the general principle of conservation. Finally, Eq (5) gives the conservation of each amino acid a at each position i, but an overall positional conservation Di can be defined following the same principles (Fig 3A, and see S1 Text, section C).
Analysis of the spatial pattern of positional conservation generally leads to a simple conclusion: the solvent inaccessible core of proteins and functional surfaces tend to be more conserved and the remainder of the surface is less conserved (Fig 3A and 3B) [10, 37, 38]. Thus, positional conservation in sequence alignments reflects well-known properties of protein three-dimensional structures.
Second-order statistics: Conserved correlations
The cooperativity of amino acids in specifying protein folding and function implies that the concept of positional conservation of individual positions should at least be extended to a concept of pairwise conservation, reporting coevolution between positions in a protein family. Given the alignment, a measure of correlation of the pair of amino acids (a, b) at positions (i, j) is given by the difference of their joint frequency and that expected in absence of correlation, . Computed for all pairs of amino acids in the alignment, this defines a covariance matrix
(6) |
Alternatively, statistical dependency can be quantified by the mutual information, whose origin is similar to the relative entropy [5, 36]. However, both the covariance matrix and the mutual information report deviations from independence given the frequencies , and do not take into account the evolutionary relevance of observing those frequencies. In the current implementation of SCA, the approach is to perform a first-order perturbation analysis on the multiple sequence alignment in which we compute the correlated conservation of pairs of amino acids. To explain, consider that many alignments A are available for the same protein family. We can then define relative entropies —our measure of positional conservation—for each alignment A, and compute their correlations over the ensemble of alignments by
(7) |
where the angled brackets indicate averages over the A alignments. In practice, many such alignments can be obtained by bootstrap resampling the original alignment [39]; for instance, a procedure known as “jackknife resampling” consists of successively removing each sequence s from the original alignment to create a collection of M sub-alignments. A perturbative expansion of as a function of for the jackknife resampling process shows that Eq (7) yields a covariance matrix that has the form
(8) |
in which is a function of the conservation of each amino acid at each position (see S1 Text, section D for derivation) [10]. That is, SCA produces a weighted covariance matrix, with the weighting function ϕ controlling the degree of emphasis on conservation. This definition of ϕ has the property of rising steeply as the frequencies of amino acids approach one. As a consequence, these weights damp correlations in arising from weakly conserved amino acids (the gradient of approaches zero as ), and emphasize conserved correlations.
Another way to understand these weights comes from considering their role in determining similarities between sequences comprising the alignment. The mathematical principles are described below, but in essence positional weights redefine the distance between sequences in a manner that emphasizes variation at more conserved positions in the alignment (see S1 Text, section I). It is logical that such a “conservation-biased” distance metric between sequences will provide a better representation of the functional differences (as opposed to historical differences) between sequences. The weighting by in Eq (8) implements the same principle applied to the correlations between positions instead of the correlations between sequences.
In principle, the specific form of ϕ should vary depending on the evolutionary history of the protein properties that are under consideration; the more conserved the properties of interest are, the more the weights should emphasize conservation [40]. Indeed, different weighting functions are possible if mathematical formalisms other than the KL entropy are proposed for defining positional conservation, or if other approaches than the first-order perturbation analysis described here are developed. For example, early versions of the SCA method [3, 6] involved slightly different weights whose technical origins are given in S1 Text, section E.
is a four-dimensional array of L positions × L positions × 20 amino acids × 20 amino acids, but we can compress it into a L × L matrix of positional correlations by taking a magnitude (the Frobenius norm) of each 20 × 20 amino acid coevolution matrix for each pair of positions (i, j):
(9) |
See S1 Text, section F and S1 Fig for additional arguments about compressibility of .
Fig 3C and 3D shows the matrix for the G protein family. As previously reported, the matrix is heterogeneous, with a small number of positions engaged in significantly higher correlations than most positions (Fig 3C, [6, 10]). Hierarchical clustering makes this heterogeneity more apparent, and reveals the existence of nested clusters of correlated positions (Fig 3D). These findings are qualitatively consistent with a sparse, hierarchical, and cooperative pattern of evolutionary constraints. As we show below, there is also modularity [10], with quasi-independent groups of positions emerging from the correlations (the sectors and their subdivisions). Unlike the interpretation of first-order conservation (Fig 3A and 3B), none of these properties is obvious in either current analyses of protein structures.
Decomposition of the SCA positional correlation matrix
How can we understand the pattern of coevolution in the matrix? The existence of correlations(Fig 3C) indicates that treating the amino acid positions as the basic units of proteins is not the most relevant representation. Instead, we should seek a transformation that re-parameterizes the protein into groups of correlated positions that are maximally independent from each other—a more natural representation that defines the units of evolutionary selection. The first step in this process is spectral (or eigenvalue) decomposition. Per this decomposition, the matrix is written as a product of three matrices:
(10) |
where is an L × L diagonal matrix of eigenvalues (ranked by magnitude) and is an L × L matrix whose columns contain the associated eigenvectors. Each eigenvalue gives the quantity of information (variance) in captured, and each associated eigenvector in gives the weights for combining sequence positions into transformed variables (or eigenmodes).
For both G protein and S1A alignments, the histogram of eigenvalues—the spectrum of —reveals a few large eigenvalues extending from a majority of small values (Fig 4A and 4B, black). To estimate the number of significant eigenvalues, we compare the actual spectrum with that for many trials of randomized alignments in which the amino acids at each position are scrambled independently [10] (Fig 4A and 4B, red line). This randomization removes true positional correlations, leaving behind the spurious correlations expected due to finite sampling in the alignment. As is the case for all practical alignments in which the number of effective sequences is not large compared to the number of amino acids, these spurious correlations account for the bulk of the spectrum. Indeed, for both alignments this analysis indicates that just the top few eigenmodes are statistically significant (k* = 4, G protein, and k* = 7, S1A; see S2 Fig for an analysis of robustness). Thus, the k* associated eigenvectors define a low dimensional space in which patterns of positional coevolution can be studied (e.g. Fig 4C and 4D). It is important to note that the precise value of k* is not a fundamental property of a protein family; it depends on protein size and the number of effective sequences. Nevertheless, with adequate sampling (M′ > 100) the analysis of sectors seems largely robust to its precise value (see DHFR tutorial, S3 Text).
Fig 4C and 4D show structure of the space spanned by the top three eigenvectors for the G protein and S1A families, respectively. In these graphs, the (Euclidean) distance of a position from the origin reports its overall contribution to the correlations, and the distance between two positions indicate their degree of correlation: strongly correlated positions appear near-by, while weakly correlated positions are far apart, or, for the majority that do not make any substantial contributions to the correlations, clustered near the origin. As a consequence, independent sets of correlated positions are expected to cluster into groups of positions at distance from the origin. When the correlations within these groups are organized hierarchically, these clusters extend radially with positions at extremity representing the core of the hierarchy, and successive layers at decreasing distance from the origin representing progressively weaker levels of the hierarchy. For both protein families, this analysis suggests a few distinct groups of positions that seem to emerge radially from the origin (Fig 4C and 4D, different colors).
The spectral decomposition is effective for dimension reduction, but the eigenmodes generally do not provide an optimal representation of groups of coevolving positions. For example, distinct groups of positions emerge along combinations of the k* top eigenvectors [10]. The reason is that just decorrelation of the positions by diagonalizing the matrix—the essence of eigenvalue decomposition—is a weaker criterion than achieving statistical independence, which demands absence of not only pairwise correlations, but lack of any higher order statistical couplings. In prior work, we managed this problem heuristically, finding combinations of eigenmodes, excluding the first, that happen to represent quasi-independent groups [10]. Here, we introduce the use of independent components analysis (ICA [41, 42])—an extension of spectral decomposition—that computationally addresses this problem. ICA uses numerical optimization to deduce a matrix W that transforms the k* top eigenmodes of a correlation matrix into k* maximally independent components (ICs, S1 Text, section G),
(11) |
The bottom line is that the k* ICs (in columns of ) should now represent a more appropriate organization of positional coevolution.
In both G proteins (Fig 4E) and S1A proteases (Fig 4F), ICA produces a representation in which the majority of positions are weakly correlated and cluster near the origin and a relatively small subset of positions comprise quasi-independent groups of amino acids emerging along separate orthonormal axes (the ICs). The ICs need not be strictly independent, a key issue in defining sectors that we discuss in detail below. Nevertheless, spectral decomposition with ICA provides the sort of transformation of protein sequences that we seek—based on their evolutionary correlations, amino acid positions are regrouped and transformed into new effective variables (the ICs) that represent collectively evolving modes of the protein under study.
Sequence-position mapping: An interpretation of the decomposition
How can we examine the relevance of the IC-based decomposition of proteins? A approach comes from understanding a fundamental mathematical relationship between the pattern of positional correlations (which defines ICs) and the structure of the sequence space spanned by the alignment (which defines sequence subfamilies) [43, 44]. The concepts underlying this mapping between positions and sequences were presented either heuristically [10] or partially [17] in prior work; here, we provide a full explanation with new mathematical methods.
Consider the two-dimensional binary matrix representation of an alignment Xsn comprised of M sequences by 20L amino acids (Figs 1C and 5A). From Xsn, we can compute two kinds of correlations: (1) a correlation matrix over rows , which represents the similarity (fraction identity) of each pair of sequences r and s (Fig 5B) and (2) a correlation matrix over columns , which represents the joint frequency of amino acids at each pair of positions (Fig 5C). F and S are intimately related to each other by a mathematical property known as the singular value decomposition (SVD). Specifically, if U represents the eigenvectors of the sequence correlation matrix S and V represents the eigenvectors of the amino acid correlation matrix F, then
(12) |
where Λ is a diagonal matrix whose entries are (up to a scaling factor) eigenvalues of both S and F. The key conceptual point is that by the SVD, the eigenvectors of S are a mapping from the eigenvectors of F, where the “map” is the alignment X itself,
(13) |
This introduces the principle of sequence-position mapping, using the full alignment matrix X to relate patterns of amino acid correlations (in V) to patterns of sequence divergence (in U). But, to study the pattern of sequence divergence associated with sectors, we need to make a similar mapping using the conservation-weighted dimension-reduced coevolution matrix (rather than the unweighted amino acid correlation matrix F). Since is a L × L positional correlation matrix, a sequence-space mapping analogous to Eq (13) requires a dimension-reduced alignment matrix in which the 20 amino acids at each position are compressed into a single value. The Supplementary Information describes a new approach for this step, effectively reducing the alignment from a M × L × 20 array to an M × L matrix xsi by projecting the amino acid dimension down to a single scalar value (S1 Text, section H and S4 Fig). By analogy with Eq (13), the reduced alignment matrix xsi then defines a mapping between the space of positional coevolution (in the top ICs of the matrix) and the corresponding sequence space. Specifically, if and are the eigenvalues and eigenvectors, respectively, of the SCA positional coevolution matrix , then
(14) |
represents the structure of the sequence space corresponding to the patterns of positional coevolution in . Furthermore, if W is the transformation matrix derived from ICA of , Eq (11), then
(15) |
represents the sequence space corresponding to , the ICs of the matrix.
Eqs (14) and (15) give us the necessary tools for interpreting the IC-based decomposition of proteins. For the S1A family, Fig 6 shows a mapping between the top six ICs and the corresponding sequence space. Sequences are colored by enzymatic function (Fig 6A–6C, the haptoglobins are non-catalytic homologs of the S1A family), by catalytic specificity (Fig 6D–6F), or by phylogenetic origin (Fig 6G–6I). The data show that ICs 1–3 correspond to essentially orthogonal divergences in the S1A protein family. IC1 (but not any of the other ICs) separates the catalytic from non-catalytic S1A proteins (Fig 6A), IC2 uniquely separates S1A proteins by their annotated primary (P1 site) catalytic specificity [28] (Fig 6D), and IC3 uniquely separates vertebrate and invertebrate sequences (Fig 6H). ICs4–6 show more subtle inhomogeneities with regard to catalytic specificity (Fig 6E and 6F), indicating finer subdivisions of the annotated sequences—well-defined predictions for further study. Thus, the ICs of the matrix contain independently evolving functional units within the S1A protease [10].
Fig 7A–7D shows the mapping between the top ICs of the G-protein family and the corresponding sequence space, colored either by functional sub-type (Fig 7A and 7B) or by taxonomic origin (Fig 7C and 7D). The data show that IC1 and IC2 separate different sub-classes of the G protein family, suggesting that like in S1A proteases, amino acid motifs in different ICs can control different functional properties (Fig 7A). In contrast, IC3 and IC4 are associated with a near homogeneous distribution of functional subtypes, suggesting either neutral or more fine variations with regard to the broad functional annotations available in this protein family (Fig 7B).
With the exception of IC3 in the S1A family (Fig 6H), none of the ICs are obviously associated with the divergence of the main taxonomic groups in the alignment; indeed, all taxa seem nearly homogeneously distributed over the sequence modes (Up) corresponding to most of the ICs. Many paralogs of the different functional classes of G proteins and S1A proteases are found in each type of organism and thus functional divergence might therefore not be expected to follow the divergence of species. In contrast, ICs are more associated with taxonomic classification for the DHFR protein family (S4 Fig and S3 Text), consistent with the fact that this core metabolic enzyme is encoded by a single ortholog in each genome.
In summary, the sequence-position mapping provides evidence that the ICs of the matrix represent conserved, differentially evolving functional units in proteins. The ICs are not distinguished by the magnitude of positional conservation (Fig 8), showing that this decomposition of proteins is fundamentally a property of correlations—the second order terms in conservation. This finding makes an important statement about the “value added” by studying coevolution, as opposed to just the first-order conservation of positions. Indeed, it is difficult to experimentally test the unique value of statistical coevolution by conventional single mutation experiments, even when conducted on a massive scale [21, 45, 46]. Coevolution implies the need for higher-order mutational studies, which are difficult to perform quantitatively and only now starting to become feasible [47]. In this regard, the functionally meaningful, quasi-independent divergence of proteins along ICs demonstrates the necessity of coevolution in providing a proper decomposition of protein structure.
Sectors
But, does the existence of k* significant ICs imply k* independent functional units (and therefore k* sectors [10])? Not necessarily. Sectors typically have an organization in which the constituent positions can be further broken up into subsets of coevolving positions. One generative mechanism for this architecture comes from the tree-like structure of the alignment in which sequences are partitioned into functional subfamilies along which portions of one sector can diverge [43, 44]. Thus, each IC could have one of two interpretations: (1) a truly independent sector associated with a distinct function, or (2) the decomposition of a single sector (representing one functional property) into separately diverging sub-parts. In this sense, the term “independent component” is something of a misnomer, but we retain the language here for consistency with the ICA method.
How can we systematically distinguish these possibilities to deduce the number and composition of sectors? We follow a simple procedure (see tutorials in S3 Text). First, we fit each IC to an empirical statistical distribution and identify the positions contributing to the top five percent of the corresponding cumulative density function (CDF, Figs 4E, 4F and S2). The t-distribution appears to generally fit the ICs well in all cases studied to date (S5 Fig), and IC composition is robust to alignment size when diversity is maintained (S6 Fig). The CDF cutoff is an adjustable parameter, but 5% seems to agree well with experimental significance in the model systems studied [16, 21]. We then construct a sub-matrix of that contains only the selected top-scoring positions for the k* ICs, ordered by their degree of contribution to each IC. For the G protein family, this corresponds to a matrix of 54 positions that contribute to the top four significant ICs (Fig 9A). This sub-matrix describes both the pattern of “internal” correlations between positions that make up each IC (the diagonal blocks), and the pattern of “external” correlations between ICs (the off-diagonal blocks). This representation shows that ICs 1, 2, and 3 display a set of transitive inter-IC correlations, with IC1 correlated to IC2 and IC2 correlated to IC3, indicating that IC1–3 together comprise the hierarchically decomposed parts of a single sector (sector 1, Fig 9A and 9B). In contrast, IC4 shows near-independence from the other ICs, suggesting that it defines a distinct sector (sector 2, Fig 9A and 9B).
These sector definitions are made exclusively from analysis of the IC-based submatrix of , but correspond to a meaningful spatial architecture in the G protein. These proteins are binary switches that display different conformations depending on the identity of their bound guanine nucleotide [25, 26]. The exchange of GTP for GDP triggers two specific conformational changes: clamping of the so-called switch I loop closer to the nucleotide binding pocket, and transit of a disordered and weakly interacting surface loop (switch II) to an ordered helix that is well-packed against the core domain (Fig 9C) [25]. Sector 1 comprises a physically contiguous group of amino acid residues that shows excellent agreement with the nucleotide-dependent allosteric mechanism [50]. The sector is compact in the GTP-bound state but partially disrupted in the GDP-state (Fig 9B and 9C), a finding consistent with the state-dependent connectivity between the nucleotide-binding pocket and the switch loops. Furthermore, the hierarchical breakdown of sector 1 into its constituent ICs 1, 2 and 3 reveals a meaningful structural organization: IC3 (cyan) defines a physically contiguous network that comprises the nucleotide binding pocket, IC1 (light blue) defines the packing interactions between switch II and the core domain, and IC2 (dark blue) represents a set of surface accessible positions (including switch I) that link to the buried core of sector 1 (Fig 9C). Nucleotide exchange substantially reorganizes the structure and connectivity of IC1 and 2, but is largely inconsequential for IC3 (Fig 9C).
Consistent with assignment as an independent sector, sector 2 (IC4, red) also comprises a mostly physically contiguous group within the core of the G protein (Fig 9B); like IC3 of sector 1 (cyan) it shows no nucleotide-dependent conformational plasticity. These results are interesting since IC1 and IC2 (but not IC3 or IC4) are associated with the divergence of functional sub-classes of G protein (Fig 7A). The data suggest that IC3 (cyan subset, sector 1) and IC4 (sector 2) are global functional modes shared by all members of the G protein family, while ICs 1 and 2 correspond to subsets of sector 1 that are specialized for tuning allosteric or effector-binding properties within sub-classes of G proteins. These observations represent new hypotheses for further study.
For the S1A family, the IC-based submatrix shows little evidence of inter-IC correlations, and thus we conservatively treat all ICs as separate sectors (S7 Fig). Each sector corresponds to a largely contiguous network of amino acids in the protease tertiary structure, a decomposition consistent with the orthogonal sequence divergences and with previous reports (S7 Fig, and [10]). Examples of sector analysis for two other protein families—the dihydrofolate reductases [51] and the class A beta-lactamases [52]—are provided in S8 Fig and in tutorials (S3 Text).
The process of sector identification presented here is heuristic, requiring the judgement of the practicing scientist to determine the grouping of ICs to form sectors. This reflects that fact that various degrees of independence between ICs are possible depending on the statistical nature of selective pressures operating on a protein family. Thus, an automated approach to interpreting the matrix awaits more broad experience with sector analysis in many protein families. Given the importance of interpreting hierarchical correlation matrices in general (e.g [53–55]), it seems reasonable that such automation might be achieved with further work.
Discussion
A fundamental goal in biology is to understand the architectural principles of proteins—the pattern of constraints on and between amino acids that underlies folding, biochemical activities, and adaptation. An emerging approach is to leverage the growing databases of protein sequences to statistically infer these constraints from large and diverse ensembles of homologous sequences. This strategy has two defining features that distinguish it from the more traditional direct physical study of specific model proteins. First, by averaging over the space of homologs, the statistical approach emphasizes the general constraints shared by many related proteins over those that are idiosyncratic to particular proteins. Second, by quantitatively examining the structure of correlations, the statistical approach provides models for the global pattern of cooperativity between amino acids. SCA adds an extra concept; by weighting correlations with a function of the evolutionary conservation of the underlying amino acids, this approach incorporates a measure of their functional relevance [56, 57]. Mathematical decomposition of the weighted coevolution matrix reveals an internal architecture for proteins in which the basic functional units are groups of amino acids called sectors. The sector architecture is consistent with two empirically known but poorly understood properties of proteins: (i) sparsity, such that only a fraction of the amino acids are functionally critical [21, 58], and (ii) distributed cooperativity, such that folding and function can depend on the coupled action of amino acids linking distantly positioned sites [59–61]. It has also revealed a previously unrecognized feature of proteins: modularity, such that multiple functionally distinct sectors are possible in a single protein domain [10].
A hierarchical model for sectors
Previous work has introduced the concept of sectors as quasi-independent units of protein structure that are associated with distinct functional properties [10], but has largely ignored their internal architecture. This work presents a more refined description in which a sector may itself be decomposed into a physically contiguous core element (e.g. IC3, Fig 9C), surrounded by peripheral elements (e.g. ICs 1 and 2, Fig 9C) that have the property of differential variation along functional branches of a protein family (Fig 7A). Thus, we propose a model that sectors are structural units of function and the ICs define patterns of variation within these units.
These observations also highlight the practical value of the mapping between positional correlations and sequence subfamilies. When functional divergences between subfamilies are annotated, the mapping can identify the positions responsible for this divergence. For example, in the Hsp70 family of chaperones, the existence of subfamilies with known differences in allosteric function led to the identification of positions involved in the underlying mechanism [17]. Turned around, when the role of specific positions in a protein is known, the mapping can help annotate sequences according to the associated functional property. For example, sequence divergence within sector positions with known function in the S1A family permitted classification of the sequence space according to that functional property [10]. In principle, high-throughput methods for functional annotation of members of a protein family should permit even more refined mappings between amino-acid variation and phylogenetic or functional divergence, a step towards relating genotype-to-phenotype at the molecular level.
Relationship to other methods
It is valuable to explain the similarities and distinctions of SCA with other analyses of coevolution in multiple sequence alignments. The direct coupling analysis (DCA [4]) and its various extensions [9, 62–64] are focused on using coevolution to determine physical contacts between amino acids within or between protein tertiary structures. As different as this problem may seem from discovering the pattern of functionally coevolving amino acids, there is a deep relationship. Recent work shows that the two approaches focus on two extremes of the same hierarchical architecture of coevolution [7, 8]. SCA focuses on the global modes of coevolution (the top eigenmodes of a conservation-weighted correlation matrix), and DCA on the minimal units of coevolution (the bottom eigenmodes of an unweighted correlation matrix). Thus, coevolving direct contacts are at one end of the hierarchy and sectors at the other. Consistent with this, coevolving direct contacts are found within sectors and outside of sectors, but not bridging two independent sectors [8]. Another approach for analyzing coevolution in protein alignments is mutual information, which has been successful at predicting the amino acids responsible for specificity in some protein-protein interactions [5]. The distinction between this method and SCA lies in the nature of the weighting function ϕ; in essence, the mutual information method uses flat positional weights (ϕ = 1), which has the effect of emphasizing more unconserved correlations and may therefore be more appropriate when studying rapidly diverging functional properties [40]. Taken together, these observations begin to clarify the relationship of the different approaches, and poses the question of the nature of physical information held at various levels of the hierarchy of coevolution, a matter for future experimentation. From a theoretical point of view, the observations highlight the need for a better, more unified framework representing the full hierarchy in amino acid correlations in proteins, a key next goal in advancing the statistical approach to the biology of proteins.
Conclusion
Sector analysis provides a representation of proteins that is distinct from the first-order analysis of positional conservation and that (so far) is not obtained from structure determination or functional mutagenesis. Thus, it provides a valuable tool for directing experimental studies of protein folding and function, and ultimately, for formulating a physical and evolutionary theory consistent with the design of natural proteins.
Materials and Methods
Multiple sequence alignments were obtained from previous work [10] or from the PFAM database (release 27.0, accession codes PF00071 (G proteins), PF00186 (DHFR), and PF13354 (class A β-lactamases)), and were subject to pre-processing with default parameter values as described in Box 1. Reference sequences/structures selected for each family were rat trypsin (PDB 3TGI), human Ras (PDBs 5P21 and 4Q21), E. coli DHFR (PDB 1RX2), and E. coli TEM-1 β-lactamase (PDB 1FQG), and with sub-sampling to the number of effective sequences, yielded the following final alignment statistics: S1A serine proteases (928 effective sequences by 205 positions), G proteins (3366 effective sequences by 158 positions), DHFR (1157 effective sequences by 151 positions), β-lactamase (497 effective sequences by 200 positions). All calculations were carried out using a new python implementation of the statistical coupling analysis (pySCA v6.2), following the algorithms described in Box 1 and in the main text. Step-by-step tutorials for executing the analysis for the four protein families are provided in the S3 Text and accompany the toolbox distribution. The pySCA toolbox is available for download through GitHub (https://github.com/reynoldsk/pySCA), and with online instructions at http://reynoldsk.github.io/pySCA.
Supporting Information
Acknowledgments
We thank J. Bloom, A. Murray, and members of the Ranganathan Lab for critical review of the manuscript and I. Junier and S. Leibler for valuable discussions.
Data Availability
All data and materials described in this paper are available from the GitHub software repository: https://github.com/reynoldsk/pySCA and http://reynoldsk.github.io/pySCA.
Funding Statement
This work was supported by Agence Nationale de la Recherche (ANR-10-PDOC-004-01, O.R.), the Gordon and Betty Moore Foundation (KAR), the National Institutes of Health (RO1EY018720-05, R.R.), the Robert A. Welch Foundation (I-1366, R.R.), and the Green Center for Systems Biology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973. July;181(4096):223–30. 10.1126/science.181.4096.223 [DOI] [PubMed] [Google Scholar]
- 2. Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996. March;257(2):342–58. 10.1006/jmbi.1996.0167 [DOI] [PubMed] [Google Scholar]
- 3. Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999. October;286(5438):295–9. 10.1126/science.286.5438.295 [DOI] [PubMed] [Google Scholar]
- 4. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A. 2011. December;108(49):E1293–301. 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, et al. Rewiring the specificity of two-component signal transduction systems. Cell. 2008. June;133(6):1043–54. 10.1016/j.cell.2008.04.040 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Süel GM, Lockless SW, Wall MA, Ranganathan R. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol. 2003. January;10(1):59–69. 10.1038/nsb881 [DOI] [PubMed] [Google Scholar]
- 7. Cocco S, Monasson R, Weigt M. From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction. PLoS Comput Biol. 2013;9(8):e1003176 10.1371/journal.pcbi.1003176 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Rivoire O. Elements of coevolution in biological sequences. Phys Rev Lett. 2013. April;110(17):178102 10.1103/PhysRevLett.110.178102 [DOI] [PubMed] [Google Scholar]
- 9. Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012. November;30(11):1072–80. 10.1038/nbt.2419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009. August;138(4):774–86. 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Ferguson AD, Amezcua CA, Halabi NM, Chelliah Y, Rosen MK, Ranganathan R, et al. Signal transduction pathway of TonB-dependent transporters. Proc Natl Acad Sci U S A. 2007. January;104(2):513–8. 10.1073/pnas.0609887104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Miño-Galaz GA. Allosteric Communication Pathways and Thermal Rectification in PDZ-2 Protein: A Computational Study. J Phys Chem B. 2015. May;119(20):6179–89. 10.1021/acs.jpcb.5b02228 [DOI] [PubMed] [Google Scholar]
- 13. Ota N, Agard DA. Intramolecular signaling pathways revealed by modeling anisotropic thermal diffusion. J Mol Biol. 2005. August;351(2):345–54. 10.1016/j.jmb.2005.05.043 [DOI] [PubMed] [Google Scholar]
- 14. Peterson FC, Penkert RR, Volkman BF, Prehoda KE. Cdc42 regulates the Par-6 PDZ domain through an allosteric CRIB-PDZ transition. Mol Cell. 2004. March;13(5):665–76. 10.1016/S1097-2765(04)00086-3 [DOI] [PubMed] [Google Scholar]
- 15. Lee SY, Banerjee A, MacKinnon R. Two separate interfaces between the voltage sensor and pore are required for the function of voltage-dependent K(+) channels. PLoS Biol. 2009. March;7(3):e47 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Reynolds KA, McLaughlin RN, Ranganathan R. Hot spots for allosteric regulation on protein surfaces. Cell. 2011. December;147(7):1564–75. 10.1016/j.cell.2011.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Smock RG, Rivoire O, Russ WP, Swain JF, Leibler S, Ranganathan R, et al. An interdomain sector mediating allostery in Hsp70 molecular chaperones. Mol Syst Biol. 2010. September;6:414 10.1038/msb.2010.65 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. van den Bedem H, Fraser JS. Integrative, dynamic structural biology at atomic resolution–it’s about time. Nat Methods. 2015. April;12(4):307–18. 10.1038/nmeth.3324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Shulman AI, Larson C, Mangelsdorf DJ, Ranganathan R. Structural determinants of allosteric ligand activation in RXR heterodimers. Cell. 2004. February;116(3):417–29. 10.1016/S0092-8674(04)00119-9 [DOI] [PubMed] [Google Scholar]
- 20. Smock RG, Gierasch LM. Sending signals dynamically. Science. 2009. April;324(5924):198–203. 10.1126/science.1169377 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. McLaughlin RN Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. The spatial architecture of protein function and adaptation. Nature. 2012. November;491(7422):138–42. 10.1038/nature11500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Reynolds KA, Russ WP, Socolich M, Ranganathan R. Evolution-based design of proteins. Methods Enzymol. 2013;523:213–35. 10.1016/B978-0-12-394292-0.00010-2 [DOI] [PubMed] [Google Scholar]
- 23. Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R. Natural-like function in artificial WW domains. Nature. 2005. September;437(7058):579–83. [DOI] [PubMed] [Google Scholar]
- 24. Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R. Evolutionary information for specifying a protein fold. Nature. 2005. September;437(7058):512–8. [DOI] [PubMed] [Google Scholar]
- 25. Sprang SR. G protein mechanisms: insights from structural analysis. Annu Rev Biochem. 1997;66:639–78. [DOI] [PubMed] [Google Scholar]
- 26. Gilman AG. G proteins: transducers of receptor-generated signals. Annu Rev Biochem. 1987;56:615–49. 10.1146/annurev.bi.56.070187.003151 [DOI] [PubMed] [Google Scholar]
- 27. Hedstrom L. Trypsin: a case study in the structural determinants of enzyme specificity. Biol Chem. 1996;377(7–8):465–70. [PubMed] [Google Scholar]
- 28. Hedstrom L. Serine protease mechanism and specificity. Chem Rev. 2002. December;102(12):4501–24. 10.1021/cr000033x [DOI] [PubMed] [Google Scholar]
- 29. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997. September;25(17):3389–402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Pei J, Kim BH, Grishin NV. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. 2008. April;36(7):2295–300. 10.1093/nar/gkn072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, et al. The Pfam protein families database. Nucleic Acids Res. 2004. January;32(Database issue):D138–41. 10.1093/nar/gkh121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Obermayer B, Levine E. Inverse Ising inference with correlated samples. New Journal of Physics. 2014;16(12):123017. [Google Scholar]
- 33. Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr Biol. 2014. November;24(22):2643–51. 10.1016/j.cub.2014.09.072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Schreiber G, Fersht AR. Energetics of protein-protein interactions: analysis of the barnase-barstar interface by single mutations and double mutant cycles. J Mol Biol. 1995. April;248(2):478–86. [DOI] [PubMed] [Google Scholar]
- 35. Wells JA. Additivity of mutational effects in proteins. Biochemistry. 1990. September;29(37):8509–17. 10.1021/bi00489a001 [DOI] [PubMed] [Google Scholar]
- 36. Cover TM, Thomas JA. Elements of information theory. John Wiley & Sons; 2012. [Google Scholar]
- 37. Cygler M, Schrag JD, Sussman JL, Harel M, Silman I, Gentry MK, et al. Relationship between sequence conservation and three-dimensional structure in a large family of esterases, lipases, and related proteins. Protein Sci. 1993. March;2(3):366–82. 10.1002/pro.5560020309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, et al. ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics. 2003. January;19(1):163–4. 10.1093/bioinformatics/19.1.163 [DOI] [PubMed] [Google Scholar]
- 39. Efron B, Efron B. The jackknife, the bootstrap and other resampling plans. vol. 38 SIAM; 1982. [Google Scholar]
- 40. Colwell LJ, Brenner MP, Murray AW. Conservation weighting functions enable covariance analyses to detect functionally important amino acids. PLoS One. 2014;9(11):e107723 10.1371/journal.pone.0107723 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Bell AJ, Sejnowski TJ. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 1995. November;7(6):1129–59. 10.1162/neco.1995.7.6.1129 [DOI] [PubMed] [Google Scholar]
- 42. Hyvärinen A, Karhunen J, Oja E. Independent component analysis. vol. 46 John Wiley & Sons; 2004. [Google Scholar]
- 43. Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995. February;2(2):171–8. 10.1038/nsb0295-171 [DOI] [PubMed] [Google Scholar]
- 44. Lapedes AS, Giraud BG, Liu L, Stormo GD. Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lecture Notes-Monograph Series. 1999;8:236–256. 10.1214/lnms/1215455556 [DOI] [Google Scholar]
- 45. Stiffler MA, Hekstra DR, Ranganathan R. Evolvability as a function of purifying selection in TEM-1?-lactamase. Cell. 2015. February;160(5):882–92. 10.1016/j.cell.2015.01.035 [DOI] [PubMed] [Google Scholar]
- 46. Teşileanu T, Colwell LJ, Leibler S. Protein sectors: statistical coupling analysis versus conservation. PLoS Comput Biol. 2015. February;11(2):e1004091 10.1371/journal.pcbi.1004091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Poelwijk FJ, Krishna V, Ranganathan R. The context-dependence of mutations: a linkage of formalisms. arXiv preprint arXiv:150200726 2015;. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Pai EF, Krengel U, Petsko GA, Goody RS, Kabsch W, Wittinghofer A. Refined crystal structure of the triphosphate conformation of H-ras p21 at 1.35 A resolution: implications for the mechanism of GTP hydrolysis. EMBO J. 1990. August;9(8):2351–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Milburn MV, Tong L, deVos AM, Brünger A, Yamaizumi Z, Nishimura S, et al. Molecular switch for signal transduction: structural differences between active and inactive forms of protooncogenic ras proteins. Science. 1990. February;247(4945):939–45. [DOI] [PubMed] [Google Scholar]
- 50. Hatley ME, Lockless SW, Gibson SK, Gilman AG, Ranganathan R. Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci U S A. 2003. November;100(24):14445–50. 10.1073/pnas.1835919100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Schnell JR, Dyson HJ, Wright PE. Structure, dynamics, and catalytic function of dihydrofolate reductase. Annu Rev Biophys Biomol Struct. 2004;33:119–40. 10.1146/annurev.biophys.33.110502.133613 [DOI] [PubMed] [Google Scholar]
- 52. Salverda MLM, De Visser JAGM, Barlow M. Natural evolution of TEM-1 ?-lactamase: experimental reconstruction and clinical relevance. FEMS Microbiol Rev. 2010. November;34(6):1015–36. 10.1111/j.1574-6976.2010.00222.x [DOI] [PubMed] [Google Scholar]
- 53. Morrison G, Mahadevan L. Asymmetric network connectivity using weighted harmonic averages. EPL (Europhysics Letters). 2011;93(4):40002. [Google Scholar]
- 54. Ma HW, Zhao XM, Yuan YJ, Zeng AP. Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics. 2004. August;20(12):1870–6. 10.1093/bioinformatics/bth167 [DOI] [PubMed] [Google Scholar]
- 55. Morrison G, Mahadevan L. Discovering communities through friendship. PLoS One. 2012;7(7):e38704 10.1371/journal.pone.0038704 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007. August;23(15):1875–82. 10.1093/bioinformatics/btm270 [DOI] [PubMed] [Google Scholar]
- 57. Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006;7:61–80. 10.1146/annurev.genom.7.080505.115630 [DOI] [PubMed] [Google Scholar]
- 58. Bowie JU, Reidhaar-Olson JF, Lim WA, Sauer RT. Deciphering the message in protein sequences: tolerance to amino acid substitutions. Science. 1990. March;247(4948):1306–10. 10.1126/science.2315699 [DOI] [PubMed] [Google Scholar]
- 59. Horovitz A, Fersht AR. Co-operative interactions during protein folding. J Mol Biol. 1992. April;224(3):733–40. 10.1016/0022-2836(92)90557-Z [DOI] [PubMed] [Google Scholar]
- 60. LiCata VJ, Ackers GK. Long-range, small magnitude nonadditivity of mutational effects in proteins. Biochemistry. 1995. March;34(10):3133–9. 10.1021/bi00010a001 [DOI] [PubMed] [Google Scholar]
- 61. Luque I, Leavitt SA, Freire E. The linkage between protein folding and functional cooperativity: two sides of the same coin? Annu Rev Biophys Biomol Struct. 2002;31:235–56. 10.1146/annurev.biophys.31.082901.134215 [DOI] [PubMed] [Google Scholar]
- 62. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A. 2013. September;110(39):15674–9. 10.1073/pnas.1314045110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife. 2014;3:e02030 10.7554/eLife.02030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS. Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012. June;149(7):1607–21. 10.1016/j.cell.2012.04.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Bhabha G, Ekiert DC, Jennewein M, Zmasek CM, Tuttle LM, Kroon G, et al. Divergent evolution of protein conformational dynamics in dihydrofolate reductase. Nat Struct Mol Biol. 2013. November;20(11):1243–9. 10.1038/nsmb.2676 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data and materials described in this paper are available from the GitHub software repository: https://github.com/reynoldsk/pySCA and http://reynoldsk.github.io/pySCA.