Abstract
The rapid increase of “-omics” data warrants the reconsideration of experimental strategies to investigate general protein function. Studying individual members of a protein family is likely insufficient to provide a complete mechanistic understanding of family functions, especially for diverse families with thousands of known members. Strategies that exploit large amounts of available amino acid sequence data can inspire and guide biochemical experiments, generating broadly applicable insights into a given family. Here we review several methods that utilize abundant sequence data to focus experimental efforts and identify features truly representative of a protein family or domain. First, coevolutionary relationships between residues within primary sequences can be successfully exploited to identify structurally and/or functionally important positions for experimental investigation. Second, functionally important variable residue positions typically occupy a limited sequence space, a property useful for guiding biochemical characterization of the effects of the most physiologically and evolutionarily relevant amino acids. Third, amino acid sequence variation within domains shared between different protein families can be used to sort a particular domain into multiple subtypes, inspiring further experimental designs. Although generally applicable to any kind of protein domain because they depend solely on amino acid sequences, the second and third approaches are reviewed in detail because they appear to have been used infrequently and offer immediate opportunities for new advances. Finally, we speculate that future technologies capable of analyzing and manipulating conserved and variable aspects of the three-dimensional structures of a protein family could lead to broad insights not attainable by current methods.
General strategies are needed to explore functional variation within protein domains
The rapidly increasing availability of “-omics” data has revolutionized the practice of science and medicine in many ways in recent decades. Here we focus on a few strategies that capitalize on the large amount of amino acid sequence data available to inspire and guide biochemical experiments designed to yield insights broadly applicable beyond a single protein.
Domains, which are typically defined by sequence similarity, can be considered the “building blocks” of proteins. Each type of domain has a characteristic three-dimensional structure as a result of its characteristic sequence and can also be classified by structural methods [1, 2]. Approximately 20,000 different types of protein domains are currently known [3]. We use the term “architecture” to reference the ordered series of distinct domains that defines a given protein family. Multiple sequence alignments can be used to identify evolutionarily conserved residues within a domain, which have high probabilities of being important for structure, catalysis, binding, etc. Perturbing residues that exhibit various degrees of conservation and then experimentally assessing impact on phenotype has long been used to deduce common functions shared by all domain/family members.
Although primary sequence dictates the properties of a protein, predicting detailed functional characteristics from sequence data alone is a challenging and complex task. Specific (known) protein domains can be detected relatively easily by primary amino acid sequence highlighting residue identity and positional conservation [3, 4]. Variable regions can dictate substantial differences in functional characteristics between distinct instances of a given domain, often between contexts within different protein architectures. Because many of the more obvious relationships between sequence, structure, and function have been explored experimentally, newer strategies are needed both to investigate the more complex and subtle associations that remain to be established, and to bridge the ever-increasing gap between the number of known proteins and the number that reasonably can be studied experimentally. Additionally, a complete mechanistic understanding of how a domain and/or protein family functions likely will not be achieved by studying individual cases. Rather, a generalized and agnostic approach will be needed to identify properties applicable to all members of a given subgroup. Fortunately, the surfeit of available sequence data provides numerous opportunities to address the problem. We review here several widely applicable (although somewhat underutilized) approaches for using amino acid sequence data to guide the design of future experiments, ultimately generating results that can be extrapolated to most members of a given domain or protein family: (i) Coevolutionary analyses of aligned amino acid sequences can reveal relationships between structurally and functionally important residue positions. (ii) Natural amino acid abundances at key variable positions can define physiologically relevant amino acids to test experimentally in the context of a common protein backbone. (iii) High resolution clustering and dimensionality reduction techniques using amino acid sequences can reveal if a domain exhibits distinct sub-types in the contexts of different parent architectures, which can then be explored in greater detail.
A related but distinct use of extensive amino acid sequence data is for protein engineering and design, reviewed in [5] and not considered further here.
Inferring functional relevance by using amino acid sequence alignments to explore relationships between residue pairs
The Neutral Theory of Molecular Evolution contends that the functional importance of an individual residue position within a protein is inversely correlated with the speed at which that position varies over evolutionary time [6]. Despite the somewhat contentious nature of the theory within the field of evolutionary biology, critical positions in a protein family/domain are often highly conserved (invariant) and identifiable during even the most cursory of sequence analyses. Their invariant natures imply that residue changes are poorly tolerated at the most essential positions [7, 8]. Because molecular evolution is constrained by protein structure and function, invariant residues are often catalytically important and/or serve key structural roles.
In the most basic form of conservation analysis, amino acid frequencies at individual positions in an alignment are estimated, typically under the assumption that each position is independent of residues at other sites. However, protein structure and function depend on a highly complex network of interactions between multiple constituent residues. Scientists have long proposed using coevolutionary relationships between positional pairs to predict residue interactions and/or contacts for the purposes of tertiary structure prediction [7, 9–13]. The premise is relatively straightforward. Imagine two amino acids that form a structurally important (or energetically favorable) interaction within a protein. If one residue is perturbed, perhaps as a result of a random mutation, then in many cases a compensatory change at the other position will restore the interaction, thereby altering the evolutionary trajectory of both residues [14]. The coevolutionary coupling between residue pairs can be statistically quantified. Conceptually, this is the “evolutionary genetics” equivalent of selecting for allele-specific suppressor mutations and is sometimes referred to colloquially as “covariation” analysis. Numerous mathematically distinct techniques have been developed using the principle of covarying residues for various purposes (to predict structural/functional residues, to detect interaction contacts, etc.), including mutual information (MI [15]; and its successor, direct coupling analysis or DCA [16]), sparse inverse covariation (PSICOV) [17], and statistical coupling analysis (SCA) [18], among many others [19–25] (reviewed extensively in [26, 27]).
One of the major challenges encountered when searching for coevolutionary relationships using primary sequence alignments is a concept known as phylogenetic bias. If some species are overrepresented in a given alignment, then the corresponding sequences are not adequately independent representatives of a relevant protein/domain family, which violates a key assumption of many coevolutionary-based modeling techniques. The effect is strongly dependent on the input sequence database, and such biases can introduce substantial “false” couplings between alignment positions that lack true functional or coevolutionary relationship(s) within the input sequences [28]. Numerous attempts to correct for phylogenetic noise have been proposed to minimize the influence of closely related entries in the starting sequence alignment, with varying levels of success [16, 29–32]. Increasing availability of suitable input sequence data (i.e., less redundant and/or more representative data sets) should diminish the problem. However, efforts are ongoing to develop methods for disentangling relevant signals from phylogenetic noise, and it is still unclear just how much covariation can tell us about true molecular coevolution [33]. Nevertheless, the considerable past successes of using covariation/coevolutionary analysis to predict tertiary structure and to identify important functional residues suggests that current methods can serve as valuable tools to suggest leads for experimental studies that can be prioritized based on other information (e.g., ignore hydrophobic core residues when looking for active site residues).
Although the most common application of coevolutionary analysis is arguably in protein structure prediction [34] (reviewed in [35]), positional covariation within a multiple sequence alignment can and has also been used to identify functional/catalytic or ligand/partner binding residues [18, 36, 37] and even predict the phenotypic effects of certain mutations [38, 39].
Multiple studies have used covariation analysis to identify direct interactions between amino acids at protein/protein binding interfaces [40, 41]. For example, bacterial species often encode multiple two-component regulatory systems, each with sensor kinase and response regulator proteins that respond to unique stimuli but share common phosphotransfer mechanisms. To ensure the appropriate response to a given stimulus, the systems must be insulated from each other, so phosphoryl groups are transferred from sensor kinases to partner rather than non-partner response regulators. Coevolutionary analyses have identified residues that determine specificity and allowed rewiring of sensor kinase/response regulator pairs [42–47] and toxin/antitoxin pairs [48], or provided independent support for structural models of domain interactions [49]. Another recent use combined covariation analysis and molecular dynamics simulations to identify binding specificity among clustered protocadherin isoforms [50].
Conceptually analogous techniques can be used to infer previously unknown protein-protein interactions. Phyletic Direct Coupling Analysis (PhyDCA) detects co-occurrence patterns of domains or proteins across genomes instead of correlations between amino acids at particular positions within a protein [51] and was used to link specific architectures of CheA kinases to various types of prokaryotic chemotaxis signal transduction networks [52]. Using a similar principle, evolutionary rate covariation (ERC) posits that genes that coevolve at similar rates across genomes may encode proteins that interact [53]. ERC was used to predict and then experimentally confirm that a particular zinc transporter localizes to mitochondria [54]. Viruses contain relatively few genes, which mutate at a relatively high rate, and yet viral variants retain function. A recent study examined covariation across the entire proteome of each of five viral species and found that covariation was primarily intra-protein in some viruses and inter-protein in others [55]. Most covariation analyses to date have examined homologous proteins across species, rather than all proteins within a species as was done here, raising the possibility of new applications of covariation analysis.
Another innovative recent application of covariation analysis was to infer interaction residues for specific binding of small molecules. Quorum sensing systems of Gram-negative bacteria support communication via enzymes that synthesize acyl-homoserine lactones (AHL) and receptors that bind AHLs. The acyl chains of the AHL signaling molecules vary substantially in length and chemical modification between systems/organisms, allowing for discrimination between kin and stranger in mixed microbial communities. Because the synthase and receptor of a given system both interact with the same AHL, examining covariation across many pairs of synthase and receptor proteins identified residues responsible for ligand specificity in each [56]. Proteins with altered synthetic and response capabilities were then successfully engineered.
Prior efforts have also used sequence covariation to map out complex allosteric networks within proteins (e.g., [18]). A recent study focused on the DNA mismatch repair protein MutS, which contains two spatially distant but closely coupled active sites involved in DNA mismatch-binding and ATPase activity. SCA suggested that a subtle but functionally important web of residues allosterically link the two active sites. The candidate network was then probed computationally using molecular dynamics of MutS variants with alanine substitutions to generate experimentally testable hypotheses [57].
High throughput experimental technologies can provide a different perspective. A recent study used deep mutational scanning to find all possible amino acid substitutions that restore function following disruption of specific binding interactions involving covarying residues in a toxin-antitoxin pair identified many non-allele-specific suppressors involving non-contacting positions that were not identified by covariation analysis [58]. Another study measured rate constants for hydrolysis of ~1,300 different peptide substrates identified by phage display by each of eight different matrix metalloproteases [59]. Analysis of correlations between enzyme sequence and substrate preference identified about 50 variable residues in the proteases that collectively determine specificity. Changing four of these residues was sufficient to swap substrate specificity between two proteases. Both studies suggest that important functional interactions can involve multiple residues working together collectively. Therefore, potentially valuable future efforts might focus on the development of more robust methods to analyze and interpret covariation/coevolution of three or more positions, rather than pairs of residues, or even to apply the principles of graph theory as a way of visualizing higher order and/or indirect coevolutionary relationships using network analysis. Several studies focus on this line of inquiry [29, 37, 60–65].
Exploiting the restricted sequence space of functionally important variable positions
In the familiar paradigm of investigating one member of a protein family, genetic experiments can assess the physiological relevance of biochemical or structural results. A different approach is required when investigating an entire protein family and evolutionary genetics replace laboratory genetics: The results of evolution distinguish variable from conserved residues. Candidates for functionally important variable positions can be identified by inspection of protein structures, coevolutionary analyses, genetic methods (including screens such as alanine-scanning mutagenesis or selections in a tractable experimental system), etc. Variable positions that form networks of interacting residues within a protein can also be identified using more sophisticated computational methods [66, 67]. Finally, the functional consequences of amino acid substitutions at selected variable positions are tested in a few model proteins, with substitutions chosen to be representative of the entire protein family. Physiological relevance is incorporated into experimental design by using the results of evolution to identify the naturally most abundant amino acids or combinations of amino acids at variable positions. Basing substitutions on naturally occurring frequencies makes surveying the key features of the entire family (in the context of a few protein backbones) experimentally feasible and optimizes the utility of the resulting data. By focusing on variable residues, a virtuous cycle could be created in which sequence information motivates and informs experimental design. Experimental results in turn enhance sequence annotation and offer the ability to predict properties of uncharacterized family members from inspection of primary amino acid sequence alone. A robust predictive capability would mitigate our inability to experimentally investigate the rapidly increasing number of proteins discovered by genome sequencing.
We illustrate such a strategy using an investigation of receiver domains (PF00072), the ninth most abundant protein domain in nature [3]. Receiver domains contain five conserved active site residues (Figure 1) that catalyze autophosphorylation and autodephosphorylation reactions. Rate constants reported for wild-type receiver domains span five to six orders of magnitude for autodephosphorylation [68, 69] and two orders of magnitude for autophosphorylation (limited, as only two wild-type cases have been experimentally measured [70, 71]). Variable residues were sought that could substantially affect reaction kinetics. To facilitate the identification of equivalent positions in different members of the same family, variable positions are named with respect to the landmarks of conserved residues, with − and + indicating the number of positions in the N- or C-terminal direction respectively (Figure 1).
Figure 1. Receiver domain structure.
The five conserved residues that catalyze receiver domain autophosphorylation and autodephosphorylation reactions are shown in green. D is the site of phosphorylation. DD coordinate (orange dashed lines) the divalent metal ion, shown in yellow. K, T, and the metal ion each bind (red dashed lines) one of the phosphoryl group oxygen atoms. A stable, non-covalently bound BeF3− mimic of the PO32− phosphoryl group is shown in cyan and yellow. Five variable residues known to affect reaction kinetics are shown in blue, with positions named in relation to the conserved residues. Black arrow indicates the required path of attack by phosphodonor or water molecule in line with P–O bond to be formed or broken respectively [100]. Based on 1FQW structure of E. coli CheY [101].
Comparison of the receiver domain structure with the amino acid composition at position T+1 (53% Ala, 22% Gly, 10% Ser, 7% Thr) led to the hypothesis that small amino acids predominate to allow steric access to the phosphorylation site [72]. The effects of various substitutions on the rate constants for autophosphorylation and autodephosphorylation of Escherichia coli CheY are consistent with the hypothesis (Figure 2A). Specifically, substitutions affect both reactions similarly, large or branched residues inhibit both reactions, and substitution effects are phosphodonor specific, suggesting direct interaction of sidechain with substrate [73]. Furthermore, T+1 substitutions have similar effects on autodephosphorylation in three different receiver domains.
Figure 2. CheY autophosphorylation and autodephosphorylation rate constants as a function of substitution position.
Rate constants of E. coli CheY substitution mutants are plotted for autophosphorylation with phosphoramidate (kphos/KS) versus autodephosphorylation with water (kdephos). Note the logarithmic scales on both axes. Red square is wild-type CheY, with NAEPF (single letter amino acid codes, N- to C-terminal) composition for the five variable residues in Figure 1. Intersection of dashed lines indicates rate constants supported by the most abundant (~11%) combinations of five variable residues (MAKPF, MARPF, shown in Panel B) in prokaryotic receiver domains spliced onto the CheY backbone. (A) Substitutions at T+1 (aqua triangles). (B) Substitutions at D+2 (black diamonds), T+2 (brown squares), or both (blue triangles). (C) Substitutions at K+1 (black circles), K+2 (blue diamonds), or both (green triangles). Data from [71, 73–76].
Variable positions D+2 and T+2 surround the path of nucleophilic attack (Figure 1). There are 202 = 400 possible combinations of amino acids at these two positions, but 20 combinations (5% of the total) account for 50% of the combinations observed in nature. Physiologically relevant D+2 and/or T+2 substitutions affect the two reactions inversely (Figure 2B), likely due to opposing effects of hydrophobic and hydrophilic residues [71, 74, 75]. D+2 and T+2 substitutions have similar effects on autodephosphorylation in three different receiver domains.
A chain of high covariation scores link residues T+2, D+2, T+1, K+1, and K+2, suggesting positions K+1 and K+2 can also affect reaction kinetics. Position K+1 is Pro in 82% of prokaryotic receiver domains and position K+2 is Phe, Val, Ile, or Leu in 73%. K+1 and/or K+2 substitutions affect CheY autophosphorylation reactions much more than autodephosphorylation reactions (Figure 2C) [76]. The effects are independent of phosphodonor type, suggesting no direct interaction between K+1/K+2 sidechains and substrate. Instead, the kinetic effects are imposed via conformational equilibria. Receiver domains efficiently catalyze reactions only when in an active conformation. K+1 and K+2 substitutions are thought to perturb the equilibria of the non-phosphorylated protein between inactive and active conformations, substantially affecting the autophosphorylation reaction. In contrast, phosphorylation stabilizes active conformations for autodephosphorylation, so the substitutions can have only limited effects on dephosphorylation reaction rate.
Substitutions at a single receiver domain position typically affect reaction rate constants by about an order of magnitude, but combinations of positions can have greater effects. The sparse occupancy of sequence space by naturally occurring proteins [43, 77, 78] makes experimental investigation of combinatorial effects feasible. For the five variable positions explored here, there are 205 = 3.2 million possible combinations, but the top 100 encompass 40% of all naturally occurring combinations.
An analogous study identified a variable position near the active site of M1 family aminopeptidases [79]. Substitutions that reflected the diverse amino acids found at this position in wild-type family members were made in the context of two different aminopeptidases. Characterization of the mutant proteins revealed a two order of magnitude range in reaction rate and substrate binding constants, as well as substantial effects on substrate specificity. The variable position allows evolutionary specialization or fine tuning of different aminopeptidases for different physiological circumstances.
A conceptually related approach first used crystal structures to identify residues in the ATP binding site of protein kinases [80]. The amino acid composition at each position was then estimated using sequences for all known human protein kinases. The most variable binding residue positions were then exploited to design enzyme inhibitors with specificity for different kinases.
Categorizing the functional heterogeneity of specific protein domains
There is a long history of employing computational strategies to group protein superfamilies into progressively more granular functional categories [surveyed in [81, 82]]. Such approaches have often analyzed entire protein sequences, which may at lower levels of resolution include proteins with very different architectures. Some classification methods rely heavily on structural information [83], particularly active site profiles [84, 85]. One recent example involves the CATH database, which utilizes available protein structures to group similar domains into evolutionarily related superfamilies, providing further structural and functional information. The CATH superfamilies can be further subdivided into putative functional families (FunFams; presumably domains sharing common functions) using additional sequence analysis and by predicting specificity determining residues for each group [86]. Another approach, known as genomic enzymology, integrates amino acid sequence comparisons with genomic context, i.e. the organization in prokaryotes of genes encoding related functions into operons [87]. Further efforts have even used techniques from the field of machine learning to improve functional classification of proteins, revealing relationships not previously detected by other less sophisticated methods [88, 89].
Evolutionary analyses typically involve grouping together items with similar characteristics. The choice of which items belong in which groups is somewhat subjective and depends on the desired level of resolution for a particular purpose. Current definitions of protein domains in databases such as Pfam [3] or SMART [4] are useful because they separate modular elements of evolution and capture gross sequence (and by extension structural) features. Experimentally established properties of a protein domain can often be extrapolated to predict with reasonable accuracy properties of the same domain within an uncharacterized parent protein. However, selective pressure does not stop at the protein domain level, and the same domain may diverge in the context of different proteins [90–92]. Therefore, methods that can split individual representatives of a protein domain into different categories with potentially different functional characteristics are valuable and necessary. Such methods could then be used to discern when known domain properties are applicable to uncharacterized examples of the same domain and to establish experimental priorities by identifying subgroups that have not yet been experimentally investigated, among other applications.
We recently developed a conceptually simple but potentially powerful strategy for this purpose [93]. By focusing on individual domains, our approach is complementary to the approaches for categorizing whole proteins summarized at the start of this section. Our method is based solely on amino acid sequences and is therefore generally applicable, including for domains that lack determined structures, known functions, and/or active sites. We illustrate the steps of analysis with CheW-like domains (PF01584), scaffolding entities that lack any known active site. For clarity, we use italicized terms to keep track of the classification stages. (i) Sufficient Architectures (i.e., ordered series of domains) containing the domain in question are chosen to represent most of the natural occurrences of the domain. Analysis is simplified by the fact that far fewer domain combinations exist in natural proteins than are theoretically possible [94]. For example, 16 Architectures (out of >100 known) contain ~95% of CheW-like domains. (ii) Amino acid sequences for the desired domain are extracted from all proteins of the chosen Architectures in a Representative Proteome [95] data set selected to yield a total number of sequences that is suitable for computational analysis. We used ~8,500 sequences of CheW-like domains from ~8,000 proteins. (iii) Each occurrence of the domain in an Architecture (there may be more than one) is considered to be a Context. For example, the bottom Architecture in Figure 3 has two CheW-like domains and hence two Contexts. Each Context may be subject to unique evolutionary pressures. (iv) To assess the underlying structure of the sequence data, the domain sequences are first partitioned into Clusters. Clustering is unsupervised and simply seeks to minimize the distance in multidimensional space between the sequences in a Cluster and the center of the cluster (termed its k-medoid), while maximizing the distance between k-medoids. Because the sequences that can satisfy the functional requirements for each Context might be distinct, sequences are initially sorted into the same number of Clusters as there are Contexts. (v) The sequences in each Cluster are then tabulated according to architectural Context. In the case of CheW-like domains, there was not a one-to-one correspondence between sequences in a Context and a Cluster as might be expected if every Context were functionally distinct. Although a given Context may contain sequences belonging to multiple Clusters, most sequences for a CheW-like Context belong to Clusters with similar properties (i.e., the k-medoids representing each Cluster are close to one another when viewed by multi-dimensional scaling). If a Context includes sequences belonging to very different Clusters, then the Context is divided accordingly into multiple Types. For CheW-like domains, this only happened for CheW proteins (top Architecture in Figure 3). Otherwise, each Context is simply renamed as a Type. (vi) Finally, random subsets of sequences from each Type are repeatedly chosen and viewed using multi-dimensional scaling to group Types with similar properties into consistent Classes.
Figure 3. Major architectures of proteins containing CheW-like domains.
The Class (designated by numbers from 1 to 6) of CheW-like domains are shown as a function of Architecture and Context as defined in the text. Approximately 95% of CheW-like domains occur in 16 Architectures from three protein lineages [93]. Major Architectures are shown schematically from N to C-terminal. CheW proteins contain only CheW-like domains. Based on Cluster analysis described in the text, the single Context of CheW proteins consists of three Types, each of which ultimately belong to a different Class, as indicated by the asterisk. CheV proteins contain CheW-like and Receiver domains. CheA proteins are the most architecturally diverse. The basic CheA architecture of Hpt, Dimer, CA, and CheW-like domains is most commonly supplemented with up to 10 N-terminal Hpt domains or an additional CheW-like domain. Relationships between Classes 1 to 6 of CheW-like domains and architectural Contexts are shown [93]. The three CheA architectures shown also often feature a C-terminal Receiver domain, indicated in brackets. Domains and Pfam designations are CheW-like (PF01584), Receiver (Response_reg, PF00072), Hpt (histidine phosphotransfer, PF01627), Dimer (H-kinase_dim, PF02895), CA (catalytic & ATP binding, HATPase_c, PF02518).
CheW-like domains primarily occur in three kinds of proteins (Figure 3). CheW proteins contain only CheW-like domains. CheV proteins contain a receiver domain in addition to a CheW-like domain. CheA proteins contain multiple domains that support phosphorylation in addition to a CheW-like domain. The methodology described above yielded six Classes of CheW-like domains, with numerous implications [93]. Some ramifications (experimentally testable hypotheses) of being able to distinguish different Classes of CheW-like domains follow: (i) CheV function is poorly understood [96], but CheW-like domains in CheV and CheW are both primarily Class 1 and so likely perform similar functions. (ii) The role of the CheV receiver domain has not been established, but it probably does not directly interact with the CheW-like domain. Direct interaction would likely result in coevolution of the receiver and CheW-like domains, in turn leading to divergence from Class 1, which is not observed. (iii) Many CheA Architectures contain a receiver domain immediately C-terminal to a CheW-like domain, an arrangement superficially analogous to CheV proteins. However, the functions are likely different, because CheV proteins have Class 1 domains, whereas CheA proteins have Class 3, 4, or 5 domains.
A logical step after identifying Classes of a domain is to decipher the underlying amino acid sequence characteristics of each Class and so lay the foundation for further experimental dissection. Traditional sequence logos are inadequate to make detailed comparisons of many multiple sequence alignments over the length of an entire protein domain. Therefore SimpLogo, a new tool to visually convey gap frequency and amino acid abundance at every position of the alignment [93], was devised. The R package is available at https://github.com/clayfos/SimpLogo.
Potential ways to further exploit three-dimensional information about protein domains
Primary amino acid sequences encode the emergent properties of secondary (α-helices, β-strands, loops), tertiary (three-dimensional fold), and quaternary (oligomeric state) protein structure. The approaches reviewed here are based on primary structure. The experimental utility of such strategies is determined by the degree to which properties of interest depend on amino acid side chains, which can easily be altered and the consequences examined experimentally. In the example of rate constants for receiver domain autocatalytic reactions, five variable positions account for three (Figure 2) out of six orders of magnitude in range. Accounting for half of the range (on a logarithmic scale) represents significant progress toward understanding the factors that determine variation in reaction kinetics between individual examples of the same protein domain. From a linear perspective however, only 0.1% of the phenomenon is understood, with 99.9% left to explain. Additional variable positions might be identified that account for the remaining three orders of magnitude. However, minute differences in backbone and sidechain positioning can also significantly affect factors of interest (rates of catalysis, binding specificity or affinity). In other words, the features that control reaction kinetics might depend heavily on the details of three-dimensional structure, which would limit the utility of strategies that alter primary amino acid sequence and thus complicate experimental dissection.
A long-standing challenge has been how to predict tertiary protein structure from primary amino acid sequences. Alpha-Fold and other computational methods are now remarkably accurate in predicting protein structure [97, 98]. One can imagine future extensions of these approaches by three-dimensional analogies to the one-dimensional approaches addressed in this review. Individual examples of a particular protein domain typically exhibit relatively low sequence identity but share a three-dimensional fold. Structure defines function, so what information might be revealed by structural analyses? Comparison of three-dimensional structures of many examples of a particular protein domain could identify conserved and variable features of the backbone structure instead of conserved and variable features of amino acid sequences. For example, virtually all receiver domain structures exhibit an (αβ)5 fold but the length and orientation of the α4 helix varies substantially [99]. The functional consequences of this structural diversity within receiver domains have not been explored. Correlations could be sought between specific backbone features and properties of interest to generate mechanistic hypotheses. To test such hypotheses would require development of methods to experimentally manipulate the backbone in desired ways, ideally with a facility comparable to current techniques for changing side chains.
Perspectives.
Genome sequencing has identified far more proteins than can be experimentally characterized one at a time. There is an increasing need for experimental strategies that can efficiently identify and characterize factors leading to differences between individual representatives of a protein domain or family.
Three current strategies reviewed here exploit abundant amino acid sequence data to address this challenge: coevolutionary analyses to identify variable residues of likely functional importance, amino acid sequence composition at key variable positions to inform physiologically relevant substitutions for functional tests, and clustering and multi-dimensional scaling analyses of amino acid sequences to identify major subtypes of protein domains. Only the first strategy appears to have been widely used.
Widespread use of biochemical experiments whose design is inspired by information derived from amino acids sequences of protein domain or family members could yield results that in turn substantially enhance annotation of individual protein sequences. Eventual creation of analogous methods applied to three-dimensional protein structures instead of one-dimensional amino acid sequences could yield future analytical and experimental breakthroughs.
Acknowledgements
This work was funded by National Institutes of Health grant GM050860 to R.B.B. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.
Abbreviations
- AHL
acyl-homoserine lactones
- D+n/T+n/K+n
position n residues to the C-terminal side of conserved receiver domain Asp, Thr/Ser, or Lys residues respectively
- DCA
direct coupling analysis
- ERC
evolutionary rate covariation
- MI
mutual information
- PhyDCA
phyletic direct coupling analysis
- PSICOV
sparse inverse covariation
- SCA
statistical coupling analysis
Footnotes
Competing Interests
The authors declare there are no competing interests associated with the manuscript.
References
- 1.Andreeva A, Kulesha E, Gough J and Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 48, D376–D382 10.1093/nar/gkz1064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, Pang CSM, Woodridge L, Rauer C, Sen N, Abbasian M, Le Cornu S, Lam SD, Berka K, Varekova IH, Svobodova R, Lees J and Orengo CA (2021) CATH: increased structural coverage of functional space. Nucleic Acids Res 49, D266–D273 10.1093/nar/gkaa1079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar, Gustavo A, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD and Bateman A (2020) Pfam: The protein families database in 2021. Nucleic Acids Res 49, D412–D419 10.1093/nar/gkaa913 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Letunic I, Khedkar S and Bork P (2021) SMART: recent updates, new developments and status in 2020. Nucleic Acids Res 49, D458–D460 10.1093/nar/gkaa937 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Clifton BE, Kozome D and Laurino P (2022) Efficient exploration of sequence space by sequence-guided protein engineering and design. Biochemistry 10.1021/acs.biochem.1c00757 [DOI] [PubMed] [Google Scholar]
- 6.Kimura M and Ohta T (1974) On some principles governing molecular evolution. Proc Natl Acad Sci U S A 71, 2848–2852 10.1073/pnas.71.7.2848 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gobel U, Sander C, Schneider R and Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18, 309–317 10.1002/prot.340180402 [DOI] [PubMed] [Google Scholar]
- 8.Page RDM and Holmes EC Molecular evolution: A phylogenetic approach. 1st ed. Blackwell Science: Oxford; 1998. [Google Scholar]
- 9.Atchley WR, Wollenberg KR, Fitch WM, Terhalle W and Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17, 164–178 10.1093/oxfordjournals.molbev.a026229 [DOI] [PubMed] [Google Scholar]
- 10.Larson SM, Di Nardo AA and Davidson AR (2000) Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. J Mol Biol 303, 433–446 10.1006/jmbi.2000.4146 [DOI] [PubMed] [Google Scholar]
- 11.Olmea O, Rost B and Valencia A (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 293, 1221–1239 10.1006/jmbi.1999.3208 [DOI] [PubMed] [Google Scholar]
- 12.Olmea O and Valencia A (1997) Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des 2, S25–32 10.1016/s1359-0278(97)00060-6 [DOI] [PubMed] [Google Scholar]
- 13.Shindyalov IN, Kolchanov NA and Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7, 349–358 10.1093/protein/7.3.349 [DOI] [PubMed] [Google Scholar]
- 14.Pollock DD, Taylor WR and Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287, 187–198 10.1006/jmbi.1998.2601 [DOI] [PubMed] [Google Scholar]
- 15.Weigt M, White RA, Szurmant H, Hoch JA and Hwa T (2009) Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci U S A 106, 67–72 10.1073/pnas.0805923106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T and Weigt M (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 108, E1293–1301 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jones DT, Buchan DW, Cozzetto D and Pontil M (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190 10.1093/bioinformatics/btr638 [DOI] [PubMed] [Google Scholar]
- 18.Lockless SW and Ranganathan R (1999) Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 10.1126/science.286.5438.295 [DOI] [PubMed] [Google Scholar]
- 19.Baker FN and Porollo A (2016) CoeViz: a web-based tool for coevolution analysis of protein residues. BMC Bioinformatics 17, 119 10.1186/s12859-016-0975-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cheng RR, Morcos F, Levine H and Onuchic JN (2014) Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc Natl Acad Sci U S A 111, E563–571 10.1073/pnas.1323734111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Colell EA, Iserte JA, Simonetti FL and Marino-Buslje C (2018) MISTIC2: comprehensive server to study coevolution in protein families. Nucleic Acids Res 46, W323–W328 10.1093/nar/gky419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Corcoran D, Maltbie N, Sudalairaj S, Baker FN, Hirschfeld J and Porollo A (2021) CoeViz 2: Protein graphs derived from amino acid covariance. Front Bioinform 1, 10.3389/fbinf.2021.653681 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ekeberg M, Lovkvist C, Lan Y, Weigt M and Aurell E (2013) Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys 87, 012707 10.1103/PhysRevE.87.012707 [DOI] [PubMed] [Google Scholar]
- 24.Kolesov G and Mirny LA (2009) Using evolutionary information to find specificity-determining and co-evolving residues. Methods Mol Biol 541, 421–448 10.1007/978-1-59745-243-4_18 [DOI] [PubMed] [Google Scholar]
- 25.Wu Z, Liu H, Xu L, Chen HF and Feng Y (2020) Algorithm-based coevolution network identification reveals key functional residues of the alpha/beta hydrolase subfamilies. FASEB J 34, 1983–1995 10.1096/fj.201900948RR [DOI] [PubMed] [Google Scholar]
- 26.Neuwald AF (2016) Gleaning structural and functional information from correlations in protein multiple sequence alignments. Curr Opin Struct Biol 38, 1–8 10.1016/j.sbi.2016.04.006 [DOI] [PubMed] [Google Scholar]
- 27.Taylor WR, Kandathil S and Jones DT Sequence covariation analysis in biological polymers. In: Balding DJ, Moltke I and Marioni J, editors. 4th ed. John Wiley & Sons: Oxford; 2019. [Google Scholar]
- 28.Qin C and Colwell LJ (2018) Power law tails in phylogenetic systems. Proc Natl Acad Sci U S A 115, 690–695 10.1073/pnas.1711913115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Burger L and van Nimwegen E (2010) Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput Biol 6, e1000633 10.1371/journal.pcbi.1000633 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Dunn SD, Wahl LM and Gloor GB (2008) Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340 10.1093/bioinformatics/btm604 [DOI] [PubMed] [Google Scholar]
- 31.Tillier ER and Lui TW (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19, 750–755 10.1093/bioinformatics/btg072 [DOI] [PubMed] [Google Scholar]
- 32.Wollenberg KR and Atchley WR (2000) Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci U S A 97, 3288–3291 10.1073/pnas.97.7.3288 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Talavera D, Lovell SC and Whelan S (2015) Covariation Is a poor measure of molecular coevolution. Mol Biol Evol 32, 2456–2468 10.1093/molbev/msv109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R and Sander C (2011) Protein 3D structure computed from evolutionary sequence variation. PLoS One 6, e28766 10.1371/journal.pone.0028766 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pearce R and Zhang Y (2021) Toward the solution of the protein structure prediction problem. J Biol Chem 297, 100870 10.1016/j.jbc.2021.100870 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chakrabarti S and Panchenko AR (2010) Structural and functional roles of coevolved sites in proteins. PLoS One 5, e8591 10.1371/journal.pone.0008591 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lee BC, Park K and Kim D (2008) Analysis of the residue-residue coevolution network and the functionally important residues in proteins. Proteins 72, 863–872 10.1002/prot.21972 [DOI] [PubMed] [Google Scholar]
- 38.Figliuzzi M, Jacquier H, Schug A, Tenaillon O and Weigt M (2016) Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol Biol Evol 33, 268–280 10.1093/molbev/msv211 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hopf TA, Ingraham JB, Poelwijk FJ, Scharfe CP, Springer M, Sander C and Marks DS (2017) Mutation effects predicted from sequence co-variation. Nat Biotechnol 35, 128–135 10.1038/nbt.3769 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kamisetty H, Ovchinnikov S and Baker D (2013) Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A 110, 15674–15679 10.1073/pnas.1314045110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ovchinnikov S, Kamisetty H and Baker D (2014) Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 3, e02030 10.7554/eLife.02030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Capra EJ, Perchuk BS, Lubin EA, Ashenberg O, Skerker JM and Laub MT (2010) Systematic dissection and trajectory-scanning mutagenesis of the molecular interface that ensures specificity of two-component signaling pathways. PLoS Genet 6, e1001220 10.1371/journal.pgen.1001220 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.McClune CJ, Alvarez-Buylla A, Voigt CA and Laub MT (2019) Engineering orthogonal signalling pathways reveals the sparse occupancy of sequence space. Nature 574, 702–706 10.1038/s41586-019-1639-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Procaccini A, Lunt B, Szurmant H, Hwa T and Weigt M (2011) Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: orphans and crosstalks. PLoS One 6, e19729 10.1371/journal.pone.0019729 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M and Laub MT (2008) Rewiring the specificity of two-component signal transduction systems. Cell 133, 1043–1054 10.1016/j.cell.2008.04.040 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Szurmant H and Hoch JA (2010) Interaction fidelity in two-component signaling. Curr Opin Microbiol 13, 190–197 10.1016/j.mib.2010.01.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Nocedal I and Laub MT (2022) Ancestral reconstruction of duplicated signaling proteins reveals the evolution of signaling specificity. Elife 11, 10.7554/eLife.77346 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Aakre CD, Herrou J, Phung TN, Perchuk BS, Crosson S and Laub MT (2015) Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell 163, 594–606 10.1016/j.cell.2015.09.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Multamaki E, Nanekar R, Morozov D, Lievonen T, Golonka D, Wahlgren WY, Stucki-Buchli B, Rossi J, Hytonen VP, Westenhoff S, Ihalainen JA, Moglich A and Takala H (2021) Comparative analysis of two paradigm bacteriophytochromes reveals opposite functionalities in two-component signaling. Nat Commun 12, 4394 10.1038/s41467-021-24676-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Nicoludis JM, Green AG, Walujkar S, May EJ, Sotomayor M, Marks DS and Gaudet R (2019) Interaction specificity of clustered protocadherins inferred from sequence covariation and structural analysis. Proc Natl Acad Sci U S A 116, 17825–17830 10.1073/pnas.1821063116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Croce G, Gueudré T, Ruiz Cuevas MV, Keidel V, Figliuzzi M, Szurmant H and Weigt M (2019) A multi-scale coevolutionary approach to predict interactions between protein domains. PLOS Comp Biol 15, e1006891 10.1371/journal.pcbi.1006891 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Vass LR, Bourret RB and Foster CA (2022) Analysis of CheW-like domains provides insights into organization of prokaryotic chemotaxis systems. Proteins 10.1002/prot.26430 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Clark NL, Alani E and Aquadro CF (2012) Evolutionary rate covariation reveals shared functionality and coexpression of genes. Genome Res 22, 714–720 10.1101/gr.132647.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Kowalczyk A, Gbadamosi O, Kolor K, Sosa J, Andrzejczuk L, Gibson G, St Croix C, Chikina M, Aizenman E, Clark N and Kiselyov K (2021) Evolutionary rate covariation identifies SLC30A9 (ZnT9) as a mitochondrial zinc transporter. Biochem J 478, 3205–3220 10.1042/BCJ20210342 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Sruthi CK and Prakash MK (2019) Statistical characteristics of amino acid covariance as possible descriptors of viral genomic complexity. Sci Rep 9, 18410 10.1038/s41598-019-54720-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Wellington Miranda S, Cong Q, Schaefer AL, MacLeod EK, Zimenko A, Baker D and Greenberg EP (2021) A covariation analysis reveals elements of selectivity in quorum sensing systems. Elife 10, e69169 10.7554/eLife.69169 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Lakhani B, Thayer KM, Hingorani MM and Beveridge DL (2017) Evolutionary covariance combined with molecular dynamics predicts a framework for allostery in the MutS DNA mismatch repair protein. J Phys Chem B 121, 2049–2061 10.1021/acs.jpcb.6b11976 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Ding D, Green AG, Wang B, Lite TV, Weinstein EN, Marks DS and Laub MT (2022) Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nat Ecol Evol 6, 590–603 10.1038/s41559-022-01688-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Ratnikov BI, Cieplak P, Gramatikoff K, Pierce J, Eroshkin A, Igarashi Y, Kazanov M, Sun Q, Godzik A, Osterman A, Stec B, Strongin A and Smith JW (2014) Basis for substrate recognition and distinction by matrix metalloproteinases. Proc Natl Acad Sci U S A 111, E4148–4155 10.1073/pnas.1406134111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Clark GW, Ackerman SH, Tillier ER and Gatti DL (2014) Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics 15, 157 10.1186/1471-2105-15-157 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Halabi N, Rivoire O, Leibler S and Ranganathan R (2009) Protein sectors: evolutionary units of three-dimensional structure. Cell 138, 774–786 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Jernigan R, Jia K, Ren Z and Zhou W (2021) Large-scale multiple inference of collective dependence with applications to protein function. Ann Appl Stat 15, 902–924 10.1214/20-aoas1431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Lee Y, Mick J, Furdui C and Beamer LJ (2012) A coevolutionary residue network at the site of a functionally important conformational change in a phosphohexomutase enzyme family. PLoS One 7, e38114 10.1371/journal.pone.0038114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Mihaljevic L and Urban S (2020) Decoding the functional evolution of an intramembrane protease superfamily by statistical coupling analysis. Structure 28, 1329–1336 e1324 10.1016/j.str.2020.07.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shen W and Li Y (2016) A novel algorithm for detecting multiple covariance and clustering of biological sequences. Sci Rep 6, 30425 10.1038/srep30425 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Neuwald AF, Aravind L and Altschul SF (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7, 10.7554/eLife.29880 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Rivoire O, Reynolds KA and Ranganathan R (2016) Evolution-based functional decomposition of proteins. PLoS Comput Biol 12, e1004817 10.1371/journal.pcbi.1004817 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Jagadeesan S, Mann P, Schink CW and Higgs PI (2009) A novel “four-component” two-component signal transduction mechanism regulates developmental progression in Myxococcus xanthus. J Biol Chem 284, 21435–21445 10.1074/jbc.M109.033415 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Porter SL and Armitage JP (2002) Phosphotransfer in Rhodobacter sphaeroides chemotaxis. J Mol Biol 324, 35–45 10.1016/s0022-2836(02)01031-8 [DOI] [PubMed] [Google Scholar]
- 70.Creager-Allen RL, Silversmith RE and Bourret RB (2013) A link between dimerization and autophosphorylation of the response regulator PhoB. J Biol Chem 288, 21755–21769 10.1074/jbc.M113.471763 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Thomas SA, Immormino RM, Bourret RB and Silversmith RE (2013) Nonconserved active site residues modulate CheY autophosphorylation kinetics and phosphodonor preference. Biochemistry 52, 2262–2273 10.1021/bi301654m [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Volz K (1993) Structural conservation in the CheY superfamily. Biochemistry 32, 11741–11753 10.1021/bi00095a001 [DOI] [PubMed] [Google Scholar]
- 73.Immormino RM, Silversmith RE and Bourret RB (2016) A variable active site residue influences the kinetics of response regulator phosphorylation and dephosphorylation. Biochemistry 55, 5595–5609 10.1021/acs.biochem.6b00645 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Page SC, Immormino RM, Miller TH and Bourret RB (2016) Experimental analysis of functional variation within protein families: Receiver domain autodephosphorylation kinetics. J. Bacteriol. 198, 2483–2493 10.1128/JB.00853-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Thomas SA, Brewster JA and Bourret RB (2008) Two variable active site residues modulate response regulator phosphoryl group stability. Mol. Microbiol. 69, 453–465 10.1111/j.1365-2958.2008.06296.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Straughn PB, Vass LR, Yuan C, Kennedy EN, Foster CA and Bourret RB (2020) Modulation of response regulator CheY reaction kinetics by two variable residues that affect conformation. J. Bacteriol. 202, e00089–00020 10.1128/JB.00089-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Povolotskaya IS and Kondrashov FA (2010) Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926 10.1038/nature09105 [DOI] [PubMed] [Google Scholar]
- 78.Starr TN, Picton LK and Thornton JW (2017) Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 10.1038/nature23902 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Dalal S, Ragheb DRT, Schubot FD and Klemba M (2013) A naturally variable residue in the S1 subsite of M1 family aminopeptidases modulates catalytic properties and promotes functional specialization. J Biol Chem 288, 26004–26012 10.1074/jbc.M113.465625 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Vulpetti A and Bosotti R (2004) Sequence and structural analysis of kinase ATP pocket residues. Farmaco 59, 759–765 10.1016/j.farmac.2004.05.010 [DOI] [PubMed] [Google Scholar]
- 81.Fetrow JS and Babbitt PC (2018) New computational approaches to understanding molecular protein function. PLoS Comput Biol 14, e1005756 10.1371/journal.pcbi.1005756 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Rauer C, Sen N, Waman VP, Abbasian M and Orengo CA (2021) Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol 70, 108–122 10.1016/j.sbi.2021.05.012 [DOI] [PubMed] [Google Scholar]
- 83.Akiva E, Brown S, Almonacid DE, Barber AE 2nd, Custer AF, Hicks MA, Huang CC, Lauck F, Mashiyama ST, Meng EC, Mischel D, Morris JH, Ojha S, Schnoes AM, Stryke D, Yunes JM, Ferrin TE, Holliday GL and Babbitt PC (2014) The Structure-Function Linkage Database. Nucleic Acids Res 42, D521–530 10.1093/nar/gkt1130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Knutson ST, Westwood BM, Leuthaeuser JB, Turner BE, Nguyendac D, Shea G, Kumar K, Hayden JD, Harper AF, Brown SD, Morris JH, Ferrin TE, Babbitt PC and Fetrow JS (2017) An approach to functionally relevant clustering of the protein universe: Active site profile-based clustering of protein structures and sequences. Protein Sci 26, 677–699 10.1002/pro.3112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Leuthaeuser JB, Knutson ST, Kumar K, Babbitt PC and Fetrow JS (2015) Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity. Protein Sci 24, 1423–1439 10.1002/pro.2724 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, Lee D, Lees JG, Lewis TE, Studer RA, Rentzsch R, Yeats C, Thornton JM and Orengo CA (2013) New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res 41, D490–498 10.1093/nar/gks1211 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Zallot R, Oberg N and Gerlt JA (2021) Discovery of new enzymatic functions and metabolic pathways using genomic enzymology web tools. Curr Opin Biotechnol 69, 77–90 10.1016/j.copbio.2020.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, Bateman A, DePristo MA and Colwell LJ (2022) Using deep learning to annotate the protein universe. Nat Biotechnol 40, 932–937 10.1038/s41587-021-01179-w [DOI] [PubMed] [Google Scholar]
- 89.Littmann M, Bordin N, Heinzinger M, Schutze K, Dallago C, Orengo C and Rost B (2021) Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics 10.1093/bioinformatics/btab371 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Dong Z, Zhou H and Tao P (2018) Combining protein sequence, structure, and dynamics: A novel approach for functional evolution analysis of PAS domain superfamily. Protein Sci 27, 421–430 10.1002/pro.3329 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Koretke KK, Lupas AN, Warren PV, Rosenberg M and Brown JR (2000) Evolution of two-component signal transduction. Mol Biol Evol 17, 1956–1970 10.1093/oxfordjournals.molbev.a026297 [DOI] [PubMed] [Google Scholar]
- 92.Pandya C, Brown S, Pieper U, Sali A, Dunaway-Mariano D, Babbitt PC, Xia Y and Allen KN (2013) Consequences of domain insertion on sequence-structure divergence in a superfold. Proc Natl Acad Sci U S A 110, E3381–3387 10.1073/pnas.1305519110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Vass LR, Branscum KM, Bourret RB and Foster CA (2022) Generalizable strategy to analyze domains in the context of parent protein architecture: A CheW case study. Proteins 90, 1973–1986 10.1002/prot.26390 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Naveenkumar N, Kumar G, Sowdhamini R, Srinivasan N and Vishwanath S (2019) Fold combinations in multi-domain proteins. Bioinformation 15, 342–350 10.6026/97320630015342 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Chen C, Natale DA, Finn RD, Huang H, Zhang J, Wu CH and Mazumder R (2011) Representative proteomes: A stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLOS One 6, e18910–e18910 10.1371/journal.pone.0018910 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Huang Z, Pan X, Xu N and Guo M (2019) Bacterial chemotaxis coupling protein: Structure, function and diversity. Microbiol Res 219, 40–48 10.1016/j.micres.2018.11.001 [DOI] [PubMed] [Google Scholar]
- 97.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P and Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Zidek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D and Velankar S (2022) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50, D439–D444 10.1093/nar/gkab1061 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Bourret RB (2010) Receiver domain structure and function in response regulator proteins. Curr Opin Microbiol 13, 142–149 10.1016/j.mib.2010.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Lassila JK, Zalatan JG and Herschlag D (2011) Biological phosphoryl-transfer reactions: understanding mechanism and catalysis. Annu. Rev. Biochem. 80, 669–702 10.1146/annurev-biochem-060409-092741 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Lee SY, Cho HS, Pelton JG, Yan D, Berry EA and Wemmer DE (2001) Crystal structure of activated CheY. Comparison with other activated receiver domains. J. Biol. Chem. 276, 16425–16431 10.1074/jbc.M101002200 [DOI] [PubMed] [Google Scholar]