Abstract

Inferring the historical and biophysical causes of diversity within protein families is a complex puzzle. A key to unraveling this problem is characterizing the rugged topography of sequence–function adaptive landscapes. Using biochemical data from a 29 = 512 combinatorial library of tobacco 5-epi-aristolochene synthase (TEAS) mutants engineered to make the native major product of Egyptian henbane premnaspirodiene synthase (HPS) and a complementary 512 mutant HPS library, we address the question of how product specificity is controlled. These data sets reveal that HPS is far more robust and resistant to mutations than TEAS, where most mutants are promiscuous. We also combine experimental data with a sequence Potts Hamiltonian model and direct coupling analysis to quantify mutant fitness. Our results demonstrate that the Hamiltonian captures variation in product outputs across both libraries, clusters native family members based on their substrate specificities, and exposes the divergent catalytic roles of couplings between the catalytic and noncatalytic domains of TEAS versus HPS. Specifically, we found that the role of the interdomain connectivities in specifying product output is more important in TEAS than connectivities within the catalytic domain. Despite being 75% identical, this property is not shared by HPS, where connectivities within the catalytic domain are more important for specificity. By solving the X-ray crystal structure of HPS, we assessed structural bases for their interdomain network differences. Last, we calculate the product profile Shannon entropies of the two libraries, which showcases that site–site connectivities also play divergent roles in catalytic accuracy.
Introduction
Represented in organisms across the tree of life, terpenoids comprise the largest family of natural products totaling over 100,000 known compounds.1−3 This wide array of compounds all arise from the condensation of isopentenyl diphosphate and dimethylallyl diphosphate to make a variety of isoprenoid substrates. These substrates are often cyclized by a family of enzymes that remarkably all share the so-called terpene synthase (TPS) fold. Class I TPSs are classified as mono-, sesqui-, and diterpene synthases according to their preferred substrates geranyl diphosphate, farnesyl diphosphate (FPP), and geranylgeranyl diphosphate, respectively. As a consequence of the labile carbocation chemistry involved after ionization of the diphosphate group, the suite of potential products is large. Product output is both narrowed and tailored, however, via stereospecific interactions with residues in the active site that raise the energy of some intermediate conformations, thereby disallowing those catalytic pathways.4 Nevertheless, TPSs are known for their promiscuity, wherein several minor products are produced in addition to the main product, for which the synthase is generally named. Plants host, by far, the largest number of terpenoids where they help in warding off herbivore attacks,5 attracting pollinators,6 regulating growth, and facilitating wound healing.7 Commercially, terpenoids are used in flavors and fragrances8 as antimicrobials9 and as therapeutics for diseases like malaria10 and cancer.11
Elucidating how this vast biosynthetic diversity emerged requires detailed mapping of biochemical and biophysical properties across the accessible sequence space of the family, defining the sequence–function landscape. We addressed this question experimentally using two sesqui-TPSs that share 75% sequence identity from the Solanaceae family, tobacco 5-epi-aristolochene synthase (TEAS), and Egyptian henbane premnaspirodiene synthase (HPS). These bind and cyclize FPP within a catalytic pocket in the C-terminal domain to generate 5-epi-aristolochene (5EA) and premnaspirodiene (PS) via a common Eudesmyl carbocation intermediate (Figure 1a). Plant mono- and sesquiterpene synthases also have an N-terminal “vestigial” domain of unknown function proposed to have distinct evolutionary origins from the catalytic C-terminal domain (Figure 1b).12,13 After identifying nine divergent positions in and around the active site of TEAS and HPS, we interconverted the enzymes’ product specificities by swapping the residues at these sites (Figure 1c).14 Following this work, we constructed and assayed the product profiles of a combinatorial library consisting of all possible residue combinations (512 = 29) in the tobacco enzyme (denoted as the M9 library, Figure 1e,f).15 Product profiles are very sensitive to changes at these nine sites, and many HPS-like variants were mutated at combinations of sites beyond the active site surface, indicating that specificity-determining factors are not restricted to the active site. In addition, epistatic contributions to the product specificity were quantified to reveal the importance of residue–residue coupling or positional epistasis as a determinant of TPS catalytic output. While this work led to powerful insights about an interconnected set of positions in TEAS, it allowed us only a limited view of whether our conclusions about their roles would hold in the HPS background. We address this gap by constructing the full 512-member HPS M9 library and assaying each product profile by GC–MS. Analyses of this library suggest that HPS is bent toward the accurate production of native product, while TEAS M9 library mutants tend to be far more promiscuous. We observe the strong sensitivity of this 9-position adaptive landscape to sequence context as well as its ruggedness. In addition, this difference in robustness highlights the benefit of mutational screenings using multiple related genes since even if they are functionally similar, the properties of their local sequence–function landscapes can be highly context dependent.
Figure 1.
Overview of the experimental system. (A) Proposed catalytic mechanisms of TEAS and Egyptian HPS start with substrate farnesyl pyrophosphate and lead to their respective major products and 4-epi-eremophilene (4EE). (B) Structural superposition of TEAS (5EAT) and HPS (5JO7) reveals structural similarity. N-terminal and catalytic C-terminal domains are colored orange and teal, respectively. HPS is colored in more saturated tones. (C) Closeup of the TEAS and HPS active sites, highlighting the nonionizable substrate analogue, farnesyl hydroxyphosphate, the magnesium cations, and the nine positions mutated in the M9 libraries. Positions 4 and 9 (magenta) lie on the active site surface; teal residues are one layer back from the active site. (D)Primary structure schematic depicting the respective M9 mutations in TEAS and HPS. (E) Full combinatorial libraries were cloned for both TEAS and HPS. (F) Mutants were overexpressed in Escherichia coli and purified in high throughput. The enzymatic products were assessed using the vial assay.16 The products were detected by GC–MS, and the percent outputs of each detectable product were quantitated by peak integration.
The 512 × 2 = 1024 mutant library reported here emphasizes the combinatorial explosion that would occur should we wish to include more than 9 of the 135 divergent sites between TEAS and HPS or more than one single amino acid change. Therefore, we employed direct coupling analysis (DCA), which uses statistical inference to model the sequence space of the entire TPS family starting from a multiple sequence alignment (MSA) of native homologues. Our model assumes that the fold and functions of TPS family members are facilitated by interacting networks of residues and that when one member of the network is changed, its partners must be changed as well in order to maintain functionality during the historical expansion of the family. This coevolutionary information-based modeling has been used to predict structure17 and dimer interfaces,18 engineer novel LacI-based gene expression repressors,19 and it is related to protein stability.20
Here, we find that coevolutionary information not only predicts the TPS fold but also captures catalytic features, such as substrate and product specificity. In the case of the HPS mutant library, it even captures product diversity, hinting that the structural determinants of synthase promiscuity can also be controlled by coevolving networks of amino acids. Our model also helped to elucidate the distinct roles of amino acid pairs involving sites spanning the N- and C-terminal domains. Our results show that in HPS, the intradomain pairs drive catalytic specificity, while in TEAS, interdomain pairs play a bigger role in catalytic specificity than pairs, where both positions are in the catalytic domain. In order to assess the generality of these catalytically active interdomain pairs across the family, we developed a technique similar to one previously used by Cheng et al. to identify cognate two-component signaling protein pairs.21 Our analyses reveal that despite these sets of interdomain site pairs being among the highest ranked in the model, the residues in these sites are not widely conserved. To determine the structural basis for the divergence in functional roles of interdomain connectivities, we solved the X-ray crystal structure of WT HPS. Structural comparisons of HPS and TEAS reveal that the strongest interdomain pairs predicted in our model are indeed close contacts at the interface between the two domains; in addition, their influence on catalysis is plausible because they are nearby active site residues and M9 positions. Overall, our work demonstrates the combined power of large experimental data sets with global computational models of sequence space to characterize adaptive landscapes.
Methods
HPS and TEAS M9 Library Construction
For HPS, a total of 512 (=29) constructs were made using a restriction enzyme digestion and ligation approach. All constructs were confirmed via Sanger sequencing and then cloned into the pHIS9GW vector in preparation for overexpression. For TEAS, 432 members of the M9 combinatorial library were originally obtained.15 To generate the full library, the SCOPE method was repeated from scratch,22 and all 512 TEAS mutants were identified using Sanger sequencing. Mutant TEAS genes were subcloned into pH9GW, an in-house expression vector encoding nine N-terminal histidines, via the Gateway system (Invitrogen) according to manufacturer’s instructions. Further details are provided in Supporting Information text.
Protein Expression and Purification
HPS library mutants were overexpressed in BL21 (DE3) at 18 °C for 16 h with 0.5 mM isopropyl β-d-1-thiogalactopyranoside (IPTG) induction as previously described.23 After assessing protein solubility using SDS-PAGE, we purified 338 proteins from the HPS M9 library using a Qiagen BioRobot 8000 (Qiagen) via a modified protocol that started from soluble proteins instead of a cell culture. The mixtures of 400 μL of soluble protein, 500 μL of wash buffer (50 mM Tris, pH 8, 500 mM NaCl, 10 mM imidazole, pH 8, and 10% glycerol), and 200 μL of Ni-NTA slurry (Qiagen) were processed in 96-well plates. After two washes with 800 μL of wash buffer, proteins were purified using 2 × 150 μL of elution buffer (wash buffer with 250 mM imidazole). To remove imidazole from the purified samples, we used Zeba Spin Desalting Columns, 7K MWCO (Thermo). Concentrations of purified proteins were measured using SDS-PAGE and ImageJ.24
Expression and purification of TEAS M9 library proteins were performed as previously described.25 Mutants were expressed in BL21(DE3) cells in 5 mL of Terrific broth with kanamycin until cultures reached an OD600 value of 0.8 or greater. Protein expression was induced by addition of IPTG to 0.1 mM, followed by growth with shaking at 20 °C for 5 h. Pellets from harvested cell cultures were resuspended by adding 0.811 mL of lysis buffer containing 1 mg/mL lysozyme and 1 mM EDTA directly to frozen pellets, followed by shaking at room temperature. Proteins were purified using 96-well, 800 μL, 25–30 μM melt blown polypropylene filter plates (Whatman 7700-2804) and Ni-NTA superflow resin (Qiagen).
Measurement of Apparent kcat Values
Apparent kcat values were measured as previously described.26 Briefly, initial velocities of each purified mutant were measured at 25 μM FPP, and apparent kcat values were estimated by multiplication. Enzyme assays were carried out in 500 μL of reaction volume containing 10 nM enzyme and immediately overlaid with 500 μL of hexanes containing p-chlorotoluene (SUPELCO), the internal standard. After various time points, samples were analyzed by GC–MS. Further details are provided in Supporting Information text.
Product Selection for Analyses in Individual HPS M9 Library Mutant
The product peaks of 512 HPS M9 library mutants from GC–MS results were examined using a MSD ChemStation E.02.02.1431 (Agilent). Several peaks that were tiny in some mutants were bigger and easily identifiable in other mutants. Several peaks were too close and partially overlapped, and their mass spectra are also similar.26 Therefore, we manually separated all of the peaks from the GC chromatograms of all of the mutants for the accurate annotation.
Out of 512 mutants, 33 are either insoluble or inactive and did not produce any product. The HPS M9 library produced 19 different products from gas chromatography–mass spectrometry (GC–MS). WT HPS produced 16 products, and 5 mutants produced 18 products, the largest variety of products. Mutants with either high solubility and/or high activity produced a high amount of total products from overnight enzyme reactions using crude E. coli extract expressing each mutant. However, many mutants with either low solubility or low activity produced a small amount of total products. In these cases, some minor products are not visible in GC not because they are not produced but because their peaks are simply not visible due to an overall low amount of products. On the other hand, nondetected peaks in the mutants that produce large amounts of products are assumed that these nondetected peaks are not produced. Based on the GC result, we decided to include all 19 products, even though some have 0 value, for 273 mutants that produced equal or more than 13 products, and the total peak area is equal to or more than 6 ×107. Here, peak areas (intensities) are abundances of total mass-to-charge units during scan. Some mutants produce an overall small amount of products and/or a small variety of product. However, the invisible peaks were ignored in the analysis. In the samples with even lower amounts of total product, which also have a small number of products, only three major products were analyzed.
TEAS Product Identification and Quantification by GC–MS
The M9 TEAS mutant library reaction products were analyzed using a Hewlett-Packard 6890 gas chromatograph coupled to a 5973 mass selective detector outfitted with a 7683B series injector and an autosampler and equipped with either an HP-5MS or an HP-Chiral-20B column (Agilent). Product peaks were quantified by integration of peak areas using Enhanced Chemstation (version E.02.00, Agilent). Products were identified using a Massfinder 4.25 (http://massfinder.com/). The GC–MS data was inspected to identify the peaks (compounds) to be quantified in the series of samples. The quantification was carried out automatically and used the mass spectra to obtain chromatograms extracted for ions (m/z) (usually 3–5) specific to each compound. First, the intensities of each extracted chromatogram were calculated using Met-Idea v2.0527 based on a collection of [retention time, m/z] pairs. The rest of the steps were carried out in Matlab 2013 (MathWorks) using scripts written in-house. For each extracted chromatogram, the intensities were corrected to take into account the percentage signal that the ion represented in the mass spectrum, so that the corrected intensities should be the same for all ions and represent the amount compound present (relative quantitation). These intensities were averaged across ions. The percentage signal represented by each compound was then calculated. In addition, a report, from scripts written in house, was generated that provided a number of useful diagnostic tools, notably graphs showing the extracted chromatograms over the relevant RT range, as well as the correlation between the corrected intensities from different ions. These were used to detect systematic bias resulting from nonspecificity and/or interference between closely eluting compounds. When necessary, the list of ions was refined so as to limit such occurrences.
Model and MSA Generation
According to direct coupling
analysis, the distribution of sequences in a protein family can be
assumed to be described by a Potts model, which gives the global probability
of a family sequence
according to
| 1 |
where H is the Hamiltonian and represents a statistical “energy” of a given sequence
| 2 |
The key parameters of the model are the statistical couplings between sites, eij, and local fields, hi, which capture signals of site-independent conservation. The i and j indices refer to position along the protein family sequence, and σi refers to the amino acid identity at site i. In the simplest statistical mechanical system described by a Potts model, only neighboring spins interact, making the 2 × 2 matrix easily solvable. For proteins, each position can interact with every other position, which significantly increases the size and complexity of the coupling parameter. It therefore must be estimated, which we did using the mean-field DCA implementation.17
The aligned sequences of TEAS (uniprotID Q40577) and HPS (uniprotID Q39978) (ClustalO, N = 556)28,29 were used as the seed for HMM profile generation by HMMbuild, and relevant sequences were collected using HMMsearch against the UniProt Trmbl and SwissProt databases.30 Sequences with longer than 10% consecutive gaps were removed to improve data quality, resulting in 5066 sequences and 1679 effective sequences (θ = 0.2).
IntraC- and Interdomain Hamiltonian Calculations
HintraC scores were calculated according to the following equation
| 3 |
where LD represents the beginning site of the C-terminal domain and L is the final position, 226 and 548 in TEAS numbering, respectively.
Hinter scores were calculated according to the following equation
| 4 |
where LN is set to 38 and L is 523, which excluded the first 37 and the last 25 MSA positions, mitigating impacts from their high gap frequencies.
Significance Testing for Correlation Coefficients
Pearson correlation coefficients (R) between sequence H and functional data for 1000 randomly mismatched data sets were calculated and compared to R for the real data set; the percentile for the real data is the percent of R’s from the mismatched data sets that are equal to or below it. A cutoff of the 99th percentile was used to determine statistical significance.
Scrambled MSA Data
Scrambling of N-and C-terminal domains of the TPS family was accomplished similar to previous work.21 The N and C terminal domains (between positions 226 and 227 in the TEAS sequence) were cut and randomly mismatched across the MSA. The coupling and local field parameters were inferred, and H for each M9 library sequence was calculated according to eq 4. This scrambling procedure was repeated 10 times before Hamiltonian values were averaged and then correlated with functional metrics from the M9 library experimental data sets.
Shannon Entropy Calculations
Product output values were scaled such that the total product outputs of each mutant’s profile added up to 1. These values were then input into the Shannon information equation by using log2 to calculate entropy in bits.
Protein Crystal Preparation and X-ray Data Collection
HPS crystals were emerged in hanging drop with 100 mM piperazine-N,N′-bis(2-ethanesulfonic acid), pH 6.5, 200 mM disodium tartrate, and 23% PEG 20000 as reservoir solution. Through the crystal optimization, a very large crystal was obtained from the mixture of 1 μL of protein (HPS construct with 22 amino acids truncated at N-terminus, 20 mg/mL) and 1 μL of reservoir solution with 500 μL of 0.55 M NaCl in the bottom at room temperature. The cryoprotectant solution consisted of 85% reservoir solution and 15% glycerol. X-ray diffraction data were collected at beamline BL5.0.3 at the Lawrence Berkeley National Laboratory (LBNL). Data were processed using iMOSFLM,31 the CCP4 Program Suite,32 PHENIX,33 and COOT.34
Results and Discussion
Completion and Characterization of the TEAS and HPS M9 Libraries
The structure-based combinatorial protein engineering (SCOPE) method was previously used to obtain all possible combinations of 9 amino acid mutations in TEAS (29 = 512 combinations), of which 432 unique mutant constructs were successfully obtained.15,22 Here, we recloned the TEAS M9 library in order to obtain all 512 TEAS M9 mutants (Figure 1e). We then used the digestion and ligation method to obtain all of the HPS M9 library clones (see Methods). As schematized in Figure 1f, the TEAS and HPS mutant libraries were overexpressed in E. coli and purified in high throughput, and each of the 1024 mutants were individually subjected to biochemical vial assays; the output of each product as a percentage of the total was detected using GC–MS.16 Quantitative assessment of these total ion chromatograms provided a product profile for each active mutant, in total, 508 for TEAS and 450 for HPS (Figure 1e). The major products among library mutants were PS, 5EA, and 4EE; however, several minor species are also produced. Previously, these compounds were all grouped together for quantitative product profile assessments, which introduced the potential for error owing to differing ionization efficiencies.15 For this work, each detectable product of the TEAS M9 library mutants, 12 products in total, was tracked separately, thereby avoiding this error source. In addition, building off of recent work characterizing the full minor product profiles of WT and M9 HPS, we followed 19 total products across the HPS M9 library, see Figure S11a.26
The distribution of product specificities across the M9 libraries is shown in Supporting Information Figure S1. The majority of TEAS M9 library mutants (69.5%) are promiscuous; that is, no single product represents 50% or greater of the total. The second most popular mutants in the library are TEAS like (20.1%), which produce 50% or greater of 5EA. These proportions are swapped in the HPS M9 library, where most mutants are HPS like (73.1%), followed by promiscuous mutants (23.1%). In both TEAS and HPS libraries, EES-like enzymes, which produce a majority of 4EE, are rare, representing 2.8 and 1.3%, respectively. This difference in product distribution reveals that HPS is a far more robust enzyme than TEAS, resisting more changes to the product specificity even as mutations are adopted. These results corroborate recent observations that HPS maintains catalytic specificity better than TEAS, even during chemical and heat denaturation.26 Importantly, kinetic analyses on 335 HPS library members revealed that the majority had kcat values within 10-fold of WT HPS, demonstrating that most mutation combinations across the library left the catalytic rate largely intact (Supporting Information Figure S2).
Coevolutionary Informational Analysis of Product Profile-Sequence Landscape
We next set out to characterize the role of coevolutionary information in TPS “function,” defined here as various aspects of catalytic activity, e.g., product specificity and turnover rate.35,36 We first aligned the WT TEAS and HPS sequences and used this as a seed for the construction of a hidden Markov model of homologues built with HMMsearch from the Uniprot database,30 see Figure 2a. This procedure afforded us a diverse MSA of TPSs related to TEAS and HPS. We then assumed the global joint probability distribution of amino acids at each position in this MSA according to eq 1 (Figure 2a); the mean field DCA17 was employed to infer the parameters of the distribution, which are the site–site couplings (eij) and local fields (hi), or the amino acid biases at each site. With these parameters, one can calculate a “statistical sequence energy” for any given sequence called the Hamiltonian (H) (eq 2). Formally, H can be used to compute the probability of finding a particular sequence in the input protein family. H serves as a proxy for protein fitness37−42 and has been described as a measure of the “typicality” of a protein sequence within its family, with more negative values signifying more family like or typical sequences.43 Coevolutionary information-based scoring has also been shown to predict specificity between histidine kinases and response regulators,44 compatibility between DNA recognition and allosteric response modules in LacI-type transcription inhibitors,19 folding kinetics,20 and mutational phenotypes in protein–RNA complexes.45
Figure 2.

Computational model and correlation of functional specificity and coevolutionary information. (A)TPS family alignment was constructed using an alignment of TEAS and HPS as a seed for hmmprofile building and sequence acquisition from the uniprot database. Coupling and local field parameters were then estimated using mfDCA, assuming the global sequence probability distribution of the Potts model (eq 1). The statistical energy of any sequence aligned to the family model is its Hamiltonian score (eq 2). Scatter plots of H scores for each mutant in the TEAS (top row) and HPS (bottom row) M9 libraries vs the percent product output for the native major products (B,E), engineered products, which is PS for TEAS and 5EA for HPS, (C,F) and 4EE (D,G). WTs and M9s are plotted as magenta and green asterisks, respectively. Correlation coefficients (R) shown in each plot are all statistically significant (see Methods).
To discover functional features of TPSs that are captured by coevolutionary information, we plotted the Hamiltonian values for each sequence in the TEAS and HPS M9 mutant libraries against the percent outputs of the three main products of the libraries, 5EA, PS, and 4EE (Figure 2b–g). For convenience, we will refer to 5EA as the “native” and PS as the “engineered” product of TEAS, while PS and 5EA will be the “native” and “engineered” products of HPS, respectively. The H captures product output for the native and engineered products of HPS with correlation values of −0.64 and 0.67 (Figure 2e–f). For TEAS M9 mutants, the correlations are somewhat weaker but still significant at −0.44 and 0.55 for native and engineered product outputs. Note that the native and engineered product correlations run in opposite directions. This occurs because the yield of engineered product comes at the direct expense of the native. M9 mutations influence the favorability of a methyl versus methylene shift, potentially leading to 5EA or PS, respectively (purple vs pink curved arrows, Figure 1a).15 Prediction of 4EE output, on the other hand, is poor, and it also differs substantially between TEAS and HPS (R = 0.42 (HPS), 0.044 (TEAS), see Figure 2d,g). In addition, correlations of H with other minor products are often small or statistically insignificant, likely due to the limited dynamic range of their outputs (Supporting Information Figures S3 and S4 and Tables S1 and S2).
We also assessed the correlations of H with apparent rates of catalysis for 335 members of the HPS M9 library as well as previously published data sets for the TEAS M9 library and the Artemisia annua β-farnesene synthase 6 Å library, with 36 and 101 mutants, respectively.15,25 The Hamiltonians of these mutants correlate relatively poorly with their apparent rates of catalysis (Supporting Information Figure S6). Similarly, the 64 HPS M9 library mutants displaying extremely low activity, which we deemed essentially inactive, have nearly identical H probability distributions as active mutants (see Supporting Information Figure S5). However, when we examined the H distributions of native TPSs, we find that TPSs with the same substrate specificities cluster together in broad but distinct groups with the diterpene synthases being the most typical of the family and the sesquiterpene synthases being the least (Supporting Information Figure S7). Taken together, these results suggest that specificity, be it a product or a substrate, is the catalytic feature that is most strongly tied to coevolutionary information within this protein family.
Role of Intra- and Interdomain Couplings in Catalysis
The global nature of our model allows us to interrogate specific subsets of couplings and identify which are most involved in the various functional attributes of the system. Given that the C-terminal domain contains the active site pocket (Figure 3a), we reasoned that the majority of the positional cross-talk related to catalysis would be confined to pairs within that portion of the sequence. To test this idea, we first verified that the bulk of the coevolutionary information within H is driven by couplings. Hamiltonians calculated from a site-independent model, in which all couplings are set to zero, are not predictive of catalytic specificity, as demonstrated by the reduced R values ranging from −0.001 to −0.3 (see Supporting Information Figure S8). We next calculated Hamiltonian values using only pairs from the C-terminal domains of our TEAS and HPS mutants (HintraC, eq 3). For clarity in this section, we refer to H as Hfull, as it incorporates the full set of site–site couplings across the entire sequence. We observed that as anticipated, HintraC captures catalytic output of the engineered product well for HPS with an R value of 0.65, which is almost identical to that for Hfull (0.67, Figure 2e). This finding suggests that the couplings between the M9 residues and the rest of the C-terminal domain are almost entirely responsible for the observed correlation between Hfull and catalytic output. In the case of TEAS, however, we observed that HintraC does not capture the native product output of TEAS (R = −0.03), suggesting that here, the most important residues coupled with the M9 positions must lie outside the catalytic domain (Figure 3d).
Figure 3.

Comparison of intra- vs interdomain site pair roles in catalytic specificity. (A) Structural schematic of TPS family showing that couplings between site pairs within the catalytic C-terminal domain are used to calculate HintraC, while those between sites across the two domains are used to calculate Hinter. (B) Sequence positions defining pairs included in each domain are given using TEAS numbering HintraC and Hinter are defined in eqs 3 and 4, respectively. Scatter plots of HintraC (C,D) or Hinter (E,F) vs the engineered product percent outputs of HPS (left panels) and TEAS (right panels) M9 library mutants. Coloring as in Figure 2. All correlation coefficients are statistically significant.
To test this hypothesis, we calculated a Hamiltonian for each sequence in the HPS and TEAS mutant libraries, wherein only interdomain pairs were considered (Hinter). As expected, the correlation of Hinter with output of engineered product for HPS is reduced relative to the HintraC correlation (0.29 vs 0.65, Figure 3c,e). This result confirms that for HPS, the most important pairs related to product specificity are within the catalytic domain. Remarkably, however, the correlation of Hinter with engineered product output for TEAS is starkly increased to 0.5. As the correlation with Hfull was 0.55, this result shows that almost the entirety of the coevolutionary information related to catalytic specificity for the TEAS M9 library lies in pairs spanning the N-terminal and C-terminal domains.
Discovering such a distinction in the functional assignments of site–site couplings of two otherwise highly similar enzymes is a very unexpected result that challenges our understanding of both the sequence–function landscape and the history of TPSs. The N-terminal domain has been considered essentially an evolutionary vestige46 with no known function, save for active site capping features of the unstructured N-terminal tail.47,48 Indeed, the N-terminal domain has been, to our knowledge, ignored in mutational studies assessing the sequence-specificity landscape of TPSs in favor of rational, phylogeny-guided approaches that focus on the active site region in the C-terminal domain.14,15,49−54 Therefore, our findings here represent a major shift in our understanding of the TPS catalytic architecture: for some TPSs, the two domains can work more as a unit to specify catalytic outputs than previously envisioned.
The observation of asymmetrical couplings is not merely a mathematical phenomenon or an artifact of our model but is supported by experimental evidence. To check whether the conserved exon boundaries between TEAS and HPS represented functional cassettes that controlled product specificity, Back and Chappell produced chimeric constructs that mixed and matched segments of TEAS and HPS approximately corresponding to exon boundaries.55 They saw that a chimera that paired the catalytic domain of HPS with the N-terminal domain of TEAS output PS in the same proportions as WT HPS with only a slight reduction in specific activity (28 vs 22 nmol/(mg* protein * h)). On the other hand, when the catalytic domain of TEAS was paired with the N-terminal domain of HPS, no enzyme activity could be detected, indicating that interdomain communication is more important for catalytic activity in TEAS than HPS. Our Potts model of TPS family sequence space now provides an explanation for this asymmetry and clarifies that it is not sequence segments but rather specific pairs of interactions across the entire sequence that control product specificity.
That interdomain interactions differ between HPS and TEAS leads us to ask exactly how individualized are the interdomain site pairings across the TPS family? Consider two extreme scenarios: in one case, there are just two highly conserved classes of interaction modes in the whole family, and HPS and TEAS happen to be examples of either. By contrast, in the second case, there are no conserved interactions across the two domains, and each domain–domain interaction is essentially unique. To examine where our TPS family lies along this spectrum, we took a computational approach utilizing direct information. The method is similar to a previous study which used a direct information score to predict cognate partners of bacterial two-component signaling proteins.21 Here, we first generated a scrambled model by splitting the protein sequences in the MSA between the N- and C-terminal domains, mixing and matching their domain partners, and using mfDCA once again to infer the Pott’s model parameters (Figure 4a). This procedure impacts the couplings but not the local fields, as the site amino acid frequencies remain the same after scrambling. Importantly, any generic couplings that are conserved across the TPS family interdomain interaction will be preserved, while couplings specific to the cognate domains will be lost. With these scrambled model parameters in hand, we calculated Hamiltonian values for each M9 library sequence using only interdomain couplings. This procedure was repeated 10 times, and the resulting interdomain Hamiltonians averaged to yield Hscraminter. We then correlated these Hamiltonians with the outputs of engineered products for HPS and TEAS and compared these values to those of the previous model, which was based on intact cognate domain pairs (Figure 4B). We observe that the Hamiltonian derived from this scrambled model fails to capture any product specificity (Supporting Information Table S1): the R is reduced from 0.29 to 0.09 for HPS and from 0.5 to −0.11 for TEAS, suggesting that it is specific rather than generic interdomain couplings dominating the coupling matrix in our original model.
Figure 4.
Scrambling of interdomain coevolutionary information. (A) Each sequence in the alignment was split at the junction between the two domains, and then domains were mixed and matched randomly. Coupling and local field parameters were inferred from these scrambled MSA’s and Hinter new scores calculated for each library sequence. This procedure was done 10 times, and the resulting H scores were averaged to get Hscraminter. (B) Bar charts comparing the correlation coefficients of Hinter and Hscraminter vs major product outputs for HPS and TEAS M9 mutant libraries. Arrow highlights the dramatic loss of connection between interdomain pairing information and product specificity when domains are scrambled—an indication that cognate interdomain pairs lack conservation across the family.
To compare specific and generic interdomain pairings involved in predicting structural contacts, we used the X-ray crystal structure of TEAS (PDBID 5EAT) to generate a mapping of all pairs of residues in the structure with a maximum interheteroatom distance of 8 Å (Supporting Information Figure S9a,b). We find that the scrambled model retains zero interdomain predicted contacts among the top 1L unique DI pairs (668 consolidated pairs, see Methods, Supporting Information Figure S9B, red box). This loss of predicted contacts in the interdomain region occurs without the global true positive rates suffering (Supporting Information Figure S9c), confirming that our scrambled model is of high quality and only disrupts interdomain contacts. This result supports the idea that the most significant coevolutionary information across the N- and C-terminal domains is highly specific to each protein in the family. This implies that the sites are not conserved but they are changing together.
Structural Analysis of the Highest Coupled Interdomain Pairs
To investigate the structural bases for the phenomenon exposed by our computational model, we solved the X-ray crystal structure of WT HPS to 2.15 Å resolution and performed comparative structural analyses with TEAS. The X-ray crystallography data collection and refinement statistics are summarized in Supporting Information Table S3. Despite being in different domains, 19/20 of the top interdomain pairs are close contacts within 8 Å (Figure 5a). In fact, the average interheteroatom distance is 4.0 ± 1.4 and 4.2 ± 1.2 Å (standard deviation) for HPS and TEAS, respectively. Table 1 shows that our top 20 interdomain pairs are highly ranked overall, ranging from 20th to 153rd out of 152,076 total pairs. This suggests that the connections between the two domains are as important to fitness or family likeness as the connections within each domain. Comparison of the top interdomain DI pairs in HPS vs TEAS also reveals that key interdomain pairs are (i) in the vicinity of the active site pocket and the M9 residues and (ii) forming divergent chemical interactions; given the importance of steric control for product specificity and accuracy,4 we envision that these interdomain structural and possibly dynamical differences that adjoin the active site pocket could play an important role in product specificity for TEAS (Figure 5b–f and Supporting Information Figure S10).
Figure 5.
Structural analysis of interdomain contacts. (A) Superposition of TEAS and HPS (chain A) structures. The rmsd reported in the text is the average of four rmsd values calculated for the TEAS structure with each of the four chains of the HPS structure: 0.78, 0.823, 0.748, and 0.81 Å. The top 20 interdomain pairs are shown with yellow lines connecting their alpha carbons. The dotted line highlights the only pair with a minimum interheteroatom distance greater than 8 Å. Pairs where one residue differs between TEAS and HPS are highlighted as spheres in the TEAS structure shown in panel (B). M9 positions shown as green spheres. (C–F) Closeups of divergent pairs. First residue numbers and lighter colors are TEAS, whereas the second residue number and darker colors are HPS. The red numbers are labeling each divergent pair by its ranking, 1–7.
Table 1. Top 20 Interdomain DI Pairs.
| interdom. rank (75,259 total) | global rank (152,076 total) | ranked DI pairs | matched to 5JO7 (HPS) | matched to 5EAT (TEAS) | HPS dist. (Å)b | TEAS dist. (Å) | HPS residues | TEAS residues |
|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 212–240 | 212–239 | 204–232 | 4.9 | 4.9 | R–D | R–D |
| 2 | 12 | 173–497 | 173–496 | 165–489 | 2.9 | 6.9 | E–K | E–K |
| 3 | 13 | 47–235 | 47–234 | 45–227 | 2.6 | 2.9 | E–R | E–R |
| 4 | 15 | 82–236 | 82–235 | 79–228 | 3.7 | 3.6 | I–F | I–F |
| 5 | 18 | 84–289 | 84–288 | 81–281 | 3.5 | 4.1 | Y–P | Y–P |
| 6 (1)a | 20 | 224–238 | 223–237 | 216–230 | 2.4 | 2.8 | E–K | D–K |
| 7 | 27 | 167–498 | 167–497 | 159–490 | 2.6 | 3.0 | H–D | H–D |
| 8 | 46 | 213–526 | 213–525 | 205–518 | 6.3 | 5.2 | V–V | V–V |
| 9 | 74 | 82–240 | 82–239 | 79–232 | 3.9 | 4.0 | I–D | I–D |
| 10 (2) | 77 | 204–525 | 204–524 | 196–517 | 2.9 | 5.2 | Q–D | Q–E |
| 11 (3) | 83 | 209–526 | 209–525 | 201–518 | 3.8 | 4.0 | S–V | G–V |
| 12 | 88 | 85–240 | 85–239 | 82–232 | 2.5 | 2.9 | H–D | H–D |
| 13 (4) | 100 | 84–287 | 84–286 | 81–279 | 6.5 | 3.8 | Y–A | Y–F |
| 14 | 125 | 48–236 | 48–235 | 46–228 | 3.8 | 3.9 | I–F | I–F |
| 15 (5) | 143 | 208–493 | 208–492 | 200–485 | 6.4 | 5.9 | K–D | K–E |
| 16 (6) | 148 | 165–493 | 165–492 | 157–485 | 5.8 | 6.2 | R–D | R–E |
| 17 | 152 | 167–501 | 167–500 | 159–493 | 4.0 | 3.3 | H–E | H–E |
| 18 (7) | 153 | 204–529 | 204–528 | 196–521 | 2.8 | 3.9 | Q–K | Q–I |
| 19 | 179 | 166–504 | 166–503 | 158–496 | 5.1 | 4.3 | T–L | T–L |
| 20 | 181 | 53–443 | 53–442 | 51–435 | 37.9 | 37.3 | E–A | E–A |
DI pairs in bold with rankings in parentheses are those in which one of the amino acid residues differs between TEAS and HPS.
Reported distances are from chain A and are between the closest heteroatoms of the two amino acids noted.
Interestingly, when we computed the Shannon entropies of TEAS and HPS library mutants’ product profiles, we find once again divergence in the role of site–site connectivities: they are important for product accuracy in HPS but not for TEAS (Figure S11). A more detailed summary of these results can be found in Supporting Information Text.
In this work, we have comprehensively studied the TPS family sequence space with both experimental and global modeling approaches. The combination of the two is complementary and allows us to get insights into the sequence dependence of function and the evolution thereof. The exceptional robustness of HPS toward mutations that is reported herein is especially intriguing. We show that in the face of accumulating designed mutations, HPS variants tend to maintain native-like product profiles, while product specificity is a far more rare trait among corresponding TEAS M9 library mutants. The adaptive landscape is strongly sensitive to the sequence context. Owing to this difference, we propose that the evolutionary mechanisms that could cause a transition from HPS to TEAS would necessarily differ from those that might transition TEAS to HPS. For one, more mutations in HPS would need to accumulate on average in order to significantly increase a selectable new catalytic function. As a consequence, we might expect (i) more diversity in the standing variation of HPS sequences in a population and (ii) more interspecies diversity among PS synthases than would be seen for the EAS’s. Future work in comparative genomics and population genetics will help us test these predictions.
Previous studies have quantitated positional epistasis as it relates to mutations impacting catalytic specificity, and this remains an important approach to characterizing the fitness landscapes of TPSs and proteins in general.15,25,49 Our work highlights the power of combining these experimental approaches with computational modeling of sequence space using DCA,17 specifically that catalytically diverse protein families like the TPS’s can be successfully modeled using coevolutionary information and that the Hamiltonian score of a mutated sequence captures catalytic features like percent output of the native product and substrate specificity. In the case of HPS, the accuracy of product output, measured with Shannon entropy of the product profile, is also captured by H, while for TEAS, it was not. Indeed, no relationship between H and the apparent rate of catalysis was observed either. Based on these findings, we speculate that in nature product, Shannon entropy and rate of catalysis of TPS catalysis may not be as universally strong a basis for organismal survival and fitness as controlling for the output of a specific major product is.
Our global Potts model utilized here also facilitates detection of amino acid connectivities in places we might not have thought to look, nor would we have encountered in experimental data sets that are necessarily limited in scope. Therefore, our TPS family Potts model was able to unveil catalytically relevant interdomain connectivity to a domain considered catalytically inert and written off as almost entirely extraneous by the TPS field. We show that the involvement of the N-terminal domain is more important for TEAS catalysis than it is for HPS, which corroborates early domain swapping experiments by Back and Chappell wherein replacing the N-terminal domain of HPS with that of TEAS had no impact on product specificity, but doing the reverse destroyed TEAS activity.55 Collectively, these results portray that despite the high structural and sequence similarity of TPS enzymes, their N- and C-terminal domains are not interchangeable because the functional assignments of pairs connected across the domains are quite diverse. We speculate that other functions, besides catalytic specificity, including conformational flexibility and thermostability, might be captured in other synthases by the top interdomain coevolutionary pairs identified here. Indeed, modifications to the interdomain region of a TPS could help in the design of novel TPSs with new substrate and product specificities, as well as optimizing natural synthases for industrial and medicinal purposes.
Acknowledgments
The authors thank research support by NSF EEC-0813570 and Howard Hughes Medical Institute (J.P.N.), NIH R35GM133631 (C.M.N. and F.M.), and NSF MCB-1943442 (F.M.). We acknowledge support for P.E.O. from the Biotechnology and Biological Sciences Research Council (BBSRC) Institute Strategic Program Grants BB/J004561/1 (Understanding and Exploiting Plant and Microbial Secondary Metabolism).
Data Availability Statement
The MSA and biochemical data underlying this article are available in online Supporting Information. All scripts including those used for eij and hi parameter inference by DCA and to calculate Hamiltonian and interdomain Hamiltonian scores were written in MATLAB (The MathWorks, Natick, MA) and can be found at https://github.com/morcoslab/interdomain_hamiltonian.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.biochem.3c00578.
Product percentage and Shannon entropy of TEAS M9 library and solubility, apparent Kcat, and product percentage and Shannon entropy of HPS M9 library (XLSX)
Shannon entropies of TEAS and HPS library mutants’ product profiles (TXT)
Detailed structural analysis, Shannon entropy of product profiles, methods for mutant library construction, and measurement of apparent kcat for HPS mutants; product specificity distributions, distribution of HPS M9 library kcat values, TEAS and HPS library minor products correlations with H scores, KL divergence comparisons of active vs inactive HPS mutants, apparent kcat correlations with H scores, TPS family H score distribution highlighting substrate specificities, site-independent model H correlations, contact maps, and structural analysis; and major and minor products correlated with H scores, minor products correlated with H scores, and crystal and X-ray diffraction data of HPS (PDF)
Accession Codes
The atomic coordinates and crystallographic structure factors of Egyptian Henbane Premnasiprodiene synthase have been deposited in the PDB (http://www.rcsb.org) with accession code 5JO7.
Author Present Address
¶ Seoul National University College of Agriculture and Life Sciences, Forestry & Bioresources and Plant Genomics and Breeding Institute and Department of Agriculture, Seoul 08826, Republic of Korea
Author Present Address
∇ Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, Norfolk, NR4 7TJ, U.K.
Author Present Address
○ SRI International, Biosciences Division, Menlo Park, California, 92122, United States
Author Contributions
†† C.M.N. and H.J.K. contributed equally to this work.
The authors declare no competing financial interest.
Supplementary Material
References
- Berthelot K.; Estevez Y.; Deffieux A.; Peruch F. Isopentenyl diphosphate isomerase: a checkpoint to isoprenoid biosynthesis. Biochimie 2012, 94, 1621–1634. 10.1016/j.biochi.2012.03.021. [DOI] [PubMed] [Google Scholar]
- Köksal M.; Hu H.; Coates R. M.; Peters R. J.; Christianson D. W. Structure and mechanism of the diterpene cyclase ent-copalyl diphosphate synthase. Nat. Chem. Biol. 2011, 7, 431–433. 10.1038/nchembio.578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buckingham J.Dictionary of Natural Products; Taylor & Francis Group, 2014. [Google Scholar]
- Hess B. A.; Smentek L.; Noel J. P.; O’Maille P. E. Physical constraints on sesquiterpene diversity arising from cyclization of the eudesm-5-yl carbocation. J. Am. Chem. Soc. 2011, 133, 12632–12641. 10.1021/ja203342p. [DOI] [PubMed] [Google Scholar]
- Mithöfer A.; Boland W. Plant defense against herbivores: chemical aspects. Annu. Rev. Plant Biol. 2012, 63, 431–450. 10.1146/annurev-arplant-042110-103854. [DOI] [PubMed] [Google Scholar]
- Boncan D. A. T.; Tsang S. S.; Li C.; Lee I. H.; Lam H.-M.; Chan T.-F.; Hui J. H. Terpenes and terpenoids in plants: Interactions with environment and insects. Int. J. Mol. Sci. 2020, 21, 7382. 10.3390/ijms21197382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pichersky E.; Raguso R. A. Why do plants produce so many terpenoid compounds?. New Phytol. 2018, 220, 692–702. 10.1111/nph.14178. [DOI] [PubMed] [Google Scholar]
- Schwab W.; Fuchs C.; Huang F.-C. Transformation of terpenes into fine chemicals. Eur. J. Lipid Sci. Technol. 2013, 115, 3–8. 10.1002/ejlt.201200157. [DOI] [Google Scholar]
- Raut J. S.; Shinde R. B.; Chauhan N. M.; Mohan Karuppayil S. Terpenoids of plant origin inhibit morphogenesis, adhesion, and biofilm formation by Candida albicans. Biofouling 2013, 29, 87–96. 10.1080/08927014.2012.749398. [DOI] [PubMed] [Google Scholar]
- Wallaart T. E.; Bouwmeester H. J.; Hille J.; Poppinga L.; Maijers N. C. Amorpha-4, 11-diene synthase: cloning and functional expression of a key enzyme in the biosynthetic pathway of the novel antimalarial drug artemisinin. Planta 2001, 212, 460–465. 10.1007/s004250000428. [DOI] [PubMed] [Google Scholar]
- Sugahara S.-i.; Kajiki M.; Kuriyama H.; Kobayashi T.-r. Paclitaxel delivery systems: the use of amino acid linkers in the conjugation of paclitaxel with carboxymethyldextran to create prodrugs. Biol. Pharm. Bull. 2002, 25, 632–641. 10.1248/bpb.25.632. [DOI] [PubMed] [Google Scholar]
- Cao R.; Zhang Y.; Mann F. M.; Huang C.; Mukkamala D.; Hudock M. P.; Mead M. E.; Prisic S.; Wang K.; Lin F.-Y.; et al. Diterpene cyclases and the nature of the isoprene fold. Proteins: Struct., Funct., Bioinf. 2010, 78, 2417–2432. 10.1002/prot.22751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trapp S. C.; Croteau R. B. Genomic organization of plant terpene synthases and molecular evolutionary implications. Genetics 2001, 158, 811–832. 10.1093/genetics/158.2.811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greenhagen B. T.; O’Maille P. E.; Noel J. P.; Chappell J. Identifying and manipulating structural determinates linking catalytic specificities in terpene synthases. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 9826–9831. 10.1073/pnas.0601605103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’maille P. E.; Malone A.; Dellas N.; Andes Hess B.; Smentek L.; Sheehan I.; Greenhagen B. T.; Chappell J.; Manning G.; Noel J. P. Quantitative exploration of the catalytic landscape separating divergent plant sesquiterpene synthases. Nat. Chem. Biol. 2008, 4, 617–623. 10.1038/nchembio.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Maille P. E.; Chappell J.; Noel J. P. A single-vial analytical and quantitative gas chromatography–mass spectrometry assay for terpene synthases. Anal. Biochem. 2004, 335, 210–217. 10.1016/j.ab.2004.09.011. [DOI] [PubMed] [Google Scholar]
- Morcos F.; Pagnani A.; Lunt B.; Bertolino A.; Marks D. S.; Sander C.; Zecchina R.; Onuchic J. N.; Hwa T.; Weigt M. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. U.S.A. 2011, 108, E1293–E1301. 10.1073/pnas.1111471108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dos Santos R. N.; Khan S.; Morcos F. Characterization of C-ring component assembly in flagellar motors from amino acid coevolution. R. Soc. Open Sci. 2018, 5, 171854. 10.1098/rsos.171854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang X.-L.; Dimas R. P.; Chan C. T.; Morcos F. Coevolutionary methods enable robust design of modular repressors by reestablishing intra-protein interactions. Nat. Commun. 2021, 12, 5592–5598. 10.1038/s41467-021-25851-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morcos F.; Schafer N. P.; Cheng R. R.; Onuchic J. N.; Wolynes P. G. Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection. Proc. Natl. Acad. Sci. U.S.A. 2014, 111, 12408–12413. 10.1073/pnas.1413575111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng R. R.; Morcos F.; Levine H.; Onuchic J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl. Acad. Sci. U.S.A. 2014, 111, E563–E571. 10.1073/pnas.1323734111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Maille P. E.; Bakhtina M.; Tsai M.-D. Structure-based combinatorial protein engineering (SCOPE). J. Mol. Biol. 2002, 321, 677–691. 10.1016/S0022-2836(02)00675-7. [DOI] [PubMed] [Google Scholar]
- O’Maille P. E.; Chappell J.; Noel J. P. Biosynthetic potential of sesquiterpene synthases: alternative products of tobacco 5-epi-aristolochene synthase. Arch. Biochem. Biophys. 2006, 448, 73–82. 10.1016/j.abb.2005.10.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneider C. A.; Rasband W. S.; Eliceiri K. W. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 2012, 9, 671–675. 10.1038/nmeth.2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salmon M.; Laurendon C.; Vardakou M.; Cheema J.; Defernez M.; Green S.; Faraldos J. A.; O’Maille P. E. Emergence of terpene cyclization in Artemisia annua. Nat. Commun. 2015, 6, 6143–6210. 10.1038/ncomms7143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koo H. J.; Vickery C. R.; Xu Y.; Louie G. V.; O’maille P. E.; Bowman M.; Nartey C. M.; Burkart M. D.; Noel J. P. Biosynthetic potential of sesquiterpene synthases: product profiles of Egyptian Henbane premnaspirodiene synthase and related mutants. J. Antibiot. 2016, 69, 524–533. 10.1038/ja.2016.68. [DOI] [PubMed] [Google Scholar]
- Broeckling C. D.; Reddy I. R.; Duran A. L.; Zhao X.; Sumner L. W. MET-IDEA: data extraction tool for mass spectrometry-based metabolomics. Anal. Chem. 2006, 78, 4334–4341. 10.1021/ac0521596. [DOI] [PubMed] [Google Scholar]
- Sievers F.; Wilm A.; Dineen D.; Gibson T. J.; Karplus K.; Li W.; Lopez R.; McWilliam H.; Remmert M.; Söding J.; et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goujon M.; McWilliam H.; Li W.; Valentin F.; Squizzato S.; Paern J.; Lopez R. A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res. 2010, 38, W695–W699. 10.1093/nar/gkq313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bateman A.; Martin M. J.; Orchard S.; Magrane M.; Agivetova R.; Ahmad S.; Alpi E.; Bowler-Barnett E. H.; Britto R.; Bursteinas B.; et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, D480–D489. 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Battye T. G. G.; Kontogiannis L.; Johnson O.; Powell H. R.; Leslie A. G. iMOSFLM: a new graphical interface for diffraction-image processing with MOSFLM. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2011, 67, 271–281. 10.1107/S0907444910048675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winn M. D.; Ballard C. C.; Cowtan K. D.; Dodson E. J.; Emsley P.; Evans P. R.; Keegan R. M.; Krissinel E. B.; Leslie A. G.; McCoy A.; et al. Overview of theCCP4 suite and current developments. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2011, 67, 235–242. 10.1107/s0907444910045749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adams P. D.; Afonine P. V.; Bunkóczi G.; Chen V. B.; Davis I. W.; Echols N.; Headd J. J.; Hung L.-W.; Kapral G. J.; Grosse-Kunstleve R. W.; et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2010, 66, 213–221. 10.1107/s0907444909052925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emsley P.; Cowtan K. Coot: model-building tools for molecular graphics. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2004, 60, 2126–2132. 10.1107/S0907444904019158. [DOI] [PubMed] [Google Scholar]
- Keeling D. M.; Garza P.; Nartey C. M.; Carvunis A.-R. Philosophy of Biology: The meanings of’function’in biology and the problematic case of de novo gene emergence. Elife 2019, 8, e47014 10.7554/elife.47014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doolittle W. F. We simply cannot go on being so vague about ‘function. Genome Biol. 2018, 19, 223. 10.1186/s13059-018-1600-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng R. R.; Nordesjö O.; Hayes R. L.; Levine H.; Flores S. C.; Onuchic J. N.; Morcos F. Connecting the sequence-space of bacterial signaling proteins to phenotypes using coevolutionary landscapes. Mol. Biol. Evol. 2016, 33, 3054–3064. 10.1093/molbev/msw188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levy R. M.; Haldane A.; Flynn W. F. Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 2017, 43, 55–62. 10.1016/j.sbi.2016.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cocco S.; Feinauer C.; Figliuzzi M.; Monasson R.; Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 2018, 81, 032601. 10.1088/1361-6633/aa9965. [DOI] [PubMed] [Google Scholar]
- Ziegler C.; Martin J.; Sinner C.; Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat. Commun. 2023, 14, 2222. 10.1038/s41467-023-37958-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bisardi M.; Rodriguez-Rivas J.; Zamponi F.; Weigt M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Mol. Biol. Evol. 2022, 39, msab321. 10.1093/molbev/msab321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Figliuzzi M.; Jacquier H.; Schug A.; Tenaillon O.; Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 2016, 33, 268–280. 10.1093/molbev/msv211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de la Paz J. A.; Nartey C. M.; Yuvaraj M.; Morcos F. Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc. Natl. Acad. Sci. U.S.A. 2020, 117, 5873–5882. 10.1073/pnas.1913071117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sinner C.; Ziegler C.; Jung Y. H.; Jiang X.; Morcos F. ELIHKSIR Web Server: Evolutionary Links Inferred for Histidine Kinase Sensors Interacting with Response Regulators. Entropy 2021, 23, 170. 10.3390/e23020170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Q.; Kunder N.; De la Paz J. A.; Lasley A. E.; Bhat V. D.; Morcos F.; Campbell Z. T. Global pairwise RNA interaction landscapes reveal core features of protein recognition. Nat. Commun. 2018, 9, 2511–2610. 10.1038/s41467-018-04729-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Starks C. M.; Back K.; Chappell J.; Noel J. P. Structural basis for cyclic terpene biosynthesis by tobacco 5-epi-aristolochene synthase. Science 1997, 277, 1815–1820. 10.1126/science.277.5333.1815. [DOI] [PubMed] [Google Scholar]
- Whittington D. A.; Wise M. L.; Urbansky M.; Coates R. M.; Croteau R. B.; Christianson D. W. Bornyl diphosphate synthase: structure and strategy for carbocation manipulation by a terpenoid cyclase. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 15375–15380. 10.1073/pnas.232591099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams D. C.; McGarvey D. J.; Katahira E. J.; Croteau R. Truncation of limonene synthase preprotein provides a fully active ‘pseudomature’form of this monoterpene cyclase and reveals the function of the amino-terminal arginine pair. Biochemistry 1998, 37, 12213–12220. 10.1021/bi980854k. [DOI] [PubMed] [Google Scholar]
- Ballal A.; Laurendon C.; Salmon M.; Vardakou M.; Cheema J.; Defernez M.; O’Maille P. E.; Morozov A. V. Sparse epistatic patterns in the evolution of terpene synthases. Mol. Biol. Evol. 2020, 37, 1907–1924. 10.1093/molbev/msaa052. [DOI] [PubMed] [Google Scholar]
- Leferink N. G. H.; Dunstan M. S.; Hollywood K. A.; Swainston N.; Currin A.; Jervis A. J.; Takano E.; Scrutton N. S. An automated pipeline for the screening of diverse monoterpene synthase libraries. Sci. Rep. 2019, 9, 11936–12012. 10.1038/s41598-019-48452-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoshikuni Y.; Martin V. J.; Ferrin T. E.; Keasling J. D. Engineering cotton (+)-δ-cadinene synthase to an altered function: germacrene D-4-ol synthase. Chem. Biol. 2006, 13, 91–98. 10.1016/j.chembiol.2005.10.016. [DOI] [PubMed] [Google Scholar]
- Kampranis S. C.; Ioannidis D.; Purvis A.; Mahrez W.; Ninga E.; Katerelos N. A.; Anssour S.; Dunwell J. M.; Degenhardt J.; Makris A. M.; et al. Rational Conversion of Substrate and Product Specificity in a Salvia Monoterpene Synthase: Structural Insights into the Evolution of Terpene Synthase Function. Plant Cell 2007, 19, 1994–2005. 10.1105/tpc.106.047779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segura M. J.; Jackson B. E.; Matsuda S. P. Mutagenesis approaches to deduce structure–function relationships in terpene synthases. Nat. Prod. Rep. 2003, 20, 304–317. 10.1039/B008338K. [DOI] [PubMed] [Google Scholar]
- Srividya N.; Davis E. M.; Croteau R. B.; Lange B. M. Functional analysis of (4 S)-limonene synthase mutants reveals determinants of catalytic outcome in a model monoterpene synthase. Proc. Natl. Acad. Sci. U.S.A. 2015, 112, 3332–3337. 10.1073/pnas.1501203112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Back K.; Chappell J. Identifying functional domains within terpene cyclases using a domain-swapping strategy. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 6841–6845. 10.1073/pnas.93.13.6841. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The MSA and biochemical data underlying this article are available in online Supporting Information. All scripts including those used for eij and hi parameter inference by DCA and to calculate Hamiltonian and interdomain Hamiltonian scores were written in MATLAB (The MathWorks, Natick, MA) and can be found at https://github.com/morcoslab/interdomain_hamiltonian.



