Abstract
How a protein’s function influences the shape of its fitness landscape, smooth or rugged, is a fundamental question in evolutionary biochemistry. Smooth landscapes arise when incremental mutational steps lead to a progressive change in function, as commonly seen in enzymes and binding proteins. On the other hand, rugged landscapes are poorly understood because of the inherent unpredictability of how sequence changes affect function. Here, we experimentally characterize the entire sequence phylogeny, comprising 1158 extant and ancestral sequences, of the DNA-binding domain (DBD) of the LacI/GalR transcriptional repressor family. Our analysis revealed an extremely rugged landscape with rapid switching of specificity even between adjacent nodes. Further, the ruggedness arises due to the necessity of the repressor to simultaneously evolve specificity for asymmetric operators and disfavors potentially adverse regulatory crosstalk. Our study provides fundamental insight into evolutionary, molecular, and biophysical rules of genetic regulation through the lens of fitness landscapes.
Keywords: ancestral sequence reconstruction (ASR), protein evolution, deep mutational scanning (DMS), epistasis, sequence space, fitness landscape
In brief
Understanding how a protein’s function shapes its fitness landscape is crucial in molecular evolution. Meger, Spence, et al. experimentally characterized the DNA specificity landscape in the LacI/GalR family of gene regulators, revealing its extreme ruggedness and providing insights into the evolutionary dynamics shaping DNA specificity and regulatory function.
INTRODUCTION
A central question in molecular evolution is how proteins acquire novel functions, such as new binding specificities, via mutation and selection. The sequence-fitness landscape is a valuable construct that has allowed us to understand and visualize the complex relationship between evolutionary sequence changes and protein function.1–7 The topology of a fitness landscape largely reflects the nature of the evolutionary process: “smooth” landscapes, with small numbers of discrete but connected peaks across which new activities can gradually evolve via additive and predictable mutational steps, are relatively well understood, with many enzymes and binding proteins as examples.2,8–12 Importantly, many of these studies span short mutational trajectories that may appear smooth. The molecular basis for the smoothness of these transitions can be partly explained by the modulation of the conformational dynamics of proteins, in which mutations can gradually shift the conformational equilibrium towards conformations better suited to the new activity.13–15 In contrast, our understanding and characterization of “rugged” fitness landscapes, in which mutations tend to have unpredictable epistatic effects that result in fitness landscapes consisting of multiple peaks separated by many non-functional sequences (“valleys”), is less well developed.16–18 One hypothesis is that the sequence-fitness landscapes, and hence evolvability, of certain proteins are inherently defined by the function and fold.19 This partially deterministic view has important implications for evolution of new activities (such as drug19,20 or vaccine resistance21) and protein engineering. 22,23
To understand and characterize the role of rugged fitness landscapes in molecular evolution, we need to comprehensively map sequence-function relationships over large and diverse sequence spaces that span the full evolutionary history of protein families. While directed evolution and deep mutational scanning (DMS) experiments have provided profound insight into molecular evolution at the level of single mutational steps, they seldom explore the large evolutionary timescales and the full diversity of sequence space within protein families.3,4,9 In contrast, ancestral sequence reconstruction (ASR) can sample large spans of sequence space through the computational reconstruction of ancestral species from a phylogenetic tree and a sequence evolution model.24,25 However, ASR studies generally characterize a relatively small number of divergent ancestral sequences, which makes it difficult to deconvolute adaptive and neutral mutations and understand the stepwise additive and context-specific (epistatic) effects of mutations.26
In contrast to enzymes, which have been shown to often evolve across relatively smooth fitness landscapes in which promiscuous activities can be gradually optimized to become the primary function,13–15 regulatory proteins, such as transcription regulators, may evolve over a more rugged fitness landscape as they are generally defined by high DNA specificity, and incremental mutations leading to promiscuity could be evolutionarily disfavored. Indeed, there has been significant previous work on understanding how epistasis has shaped the evolution of eukaryotic glucorticoid receptors27,28 and extensive investigation into the biophysical complexity of zinc finger transcriptional regulator evolution29,30. The lac repressor (LacI), which belongs to the LacI/GalR family (LGF) of prokaryotic gene regulators,31 is a well-studied model for DNA recognition and allostery.32 Proteins of the LGF show remarkable diversity in their amino acid sequences and their DNA recognition. While much is known about the operator specificity of Escherichia coli LacI (EcLacI),33–37 our understanding of the sequence-fitness landscape of DNA-binding specificity in the wider LGF, the historical evolutionary trajectory of operator recognition and the molecular mechanisms that underpin the evolution of DNA-specificity in proteins remains incomplete. The LGF is thus a compelling system to study ruggedness in sequence-fitness landscapes and how the combination of structure and function dictate the evolutionary dynamics and selective pressures exerted on these regulators.
In this study, we experimentally characterize a complete phylogenetic tree (1158 extant and ancestral sequences) of the LGF and perform DMS on extant EcLacI to reveal the sequence-fitness landscape of DNA specificity for the E. coli Lac operator. We find the landscape to be extremely rugged due to high levels of epistasis, with most sequences having no affinity for the E. coli lac operator sequence. However, the screen unearthed dozens of functional repressors from distinct phylogenetic clades, with as many as 32/60 amino acid substitutions in the DNA binding domain (DBD) compared to the EcLacI DBD. Analysis of the local evolutionary landscape within clades shows gain/loss of function between adjacent nodes, indicating rapid switches of specificity, which may be beneficial for developing orthogonal genetic regulation. The molecular basis for this observation is revealed through in vitro binding assays and simulations, which show that ruggedness of the LGF fitness landscape arises in part due to the necessity for regulators to simultaneously evolve specificity for either DNA half-site in asymmetric operators and explains why this fold and operator structure has been evolutionarily selected for genetic regulation.
RESULTS
Synthesis and characterization of a complete phylogeny of LGF DNA-binding domains.
The E. coli lac operator sequence has emerged over a ~3-billion-year evolutionary history. To generate a set of DBDs that spans the full evolutionarily accessible sequence space of the LGF, we performed phylogenetic inference to reconstruct 577 ancestral sequences from a dataset of 581 extant EcLacI homologs (Fig. 1a). Consistent with previous phylogenetic studies,38 our analysis reconstructs the family as three major lineages. When rooted at the phylogenetic midpoint, these include the LacI-clade as the most ancestral, a single common ancestor giving rise to descendent clades comprising the catabolite control protein (CcpA/RegA), ribose, maltose, sucrose, galactose, arbutin/salicin, purine and cytosine regulators (RbsR, MalR, SacR/ScrR, GalR/GalS, AscG, PurR and CytR, respectively), and an uncharacterized lineage of LacI-like proteins (Fig. 1a). Midpoint rooting, which places the root at the center of the longest branch, is a common practice in comprehensive phylogenetic analyses39–41, however can lead to inaccurate interpretations of phylogenetic topologies in the absence of an appropriate outgroup. The placement of the root at the midpoint of the phylogeny we present in Fig. 1a is corroborated by previous phylogenetic analyses on the LGF, which find the LacI lineage as the most ancestral within the family when distantly homologous solute-binding proteins are included as an outgroup to root the tree on38. Moreso, as the focus of this work is on the relationships between sequence and function, which are invariant to the placement of the root, rather than an accurate reconstruction of evolutionary events within the family, the root position here is largely arbitrary.
Fig. 1. Evolution of LacO recognition.
(a), Phylogenetic inference of the LacI/GalR family of prokaryotic transcription factors. 577 sequences were ancestrally reconstructed using from a dataset of 581 extant LacI homologs (Supplementary Fig. 1, see “methods”). Color gradient shows average repression scores from replicate (n=2) high-throughput screens. (b), Structure (PDB ID: 1EFA) of dimeric extant E. coli lac repressor bound to LacO (left). LBD (orange), DBD (blue), and LacO (gray) are shown in cartoon representation. Each extant and ancestral DBD was generated using multiplex micro-array DNA synthesis (right). (c), In vivo characterization of repressor function. DBDs were fused to the E. coli LacI LBD and constitutively expressed on an accessory plasmid. A reporter construct consisting of the gene encoding sfGFP under transcriptional control of LacO was integrated into the genome of E. coli to assay repressor function. (d), Flow cytometry of the pre-selected phylogenetic DBD library (top) and after two sequential sorts to isolate repression competent DBDs (bottom). The indicated gate was used for selection by FACS (Supplementary Figs. 8 and 9). RFU, relative fluorescence units.
On average, the mean posterior probability of each ancestral DBD was 93%, indicating strong statistical support (according to previous benchmarks on statistical uncertainty in ASR42) across the full dataset (Supplementary Fig. 1a). Likewise, phylogenetic branch supports were consistently high across the topology (Supplementary Fig. 1b), and the approximately unbiased test failed to reject the topology we present in Fig. 1 among 9 other tree-search replicates (P-value=0.682)43 (Supplementary Fig. 2).
One limitation of ASR is that, while it is possible to generate hundreds of ancestral sequences computationally, only a handful (<10) of evolutionarily distant nodes are typically characterized experimentally. Thus, the incremental functional adaptations associated with the full evolutionary trajectory can remain obscured by neutral variation. Here, we synthesized the DBDs of all 577 reconstructed ancestors, as well as all 581 extant sequences used in our phylogeny, generating a dataset of 1158 diverse DBDs that fully covers the phylogenetic tree constructed in this work (Fig. 1a). The sequence identity between all DBDs in this dataset ranged between 27% and 99% sequence identity and was on average 58% (~25 mutations over the 60 residue-long DBD; Supplementary Fig. 1c), consistent with the extensive divergence expected within the LGF. We used chip-based oligonucleotides to synthesize the DBD sequences and cloned them into a plasmid library to encode chimeric variants consisting of DBDs fused to the ligand-binding domain (LBD, residues 61–360) of EcLacI, which is induced by allolactose/IPTG (Fig. 1b). Extant sequences with DBD’s exceeding 60 amino acids (396 of 581) were aligned to the EcLacI sequence and truncated to 60 amino acids from the N-terminus due to limitations in oligonucleotide chip synthesis length. We constructed chimeras with an invariant LBD for several reasons: (i) by studying the ancestral DBD in the context of the EcLacI LBD, we can be confident that the regulatory mode of the chimeras will match that of the extant EcLacI and will repress in the absence of an allosteric effector;44 (ii) since the inducer of the EcLacI LBD is known, we can study the allosteric communication between LBD and DBD in the chimeras; (iii) we can ensure that the experimental conditions we used were conducive to repression, giving accurate insight on exclusively DNA specificity in the DBD. This approach builds on the observation that LGF LBDs and DBDs are generally modular and can be swapped within the family without qualitative changes in the function of either domain.44,45 Deep sequencing the plasmid library confirmed 100% coverage of the DBD library with minimal skew i.e., all 1158 DBDs were present (Supplementary Fig. 7).
To characterize the affinity of the DBDs within the phylogeny to the E. coli lac operator and their allosteric activation by IPTG, we used a cell-based pooled screening strategy (Fig. 1c). We inserted a single-copy GFP cassette under the control of the lac operator (LacO) into the E. coli genome. Variants that can bind to LacO with sufficient affinity repress GFP expression, while those that cannot produce high GFP expression. Using fluorescence-activated cell sorting (FACS), repression-competent variants are enriched by sorting low GFP cells. Only a small fraction (2.5–2.6%) of low-fluorescence cells were observed via flow cytometry, indicating that relatively few extant and ancestral DBDs in this library repress LacO (Fig. 1d). The subset of repression competent variants was enriched to 92.7–93.3% of the total population after two rounds of sorting (Supplementary Fig. 9). The pre- and post-sorted libraries were sequenced to estimate the enrichment ratios. A higher enrichment ratio implies greater GFP repression and is a proxy for protein-DNA affinity. The log-transformed enrichment ratios were highly correlated for the enriched mutants (R2=0.99, Spearman ρ=0.99, Supplementary Fig. 10) between independent replicates but less correlated in the full dataset (R2=0.34, Spearman ρ=0.21) due to greater measurement noise in the depleted (non-functional repressors of LacO) mutants. We identified 15 ancestral and 7 extant DBDs from the 1158-member library that were enriched after sorting. These 22 enriched DBDs comprise 1.9% of the presorted library, consistent with the fraction of the low-fluorescence population observed via flow cytometry.
The sequence-fitness landscape of the DNA binding domains is rugged.
A fundamental question in molecular evolution is whether the emergence of new functions is mediated by the gradual partitioning of functions (smooth) or by discrete switches (rugged). On a smooth fitness landscape, we would expect a single lineage to exhibit gradual changes in activity and maintain significant promiscuity. For example, the binding specificities of proteins within the periplasmic amino acid binding protein family (same fold as the LBD of the LGF family) were shown to gradually alter over time along evolutionary lineages through the selection of promiscuous functions.26,46 In contrast, a rugged fitness landscape involving discrete specificity switches, would result in different functions being sparsely dispersed across the phylogeny. Our results reveal a rugged fitness landscape for the DBDs, in which functional repressors are dispersed within multiple lineages across the phylogenetic tree (Figs. 1a and 2a–c). Indeed, LacO repression appears to have emerged in three evolutionarily distinct lineages; the EcLacI clade, PurR/CytR clade and a third clade that we dub the functional LacI-like regulators. Despite the close homology with ecLacI, other sequences in the LacI lineage are unable to repress LacO. This observation highlights the ruggedness of the fitness landscape. Notably, these clades are evolutionarily distinct and share an ancestor only at the LCA of the full LGF, which diverged ~3Gya.
Fig. 2. Sequence-fitness landscape and phylogenetic mapping of LacO recognition.
(a-c), Lineages containing functional repressors of LacO. Ancestrally reconstructed sequences are numbered by tree node and extant sequences are indicated with UniProt IDs. Red nodes indicate clonally validated functional repressors of LacO within the LacI lineage (a), functional LacI-like regulators (b), and PurR/CytR-like regulators (c), and gray nodes indicate non-functional repressors of LacO. Node labels are included for all repression competent variants and a subset of non-functional repressors specifically mentioned in the main text. The scale bar, shared between (a-c) is shown in (c) below the phylogeny. (d), Correlation between enrichment scores of the high-throughput screen and clonally assayed fluorescence intensities. Enrichment values were determined by comparing NGS distributions of the pre- and post-sorted populations. Error bars denote the standard deviation of replicate (n=2) sorting and NGS experiments.Mean fluorescence was normalized to OD600. The red highlighted region indicates “functional repressors” (log2(enrichment) > 0 and clonal fluorescence < 36,000 RFU OD−1). (e), Primary sequences of characterized ancestral and extant DBDs and overall structural topology of the DBD (top). (f), Sequence-fitness landscape of repressor function. Enrichment values were mapped onto the t-SNE landscape of the OHE represented phylogenetic DBD sequences (see “methods”) to assess ruggedness. RFU, relative fluorescence units. OD, optical density.
The DBD sequence makes up only a small proportion (1/6th) of the full-length regulators that phylogenetic analysis and ASR were performed on. As phylogenetic signal from the LBD, which accounts for most of the sequence, may have confounded the true divergence between DBDs, we performed an identical phylogenetic analysis on the DBD-sequence alone. As with the full-length phylogeny, the DBD-only phylogeny is resolved in three major lineages and incongruencies shared between the trees are predominantly within each lineage where LacO repression has emerged, rather than between them, indicating that phylogenetic incongruency alone does not account for the observed ruggedness (Supplementary Fig. 2). Moreso, as the ML reconstruction is a site-wise algorithm that operates on alignment columns in isolation47, the sequence of the LBD included in the reconstruction would not have influenced the sequence of the DBD beyond implicitly altering the tree structure and branch lengths. We also find no correlation between the mean posterior probability of a DBD and its log-enrichment (Supplementary Fig. 1d), indicating that the observed ruggedness is not an artifact arising from some DBDs being statistically better supported than others. These results together suggest that ruggedness in the evolution of LacO repression is not the consequence of artifacts introduced during phylogenetic inference or ASR.
To better understand the dynamics of LacO repression divergence, we performed ancestral trait reconstruction (ATR) to estimate the expected repression competence of ancestral nodes, given the extant protein phenotypes. Our approach assumes that continuous trait (i.e. log-enrichment) evolution occurs as a diffusion process that can be modelled with Brownian motion, where differences in extant phenotypes emerge through independent, stochastic fluctuations over the phylogenetic topology48. The parameters of the continuous trait Brownian motion model are fitted by maximizing the likelihood of the extant observations over the phylogenetic topology and branch-lengths48. ATR therefore provided an expected fitness landscape for LacO repression, had it diverged over a smooth and diffusive fitness landscape with respect to the phylogenetic topology. We find many topological differences in the emergence and loss of repression competence between the expected (ATR) and ground-truth fitness landscapes (Supplementary Fig. 3). For example, under the ATR model, ancestral sequences within the ecLacI lineage are expected to have gradually acquired repression competence prior to the emergence of the ecLacI fitness peak, whereas empirical data shows the loss of repression competence in the immediate ancestors of ecLacI. The prior mode of gradual divergence is reminiscent of previous observations in the emergence of catalytic activity and substrate specificity from similar ASR studies 26,46,49.
We took a graph signal processing approach to quantify ruggedness in the LGF LacO repression landscape. We measured ruggedness as the normalized Dirichlet energy over k nearest neighbor (KNN) graphs constructed from the one-hot embedding (OHE) space of the DBD sequence 50,51. In this graph, each DBD sequence shares an edge to its k closest mutational neighbors (where k is the square root of the number of nodes) and the signal over the graph is the fitness measurement. Embedding sequences in the OHE domain imposes that distances in the KNN graphs used to compute ruggedness are directly proportional to the number of discrete mutational differences between DBD sequences and not phylogenetic topology or structure. This ensures that ruggedness is phylogeny-independent to avoid the possibility of results being confounded by conflicting phylogenetic signals from the DBD and LBD. The Dirichlet energy of the graph can be interpreted as the normalized sum of squared differences in fitness between neighboring nodes over the graph, where each node is connected to its 34 (square root of 1161 member dataset) closest neighbors. High Dirichlet energy is representative of a rugged fitness landscape 51. Indeed, the Dirichlet energy of the empirical fitness landscape (328 ± 4.81) is significantly greater than the Dirichlet energy of the expected landscape from ATR (233 ± 3.89) (P<0.0001, subsampled to 1000 replicates; Fig. 2; Supplementary Fig 4.). To eliminate the possibility that ATR expected fitness scores relevant to only the DBD were conflated by branch lengths optimized over full-length repressors (DBD and LBD), we reoptimized the LGF phylogeny branch lengths to only the DBD by ML and reanalyzed the expected fitness scores and ruggedness (Supplementary Fig 5). Indeed, we find that branch length re-optimization to the DBD does not reconcile the significant difference between the expected (223 ± 4.56) and observed Dirichlet energies (P-value = <0.0001; subsampled to 1000 replicates) and produces expected fitness scores by ATR that are not positively correlated with the observed fitness scores (Pearson’s R2 = −0.2352; Supplementary Fig 5). These results together suggest that LacO repression evolves over a fitness landscape that is more rugged than expected from extant observations alone, and that this ruggedness is not an artifact arising from different phylogenetic scales shared between the LGF DBD and LBD.
We additionally repeated this analysis on an independently characterized phylogeny of the sarbercovirus spike glycoprotein receptor binding domain (RBD)52 to ensure that expected and experimental fitness values were consistent when the underlying fitness landscape is not dramatically rugged over the phylogenetic topology (Supplementary Fig. 6). Indeed, in the RBD dataset we find a strong correlation between the expected and observed fitness scores (P-value=0.0111; Pearson’s R2=0.8686), indicating that the absence of a positive correlation between the expected and observed LGF DBD fitness scores (Pearson’s R2= −0.1035) is not a consequence of the ATR method. Ruggedness analysis of the sarbecovirus dataset also reveals no significant difference between the ATR expected (52.2 ± 4.91) and experimentally (52.3 ± 5.17) observed OHE graph Dirichlet energy (P-value=0.103; subsampled to 100 replicates), further validating our use of ATR in setting prior expectations for smooth fitness landscapes and highlighting the anomalous nature of the LGF DBD landscape.
We next traced the evolution of function within the clades that LacO repression was observed, showing that recognition of LacO is sporadic within each group. For instance, in the EcLacI clade, out of the 67 sequences, only seven were repression competent. Additionally, we identify 32 instances where LacO recognition is either gained or lost between adjacent nodes (Fig. 2a–c). Thus, while evolutionary selection and optimization of ancestral function exist to some extent, we observe that rapid gain or loss of that same function is more dominant across the LGF phylogeny. In summary, the DBDs of the LGF appear to be evolving across a rugged fitness landscape, leading to rapid gain/loss of function transitions.
To understand how and why sequence changes altered LacO recognition, we investigated the enrichment of different sequences, which showed a strong correlation (Pearson’s R2=0.85) between the enrichment ratio and the ability of a variant to repress (Fig. 2d). Clonal screening revealed a near-binary functional switch in repression competency separating enriched and depleted variants. Of the 1158-member variant library, 7 extant and 13 ancestral sequences were clonally validated as repression competent (log2(enrichment) > 0 and clonal fluorescence < 36,000 RFU OD−1). To confirm the expression of non-functional repressors in E. coli, we performed SDS-PAGE analysis of cell lysates for a random subset of 5 extant and 5 ancestral sequences. Expression levels were comparable to E. coli lac repressor (Supplementary Fig. 11). Analysis of the sequence conservation patterns of functional sequences revealed among the four helices of the DBD, H2 and HH are most conserved as these are essential for LacO recognition and allosteric activation, respectively (Fig. 2e; Supplementary Fig. 12).53 Indeed, within the recognition helix (H2) residues Tyr17, Gln18 and Arg22, which directly interact with the lac operator and control DNA specificity54, are rarely mutated in the functionally repressive DBDs (Fig. 2e). The only exceptions include Y17H and Q18M, which have both been demonstrated to maintain lac operator binding in the genetic background of EcLacI.55,56 Ancestors 835, 850, and 851 have multiple substitutions in the HH; these variants repress GFP but are not allosterically activated by 1 mM IPTG (Supplementary Fig. 13), consistent with the role of HH in allosteric signaling.53
In summary, our analysis of LacO repressor function across the 1158 sequences of this phylogenetic tree suggests that (a) LacO recognition is an easily accessible evolutionary state (appeared independently multiple times) and (b) it simultaneously exists within narrowly confined solution spaces (the fitness peaks that do exist are very small and surrounded by inactive sequences).
Diverse substitutions shape local ruggedness among related DBDs.
To better understand how the sequence space surrounding the functional repressors affects function (i.e. ability to repress LacO in this study), we identified specific mutations that affect function. This is challenging in genetically diverse datasets because neutral genetic drift and epistasis can both mask and confound sequence-function relationships. In this dataset, the context-dependence of three amino acid positions, 18, 22, and 26, illustrate how epistasis creates ruggedness (Fig. 2a–c). First: Gln18 in the non-functional ancestor Anc849 is mutated to Met18 (previously shown to increase affinity (2.5-fold) to LacO in EcLacI56) in Anc850 (alongside Glu39Leu), resulting in gain of function, followed by the introduction of Arg48Thr, Asn50Ser, and Ala52Ile in Anc851, which retains function even though these mutations are highly represented in non-functional variants (Fig. 2c; Supplementary Fig. 15). Thus, the deleterious effects of Arg48Thr, Asn50Ser, and Ala52Ile are mitigated by the presence of Gln18Met.57 Second: Arg22 is essential for LacO binding as it forms a critical base-specific hydrogen bond (Supplementary Fig. 12d);58while it is found in every functional variant in the screen (Fig. 2f, Supplementary Fig. 14), 1018 non-functional variants also contain Arg22. Arg22His is among the substitutions found in Anc1143, which is the closest non-functional ancestor of EcLacI (Fig. 2a). Third: Asn26 in the non-functional ancestor Anc875 is mutated to His26 in Anc876 and introduces function de novo. However, when Asn26His is introduced alongside the Ile42Met in Anc877, there is no gain of function (Fig. 2b). In summary, we find subtle mutational perturbations have large effects on function, indicating a metastable state, and that epistasis contributes to the ruggedness of the landscape.
Historical contingency and genetic drift shape evolution.
While evolutionary trajectories of proteins are constrained by biological function, stochastically sampled permissive substitutions also shape the sequence landscape and contribute to historical contingency i.e., the dependence of evolutionary trajectories on historical mutations.28,59,60 From ASR alone, it is difficult to parse whether mutations are functionally essential, permissive, or neutral because ancestral and extant sequences are often separated by many substitutions. Therefore, to investigate the role of each DBD position, we used deep mutational scanning (DMS), to systematically assay all single-amino acid substitutions in EcLacI. By combining ASR with a DMS screen of EcLacI, we sought to disentangle conserved residues that have been fixed by adaptive evolution from those that arose through neutral drift.
To functionally characterize all single amino acid substitutions of the EcLacI DBD (1121 variants in total), we used the same technologies (oligonucleotide chip synthesis, one-pot library cloning, FACS, and deep sequencing) as described previously for our high-throughput phylogenetic screen (Supplementary Fig. 16). We sorted the low fluorescence population to enrich the repression-competent variants. The number of repression-competent DMS variants (25% of the population) was 10-fold higher than from the phylogenetic library (2.5%) (Supplementary Fig. 18a). These were enriched to ~95% of the population after two rounds of sorting (Supplementary Figs. 17 and 18). Sequencing of pre- and post-sorted libraries yielded enrichment ratios, normalized to native EcLacI and log-transformed, that were highly correlated (R2=0.96, Spearman ρ=0.93) between independent replicates (Supplementary Fig. 19)37. We clonally screened a random subset of DMS variants and observed strong correlation (R2=0.72) between enrichment and the ability of a variant to repress (Supplementary Fig. 20).
For this DMS experiment, each DBD position has a probability distribution of the 20 canonical amino acids pre- and post-selection. We used Kullback-Lieber divergence (KLD61; a statistical method for comparing two probability distributions) to discern which residues are functionally restrictive (high score) and permissive (low score). This showed that the restrictive and permissive positions are non-uniformly distributed across the DBD, with H2 and HH largely intolerant to substitutions (Fig. 3a, b), consistent with the sequencing results from the phylogeny (Fig. 2), while the C-terminus, loop 1, loop 2 and most of H3 are robust to mutation.
Fig. 3. DMS to assign evolutionary roles to each position of the DBD.
(a), Heat map showing the normalized enrichment scores (red gradient) of each EcLacI substitution after selection of repression competent DBDs. Secondary structure topology of the DBD (top) and substitution frequency (blue gradient) among functional extant and ancestral DBDs are shown. (b), Kullback-Liebler divergence (KLD) describes how much the amino acid probability distribution changes at each position in response to functional selection in both the DMS and phylogenetic screens. For the DMS screen, positions with high KLD are functionally restrictive, whereas low KLD positions are more tolerant to substitutions. (c), Assigned evolutionary roles to each DBD position based on k-means clustering of KLD scores of both DMS and phylogenetic screens (Supplementary Fig. 21). Context-independent positions (low phylogenetic KLD and high DMS KLD) contribute to stability and general HTH fold. Context-specific positions (high phylogenetic and DMS KLD) modulate LacO recognition. Historically contingent positions (high phylogenetic KLD and low DMS KLD) are not functionally constrained in the genetic background of E. coli LacI, despite sequence convergence among functional phylogenetic variants. Permissive positions (low phylogenetic and DMS KLD) are generally tolerant to substitutions.
While DMS provided key insights into the functional role of each residue, it lacks an evolutionary perspective. For instance, using DMS alone we cannot distinguish between residues essential for LacO recognition (“context-specific”) and those required for stability and general fold (“context-independent”). By combining DMS with our extensive phylogenetic screen, it is possible to define an evolutionary role for each residue (Fig. 3b). Using k-means clustering, we identified four possible combinations of DMS (high/low) and phylogenetic (high/low) KLD scores, each representing a unique evolutionary role of a residue (Fig. 3c, Supplementary Fig. 21). Context-independent functionally restrictive residues, such as those that are structurally essential, have high DMS KLD scores (functionally restrictive) and low phylogenetic KLD scores - because positions that are conserved across all phylogenetic variants cannot undergo a shift in probability distribution upon selection and thus have low KLD scores despite potentially being functionally relevant (Fig. 4a,b). These context-independent residues are heavily concentrated within the core of the HTH fold and across the dimerization interface of the HH, and include essential LacO-binding residues, such as Arg22 (Fig. 4a,b). Context-dependent functionally restrictive positions have high KLD scores for both DMS and phylogenetic screens. These residues are critical for LacO recognition, either making direct contact with the operator DNA in both the major and minor grooves, helping to orient the DBD for LacO binding, or interacting with the LBD for allosteric communication. Context-dependent residues include Tyr7 and Tyr17, which contact LacO or are associated with an epistatic change-of-function, such as in Anc877 (Fig. 4c,d). Historically contingent positions have low DMS and high phylogenetic KLD scores i.e., while mutationally robust in terms of EcLacI function, they are restricted in phylogenetic analysis owning to their deterministic role in evolution (Fig. 4e,f). Many of these residues are located adjacent to context-dependent functionally restrictive positions, consistent with a role in modulating their effect, such as Asn26 (historically contingent) (Fig 4f). Functionally permissive positions have low KLD scores for both DMS and phylogenetic screens. These residues, such as Val24 and Glu44, are solvent exposed. Altogether, this analysis demonstrates that the evolutionary role of all residues within the DBD of EcLacI can be revealed through a statistical analysis of DMS and ASR data.
Fig. 4. Structural role of DBD mutations.
(a) Context-specific mutations in the LacO major groove, where Tyr7 make essential contacts with the DNA. (b) context-specific mutations in the minor groove. Context-specific mutations are dominated by locations that contact the DNA, particularly in the major groove. (c) Historically contingent mutations in the major groove. (d) Historically contingent mutations in the minor groove. Historically contingent mutations cluster around the context specific mutations that they set the background for. (e) Context-independent mutations in the major groove. Context-independent mutations cluster around the core packing of the DBD. (f) Context-independent mutations in the minor groove. Context-independent mutations also include those essential for DNA-recognition, such as Arg22.
Operator asymmetry contributes to the rugged fitness landscape.
Despite asymmetric DNA operator sequences having lower affinity for DBDs than symmetrical sequences,62 they are overwhelmingly dominant throughout evolution and must therefore confer some selective benefit. To investigate this, we performed in vitro DNA-binding assays using surface plasmon resonance (SPR; Fig. 5). We find that all functional variants tested (Anc880, Anc881 and Anc882) bind the symmetrical LacOsym with Kd constants greater than an order of magnitude lower than their asymmetric operator counterparts (LacO and LacO1), confirming that operator asymmetry significantly reduces a repressor’s DNA-binding capacity in ancestral, as well as extant DBDs (Fig. 5a–c; Supplementary Fig. 22).
Fig. 5. Transcriptional regulator asymmetry and fitness landscape ruggedness.
(a) Multi-cycle kinetics SPR sensogram for Anc882 LacO binding. Log2 dilution series of purified Anc882 (see methods) are plotted by color gradient. Higher response units (RU) correspond to more protein binding with immobilized LacO DNA. (b) Anc882 Kd values, determined by SPR for LacO1, LacOsym and LacO DNA binding when repressed (solid) and induced with 10 mM IPTG (hashed). (c) Magnified view of repressed Kd values. (d) Conformational snapshots of the most sampled conformations for monomers 1 (red) and 2 (cyan). Dominant conformations of Tyr7, Tyr17, Gln18 and Arg22 are shown from principal component analysis (PCA; panel e). (e) PCA of Anc882, showing monomers A (red) and B (blue). Components are projected from equilibrium dynamics trajectories of the functional repressor unit with LacO DNA over ~600 ns of simulation. Conformational snapshots presented in (d) are shown as circles. (f) Distribution of key protein-DNA contacts in Anc882 and LacO. DNA residues (X-axis) have been aligned for visualization. Separation between monomer unit dynamics, asymmetric conformations in essential residues during simulation and differing key interactions between the DNA operator and the contacting residues between monomers indicates asymmetry in binding.
To investigate the molecular basis for binding of asymmetrical operator sequences, we performed all-atom molecular dynamics (MD) simulations on models of Anc880 and 882, as well as EcLacI, complexed with LacO (Fig. 5). Each system was simulated with 10 independent 60 ns replicates (600 ns total sampling time) (Supplementary Fig. 23). Previous studies have identified motions related to DNA-binding specificity in EcLacI occur over ns – μs timescales, consistent with the timescale of our simulations.63,64 We observe asymmetry in the protein-DNA interactions between the two monomers that constitute the functional dimer (Fig. 5d,e). Although both monomers interact with the DNA via the same residues, the relative contributions these residues make to the overall binding energy and their DNA contacts are distinct. In Anc882, which is the variant with the greatest LacO affinity, we identified Tyr7, Tyr17, Gln18 and Arg22 as being essential for DNA recognition and binding. However, the conformations of each of these residues differ between monomers to accommodate the respective operator half-sites and principal component analysis (PCA) reveals that backbone motions differ between either monomer (Fig. 5d–f). Identical analyses of apo-Anc882 trajectories found that in the absence of DNA, backbone motions are nearly indistinguishable between monomers (Supplementary Fig. 24). This analysis is consistent with NMR structures of the EcLacI DBD complexed with the natural LacO1 operator65, and alternate DNA binding modes have been alluded to by analyses of LGF repressors and EcLacI mutants.66 Asymmetrical operator sequences may impose a complex selective pressure where the DBDs must be able to undergo a conformational change to bind two distinct half-sequences with physiologically relevant affinity.
The demand on the DBD to simultaneously evolve high affinity for two distinct half-site DNA sequences using a single binding site (as well as allosteric regulation with the LBD) means that the fitness landscape we observe across this phylogeny can be viewed as a composite of two or more fitness landscapes. To test this, we performed in silico evolutionary simulations on small model systems using theoretical elementary landscapes as a model (Fig. 6)67–69. Elementary landscapes derive from a robust and well-established graph-theoretic approach to combinatorial landscape optimization problems69; specifically, they are the orthogonal eigenvectors of the graph-Laplacian of a sequence space graph. Simulated fitness landscapes over the graph are therefore spectral components of the graph topology; the nth elementary landscape corresponds to the nth eigenmode of the graph Laplacian. As the order of the elementary landscape increases, the complexity (and hence ruggedness) of the fitness landscape does as well. For our purposes, elementary landscapes provide a theoretically rigorous way of testing hypotheses that pertain to the ruggedness of a fitness landscape. Using a linear combination of elementary landscapes, each representing a different hypothetical fitness function, we find that the combination of two fitness landscapes can indeed be more rugged (measured by Dirichlet energy) than either constitutive fitness landscape is alone, akin to constructive/deconstructive wave interference (Fig. 6). This demonstrates that ruggedness can emerge from the composition of otherwise smooth fitness landscapes and predicts that ruggedness may emerge where multiple and complex fitness constraints are imposed simultaneously on a phenotype, such as repression competence in LacI.
Fig 6. Elementary landscape model simulations.
2nd and 3rd elementary landscapes (EL) in panels (a), (b) respectively. Each vertex represents a unique genotype, colored simulated fitness. Edges connect genotypes accessible by a single mutation. (c) A linear combination of 2nd and 3rd elementary landscapes. (d) Dirichlet energies for EL2 (a), EL3 (b) and the composite (c) landscapes. The composite landscape is characterized by an energy that is greater than either constituent landscape, and is therefore more rugged than the constitutive EL2 and EL3.
DISCUSSION
These results show that the fitness landscape for operator binding by DBDs of the LGF is extremely rugged i.e., the viable fitness space for LacO recognition and repression is narrow and highly localized. This ruggedness has long been alluded to by mutational studies, but never comprehensively observed or studied prior to our work.5,70,71 In contrast, most of the protein families studied by ASR to date demonstrate protein evolution over smooth fitness landscapes where gradual changes in function can be observed.26,46,60,72–75 Even those characterized by extensive epistasis have well-defined mutational trajectories from semi- or non-functional intermediate sequences to functional ones27,59. The difference in evolutionary dynamics between these studies and the LGF may reflect fundamentally different biological and physical constraints. While any complex molecular trait can be described as a combination of arbitrarily smooth fitness functions (i.e. folding, substrate-recognition, electrostatic preorganization, immunogenicity, among many catalytic and structural parameters76) DNA recognition among LGF regulators is particularly rugged due to the compounding of two DNA-binding landscapes that likely have complexity in isolation; if either half-site requires the correct folding, dynamical ensemble, electrostatic and shape complementarity (among likely many other traits), a protein sequence that binds both half-sites must compound the molecular requirements for both of the half-sites.
One aspect of the rugged DBD landscape is the evolutionary metastability of repression: descendants of a repression competent ancestor are seldom functional with the same operator sequence and mutations have extreme epistatic relationships. This indicates that both selection pressure for LacO specificity has been variable and that LacO specificity has emerged independently numerous times throughout evolutionary history, each time in different sequence contexts. Like the notion of an epistatic ratchet, where a phenotype becomes completely inaccessible within a specific background28,77, the metastability we observe occurs through non-specific mutations that counteract or reciprocate the fitness effects of previously fixed mutations in backgrounds that are on the edge of a binary phenotype, making a few genetic changes sufficient to rewire specificity. Indeed, the influence of epistasis as a mutational ratchet has been previously studied extensively in Eukaryotic transcriptional regulators, however not over a scale of diversity that we observe here. Using a combination of ASR and DMS, we have comprehensively mapped historical contingency within the DBD showing that permissive sites, which are essential in an evolutionary context and underlie the epistasis and metastability, can be identified. In addition, our DMS screen showed that repressor function was impaired in more than 40% of all point mutations of the lac repressor DBD, indicating that ruggedness and metastability are not restricted to or an artifact of larger sequence variations between homologs.
The rugged landscape, the asymmetry of the DBD:DNA complex and the evolutionary metastability of the repressor function are all interesting observations, but how do they relate to the physiological role of these proteins? Since effective genetic regulation is paramount to organismal fitness, precise and binary interactions are essential. To impart a selective advantage, regulators must be highly specific for their cognate operator sequences. Unlike promiscuous enzymatic activity, promiscuous DNA-binding causes metabolic cross-regulation and severe disruption of cellular function.78–80 Indeed, the biophysical properties of the DBD:DNA complex can dramatically alter the evolutionary dynamics of regulatory element divergence.81 Thus, we hypothesize that this ruggedness fitness landscape is an intrinsic aspect of fitness for the LGF: asymmetric operator half-sites have evolved as a mechanism to minimize metabolic crosstalk between regulators and create an evolutionary dynamic in which activity is essentially binary. Over a rugged fitness landscape, as we observe, DNA-binding specificity is metastable, with rapid gain/loss of function, essentially eliminating the risk of non-specific DNA-binding and metabolic cross-regulation between diverging regulators. This suggests that inherently epistatic protein folds, such as that of the DBD, have likely been evolutionarily selected for regulatory purposes. Indeed, previous ASR studies have shown that epistasis is a hallmark of several classes of eukaryotic transcription factors,27,82–85 and recent results have highlighted the importance of molecular unpredictability in transcription factor divergence.86 Similar observations of rapid de novo promoter emergence have also recently been made, indicating that the evolutionary dynamics we observe in regulatory proteins may also be present in their DNA binding partners.87–89
Altogether, our analysis of the sequence-function landscape of the LGF DBDs has led to several valuable insights. Most importantly, we prove a molecular level explanation for the high fidelity that has long been observed in gene regulation by proteins of the LGF family, showing that it is an intrinsic property of the rugged fitness landscape that is itself a function of the biophysical properties of DBD structure and asymmetry of DNA operator sequences. Future work could investigate whether rugged fitness landscapes, and functional partitioning through asymmetry, are common when the physiological function of the protein family makes promiscuous activity deleterious, as for these transcriptional regulators.
STAR METHODS
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Srivatsan Raman (sraman4@wisc.edu).
Materials availability
Requests for plasmids and strains described in this study can be made to the lead contact, Srivatsan Raman (sraman4@wisc.edu).
Data and code availability
Processed data used in this study were deposited in Github (https://github.com/raman-lab/AncLacI; Zenodo: https://doi.org/10.5281/zenodo.10652076). Raw data obtained in this study were deposited to Zenodo (https://doi.org/10.5281/zenodo.7574310)
Code used in this study have been uploaded with documentation to Github and is publicly available at the GitHub: https://github.com/raman-lab/AncLacI (Zenodo: https://doi.org/10.5281/zenodo.10652076).
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
METHOD DETAILS
Bacterial Strains
DBD libraries were expressed in E. coli DH10β (NEB) or DH10β ΔlacI::sfgfp kan reporter cells. Bacteria were grown in LB at 37°C shaking at 200 r.p.m. For protein expression and purification, E. coli BL21(DE3) (NEB) cells were used. For plates, 1.5% agar (w/v) was used. Antibiotics, carbenicillin (100μg/mL), kanamycin (30 μg/mL), and/or spectinomycin (50μg/mL) were added if appropriate for plasmid maintenance.
Phylogenetic inference
To collate an expansive dataset of the full LGF, EcLacI, RbsR, MalR, SacR/ScrR, GalR, GalS, AscG, PurR, CytR, and CcpA were each individually queried against the NCBI non-redundant protein database using pBLAST. ~250 sequences were retrieved from each individual BLAST search, using an e-value significance threshold of 1.0E-10. Redundancy in this dataset was removed to 90% sequence identity using CD-HIT93 and incomplete sequences, or sequences with poorly conserved (<1% of sequences) indels were manually removed. Sequences were aligned using the ESPRESSO protocol of T-COFFEE. 94 This alignment was manually refined to remove poorly conserved sequences or insertions and benchmarked against available LGF X-ray crystal structures. Phylogenetic inference was performed in IQ-TREE.95 The ML model, LG+R9,96,97 was fitted using ModelFinder (as implemented in IQ-TREE).97 Tree-search was performed using default parameters and branch supports were computed as ultra-fast bootstrap approximations.98 Tree search was repeated for 10 independent replicates, which were tested for statistical equivalency by the approximately unbiased (AU) test conducted to 10000 replicates.43 An additional inference was performed using the same protocol, however, only including the N-terminal 60 columns of the LGF dataset to determine how much evolutionary signal the DBD provides to the full-length sequences (Supplementary Fig 2). Ancestral sequences were reconstructed on the single topology presented in Fig. 1 using CodeML from the PAML software package.47 The sequence evolution model for ASR was manually set to LG+G496 and indel events were processed in the ancestral sequences according to the principle of parsimony. Quantitative ancestral trait reconstruction was performed using a Brownian motion model that assumes traits diverge according to stochastic diffusion over a phylogenetic topology. Parameters for the Brownian motion model were fitted by maximum likelihood using the R package APE using only the extant sequence phenotypes and the phylogenetic topology presented in Fig. 148. Expected ancestral phenotypes were extracted from this for quantitative ruggedness analysis.
Construction of GFP reporter strain
The E. coli strain DH10β was modified by lambda Red recombineering to insert a superfolder GFP gene driven by the pLlacO99 promoter and a kanamycin resistance cassette at the LacI locus to generate the GFP “reporter” cell line. The temperature sensitive plasmid pKD46 (Genbank AY048746) was used for arabinose-inducible expression of Red recombinase.100 Linear double-stranded donor DNA was amplified off a pSC101 plasmid using overhang primers to add 50bp homology arms directed to the 5’ end of the LacI gene. A frozen glycerol stock of DH10β transformed with the pKD46 plasmid was struck out on LB carbenicillin (100μg/mL) and incubated at 30°C overnight. A colony was selected and inoculated into 5mL LB carbenicillin and grown for 16h in a shaking incubator at 30°C. Cells were diluted 50X into 25mL of LB carbenicillin and grown to an OD of 0.1. Red recombinase was induced with 100mM L-arabinose and cells were grown to an OD of 0.6. The cells were harvested and prepared for electroporation. 25μL of cells were transformed with 500ng of donor DNA. Cells were recovered for 2h in SOC media in a shaking incubator at 37°C and plated on LB kanamycin (30 μg/mL). The transformants were incubated at 37°C overnight to cure the cells of the temperature sensitive pKD46 plasmid. A visibly fluorescent colony was selected and inoculated into 5mL LB kanamycin to be grown for 16h in a shaking incubator at 37°C. The genome modification was confirmed via sequencing and a glycerol stock was stored at −80°C.
General library DNA assembly
Plasmids were constructed using standard molecular biology techniques of PCR and Golden Gate assembly with Kapa HiFi DNA Polymerase (KAPA Biosystems), restriction enzymes (NEB), T4 DNA Ligase (NEB), and Antarctic Phosphatase (NEB).
Oligonucleotides encoding residues 2–60 of all LGF DBDs (1158 variants in total) or all single amino acid substitutions of the E. coli LacI DBD (1121 variants in total) were synthesized as single-stranded oligonucleotide pools (Agilent Technologies). Extant DBDs exceeding 60aa long were truncated from the N-terminus to 60 residues in length to comply with DNA synthesis restrictions. Oligonucleotides were converted to double-stranded DNA using 15 cycles of PCR amplification and purified on a spin column (EZNA Cycle Pure kit from Omega BioTek). A pSC101 backbone containing extant E. coli LacI gene under control of the strong pLtetO promoter and a spectinomycin resistance gene was amplified using a primer pair encoding BsaI cut sites that matched the DBD insertion location of both oligonucleotide libraries. The amplified backbone was treated with Dpn1 for 2h at 37°C and purified using a spin column. The backbone was treated with BsaI-HF v2 for 2h at 37°C, Antarctic phosphatase for 1h at 37°C, and subsequently purified using a spin column. A Golden Gate assembly reaction (30 cycles of 37°C for 5min and 16°C for 5min) was performed using 0.042 pmol pSC101 backbone and 0.21 pmol amplified DBD library in a 1:5 molar ratio. The assembled product was dialyzed on a semi-permeable membrane (Millipore) for 1h at 25°C against dH2O. A 25μL aliquot of electrocompetent DH10β E. coli cells were transformed with 2μL of the dialyzed product using electroporation. Cells were recovered for 1h in SOC media in a shaking incubator at 37°C and dilutions were plated on LB Spectinomycin (50 μg/mL) to calculate transformation efficiency (>106 CFU/mL). Remaining recovered cells were diluted 5x, incubated for 6h, and then diluted 50X for overnight growth in a shaking at 37°C for 16h. Library plasmids were extracted using a DNA miniprep kit and 1μL (~100ng) was used to transform 25μL of electrocompetent reporter cells following the transformation protocol described above. Glycerol stocks of the reporter cells transformed by the preselected phylogenetic and DMS plasmid libraries were stored at −80°C.
Fluorescence-activated cell sorting
Thawed glycerol stocks of reporter cells containing the LGF DBD or DMS libraries were used to inoculate (50μL each) 5mL of LB kanamycin (30μg/mL) / spectinomycin (50μg/mL) in duplicate. Cells were grown in a shaking incubator at 37°C for 16h and subsequently diluted 50X in PBS (137mM NaCl, 2.7mM KCl, 10mM Na2HPO4, 1.8mM KH2PO4) for sorting. Sorting was performed with a Sony MA900 cell sorter. Cells were excited with a 488nm laser and GFP fluorescence was monitored through a 525/50 filter. Sorting gates were used to select for singlet cells (Supplementary Figs. 8 and 17). A sorting gate was drawn to isolate the low-fluorescence populations of the LGF DBD (2.5–2.6% of total population, 250,000 cells collected) and DMS (25.4–25.6% of total population, 500,000 cells collected) libraries (Supplementary Figs. 9 and 18). Cells were sorted into 1mL of LB, and the total volume was adjusted to 5mL after sorting. Cells recovered for 1h at 37°C in a shaking incubator before antibiotics were added (kanamycin and spectinomycin), and incubation was continued for another 15h. Glycerol stocks were made and stored at −80°C. Plasmids were harvested using a DNA miniprep kit and 1μL (~100ng) was used to transform 25μL of fresh electrocompetent reporter cells. The procedure described above was repeated for a second round of low-fluorescence sorting using the same sorting gates to further enrich for repression competent variants.
Next-generation sequencing
We used deep sequencing to evaluate presorted and sorted LGF and DMS populations. The DBD was amplified in two stages with Kapa HiFi DNA Polymerase (KAPA Biosystems) for tailed amplicon sequencing. The first PCR reaction was performed with overhang primers that anneal to 5’ and 3’ constant regions surrounding the DBD. The overhangs add a variable N region (to enhance nucleotide diversity), and a portion of the universal Illumina adapter. The first reaction was performed using 14 cycles with 1ng of extracted plasmid DNA (DMS or LGF library, presorted or sorted) as template. The product was purified on a spin column (EZNA Cycle Pure kit from Omega BioTek). A second PCR reaction was used to add the index (for pooled sequencing) and the Illumina ‘stem’. This amplification was performed using 10 cycles with 10ng of the purified product from the first reaction as template. The amplicons were purified and used for deep sequencing. Samples were sequenced on an Illumina MiSeq System, using a MiSeq Reagent Kit v2 (2×250 cycles) following the manufacturer’s documentation.
Paired-end Illumina sequencing reads were merged with PEAR (Paired-end read merger).101 Phred scores (Q scores) were used for quality filtering. Reads with an expected number of errors exceeding 1 were removed and total read counts for each DBD variant were computed (Supplementary Figs. 7 and 16). Raw read counts were highly correlated (R2 ≥ 0.97) for all replicate samples. Enrichment for each variant was computed using
| (eq. 1) |
where and are relative read count frequencies before and after enrichment of repression competent variants, respectively. Higher enrichment indicates tighter repression of GFP and higher affinity to LacO. For the DMS library, enrichment scores (Escore) were computed by normalizing enrichment to WT and applying a log2 transformation. Thus, the WT DMS Escore is set to 0, Escore > 0 indicates improved function, and Escore < 0 indicates reduced function.
Kullback-Liebler divergence of a specific position is given by the following:
| (eq. 2) |
where the sum is over all 20 canonical amino acid identities, and and are the relative frequencies of an amino acid 𝑎 in the repressed sorted or presorted distributions, respectively. If the relative frequency of an amino acid at a position is zero, then the value was excluded from the summation.
Microplate fluorescence assay for clonal characterization
The presorted and repressed sorted LGF DBD libraries were struck out on LB kanamycin (30μg/mL) / spectinomycin (50μg/mL) plates. Colonies were selected and inoculated into 150μL LB kanamycin/spectinomycin in a 96-well plate. Cells were incubated at 37°C in a microplate shaker for 8h. The cultures were diluted 50X in fresh media in a 96-well plate with varying concentrations of IPTG (0, 0.1, 0.5, 1, 5, 25, 100 and 1000μM). Diluted cultures were incubated at 37°C in a microplate shaker for 14h. GFP fluorescence and OD600 were measured in a BioTek Synergy HTX Multi-Mode 96-well plate reader. Fluorescence was normalized to OD to account for differences in cell density. Assayed colonies were sequenced to identify the DBD variant. The mean and standard deviation of normalized fluorescence for replicates (n ≥ 2) for each concentration of IPTG were used to fit these dose-response curves to the Hill–Langmuir equation:
| (eq. 3) |
where is the basal fluorescence (measure of LacO affinity in the absence of inducer), is the maximum fluorescence signal achieved at 1mM IPTG, is the half maximal effective concentration, is the concentration of IPTG, and is the Hill coefficient.
Protein expression and purification
Anc880, 881 and 882 were synthesized and cloned into the NdeI/XhoI multiple cloning site of pET-28a(+) with a C-terminal 6x histidine tag. All three ancestral proteins were recombinantly expressed in BL21(DE3) lab strain E. coli at 25 °C for 16 hours in Luria-Bertani (LB) media. Protein expression was induced with 10 mM IPTG once media OD600 reached approximately 1.5 AU. Cells were pelleted by centrifugation at 5,000 RPM for 15 minutes and were lysed by sonication once resuspended in buffer A [20 mM PBS, 150 mM NaCl, 20 mM imidazole (pH 7.4)] containing the recommended amount of Serratia marscecens Turbonuclease (Sigma-Aldrich). The lysate was pelleted by centrifugation at 12,000 RPM, 4 °C for 1 hour before being filtered (0.22 mM) and loaded onto a prepacked 5 mL HisTrap FF immobilized metal ion affinity chromatography column (GE lifescience) equilibrated in buffer A. The loaded column was washed for 5 column volumes in buffer A and 3 column volumes in 94% buffer A/ 6% buffer B [20 mM PBS, 150 mM NaCl, 250 mM imidazole (pH 7.4)], before being eluted with 100% buffer B on an AKTA Start FPLC instrument (GE lifescience). IMAC purified protein was then filtered (0.22 mM) and purified by size exclusion chromatography (SEC) on a prepacked HiLoad superdex 200 16/60 SEC column (GE lifescience) equilibrated in SEC buffer [20 mM PBS, 150 mM NaCl (pH 7.4)] and run on an AKTA FPLC system (GE lifescience).
SDS-PAGE
Overnight cultures of each strain were initiated by inoculating a single colony from a streak plate into LB with the appropriate antibiotics (final working concentration of 30μg/mL kanamycin for reporter-only strains or 30μg/mL kanamycin + 50μg/mL spectinomycin for reporter + exogenous LacI strains). 1mL of each saturated overnight culture was pelleted and resuspended in 200uL Lysis Buffer (300mM NaCl, 50mM HEPES, 1mM PMSF, 1mg/mL lysozyme, 5mM ß-ME, 10% glycerol, pH 7.5). Cell resuspensions were sonicated using a Q500 sonicator (Qsonica) with the cup-horn attachment 6 times (1:15 min, 25s pulse on, 30s off 85% amp). 67uL of 4X NuPAGE LDS Sample Loading Buffer supplemented with 10% ß-ME was added to the sonicated samples, vortexed, and incubated at 95C for 20min for denaturation. Samples were centrifuged to pellet debris and insoluble material. 12uL of each sample was loaded in a 4–20% Mini-PROTEAN TGX Precast Protein Gel (Bio-Rad) and ran in 1X Tris/Glycine/SDS Running Buffer (Bio-Rad) for 50min at 140V. Gel was stained using staining solution (0.1% Coomassie Brilliant Blue R-250, 50% v/v methanol, 40% v/v water, 10% v/v glacial acetic acid) and destained using the destaining solution (50% v/v methanol, 40% v/v water, 10% v/v glacial acetic acid).
Surface Plasmon resonance
Proteins purified to homogeneity by IMAC and SEC were buffer exchanged into SPR running buffer [10 mM HEPES, 300 mM NaCl, 3 mM EDTA, 0.05% (v/v) tween-20 (pH 7.4)] and diluted into a log2 dilution series from 250 nM - 3.91 nM for repressed binding samples and 10 mM – 78.1 nM for samples induced with 70 mM IPTG. Double-stranded DNA oligonucleotides for LacO,(TGTGTGGAATTGTTATCCGCTCACAATTTCACACA) LacO1 (TGTGTGGAATTGTGAGCGGATAACAATTTCACACA), LacOsym (TGTGTGGAATTGTGAGCGCTCACAATTTCACACA), and random DNA (AGGTCAAAAAGCCAGTGGTTATTTTAAGATGTCGC) were synthesized and 5’-biotinylated (IDT). All SPR experiments were performed on a Biacore 8K instrument (GE lifescience). An SA streptavidin Biacore chip (GE lifescience) was activated with 1 M NaCl and 20 mM NaOH before four channels were charged with approximately 400 RU of each respective DNA ligand. Binding affinity was determined through multi-cycle kinetic analysis at the 7 serial log2 protein dilutions, and one buffer only blank. All protein analytes were run at 50 mL/min for a ligand contact time of 120 seconds. After each analyte cycle, the ligand was regenerated with 3 M NaCl run at 20 mL/min for a contact time of 60 seconds. The Kd of DNA binding was determined from a multicycle model fitted to the reference subtracted sensorgrams in the Biacore 8K evaluation software.
Molecular dynamics simulations
Models for Anc880 and Anc882 were generated from X-ray crystallography coordinates of EcLacI bound to LacOsym (PDB: 1EFA)102 using FoldX103. The LacOsym DNA ligand in 1EFA was mutated to the LacO sequence that was experimentally tested using X3DNA. 104 To avoid DNA terminal fraying, 5’ and 3’ termini were extended with sequences GGAAT and TCC respectively in X3DNA for all models (Full DNA sequence: GGAATTGTTATCCGCTCACAATTCC). Simulations were performed in the GROMACS MD engine using the Amber ff14SB91 force field with PARMBSC192 nucleic acid parameter set. MD integration used a timestep of 2 fs. Columb interactions were modelled using particle mesh Ewald (PME) with a cut-off radius of 1.0 nm. Leonard-Jones interactions used a cut-off scheme with a radius of 1.0 nm. Hydrogen bonds were constrained using the LINCS algorithm. 105 For each system, 10 independent replicates of equilibration (from the NVT ensemble onwards) and production MD were performed. Each protein-DNA complex was put in a rhombic dodecahedral box extending at least 1.0 nm from the closest atom to the box boundary in all directions. Simulation boxes were filled with TIP3P water molecules, and the charge was neutralized with sodium and chloride ions. To replicate physiological conditions, each box was additionally filled with 150 mM NaCl. All systems were energy minimized by steepest descent with a step size of 0.01 nm until the greatest force was less than 1000 KJ/mol/nm. The systems were equilibrated to 300 K in the NVT ensemble with harmonic restraints on all heavy atoms for 100 ps using a velocity rescale thermostat and brought to 1 Bar pressure in the NPT ensemble for 100 ps using a Berendsen barostat. Following temperature and pressure equilibration, unrestrained production MD was performed for 100 ns in the NPT ensemble using a velocity rescale thermostat and a Parrinello-Rahman barostat with references of 300 K and 1 Bar, respectively. Following periodic boundary corrections, all trajectories were analyzed in MDTraj and every 10th frame was sampled. All production MD was performed on the Australian National Computing Infrastructure GADI supercomputer.
Elementary landscape simulations
Elementary landscapes were calculated via eigendecomposition of the graph adjacency matrix corresponding to a combinatorial sequence space over an alphabet of 3 amino acids and 5 number of positions. Elementary landscapes were indexed by order (e.g. first-order, second-order and so on), and composition of landscapes performed combinatorically between 1st and 2nd order, 1st and 3rd order, and 1st and 3rd order landscapes. The composition operation was defined as a simple linear combination of landscapes (in analogy to superposition of waves in physics). All graph construction and matrix operations were performed in NumPy and Networkx. Graph plotting was performed in Networkx using custom functions.
Landscape ruggedness calculations
The ruggedness of the LacI fitness landscape was calculated as the normalized Dirichlet energy as previously described.50 Briefly, a symmetric k-nearest neighbour (kNN) graph was constructed based on distances between LacI variants that were obtained by converting LacI amino acid sequences to one-hot representations and measuring the pairwise distance between each sequence representation. The Dirichlet energy was then calculated as:
Where represents the normalized dirichlet energy, is the number of LacI variants, y is the fitness for each variant and L is the graph Laplacian operator of the adjacency matrix from the kNN graph. The Dirichlet energy was randomly subsampled without replacement for 1000 replicates on 95% of the graph, the mean and standard deviation from subsampling is reported. Dirichlet energy of the sarbecovirus dataset is subsampled to 100 replicates as the dataset size is approximately an order of magnitude smaller than the LGF DBD dataset. The fitness landscape and adjacent variants were then visualized by reducing the one-hot representations into 2-dimensions with T-distributed stochastic neighbor embedding (tSNE). Both the kNN and tSNE algorithms were implemented in Python 3.8 using scikit-learn 1.2.1.
QUANTIFICATION AND STATISTICAL ANALYSIS
Statistical details where applicable can be found in the figure legends.
Supplementary Material
Key Resources Table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Bacterial and virus strains | ||
| DH10β | New England Biolabs | Cat#: C3020 |
| DH10β ΔlacI::sfgfp kan reporter | This study | |
| BL21(DE3) | New England Biolabs | Cat #:C2527H |
| Chemicals, peptides, and recombinant proteins | ||
| Carbenicillin (disodium) | Goldbio | Cat#:C-103–5 |
| L-arabinose | Goldbio | Cat#:A-300–100 |
| Kanamycin monosulfate | Goldbio | Cat#:K-120–5 |
| Spectinomycin Dihydrochloride Pentahydrate | Goldbio | Cat#:S-140–5 |
| Kapa HiFi DNA Polymerase | KAPA Biosystems | Cat#:KK2101 |
| T4 DNA Ligase | New England Biolabs | Cat#:M0202 |
| Antarctic Phosphatase | New England Biolabs | Cat#:M0289 |
| Dpn1 | New England Biolabs | Cat#:R0176L |
| BsaI-HF v2 | New England Biolabs | Cat#:R3733L |
| NEBridge® Golden Gate Assembly Kit (BsaI-HF® v2) | New England Biolabs | Cat#:E1601L |
| IPTG | Goldbio | Cat#:I2481C |
| Serratia marscecens Turbonuclease | Sigma-Aldrich | Cat#:T4330–50KU |
| Critical commercial assays | ||
| ZR Plasmid Miniprep - Classic | Zymo | Cat#: D4016 |
| Deposited data | ||
| LGF sequences | This study |
https://github.com/raman-lab/AncLacI. Zenodo: https://doi.org/10.5281/zenodo.10652076 |
| RBD sequences | Starr, T. N. et al. 202252 | https://github.com/jbloomlab/SARSr-CoV_homolog_survey |
| LGF phylogenetic trees | This study |
https://github.com/raman-lab/AncLacI. Zenodo: https://doi.org/10.5281/zenodo.10652076 |
| RBD phylogenetic trees | Starr, T. N. et al. 202252 | https://github.com/jbloomlab/SARSr-CoV_homolog_survey |
| High throughput fluorescence and DMS sequencing data (processed) | This study |
https://github.com/raman-lab/AncLacI. Zenodo: https://doi.org/10.5281/zenodo.10652076 |
| High throughput fluorescence and DMS sequencing data (raw) | This study | Zenodo: https://doi.org/10.5281/zenodo.7574310 |
| Oligonucleotides | ||
| DBD oligonucleotide pools | Agilent | |
| 3’ biotinylated DNA oligonucleotides | IDT | |
| Primers used in this study | IDT | Table S1 |
| Recombinant DNA | ||
| Plasmids used in this study | This study | Table S2 |
| Software and algorithms | ||
| BLAST | NCBI | https://blast.ncbi.nlm.nih.gov/Blast.cgi |
| CD-HIT | Fu et al. 2012 93 | https://sites.google.com/view/cd-hit |
| T-COFFEE | Notredame et al. 200094 | https://tcoffee.crg.eu/ |
| IQ-TREE | Nguyen et al. 201595 | http://www.iqtree.org/ |
| PAML | Yang et al. 200747 | http://abacus.gene.ucl.ac.uk/software/paml.html |
| APE | Paradis et al. 201848 | https://github.com/emmanuelparadis/ape |
| PEAR | Zhang et al. 2014101 | https://cme.hits.org/exelixis/web/software/pear/ |
| GROMACS | Pall, S. et al. 202090 | https://www.gromacs.org/ |
| Amberff14SB | Maier J. A. et al. 201591 | https://www.gromacs.org/ |
| PARMBSC1 | Ivani, I. et al. 201692 | https://mmb.irbbarcelona.org/ParmBSC1/ |
| All other custom scripts | This paper |
https://github.com/raman-lab/AncLacI Zenodo: https://doi.org/10.5281/zenodo.10652076 |
| Other | ||
| Streptavidin SA Chip | Cytiva | Cat#:29104992 |
| Biacore 8K | Cytiva | Cat#:29722782 |
Highlights.
We characterized a complete phylogenetic tree to reveal the sequence-fitness landscape
We found the landscape to be extremely rugged due to high levels of epistasis.
We observed rapid switches of specificity between adjacent nodes
We showed that ruggedness is necessary to prevent off-target regulation.
ACKNOWLEDGEMENTS
This work was supported in part by the ARC Centre of Excellence in Synthetic Biology (CE200100029), the ARC Centre of Excellence in Peptide and Protein Science (CE200100012), the Australian National Computing Infrastructure (NCI), the NIH Director’s New Innovator Award DP2GM132682 (S.R.), the Great Lakes Bioenergy Research Center, U. S. Department of Energy, Office of Science, Office of Biological and Environmental Research under Award Number DESC0018409 (S.R and A.T.M). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH, Department of Defense, Department of Energy, or other federal agencies.
Footnotes
DECLARATION OF INTERESTS
The authors declare no competing interests.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
REFERENCES
- 1.Smith JM (1970). Natural selection and the concept of a protein space. Nature 225, 563–564. [DOI] [PubMed] [Google Scholar]
- 2.de Visser JAGM & Krug J. (2014). Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet 15, 480–490. [DOI] [PubMed] [Google Scholar]
- 3.Sarkisyan KS et al. (2016). Local fitness landscape of the green fluorescent protein. Nature 533, 397–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Romero PA & Arnold FH (2009). Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol 10, 866–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Poelwijk FJ, Kiviet DJ, Weinreich DM & Tans SJ (2007). Empirical fitness landscapes reveal accessible evolutionary paths. Nature 445, 383–386. [DOI] [PubMed] [Google Scholar]
- 6.Kinney JB & McCandlish DM (2019). Massively Parallel Assays and Quantitative Sequence-Function Relationships. Annu. Rev. Genomics Hum. Genet 20, 99–127. [DOI] [PubMed] [Google Scholar]
- 7.McCandlish DM (2011). Visualizing fitness landscapes. Evolution 65, 1544–1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Otwinowski J, McCandlish DM & Plotkin JB (2018). Inferring the shape of global epistasis. Proc. Natl. Acad. Sci. U. S. A 115, E7550–E7558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fowler DM & Fields S. (2014). Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhou J. et al. (2022). Higher-order epistasis and phenotypic prediction. Proc. Natl. Acad. Sci. U. S. A 119, e2204233119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bloom JD & Arnold FH (2009). In the light of directed evolution: pathways of adaptive protein evolution. Proc. Natl. Acad. Sci. U. S. A 106 Suppl 1, 9995–10000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Weinreich DM, Delaney NF, Depristo MA & Hartl DL (2006). Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114. [DOI] [PubMed] [Google Scholar]
- 13.Campbell E. et al. (2016). The role of protein dynamics in the evolution of new enzyme function. Nature Chemical Biology 12, 944–950. [DOI] [PubMed] [Google Scholar]
- 14.Jackson CJ et al. (2009). Conformational sampling, catalysis, and evolution of the bacterial phosphotriesterase. Proc. Natl. Acad. Sci. U. S. A 106, 21631–21636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kaczmarski JA et al. (2020). Altered conformational sampling along an evolutionary trajectory changes the catalytic activity of an enzyme. Nat. Commun 11, 5945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nahum JR et al. (2015). A tortoise-hare pattern seen in adapting structured and unstructured populations suggests a rugged fitness landscape in bacteria. Proc. Natl. Acad. Sci. U. S. A 112, 7530–7535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hayashi Y. et al. (2006). Experimental rugged fitness landscape in protein sequence space. PLoS One 1, e96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wu NC, Dai L, Olson CA, Lloyd-Smith JO & Sun R. (2016). Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife 5:e16965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rodrigues JV et al. (2016). Biophysical principles predict fitness landscapes of drug resistance. Proc. Natl. Acad. Sci. U. S. A 113, E1470–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Flynn JM et al. (2022). Comprehensive fitness landscape of SARS-CoV-2 Mpro reveals insights into viral resistance mechanisms. Elife 11:e77433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Louie RHY, Kaczorowski KJ, Barton JP, Chakraborty AK & McKay MR (2018). Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc. Natl. Acad. Sci. U. S. A 115, E564–E573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Romero PA, Krause A. & Arnold FH (2013). Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U. S. A 110, E193–E201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gonzalez Somermeyer L. et al. (2022). Heterogeneity of the GFP fitness landscape and data-driven protein design. Elife 11:e75842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Spence MA, Kaczmarski JA, Saunders JW & Jackson CJ (2021). Ancestral sequence reconstruction for protein engineers. Curr. Opin. Struct. Biol 69, 131–141. [DOI] [PubMed] [Google Scholar]
- 25.Harms MJ & Thornton JW (2010). Analyzing protein structure and function using ancestral gene reconstruction. Curr. Opin. Struct. Biol 20, 360–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Clifton BE et al. (2018). Evolution of cyclohexadienyl dehydratase from an ancestral solute-binding protein. Nat. Chem. Biol 14, 542–547. [DOI] [PubMed] [Google Scholar]
- 27.Anderson DW, McKeown AN & Thornton JW (2015). Intermolecular epistasis shaped the function and evolution of an ancient transcription factor and its DNA binding sites. Elife 4: e07864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bridgham JT, Ortlund EA & Thornton JW (2009). An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461, 515–519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Albà MM (2017). Zinc-finger domains in metazoans: evolution gone wild. Genome biology vol. 18 168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Blackburn MC, Petrova E, Correia BE & Maerkl SJ (2016). Integrating gene synthesis and microfluidic protein analysis for rapid protein engineering. Nucleic Acids Res. 44, e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Nguyen CC & Saier MH Jr. (1995). Phylogenetic, structural and functional analyses of the LacI-GalR family of bacterial transcription factors. FEBS Lett. 377, 98–102. [DOI] [PubMed] [Google Scholar]
- 32.Monod J, Wyman J. & Changeux J-P (1965). On the nature of allosteric transitions: A plausible model. J. Mol. Biol 12, 88–118. [DOI] [PubMed] [Google Scholar]
- 33.Marklund E. et al. (2022). Sequence specificity in DNA binding is mainly governed by association. Science 375, 442–445. [DOI] [PubMed] [Google Scholar]
- 34.Otwinowski J. & Nemenman I. (2013). Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS One 8, e61570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Zuo Z, Chang Y. & Stormo GD (2015). A quantitative understanding of lac repressor’s binding specificity and flexibility. Quant. Biol 3, 69–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Barnes SL, Belliveau NM, Ireland WT, Kinney JB & Phillips R. (2019). Mapping DNA sequence to transcription factor binding energy in vivo. PLoS Comput. Biol 15, e1006226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Garruss Alexander S, Collins Katherine M. & Church George M. (2021). Deep representation learning improves prediction of LacI-mediated transcriptional repression. Proceedings of the National Academy of Sciences 118, e2022838118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Fukami-Kobayashi K, Tateno Y. & Nishikawa K. (2003). Parallel evolution of ligand specificity between LacI/GalR family repressors and periplasmic sugar-binding proteins. Mol. Biol. Evol 20, 267–277. [DOI] [PubMed] [Google Scholar]
- 39.Spence MA, Mortimer MD, Buckle AM, Minh BQ & Jackson CJ (2021). A Comprehensive Phylogenetic Analysis of the Serpin Superfamily. Mol. Biol. Evol 38, 2915–2929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Burnim AA, Spence MA, Xu D, Jackson CJ & Ando N. (2022). Comprehensive phylogenetic analysis of the ribonucleotide reductase family reveals an ancestral clade. Elife 11:e79790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Joho Y. et al. (2023). Ancestral Sequence Reconstruction Identifies Structural Changes Underlying the Evolution of Ideonella sakaiensis PETase and Variants with Improved Stability and Activity. Biochemistry 62, 437–450. [DOI] [PubMed] [Google Scholar]
- 42.Eick GN, Bridgham JT, Anderson DP, Harms MJ & Thornton JW (2017). Robustness of Reconstructed Ancestral Protein Functions to Statistical Uncertainty. Mol. Biol. Evol 34, 247–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Shimodaira H. (2002). An approximately unbiased test of phylogenetic tree selection. Syst. Biol 51, 492–508. [DOI] [PubMed] [Google Scholar]
- 44.Meinhardt S. et al. (2012). Novel insights from hybrid LacI/GalR proteins: family-wide functional attributes and biologically significant variation in transcription repression. Nucleic Acids Res. 40, 11139–11154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Shis DL, Hussain F, Meinhardt S, Swint-Kruse L. & Bennett MR (2014). Modular, multi-input transcriptional logic gating with orthogonal LacI/GalR family chimeras. ACS Synth. Biol 3, 645–651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Clifton BE & Jackson CJ (2016). Ancestral Protein Reconstruction Yields Insights into Adaptive Evolution of Binding Specificity in Solute-Binding Proteins. Cell Chemical Biology 23, 236–245. [DOI] [PubMed] [Google Scholar]
- 47.Yang Z. (2018). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution vol. 24 1586–1591 Preprint at 10.1093/molbev/msm088 (2007). [DOI] [PubMed] [Google Scholar]
- 48.Paradis E. & Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528. [DOI] [PubMed] [Google Scholar]
- 49.Kaczmarski JA et al. (2020). Altered conformational sampling along an evolutionary trajectory changes the catalytic activity of an enzyme. bioRxiv doi: 10.1101/2020.02.03.932491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Castro E. et al. (2022). Transformer-based protein generation with regularized latent space optimization. Nat. Mach. Intell 4, 840–851. [Google Scholar]
- 51.Daković M, Stanković L. & Sejdić E. (2019). Local Smoothness of Graph Signals. Math. Probl. Eng 2019. [Google Scholar]
- 52.Starr TN et al. (2022). ACE2 binding is an ancestral and evolvable trait of sarbecoviruses. Nature 603, 913–918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Swint-Kruse L. & Matthews KS (2009). Allostery in the LacI/GalR family: variations on a theme. Curr. Opin. Microbiol 12, 129–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Spronk CA et al. (1999). The solution structure of Lac repressor headpiece 62 complexed to a symmetrical lac operator. Structure 7, 1483–1492. [DOI] [PubMed] [Google Scholar]
- 55.Lehming N. et al. (1987). The interaction of the recognition helix of lac repressor with lac operator. EMBO J. 6, 3145–315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Daber R. & Lewis M. (2009). Towards evolving a better repressor. Protein Engineering, Design and Selection 22, 673–683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Starr TN & Thornton JW (2016). Epistasis in protein evolution. Protein Sci. 25, 1204–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Milk L, Daber R. & Lewis M. (2010). Functional rules for lac repressor–operator associations and implications for protein–DNA interactions. Protein Science 19, 1162–1172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Pillai AS et al. (2020). Origin of complexity in haemoglobin evolution. Nature 581, 480–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Yang G. et al. (2019). Higher-order epistasis shapes the fitness landscape of a xenobiotic-degrading enzyme. Nat. Chem. Biol 15, 1120–1128. [DOI] [PubMed] [Google Scholar]
- 61.Kullback S. & Leibler RA (1951). On Information and Sufficiency. ann. math. stat 22, 79–86. [Google Scholar]
- 62.Sadler JR, Sasmor H. & Betz JL (1983). A perfectly symmetric lac operator binds the lac repressor very tightly. Proc. Natl. Acad. Sci. U. S. A 80, 6785–6789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Liao Q. et al. (2019). Long time-scale atomistic simulations of the structure and dynamics of transcription factor-DNA recognition. J. Phys. Chem. B 123, 3576–3590. [DOI] [PubMed] [Google Scholar]
- 64.Glasgow A, Hobbs HT, Perry ZR, Marqusee S. & Kortemme T. (2021). Ligand-induced changes in dynamics mediate long-range allostery in the lac repressor. bioRxiv doi: 10.1101/2021.11.30.470682. [DOI] [Google Scholar]
- 65.Kalodimos CG et al. (2002). Plasticity in protein-DNA recognition: lac repressor interacts with its natural operator 01 through alternative conformations of its DNA-binding domain. EMBO J. 21, 2866–2876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Zuo Z. & Stormo GD (2014). High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics 198, 1329–1343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Stadler PF (1996). Landscapes and their correlation functions. J. Math. Chem 20, 1–45. [Google Scholar]
- 68.Stadler PF (2007). Fitness landscapes. in Biological Evolution and Statistical Physics 183–204 (Springer Berlin Heidelberg, 2007). [Google Scholar]
- 69.Chicano F, Whitley LD & Alba E. (2011). A methodology to find the elementary landscape decomposition of combinatorial optimization problems. Evol. Comput 19, 597–63. [DOI] [PubMed] [Google Scholar]
- 70.Igler C, Lagator M, Tkačik G, Bollback JP & Guet CC (2018). Evolutionary potential of transcription factors for gene regulatory rewiring. Nat Ecol Evol 2, 1633–1643. [DOI] [PubMed] [Google Scholar]
- 71.Aguilar-Rodríguez J, Payne JL & Wagner A. (2017). A thousand empirical adaptive landscapes and their navigability. Nature Ecology & Evolution 1, 1–9. [DOI] [PubMed] [Google Scholar]
- 72.Hadzipasic A. et al. (2020). Ancient origins of allosteric activation in a Ser-Thr kinase. Science 367, 912–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Kaltenbach M. et al. (2018). Evolution of chalcone isomerase from a noncatalytic ancestor. Nat. Chem. Biol 14, 548–555. [DOI] [PubMed] [Google Scholar]
- 74.Bar-Rogovsky H, Hugenmatter A. & Tawfik DS (2013). The evolutionary origins of detoxifying enzymes. J. Biol. Chem 288, 23914–23927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Castro-Fernandez V. et al. (2017). Reconstructed ancestral enzymes reveal that negative selection drove the evolution of substrate specificity in ADP-dependent kinases. J. Biol. Chem 292, 15598–15610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Markin CJ et al. (2021). Revealing enzyme functional architecture via high-throughput microfluidic enzyme kinetics. Science 373, eabf8761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Ben-David M. et al. (2020). Enzyme evolution: An epistatic ratchet versus a smooth reversible transition. Mol. Biol. Evol 37, 1133–1147. [DOI] [PubMed] [Google Scholar]
- 78.Capra EJ, Perchuk BS, Skerker JM & Laub MT (2012). Adaptive mutations that prevent crosstalk enable the expansion of paralogous signaling protein families. Cell 150, 222–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zarrinpar A, Park S-H & Lim WA (2003). Optimization of specificity in a cellular protein interaction network by negative selection. Nature 426, 676–680. [DOI] [PubMed] [Google Scholar]
- 80.Friedlander T, Prizak R, Guet CC, Barton NH & Tkačik G. (2016). Intrinsic limits to gene regulation by global crosstalk. Nat. Commun 7, 12307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Friedlander T, Prizak R, Barton NH & Tkačik G. (2017). Evolution of new regulatory functions on biophysically realistic fitness landscapes. Nat. Commun 8, 216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Liu Q. et al. (2018). Ancient mechanisms for the evolution of the bicoid homeodomain’s function in fly development. Elife 7:e34594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Starr TN, Picton LK & Thornton JW (2017). Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.McKeown AN et al. (2014). Evolution of DNA specificity in a transcription factor family produced a new gene regulatory module. Cell 159, 58–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Srivastava M. & Payne JL (2022).On the incong ruence of genotype-phenotype and fitness landscapes. PLoS Comput. Biol 18, e1010524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Park Y, Metzger Brian PH & Thornton Joseph W. (2022). Epistatic drift causes gradual decay of predictability in protein evolution. Science 376, 823–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Yona AH, Alm EJ & Gore J. (2018). Random sequences rapidly evolve into de novo promoters. Nat. Commun 9. 1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Lagator M. et al. (2022). Predicting bacterial promoter function and evolution from random sequences. Elife 11, e64543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Vaishnav ED et al. (2022). The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Páll S. et al. (2020). Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS. J. Chem. Phys 153, 134110. [DOI] [PubMed] [Google Scholar]
- 91.Maier JA et al. (2015). Ff14SB: Improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Ivani I. et al. (2016). Parmbsc1: a refined force field for DNA simulations. Nat. Methods 13, 55–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Fu L, Niu B, Zhu Z, Wu S. & Li W. (2012). CD-HIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics 28, 3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Notredame C, Higgins DG & Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol 302, 205–217. [DOI] [PubMed] [Google Scholar]
- 95.Nguyen L-T, Schmidt HA, von Haeseler A. & Minh BQ (2015). IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol 32, 268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Le SQ & Gascuel O. (2008). An improved general amino acid replacement matrix. Mol. Biol. Evol 25, 1307–1320. [DOI] [PubMed] [Google Scholar]
- 97.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A. & Jermiin LS (2017). ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Hoang DT, Chernomor O, von Haeseler A, Minh BQ & Vinh LS (2018). UFBoot2: Improving the Ultrafast Bootstrap Approximation. Molecular Biology and Evolution vol. 35 518–522 Preprint at 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Lutz R. & Bujard H. (1997). Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 25, 1203–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Datsenko KA & Wanner BL (2000). One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. U. S. A 97, 6640–6645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Zhang J, Kobert K, Flouri T. & Stamatakis A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Bell CE & Lewis M. (2000). A closer view of the conformation of the Lac repressor bound to operator. Nat. Struct. Biol 7, 209–214. [DOI] [PubMed] [Google Scholar]
- 103.Delgado J, Radusky LG, Cianferoni D. & Serrano L. (2019). FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics 35, 4168–4169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Lu X-J & Olson WK (2008). 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat. Protoc 3, 1213–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Hess B, Bekker H, Berendsen HJC & Fraaije JGEM (1997). LINCS: A linear constraint solver for molecular simulations. J. Comput. Chem 18, 1463–1472. [Google Scholar]
- 106.Bodenhofer U, Bonatesta E, Horejš-Kainrath C. & Hochreiter S. (2015). msa: an R package for multiple sequence alignment. Bioinformatics 31, 3997–3999. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Processed data used in this study were deposited in Github (https://github.com/raman-lab/AncLacI; Zenodo: https://doi.org/10.5281/zenodo.10652076). Raw data obtained in this study were deposited to Zenodo (https://doi.org/10.5281/zenodo.7574310)
Code used in this study have been uploaded with documentation to Github and is publicly available at the GitHub: https://github.com/raman-lab/AncLacI (Zenodo: https://doi.org/10.5281/zenodo.10652076).
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.






