Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2021 Jun 29;118(27):e2022838118. doi: 10.1073/pnas.2022838118

Deep representation learning improves prediction of LacI-mediated transcriptional repression

Alexander S Garruss a,b,c,1, Katherine M Collins b,d, George M Church a,b,c
PMCID: PMC8271634  PMID: 34187888

Significance

The understanding of protein function increases with new experimental and evolutionary datasets. A major challenge is to apply machine learning to these datasets to capture essential features of protein function. Here, we analyze the experimentally determined repression function for tens of thousands of mutants of the LacI protein. This study provides a continuous, noncategorical repression value across a majority of all single mutations and for thousands of higher-order mutations. To develop a top-performing model for the prediction of repression by LacI, we compare several leading variant effect prediction algorithms. A deep representation learning paradigm, first trained across millions of proteins from all known protein families and then fine-tuned using LacI experimental data, offers the highest predictive performance of repression function.

Keywords: machine learning, lac repressor, deep representation learning

Abstract

Recent progress in DNA synthesis and sequencing technology has enabled systematic studies of protein function at a massive scale. We explore a deep mutational scanning study that measured the transcriptional repression function of 43,669 variants of the Escherichia coli LacI protein. We analyze structural and evolutionary aspects that relate to how the function of this protein is maintained, including an in-depth look at the C-terminal domain. We develop a deep neural network to predict transcriptional repression mediated by the lac repressor of Escherichia coli using experimental measurements of variant function. When measured across 10 separate training and validation splits using 5,009 single mutations of the lac repressor, our best-performing model achieved a median Pearson correlation of 0.79, exceeding any previous model. We demonstrate that deep representation learning approaches, first trained in an unsupervised manner across millions of diverse proteins, can be fine-tuned in a supervised fashion using lac repressor experimental datasets to more effectively predict a variant’s effect on repression. These findings suggest a deep representation learning model may improve the prediction of other important properties of proteins.


The regulation of gene expression by proteins plays a fundamental role in biology. A model system for understanding this regulation is the lac repressor (LacI protein) from the Escherichia coli bacteria (13). Studies of lac repressor mutants have enabled the mechanism of transcriptional repression to be characterized and have served as a benchmark for computational prediction of protein function. In a landmark study, Markiewicz et al. (4) measured the degree of repression mediated by more than 4,000 protein variants using four repression categories, revealing essential positions of the protein. Poelwijk et al. (5) generated thousands of random coding variants in LacI to study their repression and induction function in varying cellular environments, analyzing constraints on activity and epistatic interactions. Previous studies have also generated thousands of variants in the LacI DNA binding site to study the sequence specificity of repressor binding (68). Machine learning has been applied to predict the repression effect of protein variants primarily by classifying variant function using four qualitative categories (911), usually further reduced to only two categories of functional or nonfunctional. More recently, Miller et al. (12) compared 16 widely used variant effect prediction tools against 103 variants whose repression function was measured experimentally on a continuous scale and found Pearson correlations up to 0.61. Large-scale DNA synthesis and high-throughput DNA sequencing were recently used to systematically characterize tens of thousands of mutations in the lac repressor, including single, double, and higher-order mutations, to decipher constraints on LacI inducer specificity (13). Here, we focus on a collection of lac repressor variants to understand the contribution of mutation to basal repression function, the contribution of epistasis to repression, and the effectiveness of high-performance computer models to predict continuous repressor function values. To measure the depth and accuracy of our understanding of this protein, we developed and measured the performance of a model that takes as input a sequence variant of LacI and outputs a prediction for its repression value.

LacI Variant Effect on Repression

To evaluate models for prediction of lac repressor function, we first reanalyzed the raw sequencing data from the Taylor et al. (13) study, which was based on the computational design of thousands of repressor variants and their synthesis from DNA oligonucleotides. Variants consisted of individual regions of 36 amino acids (tiles) spanning the coding sequence with primarily single, double, and triple mutations, as well as other variants with three mutagenized regions that mostly contained more than three mutations. Tiles were assembled into variant constructs, and the variants were transformed as a library and used in a negative selection (Fig. 1A). While neutral variants repressed transcription of a colicin E1 importer gene (tolC), loss-of-function mutants no longer repressed the importer gene and led to slow growth or death when the colicin E1 toxin was present. The library of variants was sampled by next-generation sequencing before and after addition of the toxin. Each tile was separately processed by computing a fold change from a postselection value divided by a preselection one and then scaled such that >95% of the variants had a repression value between zero (severe loss of repression) and one (neutral; wild type) (Fig. 1B). The limits were chosen to minimize the effect of outliers and to enable comparison of values across tiles (SI Appendix, Fig. S1). In total, there were 3.7 × 106 variants after filtering out deletions and insertions. We further filtered using a sequencing read count threshold, resulting in a collection of 43,669 variants that we further explored.

Fig. 1.

Fig. 1.

A deep mutational scanning experiment measuring transcriptional repression by LacI. (A) Experimental overview from the Taylor et al. (13) study. A total of 16 tiles were synthesized, spanning portions of the LacI coding region. Each tile was separately cloned, transformed, and subjected to a selection favoring functional repression. Nonfunctional repressor variants do not survive the selection. Deep sequencing of the input variant pool and postselection variant pool allowed for counting each variant’s frequency and the calculation of a fold change after selection. This fold change (log2) is scaled to approximately [0,1] and referred to as the repression value. (B) The overall distribution of variant repression values and sequencing read coverage. The dashed line corresponds to 10 reads in the input condition, the minimum needed to be considered for downstream analyses. Variants were also included in downstream analysis if they were counted at least once in the input condition and at least 10 times in the postselection condition. (C) Thousands of protein variants were replicated by at least two synonymous coding variants with at least 10 read counts prior to selection. There is a strong correlation between the synonymous replicates (Pearson correlation = 0.85; SI Appendix, Fig. S2). (D) Several hundred control sequences confirm the upper and lower assay limits and were used to measure the false discovery rate. Negative control sequences contained more than 20 mutations in a single tile due to frame shifting. Positive control sequences were wild-type synonymous coding variants. Boxplot notches indicate 95% confidence intervals of the median. (E) A comparison of repression values from this study versus the semiquantitative categories of Markiewicz et al. (4). Some differences between studies (prior versus current, respectively) are expected due to operator construct (endogenous versus synthetic), fold change of observed range (>100-fold versus 10-fold), amount of LacI expressed (single copy from genome versus expression from low-copy plasmid), and other differences affecting overall sensitivity. Despite these differences, each category is significantly different from the other by a Kolmorgorov–Smirnov test, P < 0.0015 (SI Appendix, Fig. S3). (F) The distribution of protein variant repression values for 43,669 variants used in downstream analyses according to the number of mutations away from the wild-type sequence. In addition to ignoring lowly sequenced variants, we also removed variants with more than five consecutive mutations. The number shown next to each histogram is the total number of protein variants that met the minimum read threshold in either condition. The value in parentheses is the number of variants meeting the threshold in the input condition only (shown in the corresponding histogram).

To measure the experimental error, we performed a replicate analysis using protein variants whose repression value was independently measured from at least two synonymous coding variants (Fig. 1C and SI Appendix, Fig. S2). We selected the top two most abundant synonymous variants in the input condition and compared synonymous variant repression values if each coding variant was found a minimum of 10 times in the input condition. For 2,469 such cases the repression values had a Pearson correlation of 0.85, indicating a robust agreement across the full spectrum of repression values (SI Appendix, Fig. S2). We compared two sets of control sequences (Fig. 1D). As a positive control we gathered synonymous variants for wild-type protein sequences, which were nearly all functional (mean = 0.994; standard deviation = 0.200; n = 313). As a negative control, variants with more than 20 mutations from wild type were tightly packed at the nonfunctional value of zero (mean = 0.012; standard deviation = 0.109; n = 256). Wild-type synonymous variants were significantly distinct from nonfunctional variants (Welch two-sample t test, P = 1.56e-277).

We compared the results of this study to those of previous large-scale studies of repression values (4, 14) by examining single variants. We found strong agreement between the four qualitative states annotated in a prior study (4) with our continuous value scoring (Fig. 1E). The most extreme categories, “total loss” versus “no loss,” were significantly different using our experimental measurements (P = 1.44e-241, Mann–Whitney test). The two intermediate categories, “major loss” versus “some loss,” were significantly distinct from each other (P = 0.01988), as were total loss versus major loss (P = 6.766e-10) and no loss versus some loss (P = 1.698e-60). Repression values for 5,009 single mutations indicated that over 70% were functional (repression value greater than 0.5) (Fig. 1F). The majority of variants with seven mutations were nonfunctional.

Repression Values at the C-Terminal Domain

We generated a visual overview of how the amino acid at each position related to function (Fig. 2A). We then mapped the average repression value for each protein position onto a three-dimensional structure of the protein (Fig. 2B). These repression values recapitulated previously described regions for which mutations were nearly completely neutral (4), as well as essential residues at the DNA interface (15) and the interior core of the protein (16). Here we consider the hydrophobic C-terminal region of the protein in more detail. The selection assay used the pLlacO-1 operator (17) containing two full operator sites allowing for DNA binding of two separate dimers in proximity to one another. The C-terminal domain, often called the tetramerization domain, contains a coiled-coiled secondary structure (18) that interfaces among all monomers of the tetramer. The domain was found to be nonessential for repression (19) and was excluded from the large-scale Markiewicz et al. (4) study. To investigate this region further, we visualized the average repression value for each position across all four protein chains in a tetramer conformation (Fig. 2C). Variants outside of the coiled-coil interface were neutral, whereas variants at positions facing inward and at the interface between monomer chains were deleterious to function (Fig. 2D).

Fig. 2.

Fig. 2.

Repression values for single mutations. (A) Heatmap showing repression values for single mutations across all positions and for 20 amino acids. Gray values indicate missing or lowly sequenced variants. Black dots indicate the wild-type sequence identities. Values not covered in this study but reported by Markiewicz et al. (4) are shown using the corresponding median value from Fig. 1F. (B and C) The LacI monomer shown with the C-terminal region oriented upward (B) and colored as in A and for the the LacI tetramer conformation (C). (D) A zoom-in of the C-terminal tetramerization domain. (E) Simulation and energy scoring using in silico molecular modeling of mutations. Mutations that cause a gain in free energy (log10 ΔΔG Rosetta energy units) are shown for the DNA bound monomer conformation versus the DNA bound tetramer conformation. (F) Repression values compared to the free-energy unit increase. x-axis values are plotted continuously and shown with overlaid boxplots that were binarized at repression value <0.5; notches surrounding the median line indicate the 95% confidence interval of the median.

We asked whether molecular modeling and simulation could help describe the differential effect of mutations at these positions. We used a theoretical model of the tetramer configuration previously assembled and equilibrated (20). We parameterized the DNA molecule as a ligand and constructed wild-type configurations for a DNA-bound monomer and a DNA-bound tetramer, respectively, for use in the Rosetta molecular modeling suite (21). We generated mutant structures for all single variants and measured the change in free energy compared to the wild-type configuration (ΔΔG). We considered the magnitude of free-energy changes, depending on the monomer and tetramer configuration (Fig. 2E). We found variants that showed a high increase in free energy for the tetramer configuration but a low increase in the monomer configuration, such as at positions 343 and 350. Variants at these positions had a severe loss in repressor function. A proximal position, 341, had variants that were low in free-energy increase for both monomer and tetramer forms. Variants at position 341 were experimentally measured as neutral variants, supporting the idea that changes within, but not at, the C-terminal coiled-coil interface do not create a destabilizing effect on the complex. For context, changes at position 252, which is not in the C-terminal domain, were the most destabilizing in both configurations and also uniformly led to severe loss of repression. We compared all repression values to the predicted (ΔΔG) value after mutation (Fig. 2F) for variants with an increase in free energy in the tetramer configuration. Although some distinction between function and change in free energy is apparent at the extremes, protein variants had a wide range of intermediate changes to free energy which could not generally distinguish variant function. Given the observation that large differences in predicted (ΔΔG) values occur between the monomer and tetramer configurations, we hypothesized that the values taken from the tetramer configuration would be more predictive of variant effect on repression than values using the monomer configuration.

Epistasis in Multiple Mutations

The use of evolutionary conservation is another paradigm to potentially best explain LacI variant effect on repression. To assess the utility of measures of evolutionary conservation, we assembled a large collection of paralogous sequences and built a position-independent model and an evolutionary coupling model, which utilizes covariation across positions in the alignment, with the EVCouplings (22, 23) package. We hypothesized that the evolutionary model utilizing position covariation would outperform the position-independent model if a high degree of covariation is present both within the experimental dataset and within the paralogous sequence alignment. To determine whether covariation was present among positions in the experimental dataset, we considered calculations of epistasis using the experimental repression values. We used a log-additive scoring method that predicts multivariants from their independent repression values (24). We focused on double mutations in the experiment that contained measurements from underlying single mutations (Fig. 3A). To determine epistasis we calculated the difference between each double mutant’s repression value and the expected value based on the additive model. Double mutants displaying strong epistasis have been reported to significantly overlap with positions making structural contact, as seen in several deep mutational scanning studies (25, 26). We visualized the top 18, or half the tile length (L/2), most excessive deviations from the additive model expectation (Fig. 3B) and found a significant overlap of strong epistatic position pairs with long-range structural contacts in four tiles (SI Appendix, Fig. S4). To calculate significance from a random expectation of overlap, we compared the number of highly epistatic pairs overlapping with known structural contacts less than 8 Å away in physical space, but more than five amino acids away in sequence space, by counting overlaps from randomly sampling without replacement epistasis values across pairs. We performed this sampling 100,000 times to generate an empirical cumulative distribution of overlap frequency. We then used this distribution to compute a probability value for the experimentally observed overlap frequency. The use of a stricter read count threshold of 20 for each underlying single and double mutation, as well as varying the number of top pairs chosen, also showed statistically significant overlap with known structure (SI Appendix, Figs. S5 and S6), suggesting epistasis is present within the experimental dataset.

Fig. 3.

Fig. 3.

Exploration of epistasis in double mutants. (A) The experimentally measured repression value for double mutants compared to the additive model predicted value from underlying single-mutation repression values. (B) Excessive positive epistasis significantly overlaps with long-range structural contacts. Positive epistasis occurs when the experimentally measured value is more than the additive expectation. Upper triangular area shows the top (positive) and bottom 18 (L/2) most epistatic pairs, as well as the top L/2 evolutionary coupled pairs. Each tile was compared separately. Lower triangular area shows the frequency of double mutations analyzed at each position pair. We assembled an empirical random distribution of structural contact overlaps with top L/2 position pairs by shuffling the epistasis values among the same double-mutation identities and counting the occurrence of overlaps for 100,000 shuffles.

Strong evolutionary coupling detected from covarying positions in a sequence alignment also frequently overlaps with structural contacts (27). To determine whether the lac repressor sequence alignment contains such a signature, we considered the top L/2 most evolution-coupled position pairs for each tile. We found the overlap of these top couplings with long-range contacts to be highly significant (Fig. 3B and SI Appendix, Fig. S6) by comparing the top couplings and frequency of structural contact overlap to a distribution of overlap frequency generated by randomly shuffling the rank order of coupling values in each tile. We additionally compared a variational autoencoder, DeepSequence, to measure higher-order couplings beyond pairwise within the local evolutionary record, which has been shown to improve function prediction in several other deep mutational scanning studies (28).

Prediction of Repression

To investigate how effectively we could model LacI-mediated transcriptional repression, we established a 10-fold cross-validation paradigm that measured the agreement between predictions and experimentally derived repression values. All 5,009 single mutations from this study were split 10 separate times into a 90% training set and a 10% validation set. We collected several metrics about performance, such as the Pearson and Spearman correlations, and mean-square error of predictions versus experimentally measured repression values.

In addition to theoretical energy values and measures of conservation, we considered several other model baselines. We include baseline linear models, called residue-embedding models, which are two-layered neural networks that take as input a full-length protein sequence that jointly learns an amino acid embedding layer followed by a single, fully connected, linear output layer for repression value prediction. The simplest version of such a model with an output shape of the embedding layer, e.g., (360,1), is simply the average amino acid effect multiplied by the average effect per position of the training set repression values. An embedding layer output shape of (360,10), for example, has the effect of learning an optimized embedding reducing the amino acid dimensions from 20 to 10.

We also considered deep representation learning approaches inspired by natural language processing that combine unsupervised pretraining with task-specific fine-tuning. Models attempt to automatically learn features and patterns of protein sequences by observing millions of natural examples. An end-to-end differentiable model can reduce the sequence dimensionality in an automatic, unsupervised manner to learn general patterns and features of protein sequence. Such models, for example, have been shown to develop automatic pattern detection of protein secondary structure (29), thermostability (30), and localization (31). A so-called sequence-to-sequence approach (32) using the UniRep implementation was pretrained using 24 million diverse protein sequences (29). UniRep takes as input a protein sequence and performs a 10-dimensional residue embedding, which is then passed through a recurrent 1,900-cell multiplicative-long short-term memory (mLSTM) layer to predict the next amino acid in the sequence. To perform the next amino acid prediction task effectively, the weights of the model develop the capacity to detect multiscale patterns sequential to protein sequences. After pretraining, a new query sequence can be input to the model to generate a protein representation corresponding to the average output of each cell in the recurrent layer as the model sequentially passes over the query. The 1,900-dimensional representation of each input sequence was used for downstream regression and visualization. The bidirectional encoder representations from transformers (BERT) model (30, 33) first trains a predictor to accurately fill in randomly hidden residues from a sequence using all surrounding residues. We used a pretrained BERT model (30) to generate average attention-weighted representations for each sequence after every identity was hidden once. This representation was 768 dimensions per sequence and fine-tuned for repression value prediction. A document-to-vector (34) approach for protein sequence representations was also used (31) from pretrained 64-dimensional embedding layer weights corresponding to each sequence of interest. The embedding layer weights are based on unsupervised pretraining on millions of protein sequences to correctly predict a given k-mer based on several adjacent k-mers. The 64-dimensional representation is then fitted to the repression value through the use of supervised fine-tuning of the top layer.

We fine-tuned protein representations using several different top layers and found an ensemble of decision trees using representations from UniRep performed best at the repression value prediction task, reaching the highest average Pearson and Spearman correlations (Fig. 4A) and the lowest mean-square error (Fig. 4B). The top performance of UniRep was confirmed using single variants with at least 100 read counts preselection (SI Appendix, Fig. S9). As a negative control, we shuffled the repression values across the representations for training and measured the predictive performance; no correlation and high mean-squared error (MSE) were found. We found that Rosetta energy values based on the tetramer configuration were significantly more correlated to repression values than when using a monomer configuration. We also found the independent level of conservation, as measured by the relative frequency in the population of aligned sequences, performed significantly worse than evolutionary coupling methods. The deep evolutionary method using a variational autoencoder was not significantly different from the pairwise coupling method and also an improvement over independent conservation measures. The doc2vec approach (31, 34) improved upon Rosetta energy calculations and independent conservation scores but performed worse than evolutionary coupling scores, variational autoencoder scores, and the two other representation learning approaches.

Fig. 4.

Fig. 4.

Representation learning improves the prediction of repression. (A) Comparison of correlation between function prediction and repression value using 10-fold cross-validation of all 5,009 single mutations measured in this study. (B) The prediction of repressor values measured by mean-square error of prediction versus experimental value. We performed a similar cross-validation using single mutations with at least 100 read counts preselection (SI Appendix, Fig. S9), confirming the top-performing models at a higher-quality threshold.

Model Explanation and Validation

We investigated the protein variant representations in more detail by collapsing the full set of 43,669 variants into two dimensions using the t-Distributed Stochastic Neighbor Embedding (t-SNE) (35) algorithm from Euclidean distances between each sequence’s 1,900-dimensional value (SI Appendix, Fig. S7). Distance in this two-dimensional space approximates similarity in the higher-dimensional space. We visualized the two-dimensional representation for each single-mutant variant (Fig. 5A) and wondered how the representation reflected repression values. We focused on a grouping of variants in the C-terminal domain that severely affected repression. The positions of C-terminal domain variants were segmented according to both position and amino acid as they related to the C-terminal domain function (Fig. 5B). Neighboring positions in the C-terminal domain containing proline and tryptophan, for example, were more easily separable by a downstream regression task based on the underlying model’s notion of secondary structure and helix composition (29). The t-SNE dimension reduction generally showed grouping by position varied for each of the single mutations but also contained local clusters (Fig. 5C), which diverged from the expected distances for a variety of position and repression value combinations.

Fig. 5.

Fig. 5.

Exploration of variant representations. A deep representation learning model was pretrained on >25 million protein sequences from all protein families to predict the next amino acid of a random protein sequence fragment (29). After pretraining, the model was used to generate a 1,900-dimensional representation for every sequence of interest in this study by taking the average internal state of the model while evaluating each position of a given input sequence. (A) The 1,900-dimensional representations are clustered and visualized in two dimensions using t-SNE (35). Each sequence is colored by repression value. (B) A zoomed-in view of a region of the representation landscape. Points are labeled for positions 340 and above, for amino acids proline and tryptophan, and colored by repression value. (C) Variant representations of all single mutations colored by the position mutated. (D) For external validation we collected a set of 103 variants from a separate study where repression fold change was experimentally measured and compared to predictions from 16 widely used variant effect prediction tools (12). The highest-performing tool in that study reached a Pearson correlation of 0.61. To compare with the best models identified in our study, we generated representations for each of the 103 sequences and performed regression with an ensemble of decision trees. The median Pearson correlation from a 10-fold cross-validation was 0.74; the median Spearman correlation was 0.733; and the median MSE was 0.803. (E) Using a single train–test split (67 to 33%) of raw signal values from the previous study (12), we used 1,900-dimensional representations for each of 103 sequences and performed a Gaussian process regression. The validation set values are shown versus the model prediction values and colored by the internal standard deviation of the Gaussian process regressor, a metric of the predictor’s uncertainty.

To perform an independent validation of the UniRep model on an external dataset of repression values, we considered the rheostat-12 variant set from a recent study (12). The authors assayed LacI repression with a reporter for beta-galactosidase activity and then compared 16 widely used computational predictors for variant effect prediction. We collected 103 sequence variants and corresponding fold-change values and performed 10-fold cross validation. We used the same pretrained UniRep model used to fine-tune prediction values from our study, but instead fine-tuned using only data from the external source. We measured the Pearson correlation for each cross-validation set to compare performance to the results of the prior computational predictor evaluation. Using deep representations from UniRep and a top layer consisting of an ensemble of decision trees, we achieved a median Pearson correlation of 0.74, exceeding the previously reported correlation of 0.61 (Fig. 5D and SI Appendix, Fig. S8). As a complementary approach to validating our best models of repression, we took a single train–test split (67 to 33%) of raw assay values from the same study (12) and performed a Gaussian process regression (31, 36, 37). We visualized the model predictions and used the standard deviation of the model’s underlying process to measure the uncertainty of the predictor (Fig. 5E).

As a final model comparison across variants with higher-order mutations, we created a training set based on all variants in this study with three or fewer mutations and tested the prediction on all variants with more than three mutations (Table 1 and SI Appendix, Fig. S7). The ensemble UniRep predictor performed the best, as measured by highest Pearson correlation and lowest mean-square error.

Table 1.

Comparison of performance for predicting repression values

Model MSE Pearson Spearman
MLP (20,10,1) 0.148 0.584 0.614
MLP (512,1) 0.125 0.616 0.634
UniRep (linear) 0.115 0.613 0.591
UniRep (Lasso) 0.119 0.566 0.546
UniRep (ensemble) 0.107 0.637 0.619
BERT (linear) 0.136 0.538 0.517
BERT (Lasso) 0.139 0.497 0.471
BERT (ensemble) 0.116 0.609 0.602
Rosetta (tetramer) 0.177 0.164 0.292
Rosetta (monomer) 0.178 0.132 0.196
EVCouplings (independent) 0.153 0.324 0.290
EVCouplings (epistasis) 0.251 0.249 0.215
DeepSequence (score) 0.239 0.231 0.191
One-hot 0.218 0.574 0.556
One-hot (subnet) 0.194 0.585 0.580
One-hot (logistic) 0.174 0.474 0.542
Residue embedding (360,1) 0.156 0.355 0.316
Residue embedding (360,5) 0.175 0.558 0.542
Residue embedding (360,10) 0.188 0.562 0.555

Training set, ≤3 mutations; test set, >3 mutations. Bold text indicates the top performance for each metric. MLP, multilayer perceptron.

Discussion

We reanalyzed a deep mutational scanning study of LacI-mediated transcriptional repression to further the understanding of the protein and to develop advanced computational models of repression function. The experimental dataset showed strong agreement between synonymous coding variants (Pearson correlation = 0.85; Fig. 1C), contained meaningful functional boundaries as seen at both positive and negative control variants (P < 1.57e-277; Fig. 1D), and significantly agreed with prior large-scale analyses of variant repressor function (P < 0.02; Fig. 1E). An analysis of protein variant effect across all 360 positions of LacI revealed interesting repression patterns in the C-terminal tetramerization domain (Fig. 2). A comparison between theoretical free-energy calculations for a DNA-bound monomer versus DNA-bound tetramer shows that severe loss-of-repression variants in the C-terminal domain are significantly destabilizing in the tetramer configuration but not destabilizing in the monomer configuration (Fig. 2E), indicating C-terminal variants affect oligomerization rather than protein stability or operator affinity. The mechanism by which a single mutation in the C-terminal domain can cause a total loss of repression is likely driven by energetically disfavorable structural contact within the tetramerization interface, drastically increasing the dissociation of dimers (19, 38). Multivalent LacI tetramers can also form DNA loops (39) that increase repressive strength manyfold but can be disrupted with variation in tetramerization domain. The loss of repressive function due to some changes in the domain is likely caused by diminished oligomerization; however, greater understanding and precise manipulation of the C-terminal domain may yield additional opportunities for variant design. For example, mutations in the tetramerization domain have been shown to rescue distal destabilizing mutations, such as an S354F mutation that reversed the dimerization defect of Y282D, supporting that mutations in the C-terminal domain can profoundly alter repressive function of LacI in either direction (16). Taylor et al. (13) found that a single change in the C-terminal domain, L356G or P339R, enabled LacI induction by a new ligand, gentiobiose, suggesting new inducer recognition and the associated regulation of repression can be potentially modulated solely within the C-terminal domain.

A collection of thousands of double mutants in the dataset, along with functional values for most underlying single mutations, presents additional opportunities to understand the role of position covariation and epistasis in LacI (Fig. 3). We found statistically significant overlap between the most epistatic position pairs, using our experimental measurements with position pairs known to have long-range contact elucidated from crystal structures (P < 0.05; Fig. 3B and SI Appendix, Figs. S4–S6). Finding a significant overlap from a relatively small sampling of all possible double mutations suggests that excessive functional epistasis as it relates to structural contact is present in our deep mutational scanning dataset, similar to other proteins analyzed (25). A separate ranking of excessive epistasis based on evolutionary coupling scores, calculated from the frequency of cooccurrence of residues from an alignment of evolutionary related sequences, showed significant overlap with structural contacts as well (P 0.0002; Fig. 3B and SI Appendix, Figs. S4–S6).

Molecular simulation and alignment-based evolutionary methods could predict an unseen single-mutation effect with only modest performance (Fig. 4). Molecular simulation could distinguish some special cases that grossly affect folding or basal free energy but did not discriminate variant function generally. Evolutionary analyses may have difficulty predicting the effect of deviations to a single protein’s function since many mutations arising from a synthetic library may not yet be part of the extant record, have differing native inducer recognition in their organism despite overall sequence similarity, and/or be associated with differences at the DNA operator level in a different species and are therefore nonspecific to functional conservation (4). To enable effective modeling when a domain is lowly sampled in an alignment or for a mutation that is well outside of the natural evolutionary trajectory, protein sequence representations borrow from general knowledge of protein domains and allow for improvement in the prediction of function (Fig. 5). We found that a protein representation learning model could be fine-tuned to predict LacI repression function using only 68 task-specific training examples, using an external dataset to validate our best-performing model (Fig. 5D). Biswas et al. (40) recently reported a high-performing deep representation learning model using as few as 24 protein sequences for training, highlighting a unique advantage of this modeling paradigm over other approaches requiring extensive datasets.

More effective prediction of mutational effect on basal repression will improve the understanding and design of custom intracellular biosensors. A representation learning approach addresses the complexity of a multifunctional protein such as LacI. General properties of protein sequences are first learned in a unsupervised fashion from massive protein databases, which allows for the characterization of and inference about mutations well outside the local alignment space. We show the representation learning paradigm performed best at predicting repression function for unseen single mutations, had the highest Pearson correlation for unseen variants containing more than three mutations, and had the highest Pearson correlation of any previous model when independently fine-tuned on an external experimental dataset of LacI. The representation learning approach is thus a promising computational model to gain understanding and design insight for LacI and the consequence of variation for other protein functions.

Materials and Methods

Variant Library and Negative Selection.

This study utilized lac repressor variants from the Taylor et al. study (13), including 10 tiles of synthesized single mutations and 6 tiles designed from Rosetta simulations for the goals of that study considering inducer recognition. For this study we instead analyze those mutations with the goal to understand and better predict basal transcriptional repression function after mutation.

Sequencing Data Analysis.

Sequencing data analysis utilized raw read counts and did not utilize the previous rank normalization approach of Taylor et al. (13). Briefly, paired-end 300-bp reads were designed to completely overlap each tile to mitigate sequencing artifacts and were collapsed using PEAR v.0.9.8, using default settings (11). Collapsed reads were trimmed to remove barcodes and primer sites. We collected the sequencing read interior to the sequencing primer locations on each end and placed this subsection at the appropriate position in the full coding region. Full-length coding sequence counts for both pre- and postselection conditions were merged. A pseudocount of 0.5 was added to each sequence count. A fold change was computed by taking the sequencing count postselection divided by the count preselection. To compare values across tiles, we scaled the log2 fold changes by min-max scaling. The maximum value (upper limit) for each tile was the mean log2 fold change for variants counted more than 100 times postselection. The minimum value (lower limit) for each tile was the mean log2 fold change for variants having at least 10 counts in the preselection condition but less than 3 counts in the postselection. These values correspond to the dynamic limits of the assay as the log2 fold changes level off even as the sequencing depth increases (SI Appendix, Fig. S1). For analyses at the protein sequence level, the sum of synonymous sequencing read counts was used.

Structural Modeling.

We obtained a previously constructed theoretical model of the tetrameric lac repressor bound to DNA (20). The model consisted of portions of three substructures: Protein Data Bank (PDB):1LBH (18), PDB:1EFA (41), and PDB:1CJG (42). Briefly, Villa et al. (20) took single protein chains from substructures, solvated with explicit water and total charge neutralizing ions. The system was equilibrated for 2 ns while gradually lifting constraints at areas of substructure intersection and eventually allowing all-atom equilibration for 0.75 ns of simulation. We constructed Rosetta-compatible coordinate and parameter files containing protein and DNA chains of interest. We used the Rosetta pmut scan protocol (43), using default sampling to measure the average change in free energy and average total energy for each of variant of interest. The theoretical model was also used to determine structural contacts for epistasis analyses.

Epistasis Measurement.

We used an additive model to measure epistasis in our experimental data (24). The model first subtracts the wild-type value from every variant, producing an adjusted repression value. We took the adjusted repression value of each double or higher mutant and subtracted the sum of its underlying single-mutation adjusted repression values. For structural contact overlap, we ignored residue pairs less than five amino acids away in sequence space.

Evolution-Related Sequences.

We used EVCouplings (23) to perform a sequence search and alignment using the 360-amino acid wild-type sequence as a query. Using default search thresholds, the alignment contained 143,119 related sequences. We generated a rank ordering of strength-of-coupling scores using plmc (22), using run parameters “-le 36 -h 0.01 -m 200 -t 0.2 -g,” and stored the inferred model parameters. Using EVmutation (22) we loaded model parameters and scored variants of interest according to a position-independent level of sequence conservation or epistatic covariation. We also used the full-length alignment with DeepSequence (28) to train a variational autoencoder and extracted a corresponding function prediction score for each sequence of interest, with the following hyperparameters: two encoding layers each with 1,500 dimensions and rectified linear unit activations, a 30-dimensional latent variable layer, a decoder layer of 100 dimensions followed by another decoding layer of 500 dimensions, a 100-dimensional convolutional pattern layer, and a 40-dimensional convolutional decoder layer with sigmoid activations.

Model Baselines.

We evaluated a variety of neural networks trained only on sequence-value pairs from this study with no pretraining. The baseline one-hot model is a full-length sequence input layer followed by a single, fully connected output layer. The residue embedding models use an additional internal, learned layer that reduces the amino acid dimensionality for generalization. Top-performing multilayered perceptron architectures in this work consisted of a flattened one-hot input layer, a hidden layer of 512 dimensions, or two hidden layers of size 20 and 10, with tanh activations, followed by a fully connected linear output layer. For negative control during modeling we shuffled repression value labels and sequences during training.

Representations.

We configured a doc2vec (34) model implementation and corresponding pretrained weights (31) to generate representations for each sequence of interest. We selected weights for k = 3, w = 7 for sequence training sets original, scrambled, and random. We also compared two additional models with high performance on previous tasks: a model with k = 4, w = 1 for a uniform sequence training set and a model using k = 5, w = 7 for the original sequence training set. Model representations were 64-dimensional (31). For the transformer approach we utilized a BERT implementation from Tasks Assessing Protein Embeddings (TAPE) (30) with the following hyperparameters: hidden layer size of 768, intermediate layer size of 3,072, maximum of 8,192 positional embeddings, and 12 attention heads. We collected a 768-dimensional value corresponding to the average, attention-weighted hidden-layer weights of the model per sequence. To compare repression value prediction performance using the seq2seq (32) approach, we obtained UniRep code and downloaded pretrained model weights for the 1,900-dimensional model (29). We generated representations according to the average of each mLSTM cell output value while proceeding along each position of the input sequence of interest. As a negative control, we shuffled the association between sequence and representation within the training sets. Validation and testing sets were not shuffled.

Model Fine-Tuning.

We used python and scikit-learn v.0.23.2 for most top-layer regression models. Sequence-only models, e.g., one-hot, and multilayered perceptrons were implemented in TensorFlow v.2.2.0 and/or Keras v.2.4.0, minimizing the mean-square error using the Adam optimizer. Performance evaluation of single mutants used 10-fold cross-validation stratified at repression value 0.5 such that each fold contained a similar mixture of repression values to that of the original distribution. Performance evaluation for the external validation set used a 10-fold cross-validation stratified at raw fold-change value 200, as was done by the prior authors to classify functional from nonfunctional (12). Gaussian process regression (36) used the Matern kernel (nu = 5/2; length scale = 1; length bounds = 0.1,10). For the ensemble prediction we used the average prediction from 100 extremely randomized trees (44).

Supplementary Material

Supplementary File
pnas.2022838118.sapp.pdf (10.9MB, pdf)

Acknowledgments

We thank Jim Collins, Sahil Loomba, Surojit Biswas, Ethan Alley, Sam Meier, Chirag Patel, Jack Szostak, Adam Wright, John Aach, and members of the Church Laboratory for helpful discussions. We also thank Noah Taylor, Srivatsan Raman, and Stan Fields for their inspiration, mentoring, and efforts to create the experimental dataset analyzed in this work. A.S.G. is supported by Department of Energy Grant DE-FG02-02ER63445, National Human Genome Research Institute Grant 5T32HG002295-12, and Biomedical Informatics and Data Science Research Training Grant T15LM007092. Computational resource grants were provided by Nvidia, Amazon Web Services, and Google Cloud. Research funding and support was also provided by the Wyss Institute for Biologically Inspired Engineering at Harvard University.

Footnotes

Competing interest statement: G.M.C. is a cofounder of Nabla Bio, Inc., in which he has related financial interests. A full list of G.M.C.’s technology transfer, advisory roles, and funding sources is available at arep.med.harvard.edu/gmc/tech.html. A.S.G. and K.M.C. declare no competing interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2022838118/-/DCSupplemental.

Data Availability

Raw sequencing reads are available at NCBI Gene Expression Omnibus (accession nos. GSE75009 and GSE175456). Code and processed data are available in the GitHub repository, https://github.com/churchlab/lac_repression/.

References

  • 1.Jacob F., Monod J., Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961). [DOI] [PubMed] [Google Scholar]
  • 2.Gilbert W., Muller-Hill B., Isolation of the lac repressor. Proc. Natl. Acad. Sci. U.S.A. 56, 1891–1898 (1966). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wilson C. J., Zhan H., Swint-Kruse L., Matthews K. S., The lactose repressor system: Paradigms for regulation, allosteric behavior, and protein folding. Cell. Mol. Life Sci. 64, 3–16 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Markiewicz P., Kleina L. G., Cruz C., Ehret S., Miller J. H., Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as “spacers” which do not require a specific sequence. J. Mol. Biol. 240, 421–433 (1994). [DOI] [PubMed] [Google Scholar]
  • 5.Poelwijk F. J., De Vos M. G. J., Tans S. J., Tradeoffs and optimality in the evolution of gene regulation. Cell 146, 462–470 (2011). [DOI] [PubMed] [Google Scholar]
  • 6.Otwinowski J., Nemenman I., Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PloS One 8, e61570 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zuo Z., Chang Y., Stormo G. D., A quantitative understanding of lac repressor’s binding specificity and flexibility. Quant. Biol. 3, 69–80 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Barnes S. L., Belliveau N. M., Ireland W. T., Kinney J. B., Phillips R.., Mapping DNA sequence to transcription factor binding energy in vivo. PLoS Comput. Biol. 15, e1006226 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Krishnan V. G., Westhead D. R., A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics 19, 2199–2209 (2003). [DOI] [PubMed] [Google Scholar]
  • 10.Masso M., Hijazi K., Parvez N., Vaisman I. I., “Computational mutagenesis of E. coli lac repressor: Insight into structure-function relationships and accurate prediction of mutant activity” in Bioinformatics Research and Applications, Mandoiu I., Sunderraman R., Zelikovsky A., Eds. (Springer, Berlin; ), pp. 390–401. [Google Scholar]
  • 11.Zhang J., Kobert K., Flouri T., Stamatakis A., PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics 30, 614–620 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Miller M., Bromberg Y., Swint-Kruse L., Computational predictors fail to identify amino acid substitution effects at rheostat positions. Sci. Rep. 7, 41329 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Taylor N. D., et al. , Engineering an allosteric transcription factor to respond to new ligands. Nat. Methods 13, 177–183 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sousa F. L., et al. , AlloRep: A repository of sequence, structural and mutagenesis data for the LacI/GalR transcription regulators. J. Mol. Biol. 428, 671–678 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Suckow J., et al. , Genetic studies of the lac repressor XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J. Mol. Biol. 261, 509–523 (1996). [DOI] [PubMed] [Google Scholar]
  • 16.Swint-Kruse L., Plasticity of quaternary structure: Twenty-two ways to form a LacI dimer. Protein Sci. 10, 262–276 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lutz R., Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/o, the TetR/o and AraC/i1-i2 regulatory elements. Nucleic Acids Res. 25, 1203–1210 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lewis M., et al. , Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271, 1247–1254 (1996). [DOI] [PubMed] [Google Scholar]
  • 19.Chen J., Matthews K. S., Subunit dissociation affects DNA binding in a dimeric lac repressor produced by C-terminal deletion. Biochemistry 33, 8728–8735 (1994). [DOI] [PubMed] [Google Scholar]
  • 20.Villa E., Balaeff A., Schulten K., Structural dynamics of the lac repressor-DNA complex revealed by a multiscale simulation. Proc. Natl. Acad. Sci. U.S.A. 102, 6783–6788 (2005).15863616 [Google Scholar]
  • 21.Leaver-Fay A., et al. , Rosetta3: An object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 487, 545–574 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hopf T. A., et al. , Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hopf T. A., et al. , The EVCouplings python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sarkisyan K. S., et al. , Local fitness landscape of the green fluorescent protein. Nature 533, 397 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Rollins N. J., et al. , Inferring protein 3d structure from deep mutation scans. Nat. Genet. 51, 1170–1176 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Schmiedel J. M., Lehner B., Determining protein structures using deep mutagenesis. Nat. Genet. 51, 1177–1186 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Marks D. S., et al. , Protein 3d structure computed from evolutionary sequence variation. PloS One 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Riesselman A. J., Ingraham J. B., Marks D. S., Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Alley E. C., Khimulya G., Biswas S., AlQuraishi M., Church G. M., Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Rao R., et al. , “Evaluating protein transfer learning with tape” in Advances in Neural Information Processing Systems (2019). [PMC free article] [PubMed]
  • 31.Rao R., et al. , Evaluating protein transfer learning with TAPE. arXiv [Preprint] (2019). https://arxiv.org/abs/1906.08230 (Accessed 1 March 2020). [PMC free article] [PubMed] [Google Scholar]
  • 32.Sutskever I., Vinyals O., Le Q. V., Sequence to sequence learning with neural networks. arXiv [Preprint] (2014). https://arxiv.org/abs/1409.3215 (Accessed 1 March 2020).
  • 33.Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 10.18653/v1/N19-1423 (2019).
  • 34.Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space. arXiv [Preprint] (2013). arXiv:1301.3781.
  • 35.Van Der Maaten L. J. P., Hinton G. E., Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
  • 36.Rasmussen C., Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006). [Google Scholar]
  • 37.Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Barry J. K., Matthews K. S., Thermodynamic analysis of unfolding and dissociation in lactose repressor protein. Biochemistry 38, 6520–6528 (1999). [DOI] [PubMed] [Google Scholar]
  • 39.Becker N. A., Maher L. J., High-resolution mapping of architectural DNA binding protein facilitation of a DNA repression loop in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 112, 7177–7182 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Biswas S., Khimulya G., Alley E. C., Esvelt K. M., Church G. M., Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]
  • 41.Lewis M., Bell C. E., A closer view of the conformation of the lac repressor bound to operator. Nat. Struct. Biol. 7, 209–214 (2000). [DOI] [PubMed] [Google Scholar]
  • 42.Spronk C. A. E. M., et al. , The solution structure of lac repressor headpiece 62 complexed to a symmetrical lac operator. Structure 7, 1483–1492 (1999). [DOI] [PubMed] [Google Scholar]
  • 43.Kuhlman B., et al. , Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003). [DOI] [PubMed] [Google Scholar]
  • 44.Geurts P., Ernst D., Wehenkel L., Extremely randomized trees. Mach. Learn. 63, 3–42 (2006). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.2022838118.sapp.pdf (10.9MB, pdf)

Data Availability Statement

Raw sequencing reads are available at NCBI Gene Expression Omnibus (accession nos. GSE75009 and GSE175456). Code and processed data are available in the GitHub repository, https://github.com/churchlab/lac_repression/.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES