Skip to main content
eLife logoLink to eLife
. 2022 Jan 26;11:e64543. doi: 10.7554/eLife.64543

Predicting bacterial promoter function and evolution from random sequences

Mato Lagator 1,2,†,, Srdjan Sarikas 2,3,, Magdalena Steinrueck 2, David Toledo-Aparicio 2, Jonathan P Bollback 4, Calin C Guet 2,, Gašper Tkačik 2,
Editors: Sandeep Krishna5, Aleksandra M Walczak6
PMCID: PMC8791639  PMID: 35080492

Abstract

Predicting function from sequence is a central problem of biology. Currently, this is possible only locally in a narrow mutational neighborhood around a wildtype sequence rather than globally from any sequence. Using random mutant libraries, we developed a biophysical model that accounts for multiple features of σ70 binding bacterial promoters to predict constitutive gene expression levels from any sequence. We experimentally and theoretically estimated that 10–20% of random sequences lead to expression and ~80% of non-expressing sequences are one mutation away from a functional promoter. The potential for generating expression from random sequences is so pervasive that selection acts against σ70-RNA polymerase binding sites even within inter-genic, promoter-containing regions. This pervasiveness of σ70-binding sites implies that emergence of promoters is not the limiting step in gene regulatory evolution. Ultimately, the inclusion of novel features of promoter function into a mechanistic model enabled not only more accurate predictions of gene expression levels, but also identified that promoters evolve more rapidly than previously thought.

Research organism: E. coli

Introduction

Describing the relationship between sequence (genotype) and function (phenotype) lies at the heart of understanding biology and evolution. Direct experimental characterizations of genotype-phenotype mapping abound (Lehner, 2013; Kemble et al., 2019), however existing technology limits experimental exploration to only a tiny fraction of all possible sequences (Sarkisyan et al., 2016). Due to these technological limitations, there is a need to develop theoretical approaches capable of predicting how any genotype maps onto phenotype (Yi and Dean, 2019). Predictive genotype-phenotype maps are rare and incomplete: predicting protein structure from sequence is possible only for a small number of mutations around a well-characterized wildtype (Kuhlman and Bradley, 2019), while predicting RNA folding from sequence (Schuster, 2006) generally lacks a relevant reference to the function of that secondary structure.

Gene expression is one of the most fundamental processes of life, and tuning expression levels underpins complex biological function. The critical process in the expression of most bacterial genes is the recruitment of σ70-RNA polymerase (RNAP) to a stretch of DNA – the promoter (Jacob and Monod, 1961). Computational and theoretical attempts to predict the relationship between genotype (promoter sequence) and its phenotype (gene expression level) have adopted, broadly speaking, three approaches. (i) Bioinformatics – identifies promoters based on sequence homology to the σ70-RNAP consensus site (Mustonen and Lässig, 2005), which consists of –10 and –35 elements (TATAAT and TTGACA in Escherichia coli, respectively) separated by a spacer with canonical length of 17 bp, but does not predict gene expression from them (Anzolini Cassiano and Silva-Rocha, 2020). (ii) Machine learning – predicts gene expression patterns in cells (Beer and Tavazoie, 2004; Hossain et al., 2020), but lacks direct links to the underlying biological mechanisms, limiting insights into promoter structure and evolution (Libbrecht and Noble, 2015). (iii) Biophysical models – the most successful of which predict gene expression levels based on the thermodynamic properties of σ70-RNAP binding at a promoter in equilibrium (Bintu et al., 2005), do not generalize well to random sequences (Vilar, 2010). In sum, we lack a generalizable and predictive theoretical and biological understanding of the relationship between promoter genotype and its function (gene expression phenotype) even for constitutive promoters, where σ70-RNAP binding (as opposed to the binding of transcription factors) is considered to be the major determinant of gene expression levels (Forcier et al., 2018). Without such a canonical model, we cannot properly understand either promoter function or their evolution.

The standard thermodynamic model (Bintu et al., 2005; Kinney et al., 2010) considers the energy of binding between σ70-RNAP complex and DNA to identify, typically, the single strongest binding site, with the expression level proportional to the equilibrium occupancy of σ70-RNAP to that binding site (‘Standard’ model). To calculate equilibrium occupancy, the Standard model assumes that each position in the binding site independently and additively contributes to the binding energy of σ70-RNAP. When constructing the Standard model, the energy contribution of each residue (A, C, G, or T) at each position in the binding site is inferred directly from experimental data and represented in the form of an energy matrix. Constructed in this way, the Standard model represents the state-of-the-art mechanistic model of constitutive gene expression, as well as the traditional view of promoters as containing only a single σ70-RNAP binding site – a view that is at the heart of all bioinformatics approaches to identifying promoters (Anzolini Cassiano and Silva-Rocha, 2020).

To predict gene expression levels from any random sequence, we developed a mechanistic and generalizable model based on statistical thermodynamics that expands on the Standard model by accounting for several structural features of bacterial promoters. We use this model, as well as several large promoter mutant libraries, to study the promoter function among random sequences as well as the evolutionary dynamics of bacterial promoters.

Results

Extended thermodynamic model

We developed a comprehensive and generalizable thermodynamic model (‘Extended model’) that expands on the Standard model by accounting for six essential structural features of bacterial promoters that are not present in the Standard model (Figure 1A and B). (i) The possibility that σ70-RNAP binds the promoter region in multiple configurations that independently and cumulatively contribute to gene expression (Storz, 2014; Belliveau et al., 2018), as commonly observed in transcriptomics studies (Srikumar et al., 2015). We assume that σ70-RNAP concentration is low enough so that the binding of multiple molecules to the promoter has an approximately additive effect on expression. (ii) Spacer length flexibility, assigning an energy penalty for changing the distance between –10 and –35 elements away from the optimum. (iii) Occlusive unproductive binding, occurring when the initiation start site excludes a part of the ribosomal binding site (RBS) resulting in a non-translated transcript. (iv) Occlusive binding on the reverse complement strand, which inhibits productive binding at the promoter (Brophy and Voigt, 2016). (v) Dinucleotide interactions between promoter nucleotide residues that are in direct contact with σ70-RNAP. (vi) Clearance rate of RNAP from the promoter. We do not account for the UP element (Ross et al., 1993), as it has an independent role in determining expression (Einav and Phillips, 2019) and was beyond the scope of this study. We experimentally verified features (i), (iii), and (iv), as, unlike feature (ii), they have not been characterized before (Figure 2A–D). The experimental verification of unproductive binding on the reverse complement was inconclusive – two sets of promoter mutants changed expression in response to increasing strength of binding on the reverse complement, while the other two sets did not (Figure 2—figure supplement 1). However, the inclusion of this feature into the Extended model was justified because it led to a significant increase in predictability (especially of the 36N dataset) (Figure 2D).

Figure 1. Standard and Extended models.

(A) The standard thermodynamic model assumes only one (strongest) σ70-RNAP binding configuration at the promoter, which generates a productive transcript. We refer to ‘promoter’ as the entire cis-regulatory element, while ‘binding site’ refers to the RNA polymerase (RNAP) contact residues of a specific binding configuration (colored area in ‘RNAP binding sites’). The Extended model incorporates structural features of bacterial promoters into the thermodynamic framework: (i) cumulative binding, permitting σ70-RNAP to bind multiple binding sites independently on the same promoter (binding configurations 1–4); (ii) spacer length flexibility (difference between configurations 1 and 2); (iii) occlusive unproductive binding (configuration 3); (iv) occlusive binding on the reverse complement (RC) (configuration 4). To predict gene expression levels, we calculate the probability of productive σ70-RNAP binding (Pon), where μ is the chemical potential related to RNAP concentration, Eα the energy of binding for binding state α, R the clearance rate relative to σ70-RNAP recruitment rate, and Wblock the binding free energy of unproductive states, which is calculated in the same way as Won but for unproductive as opposed to productive (on) states. The Standard model does not allow for any unproductive binding, and it considers only the single strongest binding site (α*). How the structural features are incorporated into the Extended model is shown in orange. (B) Example model output for a selected PR mutant (stars mark mutated positions) for the Standard and the Extended model, showing binding configuration on the forward strand. Pixels forming the background grid indicate –10 end-points of binding sites, with the intensity of color corresponding to the strength of binding in that configuration (green productive, red unproductive binding). States that are bound strongly enough to independently lead to measurable expression are framed (one for the Standard model, three for the Extended model – two productive, one unproductive). For illustration purposes, the pixel corresponding to the binding energy of State 1 in panel A is marked with an arrow. (C) Main biophysical parameters of the Extended model, fitted from sort-seq data. Energy matrix shows the effect of every possible binding site residue on the binding energy between σ70-RNAP and DNA (strongest binding indicated by green squares). The optimal energy matrix consists of the –10 and –35 elements (underlined), positions outside the canonical elements that significantly affect quantitative predictions of gene expression levels (dotted underline), and spacer of optimal length 9 bp (corresponding to the canonical 17 bp between –10 and –35 elements). Strongest stabilizing (green) and destabilizing (red) interactions between dinucleotides are shown, with line thickness indicating the deviation from independent energy contribution to binding (range 0.15–0.38kBT). For other model parameters and all significant dinucleotide interactions, see Figure 1—source data 1. Figure 1—figure supplement 1 describes the experimental system and protocol; Figure 1—figure supplement 2 shows the comparison between our and previously obtained energy matrix for σ70-RNAP.

Figure 1—source data 1. Inferred values for model parameters.
For each position in the energy matrix, the energy penalty is normalized to the lowest energy (strongest binding) residue in that position, which is set to zero. Same applies for the spacer variation penalty, where the optimal spacer length is set to zero. Orange background marks the canonical –10 and –35 binding sites of σ70-RNAP. For pairwise interaction, identity’ refers to the two positions and specific residues that have a significant positive (destabilizing) or negative (stabilizing) interaction, with numbers indicating the position of the residue in the matrix shown in Figure 1C. Shaded cells indicate the strongest interactions, which are shown in Figure 1C. The numbers indicate the position in the energy matrix, starting from the most upstream to the most downstream (left to right in the energy matrix shown in Figure 1C).

Figure 1.

Figure 1—figure supplement 1. Experimental plasmid systems and protocol.

Figure 1—figure supplement 1.

(A) For the PR and PL libraries, the synthetic construct used to detect the effects of promoter mutations consisted of a yellow fluorescent marker (venus-yfp), preceded by a ribosomal binding site (RBS), and under the control of either the PR or PL promoter (or a PR or PL promoter mutant). The system was isolated from the rest of the plasmid by a T1 terminator (hairpin). This construct was placed on a small copy number pZS* plasmid (SC101* origin) with kanamycin resistance, with Escherichia coli MG1655 as host. (B) The expression of a green fluorescence protein (gfp) was under the control of a random 100 bp sequence consisting of: two 32-bp-long random, non-expressing flanking sequences that were not mutated; and a 36-bp-long sequence that was mutated randomly, with each nucleotide having 25% chance of being found at each position. This construct was placed on a pUA66 plasmid (SC101 origin), with E. coli NEB5α as a host. (C) Promoter mutants were cloned into the plasmid system using restriction/modification. The mutations were introduced at random, using pre-synthesized oligonucleotides with a fixed mutation rate (12% for the PR, 9% for the PL, and fully random for the 36N mutant library). The plasmids carrying mutant promoters were cloned either into MG1655 (PR and PL libraries) or NEB5α (36N library). (D) Each random mutant library was sorted through fluorescence activated cell sorting (FACS) based on the fluorescence intensity detected at the single cell level. Mutants in PR and PL libraries were sorted into four, while the 36N library was sorted into 12 equidistant bins. 150-bp-long fragments containing the promoter region of each sorted sub-library were PCR-tagged, and each library sequenced in bulk with 5 million total reads per library. (E) We screened each sequence library for only those mutants that had at least 30× coverage, and obtained fluorescence distributions of each mutant across the bins. (F) Flow cytometry measurements of 1 million mutants from each library showing distributions of fluorescence (as proxy for gene expression levels). The vertical red dotted line separates the mutants with no measurable expression (corresponding to Figure 4A). The red dotted line and the solid lines separate the four bins used to sort the PR and PL libraries (no, low, intermediate, and high expression, from left to right). The dotted lines mark the boundaries of the additional bins used to sort the 36N library. (G) Mutation frequencies in the three experimental libraries are shown as fraction of mutants with a given nucleotide at each position for PR, PL, and 36N libraries. We did not observe any bias in the mutagenesis of libraries. The consensus sequence for each library is provided underneath each plot. (H) Number of sequences in each library containing a spacer of specific length. The reported counts are based only on the spacer length of the strongest binding site identified in each sequence. Note that the Extended model accounts for cumulative binding between all possible σ70-RNAP configurations binding to a given sequence, meaning that our libraries contained a much greater number of sequences with each spacer length than shown here. In fact, because here we considered only the single strongest binding site, the counts over-represent the optimal spacer length because it has the lowest energy and, hence, binding configurations with that spacer length are more likely to be most strongly bound.
Figure 1—figure supplement 2. Comparison of σ70-RNAP (RNA polymerase) energy matrix values.

Figure 1—figure supplement 2.

Correlation between corresponding energy matrix values in the matrix obtained in this study (Figure 1C) to that obtained by Kinney et al., 2010. Points shown in red correspond to the canonical –10 and –35 sites, while the points in black represent other entries in the matrix.

Figure 2. Experimental validation of structural promoter features.

(A) Cumulative binding affects expression in most tested sequences. We experimentally created 29 promoters with the following property: the Standard model predicts no measurable expression from these promoters, while the Extended model predicts measurable expression due to the existence of multiple σ70-RNAP (RNA polymerase) binding sites. Fluorescence measurements are shown in gray bars, with error bars indicating standard error of the mean from three replicate biological measurements. The gray horizontal bar indicates the detectability (‘no measurable expression’) threshold, determined for plate reader measurements as the mean fluorescence/OD600 value across all replicates of the population carrying the plasmid without any fluorescence markers. All promoters but 10, 13, and 21 exhibited significant measurable expression. We introduced additional mutations into promoters 1–7, in order to remove the secondary binding site(s) without affecting the strongest binding site. White bars show the expression levels of these additional mutants. Only the mutated promoter 5 exhibited significant measurable expression. (B) Characterizing sequence determinants of occlusive unproductive binding. (Top left panel) We created 20 promoter sequences for which the Extended model that accounts for occlusive unproductive binding predicted no measurable expression, while the model which did not account for occlusive unproductive binding predicted measurable expression. Bars are mean fluorescence measured from three biological replicates, and error bars are standard error of the mean. The gray shaded area indicates the detectability (‘no measurable expression’) threshold. Only mutants 7 and 20 exhibited significant measurable expression. (Top right panel) We inserted mutations into the wildtype PR promoter to gradually introduce an additional binding site that was predicted to bind in an occlusive unproductive manner. These mutations were not predicted to significantly alter σ70-RNAP binding to the existing dominant PR binding site. As mutations are introduced into the promoter, they generate stronger binding to the new site, which lowers gene expression levels. (Bottom left panel) We mutated three promoters (originally found in the PR mutant library) to gradually remove their existing, predicted occlusive unproductive binding sites. As the predicted occlusive unproductive sites were removed, we measured a significant increase in gene expression levels. (Bottom right panel) In order to experimentally verify the occlusive unproductive binding cutoff distance from the –10 end of the binding site to the beginning of the ribosomal binding site (RBS), we started with the same three promoters as in bottom left panel. We used the Extended model to identify the predicted occlusive unproductive binding site, and then we moved the site upstream and downstream to increase or decrease the distance from the RBS, respectively. We identified that a binding site that is 11 or fewer base pairs away from the RBS acts as an unproductive site, while those that are 12 or more base pairs away productively and cumulatively contributed to gene expression levels. (C) Mixed support for the role of unproductive binding on the reverse complement in driving expression. We identified four promoter sequences for which we introduced up to eight mutations that would not alter predicted gene expression levels if the model did not account for unproductive binding on the reverse complement, but would if the Extended model was used. In other words, the eight introduced mutations would gradually increase the strength of binding on the reverse complement while having a minimal effect on the strength of binding on the productive strand. Colored points indicate mutants whose measured expression changes in line with Extended model predictions, while black points are mutants whose expression doesn’t change compared to the original promoter. Individual responses of each mutant are shown in Figure 2 – Extended Figure 1. (D) Improvement to predictability based on each promoter feature. Each structural promoter feature was added to the simpler iteration of the model, starting from the Standard model and building progressively toward the Extended. The clearance rate and dinucleotide interactions were included only in the model fitted on all three libraries. The ‘unproductive binding’ term includes the combined contribution of the occlusive unproductive binding and the unproductive binding on the reverse complement, both of which individually provided a small but significant improvement to model predictions. The values are the fraction of variance explained on the evaluation dataset. Figure 2 – Extended Figure 2 contains additional verifications of model predictions.

Figure 2.

Figure 2—figure supplement 1. Unproductive binding on the reverse complement.

Figure 2—figure supplement 1.

We identified four promoter sequences for which we could introduce up to eight mutations that would not alter predicted gene expression levels if the model did not account for unproductive binding on the reverse complement, but would if the Extended model was used (Figure 2C). Introduction of these mutations reduced the measured gene expression levels for two promoters (1 and 3) but had no effect on the expression levels from promoters 2 and 4.
Figure 2—figure supplement 2. Validation of model predictions.

Figure 2—figure supplement 2.

(A) Verifying model predictions on 115-bp-long sequences. In order to verify the ability of the Extended model to predict expression levels from 115-bp-long promoters, and in particular to verify the model prediction of the ease of generating promoters from random non-expressing sequences, we generated 20 pairs of promoters. These pairs consisted of a randomly generated non-expressing sequence, and a sequence exactly one point mutation away that was predicted to have measurable expression. White bars are mean expression levels of three biological replicate measurements of non-expressing promoters; gray bars are the promoters with a single point mutation. Error bars are standard error of the mean. The gray horizontal bar indicates the detectability (‘no measurable expression’) threshold. (B) Model determination of occlusive unproductive binding sites. We evaluated the Extended model fitted on the PR training dataset, with varying thresholds between productive and occlusive unproductive bindings. Shown is the change in the negative log likelihood on the dataset indicated in the legend. Red arrow indicates the actual threshold ultimately used in the model. Because this modeling only provided a range for the productive/unproductive cutoff distance between the RNA polymerase (RNAP) binding site and the ribosomal binding site (RBS), we carried out dedicated experiments to systematically validate and probe occlusive unproductive binding (Figure 2B). (C) Optimal value of the clearance rate. We scanned through the possible values of the relative clearance rate of the σ70-RNAP complex from the promoter, using the Extended model fitted on the training subsets of all mutant libraries. For each value, we refit chemical potential and hyperparameters of logistic regression using the training dataset. Shown is the change in negative log likelihood on the training data. The optimal value is indicated with the red arrow, though a wider range of values is compatible with the data. For the majority of values in the compatible range including the optimal value, the model performance improves also on the validation dataset.

Experimental mutant libraries and the model

Both, the Standard and the Extended model, were fitted by subsampling a library of >12,000 constitutively expressed random mutants of one of the strongest and best characterized promoters, bacteriophage Lambda PR, that controlled the expression of a yfp reporter gene from a small copy number plasmid in E. coli (Figure 1—figure supplement 1A). In this library, each position in the promoter had a 4% chance of containing each of the non-wildtype bases, resulting in a 12% mutation rate per position. Sort-seq experiments were used to measure gene expression levels of mutants (Kinney and McCandlish, 2019): mutants were separated using a cell sorter into four phenotypic bins (‘no’, ‘low’, ‘medium’, and ‘high’ expression), PCR-tagged according to the sorted bin, and bulk sequenced to obtain the counts of each mutant sequence in the sorted bins (Figure 1—figure supplement 1C-E). The library was randomly split into three subsets: ‘training’ (60% of the sequences), used for training model parameters; ‘validation’ (20%) used for model selection whenever necessary; and ‘evaluation’ (20%) used exclusively for final evaluation and visualization of model performance.

We inferred model parameters independently for the Standard and the Extended models. For the Standard model, all parameters were inferred under an assumption that only a single (strongest) binding site contributes to expression and that the binding to that site can only contribute positively to expression (i.e. binding of RNAP is never occlusive or unproductive). Note that the Standard model allowed for different spacer lengths but assumed that the energy of binding was not affected by such variations. This already gives our Standard model much more fitting power relative to typical approaches that use a single RNAP energy matrix with a fixed spacer, thereby raising the bar for the Extended model. To infer the parameters for the Extended model, we calculated the binding strengths of σ70-RNAP to DNA in all possible configurations allowed by the structural features incorporated into the Extended model. This meant, accounting for the energy impact of spacer length variations, the possibility of a promoter containing multiple σ70-RNAP binding sites, and that the binding to a given site can contribute both positively and negatively to expression. Then, we integrated over all positive (‘on’) and negative (‘off’) configurations to obtain the probability (Pon) of productive σ70-RNAP binding, which, in this model, solely determines expression levels. For both models, we searched for optimal model parameters (Figure 1C) by maximizing the likelihood of the weighted multinomial logistic regression of log10Pon to the median observed expression bin for a training subset (60%) of the PR mutant library. The energy matrix values obtained by fitting the Extended model correlated strongly with those from a previously published energy matrix of σ70-RNAP, but also exhibited some systematic differences (r2 = 0.72; p < 0.001) (Figure 1—figure supplement 2; Kinney et al., 2010).

Both models, which were fitted on the training subset of the PR dataset only, reproduced the evaluation subset of the PR mutant library with high accuracy, with the Extended model significantly improving on the performance of the Standard model (Figure 3A and B). A closer look at the inferred model parameters (Figure 1—source data 1) provided novel insights into the mechanisms of RNAP binding, including: (i) the energetic cost of altering the length of the spacer between the –10 and –35 DNA binding elements, which appears to be approximately quadratic for each deviation away from the optimal 17 bp (spacer penalties in Figure 1C); (ii) more precise definition of the binding footprint of RNAP compared to previous RNAP energy matrices (Kinney et al., 2010), in particular identifying 11 positions outside the canonical –10 and –35 elements that directly and independently affect σ70-RNAP (energy matrix in Figure 1C); (iii) the role of occlusive unproductive binding of RNAP to the promoter, which reduces overall expression, and whose inclusion into the model significantly improves predictability (Figure 2D).

Figure 3. Model performance.

(A) Model-to-data correlation for the Standard (blue) and the Extended model (orange) trained on PR only, shown for the evaluation subset (20%) of the PR mutant library. The experimentally measured fluorescence of each mutant is the mean bin (out of four) across all sequenced reads of that mutant (we only consider mutants with at least 30× coverage). Best-fit line (dashed for the Standard model, solid black for Extended) and the instrument detectability threshold, which we refer to as the ‘measurable expression’ and which marks the 99th percentile of plasmid-free strain (red dashed line), are shown. Marker sizes indicate data-point weights used in fits. We assume that the model predictions are independent of the instrumentally determined measurable expression threshold. (B) Model performance on mutant libraries (PR, PL, 36N), shown as fraction of variance explained on evaluation data. Arrows indicate which bars correspond to the correlation plot shown in (A). (C) Cartoon of the three mutant libraries: PR and PL sample locally around a wildtype, with each position having a 12% or 9% chance, respectively, of containing the non-wildtype residue; 36N library contains random 36-bp-long sequences, meaning that it uniformly samples the full 36-bp-long genotypic space (see Figure 3—figure supplement 2G). Colored circle and triangle represent the wildtype PR and PL sequences, respectively. Figure 3—source data 1, Figure 3—source data 2 provide additional details on the processing of mutant libraries, as does the Figure 3—figure supplement 2. Figure 3—figure supplement 1 shows the performance of the two models on previously published datasets of promoter mutants. Figure 3—figure supplement 3 shows the plate reader validation of 36N library data processing.

Figure 3—source data 1. Processing of the mutant libraries and sizes of datasets after splits.
The table shows the number of reads remaining in the datasets following each step of data processing, from original sequenced library down to the final library used for model fitting and evaluation.
Figure 3—source data 2. Number of mutants per expression bin for each split of the PR, PL, and 36N dataset.
Bins are no (‘0’), low (‘1’), intermediate (‘2’), and high (‘3’) for the PR and PL libraries, and are ordered from lowest (‘0’) to highest (‘11’) for the 36N library.

Figure 3.

Figure 3—figure supplement 1. Performance of Standard and Extended model on previously published datasets.

Figure 3—figure supplement 1.

Expression level predictions from the Standard and the Extended model were correlated to measured expression levels, in the promoter mutant libraries published by (A) Johns et al., 2018; (B) Hossain et al., 2020; (C) Urtecho et al., 2019. Red line shows the line of best fit, resulting in the reported correlation coefficient, r2.
Figure 3—figure supplement 2. Processing of mutant libraries.

Figure 3—figure supplement 2.

(A–E) Processing of PR and PL libraries. (A) All reads in the PR and PL libraries (gray), from which we take only those reads that are ±4 bp away in length from the wildtype sequence (dark gray). (B) Inverse cumulative distribution function (normalized to the total number of sequences), with shaded indicating the sequences we removed due to having less than 10× coverage. (C) We removed sequences that had 20 or more single point mutations compared to their respective wildtype sequence. Note that this mainly affected the PL library (orange), as the original plasmid from which the libraries were cloned contained the wildtype PR sequence. (D) Cumulative distribution function (CDF) of standard deviation of expression bin numbers, with shaded sequences the ones we removed from subsequent analyses. (E) Box plots indicating the distributions of mean values (in bin units) for a given mode (in bin units), before (left) and after (right) selecting for only those where mean, mode, and median are within 0.5. (F–J) Processing of the 36N library. (F) Average histogram of alignment similarity for the 1000 most covered sequences (shaded area indicates 95% confidence interval). We used the similarity threshold of 0.7 between low- and high-scoring modes to select for unique sequences and eliminate sequencing errors. (G) Histogram of coverage (black line), with highlighted contributions of the noise cloud around the reference sequence (dark red), and the clouds around the 10, 100, and 1,000 most abundant sequences (from darkest to lightest shade of red, respectively). (H) Histogram of counts for the reference sequence per bin, used to debias all other distributions. (I) Template probability distribution functions (PDFs) obtained as averages of PDFs that have the same mode (indicated by color). The inferred fluorescence activated cell sorting (FACS) noise background is shown as a thick gray line. Given a distribution, we only accepted values in the bins in which the appropriate reference was three times above the inferred background. Such filter is shown in (J).
Figure 3—figure supplement 3. Plate reader validation of 36N data processing.

Figure 3—figure supplement 3.

Seventy-seven mutants (with an approximately equal number of mutants selected from each of the 12 bins) were selected randomly and their expression levels measured in a plate reader. We correlated their expression measured in the plate reader with our estimates in fluorescence activated cell sorting (FACS) units (left) and bin units (right). The vertical red dotted line marks the measurable expression threshold in the flow cytometer. Measured expression in FACS units and expression estimate are shown in log scale.

Predicting expression from random sequences

To assess the cross-dataset predictability of the two models, we created two additional mutant libraries (Figure 3C) with >10,000 mutants each and analyzed them in sort-seq experiments. First, we created a library consisting of random mutants of the Lambda PL promoter, which shares sequence homology to PR. Second, in order to determine model performance across the entire unconstrained genotypic landscape, we built a library consisting of completely random 36 nucleotides controlling the expression of a gfp reporter gene (‘36N’). In other words, each of the 36 positions had a 25% chance of being either adenine, cytosine, guanine, or thymine. This library was created in a different strain (Figure 1—figure supplement 1B), and the expression was sorted into 12 bins instead of 4, giving greater precision in testing the model-to-data fit. Interestingly, ~10% of experimentally measured, random 36-bp-long sequences led to measurable expression (Figure 1—figure supplement 1F).

While the Extended model trained on the PR library only modestly improved predictability of the PR and local cross-predictability (PL) datasets compared to the Standard model, it dramatically improved predictions of gene expression levels from random sequences (Figure 3A and B). This improved performance of the Extended model compared to the traditional view of promoters (the Standard model) indicates that the structural features we accounted for are key components of promoter function. In fact, their inclusion helps overcome one of the major shortcomings of thermodynamic models of gene expression – their low predictability on novel datasets (Vilar, 2010), which is observed in Standard model’s low predictability of the 36N library.

Each of the structural features included in the Extended model (Figure 1A) significantly improved predictability (Figure 2D). The contribution of each structural feature to predictability differed between mutant libraries, but, in general, accounting for the energy cost of variable spacer length improved expression level predictions the most. Fitting the Extended model by subsampling all three libraries (>30,000 mutants), instead of just the PR library, led to further improvements in predictability of expression from random sequences (Figure 3B). This resulted in r2 ~0.8, which corresponds to a mean error in expression predictions of 2.8-fold, across a 1000-fold expression range. Fitting the model on all libraries also allowed for a more reliable estimation of dinucleotide interactions between promoter positions that contact σ70-RNAP (Figure 1C). Dinucleotide interactions are moderate contributors to promoter function (Otwinowski and Nemenman, 2013; Figure 2D), and appear to be stronger when at least one residue is outside of the canonical –10 and –35 binding sites.

The Standard model explicitly assumes that only a single binding position of RNAP governs promoter activity, which we model by identifying the single strongest binding site present in each random sequence in the 36N library (in contrast to the Extended model, which considers all possible σ70-RNAP binding configurations cumulatively). Because of this assumption of the Standard model, its low cross-predictability (i.e. its performance on fully random sequences – Figure 3B) is often attributed to context dependence: the poorly understood effect of positions flanking the –10 and –35 elements (Vilar, 2010; Forcier et al., 2018). In this study, the ability of RNAP to bind in multiple configurations to the same promoter, rather than a single one, led to the improved predictability of the Extended compared to the Standard model (Figure 3). As such, the Extended model explains away the extensive context dependence as a consequence of the promoter structural features that we account for (Figure 1A).

Testing model generalizability

In order to test how well the Extended model predicts gene expression levels from any sequence, we utilized three published, large-scale, promoter mutant libraries whose expression was measured in E. coli (Johns et al., 2018; Urtecho et al., 2019; Hossain et al., 2020). Each study also developed a model to predict the variation in gene expression levels observed in their respective datasets, enabling a comparison of performance to our model. As they were generated for different purposes, these libraries contain promoter variants with different features, including but not limited to the features we included in the Extended model (Figure 1). We used the Standard and the Extended models with chemical potential as the only physical parameter refitted, meaning that all other model parameters were fitted only from the libraries developed in our study.

Johns et al., 2018 library consists of ~15,000 promoter sequences harnessed from a wide range of bacterial species and expressed in E. coli and contains the most random-like distribution of mutations. As expected, the Extended model (r2 = 0.626) outperformed both the Standard model (r2 = 0.535) and the model presented in the paper (r2 = 0.47) (Figure 3—figure supplement 1A). Hossain et al., 2020 library contained 4350 sequences with the consensus –10 and –35 elements, as well as a fixed spacer length, with all sequence variability contained within the upstream, downstream, and spacer regions. Both the Extended (r2 = 0.602) and the Standard (r2 = 0.598) models had high performance on the dataset compared to the model presented in the paper (r2 = 0.45), indicating that while the conserved consensus –10 and –35 elements drove a bulk of the observed expression variation, we nevertheless extracted significant predictability gains due to the extended RNAP binding footprint (Figure 3—figure supplement 1B). Urtecho et al., 2019, developed ~11,000 promoters consisting of all combinations of a set of modules: eight –35 elements, eight –10 elements, three UP elements, eight spacers, and eight backgrounds. The Extended (r2 = 0.599) slightly outperformed the Standard model (r2 = 0.588), suggesting that the eight background sequences contained different secondary σ70-RNAP binding sites that cumulatively contributed to expression (Figure 3—figure supplement 1C). However, both models underperformed the machine learning approach presented in the paper (r2 = 0.955) as the expression variation from these modular promoters was driven by interactions between components (Urtecho et al., 2019), which were not captured by the Extended model. It is important to emphasize, however, that the machine learning model developed by Urtecho et al. considers only combinations of a small number of predefined sequence elements, and as such cannot predict expression level from arbitrary sequences.

Together, these analyses point to two important aspects of promoter function. First, the high performance of the Extended model on most tested datasets without refitting (comparable to the performance on the 36N library of the Extended model trained only on the PR library), combined with its high predictability of expression from random sequences (the 36N library), suggests that the binding energies and the structural promoter features described in this study (Figure 1) are important quantitative determinants of promoter function. Second, while these features are sufficient to predict expression levels from many sequences, especially if sequences are random, the Extended model failed to capture the variation arising from the interactions between various promoter features (Urtecho et al., 2019). The mechanisms that govern the way in which different promoter features – beyond those captured by the Extended model – interact with each other remain largely unclear – but see Einav and Phillips, 2019.

Constitutive promoters evolve readily from random sequences

Having a biophysical model capable of predicting gene expression levels from random sequences enabled us to computationally sample the entire genotypic space in order to describe the genotype-phenotype mapping of constitutive promoters. Approximately 20% of random 115-bp-long sequences (average inter-genic region length in E. coli) are predicted to have measurable expression, compared to ~8% for the Standard model (Figure 4A). Surprisingly, for ~82% of non-expressing random sequences, at least one point mutation could be found that would led to measurable expression (Figure 4B) – a finding that we verified experimentally (Figure 2—figure supplement 2A). In total, more than 1.5% of all possible point mutations converted a random non-expressing sequence into a functional promoter (Figure 4C), indicating that promoter sequences readily emerge, as previously suggested based on a limited sample of ~40 sequences (Yona et al., 2018).

Figure 4. Evolution of promoters.

(A) Probability density function (PDF) for the flow cytometry measurement of the 36N library (gray dashed line) compared to the flow cytometry fluorescence intensities simulated from 106 randomly generated 115-bp-long sequences using the Extended model fitted on all three libraries. Red dotted line marks the cutoff for ‘measurable expression’, estimated from experimental data. Measurable expression is defined to correspond to the 99th percentile of fluorescence measurements of the experimental strain carrying no plasmid and, hence, no fluorescence. Inset: cumulative distribution function (CDF) for the same comparison. (B) Density heat map (brighter color represents higher density), showing, for every simulated random sequence (expression on x-axis), the expression of a single point mutant with the largest positive effect on predicted expression (Pon) (y-axis). For 82% of non-expressing random sequences (sequences left from the dotted line on the x-axis), that mutation led to measurable expression (gray area). (C) Box plot showing the percentage of all possible point mutations predicted to convert a given random non-expressing sequence into one with measurable expression (obtained from 105 random sequences). (D) Increase in rates of adaptive evolution of the Extended relative to the Standard model. Evolution to either weakest measurable expression, or high (PR) expression levels was modeled through single point mutations. Evolution was simulated 100 independent times for each of the 100 random 115-bp-long starting sequences, by mutating the central contiguous part of the indicated length. Evolving promoters would almost never reach high expression levels when only a region smaller than the RNA polymerase (RNAP) binding site (30 bp) was allowed to mutate. (E) For evidence of selection against σ70-RNAP binding sites, we compared the free energy per bp between either the inter-genic (typically, promoter containing) or the within-genic regions of the Escherichia coli genome, and that of a random sequence with the GC% of the corresponding region (note that higher energy means weaker binding and hence lower expression). At lower binding energies (corresponding to stronger binding), the actual number of binding sites in the E. coli genome (teal) is lower than expected based on random sequences (gray). Associated p-values are also shown. The total number of binding sites increases with binding energy (i.e. there are a lot more weaker than stronger binding sites), explaining the variability in p-values. (F) CDFs for predicted binding strengths of different E. coli promoters, obtained from RegulonDB. Figure 4—figure supplement 1 shows further details on how promoter evolution was modeled. Figure 4—figure supplement 2 contains the information about the contribution of cumulative binding to expression. Figure 4—figure supplement 3 shows additional tests for selection against σ70-RNAP binding sites.

Figure 4.

Figure 4—figure supplement 1. Modeling evolution.

Figure 4—figure supplement 1.

(A) Cumulative distribution function (CDF) of the median times for promoter evolution under the Extended (orange) and Standard (blue) models for s = 1, N = 104, length of central mutagenized region of 70 bp and high target expression level (that of wildtype PR). Evolution was simulated 100 independent times for each of the 100 starting random sequences. We present this specific set of parameters as this is the case where the largest fraction of simulations stopped at 10 N iterations (our simulation limit), before reaching the target expression. For all parameter combinations, including the one shown here, more Standard model simulations terminate at 10 N iterations compared to Extended model simulations. Taking a ratio of the mean time under this CDF for the Extended model over that for the Standard model therefore represents a conservative lower bound for the speedup in promoter evolution. (B) Selection at two different population sizes (top panel: N = 103; bottom panel: N = 104) using the Strong-Selection-Weak-Mutation model at two selection strengths (s) and selecting to either PR-levels of expression or any measurable expression. Selection was simulated through 100 independent runs for each of the 100 random starting sequences, with different lengths of the sequence allowed to mutate. Errors bars are standard errors of the mean across all replicates and starting sequences. Indicated selection refers to the selection on the phenotype difference (Δlog10Pon).
Figure 4—figure supplement 2. Cumulative binding contributes more to expression at weak promoters.

Figure 4—figure supplement 2.

For 100,000 random 100-bp-long sequences, we calculated the fold increase in predicted gene expression levels of the Extended model compared to the model that is constrained to only the single strongest σ70-RNAP binding site. Predicted expression levels from stronger promoters (higher log10Pon) were determined primarily by binding to the strongest σ70-RNAP binding site. In contrast, predicted gene expression levels at weak promoters were more likely to be determined by σ70-RNAP binding at multiple sites. The orange line is the trend line obtained through non-linear regression.
Figure 4—figure supplement 3. Further tests of selection against σ70-RNAP (RNA polymerase) binding sites.

Figure 4—figure supplement 3.

(A) To provide an alternative measure to that presented in Figure 4E, instead of creating a random sequence and comparing the number of predicted σ70-RNAP binding sites in it and in the Escherichia coli genome, here we created 100 shuffled σ70-RNAP energy matrices and used each of them to predict the expression from every single position in the E. coli genome. For each shuffle, we constructed cumulative histograms of free energy for inter-genic and within-genes regions. For each bin, we then calculated the p-value of the Extended model that used the actual σ70-RNAP energy matrix, assuming a normal distribution with mean and standard deviation given by the set of models with shuffled matrices. This is a conservative estimate, as for energies ΔE < 1, the assumption of Gaussian distribution leads to overestimates of standard deviation. The matrices were shuffled per position, that is, an energy matrix of dimension 4× L, with L being the length of the binding site, is shuffled by randomly reordering the L columns while leaving the energy entries in each column unchanged in order and magnitude. Gray lines represent 95% confidence intervals. (B) For evidence of selection against σ70-RNAP binding sites only in the inter-genic regions that contain experimentally confirmed promoters (based on RegulonDB), we compared model-predicted binding energy across the region to the expected binding for a 108 bp random sequence with the GC% of the corresponding region. Also shown is the selection against binding sites within genes (same as in Figure 4E). Gray shaded areas are 95% confidence intervals.

To contextualize these findings, we simulated evolution under directional selection for expression from random non-expressing sequences using a Strong-Selection-Weak-Mutation model (Gillespie, 1983) adapted from Tuğrul et al., 2015. In short, this model considers one mutation at a time: as one mutation arises randomly in the sequence, the model determines the probability of that mutation getting fixed in the population or eliminated due to selection and drift, and only then is the fate of the next random mutation considered. This evolutionary scenario probes the upper speed limit for the emergence of new functional promoters. The Extended model predicts, on average, more rapid evolution from a non-expressing sequence (Figure 4—figure supplement 1A), toward either weakest measurable expression or strong (wildtype PR level) expression, when compared to the Standard model (Figure 4D, Figure 4—figure supplement 1B). The difference in rates of evolution becomes even more pronounced as the mutating sequence gets shorter (simulating stronger constraints due to, for example, existence of other transcription factor binding sites) (Figure 4D). The increased rate of promoter evolution predicted by the Extended, as compared to the Standard model, means that promoters can evolve even more rapidly than expected based on the standard RNAP consensus sequence or the corresponding energy matrix. Under certain conditions, the rate of increase is several orders of magnitude. This increase arises from the inclusion of the new structural promoter features (Figure 1A), which dramatically enlarges the number of possible productive binding configurations – a phenomenon particularly evident at lower expression levels, where cumulative binding across multiple binding sites substantially contributes to overall expression (Figure 4—figure supplement 2).

Selection against RNAP binding sites in E. coli promoters

The pervasiveness of σ70-RNAP binding sites among random DNA sequences has surprising implications for understanding the evolution of gene regulation: the critical question is no longer how do promoters arise, but rather how does a cell avoid expressing everything, all the time. To address this, we asked if there was evidence of selection against σ70-RNAP binding sites in the E. coli genome (Figure 4E, Figure 4—figure supplement 3A; Mustonen et al., 2008). While identifying strong selection against σ70-RNAP binding sites within genes (Yona et al., 2018), surprisingly, we also found evidence of selection against binding sites in inter-genic regions (p < 10–40 across all inter-genic regions). This negative selection was observed even when only the known, experimentally determined σ70 promoters (obtained from RegulonDB; Salgado et al., 2013) were considered (Figure 4—figure supplement 3B), most of which strongly bind σ70-RNAP and hence lead to intermediate and high expression levels (Figure 4F). Thus, there is selection against σ70-RNAP binding sites not only within genes, but also in the inter-genic regions that contain active promoters. This selection against too many binding sites in promoters could have arisen from a direct competition between σ70-RNAP and other transcription factors, and/or in response to the potential, and poorly understood, cost for a promoter to contain multiple weaker (as opposed to fewer stronger) σ70-RNAP binding sites.

Discussion

While DNA sequence ultimately encodes all the genetic elements for the function of living systems, decoding this function directly from sequence remains a grand challenge. Despite limited success, for example, in connecting the effects of mutations on protein or secondary RNA structures (Schuster, 2006; Dill and MacCallum, 2012), we lack a quantitative link between most cellular processes and DNA sequence.

One of the most fundamental cellular processes is the regulation of gene expression, which is governed by the binding of σ factors, RNAP, and transcription factors to promoters. Yet, we lack a way to decode promoters: a code or dictionary that would map promoter sequences into their function – the level of gene expression (Kinkhabwala and Guet, 2008). This mapping is fundamental for understanding gene regulatory network structure and their evolution (Voordeckers et al., 2015; Lässig et al., 2017), and for engineering synthetic biological systems (Kim et al., 2009). Moreover, gene expression is a phenotype that captures our intuitive notion of promoter function, enabling simulation of evolutionary scenarios that select for that function – a feat that appears unrealistic for cases where a phenotype has no obvious functional significance. Accounting for structural features of promoters (Figure 1A) improved the predictions of gene expression levels from any DNA sequence, by extending the applicability of thermodynamic modeling toward low expression levels and beyond local neighborhoods of known promoters (Vilar, 2010; Forcier et al., 2018), to capture the entire genotypic space. This is a necessary precondition to ask questions about de novo promoter evolution. The Extended model also allowed us to better understand key mechanisms of promoter function, such as the existence of multiple binding sites that cumulatively contribute to expression (Figure 4—figure supplement 2). Recently, a complementary approach described two further structural features of DNA-RNAP interaction, the avidity between –10 and –35 binding sites and the UP element (Einav and Phillips, 2019), yet this approach was not designed to predict expression from random sequences. Integrating these and other additional mechanisms of RNAP and promoter function (Hawley and McClure, 1983; Roy et al., 1998) into our Extended model could further improve promoter sequence-to-function predictions. Furthermore, the thermodynamic framework can be extended to model complex, regulated promoters that bind transcription factors, as long as their biophysical parameters (energy matrices) are known (Bintu et al., 2005; Saiz and Vilar, 2008).

The flexibility of σ70-RNAP binding ensures the proximity of any DNA sequence to a constitutive promoter (Figure 3). This permits easier evolutionary tuning of gene expression levels, while also allowing σ70-RNAP to function as a de facto global regulator (Salgado et al., 2013; Igler et al., 2018). Selection can thus sustain the binding specificity of required transcription factors and reduce crosstalk while maintaining expression (Friedlander et al., 2016). The ease of evolving constitutive promoters, which has been experimentally hinted at using a very limited number of mutants (Wolf et al., 2015; Yona et al., 2018), remained theoretically puzzling (Tuğrul et al., 2015). In fact, Tugrul et al. failed to identify mechanistic or evolutionary factors that would increase the predicted speed of binding site evolution and bring it closer to experimental observations. To address this discrepancy, we linked an extended biophysical model of promoter function, inferred directly from large mutant libraries, with a quantitative evolutionary framework to show that promoters indeed evolve more rapidly than expected based on existing models of protein-DNA interactions. Strikingly, this increase in the rates of promoter evolution, which can be orders of magnitude faster in the Extended compared to the Standard model, arises from the same promoter features that account for only a 2-to-3-fold increase in the likelihood of random sequences resulting in measurable expression. We hypothesize that this potentiation comes about due to a vast expansion in the number of accessible evolutionary paths that lead from a random initial sequence to the functional promoter, likely created by the flexible spacer, which is an interesting question for future research. Another key question that arises from our results is how σ70-RNAP navigates the fine balance between evolvability of binding sites and their specificity, the latter hinging on selection against spurious binding. Our study shows how biophysical models that accurately capture mechanisms of biological function offer a robust method for addressing key questions in biology and evolution.

Materials and methods

Experimental systems and mutant libraries

The PR system consisted of venus-yfp (Nagai et al., 2002) under the control of Lambda bacteriophage PR promoter (Figure 1—figure supplement 1A). The system was isolated from the rest of the plasmid with two strong terminators, T1 and T17, obtained from iGem Parts registry. The OR3 site of PR was removed in order to remove the PRM promoter. The RBS (carrying the AGGAGG Shine-Delgarno sequence) was 28 bp away from the downstream end of the strongest σ70-RNAP binding site in the PR promoter. The entire cassette was inserted into a low-copy number plasmid backbone SC101* carrying a kanamycin resistance gene (Lutz and Bujard, 1997). The random PR mutant library was created by cloning custom-made oligonucleotides (IDT Technologies), which had a 12% mutation rate for each of the 67 positions in the PR promoter (4% mutation chance for each possible mutation away from the wildtype). The mutagenized region concluded 8 bp upstream of the wildtype PR promoter transcriptional start site, and hence that start site was not mutated. Ligated plasmids were electroporated into One-Shot Top10 electrocompetent E. coli cells (Life Technologies, Carlsbad, CA). This step was used to maximize the library diversity due to One-Shot Top10 cells’ high competency. Following electroporation, cells were grown for 1 hr in LB broth and plated on selective LB plates with 50 μg/mL kanamycin to allow single colony formation and minimize resource competition, and were then grown overnight. To ensure large coverage, we cloned mutagenized PCR products until we obtained at least 30,000 individual colonies (uniquely transformed individuals). We obtained this number of individual colonies by estimating the number of colonies on each selective plate, and then repeating the above cloning procedure until each library contained at least 30,000 individual colonies. Using chilled LB media, colonies were washed off plates and collected. Plasmids were isolated from this collection in bulk using a Qiagen Mediprep kit, and transformed into strain K12-MG1655. The same wildtype layout and mutagenesis protocol was used to create the PL mutant library, with the exception that mutation rates were 9% per nucleotide (with 3% mutation chance for every possible non-wildtype mutation). Cells were always grown in a shaking incubator at 37°C.

The 36N library was placed in pUA66-lacZ plasmid backbone carrying Kan resistance (Zaslaver et al., 2006), by cloning a 100-bp-long oligonucleotide containing 36 randomized base pairs (each of the 36 positions had a 25% chance of being either adenine, thymine, cytosine, or guanine), surrounded on each side by 32 bp of randomly generated, unmutated, non-expressing DNA sequence (Figure 1—figure supplement 1B). The two 32-bp-long segments had a different sequence, as each was randomly generated. This 100 bp sequence was placed upstream of the RBS, controlling the expression of a gfp gene (Zaslaver et al., 2006). The rest of the plasmid had the same components as the one used for the PR library. The 36N plasmid DNA library was generated using a Q5 site directed mutagenesis kit (New England Biolabs, Ipswitch, MA). For amplification, we used the reference plasmid as a template (a plasmid with a random, non-expressing 100 bp fragment) and two pools of primers with a constant 3’ end and an 18N random 5’ end. We cloned the mutagenesis products into electro-competent NEB5α cells. Following electroporation, cells were grown for 1 hr in LB and then plated overnight on selective kanamycin plates. Cells were collected in bulk to form the 36N mutant library. We used different strains between the PR and the 36N libraries in order to better evaluate and increase the generalizability of our model. The mutation rates of all three libraries (Figure 1—figure supplement 1G), as well as the sequence-level randomness of the 36N library, were checked based on the sort-seq data. The libraries also contained promoter sequences with a range of spacer lengths (Figure 1—figure supplement 1H).

Sort-seq Experiments

Prior to sorting, cells were grown in M9 minimal medium with 0.2% (w/v) Casamino acids, 0.2% (w/v) glucose and 50 μg/mL kanamycin. Frozen aliquots of each mutant library were diluted 1:100 and grown overnight. Prior to sorting, overnight cultures were diluted again 1:100 and grown until exponential phase (for 3 hr in M9 media). We repeated the sorting of each library in three biological replicates.

Fluorescence activated cell sorting (FACS) was performed on a FACS Aria III flow cytometer (BD Biosciences, San Jose, CA) with a 70 μm nozzle. A 488 nm laser was used to detect forward scatter (FSC) and side scatter (SSC) with a 488/10 band-pass filter. FITC channel was used for excitation of either YFP (PR and PL libraries) or GFP (36N library). The flow rate was set to 1.0 and samples were diluted to obtain a cell count of approximately 2000 events/s. Cells for sorting were manually gated on the densest population in an FSC/SSC scatter plot, which comprised 95.5% of all events exceeding a threshold of 1000 on the SSC axis. For the PR and PL libraries, four sorting gates were set on FITC: no-expression gate, capturing >99% of all measurements from a non-expressing plasmid (control plasmid not containing yfp); high-expression gate, capturing >99% of all measurements from the wildtype PR plasmid; two gates equidistant in fluorescence between the no- and high-expression gates (Figure 1—figure supplement 1F). The boundary for the no-expression gate (obtained from the SC101* strain without the fluorescence marker) was used to define the ‘measurable expression’ threshold for PR and PL libraries. The 36N library was sorted into 12 gates as follows: The upper boundary of the lowest gate corresponded to the median of an auto-fluorescence control sample (plasmid-free Top10 cells). The upper boundary of the second-lowest gate, which also defined the ‘measurable expression’ threshold for the 36N library, was the 99th percentile of the fluorescence-free cells. The lower boundary of the highest gate (B12) was set to 2 × 104. Distances between the remaining intermediate nine gate boundaries were of equal size on the log-scale FITC histogram (Figure 1—figure supplement 1F). For PR and PL libraries we could sort into all four bins simultaneously, and hence we sorted 1 million cells for each of the three biological replicates. For the 36N library, we first recorded 105 reads, and then the number of cells sorted into each of the 12 bins corresponded to the number of cells recorded in each of the bins. The recipient plate was cooled to 4°C to halt growth while sorting to other wells was still going on. Only for the 36N library, after sorting we added 1000 cells with the reference plasmid (experimental PR plasmid) into each of the 12 sorted populations. We did this to maximize the precision of our experimental measurements for the 36N library, as it would enable more accurate experimental determination of gene expression levels by enabling normalizing the number of mutants in each bin (see section Processing of the 36N mutant library).

Cells from each sorted bin were grown overnight. We isolated plasmids from the sorted populations. We used high-fidelity PCR (Phusion, New England Biolabs) to amplify 150 bp containing the mutagenized region, and barcoded the primers according to the sorted bin. Four sets of barcoded primers were used for the PR and PL libraries, and 12 for the 36N library. PCR products were column-purified (Zymo Research, Irvine, CA) and eluted in 30 μL, of which 2 μL were run on an agarose gel for product quantification based on band fluorescence. PCR products were pooled to reach approximately equimolar concentrations of each bin, separately for each mutant library. No additional clonal amplification steps were conducted prior to sequencing. Each library was sequenced with millions of reads using 135 bp pair end Illumina sequencing (Hi-seq).

The nature of the 36N library – it containing completely random 36-bp-long sequences – means that we might not be measuring the effects of each random sequence on σ70-RNAP binding, but also possibly on other factors that impact gene expression levels, most important of which could be mRNA stability. Our experimental setup does not allow us to directly disentangle the effects of each individual sequence on mRNA stability versus binding energy. On the other hand, the PR and PL libraries have a clearly defined wildtype sequence around which all mutations were introduced. Introducing a relatively small number of mutations into these promoters is unlikely to dramatically alter mRNA stability (Mohanty and Kushner, 2016). The fact that the Extended model trained only on the PR dataset performs so well on the 36N library (Figure 3B) makes us confident that the majority of 36N mutants predominantly affect σ70-RNAP rather than mRNA stability. In other words, while the primary mechanism linking sequence to gene expression might be mRNA stability for some mutants in the 36N library, on average the effect of mutants on binding energy dominates.

Processing of PR and PL mutant libraries

Here, we describe the data processing pipeline for the PR and PL libraries, from the initial sequence reads to a dataset we use for fitting and evaluating our model. For each library, we obtained millions of reads (Figure 3—source data 1), which we paired (using the illumina-utils package; Eren et al., 2013), discarding reads with any mismatch. Each read contained a tag on both ends with information of the expression bin the sequence was sorted into, and we only included reads which have the same tag on both ends.

In each library, the remaining ~2.5 million reads contained more than 300,000 unique sequences, which we filtered based on length, coverage, and the position of the RBS (Shine-Dalgarno sequence that was the same for all sequences in the library as it was not mutated) (Figure 3—figure supplement 2A,B). We further required the sequences not to be too different from the ancestral sequence (PR and PL, respectively) (Figure 3—figure supplement 2C). This step removed the remaining PR sequence and a small cloud of sequencing errors around it from the PL library. PR wildtype sequences were present in the PL library because both libraries were made using the PR wildtype sequence as the starting point for cloning and ligation – a technique that is nearly but not exactly 100% efficient. Even though we used high-fidelity polymerase for the PCR, still some sequencing errors existed (as evident from the ‘cloud’ of sequencing errors around the PR wildtype in the PL library), but were lower than 0.01%. Finally, we required the distribution of expression bins to be as unambiguous as possible (Figure 3—figure supplement 2D,E). This requirement was introduced because sort-seq experiments involve several passages of bacterial cultures after they have been sorted into their corresponding bins. This passaging can introduce a bias in the form of over- or under-representation of a given mutant in one bin compared to its abundance in the adjacent bins. Approximately 0.5% of all sequences had highly unequal distributions between adjacent bins, and were thus excluded.

We observed a discrepancy in the number of unique sequences in the two libraries (Figure 3—source data 1) that arose due to the higher mutation rate of the PR library. This is why we took a more conservative approach and additionally raised the coverage threshold to 30 for all our analyses.

Processing of the 36N mutant library

The original sequences library contained more than 10 million paired reads, almost all of which paired without mismatches (Figure 3—source data 1). Out of those, more than a million contained the reference plasmid sequence, which we extract and treat separately. We required sufficient mapping (higher than 0.75 similarity, using local alignment function from the pairwise2 module in Biopython, with symmetric gap open and extension penalty of (2L)1, and a matching score of L1 , where L is the length of the mapping region) of both flanking regions, and that the length of the core region is within 2 bp of the canonical (36 bp) one. Most of those reads were unique and most probably errors, so we originally required coverage of at least two (Figure 3—source data 1).

At this point we set to estimate the abundance of spurious sequences (e.g. sequencing or PCR errors), by cross-mapping the abundant sequences (with the highest coverage) against all other sequences, using the same scores as defined above (Figure 3—figure supplement 2F). In every case the distribution of mapping scores was bimodal, with an overwhelming number of low scores and a small number of very high scores, suggesting that abundant sequences come with a cloud of errors, which are similar to the investigated sequence. Compiling a distribution of similarity scores for the 1000 most covered sequences showed a clear threshold of 0.7 between the low- and high-scored modes. We then accumulated all the sequences that appeared to be associated with a number of high-covered sequences and noted that the histogram of coverages of only those sequences closely matches the low-end part of the histogram of all sequences. Extrapolating from this, we estimated that requiring a coverage of at least 30 reads would lower the number of spurious sequences to only a few dozen (Figure 3—figure supplement 2G).

In contrast to the PR and PL libraries, where FACS was performed simultaneously into four bins, here the experiment involved sequential sorting of the same number of cells into one of the 12 bins (Figure 1—figure supplement 1F). Naturally, less abundant expression bins had to be sorted for a longer time, which introduced a systematic bias for higher expressing bins. To account for this bias toward higher-expressing sequences, we introduced during the cell sorting step to each bin a known number (1000) of cells carrying the reference PR plasmid. Note that this was not the same plasmid as the one used for cloning the 36N library, allowing us to easily distinguish between the reference plasmid and the wildtype plasmid that was not cloned successfully (i.e. was not carrying any mutations).

To debias, we divided each distribution by the distribution for the reference sequence (Figure 3—figure supplement 2H), and normalized. We then constructed a set of average distributions for all sequences that have the same mode. This allowed us to fit the background noise (intrinsic to FACS measurements) for all distributions. Next, we cleaned the sequence-specific distribution of potential outliers that may distort the estimate of the mean: for each mode-specific template distribution, we defined a filter that selected only bins in which the background is at max ⅓ of the value of the template. For each sequence-specific distribution, we selected a filter based on its mode, which nullified values in bins defined as outliers (Figure 3—figure supplement 2I,J). Afterward, we renormalized the distribution. This filtering is especially important for the higher-expressing sequences, where a single count in the lower bins would get enlarged tremendously by debiasing, drastically skewing the expression estimate (Figure 3—source data 1).

We used the debiased and filtered distribution over bins αi=[α0,,α11] , to produce two estimates of gene expression. In the units of bin index, we estimate expression as εbin=i=0i11aii, while in the units of luminosity measured in FACS as εfacs=i=0i11ailog10mi, where mi is the median value of measurements in the i-th bin. The two estimates are linearly related.

To validate our data processing, we randomly picked eight mutants from each of the 12 bins, sequenced them and measured in a plate reader. Out of 96 mutants, 79 were unique, and 2 did not exist in the 36N sort-seq library. We compared the measurements of the remaining 77 sequences obtained from a plate reader to the estimate of their expression obtained from the sort-seq experiment following the above-described debiasing and filtering (Figure 3—figure supplement 3). Plate reader measurements were conducted in the following manner: mutants were grown overnight in M9 minimal media supplemented with 0.2% (w/v) CAS, 0.2% (w/v) glucose, and 50 μg/mL kanamycin. The overnight populations were diluted 1000-fold, grown until OD600 of approximately 0.1, and their fluorescence measured in a Bio-Tek Synergy H1 plate reader. Three replicates of each mutant were measured.

Data splits and fitting procedure

We define 60:20:20 percentage splits of each of our libraries into three disjunct datasets (Figure 3—source data 2). The first (‘Training’ dataset) and second (‘Validation’ dataset) were used for training and model selection respectively, while the last (‘Evaluation’ dataset) was used exclusively for final evaluation and visualization of our models. Sequences were randomly divided into three splits. Note that each sequence in our library is unique, thus no sequences were repeated between libraries.

The central quantity of our modeling approach is the proportion of time RNAP spends bound to the DNA in the ‘on’ configuration, Pon, that is, in a configuration that can yield productive mRNA and thus lead to protein expression. Our quantification of expression, which we define as log fluorescence, will then be proportional to log10Pon (Bintu et al., 2005; Kinney et al., 2010).

The three libraries were different in terms of their output. While for the PR and PL libraries we used only four bins and chose the median bin for each mutant (more conservative statistic), we obtained more plausible estimates of expression for the mutants in the 36N library (as explained above). To keep the procedure the same across all three libraries, for the 36N library we rounded off the estimate of the mean expression in bin units to the closest integer. This allowed us to use multinomial logistic regression for all three datasets and the associated log-likelihood as the objective function.

Concretely, given a set of parameters (binding matrix, spatial penalties, chemical potential, potential dinucleotide interactions), our model produces log10Pon for each sequence, which we use as an independent variable to fit observed bin expression levels using multinomial logistic regression (from scikit learn, with L-BFGS-B optimization; Byrd et al., 1995). When implementing multinomial logistic regression, we treat log10Pon as an independent variable (x=log10Pon), so that the log-odd of a sequence falling into a bin i is a linear function of x: logπi1-πi=aix+bi . This yields individual probabilities: πi=11+e-(aix+bi). In the case of multi-label classification, Pi needs to be further normalized by 1, so that the probability associated with x being measured in bin i is: Pi=πiΣiπi . Additionally, we explicitly required a balanced fit by applying weights inversely proportional to the number of observations in each bin for each dataset. This is especially important for the 36N library, due to the highly disproportionate numbers of observations in the 12 bins.

The main metric for our optimization and model performance is the likelihood of the logistic regression. In the interest of higher interpretability, we also report the r2 value of a linear fit where the independent variable is log10Pon and the dependent variable is the log fluorescence estimate, using the same weights as for the logistic regression. The fluorescence estimate is the median for the PR and PL libraries, as a robust measure in the absence of high bin resolution; and mean for the 36N library, as having 12 bins allowed us to accurately estimate the mean. The multinomial logistic regression does not necessarily yield a linear dependence between log10Pon and observations. In fact, across the range of reasonable values, the log-likelihood depends on the chemical potential only weakly, yet it may change the correlation coefficient by several percent. For this reason, in Figure 3B, we show correlation coefficients (r2) for evaluation datasets with chemical potentials re-optimized on the respective training datasets. Raw r2 values are similar and are reported in Figure 2D.

We start from the Standard model, and progress toward the Extended model sequentially, including one structural feature at a time. We fit the model parameters using only the training dataset of the PR library, and only search in the vicinity of the previous best fit. We assess such a model on the PR evaluation and all three datasets of the PL library without any adjustments, since the two libraries were obtained following the same experimental protocol in the same cells. In that sense, the whole PL library played the role of an evaluation set. To evaluate this model on the 36N library, which was obtained using a different strain and through different FACS thresholds, we used the 36N training set to refit the chemical potential and the hyperparameters of the logistic repression. Therefore, in the context of evaluating a model fitted on the PR library alone, both Validation and Evaluation 36N datasets can be considered true validation datasets.

Ultimately, we wanted to pool the training datasets of all three libraries to fit a unique model to give us the best set of parameters. In this case, all Validation and Evaluation datasets were data-naive. By doing this, we increased the inference power of our models, allowing us to fit pairwise (dinucleotide) interactions between nucleotides and the rate of RNAP clearance from the initial binding site.

Standard thermodynamic model

In thermodynamic models of gene expression, the amount of protein is directly proportional to the fraction of time RNAP is bound to the promoter sequence. In the simplest case, where RNAP can be either bound (‘on’) or not bound (‘off’-state), the probability of the on-state is given by the formula:

Pon=11+e(Eμ)/kBT (1)

where E is the binding energy, and µ the chemical potential of RNAP in the cytoplasm. All our experiments were performed at room temperature, thus kBT=0.59 kcal/mol. For simplicity, we define energy and chemical potential in those units, and drop kBT from subsequent formulas.

We further expressed binding energy as an independent sum of local interactions between RNAP and individual base pairs in the promoter region, with each of the four possible bases (A,C,G,T) contributing differently depending on the position in the promoter. Such 4×l matrix is referred to as the binding energy matrix. The values of the energy matrix are determined up to an arbitrary offset per position. We set the zero value as the binding energy of the wildtype PR sequence.

In the case when the promoter sequence L is longer than the binding matrix I, the number of states is N=Ll+1, and the on-probability takes the form

Pon=i=1Ne(Eiμ)1+i=1Ne(Eiμ) (2)

The Standard model assumes that only one element (i.e. binding site) in the sum dominates, and we follow other studies (Bintu et al., 2005; Kinney et al., 2010) when formulating our standard thermodynamic model as

Pon=11+eminEi-μ (3)

Toward the Extended model: spacer flexibility

Based on the common observation that RNAP has a flexible spacer length, we extended the Standard model by allowing for spacer lengths that differ from the canonical by up to 2 bp. This effectively increased the number of possible configurations by a factor of 5. The key difference that we introduced was to assign an energy penalty for each non-canonical configuration. Increasing the spacer flexibility beyond ±2 bp did not yield further benefits to predictability. Note that it is in principle possible that the σ70-RNAP energy matrix depends on spacer length, so that a different energy matrix would need to be inferred for each spacer length. Exploring this possibility was beyond the scope of this project, as the experimental datasets would be too small for reliable inference if conditioned on individual spacer lengths.

Toward the Extended model: cumulative binding

In order to account for multiple RNAP binding configurations (different positions along the promoter and different spacer lengths) that can lead to a productive transcript, we saw the necessity of performing the thermodynamic sum in Equation 2 instead of just extracting the dominant binding as was done for the Standard model. To do this, we assume that σ70-RNAP concentration is not high enough for two molecules to be simultaneously present at the promoter before they initiate transcription. Fully embracing the thermodynamic description at this point provided us with a natural language for all further extensions.

Toward the Extended model: occlusive unproductive binding

We accommodated the occlusive unproductive binding states naturally in the thermodynamic description:

Pon=i{on}e(Eiμ)1+i{np}e(Eiμ)+i{on}e(Eiμ) (4)

The main challenge here was to find the exact position after which the transcripts that did not contain a full (i.e. non-functional) RBS were produced. To answer this question, we checked computationally whether there was a natural region where this would be the case. First, we aligned all sequences with respect to the position of the RBS (Shine-Dalgarno sequence ‘AGGAG’). Then, using the parameters from the previous model iteration (with cumulative binding), we assessed performance with different positions separating the productive and unproductive binding. This identified an 8-bp-long region, where the separating position was to be expected (Figure 2B, Figure 2—figure supplement 2B). To determine the location more precisely, we set up a separate experiment (see section Verifying promoter structural features).

Toward the Extended model: occlusive binding on the reverse complement

When bound to the reverse complement, we considered RNAP to effectively act as its own repressor. We accommodated this in the model by expanding the set of ‘np’ states in Equation 4 to all reverse complement configurations. Including this effect led to a slight but significant increased performance even without refitting, after which we also re-optimized the parameters locally.

Toward the Extended model: dinucleotide interactions

The previous components of the model were fit exclusively on the PR dataset. In our search for dinucleotide interactions and when exploring departures from thermodynamic equilibrium, we pool the datasets in search for the best fit, as only then could we get a significant estimate of the interactions.

A common assumption of thermodynamic modeling of gene expression is that the binding energy of any particular transcription factor-promoter complex can be expressed as an independent sum of interactions between individual base pairs and the transcription factor (RNAP in our case) residues. A naive inclusion of all possible dinucleotide interactions between the contact residues of RNAP and the promoter would inflate the number of parameters to thousands, rendering their simultaneous estimation hopeless, even before considering overfitting issues.

To overcome the pitfalls of overfitting and to estimate the potential importance of dinucleotide interactions, we first included each interaction independently, and required that the best fit increased log-likelihood by at least 2-fold. This reduced the number of interactions from 4416 to 892. Then, we drew 10 random subsets of 20 interactions, and jointly optimized the interactions in the vicinity of the previously obtained values, over 10 cross-validation splits (50:50) of the training and validation datasets and using L1 regularization. This way, we obtained estimates (and associated error) for the interaction value in 10 different ‘backgrounds’. For each interaction, we considered valid only those estimates that were larger in magnitude than a defined threshold, and combined them into a single estimate (mean and standard deviation), allowing for a small amount of leniency due to rather strict L1 regularization. We chose 0.002kBT as the threshold for acceptance and made sure the results were robust for thresholds of 0.001 and 0.005. We then filtered for only those interactions that were non-compatible with zero at confidence level of 2σ. This brought down the number of interactions to 250. In the next step, we again jointly inferred subsets of interactions: this time 12 subsets of 50 interactions. For each interaction, we then estimated the single mean and standard deviation from the 12 values, and conservatively selected only those that were inconsistent with zero at more than 3σ, which brought down their number to 77. Finally, we sequentially include dinucleotide interactions starting with the most and moving toward the least significant, requiring that each contributed to at least 3σ-level in the background of all accepted up to that point. This left 31 interactions (Figure 2D).

Toward the Extended model: clearance rate

Finally, we implemented our framework even closer to the real biology by noting that RNAP is not like any other DNA binding protein, in-so-far that it is in fact not in thermodynamic equilibrium: to make the transcript, it needs to leave its original binding site, that is, if it binds too strongly it cannot make the transcript. We model this by introducing a parameter R, as a rate with which RNAP is cleared away from the binding site, relative to the rate with which it is bound, in the limit of very strong binding. It effectively sets up an upper limit on the amount of time RNAP persists on the binding site, and should make the difference only with strong binding sites.

Consider a system where RNAP can exist in only three states: (i) bound at a productive position, (ii) bound at an unproductive position, and () unbound (‘off’); and consider the transition rates among them (Scheme 1):

Scheme 1. Modelling RNA polymerase binding.

Scheme 1.

The transition rate from unbound to any of the bound states k+ is the same and depends only on the concentration of free RNAP in the vicinity of the sites. The reverse transitions depend on the sequence (σ), through energy of binding kon(b)eEon(b). We modeled RNAP clearance as an independent rate Γ, which depletes the bound states on and b, and eventually contributes to repopulation of RNAP in the cell. We do not model time dependence, and are interested only in the stationary state, thus we do not need to model the time delay between leaving the bound state and reappearing in the cell.

In stationarity we have:

r+kon-Pon=ck+Poff=r+kb-Pb (5)

Introducing a relative clearance rate R=rck+ , casting the known rates and RNAP concentrations in terms of binding energy k-ck+=eE-μ , and considering that Pon+Pb+Poff=1, we obtained:

Pon=(R+eΔEonμ)11+(R+eΔEonμ)1+(R+eΔEbμ)1 (6)

This is the same formula as Equation 4, with e-Ei-μR+eEi-μ-1 . Considering the clearance rate effectively introduced a cutoff on the binding energy Emin=μ+logR, below which the energy of binding is not important anymore. Note that in the case of R0, the formula reduces to the previous state. In the case of a single (on) binding site, R is a completely degenerate degree of freedom, which can be manifestly removed by a transformation μμ-log1+R.

The scan over the values of R (only on the training dataset) is shown in Figure 2—figure supplement 2B. For each value, we refit the chemical potentials. We obtained the best fit value for the relative clearance rate of R* = 0.31, and a large uncertainty region, yet the model quite robustly preferred a non-zero clearance rate. Here, we consider the clearance rate to be a property of the RNAP molecule itself, and independent of sequence. Including the sequence dependence for the clearance rate can be done, however disentangling the exact nature of such dependence would require a dedicated set of experiments that is beyond the scope of this work.

Verifying promoter structural features

We incorporated several structural features of bacterial promoters into the thermodynamic modeling framework (Figure 1B). When estimated on all three mutant libraries, each of these features independently increased model predictability, providing evidence of its biological role. Nevertheless, we conducted additional, hypothesis-driven experiments to verify the role of these structural features. For all mutant measurements performed for these validation experiments (outlined below), we measured expression levels of three biological replicates of each mutant in a plate reader. The mutants, the PR wildtype, and the negative control (the strain carrying the plasmid without a fluorescence marker in order to define ‘measurable expression’ – i.e. background fluorescence) were grown overnight in M9 minimal media supplemented with 0.2% (w/v) CAS, 0.2% (w/v) glucose, and 50 μg/mL kanamycin. The overnight populations were diluted 1000-fold, grown to OD600 of ~0.1, and their fluorescence measured in a Bio-Tek Synergy H1 plate reader. All measurements were masked from the experimentalist performing them during the data allocation step. All the mutants were generated on the wildtype PR plasmid backbone, using oligonucleotide cloning. We did not conduct tests for the variable spacer, because changes in spacer length in σ70-RNAP binding are well documented, so we assigned them their accurate biophysical meaning in the form of a binding energy penalty to altering spacer length away from the lowest energy spacer (9 bp, corresponding to 17 bp between –10 and –35 sites). We did not independently verify the clearance rate, as we use a simplified description of clearance rate that attempts to capture its effects in a single parameter, rather than explicitly accounting for the mechanisms of σ70-RNAP clearance.

Cumulative binding

To test for the role of multiple independent σ70-RNAP binding on gene expression levels, we created 29 specific promoters, which were derived from the PR wildtype promoter. These promoters were selected so that the prediction of their gene expression levels would be different between two models: (i) the Standard model, which did not include cumulative binding, predicted no measurable expression for these promoters; (ii) a model that included only cumulative binding as a structural promoter feature, and which predicted measurable expression. We measured gene expression levels of these promoters in a plate reader as described above, compared the expression to the negative control through FDR-corrected t-tests, and found that the model with cumulative binding described experimental observations systematically better (Figure 2A).

We introduced additional mutations into a subset of these promoters in order to remove the additional σ70-RNAP binding sites. To do this, we used the Extended model to determine which mutations in the 29 promoters would reduce binding to any but the strongest binding site. Finding mutations that removed the secondary binding site(s), but that do not affect the binding to the strongest σ70-RNAP binding site, was possible for only 7 of the 29 promoters. Removal of the additional binding site(s) in this manner led to a reduction of gene expression levels, and six of the seven mutants exhibited no measurable expression – that is, what was predicted by the Standard model which accounted for only a single binding site (Figure 2A).

Occlusive unproductive binding

We performed several different tests to validate various aspects of this structural feature. First, we created 20 promoters to compare predictions from two models: (i) model that allowed for cumulative binding, but all binding had a positive impact on expression; (ii) model that allowed for cumulative binding, but considered every binding that would not transcribe a complete RBS as having a negative effect on expression. Model (i) predicted all 20 sequences to have measurable expression, while Model (ii), which accounted for occlusive unproductive binding, predicted no measurable expression for any of the sequences. By comparing the measured expression of each promoter to the negative control through FDR-corrected t-tests, we found that only 2/20 mutants exhibited measurable expression, in better agreement with the model that accounted for occlusive unproductive binding (Figure 2—figure supplement 1B top left panel).

Second, we wanted to test if our model accurately predicted how expression levels change when an additional occlusive binding site is introduced. To this end, we started with a wildtype PR, and used the Extended model to generate a series of single point mutations that were predicted to gradually introduce a new binding site. σ70-RNAP binding to the newly introduced sites would result in transcripts without a complete RBS. By constructing these mutants in the lab, we found that introduction of such a site into the PR promoter indeed led to a significant reduction in expression levels (Figure 2B top right panel). Note that the strongest σ70-RNAP binding in the wildtype PR promoter is 28 bp upstream of the RBS.

We also did the reverse – starting with three promoters from the PR mutant library that had a strong occlusive unproductive binding site, we used the Extended model to predict a series of single point mutants that would gradually remove this binding site. Experimental measurements of those mutants, when compared to the expression level of the starting promoter sequence through FDR-corrected t-tests, exhibited a significant increase in the expression levels as the binding site was removed – an increase that was consistent with that site acting in an occlusive manner, and not a cumulative one (Figure 2B bottom left panel).

Finally, we wanted to determine the exact distance between the –10 element and the RBS that turned an additional binding site from being productive (cumulative) to being occlusive. To do this, we started with the same three promoters from the PR mutant library that had a strong occlusive unproductive binding site as above, but this time we shifted the occlusive position of that binding site relative to the RBS. We shifted the binding site up to two positions closer to the RBS and up to seven positions further away from it, moving the binding site one position at a time – creating nine mutants of each of the three starting promoters. We used FDR-corrected t-tests to compare the measured expression of each mutant to the original sequence they were mutated from. We found support for the hypothesis that, as the occlusive binding site was moved further away from the RBS, it became cumulative (Figure 2B bottom right panel). This position matched the area determined with the model (see section Toward the Extended model: occlusive unproductive binding).

Occlusive binding on the reverse complement

In order to conduct an independent verification of this structural feature of promoters, we identified four promoters from the 36N mutant library that the Extended model predicted to have strong occlusive binding on the reverse complement. We introduced up to eight mutations into each of these four promoters that would progressively eliminate the occlusive site on the reverse complement. The introduced mutations had minimal predicted effect on binding on the productive strand (Figure 2C, Figure 2 – Extended Figure 1). We measured gene expression levels of these mutants as described above and performed a linear regression in order to correlate the measurements with the Extended model predictions of expression. For two sets of mutants, we found that the Extended model, which accounted for occlusive binding on the reverse complement, accurately predicted gene expression levels. For the other two mutant sets, removing the predicted binding site on the reverse complement had no measurable effect on expression levels (Figure 2C, Figure 2—figure supplement 1). This data shows that occlusive binding on the reverse complement is a more complex promoter feature than what we accounted for in the Extended model. Nevertheless, including this promoter structural feature into the model led to a significant increase in predictability (especially of the 36N dataset), which justified its inclusion into the Extended model.

Verifying model predictions

Arguably the most surprising prediction arising from the Extended model is the ease of generating promoters from random non-expressing sequences. Specifically, we wanted to verify that single point mutations on random non-expressing 115-bp-long sequences could generate dramatic shifts in expression levels. To do this, we experimentally created 20 pairs of promoters, each pair consisting of (i) a random non-expressing sequence; and (ii) the same sequence but with one point mutation that is predicted to lead to expression. We created these promoters on the PR plasmid background and measured gene expression levels as described in the section Verifying promoter structural features. By conducting a series of FDR-corrected, paired t-tests we found that, indeed, single point mutations improved gene expression levels for all but two of the 20 promoters (Figure 2—figure supplement 2A). Of the 20 original promoters, which were predicted not to have any expression, only two exhibited expression (Figure 2—figure supplement 2A), confirming the accuracy of the Extended model on sequences of 115 bp length.

Testing model generalizability with published datasets

In order to validate the generalizability of the Extended model, we tested how well it predicts gene expression levels obtained from RNA-seq experiments. To do this, we selected three published, large-scale, promoter mutant libraries whose expression was measured in E. coli (Johns et al., 2018; Urtecho et al., 2019; Hossain et al., 2020). These libraries had different properties in terms of how they generated promoter mutants and variants (as discussed in the main text).

Each library consisted of a promoter sequence matched to its measured gene expression level. For each library, we refitted only a single parameter in the Extended model – the chemical potential. All other parameters were fitted only on our three libraries (PR, PL, and 36N). We calculated Pon as a proxy for gene expression level from each promoter in the three tested libraries, and correlated our model prediction to experimental measurements to obtain the correlation coefficient (r2). For the purposes of comparison, we did the same for the Standard model. This approach meant that the Extended model’s predictability would very likely increase, if its parameters were to be refitted for each individual dataset. Furthermore, we did not include any genetic context around each promoter. In other words, we evaluated the Standard and the Extended models based only on the actual promoter sequence.

Evolution simulations

We chose 100 random starting sequences of length 115 bp, so that their predicted expression is less than measurable under both the Extended and the Standard model. We implement Gillespie-type simulation under the assumptions of Strong Selection and Weak Mutation, that is, when we can assume no clonal interference. We define the time scale in the units of inverse mutation rate 1/μ, so in each iteration of the algorithm we simulate a single mutation that appears and gets fixed in the population with the probability given by the Kimura formula

Pfix=1-e-2sδϕ1-e-4Nsδϕ

where N is the population size, δϕ is the change in the phenotypic value (gene expression level) due to mutation, and s is the selection strength on the phenotype. Note that phenotype and selection are degenerate parameters: only their product represents the actual selection coefficient. We postulate ϕ=log10Pon , so the typical mutational effects are in the range of 10–3 to 10–2. Under this regime, selection strength of 100 means a selection coefficient in the range 0.1–1.

We went through a sufficient number of iterations (new mutation events) until a threshold expression was reached by the population. We considered two thresholds: measurable expression we identified in our experiments to model the emergence of novel (weak) constitutive promoters; and the expression of the wildtype Lambda PR promoter, as an example of a very strong promoter with high constitutive expression levels. For each of the 100 starting sequences, we performed 100 independent evolution runs, to obtain an estimate of the mean exit time (and the associated standard error of the mean) under the Standard and Extended models.

We also varied the number of nucleotides in the 115-bp-long sequence that were allowed to mutate. The mutagenized region was always in the center of the sequence, over which we always integrated to obtain expression. In simulations with a more constrained region that was allowed to mutate, the simulation would take too long until full convergence. To be robust against these cases, we stopped each simulation after 10N steps, truncating the distributions of evolution times for those sequences. Importantly, doing this introduced a bias that affected the Standard model more so than the Extended (because the Standard model took longer to reach the threshold), leading to an underestimate of how much faster the Extended model is compared to the Standard one. As such, what we report as the increase in the rates of evolution of the Extended model compared to the Standard one is likely the lower bound.

Insights from applying the model to the E. coli genome

For all analyses described in this section, we used the E. coli K12 MG1655 genome (NCBI reference: NC_000913). Based on the genome annotations, we assigned one of two identities for each of the nucleotide positions in that genome: (i) ‘within genes’ (intragenic) – if the position was a part of the following annotation types: CDS or gene; (ii) inter-genic – if none of the following annotation types were present at the position: misc_feature, mobile_element, repeat_region, tRNA, Sequence Tag Sites (STSs), tmRNA, rRNA, CDS/gene, ncRNA. ‘Within genes’ region accounted for 89.6% of the genome with 51.8% GC content, while ‘inter-genic’ regions formed 9.6% of the genome with the GC content of 41.1%. Because the inter-genic category defined in this way does not guarantee that the sequence actually has a promoter function, we also considered only those sequences that have been experimentally identified as promoters, as found on RegulonDB.

We calculated binding energy for each of the 5× G configurations (five spacer lengths, G is the number of positions/base pairs in the genome), and then performed a thermodynamic sum over all spacer lengths for each position aligned so that each configuration coincided with the last base pair of the –10 element of the binding matrix. This way we obtained a free binding energy for each nucleotide (for each of the G positions) in the genome. We offset all the obtained free binding energy values such as to set the minimum to zero, in the interest of later easier readability. Surprisingly perhaps, the nearest-neighbor correlation of the free energy values is negligible, so we did not subsample, but kept all values.

To assess whether there was selection against σ70-RNAP binding sites, we constructed a synthetic genome of 100 million base pairs with the same GC content as that region (inter-genic or within-genes) of the E. coli genome. We histogram the cumulative distributions, and normalize them to the number of base pairs (inter-genic or within-genes) relevant for the real E. coli sequence. For the p-value plot, we calculated the cumulative mass function of the Poisson distribution, at the value of the real histogram, and with mean given by the synthetic histogram value. This corresponded to the one-tailed p-value.

We also developed an alternative method of evaluating selection against σ70-RNAP binding sites, in order to strengthen the validity of our claims. This time, instead of creating a random synthetic genome and comparing the predicted expression levels across that genome using the σ70-RNAP energy matrix (Figure 1C), we created 100 shuffled energy matrices and evaluated free energies of such models across the actual E. coli genome. Matrices were permuted per position, meaning that the columns of the matrix were shuffled without altering the internal structure of each column. For each such matrix, we evaluated the model at every single position in the E. coli matrix and calculated the cumulative histogram, as in the previous paragraph. The p-values were calculated assuming a normal distribution per bin. This is in fact a conservative estimate of the p-value, as for those bins with lower means, Gaussian overestimates the variance and hence the p-value. In the plot we illustrate the 95% confidence interval by showing explicitly the 3rd and the 97th percentile.

To predict the expression from known E. coli promoters, we assigned a free energy to each of the 1951 promoters in RegulonDB by integrating over a symmetric 40 bp region around the reported transcript start. For a fair comparison in plot, we also integrated in the same way over 40 bp chunks of all the inter-genic regions described above (red line in Figure 4F). For each promoter, we then searched through RegulonDB for all transcription factors that bind it. If all of them were activators, we flagged that promoter as ‘activatable’, and similarly for ‘repressible’. If we found a mixture of activators and repressors affecting a given promoter, we flagged the promoter as ‘dual’. The promoters with no known transcription factor binding were flagged as ‘no info’.

Acknowledgements

We thank Hande Acar, Nicholas H Barton, Rok Grah, Tiago Paixao, Maros Pleska, Anna Staron, and Murat Tugrul for insightful comments and input on the manuscript. This work was supported by: Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant number 216779/Z/19/Z) to ML; IPC Grant from IST Austria to ML and SS; European Research Council Funding Programme 7 (2007–2013, grant agreement number 648440) to JPB.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Mato Lagator, Email: mato.lagator@manchester.ac.uk.

Sandeep Krishna, National Centre for Biological Sciences­‐Tata Institute of Fundamental Research, India.

Aleksandra M Walczak, École Normale Supérieure, France.

Funding Information

This paper was supported by the following grants:

  • Royal Society / Wellcome Trust 216779/Z/19/Z to Mato Lagator.

  • European Research Council 648440 to Jonathan P Bollback.

  • IPC Grant from IST Austria to Mato Lagator, Srdjan Sarikas.

Additional information

Competing interests

No competing interests declared.

No competing interests declared.

Author contributions

Conceptualization, Investigation, Methodology, Writing – original draft, Writing – review and editing.

Conceptualization, Formal analysis, Methodology, Writing – review and editing.

Conceptualization, Investigation, Methodology.

Investigation, Writing – review and editing.

Writing – review and editing.

Conceptualization, Investigation, Methodology, Writing – review and editing.

Conceptualization, Formal analysis, Investigation, Writing – review and editing.

Additional files

Transparent reporting form

Data availability

Source data files have been provided for Figures 1 & 3. Code and data has been deposited in GitHub: https://github.com/szarma/Thermoters copy archived at swh:1:rev:61fe2f54941966469dad801efe06e1c879f27530.

References

  1. Anzolini Cassiano MH, Silva-Rocha R. Benchmarking Available Bacterial Promoter Prediction Tools. bioRxiv. 2020 doi: 10.1101/2020.05.05.079335. [DOI] [PMC free article] [PubMed]
  2. Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell. 2004;117:185–198. doi: 10.1016/s0092-8674(04)00304-6. [DOI] [PubMed] [Google Scholar]
  3. Belliveau NM, Barnes SL, Ireland WT, Jones DL, Sweredoski MJ, Moradian A, Hess S, Kinney JB, Phillips R. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. PNAS. 2018;115:E4796–E4805. doi: 10.1073/pnas.1722055115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, Phillips R. Transcriptional regulation by the numbers: models. Current Opinion in Genetics & Development. 2005;15:116–124. doi: 10.1016/j.gde.2005.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brophy JAN, Voigt CA. Antisense transcription as a tool to tune gene expression. Molecular Systems Biology. 2016;12:854. doi: 10.15252/msb.20156540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Byrd RH, Lu P, Nocedal J, Zhu C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing. 1995;16:1190–1208. doi: 10.1137/0916069. [DOI] [Google Scholar]
  7. Dill KA, MacCallum JL. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. [DOI] [PubMed] [Google Scholar]
  8. Einav T, Phillips R. How the avidity of polymerase binding to the -35/-10 promoter sites affects gene expression. PNAS. 2019;116:13340–13345. doi: 10.1073/pnas.1905615116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Eren AM, Vineis JH, Morrison HG, Sogin ML. A filtering method to generate high quality short reads using illumina paired-end technology. PLOS ONE. 2013;8:e66643. doi: 10.1371/journal.pone.0066643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Forcier TL, Ayaz A, Gill MS, Jones D, Phillips R, Kinney JB. Measuring cis-Regulatory Energetics in Living Cells Using Allelic Manifolds. eLife. 2018;7:e40618. doi: 10.7554/eLife.40618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedlander T, Prizak R, Guet CC, Barton NH, Tkačik G. Intrinsic limits to gene regulation by global crosstalk. Nature Communications. 2016;7:12307. doi: 10.1038/ncomms12307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gillespie JH. Some Properties of Finite Populations Experiencing Strong Selection and Weak Mutation. The American Naturalist. 1983;121:691–708. doi: 10.1086/284095. [DOI] [Google Scholar]
  13. Hawley DK, McClure WR. Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Research. 1983;11:2237–2255. doi: 10.1093/nar/11.8.2237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hossain A, Lopez E, Halper SM, Cetnar DP, Reis AC, Strickland D, Klavins E, Salis HM. Automated design of thousands of nonrepetitive parts for engineering stable genetic systems. Nature Biotechnology. 2020;38:1466–1475. doi: 10.1038/s41587-020-0584-2. [DOI] [PubMed] [Google Scholar]
  15. Igler C, Lagator M, Tkačik G, Bollback JP, Guet CC. Evolutionary potential of transcription factors for gene regulatory rewiring. Nature Ecology & Evolution. 2018;2:1633–1643. doi: 10.1038/s41559-018-0651-y. [DOI] [PubMed] [Google Scholar]
  16. Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology. 1961;3:318–356. doi: 10.1016/s0022-2836(61)80072-7. [DOI] [PubMed] [Google Scholar]
  17. Johns NI, Gomes ALC, Yim SS, Yang A, Blazejewski T, Smillie CS, Smith MB, Alm EJ, Kosuri S, Wang HH. Metagenomic mining of regulatory elements enables programmable species-selective gene expression. Nature Methods. 2018;15:323–329. doi: 10.1038/nmeth.4633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kemble H, Nghe P, Tenaillon O. Recent Insights Into the Genotype–Phenotype Relationship from Massively Parallel Genetic Assays. Evolutionary Applications. 2019;8:312–322. doi: 10.1111/eva.12846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kim HD, Shay T, O’Shea EK, Regev A. Transcriptional regulatory circuits: predicting numbers from alphabets. Science. 2009;325:429–432. doi: 10.1126/science.1171347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kinkhabwala A, Guet CC. Uncovering cis regulatory codes using synthetic promoter shuffling. PLOS ONE. 2008;3:e2030. doi: 10.1371/journal.pone.0002030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kinney JB, Murugan A, Callan CGJ, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. PNAS. 2010;107:9158–9163. doi: 10.1073/pnas.1004290107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kinney JB, McCandlish DM. Massively Parallel Assays and Quantitative Sequence-Function Relationships. Annual Review of Genomics and Human Genetics. 2019;20:99–127. doi: 10.1146/annurev-genom-083118-014845. [DOI] [PubMed] [Google Scholar]
  23. Kuhlman B, Bradley P. Advances in protein structure prediction and design. Nature Reviews. Molecular Cell Biology. 2019;20:681–697. doi: 10.1038/s41580-019-0163-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lässig M, Mustonen V, Walczak AM. Predicting evolution. Nature Ecology & Evolution. 2017;1:77. doi: 10.1038/s41559-017-0077. [DOI] [PubMed] [Google Scholar]
  25. Lehner B. Genotype to phenotype: lessons from model organisms for human genetics. Nature Reviews. Genetics. 2013;14:168–178. doi: 10.1038/nrg3404. [DOI] [PubMed] [Google Scholar]
  26. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews. Genetics. 2015;16:321–332. doi: 10.1038/nrg3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lutz R, Bujard H. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Research. 1997;25:1203–1210. doi: 10.1093/nar/25.6.1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mohanty BK, Kushner SR. Regulation of mRNA Decay in Bacteria. Annual Review of Microbiology. 2016;70:25–44. doi: 10.1146/annurev-micro-091014-104515. [DOI] [PubMed] [Google Scholar]
  29. Mustonen V, Lässig M. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. PNAS. 2005;102:15936–15941. doi: 10.1073/pnas.0505537102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mustonen V, Kinney JB, Callan CGJ, Lässig M. Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. PNAS. 2008;105:12376–12381. doi: 10.1073/pnas.0805909105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Nagai T, Ibata K, Park ES, Kubota M, Mikoshiba K, Miyawaki A. A variant of yellow fluorescent protein with fast and efficient maturation for cell-biological applications. Nature Biotechnology. 2002;20:87–90. doi: 10.1038/nbt0102-87. [DOI] [PubMed] [Google Scholar]
  32. Otwinowski J, Nemenman I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLOS ONE. 2013;8:e61570. doi: 10.1371/journal.pone.0061570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ross W, Gosink KK, Salomon J, Igarashi K, Zou C, Ishihama A, Severinov K, Gourse RL. A third recognition element in bacterial promoters: DNA binding by the alpha subunit of RNA polymerase. Science. 1993;262:1407–1413. doi: 10.1126/science.8248780. [DOI] [PubMed] [Google Scholar]
  34. Roy S, Garges S, Adhya SL. Activation and repression of transcription by differential contact: two sides of a coin. The Journal of Biological Chemistry. 1998;273:14059–14062. doi: 10.1074/jbc.273.23.14059. [DOI] [PubMed] [Google Scholar]
  35. Saiz L, Vilar JMG. Ab initio thermodynamic modeling of distal multisite transcription regulation. Nucleic Acids Research. 2008;36:726–731. doi: 10.1093/nar/gkm1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, Del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J. RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Research. 2013;41:D203–D213. doi: 10.1093/nar/gks1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, Ivankov DN, Bozhanova NG, Baranov MS, Soylemez O, Bogatyreva NS, Vlasov PK, Egorov ES, Logacheva MD, Kondrashov AS, Chudakov DM, Putintseva EV, Mamedov IZ, Tawfik DS, Lukyanov KA, Kondrashov FA. Local fitness landscape of the green fluorescent protein. Nature. 2016;533:397–401. doi: 10.1038/nature17995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Schuster P. Prediction of RNA secondary structures: from theory to models and real molecules. Reports on Progress in Physics. 2006;69:1419–1477. doi: 10.1088/0034-4885/69/5/R04. [DOI] [Google Scholar]
  39. Srikumar S, Kröger C, Hébrard M, Colgan A, Owen SV, Sivasankaran SK, Cameron ADS, Hokamp K, Hinton JCD. RNA-seq Brings New Insights to the Intra-Macrophage Transcriptome of Salmonella Typhimurium. PLOS Pathogens. 2015;11:e1005262. doi: 10.1371/journal.ppat.1005262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Storz G. Global Transcriptional Start Site Mapping Using Differential RNA. Sequencing Reveals Novel Antisense RNAs in Escherichia coli. Journal of Bacteriology. 2014;197:18–28. doi: 10.1128/JB.02096-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Tuğrul M, Paixão T, Barton NH, Tkačik G. Dynamics of Transcription Factor Binding Site Evolution. PLOS Genetics. 2015;11:e1005639. doi: 10.1371/journal.pgen.1005639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Urtecho G, Tripp AD, Insigne KD, Kim H, Kosuri S. Systematic Dissection of Sequence Elements Controlling σ70 Promoters Using a Genomically Encoded Multiplexed Reporter Assay in Escherichia coli. Biochemistry. 2019;58:1539–1551. doi: 10.1021/acs.biochem.7b01069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Vilar JMG. Accurate prediction of gene expression by integration of DNA sequence statistics with detailed modeling of transcription regulation. Biophysical Journal. 2010;99:2408–2413. doi: 10.1016/j.bpj.2010.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Voordeckers K, Pougach K, Verstrepen KJ. How do regulatory networks evolve and expand throughout evolution? Current Opinion in Biotechnology. 2015;34:180–188. doi: 10.1016/j.copbio.2015.02.001. [DOI] [PubMed] [Google Scholar]
  45. Wolf L, Silander OK, van Nimwegen E. Expression noise facilitates the evolution of gene regulation. eLife. 2015;4:987. doi: 10.7554/eLife.05856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Yi X, Dean AM. Adaptive Landscapes in the Age of Synthetic Biology. Molecular Biology and Evolution. 2019;36:890–907. doi: 10.1093/molbev/msz004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Yona AH, Alm EJ, Gore J. Random sequences rapidly evolve into de novo promoters. Nature Communications. 2018;9:1530. doi: 10.1038/s41467-018-04026-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Zaslaver A, Bren A, Ronen M, Itzkovitz S, Kikoin I, Shavit S, Liebermeister W, Surette MG, Alon U. A comprehensive library of fluorescent transcriptional reporters for Escherichia coli. Nature Methods. 2006;3:623–628. doi: 10.1038/nmeth895. [DOI] [PubMed] [Google Scholar]

Editor's evaluation

Sandeep Krishna 1

This paper builds a biophysical model to understand the contribution of the DNA sequence in σ70-RNA polymerase binding activity. The authors provide evidence that taking into account that RNA polymerase can bind in multiple configurations, including non-productive ones, significantly improves predictions. They also confirm and extend previous observations that functional promoter sequences are relatively abundant in random sequences. This work represents an important advance to the field, and will hopefully serve as the stepping stone for better models that include transcription factors and eukaryotic enhancers.

Decision letter

Editor: Sandeep Krishna1
Reviewed by: Erik van Nimwegen2, Tal Einav3

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Predicting Promoter Function and Evolution from Random Sequence" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Aleksandra Walczak as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Erik van Nimwegen (Reviewer #3); Tal Einav (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

While the reviewers found that the manuscript contains potentially interesting results, as you can see from the reviews appended below, they have many concerns. Most importantly, they are not fully convinced that the model accurately predicts expression levels and would like to see much more analysis of which extensions of the previous thermodynamic models are important for the accuracy of such predictions. They were also concerned that the result that promoters are common among random sequences is not novel and does not warrant the emphasis that the manuscript puts on this result. Therefore, we suggest that in a revised manuscript you:

(i) focus primarily on the claim that the extensions to the previous thermodynamic model are important for explaining the expression levels from random sequences and real promoters, and show in greater detail how each extension is supported by instances in your dataset;

(ii) quantify more explicitly the accuracy of predictions from your model following the suggestions of the reviewers, and specifically quantify the importance of the extensions to previous models for predicting the expression levels in your libraries and in native constitutive promoters;

(iii) test your model on at least one (preferably more) of the additional data sets suggested by Reviewer 4, and provide a clear summary of how your results compare with the data.

Please do also go through the reviews below and try to address all the additional points raised by the reviewers.

Reviewer #2:

This paper builds a biophysical model to understand the contribution of the DNA sequence in σ70-RNA polymerase binding activity. To train their model, which takes into account intra-motif spacing as well as dependencies, they generate large mutant promoter libraries and measure expression from those. The title is a bit too general since the data and model are for a specific constitutive bacteriophage promoter, but nevertheless, the study has novel implications in evolution of gene regulation.

Lagator et al., have modeled the binding of σ70-RNA polymerase (RNAP) to a constitutive promoter to understand the effect of the promoter sequence in regulating gene-expression. This is a biophysical model that extends the standard thermodynamic model of RNAP binding with its cognate -35 and -10 boxes in several ways: the aim is to account for a variable spacer between the two boxes, binding at non-cognate regions, hindering transcription by binding on the negative strand (opposite orientation), occlusive binding (causing the transcript to begin upstream and not contain the RBS), and relaxing the assumption of independence within the energy matrix by modelling pair-wise nucleotide dependencies. They create three mutant libraries originating from strong λ pR and pL promoters as well as a completely random (25% probability of each nucleotide) set of oligonucleotides to train their model parameters. The model shows an impressive predictive power on a held out set and predicts expression patterns of completely random sequences to a remarkable degree.

They further conclude that many non-promoter E. coli regions are only one bp mutation away from giving rise to measurable expression. They validate their predictions on promoters where the standard model predicts a different expression pattern from their model. The data from their experiments will be a valuable addition to the community in case people want to develop new models or test other hypotheses.

A few things are unclear:

1. The thermodynamic model still makes some assumptions which need to be clearly mentioned. The sum suggests there is at the most, a single RNAP molecule (or a binding site at a time) present on each promoter.

2. The authors learn the models by using multinomial logistic regression (MLR), where the response variable is the median observed expression for each bin (which are four/12 in total, depending on the promoter class). This is a little unusual, since the four expression values are not "categories", which is typically the expectation in MLR. In other words there is a definite order (or dependence) within the categories, the first bin having a lower expression than the second, etc. That information is lost/unused in MLR.

3. It is not clear whether there are enough instances of spacer width varying in the mutant pR / pL libraries. If the mutations are single nucleotide changes (not deletions/insertions), would it not be highly unlikely that the library contains instances of the two boxes having different spacing?

It will be best to mention clearly the assumptions in the cummulative model, mathematically. A more detailed explanation of the sum in equation 2 will be useful. See for example, in the Bintu et al., paper cited in the same section. Similarly, the MLR model should be described with an equation.

Figure 1 C shows the learned energy matrix. There are other studies which use the standard model to learn their own matrix for RNAP from large mutant libraries. A comparison between the matrices will be useful, even if the promoters are different. In the same vein, this new model can be applied on those data to see if the expression can be better predicted. This will show the generality of the model.

Does sequence in pR (and pL) contain also the annotated transcription start site (TSS)? Is there any chance that the TSS also gets mutated in their libraries? Since the -10 box is relative to the TSS the distance between the box and TSS is presumably important, which can be defined only with respect to the nucleotide (or a small neighborhood) at the TSS. Does the final energy matrix contain those regions? Figure 1C suggests not.

I would have expected to see the Y axis of Figure 2B to be stratified across 12 bins. But that is not the case, which means I am missing something.

Is it not possible that the energy matrix values (those at the two boxes) change with the spacer? I understand that is not allowed in the current model.

Finally, the contribution of the dinucleotide interactions is not convincing. Perhaps the prediction accuracy on the held out set based on with and without including these terms will help clear that.

Reviewer #3:

Lagator et al., extend previous biophysical models to predict gene expression of constitutively expressing E. coli promoters from their sequence. They provide evidence that taking into account that RNA-polymerase can bind in multiple configurations, including non-productive ones, significantly improves predictions. They also confirm and extend previous observations that functional promoter sequences are relatively abundant in random sequences.

The aim of this study is to develop a realistic biophysical model for predicting the expression of (constitutive) E. coli promoters from their sequence. From my understanding of the presentation, the main new results that the authors claim are:

1. They extend existing biophysical models to a more realistic model that takes into account that the spacers between the -35 and -10 sites are flexible, that one should not just focus on the best site in a region but take into account cumulative binding to all possible sites, and that one should also explicitly model that binding to some sites is not productive (e.g. binding to reverse-complemented sites on the opposite strand).

2. They confirm experimentally that these extensions can indeed have a major impact on expression, for selected promoter sequences.

3. This biophysical model accurately predicts the gene expression level of any sequence.

4. They show that a substantial fraction of short random sequences will act as promoters, i.e. drive significant expression.

5. There is significant evidence for selection against ubiquitous expression, i.e. sigma70 binding sites are depleted both in gene bodies and in intergenic regions.

In my opinion the strongest parts of the paper are points 1 and 2. The extensions of the 'standard' biophysical model are all very sensible and the model is implemented in a mostly reasonable way. None of the ingredients in the model are particularly novel or surprising, but that's not a bad thing. I do wonder why the authors did not include features such as allowing only one of the two 'feet' of the sigma70 to bind (like in the Einav and Phillips PNAS 2019 paper) or non-specific binding (although maybe the clearing rate that the authors introduce plays a role analogous to non-specific binding). It also was not entirely clear to what extent binding at one site precludes binding at other sites in the model. This could be explained better. But apart from this the model is very reasonable. The only part I am sceptical about is the dinucleotide modeling, which is based on a very ad hoc and rather baroque procedure.

I particularly liked the experimental confirmations (Figure 1 – supplement 1) that support that several extensions of the model can indeed have a major impact on expression for selected promoter sequences. In my opinion these results should probably have been the main focus of the paper, maybe together with some more rigorous quantification of what role the various extensions play in predicting the expression levels of the random sequences. If, in addition, the authors had provided evidence that these extensions are also important for expression in native constitutive E. coli promoters, I think this could have been a very nice paper.

However, currently the focus seems to be on two aspects that I am much less convinced about. The authors in several places make the claim (point 3 above), that their model accurately predicts expression levels. I do not feel this claim is properly substantiated. First, to give some context, there is a very large body of work going back to the 1980s that aim to predict promoter sequences in E. coli directly from sequence, and with just a little bit of googling one will uncover plenty of works in the literature that appear to claim to `solve' this problem, e.g. Cassiano and Silva-Rocha mSystems 2020 present a recent benchmarking of a whole host of promoter prediction tools. I understand that these methods do not specifically aim to predict promoter strength but presumably one could use the 'score' that they calculate as a predictor of promoter strength and it is not clear to me how poorly such predictors would perform in comparison with the model presented here. In addition, there are also several previous papers that specifically claim to successfully predict promoter strength from sequence using either biophysical models close to the ones used here (e.g. Brewster et al., PLoS Comp Biol 2012, Einav and Phillips PNAS 2019) or more general machine learning methods, e.g. Mulligan et al., Nucl Acids Res 1984, Weiler and Wecknagel J Theo Biol 1994, DeMey et al., BMC Biotechnology 2007, Meng et al., Quantitative Biology 2017, and the recent Zhao et al., BioRxiv 2020.06.25.170365. Does the model of the authors significantly improve over these approaches? It is not clear to me. Especially the Einav and Phillips method is curtly dismissed in the discussion, but it is not clear to me whether that model, if trained on this data, would really perform much worse than the current model.

I understand it is kind of a pain to do extensive comparison with previous methods but in order to make real progress on understanding what determines the expression levels of constitutive E. coli promoters, then at a minimum I would have expected a very clear quantification of exactly how well the model performs. What one would want to know is: If I feed the model a random sequence, how accurately does it predict its expression level? Or, how well can the model predict the variation in expression levels of constitutive promoters across the E. coli genome? How much accuracy will be lost when various of the extensions are removed, either on native promoters, or on random sequences.

However, we do not really get such an analysis. The authors only present one scatter and Pearson correlations between the predicted log[Pon] and measured log[expression] for various versions of the model. Crucially, I do not think that a correlation of r2 ~ 0.75 on a dataset where expression levels vary over a 1000-fold range can be reasonably described as `accurate prediction'. Just as a back-of-the-envelope, if both data and model have a standard-deviation in log-expression levels of about log[100], then r2 = 0.75 corresponds to the error between model and data having standard-deviation of log[10]. That is, unless I made a mistake in my back-of-the-envelope, the model's predictions would typically be off by 10-fold (and the scatter does seem sort of consistent with this). t is also in line, with the fact that the bins for the P_R data are 10-fold apart in expression level, i.e. the training of the model is done on a dataset in which promoters whose expression levels are within 10-fold of each other are effective treated as equally expressed. I do not think that this level of precision can be called accurate prediction of expression levels. For example, using very similar reporter constructs, the difference in expression of the median and 95th percentile of native E. coli promoters is about 20-fold (i.e. see Wolf et al., Figure 1B), so this model could barely tell these apart.

Another issue is that it is not clear which extensions of the model are the most crucial for the accuracy of the predictions. I was surprised to see (in Figure 2 – source data 1) that, apparently, by far the biggest improvement on the random sequences (36N dataset) comes from including flexible spacers and that cumulative binding and occlusive binding adds rather little. This appears somewhat at odds with the narrative that the authors present in the paper that stresses the importance of cumulative binding.

In short, although I would like to believe that this reasonable biophysical model improves a lot over previous approaches in the literature, I don't think a convincing case has been made that this is really a major advance in accuracy of promoter strength prediction.

Finally, regarding points 4 and 5, I do not understand why the authors place so much emphasis on the observation that functional promoters (i.e. sequences that drive gene expression) are common among random sequences, because this is neither novel nor surprising. It has been long-known that the 'position specific weight matrices' that model sigma70 binding sites have low information content, e.g. Schultzaberger et al., Nucl Acids Res 2007 estimate it to be about 6.5 bits. Given this, one would very roughly expect a site every 100 bp in random sequences. In fact, this was one of the reasons why, in Wolf et al., (eLife 2015), we were confident that we would easily find a large diversity of functional promoters in a library of random sequences. This expectation was confirmed by our measurements as one can see from Figure 1B of Wolf et al. That is, after selecting the 5% of highest expressing cells from the random library, most of them express at a reasonable level. At the time it didn't even occur to us to mention this as a surprising observation. When the paper of Yona et al., (Nat Comm 2018) appeared, we realized that we misjudged this, and that for many people it WAS surprising that functional promoters are common even in relatively short random sequences. However, now that Yona et al., have an entire paper focused on this observation, including showing depletion of sigma70 sites within gene regions, there is no more novelty to this observation.

The observations about the frequency of occurrence of promoters in random sequences could have been more interesting if it had been more meaningfully quantified. That is, any sequence will presumably drive expression at some nonzero rate. An interesting (and meaningful) quantitative question is how likely it is that a random sequence drives expression at rate > x, as a function of x. Maybe the authors' data and model could give a reasonable answer to this question. However, instead the authors simply treat this problem as if there is just a binary question whether a promoter is expressing or non-expressing. Moreover, they define the threshold between expressing and non-expressing in a way that is not biologically meaningful but seems determined by instrument precision. I actually had a hard time understanding how precisely 'measurable expression' was defined. My understanding is that, for the P_R and P_L libraries, it is defined as any expression higher than the 99th percentile of measurements for cells carrying no YFP. But how it is defined for the 36N library (with another reporter) was unclear to me. In any case, it appears that 'measurable expression' is defined in terms of the auto-fluorescence level that the cells have in the FACS machine. Unfortunately, this is not a biologically meaningful cut-off. In fact, a substantial fraction of E. coli promoters express below the detection limit in the FACS (i.e. not significantly above auto-fluorescence).

In summary, I think there are some very good ideas in this paper. I in principle like the biophysical model, and the experimental validations of the model's extensions I thought were very nice and, when extended, could make a compelling paper. Unfortunately, the current focus of the papers appears to be on the frequent occurrence of promoters in random sequences (which is not novel) and claims that the model is a major advance in predicting gene expression from sequence, which I do not feel are substantiated by the results presented.

1. I had real trouble understanding the processing of the 36N dataset. I understand that, because in the 12-bin dataset you sort until you have a certain number of cells in the bin, you need to correct for this when estimating the relative fractions of cells with a given promoter that go to different bins. But I do not understand the approach with the spike-ins. Aren't there, in addition to the spike ins, also an unknown number of reference promoters coming from the sorting?

2. line 81: "even for constitutive promoters, where σ70-RNAP binding solely determines gene expression levels" Has this really been established? There could be effects of differential binding of topoisomerases that undo supercoiling.. or differential weak binding by other TFs.. and so on. I don't think we really know these levels are solely determined by sigma70-RNAP.

3. Figure 1 – supplement 1.

Why is detectability threshold higher in B than in A? I found it generally quite frustrating that it is hard to figure out how exactly this 'measurable expression' threshold, which plays such a crucial role, is defined.

4. 515-516: Finally, we required the distribution of expression bins to be as unambiguous as possible (Figure 2 —figure supplement 1D,E).

Why is this done? The way I understand it, this throws out promoters whose expression levels straddle two bins. I guess you want to throw those out because the fitting of the model to only observations in 4 bins only is more accurate when using only promoters that fall sort of in the middle of those bins, but I find this quite unsatisfactory. Would it not have been possible to fit a log-normal to the relative fractions of counts in the different bins and get continuous expression estimates?

5. 596: "We define 60:20:20% splits of each of our libraries into three disjunct datasets Figure 2 – Source Data 3."

Are these splits random? A notorious problem in cross-validation based fitting is when highly correlated datapoints occur in both training and test data. For example, in this case you could have promoters that are virtually identical, and with equal expression, in both training and test datasets. This could lead to misleadingly high performance on the test dataset. I actually doubt this is happening here, but maybe it is worthwhile checking this.

6. Did you try adding a non-specific binding term to the model? Or the ability for sigma70 to bind through only one of its two feet? If not, why not? Do the authors think sigma70 cannot be bound only at the -10 or -35 sites?

7. I did not like the way multinomial logistic regression was used to fit the model. Expression should go up monotonically with Pon. It makes no sense to treat the bins as unrelated 'classes'.

8. Comparing evolution of promoters de novo under the extended model against the 'standard model' is a straw man in my opinion. The standard model is an approximation to the thermodynamic model that is valid when there is one site that dominates the overall affinity of a promoter. I agree that it is often assumed that this is a good approximation for native promoters that have been shaped by selection. But I do not think there is anybody that thinks this is a good model for how promoters appear in random sequences. Obviously weak promoters can then appear anywhere in the sequence.

9. For Figure 3A, I think it is essential to also predict such a curve for the sequences in each of the libraries and see how well these predicted curves match the observed distributions for these libraries. Otherwise it is hard to judge how meaningful this predicted distribution is.

10. Many intergenic sequences in E. coli will be constrained to have all kinds of TFBSs, precluding sigma70 to occur in those positions. So to check whether there is really depletion of sigma70 sites, comparing with random sequence of the same nucleotide composition is not quite correct in my opinion. First, you should distinguish intergenic regions that are upstream of two operons from those that are upstream of one, and those downstream of two promoters. Those 3 classes of intergenic regions are also known to have different nucleotide composition. The best regions for tests would be downstream regions, and regions that are upstream of an operon that is constitutive.

Reviewer #4:

The manuscript by Lagator et al., examines a general framework to predict the constitutive level of gene expression for bacterial DNA, which constitutes one of the most fundamental processes in biology. Whereas previous work has typically been restricted to a limited sampling around a well-studied promoter (such as the lac operon), the authors push towards a model that can characterize all possible sequences, which they support with libraries containing 10,000 mutants each around two strong promoters (PR and PL) as well as a library containing 10,000 random sequences (36N). This work represents an important advance to the field, and will hopefully serve as the stepping stone for better models that include transcription factors and eukaryotic enhancers.

While the authors provide a clear and compelling story using their impressive data set, the one piece that I felt was missing was a connection with large gene expression data sets from other groups. How robust will their already-trained model be when applied in another context, where the experimental design is somewhat different? This will help other groups understand the potential gain in implementing this model for their own ends, and if any fine-tuning or model adjustments need to be performed, such information would be important to add. To this end, I suggest a few possible data sets below. The authors do not need to analyze all of them (although more is better), and they are welcome to substitute other data sets of their choice:

1) Hossain 2020 [https://doi.org/10.1038/s41587-020-0584-2] analyzed 4,350 bacterial promoters with a broad range of gene expression measurements varying by 820,000-fold. These promoters should be constitutive, although their random DNA sequences might occasionally give rise to transcription factor binding sites (as in the 36N library). The sequences and read counts are available in their Supplementary Data 3.

2) Urtecho 2019 [https://doi.org/10.1021/acs.biochem.7b01069] analyzed the constitutive expression from 10,000 promoters. Their setup ensures that no transcription factors are involved, providing a clean test for the model. You can find their expression data at the GEO Accession Number GSE108535 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108535]. The file “GSE108535_sigma70_variant_data.txt.gz” contains the names of promoter variants and the average RNA/DNA expression levels of two biological replicates, while the file “GSE108535_barcode_mapping.txt.gz” contains the name of each promoter variant and its corresponding sequence (where the sequence is specified in the column ‘most_common’). Figure S12 and Table S1 from their supplementary information provide a visual schematic for how to put the pieces together.

3) Johns 2018 [http://doi.org/10.1038/nmeth.4633] analyzed 30,000 promoters mined from prokaryotic genomes, which would be a very interesting avenue to explore. These promoters may involve transcription factors (although their mining strategy might bias them towards constitutive promoters). Their sequences are given in Supplementary Table 1 and their expression measurements for E. coli are provided in Supplementary Table 3.

I suspect the authors are already planning to do this, but it would be very helpful if a program implementing the fully-trained model was provided (along with some examples of its usage), to help the community utilize these results.

eLife. 2022 Jan 26;11:e64543. doi: 10.7554/eLife.64543.sa2

Author response


While the reviewers found that the manuscript contains potentially interesting results, as you can see from the reviews appended below, they have many concerns. Most importantly, they are not fully convinced that the model accurately predicts expression levels and would like to see much more analysis of which extensions of the previous thermodynamic models are important for the accuracy of such predictions. They were also concerned that the result that promoters are common among random sequences is not novel and does not warrant the emphasis that the manuscript puts on this result. Therefore, we suggest that in a revised manuscript you:

(i) focus primarily on the claim that the extensions to the previous thermodynamic model are important for explaining the expression levels from random sequences and real promoters, and show in greater detail how each extension is supported by instances in your dataset;

(ii) quantify more explicitly the accuracy of predictions from your model following the suggestions of the reviewers, and specifically quantify the importance of the extensions to previous models for predicting the expression levels in your libraries and in native constitutive promoters;

(iii) test your model on at least one (preferably more) of the additional data sets suggested by Reviewer 4, and provide a clear summary of how your results compare with the data.

Please do also go through the reviews below and try to address all the additional points raised by the reviewers.

We thank the Editor and the Reviewers for their comments and suggestions. We have addressed each comment raised by the Reviewers (see below). In particular, we analyzed the performance of our model on three datasets suggested by Reviewer 4, and compared our model performance to that of the models presented in various papers, in order to demonstrate the cross-predictability of our model. We have also expanded on the description of different extensions we added to the existing thermodynamic model, and explored in more detail how each contributed to predictability.

However, while we appreciate that the Reviewers found the ‘evolutionary’ part of the manuscript less novel, we continue to strongly believe that it constitutes an important contribution to the field. This is because the question of whether expression emerges easily from random sequences is a quantitative – not a qualitative – one, as we argue below.

A few previous studies (Wolf, Silander, Nimwegen 2015 and Yona et al., 2018) have shown that random sequences often lead to expression, but have not offered a quantitative understanding of the phenomenon. For example, Yona et al., 2018, used a small dataset of only 40 sequences to show that ~10% of them have promoter activity. The small dataset inherently prevents one from drawing quantitative conclusions about just how likely random sequences are to lead to expression. To obtain a more quantitative understanding of this problem, one could use a standard energy matrix (such as the one obtained by Kinney et al., 2010) over random sequences and identify the likelihood of random sequences leading to expression. This is, in fact, exactly what we did in this study by using the Standard model, which represents a conservative null expectation for our studies.

In contrast to this null expectation, we find that the Extended model, which accounts much better for the experimental data, also predicts a significantly higher likelihood of random sequences leading to expression. This quantitative (as opposed to qualitative) understanding of σ70-RNA polymerase binding also allows us to make more precise statements about how promoters evolve from random (and non-random) sequences, allowing us to claim not only that promoter evolve rapidly, but that they evolve much more rapidly than previously thought. Similarly, we identified stronger selection against σ70-RNAP binding sites than was previously thought, as we find evidence of selection against binding sites even within promoter sequences (previous studies have reported selection against binding sites within coding regions only). Importantly, these effects are not small: the fraction of sequences that express above a detectability level is more than double using the Extended compared to Standard model, and the predicted rates of evolution of functional promoters can be faster by orders of magnitude (!) using the Extended compared to the Standard model.

Put together, our study achieves something unique: it uses a large-scale dataset to develop a model with higher predictability compared to state-of-the-art mechanistic models of gene expression, and then uses that model to improve our quantitative understanding of promoter evolution. In doing that, it points to a potentially huge effect that molecular details of binding of a protein (RNAP) could have on the evolvability of gene expression.

We have made numerous changes to the text to reflect and better emphasize what is novel, and in particular to emphasize the fact that we predict promoter evolution to be not “just” fast in general, but indeed much faster than what the previous state-of-the-art models would predict. We elaborate further on why we think this evolutionary aspect of our study is important in responses to the individual points raised by the Reviewers. The lines referenced in this letter correspond to the revised manuscript document with tracked changes.

Finally, we would like to apologize to the Editor and the Reviewers for the long time it took us to address all the comments. A combination of the co-authors living in different locations and changing work places (and career paths in some cases), as well as the Covid-19 pandemic introduced substantial delays.

Reviewer #2:

This paper builds a biophysical model to understand the contribution of the DNA sequence in σ70-RNA polymerase binding activity. To train their model, which takes into account intra-motif spacing as well as dependencies, they generate large mutant promoter libraries and measure expression from those. The title is a bit too general since the data and model are for a specific constitutive bacteriophage promoter, but nevertheless, the study has novel implications in evolution of gene regulation.

We thank the Reviewer for the positive assessment of our manuscript. We have amended the title to be more specific, clarifying that we focus on bacterial promoters.

Lagator et al., have modeled the binding of σ70-RNA polymerase (RNAP) to a constitutive promoter to understand the effect of the promoter sequence in regulating gene-expression. This is a biophysical model that extends the standard thermodynamic model of RNAP binding with its cognate -35 and -10 boxes in several ways: the aim is to account for a variable spacer between the two boxes, binding at non-cognate regions, hindering transcription by binding on the negative strand (opposite orientation), occlusive binding (causing the transcript to begin upstream and not contain the RBS), and relaxing the assumption of independence within the energy matrix by modelling pair-wise nucleotide dependencies. They create three mutant libraries originating from strong λ pR and pL promoters as well as a completely random (25% probability of each nucleotide) set of oligonucleotides to train their model parameters. The model shows an impressive predictive power on a held out set and predicts expression patterns of completely random sequences to a remarkable degree.

They further conclude that many non-promoter E. coli regions are only one bp mutation away from giving rise to measurable expression. They validate their predictions on promoters where the standard model predicts a different expression pattern from their model. The data from their experiments will be a valuable addition to the community in case people want to develop new models or test other hypotheses.

A few things are unclear:

1. The thermodynamic model still makes some assumptions which need to be clearly mentioned. The sum suggests there is at the most, a single RNAP molecule (or a binding site at a time) present on each promoter.

The Reviewer is correct to point out that our ansatz assumes that only a single RNAP molecule can be present at the promoter at a given time in a productive configuration. Once bound, the RNAP molecule initiates transcription and leaves a promoter faster than another RNAP molecule can find and bind to the promoter. In other words, we assume that RNAP concentration is not high enough for two molecules to be bound to the promoter simultaneously before they initiate transcription and clear the promoter. We explain this assumption in the main text (L117-119) and in the Methods (L804-806).

2. The authors learn the models by using multinomial logistic regression (MLR), where the response variable is the median observed expression for each bin (which are four/12 in total, depending on the promoter class). This is a little unusual, since the four expression values are not "categories", which is typically the expectation in MLR. In other words there is a definite order (or dependence) within the categories, the first bin having a lower expression than the second, etc. That information is lost/unused in MLR.

We implemented one-vs-rest MLR as a conservative choice, which would allow for a non-monotonic relationship between P_on and gene expression. Ordinal logistic model would force it by definition. In particular, for all the models without the clearance rate, if there were a binding state so strong that the expression is actually lower than for a bit weaker binding, we would be able to detect it (and we were actively looking if any such effect exists; it does not). Despite using MLR, we find a monotonic relationship between P_on and expression, which we took to be one of our internal consistency checks.

3. It is not clear whether there are enough instances of spacer width varying in the mutant pR / pL libraries. If the mutations are single nucleotide changes (not deletions/insertions), would it not be highly unlikely that the library contains instances of the two boxes having different spacing?

We now include a table in Figure 1 —figure supplement 1, as well L496-498, which provides further information about the variability of spacer length variants in our datasets. It is worth noting, however, that the best we can do on fully random sequences is show the spacer for the single strongest binding site in each sequence. The whole premise of the thermodynamic modelling used in this work is that many binding configurations on each sequence can matter. So, while the numbers presented in the figure may appear low e.g. for spacer lengths of -2 and +2, states with those spacers always contribute to P_on; the table simply reports cases where that contribution is the strongest. Note that, when fitting the Extended model, we validate that including spacer lengths of -2 and +2 significantly improves predictability.

It will be best to mention clearly the assumptions in the cumulative model, mathematically. A more detailed explanation of the sum in equation 2 will be useful. See for example, in the Bintu et al., paper cited in the same section. Similarly, the MLR model should be described with an equation.

We thank the Reviewer the suggestion. We have now included a set of summary equations into Figure 1, in order to provide a better explanation for what the model is actually doing. We have also provided equations that describe the MLR model (L712-721).

Figure 1 C shows the learned energy matrix. There are other studies which use the standard model to learn their own matrix for RNAP from large mutant libraries. A comparison between the matrices will be useful, even if the promoters are different. In the same vein, this new model can be applied on those data to see if the expression can be better predicted. This will show the generality of the model.

We have included a comparison of our final energy matrix (obtained by fitting the Extended model on all three datasets) and those obtained by Kinney et al., 2010 (Figure 1 —figure supplement 2; for a visual representation see Author response image 1). The elements of the two matrices correlate strongly (r2=0.83). Note that the Standard model used in this study is, by construction, the same as that used by Kinney et al., meaning that, if we trained our models on their data, we would observe qualitatively the same improvement of the Extended over the Standard model as we observed in the PR library (Figure 3).

Author response image 1.

Author response image 1.

Does sequence in pR (and pL) contain also the annotated transcription start site (TSS)? Is there any chance that the TSS also gets mutated in their libraries? Since the -10 box is relative to the TSS the distance between the box and TSS is presumably important, which can be defined only with respect to the nucleotide (or a small neighborhood) at the TSS. Does the final energy matrix contain those regions? Figure 1C suggests not.

We thank the Reviewer to point out this complex question. On one hand, the original TSS for the PR and PL promoters was not mutated. We clarify this point in the Methods (L456-457). However, we explicitly model for the possibility that RNAP can bind anywhere along the sequence, and some of those binding configurations would have a TSS in the mutated region. We do not model how the changes in the TSS might affect expression levels. However, the fact that the Extended model trained only on the PR library exhibits high predictability on the random sequences (36N library) gives us confidence that the effect of mutagenizing the TSS plays a smaller role in determining gene expression levels compared to the binding energy between RNAP and DNA, at least within our sequences.

I would have expected to see the Y axis of Figure 2B to be stratified across 12 bins. But that is not the case, which means I am missing something.

We thank the Reviewer for pointing out this omission in our explanation of Figure 2B (now Figure 3A). For the 36N library, because we have 12 bins, we could determine the mean expression (rather than using the median bin as we do in the PR library). We now clarify this in the figure legend (L1451-1453).

Is it not possible that the energy matrix values (those at the two boxes) change with the spacer? I understand that is not allowed in the current model.

This is a very good point, and we agree with the Reviewer that energy matrix values might change depending on the spacer length. More broadly, any two (or more) parameters in our model could depend on each other. However, understanding such dependencies in a highly multidimensional space is beyond the scope of our work, except for the dinucleotide interactions which we account for. Specifically, for the question raised by the referee, we expect that we would not have enough data to infer energy matrices reliably conditional on individual spacer lengths. We discuss this in more detail in the Methods section (L794-798).

Finally, the contribution of the dinucleotide interactions is not convincing. Perhaps the prediction accuracy on the held out set based on with and without including these terms will help clear that.

In line with Reviewers’ comment 14, we expanded the text and provided a Figure 2D, which includes detailed information about how each extension to the model (including dinucleotide interactions) individually affects predictability on the held out dataset.

Reviewer #3:

Lagator et al., extend previous biophysical models to predict gene expression of constitutively expressing E. coli promoters from their sequence. They provide evidence that taking into account that RNA-polymerase can bind in multiple configurations, including non-productive ones, significantly improves predictions. They also confirm and extend previous observations that functional promoter sequences are relatively abundant in random sequences.

The aim of this study is to develop a realistic biophysical model for predicting the expression of (constitutive) E. coli promoters from their sequence. From my understanding of the presentation, the main new results that the authors claim are:

1. They extend existing biophysical models to a more realistic model that takes into account that the spacers between the -35 and -10 sites are flexible, that one should not just focus on the best site in a region but take into account cumulative binding to all possible sites, and that one should also explicitly model that binding to some sites is not productive (e.g. binding to reverse-complemented sites on the opposite strand).

2. They confirm experimentally that these extensions can indeed have a major impact on expression, for selected promoter sequences.

3. This biophysical model accurately predicts the gene expression level of any sequence.

4. They show that a substantial fraction of short random sequences will act as promoters, i.e. drive significant expression.

5. There is significant evidence for selection against ubiquitous expression, i.e. sigma70 binding sites are depleted both in gene bodies and in intergenic regions.

In my opinion the strongest parts of the paper are points 1 and 2. The extensions of the 'standard' biophysical model are all very sensible and the model is implemented in a mostly reasonable way. None of the ingredients in the model are particularly novel or surprising, but that's not a bad thing. I do wonder why the authors did not include features such as allowing only one of the two 'feet' of the sigma70 to bind (like in the Einav and Phillips PNAS 2019 paper) or non-specific binding (although maybe the clearing rate that the authors introduce plays a role analogous to non-specific binding).

The reviewer has a valid point. Extending the configuration space even further to allow bindings of single feet would indeed be possible, and had we started to work on this model after the Einav and Phillips publication, we would have included it in our model. However, by 2019, we have already had our model fitted and all downstream research completed, and had no indication that the effect from including single feet binding would be large.

To investigate the potential contribution of this effect at this stage, we split up the effects coming from each of the feet by calculating the free energy of binding of each foot individually across each sequence in our 36N library, and compared it to the binding energy when both feet bind. The Author response image 2 shows the result of such an analysis, where we see that the FVE in the case of both feet binding (right plot) is much larger than the FVE of either of the single feet binding. We report raw, non-weighted r2’s, as well as r2’s weighted in the same manner as throughout the manuscript, so that the weights of data points in each bin are inversely proportional to the number of sequences that fall in that bin.

Author response image 2.

Author response image 2.

Our analysis of course does not dismiss the reasonable idea the binding of one foot can individually lead to transcription, nor does it address the possibility of avidity between the binding of two feet (as described by Einav and Phillips), but rather suggests it may have a significant – but maybe not very large – influence, yet comparable to other mechanisms that we included. These results are also consistent with the fact that the energy penalties in the -10 box have a higher mean value than those in the -35 box (2.02 vs 1.61), and hence contribute more to the explanatory power.

It also was not entirely clear to what extent binding at one site precludes binding at other sites in the model. This could be explained better.

Our model assumes that RNAP concentrations are not high enough for two molecules to be bound to the promoter simultaneously before they initiate transcription and clear the promoter; under this assumption the Pon can be written as a sum over single-molecule Boltzmann weights. We do, however, explicitly sum over all single-RNAP binding configurations on the piece of DNA that constitutes our promoter; that includes binding in configurations that are productive and the ones that are not (e.g., binding on the reverse complement). The logic here is that, on the productive strand, all bound RNAP molecules are released quickly and move in the same direction. In contrast, the released RNAP on the reverse-complement could clash with the productive RNAP. Formally, if the unproductive binding interferes with the probability of RNAP to bind productively via exclusion, the unproductive terms in the partition sum are still linear order in concentration (as opposed to terms where two RNAP are bound simultaneously, which are quadratic and are thus negligible corrections in the regime we consider). Note that the unproductive binding on the reverse complement has the smallest impact on predictability. We provide further clarification on this in the main text (L117-119) and methods (L804-806).

But apart from this the model is very reasonable. The only part I am sceptical about is the dinucleotide modeling, which is based on a very ad hoc and rather baroque procedure.

We agree with the Reviewer’s assessment that the dinucleotide modelling is not carried out using a clean “textbook” inference procedure that we used for other model extensions. This is because of the extremely large number of possible dinucleotide interactions which we did not manage to learn simultaneously in a robust way even with regularization; each pairwise interaction also adds non-trivial computational time and including dozens of interactions already proved computationally limiting. In turn, we devised a more ad hoc method that first effectively performs feature selection by finding significant interactions, after which it attempts to estimate their effect. Our main objective when investigating pairwise interactions was to assess how much they could affect predictability and see whether there is any obvious pattern between interactions that we could identify. First, we found that the effect of dinucleotide interactions on the predictability is fairly limited (Figure 2D). Second, there appears to be an interesting pattern where interactions tend to feature basepairs outside of the canonical -10 and -35 feet, as we mention in the main text.

I particularly liked the experimental confirmations (Figure 1 – supplement 1) that support that several extensions of the model can indeed have a major impact on expression for selected promoter sequences. In my opinion these results should probably have been the

main focus of the paper, maybe together with some more rigorous quantification of what role the various extensions play in predicting the expression levels of the random sequences. If, in addition, the authors had provided evidence that these extensions are also important for expression in native constitutive E. coli promoters, I think this could have been a very nice paper.

We thank the Reviewer for pointing out to us the importance of various extension we included in the model. We now include a new figure (Figure 2) to highlight the experimental confirmations. We quantify their relative impact on predictability in a new table (Figure 2D) and include a brief discussion of this in the main text (L129-137, L226-229).

However, currently the focus seems to be on two aspects that I am much less convinced about. The authors in several places make the claim (point 3 above), that their model accurately predicts expression levels. I do not feel this claim is properly substantiated. First, to give some context, there is a very large body of work going back to the 1980s that aim to predict promoter sequences in E. coli directly from sequence, and with just a little bit of googling one will uncover plenty of works in the literature that appear to claim to `solve' this problem, e.g. Cassiano and Silva-Rocha mSystems 2020 present a recent benchmarking of a whole host of promoter prediction tools. I understand that these methods do not specifically aim to predict promoter strength but presumably one could use the 'score' that they calculate as a predictor of promoter strength and it is not clear to me how poorly such predictors would perform in comparison with the model presented here. In addition, there are also several previous papers that specifically claim to successfully predict promoter strength from sequence using either biophysical models close to the ones used here (e.g. Brewster et al., PLoS Comp Biol 2012, Einav and Phillips PNAS 2019) or more general machine learning methods, e.g. Mulligan et al., Nucl Acids Res 1984, Weiler and Wecknagel J Theo Biol 1994, DeMey et al., BMC Biotechnology 2007, Meng et al., Quantitative Biology 2017, and the recent Zhao et al., BioRxiv 2020.06.25.170365. Does the model of the authors significantly improve over these approaches? It is not clear to me. Especially the Einav and Phillips method is curtly dismissed in the discussion, but it is not clear to me whether that model, if trained on this data, would really perform much worse than the current model.

We agree with the Reviewers that we needed to do a better job in demonstrating the predictive power of our model. To this end, we have now analyzed three published datasets – Johns et al., 2018, Urtecho et al., 2019, and Hossain et al., 2020. We selected these datasets because they involved large mutant libraries with 1000s of sequences, and because each of them developed a model to try to predict expression levels directly from sequence. We found that our model led to significant improvements in predictability compared to the models presented by Johns et al., and Hossain et al., while underperforming compared to the machine learning approach used by Urtecho et al. It is worth pointing out that this machine learning model is not designed to predict expression from any sequence, as the dataset they based the model on is modular (i.e. they make all possible combinations of pre-set -10, -35, UP element, and spacer sequences). The same applies for the Einav and Phillips 2019 model, which in principle cannot predict expression levels from random sequences because the model is based on the Urtecho et al., dataset. We discuss these findings in the new section of the manuscript (Testing Model Generalizability starting at L256).

We would specifically like to highlight that the tests of our model did not include refitting the model to different published data sets, but solely an evaluation of the performance of the already fitted model (with the single parameter readjustment for possibly different RNAP concentrations that was inferred for each data set separately). Given our previous attempts (unpublished) with machine-learning-derived models, such across-dataset generalization (as opposed to within-dataset random splits into training and validation data) is a very powerful test that expressive machine-learning models can easily fail even though they show high within-dataset generalization.

We also analyzed the Zhao et al., dataset, but identified some issues with the data. In particular, we found that a large fraction of the reported mutations occur in and around the ribosomal binding sites. Our model does not account for the effects of mutations in ribosomal binding sites, resulting in low predictability. For this reason, we decided not to include this analysis in the manuscript, but provide Author response image 3 for your and Reviewers’ reference.

Author response image 3.

Author response image 3.

We did not analyze any of the other datasets mentioned by the Reviewer above, because all of them contained only a small number of promoters (not more than 100), preventing the authors from developing persuasive models for predicting gene expression levels from sequence.Finally, we did not develop a predictive method based on bioinformatics approaches that aim to identify promoters (as opposed to predict gene expression levels), because these models (as outlined by Cassiano and Silva-Rocha 2020) assume that a promoter has a single dominant RNAP binding site, as well as not accounting for the energy impact of spacer length variability. While constructing a model along these lines would be possible, it would be very labor intensive and would most likely look very similar to the Standard model we used.

More broadly, we would like to point out – and that we now emphasize much more clearly throughout the manuscript (for example, L89-99, L157-170) – that our study implicitly already compares the performance of the Extended model to various other studies (such as Brewster et al., 2012 and Kinney et al., 2010), through the comparison with the Standard model. The Standard model works under the same assumptions used by all existing mechanistic models aimed at predicting gene expression levels from any sequence (like Brewster et al., 2012 and Kinney et al., 2010). Similarly, it works under the same assumptions as most bioinformatics models for predicting promoters, as they assume that a single RNAP binding site is what defines a promoter. Hence, the comparison between the Extended and the Standard model implicitly already informs about the Extended model performing better than the existing mechanistic approaches. Moreover, the Standard model is given the “best chance” since (i) we fully optimize it on our datasets; (ii) allow it to use a flexible spacer (but with no energy penalty for different spacer lengths); (iii) allow it to use a larger binding footprint than the standard 6+6 basepairs or the matrix by Kinney et al. In this sense, any excess performance by the Extended model over Standard model that we report is a conservative estimate for what could be expected if we were to use a previously published mechanistic models with standard features.

This reasoning, of course, is true only for a comparison with mechanistic approaches. While some machine learning approaches might outperform our model on the specific datasets on which they were trained on, we are not aware of such models that work for any random sequence and generalize well across datasets; in particular, irrespective of their predictive power, such models would lack the mechanistic interpretability that we desired.

In response to this, and other, Reviewer comments, we have softened our language and removed the use of the subjective term ‘accurate’.

I understand it is kind of a pain to do extensive comparison with previous methods but in order to make real progress on understanding what determines the expression levels of constitutive E. coli promoters, then at a minimum I would have expected a very clear quantification of exactly how well the model performs. What one would want to know is: If I feed the model a random sequence, how accurately does it predict its expression level? Or, how well can the model predict the variation in expression levels of constitutive promoters across the E. coli genome? How much accuracy will be lost when various of the extensions are removed, either on native promoters, or on random sequences.

We agree with the Reviewer that we should have been clearer and done a more thorough job in validating our model. The Reviewer raises three ways in which we can provide more details and validations of Extended model performance:

a) “If I feed the model a random sequence, how accurately does it predict expression levels” – we explicitly test this by validating the Extended model on the validation portion of the 36N library (and now clarify in various locations throughout the manuscript that this is explicitly what we have done).

b) “How well can the model predict the variation in expression levels of constitutive promoters across the E. coli genome” – to this end, we now include a comparison of the Extended model performance on three previously published datasets, and find that the model performs well on all of them (see section Testing Model Generalizability). It is worth noting that this performance was achieve without any fitting of the Extended model to these new datasets (with the exception of chemical potential). We did not examine the performance of our model on constitutive E. coli promoters because the existing datasets measuring expression levels of constitutive promoters come from RNA-seq experiments, which are notoriously imprecise.

c)“How much accuracy will be lost when various of the extensions are removed” – we provide more detail on how each model extension contributes to overall predictability (Figure 3D, L129-137, L217-229).

In order not to overstate our findings, we toned down the language, especially by minimizing the use of the word ‘accurate predictions’ and instead only talking about ‘predictions’.

However, we do not really get such an analysis. The authors only present one scatter and Pearson correlations between the predicted log[Pon] and measured log[expression] for various versions of the model. Crucially, I do not think that a correlation of r2 ~ 0.75 on a dataset where expression levels vary over a 1000-fold range can be reasonably described as `accurate prediction’. Just as a back-of-the-envelope, if both data and model have a standard-deviation in log-expression levels of about log[100], then r2 = 0.75 corresponds to the error between model and data having standard-deviation of log[10]. That is, unless I made a mistake in my back-of-the-envelope, the model’s predictions would typically be off by 10-fold (and the scatter does seem sort of consistent with this). T is also in line, with the fact that the bins for the P_R data are 10-fold apart in expression level, i.e. the training of the model is done on a dataset in which promoters whose expression levels are within 10-fold of each other are effective treated as equally expressed. I do not think that this level of precision can be called accurate prediction of expression levels. For example, using very similar reporter constructs, the difference in expression of the median and 95th percentile of native E. coli promoters is about 20-fold (i.e. see Wolf et al., Figure 1B), so this model could barely tell these apart.

The reviewer correctly notices that the PR (and PL) datasets consist of wide bins, which encompass sequences potentially 10-fold apart in expression level. Indeed, being able to correctly predict bin identity of a sequence with a 10-fold error should not have been particularly surprising. However – and in part because of that – we also apply the same biophysical model to an independently obtained and more precise dataset (36N), where we report higher performance, suggesting that the performance on PR / PL datasets is limited by the measurements themselves. Moreover, even a model fitted only on the PR dataset performs well on the dense and independent 36N library, indicating that the error of the model predictions themselves cannot be as high as 10-fold. Below, we make the argument more precisely for the 36N dataset.

Let V be the variance of the measured log10(expression), and MSE the mean square error of the prediction of the same quantity. The fraction of variance explained (r2 in the case of a linear fit) is then FVE = 1-MSE/V, so a single theory is evaluated differently depending on the variance of observations V. Our ideal dataset is uniformly sampled across all possible expressions (in particular, the distribution over expressions is not Gaussian), and to approximate that, we introduce weights so that the total weight in each log-equidistant expression bin is equal. For our random library (36N) weighted in this way, the variance of log10(expression) V = 1.016. A value of FVE=0.75 yields the MSE of ~0.25, i.e. the error of ~0.5, which corresponds to a 3-fold error in expression. This might not be outstanding, but one should think of it in the context of the 1000-fold dynamic range of measurable expressions that we probe.

For reference, a uniform distribution of log10(expression) spanning a 1000-fold range has V=0.75; FVE of 0.75 in this case would yield an error of 0.43 for log10(expression), i.e. 2.7-fold for expression; this is close to the calculation for our data, above. The reviewer does the same back-of-the-envelope calculation, assuming standard deviation log10(100), i.e. V=4, thus arriving to a 10-fold error. We agree such a precision would be much less remarkable than what we can report.

At this point we would like to thank the reviewer for challenging us to think of performance in terms of the errors, rather than r2. Consequently, we have included this discussion in the main manuscript (L232-234) in order to explicitly clarify to readers what our r2 = 0.8 means for individual errors and model accuracy. We make it explicit that it corresponds to a 3-fold error in a dataset spanning expressions from very strong promoters, down to not detectable. Among reported promoters in E. coli, which naturally have a narrower range of expression, the same error yields a much lower FVE.

Another issue is that it is not clear which extensions of the model are the most crucial for the accuracy of the predictions. I was surprised to see (in Figure 2 – source data 1) that, apparently, by far the biggest improvement on the random sequences (36N dataset) comes from including flexible spacers and that cumulative binding and occlusive binding adds rather little. This appears somewhat at odds with the narrative that the authors present in the paper that stresses the importance of cumulative binding.

We expanded the discussion of the relative contribution of each extension (L129-137, L217-223), added a new figure to main text (Figure 2, especially panel D) and have also altered the language throughout the manuscript to not insinuate that cumulative binding is more important than other extensions, especially when compared to spacer flexibility. It is worth pointing out that cumulative binding enables multiple spacer length configurations to co-exist within the model, compared to the Standard model which allows only for a single configuration. Hence, the two are intrinsically linked. However, the Reviewer is right to point out that when evaluated individually, spacer flexibility contributes more to overall predictability of our datasets. We see cumulative binding as a necessary basis on which other effects may or may not build upon.

Importantly, cumulative binding plays a bigger role in determining lower expression levels (see Figure 4 —figure supplement 2). This means that the total r2 contribution across all sequences could be small while the impact of cumulative binding on weakly expressing sequences is large. Furthermore, cumulative binding has an effect of shifting the mean expression of a random sequence to higher values, with clear evolutionary consequences that we explored in Figure 4. Specifically, the contribution of cumulative binding to low expression levels plays a key role in promoter evolution by providing a greater likelihood of a random mutation leading to an increase in expression, especially when evolving from non-functional sequences (L347-351).

In short, although I would like to believe that this reasonable biophysical model improves a lot over previous approaches in the literature, I don't think a convincing case has been made that this is really a major advance in accuracy of promoter strength prediction.

We hope that the additional analyses included in the revised manuscript persuade the Reviewer that our model either performs better than the existing approaches, provides more mechanistic insights into the functioning of promoters, or both. We have expanded the explanation of the model to include the arguments presented by the Reviewer throughout our revised manuscript.

Finally, regarding points 4 and 5, I do not understand why the authors place so much emphasis on the observation that functional promoters (i.e. sequences that drive gene expression) are common among random sequences, because this is neither novel nor surprising. It has been long-known that the 'position specific weight matrices' that model sigma70 binding sites have low information content, e.g. Schultzaberger et al., Nucl Acids Res 2007 estimate it to be about 6.5 bits. Given this, one would very roughly expect a site every 100 bp in random sequences. In fact, this was one of the reasons why, in Wolf et al., (eLife 2015), we were confident that we would easily find a large diversity of functional promoters in a library of random sequences. This expectation was confirmed by our measurements as one can see from Figure 1B of Wolf et al. That is, after selecting the 5% of highest expressing cells from the random library, most of them express at a reasonable level. At the time it didn’t even occur to us to mention this as a surprising observation. When the paper of Yona et al., (Nat Comm 2018) appeared, we realized that we misjudged this, and that for many people it WAS surprising that functional promoters are common even in relatively short random sequences. However, now that Yona et al. have an entire paper focused on this observation, including showing depletion of sigma70 sites within gene regions, there is no more novelty to this observation.

We appreciate the concerns that the Reviewer has regarding the novelty of the ‘evolutionary’ set of results presented in our manuscript. It is clear that in the original submission we failed to appropriately emphasize the evolutionary relevance of our results.

First, and most importantly, we specifically compare the predictions about promoter evolution derived from the Extended model to the predictions from the Standard model. This is precisely because the Standard model already captures exactly the assumption that the Reviewer and Yona et al., are talking about – it reflects the standard, textbook view of how bacterial promoters work. Namely, the prediction that you would get roughly 1 RNAP binding site in every 100bp is based on the standard position weight matrices of σ70-RNAP and the expression expectation is based on a straightforward view of how binding maps into expression. The Extended model, which captures various additional features of promoter architecture, predicts many more expressive binding configurations compared to the Standard model, which shares the assumptions discussed by the Reviewer. We are thus not making a verbal claim about whether there are “a lot” or “a few” expressing sequences among random sequences per se (such a qualitative statement is anyway somewhat arbitrary in what is considered “a lot”); rather, we are making a quantitative and comparative statement that the Extended model predicts much more expression from random sequence relative to the Standard model – even if the latter already predicts “a lot”. RNAP binding is more promiscuous than predicted by the Standard model. For quantitative details on how much faster promoter evolution is when predicted with the Extended model, see Figure 4D.

Second, evolutionary simulations in Figure.4D show further that such differences between the Standard and Extended model can be drastically potentiated by the evolutionary dynamics. Even though we only expect ~3-fold change between Standard and Extended model in the fraction of sequences that measurably express (Figure 4A), this can map into multiple-order-of-magnitude changes in the time-to-evolve under the SSWM evolutionary model. This is because of the multitude of expressing states and evolutionary paths that are available to the Extended model but forbidden to the Standard model. This is most clearly seen when expression has to evolve under constraints (Figure 4D, when only a small region is mutated). This point is not trivial at all: in Tugrul et al., 2015 we showed that the predicted dynamics of evolution of “standard” TF binding sites (with no spacer flexibility and other features of the Extended model) is worryingly slow and difficult to square with the reported estimates for the timescales of regulatory evolution based on comparative genomics studies. There, the authors failed to identify biophysical mechanisms that would enable faster evolution of regulatory sequence. Here, in contrast, starting with data, we see that the particular properties of RNAP binding captured by the Extended model do result in much faster promoter evolution (compared to the Standard model expectation). Therefore, we believe that these results provide a much wider relevance to our work than simply reporting improved predictability due to the inclusion of a host of biophysical mechanisms into the model.

We have now changed the text in multiple places to reflect and emphasize these points (most substantively in L342-351).

It is also worth pointing out that we provide a deeper basis for the ‘evolutionary’ findings that we report compared to Yona et al., while also exploring several novel aspects of promoter evolution. For example, we show not only that σ70-RNAP sites are selected against in the coding regions (which was shown by Yona et al., as well) but also within promoter regions themselves, even when we only use the list of verified promoter sequences from RegulonDB. To us, this seems like a novel finding, that corroborates the promiscuity of RNAP binding under the Extended model. Similarly, it seems relevant to report findings about promoter evolution even if they show the same results as Yona et al., did, because we base our findings on a vastly larger dataset combined with a model capable of actually explaining much of the diversity we observe in that data, as opposed to being based on a sample size of ~40 mutants.

For all of these reasons, we believe there is substantial novelty in our results. We have changed the manuscript in a variety of places to better emphasize what is novel about our results and to properly explain that the Standard model actually represents the traditional view of promoters (most substantively in L89-99).

The observations about the frequency of occurrence of promoters in random sequences could have been more interesting if it had been more meaningfully quantified. That is, any sequence will presumably drive expression at some nonzero rate. An interesting (and meaningful) quantitative question is how likely it is that a random sequence drives expression at rate > x, as a function of x. Maybe the authors' data and model could give a reasonable answer to this question.

This is a great suggestion and we have now included an additional plot in Figure 4A, which shows the cumulative distribution function (CDF) for the Extended model predictions of expression from random sequences (as well as the CDF for the experimental flow cytometry measurement of the 36N library). This plot shows the likelihood of finding a random sequence with measurable expression above any set threshold.

However, instead the authors simply treat this problem as if there is just a binary question whether a promoter is expressing or non-expressing. Moreover, they define the threshold between expressing and non-expressing in a way that is not biologically meaningful but seems determined by instrument precision. I actually had a hard time understanding how precisely 'measurable expression' was defined. My understanding is that, for the P_R and P_L libraries, it is defined as any expression higher than the 99th percentile of measurements for cells carrying no YFP. But how it is defined for the 36N library (with another reporter) was unclear to me.

We apologize that our explanation of how we defined ‘measurable expression’ wasn’t clear. The threshold is defined the same way for all three libraries – as the 99th percentile of the distribution of fluorescence measurements from cells carrying no fluorescence marker of any sort. The only difference is that in the 36N library, we include one additional cell sorting gate below the detectability threshold. In modelling, we had to assume that the thermodynamic assumptions behind the model apply the same way to sequences below measurable threshold. We have now clarified these points, both as they relate to experiments and model, in the legends to Figures 3 and 4, as well as the Sort-seq experiments section of the Methods (L520-523, L526-528).

We do appreciate the reviewer’s comment that the simplistic use of the language of whether there is a promoter or not (as opposed to a quantitative distinctions in its strength) is inappropriate, and we have carefully reworded a number of sentences to clarify.

In any case, it appears that 'measurable expression' is defined in terms of the auto-fluorescence level that the cells have in the FACS machine. Unfortunately, this is not a biologically meaningful cut-off. In fact, a substantial fraction of E. coli promoters express below the detection limit in the FACS (i.e. not significantly above auto-fluorescence).

We agree with the Reviewer, and have now added a caveat explaining this point in the main text. Specifically, we explain now in the main text that we assume that the energetics of σ70-RNAP binding to DNA drive promoter activity below the instrument detectability threshold as they do above it (Figure 3 legend – L1456-1458).

In summary, I think there are some very good ideas in this paper. I in principle like the biophysical model, and the experimental validations of the model's extensions I thought were very nice and, when extended, could make a compelling paper. Unfortunately, the current focus of the papers appears to be on the frequent occurrence of promoters in random sequences (which is not novel) and claims that the model is a major advance in predicting gene expression from sequence, which I do not feel are substantiated by the results presented.

1. I had real trouble understanding the processing of the 36N dataset. I understand that, because in the 12-bin dataset you sort until you have a certain number of cells in the bin, you need to correct for this when estimating the relative fractions of cells with a given promoter that go to different bins. But I do not understand the approach with the spike-ins. Aren't there, in addition to the spike ins, also an unknown number of reference promoters coming from the sorting?

We understand the Reviewer’s confusion about the spike ins, as we hadn’t explained this procedure well enough in the original submission. Now we clarify in more detail in the Methods (L645-648) how we introduced the spike ins – the spike ins were done using a reference Λ PR promoter, while the plasmid used for cloning the 36N library (wildtype plasmid) did not carry the Λ PR promoter. Hence, we could easily distinguish between the reference plasmid we added, and the wildtype plasmid that was not cloned properly and was hence not carrying promoter mutations.

2. line 81: "even for constitutive promoters, where σ70-RNAP binding solely determines gene expression levels" Has this really been established? There could be effects of differential binding of topoisomerases that undo supercoiling.. or differential weak binding by other TFs.. and so on. I don't think we really know these levels are solely determined by sigma70-RNAP.

We now discuss in the main text the caveat identified by the Reviewer (L80).

3. Figure 1 – supplement 1.

Why is detectability threshold higher in B than in A? I found it generally quite frustrating that it is hard to figure out how exactly this 'measurable expression' threshold, which plays such a crucial role, is defined.

We thank the Reviewer for spotting this error. There was a formatting error with the figure which we have now fixed. We have also included an explanation for how ‘measurable expression’ was defined for this specific set of measurements in the legend to Figure 2 (which used to be Figure 1 —figure supplement 1), as it is defined differently for populations measured in a flow cytometer and those measured in a plate reader. Having said that, the logic behind the definition is the same – measurable expression being defined by the fluorescence level of the strain carrying no fluorescence markers.

4. 515-516: Finally, we required the distribution of expression bins to be as unambiguous as possible (Figure 2 —figure supplement 1D,E).

Why is this done? The way I understand it, this throws out promoters whose expression levels straddle two bins. I guess you want to throw those out because the fitting of the model to only observations in 4 bins only is more accurate when using only promoters that fall sort of in the middle of those bins, but I find this quite unsatisfactory. Would it not have been possible to fit a log-normal to the relative fractions of counts in the different bins and get continuous expression estimates?

By requiring unambiguity, we only meant that we discard sequences with standard deviations larger than 0.5, following with the requirement that the mean, median, and mode of the distribution across bins should point to the same bin. For the PR library, the first step reduces the number of sequences from 29020 to 22884, and the second one further to 22769. The same numbers for the PL library are 6415, 4239 and 4222. We’ve amended the text to make this fact more explicit (L699-706).

As for the possibility of using the relative fractions, we opted not to follow that route because of the abundant possibilities for introducing biases during library preparation. From the moment the cells are sorted into their appropriate vials to when we obtain their sequences, they undergo several passages (overnight growth and subsequent dilution) to allow populations to grow large enough numbers so that sequencing is more robust. However, such passaging can introduce a bias in the form of over- or under-representation of a mutant in one bin compared to other surrounding bins. PCR adds further potential for departure from the assumption that the two reads are independent and identically distributed. An additional benefit was that it simplifies the analysis and interpretation. This explanation has now been added to Materials and methods (L596-602).

5. 596: "We define 60:20:20% splits of each of our libraries into three disjunct datasets Figure 2 – Source Data 3."

Are these splits random? A notorious problem in cross-validation based fitting is when highly correlated datapoints occur in both training and test data. For example, in this case you could have promoters that are virtually identical, and with equal expression, in both training and test datasets. This could lead to misleadingly high performance on the test dataset. I actually doubt this is happening here, but maybe it is worthwhile checking this.

Due to the nature in which we process our libraries, we are dealing with unique sequences. Namely, we pick each random sequence that appears in the entire sequencing output only once, and use the median bin to represent its fluorescence (in the PR and PL libraries) or the actual mean bin (36N library). This sequence then gets assigned either to the testing or training data, but not both; hence, in our dataset, the splits are disjunct by design.

Sequences in the testing and training sets are correlated (but not identical) in the PR and PL libraries by construction, since all sequences are derived from the same wildtype; sequences are uncorrelated in the 36N library. The fact that our model performs well when tested also on different libraries from the PL on which it was fit should provide strong evidence against any effects that the referee is referring to.

We clarify this point in more detail in the Methods section data splits and fitting procedure (L689-691).

6. Did you try adding a non-specific binding term to the model? Or the ability for sigma70 to bind through only one of its two feet? If not, why not? Do the authors think sigma70 cannot be bound only at the -10 or -35 sites?

To properly address this concern, we need to consider two different scenarios. First, that the binding of σ70 through only, for example, -10 foot, does not alter in any way its binding preference. In other words, the energy matrix for the -10 foot is the same no matter whether σ70 is bound only with the -10 foot or with both, -10 and -35 feet. Our model allows for this scenario fully. In fact, our model does not ‘know’ whether the binding occurs predominantly through one foot or the other. All it cares about is the total binding energy, which one can achieve through ‘mix and match’ of positions throughout the entire energy matrix. If, however, this assumption is not true, then the sequence preference of, for example, the -10 foot, would depend on whether and to what extent the -35 foot is bound. To explore for this possibility, we would need to re-fit all model parameters under at least two additional binding configurations – -10 foot bound alone and -35 foot bound alone, and this kind of analysis was beyond the scope of our study. Just to illustrate how much more difficult and time-intensive this task would be, consider that the Einav and Phillips 2020 PNAS paper was entirely dedicated to fitting only this one aspect of polymerase binding, and it did so with only eight -10 and eight -35 variants. We have thousands of sequences variants, which would make this work substantially more complicated. However, it would be very interesting to pursue this line down the road.

7. I did not like the way multinomial logistic regression was used to fit the model. Expression should go up monotonically with Pon. It makes no sense to treat the bins as unrelated 'classes'.

This comment is very similar to that of the Reviewer 1, so here we repeat our response: We implemented one-vs-rest MLR as a conservative choice, which would allow for a non-monotonic relationship between P_on and expression. Ordinal logistic model would force it by definition. In particular, for all the models without the clearance rate, if there were a binding state so strong that the expression is actually lower than for a bit weaker binding, we would be able to detect it (and we were actively looking if any such effect exists; it does not). Despite using MLR, we find a monotonic relationship between P_on and expression, which we took to be one of our internal consistency checks.

8. Comparing evolution of promoters de novo under the extended model against the 'standard model' is a straw man in my opinion. The standard model is an approximation to the thermodynamic model that is valid when there is one site that dominates the overall affinity of a promoter. I agree that it is often assumed that this is a good approximation for native promoters that have been shaped by selection. But I do not think there is anybody that thinks this is a good model for how promoters appear in random sequences. Obviously weak promoters can then appear anywhere in the sequence.

We are glad the Reviewer pointed this lack of clarity in our explanation of how we actually use the Standard model. We do not assume that it has to bind to a specific location in the sequence. In fact, we scan the entire sequence for all possible binding configurations, just like we do with the Extended model, but with the Standard one we then only pick the single strongest binding configuration (as opposed to summing up across all binding configurations like with the Extended model). As such, when simulating promoter evolution, we do not require the Standard model to have a fixed position. Rather, we require it only to have only one binding site, no matter where that site might appear in the sequence. We now clarify this point further in the Legend to Figure 4, as well as the main text (L242-245).

9. For Figure 3A, I think it is essential to also predict such a curve for the sequences in each of the libraries and see how well these predicted curves match the observed distributions for these libraries. Otherwise it is hard to judge how meaningful this predicted distribution is.

We have now predicted the distribution for random sequences using the Extended model and compare it to the experimentally measured distribution of the 36N library, as shown in Figure 4A.

10. Many intergenic sequences in E. coli will be constrained to have all kinds of TFBSs, precluding sigma70 to occur in those positions. So to check whether there is really depletion of sigma70 sites, comparing with random sequence of the same nucleotide composition is not quite correct in my opinion. First, you should distinguish intergenic regions that are upstream of two operons from those that are upstream of one, and those downstream of two promoters. Those 3 classes of intergenic regions are also known to have different nucleotide composition. The best regions for tests would be downstream regions, and regions that are upstream of an operon that is constitutive.

We agree with the Reviewer that the way in which we defined promoters in Figure 4E is neither ideal nor comprehensive. What we wanted to do is check for our hypothesis on the two ends of the spectrum – a relaxed definition of what a promoter is (which is what we used) and a very conservative definition (only experimentally confirmed promoters). Our relaxed definition includes all three ‘classes’ of promoters, including those that are found downstream of operons and hence are unlikely to act as promoters. In order to eliminate all intergenic regions that do not contain a promoter (including all the ‘downstream’ promoters mentioned by the Reviewer), we use the conservative definition that includes only the intergenic regions with known promoter function taken from RegulonDB. We use this conservative definition because we are asking whether there is selection against σ70-RNAp binding sites, without discriminating the source of that selection, which can be the existence of one or more transcription factor binding sites and/or selection against having too many σ70-RNAp binding sites.

To supplement our approach, the Reviewer suggested to do the following: to split intergenic regions into three categories and to look for evidence of selection against σ70-RNAp binding sites in two of those categories. First is the promoters downstream of operons. These are intergenic regions expected not to have any promoter activity. It is unclear to us how providing evidence of selection against σ70-RNAp in such regions would provide novel insights compared to what we already identify by examining selection in coding regions. Furthermore, we already include these regions in our relaxed definition of a promoter in which we demonstrate strong evidence of selection against σ70-RNAp binding sites.

The second set of promoters that the Reviewer suggests to look at, are the upstream regions of a constitutive operon. This definition poses a few technical difficulties for our analysis. First, it is difficult to define a constitutive promoter. Definitions based on experimental evidence are inevitably at the mercy of the specific media conditions used, which in turn define the transcriptome and hence define which promoters might be defined as constitutive. Depending on how many and which environments are tested would therefore give quite a different set of promoters deemed to be constitutive. As carrying out a comprehensive experiment in all possible environments is, obviously, not possible, the best we can do is to look at all promoters reported in RegulonDB or other similar aggregate databases that accumulate information from multiple published sources. ~80% of all σ70 E. coli promoters reported in RegulonDB have at least one associated transcription factor, meaning that following what the Reviewer suggested would result in us analyzing only ~20% of the sequences included in our conservative promoter list. This number of promoters is very low, meaning that we would not have enough statistical power to discern any but the strongest of selections against σ70-RNAp. Importantly, we set out to analyze the evidence for selection against σ70-RNAp sites, irrespective of the source of that selection, so we consider it important to consider the promoters that bind transcription factors rather than focusing only on constitutive ones.

We hope that the fact that we found the same evidence for selection against binding sites even within only the known, experimentally confirmed promoters obtained from RegulonDB (Figure 4 —figure supplement 3B) reassures the Reviewer that our findings do hold no matter how we define a promoter. We have added a discussion about the conclusions that can be drawn from our approach with respect to the source of the selection we report (L368-372).

Reviewer #4:

The manuscript by Lagator et al., examines a general framework to predict the constitutive level of gene expression for bacterial DNA, which constitutes one of the most fundamental processes in biology. Whereas previous work has typically been restricted to a limited sampling around a well-studied promoter (such as the lac operon), the authors push towards a model that can characterize all possible sequences, which they support with libraries containing 10,000 mutants each around two strong promoters (PR and PL) as well as a library containing 10,000 random sequences (36N). This work represents an important advance to the field, and will hopefully serve as the stepping stone for better models that include transcription factors and eukaryotic enhancers.

We thank the Reviewer for the encouraging comments.

While the authors provide a clear and compelling story using their impressive data set, the one piece that I felt was missing was a connection with large gene expression data sets from other groups. How robust will their already-trained model be when applied in another context, where the experimental design is somewhat different? This will help other groups understand the potential gain in implementing this model for their own ends, and if any fine-tuning or model adjustments need to be performed, such information would be important to add. To this end, I suggest a few possible data sets below. The authors do not need to analyze all of them (although more is better), and they are welcome to substitute other data sets of their choice:

1) Hossain 2020 [https://doi.org/10.1038/s41587-020-0584-2] analyzed 4,350 bacterial promoters with a broad range of gene expression measurements varying by 820,000-fold. These promoters should be constitutive, although their random DNA sequences might occasionally give rise to transcription factor binding sites (as in the 36N library). The sequences and read counts are available in their Supplementary Data 3.

2) Urtecho 2019 [https://doi.org/10.1021/acs.biochem.7b01069] analyzed the constitutive expression from 10,000 promoters. Their setup ensures that no transcription factors are involved, providing a clean test for the model. You can find their expression data at the GEO Accession Number GSE108535 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108535]. The file "GSE108535_sigma70_variant_data.txt.gz" contains the names of promoter variants and the average RNA/DNA expression levels of two biological replicates, while the file "GSE108535_barcode_mapping.txt.gz" contains the name of each promoter variant and its corresponding sequence (where the sequence is specified in the column 'most_common'). Figure S12 and Table S1 from their supplementary information provide a visual schematic for how to put the pieces together.

3) Johns 2018 [http://doi.org/10.1038/nmeth.4633] analyzed 30,000 promoters mined from prokaryotic genomes, which would be a very interesting avenue to explore. These promoters may involve transcription factors (although their mining strategy might bias them towards constitutive promoters). Their sequences are given in Supplementary Table 1 and their expression measurements for E. coli are provided in Supplementary Table 3.

As suggested by the Reviewer, we have now analyzed the three datasets and evaluated the performance of our model. We did so without any additional fitting (except for the chemical potential, which depends on the strain, growth conditions and other factors influencing the concentration of RNAp inside the cells), and we were very encouraged by the high performance of our model across all three datasets, especially as they were all developed to address very different questions. These datasets were also especially useful as each had a model developed by the respective authors, allowing us to compare our model to their performance. The manuscript now contains a new section discussing these results (Testing Model Generalizability).

I suspect the authors are already planning to do this, but it would be very helpful if a program implementing the fully-trained model was provided (along with some examples of its usage), to help the community utilize these results.

The code related to this manuscript has been deposited on github and can be found at https://github.com/szarma/Thermoters.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. Inferred values for model parameters.

    For each position in the energy matrix, the energy penalty is normalized to the lowest energy (strongest binding) residue in that position, which is set to zero. Same applies for the spacer variation penalty, where the optimal spacer length is set to zero. Orange background marks the canonical –10 and –35 binding sites of σ70-RNAP. For pairwise interaction, identity’ refers to the two positions and specific residues that have a significant positive (destabilizing) or negative (stabilizing) interaction, with numbers indicating the position of the residue in the matrix shown in Figure 1C. Shaded cells indicate the strongest interactions, which are shown in Figure 1C. The numbers indicate the position in the energy matrix, starting from the most upstream to the most downstream (left to right in the energy matrix shown in Figure 1C).

    Figure 3—source data 1. Processing of the mutant libraries and sizes of datasets after splits.

    The table shows the number of reads remaining in the datasets following each step of data processing, from original sequenced library down to the final library used for model fitting and evaluation.

    Figure 3—source data 2. Number of mutants per expression bin for each split of the PR, PL, and 36N dataset.

    Bins are no (‘0’), low (‘1’), intermediate (‘2’), and high (‘3’) for the PR and PL libraries, and are ordered from lowest (‘0’) to highest (‘11’) for the 36N library.

    Transparent reporting form

    Data Availability Statement

    Source data files have been provided for Figures 1 & 3. Code and data has been deposited in GitHub: https://github.com/szarma/Thermoters copy archived at swh:1:rev:61fe2f54941966469dad801efe06e1c879f27530.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES