Abstract
Predicting the location and strength of promoters from genomic sequence requires accurate sequenced-based promoter models. We present the first model of a full-length bacterial promoter, encompassing both upstream sequences (UP-elements) and core promoter modules, based on a set of 60 promoters dependent on σE, an alternative ECF-type σ factor. UP-element contribution, best described by the length and frequency of A- and T-tracts, in combination with a PWM-based core promoter model, accurately predicted promoter strength both in vivo and in vitro. This model also distinguished active from weak/inactive promoters. Systematic examination of promoter strength as a function of RNA polymerase (RNAP) concentration revealed that UP-element contribution varied with RNAP availability and that the σE regulon is comprised of two promoter types, one of which is active only at high concentrations of RNAP. Distinct promoter types may be a general mechanism for increasing the regulatory capacity of the ECF group of alternative σ's. Our findings provide important insights into the sequence requirements for the strength and function of full-length promoters and establish guidelines for promoter prediction and for forward engineering promoters of specific strengths.
INTRODUCTION
Bacterial genome sequences are being completed at an exponentially increasing rate but the ability to use these sequences as genomic blueprints requires accurate prediction of promoters and of their strength. Detecting the existence and elements of a promoter is a challenge because promoters are constructed of multiple poorly conserved motifs separated by variable length spacers. Moreover, once promoters are identified, it is challenging to predict their maximal initiation rates because transcription initiation is comprised of multiple kinetic steps including initial binding of RNA polymerase (RNAP) and subsequent ‘melting’ (strand-opening) of the DNA. Here, we develop the first sequence-based model of a full-length bacterial promoter, identify the key determinants of promoter sequence that correlate with promoter strength and use these features to establish a predictive model for promoter strength.
Bacterial promoters are comprised of a core promoter region recognized by σ, and an upstream region, termed the UP-element, recognized by the α subunits of RNAP (Figure 1A). With the exception of σ54-dependent promoters, the core promoter region is comprised of the −35 and −10 motifs, located, respectively, at these distances upstream of the transcription start point. Each σ factor recognizes core promoter motifs with distinct sequences. Thus, specific promoter recognition is determined by the σ factor, which binds to RNAP to form holoenzyme and directs the complex to its target promoter sequences. The σ70 family of σ factors is comprised of four phylogenetically related groups (1). Group 1 is the housekeeping σs; these are essential and recognize thousands of promoters. Groups 2–4 are the alternative σs, which recognize discrete sets of promoters enabling specialized responses to environmental stresses and developmental cues. σs are modular proteins, comprised of a variable number of domains. The Group 4 or Extracytoplasmic function (ECF) σs have only 2-domains, one recognizing the −10 motif and the other recognizing the −35 motif (1). Importantly, these σs are the most abundant alternative σs and are involved in regulating a diverse repertoire of stress and developmental responses (2–4). Their promoters provide an ideal test-bed for constructing and improving promoter models, as they are simple and relatively well-conserved.
The UP-element increases promoter strength (5–8). It is composed of alternating A- and T-tracts (6,9–11) and has two subsites, each recognized by the C-terminal domain of one of the two α subunits of RNAP (αCTDs: αCTD I and αCTD II) (see Figure 1A) (6,9,10,12). Each αCTD binds the DNA minor groove via a helix-hairpin-helix motif; however, there are no direct sequence-specific contacts. Instead, the preference for A- and T-tract DNA reflects a structural property of these sequences: the narrower minor grooves of A-tract DNA facilitate optimal insertion of the R265 side chain of αCTD (13–15). The αCTDs are mobile, being connected to their N-terminal domain and hence the main body of RNAP via a flexible linker (16). While αCTD I is primarily located on the promoter-proximal subsite, which is centered near −43 and adjacent to the −35 motif enabling interaction with select σs (17–19), αCTD II is mobile. αCTD II not only binds predominantly to the promoter distal subsite centered near −53 (9), but can also transiently occupy multiple minor groove sites on the same face of the helix further upstream centered at positions −63, −73, −83 and/or −93 (20–22).
Position weight matrices (PWMs) are typically used to model and predict transcription factor binding sites and promoters; however, their predictions can suffer from many false positives (23,24). Previously, we tested the utility of PWMs to predict the strength of core promoter sequences (from −35 to +20) using a library of 60 promoters recognized by σE, an Escherichia coli ECF σ (25) and separate PWMs for each core promoter motif [−35 motif, spacer, −10 motif, discriminator, start and initial transcribed region (ITR); Figure 1B]. Our best model summed the PWM scores of select motifs that positively correlated with promoter strength (−35, −10, discriminator and start) and included a penalty term applied for non-optimal spacer and discriminator lengths. This demonstrates the utility of PWMs to predict the strength of core promoter sequences. In addition, applying minimal scores for each motif enabled successful discrimination between active and weak/inactive promoters, which is critical for accurate promoter prediction. In contrast, the contribution of UP-element sequences to promoter strength is not expected to be well-described by PWMs. Natural UP-elements display little position-specific sequence conservation (Figure 1B), and the known requirement of narrowed minor grooves for optimal α-binding, suggests a dependency between adjacent nucleotides. This state is not captured by PWMs, which assume that the binding energy contribution of each nucleotide position is independent and additive.
In this work, we successfully model UP-element contributions to promoter strength, developing a sequence-based model that captures both the structural features of the DNA subsites and the multiple binding locations of α. We demonstrate that the UP-element model can be combined with our core promoter model (25) to generate the first model that accurately identifies full-length promoters and estimates their strength. We also find that the contribution of the UP-elements to promoter strength depends both on RNAP levels and properties of the core promoter. Since ECF σ promoters are predominantly regulated by the availability of their active cognate σ, and hence by the availability of σ-specific RNAP holoenzyme, this finding has significant implications for understanding the regulation of ECF σ promoters in vivo. Our work suggests strategies for improving the accuracy of full-length promoter prediction and for forward engineering promoters of specific strengths for use in synthetic biology.
MATERIALS AND METHODS
Strains, plasmids and growth conditions
All strains were grown in M9 complete minimal medium supplemented with appropriate antibiotics at 30°C with shaking. M9 complete minimal medium was prepared as described (26) supplemented with 0.2% glucose, 1 mM MgSO4, vitamins and all amino acids (40 μg/ml). Media was supplemented with 30 μg/ml kanamycin and/or 100 μg/ml ampicillin as required. All assay strains are derivatives of E. coli K−12 strain MG1655 (27) and MC1061 (28) and are listed in Supplementary Supplementary Table S1. Assays with basal levels of σE were performed in derivatives of CAG45113 (MG1655) transformed with derivatives of the GFP expression vector, pUA66 (29), carrying the long and short σE promoter libraries (8) (Supplementary Table S1). Assays with over-expression of σE were performed in derivatives of CAG58200 (MG1655 ΔlacX74 [ΦλrpoHP3::lacZ]) carrying the plasmid pLC245 (23) expressing rpoE from the IPTG inducible Ptrc promoter, and pUA66 carrying the long and short σE promoter libraries (8) (Supplementary Table S1). σE-independent promoter activities were determined in derivatives of CAG22216 (MC1061 ΔlacX74 [ΦλrpoHP3::lacZ] rpoE::ΩCm (30) carrying the long and short σE promoter libraries (8) (Supplementary Table S1).
All σE promoter constructs were carried on the low copy vector, pUA66, driving the expression of GFP from the reporter gene gfpmut2, and were constructed as described in (8). Briefly, full-length (long) promoter sequences from −65 to +20 with respect to the transcription start site were cloned into the XhoI–BamHI sites of pUA66, creating derivatives pUA66 E1-E60 (Supplementary Table S1). The core (short) promoter library contained sequences from the −35 motif to +20 cloned in XhoI–BamHI of pUA66, creating derivatives pUA66 Et1-Et60 (Supplementary Table S1). Note that due to variable spacer lengths between the promoter motifs, the cloned upstream position was always taken as the first G of the GGAACTT −35 motif, thereby all core promoter sequences contain a full-length −35 motif with no additional upstream sequences.
In vivo promoter assays
All in vivo promoter assays were performed as described in ref. (8) with the exception that strains were grown in M9 complete minimal medium instead of LB. The lower autofluorescence of M9 medium compared to LB permits more accurate quantification of GFP and hence promoter activity under basal σE levels. Strain derivatives of CAG45113 and CAG22216 were grown at 30°C for measurements of promoter strength when σE is expressed at the basal level or in the absence of σE (these latter strains have a suppressor of σE essentiality to permit their growth); CAG58200 derivatives were supplemented with 100 µM IPTG to induce rpoE expression for measurements of promoter strength under high (over-expression) levels of σE. Briefly, overnight cultures of the strains in 96-well microplates were diluted 200-fold to an OD450 ∼0.03 in fresh medium ±100 µM IPTG for CAG58200 derivatives. Strains in the covered 96-well microplates were incubated in a multimode microplate reader-incubater shaker (Varioskan; Thermo Fisher Scientific), and measurement of optical density (OD450 nm) and fluorescence (relative fluorescence units or RFU; excitation = 481 nm; emission = 507 nm) was performed every 15 min. Promoter strength was taken as the slope of the change in GFP fluorescence as a function of cell growth during exponential growth phase (OD450 = 0.2−0.45) as described in ref. (8).
Promoter templates for in vitro transcription
Linear promoter fragments for in vitro transcription assays were generated by PCR from the pUA66 E1-E60 and pUA66 Et1-Et60 plasmid templates (Supplementary Table S1) as described in (25) (primer sequences available on request). Briefly, promoter fragments were generated from the plasmid templates using upstream primers and a downstream primer that incorporates the highly efficient rpoC terminator sequence (31). This generated promoter fragments from −203 to +145 containing vector sequence upstream and vector sequence + rpoC terminator sequence downstream of promoter sequences −65 to +20 for full-length promoters and −35 to +20 for core promoters (Supplementary Figure S1A, B). Both promoter libraries generate a 118 nt mRNA transcript. The competitor promoter fragment contained PrpoH core promoter and generates a 149 nt mRNA transcript (Supplementary Figure S1C) (25).
Purification of RNA polymerase core enzyme and σE
RNA polymerase core enzyme was purified as described in ref. (32). N-terminally His6-tagged σE was purified as described in ref. (33) from soluble cell lysates of strain BL21λDE3 (pLysS, pRER76) with the following modifications: A 500 ml culture of BL21λDE3 (pLysS, pRER76) was grown at 25°C in LB + 100 µg/ml ampicillin and 50 µg/ml chloramphenicol until OD600 = 0.5. The culture was supplemented with an additional 100 µg/ml ampicillin and induced with 1 mM IPTG for 2 h with shaking at 25°C. Cells were harvested by centrifugation and resuspended in 10 ml Lysis Buffer (50 mM NaH2PO4(H2O) pH 8.0, 500 mM NaCl, 10% w/v glycerol, 10 mM imidazole). Cells were lysed by sonication and the lysate centrifuged at 10 000g for 10 min at 4°C. The majority of σE was present in the soluble fraction and was purified using a QIAGEN Ni2+ affinity column under native conditions as per manufacturer's instructions (Valencia, CA, USA). Our modified lysis buffer was used in the loading and wash steps (+20 mM Imidazole), and σE was eluted using modified elution buffer [50 mM NaH2PO4(H2O) pH 8.0, 500 mM NaCl, 10% w/v glycerol] with a stepwise imidazole gradient from 20 to 200 mM in 20 mM increments. σE eluted between 60 and 100 mM imidazole and was essentially pure. The σE containing fractions were pooled and dialyzed into storage buffer [20 mM Tris–HCl (pH 7.9 at 4°C), 500 mM NaCl, 1 mM EDTA, 50% w/v glycerol, 1 mM DTT] and then stored at −80°C.
In vitro transcription assays
Multi-round transcription assays were used to measure promoter strength (rate of mRNA production) and were performed in triplicate with 10 and 50 nM RNAP. To facilitate accurate determination of promoter strength, the assays were modified in four ways: (i) Inclusion of the highly efficient rpoC terminator at the end of the promoter templates ensured specific transcript termination, rather than termination by RNA polymerase ‘running’ off the end of the template. This gave a 5-fold increase in specific transcript signal strength (25). (ii) With test promoters, transcript generation during incubation with RNA polymerase was carefully evaluated to ensure proper ‘multi-rounds’ and that there was no depletion of NTPs (data not shown). (iii) To facilitate rapid analysis of large numbers of promoters, assays were performed ‘high-throughput’ in 96-well plates and loaded on a standard S2 sequencing gel poured with 32-well combs using a standard 12-channel multichannel pipette. (iv) Inclusion of a control promoter in each assay enabled the test promoter transcript to be normalized against the control promoter to account for variations in RNA polymerase activity and gel loading, enabling comparison of test promoter activities. The transcription reactions (6 µl) contained 0.5 nM test promoter and 0.5 nM competitor promoter DNA, 5–50 nM core RNA polymerase with 2-fold excess σE in 1× Binding Buffer (5% glycerol, 20 mM Tris pH 8.0, 300 mM KAc, 5 mM MgAc, 0.1 mM EDTA, 1 mM DTT, 50 µg/ml BSA, 0.05% Tween), 150 µM GTP/ATP/UTP, 10 µM CTP, 0.5 µCi α32P-CTP (3000 Ci/mmol; 110 TBq/mmol) and incubated at 37°C for 10 min. Reactions were terminated by addition of 4.5 µl Stop Solution (20 mM EDTA, 80% deionized formamide and 0.1% [w/v] bromophenol blue and xylene cyanol). Transcripts were resolved on a 6% denaturing polyacrylamide sequencing gel (see example in Supplementary Figure S1D), visualized using a Molecular Dynamics Storm 560 Phosphorimager scanning system (Sunnyvale, CA, USA), and quantified using the software ImageQuant v5.2 (G.E. Healthcare Life Sciences). For each assay, promoter strength = (test promoter–background)/(control promoter–background).
Promoter strengths and UP-effects used for modeling
All σE-dependent and independent promoter strengths (background subtracted) determined in vivo and in vitro together with their calculated UP-effects are presented in Supplementary Table S2. UP-effects were calculated as log2([full-length promoter activity]/[core promoter activity]). To prevent extreme UP-effect ratios generated from very weak promoter activities, promoter strengths <2-fold of the background were reset to 2-fold above background. For example, all in vivo promoter activities less than 2 were reset to 2; all in vitro promoter activities less than 0.05 were reset to 0.05. Full-length promoter models were constructed using active full-length promoters defined as 2-fold above background, with the exception of in vivo basal promoter activities. Here, a lower cutoff was used (>1.5) to increase the number of active promoters in the model from 15 to 18. The active in vivo basal full-length promoter set excluded three promoters that exhibited significant σE-independent activity (plsB, yfjO, yecI; Supplementary Table S2). YbcR was also excluded from the basal in vivo active dataset since it was unusually active in this and no other condition (in vitro or in vivo; Supplementary Table S2). Although ybcR exhibited negligible σE-independent activity in M9 medium, significant independent activity was observed in LB (8), suggesting additional or spurious regulation of this promoter.
Scoring promoter sequences
Sequence logos of aligned promoter motifs were generated using WebLogo v2.8 [http://weblogo.berkeley.edu//; (34)]. Position weight matrices were constructed using the method of (35) with aligned sequences for each motif. Core promoter motifs (−35, spacer, −10, discriminator, start and initial transcribed region; see Figure 1B and sequences listed in Supplementary Table S3) were defined and scored using PWMs as described in (25). A combined spacer and discriminator length penalty score (S + D pen) was applied to promoters with suboptimal spacing between the +1, −10 and −35 motifs based on the observed spacing frequency for σE promoters (Figure S3) as described in (25). A core promoter score, C, was derived by summing select motif PWMs and S + D pen scores. Upstream sequences (listed in Supplementary Table S3) were scored as described in Supplementary Figure S2 to derive an upstream score, U.
Total promoter score was taken as the sum of the core and upstream scores: Sp = U + C. This assumes that upstream and core promoter scores have equal weight to total promoter score. However, given that the core scores (C) are based on PWMs and that the upstream scores (U) are A- and T-tract counts of length or frequency, this may not be the case. Accordingly, Partial Least Squares Regression (PLSR) (36) was used to solve for the upstream and core model coefficients as described in (25) using the model: Sp = xU.U + xC.C; where xU and xC are x-value coefficients applied to promoter model scores, U and C. The model was solved using a matrix of y-values (Sp) and x-variables (U and C) with the software package ‘The Unscrambler v9.8’ (CAMO Software AS, Norway; http://www.camo.no).
Promoter score, Sp, was taken to be proportional to the log of promoter strength, , where Ka is occupancy or promoter strength (37). The fit of Sp with log2(Ka) was assessed by Pearson's correlation coefficient (R) and significance (p) determined using a two-tailed test (http://www.danielsoper.com/statcalc3). Similar assumptions were applied to the correlation of upstream scores (U) with UP-effect, E, (E = (log2([full-length promoter activity]/[core promoter activity])) to give . Outliers in the correlation of promoter score (Sp) with strength (Ka) were identified as having both high residual y-variance and high leverage values using the software ‘The Unscrambler v9.8’. Promoter models were tested for over-fitting using 10-fold cross-validation as described in ref. (25).
RESULTS
Quantifying the effects of upstream sequences on promoter strength
We determined the contribution of upstream sequences to promoter strength (UP-effect) using our previously characterized library of 60 natural σE-dependent promoters from E. coli and Salmonella enterica (Figure 2, Supplementary Table S1) (8,25). The UP-effect is calculated as the log2 ratio of the activity of the full-length promoter, containing sequences from −65 to +20, as compared to its core promoter derivative, containing sequences from −35 to +20 and an upstream sequence derived from a common vector sequence (Supplementary Figure S1A and B). Promoter strength was measured both in vitro (Figure 2A) and in vivo (Figure 2B) under limiting σE-RNAP, which mimics the basal expression levels of the σE system. In vitro measurements utilized competitive multi-round transcription assays with limiting amounts of RNAP. Each reaction contained two promoter templates on separate linear fragments: a strong competitor promoter and a test promoter (Figure S1; also see Materials and Methods). In vivo assays used promoters expressing a GFP reporter in cells expressing basal levels of σE, performed as described previously (8), except that cells were grown in M9-glucose medium to reduce autofluorescence and enable more accurate measurement of low σE activity. We define active promoters as those with activity ≥ 2-fold above background. Under our stringent assay conditions with low σE levels, 30 out of the 60 promoters tested were active in vitro, and a largely overlapping set of 22 promoters were active in vivo. Four of the promoters active in vivo were excluded from subsequent analysis because they exhibited σE independent activity, limiting the in vivo set to 18 promoters (see ‘Materials and Methods’ section; Supplementary Table S2). Promoter models were constructed from the active promoters. Calculation of the UP-effect (log2[full-length promoter activity / core promoter activity]) revealed a range of positive and negative effects on activity of the core promoter in vitro (Figure 3A) and in vivo (Figure 3B). Positive UP-effects were associated with A- and T-tracts indicative of UP-elements, and UP-effects were usually greater in vitro than in vivo.
Modeling the UP-effect
We divided the upstream regions of promoters into three putative α-binding subsites: proximal (−46 to −35), distal (−57 to −47) and far−distal (−58 to −64) (illustrated in Figures 1 and 3). We tested whether various models were successful, based on the correlation (R) of subsite scores predicted by the model with UP-effect (Table 1). Using aligned upstream sequences, we first tested PWM models (Table 1). As expected, PWMs performed poorly, even with over-represented motifs identified within each subsite using MEME or WCONSENSUS, or by incorporating adjacent nucleotide dependencies using dinucleotide PWMs (data not shown). The promoter library is too small for statistically meaningful analysis of trinucleotide models. Together these results suggest higher order dependencies are required for UP-element function.
Table 1.
Promoter model | Correlation (R) of model score with UP-effecta |
|||||
---|---|---|---|---|---|---|
Subsite score correlations |
Combined subsite score correlations |
|||||
Far-distal | Distal | Proximal | FD + Db | D + Pc | All 3d | |
In vitro multi-rounds at 10 nM RNAP (30 promoters) | ||||||
PWM | −0.02 | 0.43* | 0.16 | 0.48** | 0.33 | |
Percentage AT content | 0.18 | 0.32 | 0.51** | 0.37 | 0.55** | 0.54** |
A-tract/T-tract counts (3 nt) | 0.23 | 0.57** | 0.43* | 0.68**** | 0.66**** | 0.74**** |
A-tract/T-tract length | −0.19 | 0.56** | 0.17 | 0.34 | 0.55** | 0.38* |
A-tract/T-tract length ± 1 nt | −0.01 | 0.56** | 0.28 | 0.42 | 0.57** | 0.48** |
In vivo at basal levels of σE (18 promoters)e | ||||||
PWM | −0.47* | 0.25 | 0.16 | 0.26 | −0.08 | |
Percentage AT content | −0.29 | 0.24 | 0.10 | 0.01 | 0.19 | 0.07 |
A-tract/T-tract counts (3 nt) | −0.08 | 0.62* | 0.44 | 0.38 | 0.78*** | 0.66* |
A-tract/T-tract length | −0.06 | 0.49* | 0.36 | 0.36 | 0.55* | 0.46 |
A-tract/T-tract length ± 1 nt | −0.27 | 0.47* | 0.34 | 0.17 | 0.53* | 0.29 |
Cluster 1 at 10 nM RNAP (20 promoters) | ||||||
PWM | −0.46* | 0.26 | −0.16 | 0.12 | −0.25 | |
Percentage AT content | −0.24 | 0.31 | 0.33 | 0.14 | 0.42 | 0.29 |
A-tract/T-tract counts (3 nt) | 0.12 | 0.64** | 0.49* | 0.57** | 0.78**** | 0.74*** |
A-tract/T-tract length | −0.27 | 0.62** | 0.32 | 0.34 | 0.66** | 0.46* |
A-tract/T-tract length ± 1 nt | −0.28 | 0.60** | 0.38 | 0.33 | 0.65** | 0.43 |
Cluster 2 at 50 nM RNAP (9 promoters) | ||||||
PWM | −0.05 | −0.23 | 0.03 | −0.13 | −0.17 | |
Percentage AT content | 0.47 | 0.49 | 0.23 | 0.71* | 0.49 | 0.57 |
A-tract/T-tract counts (3 nt) | 0.84** | 0.03 | 0.29 | 0.34 | 0.26 | 0.46 |
A-tract/T-tract length | 0.13 | −0.32 | −0.35 | −0.21 | −0.38 | −0.30 |
A-tract/T-tract length ± 1 nt | 0.49 | −0.27 | −0.13 | 0.18 | −0.26 | 0.09 |
aThe highest subsite or combined subsite model correlation for each promoter group is indicated in bold. Significant correlations (R) by two-tailed test are indicated: p < 0.05*; p < 0.01**; p < 0.001***; p < 0.0001****. bCombined Far-Distal + Distal model subsites. cCombined Distal + Proximal model subsites. dCombined Far-Distal + Distal + Proximal model subsites. eActive basal promoters in cluster 1: degP, ddg, rpoE, rseA, ygiM, yraP, rybB, STM1676, micA; and in cluster 2: yicJ.
We then built models for each subset based on known features of UP-elements (A/T content, A- and T-tract length/frequency) (Table 1; see Figure S2 for model descriptions). Our best model counted the number of overlapping A- and T-tracts 3 nt in length. This model is based on the finding that near maximal narrowing of the minor groove is achieved after a run of three A residues (15). The highest correlation with in vitro UP-effects counted the number of overlapping 3 nt A- and T-tracts (e.g. 1,2 or 3) across all three subsites (R = 0.74; p = 3×10−6), while the highest correlation in vivo was achieved by combining counts of overlapping 3 nt A- and T-tracts in just the distal and proximal subsites (R = 0.78; p = 1×10−4). A more complex model mimicked the features of A- and T-tract distribution observed at strong UP-elements (9,10). Here, the far-distal and distal subsites were scored by the length of contiguous A-tract followed by T-tract that overlapped the center of the α-binding sites (i.e. the number of nucleotides), and the proximal subsite was scored by the length of A- or T-tract that overlapped the center of the proximal α-binding site (Supplementary Figure S2). Overall, this model performed slightly less well than the simpler overlapping 3 nt A- and T-tract counts (Table 1). To test the importance of exact placement of the A- and T-tracts, a variation of the A- and T-tract length model was used in which the tracts only had to be within 1 nt of the center of the α-binding sites. This gave little difference in score correlations, suggesting that close proximity of A- and T-tracts to the binding sites is sufficient to capture UP-effects (Table 1). We also tested the correlation of A/T content with UP-effect (Table 1). This yielded only slightly weaker correlations in vitro, but for the in vivo promoters this approach performed poorly, suggesting that both A- and T-tract length and number provide more information than simple A/T content. In summary, the success of overlapping 3 nt A- and T-tract counts strongly indicates the importance of minor groove width for UP-effect. Interestingly, the distal and proximal subsites scores generated the strongest correlations across all models, suggesting that these subsites are the main contributors to the UP-effect.
Modeling full-length promoters
We derived a model describing the strength of full-length promoters by first determining which UP-effect model provided the best correlation with full-length promoter strength and then combining this model with the core promoter model. In the in vivo case, the UP effect model that correlated best with full length promoter strength in vivo was the same model that best described UP-effects alone (proximal and distal overlapping 3 nt A- and T-tract counts, Table 1). For in vitro measurements, the distal site A-tract/T-tract length (±1 nt) model correlated best with full length promoter strength; we note that this was one of the top subsite models that best described the UP-effects (Table 1). The core promoter model (25) (see ‘Materials and Methods’ section) was based on PWMs constructed for each core promoter motif (Figure 1B) and a penalty term applied for non-optimal spacer and discriminator lengths (S + D penalty, see Supplementary Figure S3). Figure 4A and B illustrates the initial correlation of each UP and core promoter module with full-length promoter strength for both in vitro and in vivo data. Most module scores positively correlate with promoter strength, except the far-distal UP-subsite, and the spacer and discriminator motifs. The best full-length promoter models were constructed by summing the scores of select modules that positively correlated with promoter strength (indicated with asterisks in Figure 4A and B). This generated good correlation of promoter score with promoter strength both in vitro, R = 0.71 (p = 1×10−5; Figure 4C) and in vivo, R = 0.85 (p = 8×10−6; Figure 4D) (summarized in Table 2). All other module combinations and models resulted in lower correlations (data not shown). These initial models were optimized based on the assumption that a small number of promoters may have unusual sequence properties that detract from the model and therefore present as outliers in the correlation of score with promoter strength. Five outliers from the in vitro promoter model and three from the in vivo model were detected and removed based on their high residuals and leverage properties on the general fit of the model. Subsequent analysis revealed that outliers generally have at least one unusually low scoring module, validating the idea that their properties differ from the other promoters (see Discussion). The remaining promoters were used to construct optimized models using the same selection of modules as in the initial models (see optimized modules, Figure 4A and B), resulting in an improved fit of R = 0.90 (p < 1×10−6) (in vitro) and a slightly improved fit of R = 0.91 (p = 3×10−6) in vivo (Table 2; Figure 4E and F). Each optimized full-length model was tested for over-fitting using 10-fold cross-validation. The validated promoter scores still correlated well with promoter strength (R = 0.81 and 0.86 for in vitro and in vivo, respectively; Table 2; Figure S4), demonstrating that each model has good predictive utility. As the full-length in vitro and in vivo promoter models combine different scoring systems for the UP and core models, we explored whether using partial least squares regression (PLSR; see ‘Materials and Methods’ section) (36) to calculate optimal coefficients to scale their contributions improved fit. Although this procedure increased fit, cross-validation of these models resulted in lower validation scores compared to models with no coefficients, suggestive of over-fitting (data not shown). Therefore, our final models do not have coefficients to scale contributions of UP and core promoter modules.
Table 2.
Data set | Correlation (R) between promoter strength and total promoter scorea |
Correlation (R) between promoter strength and optimized sub-model score |
Outliers | |||
---|---|---|---|---|---|---|
Initb | Optc | Vald | UP model | Core modele | ||
MR 10 nM RNAPf | 0.71 (p = 1×10−5) | 0.90 (p < 1×10−8) | 0.81 (p = 9×10−7) | 0.69g (p = 1×10−4) | 0.64 (p = 6×10−4) | ygiM, rseA, ompX, yfeK, sbmA |
In vivo basalh | 0.85 (p = 8×10−6) | 0.91 (p = 3×10−6) | 0.86 (p = 4×10−5) | 0.88i (p = 2×10−5) | 0.74 (p = 2×10−3) | degP, clpX, rpoE |
aCorrelation between promoter strength and total promoter score generated by summing UP and core model scores. bInitial fit. cOptimized fit after removal of outliers and rebuilding model. dValidated promoter scores fit, eCore model: S + D pen + PWM−35 + PWM−10 + PWMstart. fIn vitro promoter activities using 30 active promoters (Figure 2A). gUP model: distal subsite A-tract/T-tract length (±1 nt) scores. hIn vivo promoter activities using 18 active promoters (Figure 2B). iUP model: distal + proximal A-tract/T-tract counts (3 nt).
hIn vivo promoter activities using 18 active promoters (Figure 2B),
Accurate promoter prediction requires correct identification of functional promoters with few false positives from similar, but non-functional sequences. Our datasets of active and inactive promoters (Figure 2) are ideal to identify features that distinguish active promoters. We previously showed that effective discrimination between active and inactive core promoters required a minimum threshold score for each motif based on the lowest score of that motif in the active promoter set (25). This rule also distinguishes active and inactive full-length promoters. Active and inactive full-length promoters have similar sequence logos (Supplementary Figure S5), and applying our full-length models to score the inactive promoters poorly distinguished between active and inactive promoters (Figure 4G and H). This demonstrates that overall sequence conservation is a poor indicator of promoter function. However, applying minimum module score thresholds for each motif based on the lowest score of that motif in the active promoter set successfully identified 77% of all inactive promoters in vitro and 95% of inactive promoters in vivo (Table 3; Supplementary Figure S6). The combined −10/−35 PWM scores identified the largest number of inactive promoters with 57% in vitro and 78% in vivo scoring below threshold, and in addition the discriminator and ITR motifs scored below threshold in many inactive promoters (Table 3; Supplementary Figure S6). These results show that individual core motif cutoff thresholds are required to accurately distinguish inactive full-length promoters, demonstrating the essentiality of these modules for promoter function. In contrast, the UP models did not distinguish any inactive promoters. This is because several active promoters had no score for their upstream sequences, demonstrating that UP-elements are not essential for promoter function.
Table 3.
Data set | Number of inactive promoters | Percentage of inactive promoters scoring below module cut-off thresholds |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
S + D pen | −35 | Spacer | −10 | Disc | Start | ITR | −10/−35a | Full-length scoreb | All modulesc | ||
MR 10 nM RNAP | 30 | 3 | 33 | 7 | 10 | 13 | 7 | 27 | 57 | 33 | 77 |
In vivo basal | 37 | 3 | 54 | 3 | 22 | 24 | 0 | 51 | 78 | 59 | 95 |
aCombined −10 and −35 PWM score threshold. bFull-length model score threshold comprised of core and UP models. cCumulative effect of applying all module cut-off thresholds.
Promoter strength at different concentrations of RNAP
Strong UP-elements contribute to promoter strength predominantly by enhancing the binding of RNAP (7,38), suggesting that their relative contribution will decrease as RNAP availability increases. This is a particularly important consideration for ECF σs, where regulation is based on increasing the free pool of σ by releasing this σ from its inhibitory interaction with its cognate anti-σ, enabling formation of the transcriptionally active σ-specific RNAP holoenzyme (2,3). However, there has been no systematic analysis of the behavior of ECF σ regulon promoters as a function of σ-specific RNAP levels. Here, we used our σE promoter library and promoter models to examine the effects of σE-RNAP levels on promoter activity and also to determine how the UP-effect alters as a function of RNAP concentration.
The activity of full-length and core promoters across a range of σE-RNAP levels is displayed as a heat map (Figure 5A and B). The first two columns indicate promoter behavior in vivo for cells expressing basal (low) σE as compared to those overexpressing σE. The other columns indicate effects in vitro using a multi-round assay with a competitor promoter template (PrpoH) across a range of RNAP concentrations (5–50 nM; see ‘Materials and Methods’ section) to mimic promoter competition in vivo. The data are clustered across the different conditions to identify promoters that respond similarly to increasing concentrations of holoenzyme. As expected, the number of active promoters increased with σE-RNAP levels. The UP-effects of the promoters across the different conditions are similarly displayed in Figure 5C. Notably, the UP-effects were highly variable across the different conditions, suggesting a complex response to RNAP levels. Because the in vitro assays are more controlled than those performed in vivo, they may enable us to identify the origin of these variable effects. We performed hierarchical clustering (see ‘Materials and Methods’ section) to identify groups of promoters with similar patterns of activity based on the in vitro competitive multi-round data. Two discrete clusters were identified: Cluster 1, in which the UP-effect decreased as RNAP concentration increased (20 promoters); and Cluster 2, in which the UP-effect increased as RNAP concentration increased (9 promoters) (Figure 6). These opposing UP-effects suggest that additional factors, such as properties of the core promoter, may also influence the UP-effect.
Contribution of the UP-element depends on the competitiveness of the core promoter
The contribution of the UP-element is likely to depend on the binding strength of the core promoter: strong binding core promoters would be expected to relieve the requirements for UP-elements under high levels of RNAP. We therefore examined whether the Cluster 1 and Cluster 2 promoters differed in the binding strengths of their core promoters. The competitive multi-round assays provide a crude measure of relative promoter binding strength. Because the activity of the test promoter is displayed relative to that of the competitor, at low RNA polymerase concentrations test promoters with tight binding will compete well with the strong competitor control promoter and appear more active (e.g. micA and rybB, see Figure 5B). Accordingly, promoters with similar or higher levels of activity relative to the control promoter at low RNAP concentrations were termed strong competitors, whereas those that showed little activity at low RNAP concentration and increased their relative activity only at high RNAP concentrations were termed weak competitors. A competitive index (CI) was derived to describe promoter competitiveness based on their activity under low and high RNAP levels (CI = (activity [low RNAP])/(activity [high [RNAP]) (Figure 7A). Using the CI, promoters from Cluster 1 were found to be significantly enriched for strong competitor core promoters compared to Cluster 2 (Figure 7B; p = 0.0003 using t-test). Thus, Cluster 1 promoters are likely to become saturated as RNAP levels increase, explaining why their upstream sequences have little effect on promoter strength at high RNAP. Conversely, as Cluster 2 promoters contained weak competitor core promoters that were active only at high RNAP, their UP-effect was only apparent at high levels of RNAP (Figure 7B).
Cluster 1 and Cluster 2 upstream sequences differ in their composition
We applied the different UP models to the upstream sequences of the Cluster 1 and 2 promoters, using conditions of maximal UP-effect (Cluster 1: 10 nM RNAP; Cluster 2: 50 nM RNAP), to determine if they differed in their composition or function (Table 1; Figures S7 and S8). The best models of UP-effect for Cluster 1 promoters were the same as those identified for all 30 active promoters at 10 nM RNAP (overlapping 3 nt A- and T-tract counts; compare first and second third row of Table 1). In contrast, the best UP-element models for Cluster 2 promoters differed from those derived for all 30 promoters. For Cluster 2 promoters, the far-distal A- and T-tract counts generated the highest correlation with UP effects (R = 0.84; p = 5×10−3), while the proximal and distal subsite A- and T-tract models performed poorly (Table 1). Also, the percentage AT content generated moderate correlations with UP-effect. Interestingly, the Cluster 2 upstream regions are more AT-rich than the Cluster 1 regions (70% versus 61%, respectively). These results suggest that the α-subunits have different requirements for activity at Cluster 2 promoters.
Modeling the strength of promoters at different RNAP concentrations
Up to this point, our models were constructed from the promoter subset active at low RNAP concentration and the activity of the full-length promoter was modeled at low RNAP. However, we have identified two distinct clusters of promoters with differential UP-effects depending on RNAP concentration. We therefore expanded our modeling to examine the properties of full-length promoters at high concentrations of RNAP. For each dataset we applied our full-length promoter modeling approach, constructing models only from the active promoters and testing their ability to predict promoter strength and to accurately distinguish inactive from active promoters under each condition (Figure 8). As expected, models constructed on datasets with low RNAP levels (in vitro, 10 nM RNAP; in vivo, basal σE levels) performed better at predicting full-length promoter strength and distinguishing inactive promoters than models using datasets with high RNAP levels (in vitro, 50 nM RNAP; in vivo, over-expression of σE). Indeed, models constructed from all 46 promoters active at 50 nM RNAP could not be optimized and correlated only poorly with full-length promoter strength (R = 0.44) [constructing models from the top 30 active promoters yielded an improved correlation (R = 0.69)]. Also, models based on in vivo over-expression of σE identified only 59% of inactive promoters.
Significantly, all models based on datasets obtained at low RNAP levels are comprised of similar modules, suggesting a similarity of promoter function across these conditions. In contrast, models derived from datasets at high RNAP levels tended to have different solutions, suggesting altered motif requirements under these conditions. For example, under these conditions, there is no UP model for Cluster 1 promoters since none of the UP models improved overall model performance and the UP-effect is minimal at high RNAP levels; for Cluster 2 promoters the model is comprised both of different core modules, and an UP-model solution that differs from the best UP-effect models for the same set of promoters (compare with Table 1). The different solutions at high RNAP levels are likely due to the active promoters being comprised of a mixed population of weak or strong UP-effects and also weak or strong competitive core regions. In contrast, at low RNAP levels only promoters that have high binding affinities (high KB) and therefore are strong competitors for RNAP are active. These differences are likely to reduce the efficacy of a single predictive model of the type described here to capture promoter strength at higher RNAP levels.
DISCUSSION
This work presents the quantitative sequence requirements and descriptive promoter strength models of full-length promoters for the first time. By comparing the strength of core and full-length σE-dependent promoter sequences, we have been able to quantify the contribution of upstream sequences to promoter strength. We find a large range of UP-effects across active promoters. Significantly, their effects can be successfully modeled using descriptions of A- and T-tract frequency and length, which are consistent with the known binding requirements of the α subunits. We also find that full-length promoters can be modeled by combining the UP-element models with core models comprised of PWMs of key motifs and penalties for suboptimal motif positions. These models provide important metrics with which to dissect the requirements of promoter structure and function. In addition, we have identified important properties of UP-element and core promoters that modify the behavior of promoters across a range of RNAP concentrations. These findings impact both the design of models for promoter prediction and the design of promoters for applications in synthetic biology.
UP-element structure
Our understanding of UP-element structure was previously based on SELEX studies, which found that long continuous runs of A- and T-tract had strong UP-element function (9,10). This finding is consistent with the fact that optimal α-DNA interactions require a narrowed minor groove (13–15). However, naturally occurring UP-elements, which consist of shorter A- and T-tracts that differ both in length and frequency, had not been modeled (Figure 3). Our modeling of natural UP-elements showed that models based on frequency and length of A- and T-tracts capture important properties of natural UP-elements. Our best overall UP-element model consisted of the frequency of overlapping 3 nt A- and T-tracts. The 3 nt tract is the minimum A-tract length for maximal minor groove narrowing (15), and is most likely to capture the contribution of these more ‘broken’ A- and T-tracts in natural promoters. In addition, our models show that the distal and proximal subsites are the most significant contributors to the UP-effect at most promoters. This likely reflects the optimal binding location of the α subunits at activator-independent promoters (6,9,10,12). Finally, the best subsite models were for the distal subsite: measuring the frequency of overlapping 3 nt A- and T-tracts, and also the length of A-tract followed by T-tract, a feature characteristic of optimized distal subsites (9). It is likely that this combination of A- followed by T-tract provides a stable region of narrowed minor groove, since the flanking A- and T-tracts will generate minor groove narrowing from both directions of the DNA. The key remaining questions are the effects of different tract length and locations of tracts, single nucleotide interruptions and the composition of flanking sequences. Answering such questions requires much larger sequence libraries to provide the fine resolution necessary for addressing these issues.
Full-length promoter models
Combining the UP-element model with our previously described core promoter model enabled us to evaluate the contributions of all promoter motifs to promoter strength (Figures 4A and B and 8). This analysis revealed that although distal and proximal UP-elements are major contributors to promoter strength, the motifs important for strength of the core promoter [PWMs of the −35, −10 and start motifs, and spacer penalties; (25)] are also important for strength of the full promoter. Thus, UP-elements strongly contribute to promoter strength, but they do not mask the requirement for core promoter motifs. Indeed, similar modules distinguish inactive from active promoters in both core and full promoters (25) (Table 3; Supplementary Figure S6). Therefore, it is the properties of the core promoter that determine promoter functionality in ECF σ-type promoters as UP-elements are not required for promoter function and do not compensate for the poor functional characteristics of core promoters.
Our results suggest that too many strong interactions between RNAP and the promoter hinder escape of RNAP from σE-dependent promoters, as was previously found for near consensus σ70 promoters (39–41). The strongest σE-dependent promoters were composed of combinations of high and low scoring motifs and the strengths of the −10 and −35 motifs are negatively correlated (Supplementary Figure S6; data not shown), as expected if strong interactions are carefully calibrated for maximal activity. Intriguingly, although there is very little sequence conservation in the sequence of the spacer between the −10 and −35 motifs, except for a minor enrichment of A/T residues in its central portion, the spacer sequence is exceptionally strongly negatively correlated with promoter strength (Figure 8 and Supplementary Figure S6). We suggest that the spacer sequence affects promoter strength by altering DNA flexibility since this region undergoes large conformational changes during the formation of the initiation complex (reviewed in (42)). Suboptimal spacer sequences, likely to lack enriched A/T residues, may destabilize the bound RNAP initiation complex. At the strongest promoters, the non-conserved spacer may balance the requirement for strong interactions that promote RNAP binding with unstable interactions to facilitate promoter escape.
Promoter strength across different levels of RNAP
We report the first systematic analysis of promoter strength as a function of RNAP concentration. Our models for both the core promoter and the UP-element capture parameters related to binding, rather than subsequent steps of transcription initiation. The fact that these models performed very well only at low RNAP concentration indicates that such parameters dominate promoter strength under these conditions. Thus, we suggest that binding affinity (KB) governs the activity of σE promoters at low concentrations of RNAP. In accord with this conclusion, most promoters active at low RNAP fall into our ‘competitive’ Cluster 1 promoter class (Figure 7). Moreover, these promoters show decreasing UP-element stimulation at higher RNAP concentration, consistent with expectations for core promoters having tight binding. As RNAP increases, binding of the core promoter to RNAP will saturate, limiting the effect of additional binding conferred by the UP-element.
Importantly, systematic modeling across RNAP concentrations revealed that the σE regulon has a second distinct promoter type, which differs in both properties and sequence from the predominant Cluster 1 promoters. Cluster 2 promoters are likely to be weaker binding: their core sequences compete poorly at low RNAP concentration; however, they show increasing stimulation by UP-elements with increasing RNAP concentration. Consequently, most Cluster 2 promoters were only active at high RNAP concentrations. Cluster 1 and Cluster 2 promoters differ in their −35 sequences. Although both contain the consensus core element of the −35 element (GAAC) (43), their flanking sequences differ (Supplementary Figure S7). In particular, the downstream T-tract is present only in Cluster 1. T-tracts provide a rigid structural unit and the first T is involved in non-specific contacts with σE R149 (43). Thus, Cluster 1 and 2 promoters could have an altered trajectory of DNA, possibly explaining why the most important descriptor of UP-effect in Cluster 2 promoters is the far-distal UP-site. Interestingly, the presence of sequences in the far upstream UP-element region has been shown to be important for increasing the rate of isomerization of the open complex (38,44), suggesting that UP sites at Cluster 2 promoters may facilitate a step subsequent to initial promoter binding.
Our initial σE promoter identification efforts were directed at identifying all sequences able to function as σE promoters, so that we could examine different categories of promoters. Thus, in vivo detection used the sensitive 5′ RACE technique following σE overexpression, and in vitro detection employed high levels of RNA polymerase with no competitor template (8,23). This important decision allowed us to dissect the promoter properties of diverse sequences and enabled us to classify promoters as Cluster 1-type, Cluster 2-type and weak/inactive. Although Cluster 1-type promoters are generally active at low RNAP concentration, a subset (7/20) are active only at high RNAP concentration. These promoters typically have weak or medium strength competitive core promoters and low scoring UP-elements, suggesting that these promoters also have weak overall binding strengths. Cluster 2 promoters are active only at high RNAP concentration and most (6/9) have weakly competitive core promoters and relatively strong UP-elements. Finally, promoters classified as inactive even under high RNAP have a very low scoring core motif and often weak UP-elements, suggestive of function in vivo only when levels of free σE are extremely high (8).
The discovery of the distinct Cluster 1 and Cluster 2 promoter types is important for our general understanding of ECF σ responses. The activity of most ECF σ promoters is primarily regulated by changes in the concentration of its active σ. Our studies suggest that only select promoters will be active at low to moderate levels of the ECFσ and that regulons may have additional promoter types designed to be active only under extreme conditions. A tiered response increases the regulatory capacity of ECF σ's.
Distinct properties of alternative σ promoters
Although we were able to model the strength of near-consensus UP-elements at σ70 promoters (9,10) using overlapping A- and T-tract 3 nt counts and length (Supplementary Figure S9; Supplementary Table S4), there are important differences between the UP-element performance at housekeeping (σ70) promoters and at σE promoters. UP-elements recruit σ70 holoenzyme to weak promoters, thereby dramatically increasing promoter strength (6,11,45). In contrast, our data indicate that the presence of UP-elements at σE promoters does not significantly relieve the requirement for well-conserved core promoter sequences. Moreover, the behavior of Cluster 2 promoters indicates that there is a minimum RNAP concentration requirement for UP-element function: below this concentration there is insufficient promoter occupancy to enable isomerization. Thus, at σE promoters, UP-elements are subsidiary to the core promoter elements whereas at σ70 promoters, they may be able to substitute, in part, for core promoter elements.
These findings are consistent with an emerging view of the differences between the housekeeping σs and the diverged alternative σs (ECF, Group 4 σs and Group 3 σs). Whereas housekeeping σs recognize thousands of promoters genome-wide that are comprised of partially redundant, poorly conserved −35, −10 and extended −10 motifs (42), the diverged alternative promoters recognize 10 to 100-fold fewer promoters and are comprised of more highly conserved core promoters that require every promoter motif for function. Recent studies indicate that a major contributing factor to this differential promoter usage is that housekeeping σs contain key aromatic residues that facilitate promoter melting (46,47); whereas most alternative σs lack some of these key residues, resulting in a suboptimal melting capacity (48). Consequently, the strong melting capacity of housekeeping σs enable their tolerance of poorly conserved promoters, since only transient occupancy of the promoter is sufficient to enable melting. In contrast, the weak melting ability of alternative σs results in slow isomerization to open complex: consequently, only near consensus promoters provide a sufficiently slow dissociation rate to enable melting to occur (48,49). The difference in promoter melting ability between the housekeeping and alternative σs has important implications for the effect of UP-elements on promoter strength. The strong melting capacity of housekeeping σs enables weak promoters to be strongly activated by UP-elements that ‘recruit’ RNAP to these sequences. In contrast, the weak melting capacity of alternative σs confines UP-element function to well-conserved promoters capable of supporting open complex formation.
Constructing models for promoter prediction
Our modeling efforts suggest a pathway for predicting the biological circuits of newly discovered ECF σs. First, promoters should be identified using high-throughput experiments performed at low concentrations of the σ (e.g. basal or near basal conditions) as the tools currently available are best suited to model promoters under binding limited conditions. Second, outliers to general trends should be removed as they detract from the general predictive value of our models. Importantly, 7/8 promoters identified as outliers in our experiments contained at least one very low scoring module (five different modules in total; Supplementary Figure S6). These promoters are likely to contain sequences that functionally compensate for the particular low scoring module in that promoter. The fact that there are different low scoring modules suggests that there may be several solutions to compensate for suboptimal modules. As the lowest z-score for each module varies in active promoters (see Supplementary Figure S6), the best approach for optimizing new models would be to remove promoters containing modules with discrepantly low z-scores. Third, all core promoter motifs should be used to discriminate functional (e.g. active) from inactive or very weak promoters. Finally, only select motifs (including UP-elements) should be used to estimate promoter strength. Our results make it clear that there is a need for new approaches that model additional facets of the transcription process beyond the initial binding step. We envision that DNA structure and flexibility will provide additional readouts to help model such steps.
Designing promoters for synthetic biology
Synthetic biology designs genetic circuits for particular outputs, including transplantation of metabolic pathways into suitable hosts (50–52). These systems require careful engineering such that the expression levels of the genetic components are tuned to an appropriate input level for the next section of the circuit. This ensures desired circuit behavior and reduces toxicity of pathway intermediates (53). An example of such ‘tunability’ has been achieved by altering ribosome binding site (rbs) sequences to adjust rbs strength and hence protein translation using an ‘rbs calculator’ based on thermodynamic principles (54). This and our previous work (25) provide the foundation for developing an analogous ‘promoter calculator’ for engineering promoters of specific strengths for genetic circuits. We suggest that a minimum predictable promoter unit should extend from −65 to +20. This will include the UP-elements that dramatically affect promoter strength even of alternative sigmas, and also the downstream +1 and initial transcribed region (ITR) that can affect promoter function by modifying open complex stability and promoter escape (25,55).
SUMMARY
Many of our findings on modeling σE promoters will be applicable as other alternative σs. Their core promoter motifs are well conserved, making them tractable to modeling, and the sequence requirements of the UP-elements are conserved across bacteria. The rapid application of next generation sequencing to RNA-seq is now providing a wealth of high-resolution information of transcript start sites at a genomic level (56–58). This dramatically simplifies identifying promoter sequences, which are located directly upstream of start sites, enabling construction of descriptive promoter models for entire genomes. RNA-seq also provides quantitative information on transcript abundance and hence promoter strength, which would enable optimization of promoter models with strength. This will enable the construction of promoter strength models that can then be used for promoter predictions in closely related genomes that share orthologous σs, thereby rapidly expanding the characterization of transcriptional networks across bacteria.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online: Supplementary Figures 1–9, Supplementary Tables 1–4, and Supplementary References [9,10].
FUNDING
Funding for open access charge: National Institutes of Health (GM57755 to C.A.G.).
Conflict of interest statement. None declared.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank Adam Arkin, Steve Busby, Rick Gourse, Hao Li, Wilma Ross and Chris Voigt for many helpful discussions, and one of our reviewers who suggested using overlapping 3 nt A- and T-tract counts rather than the non-overlapping count we previously employed.
REFERENCES
- 1.Gruber TM, Gross CA. Multiple sigma subunits and the partitioning of bacterial transcription space. Annu. Rev. Microbiol. 2003;57:441–466. doi: 10.1146/annurev.micro.57.030502.090913. [DOI] [PubMed] [Google Scholar]
- 2.Helmann JD. The extracytoplasmic function (ECF) sigma factors. Adv. Microb. Physiol. 2002;46:47–110. doi: 10.1016/s0065-2911(02)46002-x. [DOI] [PubMed] [Google Scholar]
- 3.Staron A, Sofia HJ, Dietrich S, Ulrich LE, Liesegang H, Mascher T. The third pillar of bacterial signal transduction: classification of the extracytoplasmic function (ECF) sigma factor protein family. Mol. Microbiol. 2009;74:557–581. doi: 10.1111/j.1365-2958.2009.06870.x. [DOI] [PubMed] [Google Scholar]
- 4.Young BA, Gruber TM, Gross CA. Minimal machinery of RNA polymerase holoenzyme sufficient for promoter melting. Science. 2004;303:1382–1384. doi: 10.1126/science.1092462. [DOI] [PubMed] [Google Scholar]
- 5.Gourse RL, Ross W, Gaal T. UPs and downs in bacterial transcription initiation: the role of the alpha subunit of RNA polymerase in promoter recognition. Mol. Microbiol. 2000;37:687–695. doi: 10.1046/j.1365-2958.2000.01972.x. [DOI] [PubMed] [Google Scholar]
- 6.Ross W, Gosink KK, Salomon J, Igarashi K, Zou C, Ishihama A, Severinov K, Gourse RL. A third recognition element in bacterial promoters: DNA binding by the alpha subunit of RNA polymerase. Science. 1993;262:1407–1413. doi: 10.1126/science.8248780. [DOI] [PubMed] [Google Scholar]
- 7.Rao L, Ross W, Appleman JA, Gaal T, Leirmo S, Schlax PJ, Record MT, Jr, Gourse RL. Factor independent activation of rrnB P1. An “extended” promoter with an upstream element that dramatically increases promoter strength. J. Mol. Biol. 1994;235:1421–1435. doi: 10.1006/jmbi.1994.1098. [DOI] [PubMed] [Google Scholar]
- 8.Mutalik VK, Nonaka G, Ades SE, Rhodius VA, Gross CA. Promoter strength properties of the complete sigma E regulon of Escherichia coli and Salmonella enterica. J. Bacteriol. 2009;191:7279–7287. doi: 10.1128/JB.01047-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Estrem ST, Ross W, Gaal T, Chen ZW, Niu W, Ebright RH, Gourse RL. Bacterial promoter architecture: subsite structure of UP elements and interactions with the carboxy-terminal domain of the RNA polymerase alpha subunit. Genes Dev. 1999;13:2134–2147. doi: 10.1101/gad.13.16.2134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Estrem ST, Gaal T, Ross W, Gourse RL. Identification of an UP element consensus sequence for bacterial promoters. Proc. Natl Acad. Sci. USA. 1998;95:9761–9766. doi: 10.1073/pnas.95.17.9761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Aiyar SE, Gourse RL, Ross W. Upstream A-tracts increase bacterial promoter activity through interactions with the RNA polymerase alpha subunit. Proc. Natl Acad. Sci. USA. 1998;95:14652–14657. doi: 10.1073/pnas.95.25.14652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Murakami K, Kimura M, Owens JT, Meares CF, Ishihama A. The two alpha subunits of Escherichia coli RNA polymerase are asymmetrically arranged and contact different halves of the DNA upstream element. Proc. Natl Acad. Sci. USA. 1997;94:1709–1714. doi: 10.1073/pnas.94.5.1709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ross W, Ernst A, Gourse RL. Fine structure of E. coli RNA polymerase-promoter interactions: alpha subunit binding to the UP element minor groove. Genes Dev. 2001;15:491–506. doi: 10.1101/gad.870001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Benoff B, Yang H, Lawson CL, Parkinson G, Liu J, Blatter E, Ebright YW, Berman HM, Ebright RH. Structural basis of transcription activation: the CAP-alpha CTD-DNA complex. Science. 2002;297:1562–1566. doi: 10.1126/science.1076376. [DOI] [PubMed] [Google Scholar]
- 15.MacDonald D, Herbert K, Zhang X, Pologruto T, Lu P. Solution structure of an A-tract DNA bend. J. Mol. Biol. 2001;306:1081–1098. doi: 10.1006/jmbi.2001.4447. [DOI] [PubMed] [Google Scholar]
- 16.Jeon YH, Yamazaki T, Otomo T, Ishihama A, Kyogoku Y. Flexible linker in the RNA polymerase alpha subunit facilitates the independent motion of the C-terminal activator contact domain. J. Mol. Biol. 1997;267:953–962. doi: 10.1006/jmbi.1997.0902. [DOI] [PubMed] [Google Scholar]
- 17.Ross W, Schneider DA, Paul BJ, Mertens A, Gourse RL. An intersubunit contact stimulating transcription initiation by E. coli RNA polymerase: interaction of the alpha C-terminal domain and sigma region 4. Genes Dev. 2003;17:1293–1307. doi: 10.1101/gad.1079403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chen H, Tang H, Ebright RH. Functional interaction between RNA polymerase alpha subunit C-terminal domain and sigma70 in UP-element- and activator-dependent transcription. Mol. Cell. 2003;11:1621–1633. doi: 10.1016/s1097-2765(03)00201-6. [DOI] [PubMed] [Google Scholar]
- 19.Typas A, Hengge R. Differential ability of sigma(s) and sigma70 of Escherichia coli to utilize promoters containing half or full UP-element sites. Mol. Microbiol. 2005;55:250–260. doi: 10.1111/j.1365-2958.2004.04382.x. [DOI] [PubMed] [Google Scholar]
- 20.Naryshkin N, Revyakin A, Kim Y, Mekler V, Ebright RH. Structural organization of the RNA polymerase-promoter open complex. Cell. 2000;101:601–611. doi: 10.1016/s0092-8674(00)80872-7. [DOI] [PubMed] [Google Scholar]
- 21.Newlands JT, Josaitis CA, Ross W, Gourse RL. Both fis-dependent and factor-independent upstream activation of the rrnB P1 promoter are face of the helix dependent. Nucleic Acids Res. 1992;20:719–726. doi: 10.1093/nar/20.4.719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Meng W, Belyaeva T, Savery NJ, Busby SJ, Ross WE, Gaal T, Gourse RL, Thomas MS. UP element-dependent transcription at the Escherichia coli rrnB P1 promoter: positional requirements and role of the RNA polymerase alpha subunit linker. Nucleic Acids Res. 2001;29:4166–4178. doi: 10.1093/nar/29.20.4166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rhodius VA, Suh WC, Nonaka G, West J, Gross CA. Conserved and Variable Functions of the sigma(E) Stress Response in Related Genomes. PLoS Biol. 2006;4:43–59. doi: 10.1371/journal.pbio.0040002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. doi: 10.1093/bioinformatics/16.1.16. [DOI] [PubMed] [Google Scholar]
- 25.Rhodius VA, Mutalik VK. Predicting strength and function for promoters of the Escherichia coli alternative sigma factor, sigmaE. Proc. Natl Acad. Sci. USA. 2010;107:2854–2859. doi: 10.1073/pnas.0915066107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning. A Laboratory Manual. 2nd edn. New York: Cold Spring Harbor Laboratory Press; 1989. [Google Scholar]
- 27.Jensen KF. The Escherichia coli K-12 “wild types” W3110 and MG1655 have an rph frameshift mutation that leads to pyrimidine starvation due to low pyrE expression levels. J. Bacteriol. 1993;175:3401–3407. doi: 10.1128/jb.175.11.3401-3407.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Casadaban MJ, Cohen SN. Analysis of gene control signals by DNA fusion and cloning in Escherichia coli. J. Mol. Biol. 1980;138:179–207. doi: 10.1016/0022-2836(80)90283-1. [DOI] [PubMed] [Google Scholar]
- 29.Zaslaver A, Mayo AE, Rosenberg R, Bashkin P, Sberro H, Tsalyuk M, Surette MG, Alon U. Just-in-time transcription program in metabolic pathways. Nat. Genet. 2004;36:486–491. doi: 10.1038/ng1348. [DOI] [PubMed] [Google Scholar]
- 30.De Las Penas A, Connolly L, Gross CA. SigmaE is an essential sigma factor in Escherichia coli. J. Bacteriol. 1997;179:6862–6864. doi: 10.1128/jb.179.21.6862-6864.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.McDowell JC, Roberts JW, Jin DJ, Gross C. Determination of intrinsic transcription termination efficiency by RNA polymerase elongation rate. Science. 1994;266:822–825. doi: 10.1126/science.7526463. [DOI] [PubMed] [Google Scholar]
- 32.Young BA, Anthony LC, Gruber TM, Arthur TM, Heyduk E, Lu CZ, Sharp MM, Heyduk T, Burgess RR, Gross CA. A coiled-coil from the RNA polymerase beta' subunit allosterically induces selective nontemplate strand binding by sigma(70) Cell. 2001;105:935–944. doi: 10.1016/s0092-8674(01)00398-1. [DOI] [PubMed] [Google Scholar]
- 33.Rouviere PE, De Las Penas A, Mecsas J, Lu CZ, Rudd KE, Gross CA. rpoE, the gene encoding the second heat-shock sigma factor, sigma E, in Escherichia coli. EMBO J. 1995;14:1032–1042. doi: 10.1002/j.1460-2075.1995.tb07084.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Stormo GD. Consensus patterns in DNA. Methods Enzymol. 1990;183:211–221. doi: 10.1016/0076-6879(90)83015-2. [DOI] [PubMed] [Google Scholar]
- 36.Wold S, Sjostrom M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometr. Intell. Lab. 2001;58:109–130. [Google Scholar]
- 37.Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987;193:723–750. doi: 10.1016/0022-2836(87)90354-8. [DOI] [PubMed] [Google Scholar]
- 38.Ross W, Gourse RL. Sequence-independent upstream DNA-alphaCTD interactions strongly stimulate Escherichia coli RNA polymerase-lacUV5 promoter association. Proc. Natl Acad. Sci. USA. 2005;102:291–296. doi: 10.1073/pnas.0405814102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Miroslavova NS, Busby SJ. Investigations of the modular structure of bacterial promoters. Biochem. Soc. Symp. 2006:1–10. doi: 10.1042/bss0730001. [DOI] [PubMed] [Google Scholar]
- 40.Ellinger T, Behnke D, Knaus R, Bujard H, Gralla JD. Context-dependent effects of upstream A-tracts. Stimulation or inhibition of Escherichia coli promoter function. J. Mol. Biol. 1994;239:466–475. doi: 10.1006/jmbi.1994.1389. [DOI] [PubMed] [Google Scholar]
- 41.Ellinger T, Behnke D, Bujard H, Gralla JD. Stalling of Escherichia coli RNA polymerase in the +6 to +12 region in vivo is associated with tight binding to consensus promoter elements. J. Mol. Biol. 1994;239:455–465. doi: 10.1006/jmbi.1994.1388. [DOI] [PubMed] [Google Scholar]
- 42.Hook-Barnard IG, Hinton DM. Transcription initiation by mix and match elements: flexibility for polymerase binding to bacterial promoters. Gene Regul. Syst. Biol. 2007;1:275–293. [PMC free article] [PubMed] [Google Scholar]
- 43.Lane WJ, Darst SA. The structural basis for promoter -35 element recognition by the group IV sigma factors. PLoS Biol. 2006;4:e269. doi: 10.1371/journal.pbio.0040269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Davis CA, Capp MW, Record MT, Jr, Saecker RM. The effects of upstream DNA on open complex formation by Escherichia coli RNA polymerase. Proc. Natl Acad. Sci. USA. 2005;102:285–290. doi: 10.1073/pnas.0405779102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Ross W, Aiyar SE, Salomon J, Gourse RL. Escherichia coli promoters with UP elements of different strengths: modular structure of bacterial promoters. J. Bacteriol. 1998;180:5375–5383. doi: 10.1128/jb.180.20.5375-5383.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Schroeder LA, Gries TJ, Saecker RM, Record MT, Jr, Harris ME, DeHaseth PL. Evidence for a tyrosine-adenine stacking interaction and for a short-lived open intermediate subsequent to initial binding of Escherichia coli RNA polymerase to promoter DNA. J. Mol. Biol. 2009;385:339–349. doi: 10.1016/j.jmb.2008.10.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Tomsic M, Tsujikawa L, Panaghie G, Wang Y, Azok J, deHaseth PL. Different roles for basic and aromatic amino acids in conserved region 2 of Escherichia coli sigma(70) in the nucleation and maintenance of the single-stranded DNA bubble in open RNA polymerase-promoter complexes. J. Biol. Chem. 2001;276:31891–31896. doi: 10.1074/jbc.M105027200. [DOI] [PubMed] [Google Scholar]
- 48.Koo BM, Rhodius VA, Nonaka G, deHaseth PL, Gross CA. Reduced capacity of alternative sigmas to melt promoters ensures stringent promoter recognition. Genes Dev. 2009;23:2426–2436. doi: 10.1101/gad.1843709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Feklistov A, Darst SA. Promoter recognition by bacterial alternative sigma factors: the price of high selectivity? Genes Dev. 2009;23:2371–2375. doi: 10.1101/gad.1862609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Steen EJ, Kang Y, Bokinsky G, Hu Z, Schirmer A, McClure A, Del Cardayre SB, Keasling JD. Microbial production of fatty-acid-derived fuels and chemicals from plant biomass. Nature. 2010;463:559–562. doi: 10.1038/nature08721. [DOI] [PubMed] [Google Scholar]
- 51.Tamsir A, Tabor JJ, Voigt CA. Robust multicellular computing using genetically encoded NOR gates and chemical ‘wires’. Nature. 2011;469:212–215. doi: 10.1038/nature09565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature. 2000;403:335–338. doi: 10.1038/35002125. [DOI] [PubMed] [Google Scholar]
- 53.Lucks JB, Qi L, Whitaker WR, Arkin AP. Toward scalable parts families for predictable design of biological circuits. Curr. Opin. Microbiol. 2008;11:567–573. doi: 10.1016/j.mib.2008.10.002. [DOI] [PubMed] [Google Scholar]
- 54.Salis HM, Mirsky EA, Voigt CA. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 2009;27:946–950. doi: 10.1038/nbt.1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hsu LM, Cobb IM, Ozmore JR, Khoo M, Nahm G, Xia L, Bao Y, Ahn C. Initial transcribed sequence mutations specifically affect promoter escape properties. Biochemistry. 2006;45:8841–8854. doi: 10.1021/bi060247u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermuller J, Reinhardt R, et al. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010;464:250–255. doi: 10.1038/nature08756. [DOI] [PubMed] [Google Scholar]
- 57.Cho BK, Zengler K, Qiu Y, Park YS, Knight EM, Barrett CL, Gao Y, Palsson BO. The transcription unit architecture of the Escherichia coli genome. Nat. Biotechnol. 2009;27:1043–1049. doi: 10.1038/nbt.1582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Guell M, van Noort V, Yus E, Chen WH, Leigh-Bell J, Michalodimitrakis K, Yamada T, Arumugam M, Doerks T, Kuhner S, et al. Transcriptome complexity in a genome-reduced bacterium. Science. 2009;326:1268–1271. doi: 10.1126/science.1176951. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.