Abstract
We present PromoterPredict, a dynamic multiple regression approach to predict the strength of Escherichia coli promoters binding the σ70 factor of RNA polymerase. σ70 promoters are ubiquitously used in recombinant DNA technology, but characterizing their strength is demanding in terms of both time and money. We parsed a comprehensive database of bacterial promoters for the −35 and −10 hexamer regions of σ70-binding promoters and used these sequences to construct the respective position weight matrices (PWM). Next we used a well-characterized set of promoters to train a multivariate linear regression model and learn the mapping between PWM scores of the −35 and −10 hexamers and the promoter strength. We found that the log of the promoter strength is significantly linearly associated with a weighted sum of the −10 and −35 sequence profile scores. We applied our model to 100 sets of 100 randomly generated promoter sequences to generate a sampling distribution of mean strengths of random promoter sequences and obtained a mean of 6E-4 ± 1E-7. Our model was further validated by cross-validation and on independent datasets of characterized promoters. PromoterPredict accepts −10 and −35 hexamer sequences and returns the predicted promoter strength. It is capable of dynamic learning from user-supplied data to refine the model construction and yield more robust estimates of promoter strength. PromoterPredict is available as both a web service (https://promoterpredict.com) and standalone tool (https://github.com/PromoterPredict). Our work presents an intuitive generalization applicable to modelling the strength of other promoter classes.
Keywords: Regression modelling, Promoter sequences, Promoter strength prediction, Sigma70 promoters, Genetic engineering, Weak promoters, PWM construction, Data mining, Software tools
Introduction
The primary Escherichia coli promoter-specificity factor and the one widely used in recombinant DNA technology is the σ70 factor. Promoters recognized by σ70-containing RNA polymerase are called core promoters and share the following features: two conserved hexamer sequences, separated by a non-specific spacer of ideally 17 nucleotides. The two hexamers are located ∼35 and ∼10 bp upstream of the transcription start site, and are called the −35 and −10 sequences, respectively (Maquat & Reznikoff, 1978; Bujard, 1980; Paget & Helmann, 2003; Kadonaga, 2012). −35 and −10 sequences matching the consensi motifs (TTGACA and TATAAT, respectively) are known as canonical hexamers (Galas, Eggert & Waterman, 1985; Deuschle et al., 1986; Stormo, 1990). It is known that the conserved hexamer regions are vital for recognizing and optimizing the interactions between DNA and the RNA polymerase (Hawley & McClure, 1983; Knaus & Bujard, 1990; Hook-Barnard, Johnson & Hinton, 2006; Feklistov & Darst, 2011; Basu et al., 2014).
Theory has yielded a linear relationship between the total promoter score and the natural log of promoter strength (Berg & Von Hippel, 1987; Li & Zhang, 2014). Nucleotide occurrence frequencies were first used by Weller & Recknagel (1994) in promoter strength prediction. Additivity in promoter-polymerase interaction has been affirmed by Benos, Bulyk & Stormo (2002). Patterns in σ70 promoters have been quantified by Huerta & Collado-Vides (2003). Strength of E. coli σE RNA polymerase promoters were studied by Rhodius & Mutalik (2010). The complexity of E. coli σ70 promoter sequences has been treated from an information theoretic standpoint by Shultzaberger et al. (2007). More recently, an support vector machines (SVM) model has been successfully applied to predicting the strength of a mutation library of E. coli Trc promoter sequences (Meng et al., 2017). One drawback with an SVM or artificial neural networks (ANN) machine learning model is the ‘black-box’ approach; that is, the absence of any mechanistic insights that could be gleaned with respect to the relationship between promoter sequence and strength. Such an understanding could be vital in the prediction of promoter strengths in different contexts, as well as the forward design of promoters in finely-tuned genetic circuits (see Endy, 2005; De Mey et al., 2007; Salis, Mirsky & Voigt, 2009; Li & Zhang, 2014). Many freely available resources predict the location of promoters in a genomic sequence mainly by identifying the −10 and −35 regulatory sequences (De Jong et al., 2012), but very few tools are available to predict the strength of such sequences. One tool provides qualitative predictions (‘strong’ or not) of promoter strength based on the occurrence of a triad pattern (Dekhtyar, Morin & Sakanyan, 2008), and is available as a macro. Here, we present a two-step approach to the predictive modelling of the strength of σ70 core promoters, and a companion web-based platform and Python standalone tool that implement our method along with the option to dynamically include user data into the prediction model. Our implementation is the first freely available tool/web-server for the quantitative prediction of promoter strength.
Methods
Generative model of promoter sequences
A generative model of the −10 and −35 promoter sequences is constructed using two position weight matrices (PWM–10 and PWM–35) in the following manner. A comprehensive set of σ70-binding promoter sequences was extracted from the RegulonDB (Gama-Castro et al., 2016). For each promoter sequence, we extracted a −35 region of 13 nucleotides centred at −35 position, and a −10 region of 13 nucleotides centred at the −10 position, to allow for uncertainties in the precise position of occurrence of the hexamers. For each −35 region, we used FIMO (Grant, Bailey & Noble, 2011) to find the best match to the consensus −35 motif, and similarly for the −10 regions, to obtain a dataset of −35 and −10 hexamer sequences. This dataset was then filtered for only significant hits to the consensi motifs (p-value < 0.05) and the resulting dataset was used to determine the weights of each nucleotide at each position of the −35 and −10 hexamers. Nucleotide-wise counts at each position of the hexamer motifs were augmented by a pseudo-count prior to correct for E. coli GC content of 50.8% and the resulting frequency matrices were converted into log-odds matrices. Biopython routines (www.biopython.org) were used.
Linear modelling of promoter strength
Following Berg & Von Hippel (1987), we modelled the relationship between the promoter sequences and the ln of the promoter strength using multiple linear regression. The training set of 18 promoters is drawn from the Anderson library of activator-independent plasmid tet promoter variants maintained at the Registry of standard biological parts (http://parts.igem.org/Promoters/Catalog/Anderson). Each promoter sequence is scored with respect to the generative models of the −10 and −35 motifs (i.e. the PWM–10 and PWM–35 matrices) and the two scores obtained formed the feature space of the regression modelling. The regression coefficients to be determined represent the weights of the −10 and −35 regions in the regression analysis. The Anderson library provided promoter strengths spanning two orders of magnitude and normalized in the range 0.00–1.00 with respect to the strongest (i.e. reference) promoter. It was noted that the normalisation step would not affect a linear relationship, altering only the constant of the regression. The normalised strength values were log-transformed to obtain the required response variable values. Since the ln function rapidly descends towards—Inf with decreasing promoter strength, we capped the infimum of promoter strength at 0.0001 prior to log-transformation. The least-squares cost function was minimized using iterative gradient descent. The model parameters were assessed using t-statistics, and the overall model was assessed using F-statistic and the adjusted multiple coefficient of determination given by:
(1) |
where m is the number of features and n is the number of instances. The adjustment is a penalty for increasing model complexity.
Model validation
The model of promoter strength was validated in three ways:
The model was validated using leave-one-out cross-validation (LOOCV).
We generated 100 sets of 100 randomly generated promoter sequences each, using the sample function in Python. From the obtained sampling distribution of mean strengths of random promoter sequences, we calculated the estimate of the true mean strength of a random promoter sequence, together with its standard error.
We further validated our model on independent datasets of characterized promoters available in Davis, Rubin & Sauer (2011), Dekhtyar, Morin & Sakanyan (2008), and Dayton et al. (1984).
Results
The entire datasets of 1,004 −35 hexamers and 1,046 −10 hexamers parsed out of RegulonDB are available as Supplementary Information. The conservation profiles of the extracted −35 and −10 hexamer sequences of the promoters in the RegulonDB were visualized and shown in Fig. 1. Based on these PWMs, the site scores of each promoter sequence in the Anderson library were regressed on the corresponding ln of the promoter strength. A summary of this process with the training data, log-transformation of the promoter strength and predicted response values is presented in Table 1. The modelling process converged within 105 iterations by tuning the gradient descent to a learning rate (α) of 0.015, and the following model was obtained:
(2) |
Table 1. Summary of promoter information.
Promoter | −35 hexamer | −10 hexamer | Promoter activity | ln (Promoter activity) | Predicted ln (Promoter activity) |
---|---|---|---|---|---|
BBa_J23100 | TTGACG | TACAGT | 1 | 0 | −1.6336486579 |
BBa_J23101 | TTTACA | TATTAT | 0.7 | −0.35667494 | 0.0555718065 |
BBa_J23102 | TTGACA | TACTGT | 0.86 | −0.15082289 | −1.0957849491 |
BBa_J23104 | TTGACA | TATTGT | 0.72 | −0.32850407 | 0.1647181133 |
BBa_J23105 | TTTACG | TACTAT | 0.24 | −1.42711636 | −2.2871659092 |
BBa_J23106 | TTTACG | TATAGT | 0.47 | −0.75502258 | −1.3174788735 |
BBa_J23107 | TTTACG | TATTAT | 0.36 | −1.02165125 | −1.0266628468 |
BBa_J23108 | CTGACA | TATAAT | 0.51 | −0.67334455 | −0.4282477098 |
BBa_J23109 | TTTACA | GACTGT | 0.04 | −3.21887582 | −3.3693144659 |
BBa_J23110 | TTTAGG | TACAAT | 0.33 | −1.10866262 | −3.3946866337 |
BBa_J23111 | TTGACG | TATAGT | 0.58 | −0.54472718 | −0.3731455955 |
BBa_J23112 | CTGATA | GATTAT | 0.01 | −4.60517019 | −3.1533888284 |
BBa_J23113 | CTGATG | GATTAT | 0.01 | −4.60517019 | −4.2356234817 |
BBa_J23114 | TTTATG | TACAAT | 0.1 | −2.30258509 | −2.5943689001 |
BBa_J23115 | TTTATA | TACAAT | 0.15 | −1.89711998 | −1.5121342469 |
BBa_J23116 | TTGACA | GACTAT | 0.16 | −1.83258146 | −1.5897942167 |
BBa_J23117 | TTGACA | GATTGT | 0.06 | −2.81341072 | −1.1644781255 |
BBa_J23118 | TTGACG | TATTGT | 0.56 | −0.5798185 | −0.91751654 |
Note:
The promoter activities (strengths) are seen to span two orders of magnitude in the range (0.0, 1.0). The promoters follow the naming in the Anderson dataset.
We derived an independent solution of the multiple regression using R (www.r-project.org) and obtained a correlation coefficient of 0.998 between the fitted values of the two models. The interval estimates of the coefficients of the regression were computed in R using confint (fit, level = 0.95), and obtained the following 95% confidence intervals:
Intercept : (−6.4974449, −3.7118421)
PWM_35 : (0.2445358, 0.6095848)
PWM_10 : (0.1434939, 0.4017307)
The interval estimates did not include zero, and this implied that the coefficients were significant at the 0.05 level. In fact, all the three estimates were significant at a p-value of 1E-3. The F-statistic of the overall regression was significant at a p-value of 2E-4 and adj. R2 was ≈0.65. The plane of best fit corresponding to the above model is visualized in Fig. 2.
The model was then cross-validated using a 18-fold LOOCV (similar to jack-knife). Cross-validation yielded a correlation coefficient of ∼0.76 (Table 2). We sought to benchmark our model on a negative test set by generating random −35 and −10 hexamer sequences. To this end, we applied our model to 100 sets of 100 random promoter sequences each (available in Supplementary Information) and estimated the true mean of the sampling distribution as 0.00055. The standard error of the estimate was 1.04E-7. The low predicted strength along with the very small standard error indicated that the model predicted these instances to be non-promoter sequences with good certainty. This affirmed the specificity of our model for true promoters.
Table 2. Cross-validation results.
Fold | PWM_35 | PWM_10 | Combined | logStrength | cvpred | cvres |
---|---|---|---|---|---|---|
1 | 6.5966 | 2.398 | 9 | 0 | −1.757 | 1.757 |
2 | 6.9195 | 8.089 | 15.01 | −0.357 | 0.145 | −0.50 |
3 | 9.1308 | 0.402 | 9.53 | −0.151 | −1.3 | 1.15 |
4 | 9.1308 | 5.025 | 14.16 | −0.329 | 0.286 | −0.62 |
5 | 4.3854 | 3.465 | 7.85 | −1.427 | −2.36 | 0.93 |
6 | 4.3854 | 7.022 | 11.41 | −0.755 | −1.377 | 0.62 |
7 | 4.3854 | 8.089 | 12.47 | −1.022 | −1.027 | 0.00 |
8 | 4.5119 | 10.086 | 14.6 | −0.673 | −0.362 | −0.31 |
9 | 6.9195 | −4.474 | 2.45 | −3.219 | −3.463 | 0.24 |
10 | 4.3854 | 5.462 | 9.85 | −1.109 | −1.792 | 0.68 |
11 | 6.5966 | 7.022 | 13.62 | −0.545 | −0.349 | −0.20 |
12 | 2.5179 | 3.213 | 5.73 | −4.605 | −2.847 | −1.76 |
13 | −0.0162 | 3.213 | 3.2 | −4.605 | −3.977 | −0.63 |
14 | 2.3914 | 5.462 | 7.85 | −2.303 | −2.646 | 0.34 |
15 | 4.9255 | 5.462 | 10.39 | −1.897 | −1.485 | −0.41 |
16 | 9.1308 | −1.411 | 7.72 | −1.833 | −1.518 | −0.32 |
17 | 9.1308 | 0.15 | 9.28 | −2.813 | −0.796 | −2.02 |
18 | 6.5966 | 5.025 | 11.62 | −0.58 | −0.944 | 0.36 |
Note:
In each fold of cross-validation, the instance corresponding to the fold was designated as the test instance while the prediction model was built using the rest of the instances. This process was repeated 18 times, once for each test instance and the cross-validation (CV) residuals were obtained. combined, sum of the PWM scores; cvpred, predicted log strength of the test instance; cvres, cross-validation residual.
To validate our model further on true promoter sequences and experimentally characterized promoter strengths, we used datasets available in the literature and compared the predicted strength with the experimental results and examined their concordance. The following results were obtained:
For the 10 promoters discussed by Davis, Rubin & Sauer (2011), we ranked the promoters in Table 1 of the same reference according to their strengths and observed a 1,000-fold span of promoter strengths, 1E-3 to 1 (Table 3). Promoters 2 and 3 were identically strong, hence we took the average of their predicted strengths in ranking the promoters. With this arrangement, we found that the predicted order of promoters in terms of strength exactly reproduced the experimentally characterized order. Despite the fact that Anderson library and these promoters were characterized and normalized using different systems, the model was able to predict surprisingly well across a promoter strength spectrum spanning three orders of magnitude.
Next, we applied our model to the set of 13 strong promoter candidates of Thermotoga maritima discussed in Dekhtyar, Morin & Sakanyan (2008). Using the hexamer sequences provided in Fig. 5 of the same reference, we applied our model and obtained quantitative predictions of promoter strengths (Table 4). Almost all the promoters had predicted strengths >0.38 and promoters with canonical hexamers even had strengths >1.00. One promoter (TM0032) was predicted as ‘weak’ with a strength ∼0.056 and seemed to point to an apparent anomaly in the relationship between promoter sequence and strength, possibly highlighting the need for further experimentation on this promoter. Our observations were corroborated by Fig. 4 in the same reference that showed the least and greatly reduced expression from this particular promoter. These results taken in conjunction with the results on random promoter sequences affirmed the ability of our model to discriminate between promoters at opposite ends of the strength spectrum.
We also applied our model on the five promoters discussed in Dayton et al. (1984). Of these, the first three are known as ‘major’ promoters that are active even at low concentrations of the polymerase, whereas the last two are ‘minor’, less strong promoters that are only active when the polymerase is present at high concentrations. We applied our model on the promoter sequences found in Fig. 5 of the same reference and found the predictions in line with the nature of these promoters (Table 5). The activity of the least strong ‘major’ promoter is about two times more than the activity of the strongest ‘minor’ promoter. Hence, our modelling approach was able to discriminate between major and minor promoters.
Table 3. Validation results: using data of Davis, Rubin & Sauer (2011).
Actual rank | Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted rank |
---|---|---|---|---|---|---|
1 | pro1 | tttacg | gtatct | 0.009 | 0.0079073845 | 1 |
2.5 | pro2 | gcggtg | tataat | 0.017 | 0.0306978849 | 2.5 |
2.5 | pro3 | ttgacg | gaggat | 0.017 | 0.0306978849 | 2.5 |
4 | proA | tttacg | taggct | 0.03 | 0.0482647297 | 4 |
5 | pro4 | tttacg | gatgat | 0.033 | 0.0809816409 | 5 |
6 | pro5 | tttacg | taggat | 0.05 | 0.0867400443 | 6 |
7 | proB | tttacg | taatat | 0.119 | 0.1534857959 | 7 |
8 | pro6 | tttacg | taaaat | 0.193 | 0.2645364297 | 8 |
9 | proC | tttacg | tatgat | 0.278 | 0.3059490889 | 9 |
10 | proD | tttacg | tataat | 1 | 0.6173668247 | 10 |
Note:
The promoters were ordered based on the rank of their strength, and given as input to our model. The predicted promoter log strengths were then examined for agreement with the actual rank and the ordering obtained matched the original ordering. The individual predicted values for pro2 and pro3 were 0.0024 and 0.059, respectively.
Table 4. Validation with T. maritima strong promoter candidates.
Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted class |
---|---|---|---|---|---|
TM0373 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
TM1016 | ttgaat | tttaat | Strong | 0.3808572257 | Strong |
TM1272 | ttgaca | tttaat | Strong | 1.6386551999 | Strong |
TM1429 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
TM1667 | ttgaaa | tataat | Strong | 2.5859432664 | Strong |
TM1780 | ttcata | tataat | Strong | 0.463878289 | Strong |
Tmt11 | ttgaat | taaaat | Strong | 0.4665383797 | Strong |
TM0032 | tcgaaa | cataat | Strong | 0.0562167049 | Weak |
TM0477 | ttgaat | tataat | Strong | 1.0887926414 | Strong |
TM1067 | ttgacc | tattat | Strong | 0.7046782664 | Strong |
TM1271 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
Tmt45 | ttgaac | tataat | Strong | 0.670434893 | Strong |
TM1490 | ttgact | taaaat | Strong | 0.8451600149 | Strong |
Table 5. Validation with major (A1, A2, A3) and minor (C, D) promoters.
Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted class |
---|---|---|---|---|---|
A1 | ttgact | gatact | strong | 0.2904988307 | Medium |
A2 | ttgaca | taagat | strong | 0.9947607331 | Strong |
A3 | ttgaca | tacgat | strong | 0.658183377 | Strong |
C | ttgacg | tagtct | minor | 0.1452865585 | Minor |
D | ttgact | taggct | minor | 0.1541996302 | Minor |
Discussion
In addition to the independent contributions of −35 and −10 sites to promoter strength, we were interested in exploring if any interactions between them could contribute to promoter strength. To this end, we examined the following model in R:
lm(logStrength ∼ PWM35 * PWM10)
where PWM35 and PWM10 represent the corresponding site scores. This model resulted in a lower adj. R2-value than that without any interactions. Further, the p-value of the PWM10 score dropped below significance (0.31), and the interaction term turned out to be totally insignificant (p-value: 0.97), thus discounting any interaction between the sites in the present dataset. On this basis, the null hypothesis of absence of any interaction could not be rejected, and we concluded that there is little evidence for interaction between the −35 and −10 sites in contributing to promoter strength.
Our model assumed that both the predictors carried independent information about the promoter strength, and together they are able to provide sufficient information about the strength. The basis of this assumption was probed to determine if both predictors are necessary to the model. Could one predictor provide sufficient information about the promoter strength in the absence of the other? There are at least three angles to address this question, and all of them were considered to interpret the model better.
- Comparing the raw, unadjusted R2 with the adjusted R2. The corresponding values were:
- R2 ≈ 0.69
- Adj. R2 ≈ 0.65
- Since there is not much difference between R2 and adj. R2, we could say that both predictors contribute substantially to the response variable (promoter strength) and account for about 65% of its variance.
- Since the p-values of both predictors are significant, it would be interesting to observe their effect on the response variable in more detail. This was performed using the effects package in R:
- library(effects)
- fit = lm(logStrength∼ PWM35+ PWM10, data)
- plot(allEffects(fit))
- The results are shown in Fig. 3 where the PWM scores are plotted against the level of confidence in the predicted response. Confidence in the effect of −35 site increases with the score from 0 to about 7, and then is susceptible to edge effects as the score reaches 8. Confidence in the effect of the −10 site increases with the score from −4 to about 5, and then is susceptible to edge effects as the score reaches 10.
Another way to address the question is to compute the correlation coefficients between all the variables of interest, including a variable with the combined effects of −35 and −10 sites. This is shown in Table 6. Three features were used, namely PWM−10 score, PWM−35 score, and the combined score (i.e. PWM−10 + PWM−35). These feature variables were correlated with two response variables, namely promoter strength and its corresponding log-transformation. It was first observed that the PWM−10 and PWM−35 scores were anti-correlated with each other (correlation coefficient = −0.37), thus supporting the hypothesis that they are two independent features that could compensate for each other in determining promoter strength. It was significant that the each feature was better correlated with the log of the strength than the strength itself. We tried to regress the strength on the PWM scores, but the model had a very low adj. R2 (≈0.40) and the intercept term was not significant at the 0.05 level. Further, the highest correlation between the features and response variable was observed between the combined score and log of the promoter strength (∼0.79), but the combined score showed only a moderate correlation with the promoter strength prior to log-transformation (∼0.63). This was in keeping with similar observations for the strength of σE promoters (Rhodius & Mutalik, 2010) and underscored the logarithmic dependence between the promoter strength and sequence.
Table 6. Correlation matrix of features and response variables.
Correlation coefficient | PWM–35 | PWM–10 | Combined | Strength | Log-strength |
---|---|---|---|---|---|
PWM–35 | 1 | −0.3715610 | 0.3401672 | 0.4558838 | 0.5153622 |
PWM–10 | −0.3715610 | 1 | 0.7466500 | 0.3025062 | 0.4115533 |
Combined | 0.3401672 | 0.7466500 | 1 | 0.6330488 | 0.7861173 |
Strength | 0.4558838 | 0.3025062 | 0.6330488 | 1 | 0.8665495 |
Log-strength | 0.5153622 | 0.4115533 | 0.7861173 | 0.8665495 | 1 |
Finally, the assumptions of linear modelling were investigated with reference to our problem. Model diagnostics of four basic assumptions were plotted (shown in Fig. 4). Specifically:
Plot A: The residuals were plotted against the fitted values. No trend was visible in the plot, indicating the residuals did not increase with the fitted values and followed a random pattern about zero. This validated the assumption that the errors were independent.
Plot B: The square root of the relative error (standardized residual) was plotted against the fitted value. An almost flat trend was observed, indicating that the standardized residual did not vary with the fitted value. This further validated the assumption that the errors were independent.
Plot C: To test the assumption that the errors were normally distributed, the standardized residuals were plotted against the theoretical quantiles of a normal distribution. The residual distribution closely followed the theoretical quantiles, except for minor deviations towards the tails of the distribution.
Plot D: Since the least-squares cost function is sensitive to outliers, the number of outliers should be kept to a minimum. This was investigated by plotting the standardized residual against the corresponding instance’s model leverage. This plot showed that there were no significant outliers in the dataset that could exert an undue influence on the regression parameters.
An alternative univariate regression model using only the combined score of the PWMs found the coefficient of regression and the F-statistic significant (both p-values ≈ 10−4). However, the adj. R2 of the model (≈0.59) was much lower than that for Eq. (2), so the original multiple linear regression model was retained for the estimation of the promoter strength.
In summary, our model performed equally well on datasets of strong promoter sequences and datasets of weak random promoter sequences. Our model was consistent in detecting promoter strengths across a 1,000-fold span of promoter strengths in E. coli as well as the promoter strengths of a different species, T. maritima. The model was further able to discriminate between the major and minor promoters of bacteriophage T7.
Based on these results, an open-access open-source web server and standalone tool offering the prediction service have been implemented. Since the linear modelling results are dependent on the dataset, our implementation provides a facility to augment the learning based on user-provided inputs. The web interface is based on Python web module (web.py) and nginx server. The computational layer is based on numpy, Biopython and matplotlib. The user is provided with an option to add any number of promoter instances with −10 and −35 sequences and the corresponding strengths to augment the training data of the supervised model. The measurement of promoter strength could be done in the manner of Kelly et al. (2009), where the GFP (reporter gene) synthesis rate is measured per unit biomass, and this could be normalized relative to the reference promoter. In order to assess the goodness of fit of the updated model, the R2-value is re-computed, along with the 3D plot of the regression surface. This would enable the user to decide whether the data added to the model has improved its performance for further experiments with the software. Based on the trained model, the user could predict the strength of an uncharacterised promoter given its −10 and −35 hexamers.
Conclusion
The following important conclusions were drawn from our study. (1) Sequence-based modelling yielded a non-linear, logarithmic dependence between promoter strength and sequence. (2) The model was able to discriminate equally well between strong/major promoters and weak/minor/random promoter sequences, indicating successful learning of the essential features of promoter strength prediction. (3) The combined score (PWM–35 + PWM–10) emerged as the single most important predictor of the promoter strength. Our model yielded robust quantitative prediction across a 1,000-fold span of promoter strengths. It is straightforward to extend our methodology to the study of new promoter classes of other σ factors. Our implementation and web service could be useful in characterizing promoters identified in genome sequencing projects as well in engineering promoters for the design of finely-tuned genetic circuits in synthetic biology. The dynamic feature of our implementation would enable users to incorporate their own data into the model and obtain more reliable estimates of promoter strength. The service will be periodically updated based on the availability of new training instances, user input data and/or models for promoters of other σ factors.
Acknowledgments
We would like to thank the reviewers for helping improve an earlier version of the manuscript. We are grateful for computing facilities at SASTRA Deemed University for support.
Funding Statement
The authors received no funding for this work.
Additional Information and Declarations
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Ramit Bharanikumar performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, approved the final draft.
Keshav Aditya R. Premkumar performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, approved the final draft.
Ashok Palaniappan conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.
Data Availability
The following information was supplied regarding data availability:
Raw code: https://github.com/PromoterPredict/PromoterStrengthPredictor
Web service: https://promoterpredict.com/
Palaniappan, Ashok; Aditya, Keshav; Bharanikumar, Ramit (2018): PromoterPredict: sequence-based modelling of promoter strength: supplementary information. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6794939.v1
References
- Basu et al. (2014).Basu RS, Warner BA, Molodtsov V, Pupov D, Esyunina D, Fernández-Tornero C, Kulbachinskiy A, Murakami KS. Structural basis of transcription initiation by bacterial RNA polymerase holoenzyme. Journal of Biological Chemistry. 2014;289(35):24549–24559. doi: 10.1074/jbc.m114.584037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benos, Bulyk & Stormo (2002).Benos PV, Bulyk ML, Stormo GD. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Research. 2002;30(20):4442–4451. doi: 10.1093/nar/gkf578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berg & Von Hippel (1987).Berg OG, Von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology. 1987;193(4):723–750. doi: 10.1016/0022-2836(87)90354-8. [DOI] [PubMed] [Google Scholar]
- Bujard (1980).Bujard H. The interaction of E. coli RNA polymerase with promoters. Trends in Biochemical Sciences. 1980;5(10):274–278. doi: 10.1016/0968-0004(80)90036-5. [DOI] [Google Scholar]
- Crooks et al. (2004).Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Research. 2004;14(6):1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davis, Rubin & Sauer (2011).Davis JH, Rubin AJ, Sauer RT. Design, construction and characterization of a set of insulated bacterial promoters. Nucleic Acids Research. 2011;39(3):1131–1141. doi: 10.1093/nar/gkq810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dayton et al. (1984).Dayton CJ, Prosen DE, Parker KL, Cech CL. Kinetic measurements of Escherichia coli RNA polymerase association with bacteriophage T7 early promoters. Journal of Biological Chemistry. 1984;259:1616. [PubMed] [Google Scholar]
- De Jong et al. (2012).De Jong A, Pietersma H, Cordes M, Kuipers OP, Kok J. PePPER: a webserver for prediction of prokaryote promoter elements and regulons. BMC Genomics. 2012;13(1):299. doi: 10.1186/1471-2164-13-299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Mey et al. (2007).De Mey M, Maertens J, Lequeux GJ, Soetaert WK, Vandamme EJ. Construction and model-based analysis of a promoter library for E. coli: an indispensable tool for metabolic engineering. BMC Biotechnology. 2007;7(1):34. doi: 10.1186/1472-6750-7-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dekhtyar, Morin & Sakanyan (2008).Dekhtyar M, Morin A, Sakanyan V. Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes. BMC Bioinformatics. 2008;9(1):233. doi: 10.1186/1471-2105-9-233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deuschle et al. (1986).Deuschle U, Kammerer W, Gentz R, Bujard H. Promoters of Escherichia coli: a hierarchy of in vivo strength indicates alternate structures. EMBO Journal. 1986;5(11):2987–2994. doi: 10.1002/j.1460-2075.1986.tb04596.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endy (2005).Endy D. Foundations for engineering biology. Nature. 2005;438(7067):449–453. doi: 10.1038/nature04342. [DOI] [PubMed] [Google Scholar]
- Feklistov & Darst (2011).Feklistov A, Darst SA. Structural basis for promoter–10 element recognition by the bacterial RNA polymerase σ subunit. Cell. 2011;147(6):1257–1269. doi: 10.1016/j.cell.2011.10.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galas, Eggert & Waterman (1985).Galas DJ, Eggert M, Waterman MS. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology. 1985;186(1):117–128. doi: 10.1016/0022-2836(85)90262-1. [DOI] [PubMed] [Google Scholar]
- Gama-Castro et al. (2016).Gama-Castro S, Salgado H, Santos-Zavaleta A, Ledezma-Tejeida D, Muñiz-Rascado L, García-Sotelo JS, Alquicira-Hernández K, Martínez-Flores I, Pannier L, Castro-Mondragón JA, Medina-Rivera A, Solano-Lira H, Bonavides-Martínez C, Pérez-Rueda E, Alquicira-Hernández S, Porrón-Sotelo L, López-Fuentes A, Hernández-Koutoucheva A, Del Moral-Chávez V, Rinaldi F, Collado-Vides J. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Research. 2016;44(D1):D133–D143. doi: 10.1093/nar/gkv1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grant, Bailey & Noble (2011).Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. doi: 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawley & McClure (1983).Hawley DK, McClure WR. Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Research. 1983;11(8):2237–2255. doi: 10.1093/nar/11.8.2237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hook-Barnard, Johnson & Hinton (2006).Hook-Barnard I, Johnson XB, Hinton DM. Escherichia coli RNA polymerase recognition of a σ70-dependent promoter requiring a −35 DNA element and an extended −10 TGn motif. Journal of Bacteriology. 2006;188(24):8352–8359. doi: 10.1128/jb.00853-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huerta & Collado-Vides (2003).Huerta AM, Collado-Vides J. Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. Journal of Molecular Biology. 2003;333(2):261–278. doi: 10.1016/j.jmb.2003.07.017. [DOI] [PubMed] [Google Scholar]
- Kadonaga (2012).Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdisciplinary Reviews: Developmental Biology. 2012;1(1):40–51. doi: 10.1002/wdev.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelly et al. (2009).Kelly JR, Rubin AJ, Davis JH, Ajo-Franklin CM, Cumbers J, Czar MJ, De Mora K, Glieberman AL, Monie DD, Endy D. Measuring the activity of biobrick promoters using an in vivo reference standard. Journal of Biological Engineering. 2009;3(1):4. doi: 10.1186/1754-1611-3-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knaus & Bujard (1990).Knaus R, Bujard H. Principles governing the activity of E. coli promoters. In: Eckstein F, Lilley DMJ, editors. Nucleic Acids and Molecular Biology. Vol. 4. Berlin: Springer-Verlag; 1990. pp. 110–122. [Google Scholar]
- Li & Zhang (2014).Li J, Zhang Y. Relationship between promoter sequence and its strength in gene expression. European Physical Journal E. 2014;37(9):44. doi: 10.1140/epje/i2014-14086-1. [DOI] [PubMed] [Google Scholar]
- Maquat & Reznikoff (1978).Maquat LE, Reznikoff WS. In vitro analysis of the Escherichia coli RNA polymerase interaction with wild-type and mutant lactose promoters. Journal of Molecular Biology. 1978;125(4):467–490. doi: 10.1016/0022-2836(78)90311-x. [DOI] [PubMed] [Google Scholar]
- Meng et al. (2017).Meng H, Ma Y, Mai G, Wang Y, Liu C. Construction of precise support vector machine based models for predicting promoter strength. Quantitative Biology. 2017;5(1):90–98. doi: 10.1007/s40484-017-0096-3. [DOI] [Google Scholar]
- Paget & Helmann (2003).Paget MS, Helmann JD. The σ70 family of sigma factors. Genome Biology. 2003;4(1):203. doi: 10.1186/gb-2003-4-1-203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhodius & Mutalik (2010).Rhodius VA, Mutalik VK. Predicting strength and function for promoters of the Escherichia coli alternate sigma factor, σE. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(7):2854–2859. doi: 10.1073/pnas.0915066107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salis, Mirsky & Voigt (2009).Salis HM, Mirsky EA, Voigt CA. Automated design of synthetic ribosome binding sites to control protein expression. Nature Biotechnology. 2009;27(10):946–950. doi: 10.1038/nbt.1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shultzaberger et al. (2007).Shultzaberger RK, Chen Z, Lewis KA, Schneider TD. Anatomy of Escherichia coli sigma70 promoters. Nucleic Acids Research. 2007;35:771–788. doi: 10.1093/nar/gkl956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stormo (1990).Stormo GD. Consensus patterns in DNA. In: Doolittle RF, editor. Methods in Enzymology, Vol. 183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences. San Diego: Academic Press; 1990. pp. 211–221. [Google Scholar]
- Weller & Recknagel (1994).Weller K, Recknagel RD. Promoter strength prediction based on occurrence frequencies of consensus patterns. Journal of Theoretical Biology. 1994;171(4):355–359. doi: 10.1006/jtbi.1994.1239. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The following information was supplied regarding data availability:
Raw code: https://github.com/PromoterPredict/PromoterStrengthPredictor
Web service: https://promoterpredict.com/
Palaniappan, Ashok; Aditya, Keshav; Bharanikumar, Ramit (2018): PromoterPredict: sequence-based modelling of promoter strength: supplementary information. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6794939.v1