Avoidance of stochastic RNA interactions can be harnessed to control protein expression levels in bacteria and archaea

Sinan Uğur Umu; Anthony M Poole; Renwick CJ Dobson; Paul P Gardner

doi:10.7554/eLife.13479

. 2016 Sep 20;5:e13479. doi: 10.7554/eLife.13479

Avoidance of stochastic RNA interactions can be harnessed to control protein expression levels in bacteria and archaea

Sinan Uğur Umu ^1,², Anthony M Poole ^1,², Renwick CJ Dobson ^1,^2,³, Paul P Gardner ^1,^2,^4,^*

Editor: Karen Adelman⁵

PMCID: PMC5028192 PMID: 27642845

Abstract

A critical assumption of gene expression analysis is that mRNA abundances broadly correlate with protein abundance, but these two are often imperfectly correlated. Some of the discrepancy can be accounted for by two important mRNA features: codon usage and mRNA secondary structure. We present a new global factor, called mRNA:ncRNA avoidance, and provide evidence that avoidance increases translational efficiency. We also demonstrate a strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes, and that these have a greater impact on protein abundance than mRNA structure or codon usage. By generating synonymously variant green fluorescent protein (GFP) mRNAs with different potential for mRNA:ncRNA interactions, we demonstrate that GFP levels correlate well with interaction avoidance. Therefore, taking stochastic mRNA:ncRNA interactions into account enables precise modulation of protein abundance.

DOI: http://dx.doi.org/10.7554/eLife.13479.001

Research Organism: E. coli, Other

eLife digest

Many genes carry information for making proteins. To make a protein, a working copy of the information stored in DNA is first copied into a molecule of messenger RNA. These RNA messages are then interpreted by the ribosome, the molecular machine that makes proteins. Many messages are produced from each gene, and each message can be read multiple times. Thus, it should follow that the number of messages produced dictates the number of proteins made. However, this is not the case and the number of proteins produced cannot be completely predicted from knowing the number of messenger RNAs.

Cells control how much of a given protein they produce through interactions between the messenger RNAs and other regulatory RNAs. The regulatory RNAs bind directly to a message and impede protein production. Because there are millions of RNAs in a cell, these interactions have evolved to be highly specific. Nevertheless, it seems inevitable that messenger RNAs would encounter other RNAs too, which could short-circuit gene regulation and lead to less protein being produced.

Umu et al. have now asked if such short-circuit events are selected against during evolution. Computational tools were used to predict the strength of binding between the RNAs found in the dominant forms of microbial life on Earth: the bacteria and the archaea. This approach revealed that the majority of messenger RNAs bind more weakly to the most common RNA molecules found in cells than would be expected by chance. Weakened binding should prevent the RNA molecules from becoming tangled with each other and ensure that protein levels are not perturbed by unintended interactions between highly expressed messages and other RNAs.

To test this hypothesis further, Umu et al. generated versions of the gene for a green fluorescent protein that differed only in how well their messenger RNAs could avoid interacting with the most abundant RNAs in E. coli cells. Those messengers that were designed to avoid interacting with other RNAs yielded far more protein than those that were not. The findings show that taking this kind of avoidance into account can improve predictions about how much protein will be produced and should therefore make it easier to control protein production in experimental systems.

Finally, the messenger RNAs of some bacteria do not show such clear avoidance. However, these bacteria have a more complex internal cell structure. This finding hints at an alternative means for avoiding short-circuiting events that could be used by more complicated cells, such of those of animals and plants, which also contain much larger numbers of RNAs.

DOI: http://dx.doi.org/10.7554/eLife.13479.002

Introduction

It should in principle be possible to predict protein abundance from genomic data. However, protein and mRNA levels are not strongly correlated (de Sousa Abreu et al., 2009; Vogel and Marcotte, 2012; Kwon et al., 2014; Maier et al., 2011; Lu et al., 2007; Taniguchi et al., 2010; Chen et al., 2016), which is a major barrier to precision bioengineering and quantification of protein levels. mRNA secondary structure (Pelletier and Sonenberg, 1987; Chamary and Hurst, 2005), codon usage (Ikemura, 1981; Sharp and Li, 1987; Andersson and Kurland, 1990), and mRNA (and protein) degradation rates (Maier et al., 2011) are commonly invoked to explain this discrepancy (Boël et al., 2016-21). Yet, at best, these features account for only 40% of variation, and in some instances explain very little of the observed variation (Kudla et al., 2009; Maier et al., 2011; Plotkin and Kudla, 2011; Goodman et al., 2013; Chen et al., 2016). Here we show that crosstalk interactions between ncRNAs and mRNAs also impact protein abundance, and that such interactions have a greater effect than either mRNA secondary structure or codon usage. We measured interactions between a set of evolutionarily conserved core mRNAs and ncRNAs from 1700 prokaryotic genomes using minimum free energy (MFE) models. For 97% of species, we find a reduced capacity for interaction between native RNAs relative to controls. Furthermore, by generating synonymously variant GFP mRNAs that differ in their potential to interact with core ncRNAs, we demonstrate that GFP expression levels can be both predicted and controlled. Our results demonstrate that there is strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes. Applying this knowledge to mRNA design will enable precise control of protein abundance through the incorporation or exclusion of inhibitory interactions with native ncRNAs.

Results and discussion

To examine if avoidance of stochastic mRNA:ncRNA interactions is a feature of transcriptomes in bacteria and archaea, we estimated the strength of all possible intermolecular RNA interactions using a minimum free energy (MFE) model (Mückstein et al., 2006) using core ncRNAs and mRNAs. In this work the core ncRNAs are six well conserved and highly expressed tRNA, rRNA, RNase P RNA, SRP RNA, tmRNA and 6S RNA families annotated by Rfam (Gardner et al., 2011; Nawrocki et al., 2015), the core mRNAs are 114 well conserved mRNAs found across bacteria, 40 of which are also conserved across archaea (Wu et al., 2013).

If stochastic interactions are selected against, because of the capacity for abundant ncRNAs (Lindgreen et al., 2014; Deutscher, 2006; Giannoukos et al., 2012) to impact translation (Waters and Storz, 2009; Storz et al., 2011), such negative selection would be most comparable between species and readily detected for broadly conserved ncRNAs and mRNAs. Under-representation of interactions has been considered for the specific case of Shine-Dalgarno-like (SD-like) sequences and the ribosome (Li et al., 2012; Woolstenhulme et al., 2015; Borg and Ehrenberg, 2015; Diwan and Agashe, 2016) and between microRNAs and 3' UTRs (Bartel and Chen, 2004; Farh, 2005; Stark et al., 2005; van Dongen et al., 2008). We computed the free energy distribution of interactions between highly conserved mRNA:ncRNA pairs and compared this to a number of negative control interactions, which serve to show the expected distribution of binding energy values (Figure 1A). The initiation of translation has been shown to be the rate limiting step for translation (Tuller and Zur, 2015; Plotkin and Kudla, 2011; Nakahigashi et al., 2014), therefore, we focus our analysis on the first 21 nucleotides of the mRNA coding sequence (CDS). This has the further advantage of reducing computational complexity. We also test a variety of negative control mRNA regions, which are unlikely to play a functional role in RNA:ncRNA interactions. The mRNA controls include (1) di-nucleotide preserving shuffled sequences (Workman, 1999) (orange, Figure 1A), (2) homologous mRNAs from another phylum (with a compatible guanine-cytosine (G+C) content) (purple), (3) downstream regions 100 base pairs (bps) within the CDS (pink), (4) the reverse complement of the 5´ of CDSs (green), and lastly (5) unannotated (intergenic) genomic regions (yellow). Our interaction predictions in a single model strain show that native interactions consistently have higher (i.e. less stable) free energies than expected when compared to the five different mRNA negative controls: that is, there is a reduced capacity for native mRNAs and native ncRNAs to interact. We also compared different energy models and confirm that the MFE shift is a result of intermolecular binding (Figure 1—figure supplement 1A–C). We subsequently deployed the most conservative negative control (i.e. di-nucleotide preserving shuffle) and free energy model (Figure 1—figure supplement 1C) to detect if this shift for less stable binding of mRNA:ncRNA is true of all bacteria and archaea.

Figure 1. — (A) Native core mRNA:ncRNA binding energies (green line; mean = −3.21 kcal/mol) are significantly higher than all mRNA negative control binding energies (dashed lines; mean binding energies are -3.62, -5.21, -4.13, -3.86 & -3.92 kcal/mol respectively) in pairwise comparisons (p<2.2 × 10⁻¹⁶ for all pairs, one-tailed Mann-Whitney U test) for *Streptococcus suis* RNAs. (B) The difference between the density distributions of native mRNA:ncRNA binding energies and dinucleotide preserved shuffled mRNA:ncRNA controls as a function of binding energy for different taxonomic phyla. Each coloured curve illustrates the degree of extrinsic avoidance for different bacterial phyla or the archaea. Positive differences indicate an excess in native binding for that energy value, negative differences indicate an excess of interactions in the shuffled controls. The dashed black line shows the expected result if no difference exists between these distributions and the dashed grey lines show empirical differences for shuffled vs shuffled densities from 100 randomly selected bacterial strains. (C) This box and whisker plot shows −log₁₀(P) distributions for each phylum and the archaea, the p-values are derived from a one-tailed Mann-Whitney U test for each genome of native mRNA:ncRNA versus shuffled mRNA:ncRNA binding energies. The black dashed line indicates the significance threshold (p<0.05). (D) A high intrinsic avoidance strain (*Thermodesulfobacterium sp.* OPB45) shows a clear separation between the G+C distribution of mRNAs and ncRNAs (p=9.2 × 10⁻²⁵, two-tailed Mann-Whitney U test), and a low intrinsic avoidance strain (*Mycobacterium sp.* JDM601) has no G+C difference between mRNAs and ncRNAs (p=0.54, two-tailed Mann-Whitney U test). (E) The x-axis shows −log₁₀(P) for our test of extrinsic avoidance using binding energy estimates for both native and shuffled controls, while the y-axis shows −log₁₀(P) for our intrinsic test of avoidance based upon the difference in G+C contents of ncRNAs and mRNAs. Two perpendicular dashed black lines show the threshold of significance for both avoidance metrics. 97% of bacteria and archaea are significant for at least one of these tests of avoidance.

**DOI:** http://dx.doi.org/10.7554/eLife.13479.003

Figure 1—figure supplement 1. — (A) Native core mRNA:ncRNA binding energies (green line; mean = −3.21 kcal/mol) are significantly higher than all mRNA negative control binding energies (dashed lines; mean binding energies are -3.62, -5.21, -4.13, -3.86 & -3.92 kcal/mol respectively) in pairwise comparisons (p<2.2 × 10⁻¹⁶ for all pairs, one-tailed Mann-Whitney U test) for *Streptococcus suis* RNAs. (B) The difference between the density distributions of native mRNA:ncRNA binding energies and dinucleotide preserved shuffled mRNA:ncRNA controls as a function of binding energy for different taxonomic phyla. Each coloured curve illustrates the degree of extrinsic avoidance for different bacterial phyla or the archaea. Positive differences indicate an excess in native binding for that energy value, negative differences indicate an excess of interactions in the shuffled controls. The dashed black line shows the expected result if no difference exists between these distributions and the dashed grey lines show empirical differences for shuffled vs shuffled densities from 100 randomly selected bacterial strains. (C) This box and whisker plot shows −log₁₀(P) distributions for each phylum and the archaea, the p-values are derived from a one-tailed Mann-Whitney U test for each genome of native mRNA:ncRNA versus shuffled mRNA:ncRNA binding energies. The black dashed line indicates the significance threshold (p<0.05). (D) A high intrinsic avoidance strain (*Thermodesulfobacterium sp.* OPB45) shows a clear separation between the G+C distribution of mRNAs and ncRNAs (p=9.2 × 10⁻²⁵, two-tailed Mann-Whitney U test), and a low intrinsic avoidance strain (*Mycobacterium sp.* JDM601) has no G+C difference between mRNAs and ncRNAs (p=0.54, two-tailed Mann-Whitney U test). (E) The x-axis shows −log₁₀(P) for our test of extrinsic avoidance using binding energy estimates for both native and shuffled controls, while the y-axis shows −log₁₀(P) for our intrinsic test of avoidance based upon the difference in G+C contents of ncRNAs and mRNAs. Two perpendicular dashed black lines show the threshold of significance for both avoidance metrics. 97% of bacteria and archaea are significant for at least one of these tests of avoidance.

**DOI:** http://dx.doi.org/10.7554/eLife.13479.003

In terms of stoichiometry, the model we use assumes that ncRNA expression levels are vastly in excess of mRNA expression levels (i.e. [ncRNA] >> [mRNA]) (Giannoukos et al., 2012; Deutscher, 2006). This is generally a biologically reasonable assumption when focussing on core genes based upon past analysis and our own work with RNA-seq data from a range of bacteria and archaea (Figure 4) (Lindgreen et al., 2014). Consequently, any potential mRNA interaction regions are saturated with ncRNA, therefore a summative model of interaction energies is a reasonable approximation to the estimated impact of excess hybridization. If modelling ncRNAs that are not so abundant, then a model weighted by expression level may be advantageous, but it is difficult to assess these across all conditions and developmental stages that are evolutionarily relevant. In order to ensure that our analysis is comparable across all bacteria and archaea we have focussed on just the most highly conserved ncRNA and protein-coding genes. Although, many of the ncRNAs are highly structured and are bound by RNA-binding proteins this is not the case during either synthesis and degradation of these products, furthermore, a fraction of the RNA components of these genes will be exposed. Therefore, we expect these will form useful datasets for initial testing of our hypothesis.

In order to assess whether mRNA:ncRNA avoidance is an evolutionarily conserved phenomenon, we calculated intermolecular binding energies for conserved ncRNAs and mRNAs from 1,582 bacterial and 118 archaeal genomes and compared these to a negative control dataset derived using a di-nucleotide frequency preserving shuffling procedure (Workman, 1999). This measures a property that we call the ‘extrinsic avoidance‘ of mRNA:ncRNA interactions, yet this approach may fail to identify genuine avoidance in cases when the G+C content differences between interacting RNAs is extreme. Measuring only extrinsic avoidance (using shuffled mRNAs as negative controls), we found that stochastic mRNA:ncRNA interactions are significantly underrepresented in most (73%) of the prokaryotic phyla (p<0.05, one-tailed Mann-Whitney U test) (Figure 1B,C and Figure 1—figure supplement 2). This indicates that there is selection against stochastic interactions in both bacteria and archaea.

We next sought to establish the degree to which intrinsic G+C features of RNAs lead to avoidance of stochastic interactions (Figure 1D). A similar idea has been proposed which suggests that purine loading in thermophilic bacteria may limit mRNA:mRNA interactions (Lao and Forsdyke, 2000). A test of G+C composition revealed a significant difference (p<0.05, two-tailed Mann-Whitney U test) between mRNAs and ncRNAs for 95% of bacteria and archaea (Figure 1D,E). Therefore, either extrinsic or intrinsic avoidance signals indicate that selection against stochastic interactions and it is near-universal for the prokaryotes (97% of all strains) (Figure 1E and Supplementary file 1A and B).

Our results clearly establish a signature of selection that acts to minimise stochastic mRNA:ncRNA interactions. However, with thousands of potential interacting RNA species in even simple prokaryotic systems (Vivancos et al., 2010; Sharma et al., 2010), the complete avoidance of stochastic interactions is combinatorially unlikely. Therefore, there ought to be a tradeoff between avoidance and optimal expression. To assess this, we examined the relationship between potential stochastic interactions and the variation between mRNA and cognate protein levels for four previously published endogenous mass spectrometry datasets from Escherichia coli (E. coli) and Pseudomonas aeruginosa (P. aeruginosa) (Laurent et al., 2010; Kwon et al., 2014; Lu et al., 2007). We computed Spearman’s correlation coefficients between protein abundances and extrinsic avoidance, 5´ end internal mRNA secondary structure and codon usage. Of the three measures, avoidance is significantly correlated in all four datasets (Spearman’s rho values are between 0.11–0.17 and corresponding p-values are between 0.01 and 1.3 × 10^–12). In contrast, 5´ end mRNA structure significantly correlates in two datasets, and codon usage significantly correlates in all four datasets. This indicates that, despite strong selection against stochastic interactions, such interactions do significantly impact the proteome (Figure 2A and Supplementary file 3). We have also conducted an ‘outlier analysis’ on one of the E. coli datasets (Laurent et al., 2010). We have selected the top and bottom-most expressed genes relative to mRNA expression levels and computed Z-scores for each of codon-usage, internal secondary structure and avoidance measures. We found that avoidance measures show the most extreme shifts downwards for the bottom-most expressed genes and is shifted the highest for the top-most genes (Figure 2—figure supplement 4).

Figure 2. — (A) This heatmap summarizes the effect sizes of four mRNA attributes (avoidance of mRNA:ncRNA interaction, 5´ end secondary structure, codon bias and mRNA abundance) on protein expression as Spearman’s correlation coefficients, which are represented in gradient colors, while a starred block shows if the associated correlation is significant (p<0.05). (B) GFP expression correlates with optimized codon selection, measured by CAI (R_s = 0.29, p=0.016). (C) GFP expression correlates with 5 $^{'}$ end secondary structure of mRNAs, measured by 5’ end intramolecular folding energy (R_s = 0.34, p=0.006). (D) GFP expression correlates with avoidance, measured by mRNA:ncRNA binding energy (R_s = 0.56, p=6.9 × 10⁻⁶). (E) Each cartoon illustrates the corresponding hypothesis; (1) optimal codon distribution (corresponding tRNAs are available for translation), (2) low 5´ end RNA structure (high folding energy of 5´ end) and (3) avoidance (fewer crosstalk interactions) lead to faster translation.

**DOI:** http://dx.doi.org/10.7554/eLife.13479.006

Figure 2—figure supplement 1. — (A) This heatmap summarizes the effect sizes of four mRNA attributes (avoidance of mRNA:ncRNA interaction, 5´ end secondary structure, codon bias and mRNA abundance) on protein expression as Spearman’s correlation coefficients, which are represented in gradient colors, while a starred block shows if the associated correlation is significant (p<0.05). (B) GFP expression correlates with optimized codon selection, measured by CAI (R_s = 0.29, p=0.016). (C) GFP expression correlates with 5 $^{'}$ end secondary structure of mRNAs, measured by 5’ end intramolecular folding energy (R_s = 0.34, p=0.006). (D) GFP expression correlates with avoidance, measured by mRNA:ncRNA binding energy (R_s = 0.56, p=6.9 × 10⁻⁶). (E) Each cartoon illustrates the corresponding hypothesis; (1) optimal codon distribution (corresponding tRNAs are available for translation), (2) low 5´ end RNA structure (high folding energy of 5´ end) and (3) avoidance (fewer crosstalk interactions) lead to faster translation.

**DOI:** http://dx.doi.org/10.7554/eLife.13479.006

We also test how mRNA:ncRNA crosstalk impacts the translation of transformed mRNAs that have not coevolved with the ncRNA repertoire (low avoidance mRNAs are rare in native datasets). We examined two available E. coli-based GFP experimental datasets (Goodman et al., 2013; Kudla et al., 2009), where synonymous mRNAs are generated for a GFP reporter gene. This enables the assessment of the impact of synonymous changes on protein abundance using fluorescence. Avoidance and mRNA secondary structure are both significantly correlated with fluorescence, whereas codon usage is not (Spearman’s rho values are 0.11 and 0.65, the corresponding p-values are 3.17 × 10⁻⁴¹ and 1.69^–20) (Figure 2A). Note that one of the GFP datasets (Goodman et al., 2013) uses native E. coli mRNA 5´ ends for their constructs, whereas the other GFP dataset (Kudla et al., 2009) is randomly generated. We observe that the influence of avoidance on gene expression for randomly sampled synonymous mRNAs is strong (Figure 2—figure supplement 3), while endogenous gene expression is limited. Presumably, due to negative selection pruning low avoidance mRNAs from the gene pool (Figure 2A).

For each of the seven datasets described above we have tested linear models of measures of mRNA levels, codon usage, internal secondary structure and avoidance (Figure 2—figure supplement 3 and Supplementary file 5). Avoidance alone explains around 35% of variance in GFP datasets where extreme mRNA compositions can be explored, whereas in native mass-spec derived datasets 2–3% of the variance is explained by avoidance alone. Codon usage describes 2% to −0.5% of variance in GFP data, and 19% to 0.3% of variance in mass-spec derived datasets. Internal secondary structure 33% to 10% in GFP datasets, and 0.2% to 0% of the variance in mass-spec derived datasets. Using all four measures in combination across the seven datasets between 70% and 42% of variation in protein levels can be explained, removing avoidance from the model reduces these estimates by between 56% and 0.7%. Thus, avoidance is at least as good an explanation of variation in protein abundance as either codon usage and internal mRNA secondary structure.

Our results indicate that crosstalk between mRNAs and ncRNAs can impact protein expression levels. We therefore predict that taking crosstalk into account will enable the design of constructs where protein expression levels can be precisely controlled. To test this, we generated GFP constructs based on the following constraints: codon bias, 5´ end mRNA secondary structure stability and crosstalk avoidance (see Materials and methods). Our constructs are designed to capture the extremes of one variable, while controlling other variables (e.g. high or low avoidance and near-average codon bias and mRNA secondary structure). The G+C content, a known confounding factor, was also strictly controlled for each construct. We selected a commercial service to perform our GFP transformations to avoid possible bias and increase the robustness of our approach (Ioannidis and Khoury, 2011). We predicted that a construct where all three parameters are optimised will result in a higher expression. Consistent with predictions, our optimised construct had maximal expression (Figure 2—figure supplement 1). Of the three parameters, avoidance showed the largest range, suggesting that tuning this parameter permits expression levels to be finely controlled (R_s = 0.56, p=6.9 × 10⁻⁶) (Figure 2B–D and Figure 2—figure supplements 1–4).

For a final confirmation of the avoidance hypothesis, we tested the Thermus thermophilus (T. thermophilus) HB8 SSU ribosomal RNA, which is a component of one of the most complete prokaryotic ribosomal structures available in the PDB (Rozov et al., 2015). We identified the regions of the SSU rRNA that had the least capacity to interact with T. thermophilus core mRNAs and found that these regions were generally not bound to either ribosomal proteins or other ncRNAs, such as the LSU rRNA (p=2.49 × 10⁻¹⁷, Fisher’s exact test) (Figure 3; see ‘Materials and methods’). The influence of internal SD-like regions on translation pausing have been described elsewhere (Li et al., 2012), in addition we note that the anti-SD region on SSU rRNA is one of the RNA avoidance regions (Figure 3A).

Figure 3. — (A) In the upper bar, the regions of the *T. thermophilus* SSU rRNA that are under-represented in stable interactions with mRNAs (p<0.05) are highlighted in red. In the lower bar, the inaccessible residues (<3.4 Angstroms from other nucleotides or amino acids in the PDB structure 4WZO). (B) The 3 dimensional structure of the *T. thermophilus* ribosome includes 5S, SSU and LSU rRNA, 48 ribosomal proteins, 4 tRNA and a bound mRNA (PDB ID: 4WZO) (Rozov et al., 2015). We have highlighted the most avoided regions of the SSU rRNA in red (based upon the fewest stable interactions with *T. thermophilus* mRNAs (p<0.05). Two different orientations are shown on the left and right, the upper structure shows just the SSU rRNA and mRNA structures, the lower includes the ribosomal proteins (coloured blue). Bottom right, a view of the ribosome that also includes the LSU rRNA (green) is also shown. There is a significant correspondence between the accessibility of a region of SSU rRNA and the degree to which it is avoided (p=2.5 × 10⁻¹⁷, Fisher’s exact test).

**DOI:** http://dx.doi.org/10.7554/eLife.13479.012

Figure 3—figure supplement 1. — (A) In the upper bar, the regions of the *T. thermophilus* SSU rRNA that are under-represented in stable interactions with mRNAs (p<0.05) are highlighted in red. In the lower bar, the inaccessible residues (<3.4 Angstroms from other nucleotides or amino acids in the PDB structure 4WZO). (B) The 3 dimensional structure of the *T. thermophilus* ribosome includes 5S, SSU and LSU rRNA, 48 ribosomal proteins, 4 tRNA and a bound mRNA (PDB ID: 4WZO) (Rozov et al., 2015). We have highlighted the most avoided regions of the SSU rRNA in red (based upon the fewest stable interactions with *T. thermophilus* mRNAs (p<0.05). Two different orientations are shown on the left and right, the upper structure shows just the SSU rRNA and mRNA structures, the lower includes the ribosomal proteins (coloured blue). Bottom right, a view of the ribosome that also includes the LSU rRNA (green) is also shown. There is a significant correspondence between the accessibility of a region of SSU rRNA and the degree to which it is avoided (p=2.5 × 10⁻¹⁷, Fisher’s exact test).

**DOI:** http://dx.doi.org/10.7554/eLife.13479.012

This study focusses on the 5´ ends of the CDS as this region is important for the initiation of translation (Plotkin and Kudla, 2011; Tuller and Zur, 2015) and is a consistent feature of all the genomic, transcriptomic, proteomic and GFP expression datasets that we have evaluated in this work. In smaller-scale tests we have observed similar conserved avoidance signals within the entire CDSs (Figure 3—figure supplement 1) and within the 5´UTRs (Figure 3—figure supplement 2). Furthermore, we predict that similar signals can be observed for mRNA:mRNA and ncRNA:ncRNA avoidance. Although the impacts of these features are challenging to validate, interactions between clustered regularly interspaced short palindromic repeats (CRISPR) spacer sequences (Bhaya et al., 2011) and core ncRNAs are good candidates to test ncRNA:ncRNA avoidance.

In conclusion, our results indicate that the specificity of prokaryotic ncRNAs for target mRNAs is the result of selection both for a functional interaction and against stochastic interactions. Our experimental results support the view that stochastic interactions are selected against, due to deleterious outcomes on expression. We suspect avoidance of crosstalk interactions has several evolutionary consequences. First, as transcriptional outputs become more diverse in evolution, we expect that the probability of stochastic interactions for both new ncRNAs and mRNAs becomes higher. This will impact the emergence of new, high abundance RNAs, since selection for high abundance may be mitigated by deleterious crosstalk events. Second, we predict that stochastic interactions limit the number of simultaneously transcribed RNAs, since the combinatorics of RNA:RNA interactions imply that eventually stochastic interactions cannot be avoided. This may in turn drive selection for forms of spatial or temporal segregation of transcripts. Finally, taking codon usage, mRNA secondary structure and potential mRNA:ncRNA interactions into account allows better prediction of proteome outputs from genomic data, and informs the precise control of protein levels via manipulation of synonymous mRNA sequences (Figure 2—figure supplement 5).

Materials and methods

Here we summarize the data sources, materials and methods corresponding to our manuscript. We performed all statistical analyses in R, and all other computational methods in Python 2.7 or Bash shell scripts. We explicitly cite all the bioinformatics tools and their versions. All tables (Supplementary files 1–5) are available as supporting online material. All of our own sequences, scripts and R workspace images are available on Github including the supplementary files (http://github.com/UCanCompBio/Avoidance). The other datasets are cited in the manuscript (Supplementary file 3).

Evolutionary conservation

If excessive interactions between messenger RNAs (mRNAs) and non-coding RNAs (ncRNAs) are detrimental to cellular function, then we expect the signature of selection against interactions (avoidance) to be a conserved feature of prokaryotic genomes. In the following, we describe where the data used to test the evolutionary conservation of avoidance was acquisitioned, the models that we use to test avoidance and the negative controls in detail for evolutionary conservation predictions. We also investigate detect regions of avoidance on one of the core ncRNAs, the ribosomal small subunit (SSU) RNA.

Data sources for bacterial genomes

The bacterial genomes and annotations that we used for investigating mRNA:ncRNA interactions were acquired from the EBI nucleotide archive (2564 sequenced bacterial genomes available on August 2013; http://www.ebi.ac.uk/genomes/bacteria.html). We selected an evolutionarily conserved (core) group of 114 mRNAs from PhyEco (Wu et al., 2013) and an evolutionarily conserved (core) group of ncRNAs (Hoeppner et al., 2012). PhyEco markers are based on a set of profile HMMs that correspond to highly conserved bacterial protein coding genes (these include ribosomal proteins, tRNA synthetases as well as other components of translation machinery, DNA repair and polymerases) (Wu et al., 2013). The HMMer package (version 3.1b1) (Eddy, 2011) was used to extract the mRNAs corresponding to these marker genes from genome files. We removed genome sequences that host fewer than 90% of the marker genes; leaving 1582 bacterial genome sequences and 176,704 core mRNAs that spanned these. We extracted the first to the 21st nucleotide of the core mRNAs. As this region showed the strongest signal in a small-scale analysis (Figure 3—figure supplement 1A), this region has also been shown to have an unusual codon distribution in previous work (Tuller and Zur, 2015; Goodman et al., 2013) as explained in the main text. We obtained ncRNA annotations using the Rfam database (version 11.0) (Gardner et al., 2011) for the well conserved and highly expressed tRNA, rRNA, RNase P RNA, SRP RNA, tmRNA and 6S RNA families (Rfam accessions: RF00001, RF00005, RF00010, RF00011, RF00013, RF00023, RF00169, RF01854, RF00177). The redundant annotations were filtered for overlapping and identical paralogous sequences, leaving 99,281 core ncRNA that spanned 1582 bacterial genomes.

Data sources for archaeal genomes

We followed a similar pipeline for archaeal genomes as described for bacterial genomes. In total we processed 240 archaeal genomes, and after filtering those that had fewer than 90% of the marker genes, we had 118 archaeal genomes for further analysis (genomes available on August 2013) (http://www.ebi.ac.uk/genomes/archaea.html). These genomes host 12,370 and 10,804 core mRNAs and core ncRNAs respectively.

Test of an (extrinsic) avoidance model

We used RNAup (version 2.0.7) (Lorenz et al., 2011) to calculate the binding minimum (Gibbs) free energy (MFE) values of mRNA:ncRNA interactions. The RNAup algorithm combines the intramolecular energy necessary to open binding sites with intermolecular energy gained from hybridization (Mückstein et al., 2006). In other words, this approach minimizes the sum of opening intramolecular energies and the intermolecular energy (Figure 1—figure supplement 1C). In our model of avoidance, we test for a reduction in absolute binding MFE relative to negative controls as a measure of avoidance. After testing a variety of negative controls (e.g. dinucleotide preserved shuffled mRNAs, the 5´ end of homologous mRNAs from a different bacterial phylum, 100 nucleotides downstream of designated interaction region, reverse complements, and identically sized intergenic regions), we selected the dinucleotide frequency preserved shuffled sequences as our negative control since this displayed the most conservative interaction MFE distribution (Figure 3—figure supplement 1A–C). In more detail, to serve as a negative control we compute the interaction MFE between each of the core ncRNAs and 200 dinucleotide-preserved shuffled versions of the 5´ end mRNAs. A dinucleotide frequency preserving shuffling procedure is used, as Gibbs free energies are computed over base pair stacks, i.e. a dinucleotide alphabet, therefore this method has been shown to be important in order to minimise incorrect conclusions (Workman, 1999). We tested if the energy difference between native and shuffled interaction distributions is statistically significant using the nonparametric one-tailed Mann-Whitney U test, which returns a single p-value per genome (Figure 1C). If the distribution of native interaction energies for a genome is significantly higher (i.e. fewer stable interactions) than the negative control, this is an indication that the genome has undergone selection for mRNA:ncRNA avoidance. To create the background density difference lines (seen in grey at Figure 1B), we randomly selected 100 bacterial strains and plot differences between the densities of shuffled interactions.

Test of an intrinsic avoidance model

The energy-based avoidance model that we defined above is opaque to cases of ‘intrinsic avoidance’. These are where the intrinsic properties of mRNA and ncRNA sequences restrict their ability to interact. For an extreme example, if ncRNAs are composed entirely of guanine and cytosine nucleotides, whilst mRNAs are composed entirely of adenine and uracil nucleotides, then these will rarely interact. Therefore, our energy-based avoidance measures for native and shuffled interactions will both be near zero, and thus will not detect a significant energy shift between the native and control sequences. In order to account for some of these issues, we compared the G+C difference between core ncRNAs and core mRNAs. We used a nonparametric two-tailed Mann-Whitney U test to determine if there is a statistically significant G+C difference between the two samples: G+C of ncRNAs vs G+C of 5´ end mRNAs (Figure 1D,E).

Sliding window analysis to detect regions of significance for avoidance on SSU ribosomal RNA

We hypothesise that heterogeneous signals of avoidance within ncRNA sequences may correspond to the accessibility of different ncRNA regions. For example, are highly avoided regions of abundant ncRNAs more accessible than those that are avoided less? To create an avoidance profile, we tested binding MFEs of native and shuffled interactions throughout the full-length SSU ribosomal RNA of T. thermophilus, using a one tailed Mann-Whitney U tests to evaluate the degree of avoidance for each nucleotide in the SSU rRNA (Figure 3) with a window size of 10 and step size of 1 (Supplementary file 4). We selected the protein data bank (PDB) entry (4WZO) as it is one of the few ribosomal structures with associated protein, mRNA, tRNA and LSU binding data (Rozov et al., 2015). The native interactions are the interactions between T. thermophilus core mRNAs and SSU ribosomal RNA. The shuffled controls are derived from 200 dinucleotide preserved shuffled versions of the RNAs. We created a 2 × 2 contingency table which separates the counts of residues that either host a strong avoidance signal or little avoidance signal (regions with p<0.001, Mann-Whitney U test) and residues that we predict to either be in contact (<3.4 Angstroms between atoms) with ribosomal proteins or ribosomal, transfer or messenger RNAs or not in contact with other molecules (i.e. accessible) (Figure 3). We applied a Fisher’s exact test Fisher (1992) to these groups to and discovered a statistically significant relationship between avoidance and accessibility (p=2.5 × 10⁻¹⁷).

We have applied the same analysis to the other T. thermophilus core ncRNA genes (tRNAs, tmRNA, RNase P RNA and SRP RNA) in order to determine regions of avoidance (Figure 3—figure supplement 3). Since there are more than one tRNAs, we aligned the cellular RNAs to the associated Rfam model (RF00005) (Gardner et al., 2011; Nawrocki et al., 2015) using the cmalign tool (Nawrocki and Eddy, 2013).

Sliding window analysis to detect regions of significance for avoidance on mRNAs

In order to identify a region of mRNA that is consistent and unique in the datasets that we applied evolutionary and expression analyses to we created an avoidance profile from the previously published GFP mRNAs (Kudla et al., 2009). We calculated binding MFEs using a window size of 21 with a 1 nucleotide step size, and for each region we computed the associated Spearman’s correlation coefficients with p-values. This analysis revealed the significance of the first 21 nucleotides on expression, this is consistent with previous results that identify initiation as the rate limiting step for translation (Tuller and Zur, 2015; Plotkin and Kudla, 2011). It also revealed other statistically significant regions with high correlation correlation coefficient throughout the GFP mRNAs (Figure 3—figure supplement 1A).

Proteomics/Transcriptomics and GFP expression

We predict that mRNAs with low avoidance values will produce fewer proteins for each mRNA transcript than those with high avoidance. In order to test this, we conducted a meta-analysis of proteomics and transcriptomics data and the relationship between this data and measures of mRNA and ncRNA avoidance. In the following section we describe the origins of the data we have used and the statistical analysis we use to test whether avoidance influences gene expression.

Data sources and statistics for mRNA, protein abundance and GFP expression

We compiled our data from five protein and mRNA quantification datasets, which consist of three E. coli (Laurent et al., 2010; Goodman et al., 2013; Lu et al., 2007) and two P. aeruginosa (Laurent et al., 2010; Kwon et al., 2014) (Supplementary file 3). We calculated Spearman’s correlation coefficients (and associated p-values) among the protein abundances and 5´ end secondary structure (measured by intermolecular MFE), codon bias (measured by codon adaptation index (CAI)) and avoidance (Figure 2A). We have created single and multiple regression models to determine the explained variances by these parameters (Figure 2—figure supplement 3 and Supplementary file 5). These models show that avoidance explains more variance on average than secondary structure or codon bias. Up to 70 percent of the variation in GFP expression can be explained by including all the parameters and mRNA abundances (Figure 2—figure supplement 3). CAI metric defines how well mRNAs are optimised for codon bias (Sharp and Li, 1987). The CAI values were determined based on codon distribution patterns acquired from the core protein coding genes of E. coli BL21(DE3) (Accession: AM946981.2) (Wu et al., 2013) using Biopython libraries (version 1.6) (Cock et al., 2009). The folding MFE predicts how stable the secondary structure of an RNA can be. The folding MFEs of GFP mRNAs were calculated using the RNAfold algorithm (version 2.0.7) (Lorenz et al., 2011). We restricted folding energy to first 37 nucleotides because the most significant correlation was previously reported for this region (Kudla et al., 2009). We acquired previously published GFP data, associated fluorescence values and mRNA quantifications (Kudla et al., 2009) via personal communication. Our avoidance model showed the highest and most significant correlation with GFP expression in that dataset (R_s = 0.65, p=1.69 × 10⁻²⁰) (Figure 2A and Figure 2—figure supplement 2D,E,F). 5´ end secondary structure (R_s = 0.62, p=5.73 × 10⁻¹⁸) correlates slightly less than avoidance, while CAI does not correlate significantly (R_s = 0.02, p=0.4).

mRNA design

We have shown that avoidance is a broadly evolutionary conserved phenomenon and that it is significantly correlated with protein abundance relative to mRNA abundance. We now wish to test if avoidance can be used to design mRNA sequences that modulate the abundance of corresponding protein in a predictable fashion. We use a set of GFP mRNA constructs that all maintain the same G+C content, codon adaptation index (CAI) and internal secondary structure but host either very high or very low avoidance values. This procedure was repeated for the CAI and internal secondary structure values while maintaining a constant avoidance. The resulting 13 constructs were synthesised, transformed and expressed by commercial services. In the following paragraphs, we explained how we design our GFP constructs, the experimental set-up and statistical analyses.

Green fluorescence protein (GFP) mRNA design

We sampled 537,000 synonymous mRNA variants of a GFP mRNA (the 239 AA, 720 nucleotide long, with accession AHK23750, can be encoded by 7.62 × 10¹¹¹ possible unique mRNA variants). In brief, these mRNA variants were scored based upon (1) CAI, (2) mRNA secondary structure in their 5´ end region, and (3) mRNA:ncRNA interaction avoidance in their 5 end region. The genome of E. coli BL21 encodes 52 unique core ncRNAs (Gardner et al., 2011; Nawrocki et al., 2015), to estimate the level of ncRNA avoidance for each GFP mRNA, we sum the binding MFEs. For example, for each GFP mRNA we compute 52 independent binding MFE values for each ncRNA. In short, a higher summed MFE score for a GFP mRNA implies a higher avoidance, while a lower summed MFE score implies a lower avoidance. This approach assumes that the ncRNAs are expressed at much higher levels than GFP mRNAs (i.e. [ncRNA] >> [mRNA]) (Figure 4). Consequently, any potential interaction site on GFP mRNAs are likely to be saturated with ncRNA. Finally, we selected 13 GFP mRNA constructs, while controlling the range of G+C values. These GFP mRNAs were designed to have four different aspects; extreme 5 end secondary structure (2 minimum and 2 maximum folding MFE constructs), extreme codon bias (2 maximum and 2 minimum CAI constructs), extreme interaction avoidance (2 minimum and 2 maximum binding MFE constructs) and an ‘optimal’ construct. The optimal construct was selected for a high CAI, low 5´ end structure and high avoidance. All extreme GFP mRNA constructs have near identical G+C content (between 0.468–0.480) and identical G+C contents at the 5´ end (0.48). Each of the sampled GFP mRNAs is separated from other mRNAs by at least 112 nucleotide substitutions and 122 nucleotide substitutions on average (Figure 2—figure supplement 1).

Figure 4. — To create this plot, we used mean mapped reads per gene length (i.e. mean read depth per position) of each core gene. The expression data are compiled from 5 archaeal and 37 bacterial strains from a previous study (Lindgreen et al., 2014).

**DOI:** http://dx.doi.org/10.7554/eLife.13479.016

Extreme GFP transformations, determining fluorescence levels and RT-qPCR analyses

Both GFP expression assays and RT-qPCR analyses were performed as part of a commercial service offered by the University of Queensland, Protein Expression Facility and Real-Time PCR Facility. Plasmid DNA from each construct was transformed into an expression strain of E. coli BL21(DE3). Starter cultures were grown in quadruplicate from single colonies in 0.5 mL of TB kanamycin 30 μg/mL media in a 96 deep-well microplate and incubated at 30°C, 400 rpm (3 mm shaking throw). Each starter culture was used to inoculate 1.0 mL of the same media at a ratio of 1:50, each in a single well of a 96 deep-well plate. The cultures were incubated at 30°C, 400 rpm for 1 hr, at this point the cultures were chilled for 5 min then induced into 0.2 mM IPTG and incubated at 20°C. For analysis, culture samples of 100 μL were taken at 1 hr, 2 hr, 3 hr, 4 hr and 22 hr (overnight) hours post-induction (HPI) for fluorescence and optical density analysis. Samples were collected in PetriWell 96-well flat bottom, black upper, lidded microplates (Genetix). Cell density of fluorescence measurements was performed on a Spectramax M5 Microplate Reader using SMP software v 5.2 (Molecular Devices). For fluorescence intensity measurements, samples were collected in the 96-well plate listed above. Samples were analysed by bottom-read, 10 reads per well at an excitation wavelength = 488 nm, emission wavelength = 509 nm with an automatic cut-off at 495 nm and measured as relative fluorescence units (RFU). The raw RFU values were normalised by subtracting the averaged baseline values obtained from untransformed BL21(DE3) at the same time point. All samples at the 22 HPI time point were diluted 1:4 in TB kanamycin 30 μg/mL media before measurement. Total RNA was purified from induced 0.5 mL of BL21(DE3) cultures on MaxwellÂ 16 robot (Promega) using LEV simplyRNA Tissue Kit (Promega). RNA concentrations were assessed on Qubit 3.0 Fluorometer (Thermo Fisher Scientific). cDNA synthesis was done using ProtoScript II First Strand cDNA Synthesis Kit (NEB) according to manufacturer protocol using random primer. The rpsL gene was selected as the reference gene (internal control). RT-qPCR was performed in 384-well plates with a ViiA 7 Real-Time PCR System (Thermo Fisher Scientific) using Life Technology SYBR Green-based PCR assay. The data analysis was performed using Applied Biosystems QuantStudio software (Thermo Fisher Scientific). The total volume of reaction was 10 μL including 0.2 μM of each primer as a final concentration. The following PCR conditions were used: 95°C for 10 min, followed by 40 cycles of 95°C for 15 s and 60°C for 1 min. The melting curves were analyzed at 60–95°C after 40 cycles. RNA concentrations were subsequently estimated using the approach (Schmittgen and Livak, 2008). We shared the raw data, oligos and primers in the supplementary files (Supplementary file 2A,B).

Statistical analyses of extreme GFP data

As described, we designed extreme GFP mRNA constructs, and measured the associated fluorescence. A Kruskal-Wallis test (nonparametric alternative of ANOVA) shows a statistically significant difference between the fluorescence of GFP mRNA groups (p=1.35 × 10^–5) (Figure 2—figure supplement 1). Our pairwise comparison of GFP groups using a Kruskal-Nemenyi test (a nonparametric alternative of the Student’s t-test) for fluorescence difference also reveals a statistically significant difference in fluorescence between high avoidance constructs and low avoidance constructs (p=0.00036). We computed the Spearman’s correlation coefficients (and associated p-values) between GFP expression and each of the following measures; CAI (R_s = 0.29, p=0.016), intramolecular folding energy (R_s = 0.34, p=0.006), avoidance (intermolecular binding energy) (R_s = 0.29, p=6.9 × 10⁻⁶) and mRNA concentration (R_s = 0.73, p=3.2 × 10⁻³) to predict effect size of each predictor. Our avoidance model resulted in the highest correlation with GFP expression (Figure 2B–D).

Acknowledgements

Thanks to Grzegorz Kudla for sharing the GFP expression data from Kudla et al. (2009) and Cindy Chang, Emilyn Tan, Michael Nefedov from the RT-PCR and the PEF facilities at the University of Queensland for assistance with generating GFP expression data. We also acknowledge Jeppe Vinther, Lukasz Kielpinski, Anders Krogh and the attendees of the 2012 and 2015 Benasque RNA conference for stimulating discussions.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

Royal Society of New Zealand to Sinan Uğur Umu, Anthony M Poole, Renwick CJ Dobson, Paul P Gardner.
University of Canterbury Biomolecular Interaction Centre, Joint PhD Scholarship to Sinan Uğur Umu.
Royal Society of New Zealand RutherfordDiscovery Fellowships to Anthony M Poole, Paul P Gardner.
Army Research Office to Renwick CJ Dobson.

Additional information

Competing interests

The authors declare that no competing interests exist.

Author contributions

SUU, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article, Contributed unpublished essential data or reagents.

AMP, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

RCJD, Conception and design, Drafting or revising the article.

PPG, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article, Contributed unpublished essential data or reagents.

Additional files

Supplementary file 1. The levels of extrinsic and intrinsic avoidance for each publicly available bacterial and archaeal genome sequence.

Column A contains the species and strain names; Column B contains the ENA accession; Column C contains the phylum name; Column D contains the extrinsic avoidance p-value (i.e. the difference between RNAup Gibbs’ free-energy of interaction distributions between native and randomized sequences, one-tailed Mann-Whitney U test); Column E contains the genomic G+C content; Column F contains the genome size in nucleotides; Column G contains the average G+C content of the first 21 nucleotides of 114 ‘core’ (highly conserved) mRNAs; Column F contains the average G+C content of six core highly expressed ncRNAs (tRNA, rRNA, RNase P RNA, SRP RNA, tmRNA and 6S RNA); Column F contains the intrinsic avoidance p-value (i.e. the difference in G+C content between core mRNAs and ncRNAs, two-tailed Mann- Whitney U test). Table A contains the data for bacterial species, Table B contains the data for archaeal species.

DOI: http://dx.doi.org/10.7554/eLife.13479.017

elife-13479-supp1.xlsx^{(161.7KB, xlsx)}

DOI: 10.7554/eLife.13479.017

Supplementary file 2. Data for the extreme GFP transformations, including fluorescence levels and RT-qPCR analyses.

Table A: Column A contains a unique identifier for each mRNA sequence; Column B contains a unique numeric code for each mRNA sequence; Column C contains a colour used for visualizing the datasets; Column D contains the sum of RNAup Gibbs’ free-energy of interactions between nucleotides 1 to 21 of the GFP mRNA and the core ncRNAs of E. coli BL21(DE3); Column E contains the RNAfold Gibbs’ free-energy of folding for nucleotides 1 to 37; Column F contains the CAI for each GFP mRNA, using codon distribution patterns acquired from the core protein coding genes of E. coli BL21(DE3); Column G contains the GFP flourescence values (four replicates for each mRNA); Column H contains ΔΔC_T values from RT-qPCR results, these can be used to quantify mRNA concentrations; Column I contains mean GFP fluorescence values; Column J contains mRNA abundances; Table B: contains the oligonucleotide sequences used to gather RT-qPCR results.

DOI: http://dx.doi.org/10.7554/eLife.13479.018

elife-13479-supp2.xlsx^{(7.8KB, xlsx)}

DOI: 10.7554/eLife.13479.018

Supplementary file 3. Data corresponding to the heatmap in Figure 2A.

Each row corresponds to a different dataset containing protein and mRNA expression levels. Column A contains a brief summary of the type of dataset; Column B contains Spearman correlation coefficients (and corresponding p-values) between avoidance measurements and protein abundance; Column C contains Spearman correlation coefficients (and corresponding p-values) between mRNA secondary structure measurements measurements and protein abundance; Column D contains Spearman correlation coefficients (and corresponding p-values) between codon adaptation index measurements and protein abundance; Column E contains Spearman correlation coefficients (and corresponding p-values) between mRNA and protein abundances; Column F contains the species names of the organism each dataset was collected in; Column G contains the size of the dataset; Column H contains a reference to the manuscript each dataset was first published in; Column H contains a link to the Pubmed entry for each manuscript;

DOI: http://dx.doi.org/10.7554/eLife.13479.019

elife-13479-supp3.xlsx^{(5.3KB, xlsx)}

DOI: 10.7554/eLife.13479.019

Supplementary file 4. Data corresponding to Figure 3.

<3.4 For each position in the SSU rRNA sequence provided in PDB structure 4WZO, in Column A the accessibility of the region is evaluated (i.e. ’1’ if the residue is Angstroms from another nucleotide or aminoacid residue and ’0’ otherwise). Column B contains local SSU:mRNA avoidance p-values corresponding to differences in RNAup Gibbs’ free-energy of interaction distributions between native and randomized sequences using Mann-Whitney U tests.

DOI: http://dx.doi.org/10.7554/eLife.13479.020

elife-13479-supp4.xlsx^{(33.5KB, xlsx)}

DOI: 10.7554/eLife.13479.020

Supplementary file 5. This table contains the results of R² values for linear regression models between protein levels and the different predictors of expression level that are illustrated in Figure 2—figure supplement 3.

Column A contains the model (e.g. ”Protein abundance Avoidance + Folding Energy + CAI + mRNA abundance” corresponds to the relationship between protein abundance, mRNA avoidance, mRNA folding, codon usage and mRNA abundance; Column B contains the R² value for the relationship; Column C contains the reference the corresponding protein and mRNA data came from; Column D contains the type of data and the size of the dataset;

DOI: http://dx.doi.org/10.7554/eLife.13479.021

elife-13479-supp5.xlsx^{(6.6KB, xlsx)}

DOI: 10.7554/eLife.13479.021

References

Andersson SG, Kurland CG. Codon preferences in free-living microorganisms. Microbiological Reviews. 1990;54:198–210. doi: 10.1128/mr.54.2.198-210.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bartel DP, Chen CZ. Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nature Reviews Genetics. 2004;5:396–400. doi: 10.1038/nrg1328. [DOI] [PubMed] [Google Scholar]
Bhaya D, Davison M, Barrangou R. CRISPR-Cas systems in bacteria and archaea: versatile small RNAs for adaptive defense and regulation. Annual Review of Genetics. 2011;45:273–297. doi: 10.1146/annurev-genet-110410-132430. [DOI] [PubMed] [Google Scholar]
Borg A, Ehrenberg M. Determinants of the rate of mRNA translocation in bacterial protein synthesis. Journal of Molecular Biology. 2015;427:1835–1847. doi: 10.1016/j.jmb.2014.10.027. [DOI] [PubMed] [Google Scholar]
Boël G, Letso R, Neely H, Price WN, Wong K-H, Su M, Luff JD, Valecha M, Everett JK, Acton TB, Xiao R, Montelione GT, Aalberts DP, Hunt JF. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature. 2016-21;529:358–363. doi: 10.1038/nature16509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chamary JV, Hurst LD. Evidence for selection on synonymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biology. 2005;6:e13479. doi: 10.1186/gb-2005-6-9-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen WH, van Noort V, Lluch-Senar M, Hennrich ML, Wodke JA, Yus E, Alibés A, Roma G, Mende DR, Pesavento C, Typas A, Gavin AC, Serrano L, Bork P. Integration of multi-omics data of a genome-reduced bacterium: Prevalence of post-transcriptional regulation and its correlation with protein abundances. Nucleic Acids Research. 2016;44:1192–1202. doi: 10.1093/nar/gkw004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Sousa Abreu R, Penalva LO, Marcotte EM, Vogel C. Global signatures of protein and mRNA expression levels. Molecular BioSystems. 2009;5:1512–1526. doi: 10.1039/b908315d. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deutscher MP. Degradation of RNA in bacteria: comparison of mRNA and stable RNA. Nucleic Acids Research. 2006;34:659–666. doi: 10.1093/nar/gkj472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diwan GD, Agashe D. The frequency ofinternal shine-dalgarno-like motifs in prokaryotes. Genome Biology and Evolution. 2016;8:1722–1733. doi: 10.1093/gbe/evw107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy SR. Accelerated Profile HMM Searches. PLoS Computational Biology. 2011;7:e13479. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science. 2005;310:1817–1821. doi: 10.1126/science.1121158. [DOI] [PubMed] [Google Scholar]
Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85:87–94. doi: 10.2307/2340521. [DOI] [Google Scholar]
Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Research. 2011;39:D141–D145. doi: 10.1093/nar/gkq1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giannoukos G, Ciulla DM, Huang K, Haas BJ, Izard J, Levin JZ, Livny J, Earl AM, Gevers D, Ward DV, Nusbaum C, Birren BW, Gnirke A. Efficient and robust RNA-seq process for cultured bacteria and complex community transcriptomes. Genome Biology. 2012;13: R23.. doi: 10.1186/gb-2012-13-3-r23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goodman DB, Church GM, Kosuri S. Causes and effects of N-terminal codon bias in bacterial genes. Science. 2013;342:475–479. doi: 10.1126/science.1241934. [DOI] [PubMed] [Google Scholar]
Hoeppner MP, Gardner PP, Poole AM. Comparative analysis of RNA families reveals distinct repertoires for each domain of life. PLoS Computational Biology. 2012;8:e13479. doi: 10.1371/journal.pcbi.1002752. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology. 1981;151:389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]
Ioannidis JP, Khoury MJ. Improving validation practices in "omics" research. Science. 2011;334:1230–1232. doi: 10.1126/science.1211811. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kudla G, Murray AW, Tollervey D, Plotkin JB. Coding-sequence determinants of gene expression in Escherichia coli. Science. 2009;324:255–258. doi: 10.1126/science.1170160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kwon T, Huse HK, Vogel C, Whiteley M, Marcotte EM. Protein-to-mRNA ratios are conserved between Pseudomonas aeruginosa strains. Journal of Proteome Research. 2014;13:2370–2380. doi: 10.1021/pr4011684. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lao PJ, Forsdyke DR. Thermophilic bacteria strictly obey Szybalski's transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Research. 2000;10:228–236. doi: 10.1101/gr.10.2.228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laurent JM, Vogel C, Kwon T, Craig SA, Boutz DR, Huse HK, Nozue K, Walia H, Whiteley M, Ronald PC, Marcotte EM. Protein abundances are more conserved than mRNA abundances across diverse taxa. Proteomics. 2010;10:4209–4212. doi: 10.1002/pmic.201000327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li GW, Oh E, Weissman JS. The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature. 2012;484:538–541. doi: 10.1038/nature10965. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindgreen S, Umu SU, Lai AS, Eldai H, Liu W, McGimpsey S, Wheeler NE, Biggs PJ, Thomson NR, Barquist L, Poole AM, Gardner PP. Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling. PLoS Computational Biology. 2014;10:e13479. doi: 10.1371/journal.pcbi.1003907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lorenz R, Bernhart SH, Höner Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:26. doi: 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu P, Vogel C, Wang R, Yao X, Marcotte EM. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nature Biotechnology. 2007;25:117–124. doi: 10.1038/nbt1270. [DOI] [PubMed] [Google Scholar]
Maier T, Schmidt A, Güell M, Kühner S, Gavin AC, Aebersold R, Serrano L. Quantification of mRNA and protein and integration with protein turnover in a bacterium. Molecular Systems Biology. 2011;7:511. doi: 10.1038/msb.2011.38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mückstein U, Tafer H, Hackermüller J, Bernhart SH, Stadler PF, Hofacker IL. Thermodynamics of RNA-RNA binding. Bioinformatics. 2006;22:1177–1182. doi: 10.1093/bioinformatics/btl024. [DOI] [PubMed] [Google Scholar]
Nakahigashi K, Takai Y, Shiwa Y, Wada M, Honma M, Yoshikawa H, Tomita M, Kanai A, Mori H. Effect of codon adaptation on codon-level and gene-level translation efficiency in vivo. BMC Genomics. 2014;15:1115. doi: 10.1186/1471-2164-15-1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, Finn RD. Rfam 12.0: updates to the RNA families database. Nucleic Acids Research. 2015;43:D130–D137. doi: 10.1093/nar/gku1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pain A, Ott A, Amine H, Rochat T, Bouloc P, Gautheret D. An assessment of bacterial small RNA target prediction programs. RNA Biology. 2015;12:509–513. doi: 10.1080/15476286.2015.1020269. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pelletier J, Sonenberg N. The involvement of mRNA secondary structure in protein synthesis. Biochemistry and Cell Biology. 1987;65:576–581. doi: 10.1139/o87-074. [DOI] [PubMed] [Google Scholar]
Plotkin JB, Kudla G. Synonymous but not the same: the causes and consequences of codon bias. Nature Reviews Genetics. 2011;12:32–42. doi: 10.1038/nrg2899. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rozov A, Demeshkina N, Westhof E, Yusupov M, Yusupova G. Structural insights into the translational infidelity mechanism. Nature Communications. 2015;6:e13479. doi: 10.1038/ncomms8251. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmittgen TD, Livak KJ. Analyzing real-time PCR data by the comparative C(T) method. Nature Protocols. 2008;3:1101–1108. doi: 10.1038/nprot.2008.73. [DOI] [PubMed] [Google Scholar]
Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermüller J, Reinhardt R, Stadler PF, Vogel J. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010;464:250–255. doi: 10.1038/nature08756. [DOI] [PubMed] [Google Scholar]
Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987;15:1281–1295. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3'UTR evolution. Cell. 2005;123:1133–1146. doi: 10.1016/j.cell.2005.11.023. [DOI] [PubMed] [Google Scholar]
Storz G, Vogel J, Wassarman KM. Regulation by small RNAs in bacteria: expanding frontiers. Molecular Cell. 2011;43:880–891. doi: 10.1016/j.molcel.2011.08.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, Emili A, Xie XS, Li G. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010;329:533–538. doi: 10.1126/science.1188308. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tuller T, Zur H. Multiple roles of the coding sequence 5' end in gene expression regulation. Nucleic Acids Research. 2015;43:13–28. doi: 10.1093/nar/gku1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
van Dongen S, Abreu-Goodger C, Enright AJ. Detecting microRNA binding and siRNA off-target effects from expression data. Nature Methods. 2008;5:1023–1025. doi: 10.1038/nmeth.1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vivancos AP, Güell M, Dohm JC, Serrano L, Himmelbauer H. Strand-specific deep sequencing of the transcriptome. Genome Research. 2010;20:989–999. doi: 10.1101/gr.094318.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature Reviews Genetics. 2012;13:227–232. doi: 10.1038/nrg3185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waters LS, Storz G. Regulatory RNAs in bacteria. Cell. 2009;136:615–628. doi: 10.1016/j.cell.2009.01.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Woolstenhulme CJ, Guydosh NR, Green R, Buskirk AR. High-precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP. Cell Reports. 2015;11:13–21. doi: 10.1016/j.celrep.2015.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Workman C, Krogh A. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Research. 1999;27:4816–4822. doi: 10.1093/nar/27.24.4816. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu D, Jospin G, Eisen JA. Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One. 2013;8:e13479. doi: 10.1371/journal.pone.0077033. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. 2016 Sep 20;5:e13479. doi: 10.7554/eLife.13479.022

Decision letter

Editor: Karen Adelman¹

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Avoidance of stochastic RNA interactions can be harnessed to control protein expression levels in bacteria and archaea" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Naama Barkai as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This manuscript from Umu et al. provides evidence that stochastic base pairing between highly abundant non-coding RNAs (ncRNAs) such as tRNAs, rRNAs etc. and sequences within the 5'end of messenger RNAs (mRNAs) can be deleterious to the translation of the mRNAs. They suggest that this base pairing causes some of the recognized disconnect between mRNA and protein levels, and that these stochastic interactions can have a bigger effect on protein production than either mRNA structure or codon bias. They propose a factor called 'avoidance', which refers to the potential of sequences within the first 21 nt of the mRNA to hybridize to ncRNAs, with lower potential for interaction yielding higher overall expression of the gene product.

This is an interesting premise, and if fully fleshed out, could provide new insights into mRNA sequence evolution (especially of the 5'-UTR) and an additional explanation for the imperfect correlation between mRNA and protein levels. Overall, the data indicating that interactions between ncRNAs and mRNAs are avoided and could repress translation are generally convincing. Careful controls are employed for the computational and experimental studies.

However, the reviewers found that the data were largely correlative at this stage, and lacking in development of specific examples of avoidance (or the lack thereof) affecting mRNA expression. The point was also raised that small non-coding RNA molecules that are not tightly bound by proteins would be better case studies than the SSU rRNA.

Thus, we invite a revision that addresses these issues specifically, as described in more detail below.

Essential revisions:

1) The authors stopped short of providing the most interesting and important result: showing that they could better explain the relationship between mRNA levels and protein levels upon consideration of avoidance. The authors use this as motivation for their studies, but don't bring their story full circle and show that integrating avoidance into the equation helps us understand the non-uniform relationship between mRNA and protein. They find that avoidance is correlated with higher protein expression in several data sets, but fail to dig deeper. Without this demonstration of utility, we are concerned about how broad the readership for this manuscript will be.

a) The authors should investigate deeper into where native mRNAs with particularly high or low avoidance scores fall on the spectrum of mRNA level vs. protein level.

b) They should attempt to define how much better proteome predictions could be if taking avoidance into account.

2) The rationale for including some of the well-conserved ncRNAs such as RNase P RNA and 6S RNA that are generally tightly bound by proteins and largely unavailable for RNA-RNA interactions is not clear. An example of an abundant non-coding species that was freely available for stochastic interactions would be stronger.

3) The authors should present the specifics for several mRNAs to illustrate the impact of the three parameters.

eLife. 2016 Sep 20;5:e13479. doi: 10.7554/eLife.13479.023

Author response

Essential revisions:

We have added results from using linear models combining avoidance, internal secondary structure, codon usage and messenger RNAs. The R² when combining all the parameters range from 0.7 (our data) to 0.2 (Goodman et al. data). By removing the avoidance measure from the linear model the R² values decrease between 19% and 56% from the R² values in the GFP datasets. In the mass-spec data the R2 values decrease between 3 and 0.2%. Compare this with when we remove internal secondary structure information the R² values decrease between 18% and -1% for GFP data, and between 1.5% and -0.02% for the mass-spec data. When we remove codon usage statistics the R² values decrease between -0.05% and -0.4% for GFP data, and between 9% and -0.06% for the mass-spec data. This is consistent with the avoidance model being at least as good an explanation for variation in protein expression as either internal secondary structure and codon usage statistics. The manuscript has been updated to reflect some of this new analysis. See Figure 2figure supplement 3 in the new manuscript.

a) The authors should investigate deeper into where native mRNAs with particularly high or low avoidance scores fall on the spectrum of mRNA level vs. protein level.

For one of the datasets we have investigated the distribution of protein-per-mRNA ratios (log[protein-per-mRNA]) (Laurent et al. 2010). We have taken the top and bottom 10 most abundant or scarce genes based upon this metric. Then for each of the codon usage, internal secondary structure and avoidance measures we have computed Z-scores for the 20 extreme genes (using means and standard deviations derived from the total dataset). We have found that avoidance is extremely high in the group of 10 abundant genes and is extremely low in the 10 scarce datasets. The shift in Z-scores relative to the null distribution is more modest in the other two measures. Figure 2figure supplement 4 has been added to the manuscript and the discussion has been updated.

b) They should attempt to define how much better proteome predictions could be if taking avoidance into account.

See the discussion and the additional figure in response to point 1, above.

We found that avoidance alone can explain more than 36 percent of variation in synthetic mRNA expression. Avoidance outperforms the other two global factors in synthetic mRNA expression.

However, it can only explain around 2 percent of variation in native mRNA expression, which still makes avoidance a comparable factor with codon usage and secondary structure.

It is plausible to accept that native mRNAs are already selected for their high avoidance, which we detected as a conserved avoidance signal in core mRNA genes. A similar observation is also true for secondary structure. It also marginally influences protein expression in synthetic mRNA expression, but does not have similar effect on native mRNA expression.

There are relatively few ncRNA genes that we can say are never bound by protein and are broadly conserved across the range of species that we have analysed. There is increasing evidence that evolutionary turnover of ncRNA genes is rapid (Lindgreen et al. 2014). Furthermore, during assembly and degradation of ribonucleoparticles, presumably most components of these abundant ncRNAs will be accessible at some time. The tRNAs may be the least bound, by proteins (other than a transient interactions with tRNA synthetases and the ribosome). Furthermore, even for highly structured and bound ncRNAs such as the SSU rRNA, we see strong signatures of avoidance. We have included the below text in the manuscript to hopefully make this justification clearer:

In order to ensure that our analysis is comparable across all bacteria and archaea we have focussed on just the most highly conserved ncRNA and protein-coding genes. Although, many of the ncRNAs are highly structures and are bound by RNA-binding proteins this is not the case during either synthesis and degradation of these products, furthermore, a fraction of the RNA components of these genes will be exposed. Therefore we expect these will form useful datasets for initial testing of our hypothesis.

We have also included an additional figure (Figure 3figure supplement 3) that shows the regions of T. thermophilus tRNAs, tmRNA, RNase P and SRP RNA that have the fewest interactions with mRNAs (i.e. are the most avoided).

3) The authors should present the specifics for several mRNAs to illustrate the impact of the three parameters.

The outlier analysis described in response to 1a) hopefully addresses this point.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file 1. The levels of extrinsic and intrinsic avoidance for each publicly available bacterial and archaeal genome sequence.

DOI: http://dx.doi.org/10.7554/eLife.13479.017

elife-13479-supp1.xlsx^{(161.7KB, xlsx)}

DOI: 10.7554/eLife.13479.017

Supplementary file 2. Data for the extreme GFP transformations, including fluorescence levels and RT-qPCR analyses.

DOI: http://dx.doi.org/10.7554/eLife.13479.018

elife-13479-supp2.xlsx^{(7.8KB, xlsx)}

DOI: 10.7554/eLife.13479.018

Supplementary file 3. Data corresponding to the heatmap in Figure 2A.

DOI: http://dx.doi.org/10.7554/eLife.13479.019

elife-13479-supp3.xlsx^{(5.3KB, xlsx)}

DOI: 10.7554/eLife.13479.019

Supplementary file 4. Data corresponding to Figure 3.

DOI: http://dx.doi.org/10.7554/eLife.13479.020

elife-13479-supp4.xlsx^{(33.5KB, xlsx)}

DOI: 10.7554/eLife.13479.020

DOI: http://dx.doi.org/10.7554/eLife.13479.021

elife-13479-supp5.xlsx^{(6.6KB, xlsx)}

DOI: 10.7554/eLife.13479.021

[bib1] Andersson SG, Kurland CG. Codon preferences in free-living microorganisms. Microbiological Reviews. 1990;54:198–210. doi: 10.1128/mr.54.2.198-210.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Bartel DP, Chen CZ. Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nature Reviews Genetics. 2004;5:396–400. doi: 10.1038/nrg1328. [DOI] [PubMed] [Google Scholar]

[bib3] Bhaya D, Davison M, Barrangou R. CRISPR-Cas systems in bacteria and archaea: versatile small RNAs for adaptive defense and regulation. Annual Review of Genetics. 2011;45:273–297. doi: 10.1146/annurev-genet-110410-132430. [DOI] [PubMed] [Google Scholar]

[bib4] Borg A, Ehrenberg M. Determinants of the rate of mRNA translocation in bacterial protein synthesis. Journal of Molecular Biology. 2015;427:1835–1847. doi: 10.1016/j.jmb.2014.10.027. [DOI] [PubMed] [Google Scholar]

[bib5] Boël G, Letso R, Neely H, Price WN, Wong K-H, Su M, Luff JD, Valecha M, Everett JK, Acton TB, Xiao R, Montelione GT, Aalberts DP, Hunt JF. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature. 2016-21;529:358–363. doi: 10.1038/nature16509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Chamary JV, Hurst LD. Evidence for selection on synonymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biology. 2005;6:e13479. doi: 10.1186/gb-2005-6-9-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Chen WH, van Noort V, Lluch-Senar M, Hennrich ML, Wodke JA, Yus E, Alibés A, Roma G, Mende DR, Pesavento C, Typas A, Gavin AC, Serrano L, Bork P. Integration of multi-omics data of a genome-reduced bacterium: Prevalence of post-transcriptional regulation and its correlation with protein abundances. Nucleic Acids Research. 2016;44:1192–1202. doi: 10.1093/nar/gkw004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] de Sousa Abreu R, Penalva LO, Marcotte EM, Vogel C. Global signatures of protein and mRNA expression levels. Molecular BioSystems. 2009;5:1512–1526. doi: 10.1039/b908315d. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Deutscher MP. Degradation of RNA in bacteria: comparison of mRNA and stable RNA. Nucleic Acids Research. 2006;34:659–666. doi: 10.1093/nar/gkj472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Diwan GD, Agashe D. The frequency ofinternal shine-dalgarno-like motifs in prokaryotes. Genome Biology and Evolution. 2016;8:1722–1733. doi: 10.1093/gbe/evw107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Eddy SR. Accelerated Profile HMM Searches. PLoS Computational Biology. 2011;7:e13479. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science. 2005;310:1817–1821. doi: 10.1126/science.1121158. [DOI] [PubMed] [Google Scholar]

[bib14] Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85:87–94. doi: 10.2307/2340521. [DOI] [Google Scholar]

[bib15] Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Research. 2011;39:D141–D145. doi: 10.1093/nar/gkq1129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Giannoukos G, Ciulla DM, Huang K, Haas BJ, Izard J, Levin JZ, Livny J, Earl AM, Gevers D, Ward DV, Nusbaum C, Birren BW, Gnirke A. Efficient and robust RNA-seq process for cultured bacteria and complex community transcriptomes. Genome Biology. 2012;13: R23.. doi: 10.1186/gb-2012-13-3-r23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Goodman DB, Church GM, Kosuri S. Causes and effects of N-terminal codon bias in bacterial genes. Science. 2013;342:475–479. doi: 10.1126/science.1241934. [DOI] [PubMed] [Google Scholar]

[bib18] Hoeppner MP, Gardner PP, Poole AM. Comparative analysis of RNA families reveals distinct repertoires for each domain of life. PLoS Computational Biology. 2012;8:e13479. doi: 10.1371/journal.pcbi.1002752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology. 1981;151:389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]

[bib20] Ioannidis JP, Khoury MJ. Improving validation practices in "omics" research. Science. 2011;334:1230–1232. doi: 10.1126/science.1211811. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Kudla G, Murray AW, Tollervey D, Plotkin JB. Coding-sequence determinants of gene expression in Escherichia coli. Science. 2009;324:255–258. doi: 10.1126/science.1170160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Kwon T, Huse HK, Vogel C, Whiteley M, Marcotte EM. Protein-to-mRNA ratios are conserved between Pseudomonas aeruginosa strains. Journal of Proteome Research. 2014;13:2370–2380. doi: 10.1021/pr4011684. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Lao PJ, Forsdyke DR. Thermophilic bacteria strictly obey Szybalski's transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Research. 2000;10:228–236. doi: 10.1101/gr.10.2.228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Laurent JM, Vogel C, Kwon T, Craig SA, Boutz DR, Huse HK, Nozue K, Walia H, Whiteley M, Ronald PC, Marcotte EM. Protein abundances are more conserved than mRNA abundances across diverse taxa. Proteomics. 2010;10:4209–4212. doi: 10.1002/pmic.201000327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Li GW, Oh E, Weissman JS. The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature. 2012;484:538–541. doi: 10.1038/nature10965. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Lindgreen S, Umu SU, Lai AS, Eldai H, Liu W, McGimpsey S, Wheeler NE, Biggs PJ, Thomson NR, Barquist L, Poole AM, Gardner PP. Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling. PLoS Computational Biology. 2014;10:e13479. doi: 10.1371/journal.pcbi.1003907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Lorenz R, Bernhart SH, Höner Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:26. doi: 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Lu P, Vogel C, Wang R, Yao X, Marcotte EM. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nature Biotechnology. 2007;25:117–124. doi: 10.1038/nbt1270. [DOI] [PubMed] [Google Scholar]

[bib29] Maier T, Schmidt A, Güell M, Kühner S, Gavin AC, Aebersold R, Serrano L. Quantification of mRNA and protein and integration with protein turnover in a bacterium. Molecular Systems Biology. 2011;7:511. doi: 10.1038/msb.2011.38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Mückstein U, Tafer H, Hackermüller J, Bernhart SH, Stadler PF, Hofacker IL. Thermodynamics of RNA-RNA binding. Bioinformatics. 2006;22:1177–1182. doi: 10.1093/bioinformatics/btl024. [DOI] [PubMed] [Google Scholar]

[bib31] Nakahigashi K, Takai Y, Shiwa Y, Wada M, Honma M, Yoshikawa H, Tomita M, Kanai A, Mori H. Effect of codon adaptation on codon-level and gene-level translation efficiency in vivo. BMC Genomics. 2014;15:1115. doi: 10.1186/1471-2164-15-1115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, Finn RD. Rfam 12.0: updates to the RNA families database. Nucleic Acids Research. 2015;43:D130–D137. doi: 10.1093/nar/gku1063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Pain A, Ott A, Amine H, Rochat T, Bouloc P, Gautheret D. An assessment of bacterial small RNA target prediction programs. RNA Biology. 2015;12:509–513. doi: 10.1080/15476286.2015.1020269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Pelletier J, Sonenberg N. The involvement of mRNA secondary structure in protein synthesis. Biochemistry and Cell Biology. 1987;65:576–581. doi: 10.1139/o87-074. [DOI] [PubMed] [Google Scholar]

[bib36] Plotkin JB, Kudla G. Synonymous but not the same: the causes and consequences of codon bias. Nature Reviews Genetics. 2011;12:32–42. doi: 10.1038/nrg2899. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Rozov A, Demeshkina N, Westhof E, Yusupov M, Yusupova G. Structural insights into the translational infidelity mechanism. Nature Communications. 2015;6:e13479. doi: 10.1038/ncomms8251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Schmittgen TD, Livak KJ. Analyzing real-time PCR data by the comparative C(T) method. Nature Protocols. 2008;3:1101–1108. doi: 10.1038/nprot.2008.73. [DOI] [PubMed] [Google Scholar]

[bib39] Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermüller J, Reinhardt R, Stadler PF, Vogel J. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010;464:250–255. doi: 10.1038/nature08756. [DOI] [PubMed] [Google Scholar]

[bib40] Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987;15:1281–1295. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3'UTR evolution. Cell. 2005;123:1133–1146. doi: 10.1016/j.cell.2005.11.023. [DOI] [PubMed] [Google Scholar]

[bib42] Storz G, Vogel J, Wassarman KM. Regulation by small RNAs in bacteria: expanding frontiers. Molecular Cell. 2011;43:880–891. doi: 10.1016/j.molcel.2011.08.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, Emili A, Xie XS, Li G. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010;329:533–538. doi: 10.1126/science.1188308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] Tuller T, Zur H. Multiple roles of the coding sequence 5' end in gene expression regulation. Nucleic Acids Research. 2015;43:13–28. doi: 10.1093/nar/gku1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] van Dongen S, Abreu-Goodger C, Enright AJ. Detecting microRNA binding and siRNA off-target effects from expression data. Nature Methods. 2008;5:1023–1025. doi: 10.1038/nmeth.1267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] Vivancos AP, Güell M, Dohm JC, Serrano L, Himmelbauer H. Strand-specific deep sequencing of the transcriptome. Genome Research. 2010;20:989–999. doi: 10.1101/gr.094318.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature Reviews Genetics. 2012;13:227–232. doi: 10.1038/nrg3185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Waters LS, Storz G. Regulatory RNAs in bacteria. Cell. 2009;136:615–628. doi: 10.1016/j.cell.2009.01.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] Woolstenhulme CJ, Guydosh NR, Green R, Buskirk AR. High-precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP. Cell Reports. 2015;11:13–21. doi: 10.1016/j.celrep.2015.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Workman C, Krogh A. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Research. 1999;27:4816–4822. doi: 10.1093/nar/27.24.4816. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] Wu D, Jospin G, Eisen JA. Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One. 2013;8:e13479. doi: 10.1371/journal.pone.0077033. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Avoidance of stochastic RNA interactions can be harnessed to control protein expression levels in bacteria and archaea

Sinan Uğur Umu

Anthony M Poole

Renwick CJ Dobson

Paul P Gardner

Roles

Abstract

eLife digest

Introduction

Results and discussion

Figure 1. mRNA:ncRNA avoidance is a conserved feature of bacteria and archaea.

Figure 1—figure supplement 1. Applying different energy models of intramolecular and intermolecular interactions for native sequences and various negative controls.

Figure 1—figure supplement 2. The top and the bottom panels show bacterial phyla and archaeal phyla respectively.

Figure 2. mRNA attributes have different impacts on protein abundance.

Figure 2—figure supplement 1. GFP mRNA constructs have an unbiased design that produces different protein expressions.

Figure 2—figure supplement 2. The scatter-plots of protein abundances (as log-fluorescences) summarize the effect of general factors for extreme GFP and previously published GFP datasets.

Figure 2—figure supplement 3. In the lower four panels we show the R2 values for linear regression models between measures of each of avoidance, internal secondary structure, codon usage and mRNA levels for each of seven independent protein and mRNA expression datasets Supplementary file 5).

Figure 2—figure supplement 4. An outlier analysis of E. coli protein-per-mRNA ratios and avoidance, codon usage and internal mRNA secondary structure statistics.

Figure 2—figure supplement 5. Overview of mRNA:ncRNA avoidance analysis and results.

Figure 3. The most under-represented mRNA:rRNA interactions correspond to exterior regions of the ribosome.

Figure 3—figure supplement 1. Avoidance pattern and its correlation with protein expression vary on mRNAs.

Figure 3—figure supplement 2. Comparison of different regions for evolutionary conservation analyses.

Figure 3—figure supplement 3. The most avoided regions of selected T. thermophilus non-coding RNAs.

Materials and methods

Evolutionary conservation

Data sources for bacterial genomes

Data sources for archaeal genomes

Test of an (extrinsic) avoidance model

Test of an intrinsic avoidance model

Sliding window analysis to detect regions of significance for avoidance on SSU ribosomal RNA

Sliding window analysis to detect regions of significance for avoidance on mRNAs

Proteomics/Transcriptomics and GFP expression

Data sources and statistics for mRNA, protein abundance and GFP expression

mRNA design

Green fluorescence protein (GFP) mRNA design

Figure 4. The median expression of core ncRNA genes (n = 325 data points) in prokaryotic genomes is nearly two orders of magnitude greater than core mRNAs (n = 8086 data points) which proves that ncRNAs constitute most of the cellular RNAs.

Extreme GFP transformations, determining fluorescence levels and RT-qPCR analyses

Statistical analyses of extreme GFP data

Acknowledgements

Funding Statement

Funding Information

Additional information

Competing interests

Author contributions

Additional files

References

Decision letter

Roles

Author response

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Figure 2—figure supplement 3. In the lower four panels we show the R² values for linear regression models between measures of each of avoidance, internal secondary structure, codon usage and mRNA levels for each of seven independent protein and mRNA expression datasets Supplementary file 5).