Abstract
In order to grow and replicate, living cells must express a diverse array of proteins, but the process by which proteins are made includes a great deal of inherent randomness. Understanding this randomness—whether it arises from the discrete stochastic nature of chemical reactivity (“intrinsic” noise), or from cell-to-cell variability in the concentrations of molecules involved in gene expression or the timings of important cell-cycle events like DNA replication or cell division (“extrinsic” noise)—remains a challenge. In this article we analyze a model of gene expression that accounts for several extrinsic sources of noise, including those associated with chromosomal replication, cell division, and variability in the numbers of RNA polymerase, ribonuclease E, and ribosomes. We then attempt to fit our model to a large proteomics and transcriptomics data set, and find that only through the introduction of a few key correlations among the extrinsic noise sources can we accurately recapitulate the experimental data. These include significant correlations between the rate of mRNA degradation (mediated by ribonuclease E) and the rates of both transcription (RNA polymerase) and translation (ribosomes), and strikingly, an anticorrelation between the transcription and translation rates themselves.
I. INTRODUCTION
Over the last 15 years, experiments have repeatedly shown that seemingly identical cells (e.g. cells belonging to a clonal population grown in a well-stirred environment) can differ significantly in their gene expression states [1–3]. How stochastic gene expression (SGE) impacts the fitness of a cell remains a fertile area of research, and as a result stochastic modeling has grown into a cornerstone of biological physics. SGE can impart some advantages; it has been shown, for example, that in E. coli SGE gives rise to a diverse array of behavioral phenotypes [4–9], and can enable populations to quickly adapt to environmental niches [10, 11]. Nevertheless, SGE has also been shown to decrease overall growth rates and natural selection efficacy [12]. Various models of the different ways in which gene expression noise arises and how cells have evolved to control it (either amplifying or attenuating it) have been explored [13–17].
With the notable exception of a 2002 article by Swain, Elowitz and Siggia [18], little attention has been paid until recently to the effect that DNA replication has on gene expression variability [19]. During replication, as the DNA polymerases progress along the chromosome, every gene is systematically copied. Because of this, depending on their stage in the cell cycle, cells can have a different number of copies of a given gene, and this cell-to-cell variation in copy number can impact gene expression stochasticity in important ways. Jones et al. [20] showed that the noise associated with gene replication represents a major component of the total mRNA variability. Building on this work, Peterson et al. [21] showed that the messenger degradation rate, which defines the timescale at which the mean messenger count “relaxes” from its low state before to its high state after gene replication, plays an critical role in accurately describing messenger noise. Earlier analytical models of gene expression either neglect DNA replication entirely, or fail to account for the mRNA relaxation by either tacitly ignoring it or significantly overestimating the messenger degradation rate (which in turn effectively ignores the relaxation).
In this article we investigate several extrinsic sources of protein expression noise (defined as Var[p]/E[p]2). We begin by deriving expressions for the protein mean and variance assuming a simple constitutive model of gene expression that explicitly accounts for gene replication, and show that these expressions agree with simulations that exactly sample the chemical master equation (CME) for the modeled system. We then extend our considerations to account for other extrinsic sources of noise, including variability in transcription, translation, and messenger degradation rates, as well as variability in the timing of gene replication and the cell cycle duration. We find that the contribution of gene replication-associated noise to the total protein noise is significant, by itself accounting for roughly as much noise as any other extrinsic source. More importantly, we find that measurements of mRNA and protein expression in E. coli (specifically the famed Taniguchi et al. data set [2]) preclude versions of our model in which the extrinsic noise sources are assumed to act independently. In such cases, the predicted protein noise is far greater than that measured, especially among highly expressed genes. Only through the inclusion of correlations among the extrinsic noise sources is our model able to accurately describe the experimental data. We sample the space of possible correlations and find that the sets that best recover the experimentally measured protein statistics tend to include significant correlations between the mRNA degradation rate (kd) and both the transcription (kt) and translation (kr) rates, as well as anticorrelations between the transcription and translation rates themselves—a finding corroborated by an recent investigation of the correlations observed between protein and mRNA expression in E. coli [22]. We use our model to estimate transcription, translation, and messenger degradation rates for 585 E. coli genes, and show that the use of an earlier model of gene expression that does not account for gene replication and the other extrinsic noise sources leads to median relative errors of ∼ 23%, ∼ 21% and ∼ 36% in the predicted transcription, translation, and mRNA degradation rates. Finally we show that the our model tends to predict larger mRNA copy numbers than appear in the Taniguchi data set, which we attribute to a widely-used but likely underestimated literature value for the total mRNA content of E. coli.
II. MODEL
We begin by considering the simplest model of constitutive gene expression (see Figure 1). We assume mRNA (denoted m) is transcribed at rate kt from a gene (denoted D), and that the mRNA can either degrade at rate kd or be translated at rate kr to form a protein (denoted p):
(1) |
Importantly, in the above equation we have expressly noted the time-dependence of the gene copy number. For our purposes, D(t) can either take the value 1 for t less than the gene replication time tr, or 2 after the gene has been copied. This system can be described by a chemical master equation (CME, see Equation A2), which describes the time evolution of the probability that a cell is in a given chemical state. In this description, the cell can transition between states in discrete jumps; it may, for example, go from having m messengers to m − 1 as the result of an mRNA degradation event, or p proteins to p + 1 through a translation event. From the CME we can derive ODEs for the instantaneous mRNA and protein means, and variances, and the instantaneous mRNA and protein covariance (see Appendix A for details):
(2) |
Worth noting is that this model does not explicitly assume a protein degradation rate. Because proteins generally degrade on time-scales longer than the cell cycle, we expect that the main avenue by which protein concentrations are attenuated is through dilution as the cells grow and divide. We therefore posit that at the cell division time, tD, the existing proteins and mRNA are distributed to daughter cells with equal probabilities according to the binomial distribution. This assumption yields the constraints (see Equations A38 and A39):
(3) |
For constitutively expressed genes, the solutions for the messenger mean and variance are known [21], allowing for the simultaneous solution of Equations 2 and 3 (see Equations A12 and A41). Normalizing by cell size and time-averaging over the cell cycle (accounting for the fact that log-phase cells are known to have exponentially distributed ages [23, 24], and grow exponentially during the cell cycle [25]) then yields closed form solutions for the messenger and protein means and variances, E[m], Var[m], E[p], and Var[p], that depend on kt, kr, kd, tr, and tD. The expressions are cumbersome and will not be reproduced here (although E[m] and E[p] appear in Equation A48), but they can easily be computed using Mathematica [26, 27]. We note that all five model parameters should be considered stochastic variables—cells can, after all, have different numbers of RNA polymerases, ribosomes, or ribonucleases, as examples, which can affect their respective transcription, translation, or mRNA degradation rates. Accounting for these types of extrinsic variability is accomplished in Section III.
In order to get a feel for how DNA replication impacts gene expression variability, we can consider an idealized “median gene”—that is, a gene with median values for its messenger and protein copy numbers (approximately 0.064 and 18.2 per cell, respectively [2]), a median mRNA half life (approximately 2.4 minutes [2]), and a gene loci situated half-way between the origin and terminus of replication. We note that the cells used in the Taniguchi study had doubling times in the vicinity of 120 minutes (although some strain-to-strain variability, ranging between approximately 110 and 150 minutes, have been reported [4, 28]). E. coli with similar doubling times have recently been measured to have a B-period (the portion of the cell cycle prior to replication initiation) of around 42.2 minutes, and a C-period (the portion of the cell cycle during which the chromosome is being replicated) around 42.4 minutes [29], although other studies have reported slightly shorter and longer B- and C-periods, respectively [30, 31]. These values lead to a median gene replication time of tr = 42.2 + 0.5 × 42.4 = 63.4 minutes). We can then solve for estimates of the transcription and translation rates for our median gene (kt = 0.014 and kr = 1.6) and compute the protein copy number variance we should expect it to have. The result is a value of Var[p] = 115.3, corresponding to a noise level of Var[p]/E[p]2 = 0.35. Fixing all other parameters and scanning over gene loci (denoted χ, the fraction of the gene’s position along the chromosome measured from origin to terminus, which we assume affects the gene replication time as tr = 42.2+χ×42.4) shows that the noise level of our median gene can vary between 0.31 and 0.39—a relative difference of as much as 20%. Explicit stochastic simulations (using the stochastic simulation algorithm (SSA) of Gillespie [32, 33]) show outstanding agreement with these results (see Figure 2).
Despite the agreement between our simulations and analysis, the noise level we have computed, 0.35, should make us somewhat wary—we have not included any extrinsic sources of gene expression noise other than DNA replication and already our model appears to account for more than the entirety of the noise-floor observed in the Taniguchi study (approximately 0.09 for proteins expressed at levels above 10 [2]). If we do account for the other extrinsic noise sources, will our model be able to accurately describe the protein data?
III. ACCOUNTING FOR EXTRINSIC NOISE SOURCES
We can extend our model to include extrinsic sources of noise, such as variability in RNA polymerase (RNAP), ribosome, or ribonuclease E (Rne) copy numbers (which can affect the transcription, translation, and messenger degradation rates, respectively), or variability in the cell cycle duration, tD, or the timing of gene replication, tr. In each case, the effect of randomness in a given parameter can be estimated by Taylor expanding about the mean parameter value (see Equation B1).
We can (at least roughly) estimate the variance in each parameter in our model. By noting that the rates of transcription, translation, and mRNA degradation are proportional to the concentrations of RNAP, ribosomes, and Rne, and that these macromolecules tend to be highly expressed (and therefore likely to have noise levels of around 0.1) we can estimate Var[kt], Var[kr], and Var[kd] as , , and , respectively (where we now understand kt, kr, and kd to represent the mean transcription, translation, and mRNA degradation rates). Variability in the cell cycle duration is estimated to be around 10% [29], and so we might expect Var[tD] ≈ 122 min2 (assuming a 120 minute cell doubling time). Finally, using published values of the variability in the B- and C-periods of the cell cycle (52% [29], and 16% [30], respectively), we can estimate the variance in a gene’s replication time as Var[tr] ≈ (0.52 × 42.2)2 + χ2(0.16 × 42.4)2 min2.
It is fairly common in models of this type to assume that the extrinsic noise sources act independently (i.e. the matrix ρ that describes the correlations among the extrinsic noise sources in Equation B1 is the identity matrix, 𝟙). Under this assumption, Figure 3A shows the total protein noise broken down into contributions from each of its contributing sources for the same “median” gene modeled previously. For the sake of comparison, we have distinguished the extrinsic noise associated with gene replication (denoted “DNA rep”) from the intrinsic noise associated with the biochemical network (denoted Γ) by subtracting from our expressions the noise predicted by a model that does not include gene duplication [2, 5, 34]. We find that gene replication contributes a comparable amount of gene expression noise (~0.1) as variability in any of kt, kr, or kd, while variability in the timing of DNA replication and cell division, conversely, contribute very little noise (~0.02). Importantly, the total noise we find, 0.69, is significantly larger than the noise measured by Taniguchi et al. [2] for the majority of proteins with mean expression levels around 18. We can forge ahead and try to fit every point in the Taniguchi data set (see Figure 4A, and Appendix C for details on the fitting procedure) but this only confirms our fears—when (independent) extrinsic noise sources are accounted for, our model overestimates protein expression variability, and simply can not describe most of the data.
There have been a number of different mechanisms proposed by which gene-expression noise may be attenuated, including negative feedback, near-saturated signaling cascades, and forms of post-transcriptional regulation [13, 16, 17]. While undoubtedly some fraction of the genes in the Taniguchi data set are controlled through these mechanisms, the problem we face is that significant noise attenuation is required for our model to fit most of the data, and so we wish to find an explanation that applies to most—if not all—E. coli proteins. One possibility is that extrinsic noise sources should not in general be assumed to be independent. Taniguchi et al. found that the fluctuations of highly-expressed proteins (both RNAP and Rne are expressed in thousands per cell) can have correlation coefficients of as much as 0.66 [2]. Similarly, the timing of DNA replication has long been believed to be correlated with the cell cycle duration [35], and a recent study found a correlation coefficient of as high as 0.79 between the B-period and the doubling time of E. coli [30]. We can investigate the effect that extrinsic noise correlations have on our model’s ability to match the Taniguchi data by simply including the cross terms in our Taylor expansions that depend on the covariance of the model parameters.
IV. FINDING EXTRINSIC NOISE CORRELATIONS THAT FIT EXPERIMENTAL PROTEIN AND MRNA STATISTICS
Because our model overestimates protein noise when extrinsic sources are treated independently, finding correlations coefficients that lead to noise attenuation is an important part of fitting our model to the Taniguchi data. In general, the noise will be attenuated by the negative cross terms in our Taylor expansion of Var[p] (Equation B1). These arise when either: 1) the correlation coefficient between two extrinsic noise sources is negative and the partial derivatives of Var[p] with respect to these sources have the same sign; or 2) the correlation between the sources is positive and the partial derivatives with respect to them have opposite signs. An example of the former might be if the transcription and translation rates were anticorrelated (since the derivatives with respect to kt and kr are both positive), while an example of the latter might be if the transcription and mRNA degradation rates were positively correlated (because the derivative with respect to kd is negative). Importantly, finding noise-attenuating correlations can not be done arbitrarily. Any matrix ρ that describes the possible correlations among kt, kr, kd, tr, and tD must be both positive definite (which bars cases in which, for example, noise sources A and B are strongly correlated, and A and C are strongly correlated, but B and C are strongly anti-correlated) and have ones on diagonal. Such a matrix can be constructed as ρ = LLT, where L is a lower triangular matrix with diagonal values greater than 0 and whose squared row elements sum to 1 (i.e. for each row i).
Although it may be tempting to try to search for correlation matrix that minimizes the mean squared fitting error (denoted 〈Δ(ρ)〉), nonlinear fits of this type—especially those involving large parameter spaces—are notoriously difficult [36]. Moreover, even if an optimal ρ could be found, without knowledge of the shape of 〈Δ(ρ)〉, it’s difficult to say with confidence that other—possibly very different—correlation matrices could not yield fitting errors of comparable size. Here we take a more circumspect approach, opting for questions like “what correlation coefficients are likely to occur in matrices consistent with the gene expression data?” To that end, we constructed a set of 50,000 random correlation matrices with approximately uniformly-distributed off-diagonal elements (see Appendix D for details). These matrices’ associated mean squared fitting errors ranged between 5.76 and 1125.90. Focusing on the top 0.5 percentile (the set of 250 matrices with the lowest associated errors, ranging up to 8.78, and denoted {ρ*}), a number of clear trends emerged (see Figure 5A).
The median values of the correlation coefficients in our best-performing matrices, with their associated median absolute deviations (MAD), are represented in Equation 4. These matrices tend to include significant positive correlations between the mRNA degradation rates and both the transcription and translation rates, as well as significant anticorrelations between the transcription and translation rates themselves. While some matrices defy one or another of these general trends (approximately 23% of {ρ*} include negative correlations between kd and kt, 16% include negative correlations between kd and kr, and 28% include positive correlations between kt and kr), none defy two or more of them simultaneously.
Based on these results, we can say the true extrinsic noise correlation matrix—whatever it may be—likely includes correlations between the mRNA degradation rate and the transcription and translation rates. This is in keeping with what is known about highly-expressed cellular components. Rne, RNAP, and ribosomes all occur in large concentrations within the cell, and as such some correlation among their numbers should naturally arise [2]. Indeed, one can easily imagine how cells with relatively high transcription rates (due to high copy numbers of RNAP) or high translation rates (high copy numbers of ribosomes) would express high numbers of Rne, and in turn have relatively high mRNA degradation rates. These two correlations are extremely important for the overall fitting of the model to the data, and represent the largest two sources of noise attenuation among our {ρ*} matrices (see Figure 3B).
The tendency of our best-performing matrices to include anticorrelations between the transcription and translation rates is less intuitive, but considerably more interesting. Naively one would expect that a cell with greater numbers of RNAP would transcribe more ribosomes, and similarly, a cell with greater numbers of ribosomes would translate more RNAP, together giving rise to a positive correlation between the cell’s transcription and translation rates; but the correlation matrices that elicit the best fits to the data actually predict the opposite. A skeptic might attribute these results to the necessity of our matrices to include noise-attenuating terms (ρ(kt, kr) represents the third-largest overall source of noise attenuation, see Figure 3B), but there are compelling reasons to believe that this anticorrelation might be real. It has been shown that translation of the rpoB. These three correlation coefficients tend to compensate for each other; matrices with higher transcription-translation correlations, for example, also have correspondingly higher mRNA degradation-transcription and mRNA degradation-translation correlations (see Figure 5B). Other statistical enhancements are somewhat less pronounced. The transcription rate tends to be positively correlated with the timing of gene replication, and both the transcription and translation rates tend to be negatively correlated with the timing of cell division. Finally, we note that the distributions of ρ(kr, tr), ρ(kd, tr), ρ(kd, tD), and ρ(tr, tD) show only weak biases, and are relatively widely dispersed.
(4) |
mRNA (encoding the β-subunit of RNAP) is inhibited by the 50S ribosomal protein L1 [37–39]. This means that cells with high ribosomal protein copy numbers should exhibit low RNAP translation rates, and in turn suppressed transcription rates. Given the current context, it is possible that this regulatory mechanism may have evolved in order to suppress overall protein noise. Moreover, an elegant recent article by Hilfinger, Norman, and Paulsson [22] analyzed the space of all possible gene-expression models and found that only models in which the transcription and translation rates were anticorrelated could give rise to the negligible mRNA-protein correlation coefficients seen experimentally [2, 22].
The enhancements in anticorrelations between the timing of cell division and both the transcription and translation rates observed among our {ρ*} matrices are consistent with the current leading models of bacterial cell size control and division. Although it remains an active area of research, a number of theories have been posited to understand how cell division timing is regulated. These include “accumulation” models involving the buildup of a critical number of initiator molecules (such as FtsZ [40]) before division is triggered, “adder” models in which cells attempt to add a fixed volume before dividing, mixtures thereof (including the particularly compelling “multiple origins accumulation” model [24]), as well as the earlier “sizer” models (in which cells divide at a critical size) and “timer” models (wherein cells attempt to maintain fixed cell-cycle intervals). In our context, cells with high transcription or translation rates (or both) should be expected to grow faster and accumulate greater numbers of initiator molecules at earlier times. As a result—at least according to the accumulation, adder, and sizer models—these cells should divide sooner and exhibit the types of kt–tD and kr–tD anticorrelations seen in our data.
In contrast, the bias in {ρ*} toward positive correlations between the transcription rate and the timing of gene replication is considerably stronger than was expected based on the biochemical literature. Its known, for example, that accumulation of DnaA to the origin of replication plays an integral role in replication initiation. One might surmise, then, that cells with relatively high transcription rates would produce DnaA at correspondingly faster rates, leading to earlier replication times (and small or negative correlations). Similarly, its been shown that high transcriptional activity also gives rise to net negative chromosomal supercoiling, especially near the origin of replication where several highly-expressed rRNA genes reside. This supercoiling should facilitate DNA melting, again leading to earlier replication times [41]. In light of these considerations, we anticipated lower ρ(kt, tr) values than were in fact observed among our best-performing matrices.
Finally, we note that although positive correlations between the timing of DNA replication and cell division have been measured [30], no enhancement among positive values was observed in {ρ*}. We attribute this to the fact that variability in neither tr nor td contribute significantly to the overall protein noise, and as a result, correlations among them contribute correspondingly small amounts (see Figure 3). This means that the matrices that are most consistent with the experimental data—those that enable the greatest noise attenuation—show little bias in their tr–tD correlation coefficients.
V. FAILURE TO ACCOUNT FOR EXTRINSIC NOISE LEADS TO UNDERESTIMATION OF THE TRANSCRIPTION AND MRNA DEGRADATION RATES
We can compare the transcription, translation, and mRNA degradation rates fit using the theory developed here (using our set of best-performing correlation matrices) with those fit using the gamma distribution [2, 5, 34]. As before, we simultaneously fit kt, kr, and kd to the measured mRNA means, protein means and variances, and mRNA degradation rates from the Taniguchi data set using the expressions:
(5) |
where we have substituted ln(2)/tD for the dilution rate that would normally appear. We find that for the majority of genes in the data set, the transcription and mRNA degradation rates extracted using our model are significantly higher than those extracted using the gamma distribution (with median fold-changes, calculated over all genes and {ρ*} matrices, of approximately 1.23 for kt and 1.36 for kd, see Figure 6A & C). For many genes the effect can be dramatic, resulting in order of magnitude or more differences in the predicted kt and kd rates. In contrast, the translation rates predicted by our model tended to be lower (approximately 0.79-fold) than those predicted by the gamma distribution (see Figure 6B). These observations highlight the necessity of a careful accounting of extrinsic noise sources when fitting rates using gene expression variability data.
VI. THE TANIGUCHI DATASET APPEARS TO UNDERESTIMATE MRNA COPY NUMBERS
We compared the mean mRNA copy numbers extracted by fitting our model (again using our best-performing correlation matrices) with those reported in the Taniguchi data set. For most (over 75%) of the genes considered, our fit median E[m]ext. nse. was larger than the measured value (by a median fold-change, over all genes and {ρ*} matrices, of approximately 1.68, see Figure 7). As has been fairly common with quantifying copy numbers using RNA-seq, Taniguchi et al. scaled their relative measurements such that the total mRNA per cell was 1, 350, a value that derives from [42] and is based in part on total mRNA mass. More recent studies employ “spiked” samples with additional calibration mRNA added in known quantities prior to RNA-seq which serve to more directly measure the concentrations of the cellular transcripts. These studies have yielded estimates of the total mRNA content of an E. coli cell in glucose minimal medium to be approximately 2, 400 transcripts [43, 44]. Had the Taniguchi mRNA data been normalized to this value, it would have increased each mRNA count by 1.78-fold, and brought their measurements and our fits into very close agreement.
VII. CONCLUSIONS
Building on prior work by us and other authors [18, 20, 21], we have derived expressions for mRNA and protein statistics assuming a simple constitutive model of gene expression that accounts for chromosome replication. We are not the first to consider this effect [19], but to our knowledge we are the first to carefully understand it in the context of other sources of extrinsic noise, and more importantly, critically compare our model with experimental results. While we did find that the noise contribution associated with gene replication was of comparable size to those associated with variability in RNAP, Rne, and ribosome copy numbers, it turned out that this was only part of the story. As is so often the case, much more interesting results emerged when our model failed to match experimental data. Under the assumption of independent extrinsic variability (a fairly routine approximation in models of this kind) we vastly overestimated the protein noise. This in turn led to an investigation of how the extrinsic noise sources might be correlated, and ultimately to several important results. These included 1) mRNA degradation rates likely correlate with both the transcription and translation rates, perhaps through the natural correlations that emerge among highly-expressed cellular components like Rne, RNAP, and ribosomes; 2) transcription and translation rates in E. coli likely anticorrelate, possibly through the suppression of rpoB translation by the large ribosomal protein L1, although other explanations have been posited [22]; 3) accounting for extrinsic noise when extracting kinetic parameters from gene expression data consistently (and often significantly) impacts the results; and 4) the total mRNA content of E. coli appears to be greater than previously assumed literature values estimate.
Ultimately, the determination of the true extrinsic noise correlation coefficients must be an empirical exercise. As such, we note that some relatively straightforward experiments can be conducted to directly test our predictions. For example, researchers have already counted RNAP and ribosome copy numbers individually [45, 46]; by measuring both simultaneously in a two-color experiment, the anticorrelation we predict between the transcription and translation rates could be observed.
Acknowledgments
This work was supported by the National Institutes of Health Center for Macromolecular Modeling and Bioinformatics at the Beckman Institute, UIUC grant number 9 P41 GM104601-23, and by the National Science Foundation grant numbers MCB 12-44570, (CPLC) PHY-1430124, and (iPOLS) PHY-1505008.
Appendix A: Derivation of the protein copy number Mean and Variance
Here we derive expressions for the protein copy number mean and variance. Due to the potentially broad applicability of our results, including to researchers outside the traditional physics community, we have made a concerted effort not to “skip steps.” As a result, this derivation likely includes details that may seem obvious to the more seasoned reader.
We consider the system:
(A1) |
where D(t) is the time-dependent gene copy number, either one before the gene replication time, tr, or two after it. Assuming that after replication both copies can be transcribed independently and with equal rates, we can write the master equation for this system as:
(A2) |
where:
(A3) |
From this we can derive differential equation for the mean and variance of the mRNA count (see [21]), and the protein count. We consider the mean protein count, first:
(A4) |
We can insert the RHS of our master equation (Equation A2) for and evaluate term-by-term. The first term is:
(A5) |
where we have used the fact that cells can not have negative mRNA (or protein) copy numbers, and so the probability of being in a state with m = −1 is 0. The second term gives:
(A6) |
The third term gives:
(A7) |
The fourth term gives:
(A8) |
The fifth term gives:
(A9) |
where represents the time-dependent mean mRNA count. Finally, the sixth term gives:
(A10) |
Now, simply pulling this all together with appropriate signs leaves us with the expression:
(A11) |
Now, inserting Equation 2 from [21] into Equation A11 and requiring yields the solution:
(A12) |
where
(A13) |
We can now begin to consider the differential equation for the variance of the protein count:
(A14) |
As before, we can insert the RHS of Equation A2 into Equation A14 and evaluate it term by term. The first term gives:
(A15) |
The second term gives:
(A16) |
The third term gives:
(A17) |
The fourth term gives:
(A18) |
The fifth term gives:
(A19) |
And finally the sixth term gives:
(A20) |
The final term on the RHS of Equation A14 can be evaluated by inserting Equation A11:
(A21) |
Now, putting these all together (with appropriate signs) yields:
(A22) |
Evidently we need an equation for the mRNA and protein covariance. This can be easily written down:
(A23) |
Again, we insert the RHS of Equation A2, and evaluate term by term. The first term gives:
(A24) |
The second term gives:
(A25) |
The third term gives:
(A26) |
The fourth term gives:
(A27) |
The fifth term gives
(A28) |
And finally the sixth term gives
(A29) |
The final two terms on the RHS of Equation A23 give:
(A30) |
and:
(A31) |
so, finally, pulling this all together yields:
(A32) |
where the last line follows from Equation 2 in [21].
We can solve Equation A22 for Cov[m, p](t) and insert it and Equation A11 into Equation A32 to yield:
(A33) |
which immediately gives:
(A34) |
where c0 is an arbitrary integration constant that will be determined shortly. Inserting this into Equation A22 gives:
(A35) |
For which the general solution is:
(A36) |
The expressions for and can be inserted (see Equation A12 and Equation 2 of [21], respectively), and the integral can be evaluated in closed form. We require a few more constraints, however, in order to set c0 and c1. We assume that at cell division the contents (messengers and proteins) of the mother cell is distributed among the two daughters randomly but with equal probability. We can then write:
(A37) |
Where Pbinom(p|q) represents the probability that p proteins are distributed to a daughter cell given that the mother cell contains q proteins at division time. This is obviously just the binomial distribution with the probability of a successful Bernoulli trial equal to 0.5. Pmother(q) in the above equation represents the probability that the mother contains q proteins.
From this we can compute the relationship between the protein variance immediately before and immediately after cell division:
(A38) |
where Varbinom[p|q] and Ebinom[p|q] represent the variance and mean of the number of successful Bernoulli trials, p, given q attempts. We can also compute:
(A39) |
We can insert Equations A12, A38, and A39 into Equation A34 in order to solve for c0; this yields:
(A40) |
which, along with Equation A38, allows us to write:
(A41) |
Now, deriving the population mean and variance is simply a matter of integrating out the time variable according to the prescription of [21] (see equations S23 and S25 therein). It is well established that populations of log-phase cells have exponentially distributed ages [23, 24]:
(A42) |
and so we can write:
(A43) |
The resulting expression for the population mean is relatively simple:
(A44) |
while the expression for the population variance is quite long and cumbersome (and as such, will not be reproduced here) although it can be expressed in closed form.
It it fairly common in single cell proteomics measurements to report size-normalized protein distributions [2, 3]. Deriving the size-normalized protein statistics can be accomplished with only a minor revision to our formulae. Assuming cells grow exponentially during the cell cycle, we can write the a cell’s size, s, as:
(A45) |
Now we can compute the average cell size, , as:
(A46) |
Then solving for s0 such that is 1 average cell gives s0 = 1/(2ln(2)). We can use this to write the size-normalized messenger mean and protein mean simply by dividing (as computed in [21]) and by the instantaneous cell size, and we can also write the protein variance by dividing by the squared instantaneous cell size:
(A47) |
Again omitting the expression for the protein variance (we have included a Mathematica workbook that includes it as part of the Supplemental Material [47]), we find:
(A48) |
Finally, we note that a time-dependent expression for the protein messenger covariance was derived en route to the protein mean and variance (see Equations A34 and A40). We can use this to compute the size-normalized time-averaged mRNA-protein covariance and Pearson correlation coefficient:
(A49) |
Appendix B: Correcting for extrinsic noise
We can consider the effects of extrinsic noise in the parameters in our model. Following the prescription of [21] we can Taylor expand about the mean values of each parameter:
(B1) |
where E[p|x1, …xn], for example, represents the expression for the mean protein count (e.g. Equation A47), evaluated with parameters x1, …xn. An analogous expression can also be written down for Cov[m, p] in order to compute the effect of extrinsic noise on the mRNA-protein covariance and Pearson correlation.
Appendix C: Fitting Our Model to the Taniguchi Data Set
We attempted to find the transcription, translation, and mRNA degradation rates that minimize the squared error (denoted Δi) when fitting our model to the experimental data in [2]. The data includes measured protein variances and means (represented below as and for each protein i), their respective error estimates ( and ), mRNA means ( ), and mRNA lifetimes (the inverse of the mRNA degradation rates, denoted below as ) for 585 E. coli genes.
We pose the set of optimization problems:
(C1) |
where Varext. nse.[pi](kt,i, kr,i, kd,i; ρ), Eext. nse.[pi](kt,i, kr,i, kd,i; ρ), and Eext. nse.[mi](kt,i, kd,i; ρ) represent our theoretical expressions for the protein variance, protein mean, and mRNA mean, respectively (Equations B1). Here, ρ represents a matrix describing correlations among the various extrinsic noise sources; when they are assumed to be independent, ρ = 𝟙. Because [2] does not report errors for the messenger mean and degradation rates, and were set equal to , and (the third and fourth terms in Δi therefore represent squared relative deviations). Its important to note that each Δi is a fit of our model to four measured values by allowing only three to vary (kt,i, kr,i, kd,i); enabling more parameters to vary, or fitting to three or two measured values leaves the system underconstrained, and is not a meaningful test of the model’s ability to recapitulate the data. Also of note, the dependence Varext. nse.[pi], Eext. nse.[pi], and Eext. nse.[mi] on tr and tD have been suppressed above. We assume during the fitting that cells have average doubling times of tD = 120 minutes and each gene, i, has its own average replication time that depends on it location along the chromosome as tr,i = 42.2 + χi × 42.4 minutes. Each χi was computed as the fraction of the given gene’s locus along the E. coli chromosome as measured from origin to terminus [27, 48].
We performed the 585 optimizations using the SUBPLEX [49] method as implemented in the freely available nlopt software package [50].
Appendix D: Sampling Correlation matrices with Approximately Uniformly-Distributed Off-Diagonal Elements
We constructed 50,000 random correlation matrices with approximately uniformly-distributed off-diagonal terms by first constructing a large set of random matrices and then pruning the ones from over-represented regions of the correlation matrix space.
The random matrices were constructed by sampling the elements of a lower triangular matrix, L, such that the squared elements of each row sum to 1 and the diagonal terms are non-negative. The elements of each row, i, live on an i-dimensional half-sphere of radius 1; we can evenly sample the surface of each of these half-spheres by sampling the elements in L from a standard normal distribution, or from the positive half of the standard normal distribution if the element is on-diagonal, and then normalizing each row element by . The product, ρ = LLT, is then positive definite with ones on diagonal (as all correlation matrices must be). In practice, this will lead to a set of correlation matrices with very-different distributions of off-diagonal terms. Randomly permuting the indices of the rows and columns of these matrices yields off-diagonal distributions of similar shape, but they remain non-uniform.
In order to ensure approximate uniformity in the off-diagonal terms, we generated 100,000 random matrices, and assigned to each a score, S, representing the degree to which its off-diagonal terms are over- or under-represented. This was accomplished by first histogramming (with a bin width of 0.001) the 100,000 occurrences of each off-diagonal term and using the results as “frequency” functions, fi,j(ρ), that represent the number of random matrices with i, j elements within the same bin as the given matrix, ρ. Using these, a score was computed for each matrix as:
(D1) |
This score tends to be larger for ρ matrices in which most terms are under-represented in our set of 100,000 random matrices, and smaller for ρs in which most terms are over-represented. Taking the 50,000 matrices with the largest scores yielded approximately uniformly-distributed off-diagonal terms (see Figure 5A, blue histograms).
Contributor Information
John Cole, Department of Physics, University of Illinois, Urbana-Champaign.
Zaida Luthey-Schulten, Department of Chemistry, University of Illinois, Urbana-Champaign.
References
- 1.Elowitz MB, Levine AJ, Siggia ED, Swain PS. Science. 2002;297:1183. doi: 10.1126/science.1070919. [DOI] [PubMed] [Google Scholar]
- 2.Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, Emili A, Xie XS. Science. 2010;329:533. doi: 10.1126/science.1188308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Newman JR, Ghaemmaghami S, Ihmels J, Breslow DK, Noble M, DeRisi JL, Weissman JS. Nature. 2006;441:840. doi: 10.1038/nature04785. [DOI] [PubMed] [Google Scholar]
- 4.Labhsetwar P, Cole JA, Roberts E, Price ND, Luthey-Schulten ZA. Proc Natl Acad Sci U S A. 2013;110:14006. doi: 10.1073/pnas.1222569110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Friedman N, Cai L, Xie XS. Phys Rev Lett. 2006;97:168302. doi: 10.1103/PhysRevLett.97.168302. [DOI] [PubMed] [Google Scholar]
- 6.Schultz D, Jacob EB, Onuchic JN, Wolynes PG. Proc Natl Acad Sci U S A. 2007;104:17582. doi: 10.1073/pnas.0707965104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Choi PJ, Cai L, Frieda K, Xie XS. Science. 2008;322:442. doi: 10.1126/science.1161427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Assaf M, Roberts E, Luthey-Schulten Z, Goldenfeld N. Phys Rev Lett. 2013;111:058102. doi: 10.1103/PhysRevLett.111.058102. [DOI] [PubMed] [Google Scholar]
- 9.Lu M, Onuchic J, Ben-Jacob E. Phys Rev Lett. 2014;113:078102. doi: 10.1103/PhysRevLett.113.078102. [DOI] [PubMed] [Google Scholar]
- 10.Acar M, Mettetal JT, van Oudenaarden A. Nat Genet. 2008;40:471. doi: 10.1038/ng.110. [DOI] [PubMed] [Google Scholar]
- 11.MacNeil LT, Walhout AJ. Genome Res. 2011;21:645. doi: 10.1101/gr.097378.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wang Z, Zhang J. Proc Natl Acad Sci U S A. 2011;108:E67. doi: 10.1073/pnas.1100059108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rao CV, Wolf DM, Arkin AP. Nature. 2002;420:231. doi: 10.1038/nature01258. [DOI] [PubMed] [Google Scholar]
- 14.Raser JM, O’Shea EK. Science. 2005;309:2010. doi: 10.1126/science.1105891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Raser JM, O’Shea EK. Science. 2004;304:1811. doi: 10.1126/science.1098641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Swain PS. J Mol Biol. 2004;344:965. doi: 10.1016/j.jmb.2004.09.073. [DOI] [PubMed] [Google Scholar]
- 17.Thattai M, van Oudenaarden A. Biophys J. 2002;82:2943. doi: 10.1016/S0006-3495(02)75635-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Swain PS, Elowitz MB, Siggia ED. Proc Natl Acad Sci U S A. 2002;99:12795. doi: 10.1073/pnas.162041399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Soltani M, Singh A. R Soc Open Sci. 2016;3:160578. doi: 10.1098/rsos.160578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Jones DL, Brewster RC, Phillips R. Science. 2014;346:1533. doi: 10.1126/science.1255301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Peterson JR, Cole JA, Fei J, Ha T, Luthey-Schulten ZA. Proc Natl Acad Sci U S A. 2015;112:15886. doi: 10.1073/pnas.1516246112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hilfinger A, Norman TM, Paulsson J. Cell Syst. 2016;2:251. doi: 10.1016/j.cels.2016.04.002. [DOI] [PubMed] [Google Scholar]
- 23.Powell EO. J Gen Microbiol. 1956;15:492. doi: 10.1099/00221287-15-3-492. [DOI] [PubMed] [Google Scholar]
- 24.Ho PY, Amir A. Front Microbiol. 2015;6 doi: 10.3389/fmicb.2015.00662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mir M, Wang Z, Shen Z, Bednarz M, Bashir R, Golding I, Prasanth SG, Popescu G. Proc Natl Acad Sci U S A. 2011;108:13124. doi: 10.1073/pnas.1100506108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wolfram Research, Inc. Mathematica, Version 10.1. 2015. [Google Scholar]
- 27.(), see Supplemental Material at [URL will be inserted by publisher] for an Excel spreadsheet that includes the χ values for each of the 585 genes considered in this article.
- 28.Chen H. personal communication 2011 [Google Scholar]
- 29.Earnest TM, Cole JA, Peterson JR, Hallock MJ, Kuhlman TE, Luthey-Schulten Z. Biopolymers. 2016;105:735. doi: 10.1002/bip.22892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Adiciptaningrum A, Osella M, Moolman MC, Lagomarsino MC, Tans SJ. Sci Rep. 2015;5 doi: 10.1038/srep18261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Michelsen O, De Mattos MJT, Jensen PR, Hansen FG. Microbiology. 2003;149:1001. doi: 10.1099/mic.0.26058-0. [DOI] [PubMed] [Google Scholar]
- 32.Gillespie DT. J Phys Chem. 1977;81:2340. [Google Scholar]
- 33.Gillespie DT. J Phys Chem B. 2009;113:1640. doi: 10.1021/jp806431b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shahrezaei V, Swain PS. Proc Natl Acad Sci U S A. 2008;105:17256. doi: 10.1073/pnas.0803850105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cooper S, Helmstetter CE. J Mol Biol. 1968;31:519. doi: 10.1016/0022-2836(68)90425-7. [DOI] [PubMed] [Google Scholar]
- 36.Transtrum MK, Machta BB, Sethna JP. Phys Rev Lett. 2010;104:060201. doi: 10.1103/PhysRevLett.104.060201. [DOI] [PubMed] [Google Scholar]
- 37.Baughman G, Nomura M. Cell. 1983;34:979. doi: 10.1016/0092-8674(83)90555-x. [DOI] [PubMed] [Google Scholar]
- 38.Baughman G, Nomura M. Proc Natl Acad Sci U S A. 1984;81:5389. doi: 10.1073/pnas.81.17.5389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Keseler IM, Mackie A, Santos-Zavaleta A, Billington R, Bonavides-Martínez C, Caspi R, Fulcher C, Gama-Castro S, Kothari A, Krummenacker M, et al. Nucleic Acids Res. 2016 doi: 10.1093/nar/gkw1003. gkw1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Chien AC, Hill NS, Levin PA. Current biology. 2012;22:R340. doi: 10.1016/j.cub.2012.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Magnan D, Bates D. J Bacteriol. 2015;197:3370. doi: 10.1128/JB.00446-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ingraham JL, Maaloe O, Neidhardt FC, editors. Growth of the Bacterial Cell. 1st Sinauer Association, Inc.; Sunderland, MA: 1983. p. 3. [Google Scholar]
- 43.Bartholomäus A, Fedyunin I, Feist P, Sin C, Zhang G, Valleriani A, Ignatova Z. Philos Trans A Math Phys Eng Sci. 2016;374:20150069. doi: 10.1098/rsta.2015.0069. [DOI] [PubMed] [Google Scholar]
- 44.Milo R. In: Cell Biology by the Numbers. 1st Milo R, Phillips R, editors. Garlan Science, Taylor & Francis Group, LLC; New York, NY: 2016. p. 120. Chap. 2. [Google Scholar]
- 45.Bakshi S, Choi H, Weisshaar JC. Front Microbiol. 2015;6 doi: 10.3389/fmicb.2015.00636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Bakshi S, Siryaporn A, Goulian M, Weisshaar JC. Mol Microbiol. 2012;85:21. doi: 10.1111/j.1365-2958.2012.08081.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.(), see Supplemental Material at [URL will be inserted by publisher] for a Mathematica notebook that includes several tedious calculations relevant to this article.
- 48.Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. Science. 1997;277:1453. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- 49.Rowan TH. Ph D thesis. University of Texas at Austin; 1990. Functional stability analysis of numerical algorithms. [Google Scholar]
- 50.Johnson SG. The nlopt nonlinear-optimization package. 2014 http://ab-initio.mit.edu/nlopt.