Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Sep 16.
Published in final edited form as: Phys Rev Lett. 2016 Sep 13;117(12):128101. doi: 10.1103/PhysRevLett.117.128101

Stochastic Kinetics of Nascent RNA

Heng Xu 1,2,3,*, Samuel O Skinner 1,2,3, Anna Marie Sokac 3, Ido Golding 1,2,3,*
PMCID: PMC5033037  NIHMSID: NIHMS816487  PMID: 27667861

Abstract

The stochastic kinetics of transcription is typically inferred from the distribution of RNA numbers in individual cells. However, cellular RNA reflects additional processes downstream of transcription, hampering this analysis. In contrast, nascent (actively transcribed) RNA closely reflects the kinetics of transcription. We present a theoretical model for the stochastic kinetics of nascent RNA, which we solve to obtain the probability distribution of nascent RNA per gene. The model allows us to evaluate the kinetic parameters of transcription from single-cell measurements of nascent RNA. The model also predicts surprising discontinuities in the distribution of nascent RNA, a feature which we verify experimentally.


Transcription, the production of RNA from a gene, is a stochastic process consisting of multiple single-molecule events [1,2]. The inference of transcription kinetics is typically addressed as an inverse problem, using the ergodic assumption that population statistics contain the signature of single-cell kinetics. Specifically, the number of RNA molecules from the gene is measured in many individual cells simultaneously using microscopy-based methods [35], and the measured RNA copy-number distribution is then compared to the prediction from a stochastic model for transcription kinetics [69]. This approach has been successfully used to demonstrate the bursty, non-Poissonian nature of transcription [68] and to examine how transcription kinetics are modulated by transcription factors [1012].

However, mapping cellular RNA number to the underlying kinetics of transcription is hampered by the fact that this number reflects additional processes downstream of transcription, such as RNA degradation and its partitioning during cell division. The stochasticity of both processes may mask that of the transcription process [13,14]. Moreover, cellular RNA represents the combined contributions from multiple copies of the same gene, whose number changes through the cell cycle [14,15] and whose activity may be correlated [1518].

In contrast to total cellular RNA, nascent RNA the RNA molecules still actively transcribed at the gene is not subject to these effects, and therefore bears more closely the signature of the transcription process. Recent progress in fluorescence microscopy has allowed measuring the amount of nascent RNA at individual genes in single cells [8,15,16,1924]. However, the theoretical modeling of nascent RNA kinetics is only at its infancy [8,16,23,2527]. We still lack a theoretical framework for mapping the single-cell measurements back to the stochastic kinetics of transcription. The goal of this paper is to develop such a framework.

The model

We model the kinetics of nascent RNA as consisting of four steps (Fig. 1(a)): Gene activation, transcription initiation, RNA synthesis (elongation), and release [8,16,23]. The gene fluctuates between two states, active (state 1), where transcription initiation is allowed, and inactive (state 0), where it is forbidden. Transitions between states and the initiation of transcription in the active state are modeled as Poisson processes, with rates k01, k10 and kINI, respectively [6,9,28,29]. Following initiation, RNA synthesis proceeds with a constant elongation speed VEL [25,30], to a final length L. The completed RNA molecule remains on the gene for a (deterministic) duration TS before being released [22,31]. See Supplemental Material [32] for a detailed discussion of model assumptions and of possible extensions to the model.

FIG. 1. A stochastic model of nascent RNA kinetics.

FIG. 1

(a) Model schematic. (b) Different experimental observables that can be described by the model: The number of RNA polymerases (RNAPs) on the gene (green), the amount of nascent RNA (red), and the signal from single-molecule fluorescence in situ hybridization (smFISH) probes (blue). (c) The contribution function corresponding to the three observables in panel b. In all cases, TS = 0.

The state of the system is defined by two random variables, the gene state n ( n= 0, 1) and the amount of nascent RNA m ( m≥ 0 ). m is obtained by summing over all nascent RNA molecules present at the gene, and is measured in units of a single complete (mature) RNA [8,14,16]. Since nascent RNAs may be incomplete [14,22], m can have non-integer values. Here we generalize m to represent the experimentally measured signal from the nascent RNA. The actual value of m thus depends on the specific experimental observable (Fig. 1(b)). For example, in the case of single-molecule fluorescence in situ hybridization (smFISH, [35]), commonly used for RNA detection, m corresponds to the fluorescent signal emitted by oligonucleotide probes bound to the RNA. In all cases, the signal m at time t is determined by initiation events happening within a time window TRES = L VEL +TS (the residence time of RNA at the gene) prior to t , and the contribution from each nascent RNA molecule depends only on its length at time t. We define the contribution function G l ( ) to describe the signal from a single RNA of length l [23]. Since l is determined by the difference between the RNA initiation time t and the observation time t , we can rewrite G as a function of this time difference, g(τ) = G(l(τ)), with τ = tit (−TRESτ ≤ 0) and l(τ) = min{L, −VEL τ} [16]. The observed signal is then given by m(t)=t-TREStitg(ti-t). The form of g(τ) reflects the experimental observable. A few examples are depicted in Fig. 1(c) and discussed in more detail below. In all cases, g(τ) is non-increasing, with the delay Ts in RNA release represented as a time period with g=1.

General approach to solving the model

Because m exhibits a finite deterministic memory (over duration TRES), we cannot easily write the master equation for the probability distribution P(n, m). To overcome this problem and solve for the state of the system at time t, we first define the pseudo-observables 𝓷(tau;, t) ≡ n(t + τ), which indicates the gene state n at t + τ and m(τ,t)=t-TREStit+τg(ti-t), which describes the accumulation of m over the history from tTRES to t + τ. Here, τ varies from −TRES to 0. Notably, 𝓶 = 0 for τ = −TRES and 𝓶 = m for τ = 0. Next, we write the master equation for the probability distribution P(𝓷, 𝓶) [16]:

dP(m)dτ=(K-KINI)P(m)+KINIP(m-g(τ)). (1)

Here, K=[-k01k10k01-k10],KINI=[000kINI], and P(m)=[P(0,m)p(1,m)]. Note that we allow 𝓶 to be negative, but Eq. (1) guarantees that P(𝓶< 0) = 0 as long as the initial conditions satisfy that condition. To obtain the distribution of the true observables (n, m) , we solve Eq. (1) for the pseudo-observables (𝓷, 𝓶) and substitute τ = 0 (Alternatively, Eq. (1) can be used to derive an equation for P(n, m), see Supplemental Material [32]).

We focus on the steady-state behavior of P(n, m). Using the definition of 𝓶 and the (easily calculable) steady-state distribution for the gene state n , we obtain the initial condition Pτ=-TRES(m)=δ(m)k01+k10[k10k01]. To solve Eq. (1), we transform P(𝓷, 𝓶) to its characteristic function Ψ(n,ω)0eim/ωP(n,m)dm [46] to obtain

dΨ(ω)dτ=(K+(eiωg(τ)-1)KINI)Ψ(ω), (2)

with Ψ(ω)=[Ψ(0,ω)Ψ(1,ω)] and the initial condition Ψτ=-TRES(ω)=1k01+k10[k10k01]. Eq. (2) is analogous to a quantum mechanical spin system with a time-dependent interaction term. Its solution is therefore given by the Dyson series [54]:

Ψτ=0(ω)={I+N=1-TRES0dτ1-TRESτN-1dτNτ1>τiτNeiωg(τi)V(τi)}e(K-KINI)TRESΨτ=-TRES, (3)

where V(τ)= e−(KKINI)τKINIe(KKINI)τ. Applying the inverse transformation, we obtain the steady-state distribution,

P(m)=12π-+e-imωΨτ=0(ω)dω=N=0PN(m)={δ(m)+N=1-TRES0dτ1-TRESτN-1dτNδ(m-i=1Ng(τi))T[i=1NV(τi)]}e(K-KINI)TRESΨτ=-TRES. (4)

where 𝒯 is the time -ordering operator. PN(m)=[P(0,mN)P(1,mN)] is the vectorized probability of observing m, given that the number of initiation events in the time interval −TRESτ ≤ 0 was exactly N. In the general case, PN(m) depends on the contribution function g(τ) , and therefore solving Eq. (4) requires knowing the specific form of g(τ). Below we describe the solution for a number of experimentally-relevant examples. A closed-form solution may not be always possible, but P(n m , ) can be calculated numerically using the finite state projection method [16,23,33,47](Supplemental Material [32]). For the purpose of comparing with experimental data, the calculated distribution is typically marginalized over n , i.e. P(m)=nP(n,m). The moments of P(m) can be directly calculated from Eq. (2) (Supplemental Material [32]):

mN=u·(-i)NdNdωNΨτ=0(0)=u·I=1N0=k0<ki<kIi=1,,I-1=N-TRES0dτ1-TRESτl-1dτIT[i=1I(kiki-1)g(τi)ki-ki-1W(τi)]Ψτ=TRES(0), (5)

with u = (1,1) and W(τ)= eKτKINIeKτ. Below we use these moments to explore the shape of P(m) as a function of model parameters.

Solutions for specific contribution functions

Case 1: g = 1

This corresponds to measuring the number of RNA polymerases (RNAPs) currently transcribing the gene (Fig. 1(c), panel I), or, equivalently, the number of nascent RNA molecules present, irrespective of their lengths [8]. Here and below we assume for simplicity that TS = 0 (i.e. RNA is released from the gene immediately upon completion [16,19]), and (without loss of generality) set TRES =1. Since g in this case does not take fractional values, we replace the characteristic functions with generating functions, Fn(z,τ)m=0zmPτ(n,m) and F(z, τ) ≡ F0(z, τ) + F1(z,τ), and transform Eq. (1) to obtain

F¨+(k01+k10+(1-z)kINI)F.+(1-z)kINIk01F=0, (6)

with the initial conditions F (z, −1) =1, F.(z,-1)=(z-1)kINIk01k01+k10. Solving Eq. (6) and performing the inverse transformation allows us to calculate the marginal probability distribution of m (see Supplemental Material [32]),

P(m)=e-k01+k10+kINI2m!(kINI2)m{[k01+k10+kINI2-kINIk01k01+k10]i=0m(mi)M1,i+i=0m(mi)M0,i+mk01-k10k01+k10i=0m-1(m-1i)M1,i}, (7)

with Ms,i=2liw=max(0,i-l)min(l,i)(lw)(li-w)(-1)ii!(2l+s)!(kINI+κ12)l-w(kINI+κ22)l-i+w,κ1,2=k10-k01±2ik10k01 Eq. (7) provides the exact solution for the distribution of the number of transcribing RNAPs at the gene.

Figure 2(a) depicts P(m) , calculated from Eq. (7), for a few parameter values. Stochastic simulations of the model, also shown, agree with the analytical calculation (Supplemental Material [32]). For insight into the shape of P(m) , we first note that gene-state transitions are typically believed to be slow compared to both the rate of initiation and the time to complete one RNA [8,16,55]. Specifically, in the limit (k01& k10) kINI and (k01 or k10) 1, Eq. (7) can be written as the weighed sum of two Poisson distributions, with rates 0 and k (Supplemental Material [32]). In this limit, P(m) is also identical to the solution for the commonly used two-state model for cellular RNA kinetics [6,8,9,29], if we replace the residence time TRES with the RNA degradation rate kD. Outside that limiting case, however (as e.g. in [16]), the two distributions can be quite different (Fig. S1 in the Supplemental Material [32]).

FIG. 2. The probability distribution for the number of RNAPs at the gene.

FIG. 2

(a) The exact solution for P(m) (binned to integer values, red) for a few parameter values. Also shown are the results of stochastic simulations (gray). (b) The bimodality coefficient β as a function of k01, k0 and k NI was calculated and thresholded (βth = 5 9 , bottom, red surface) to classify P(m) as either bimodal or unimodal. The unimodal distributions were further classified based on the peak position. Parameter values corresponding to panel a are marked as gray circles.

To map how the shape of P(m) varies with transcription parameters, we defined the bimodality coefficient, β ≡ 1/(κγ2) , where γ is the skewness and κ the kurtosis of P(m) [51]. Calculating β over a broad range of kinetic rates, and using a threshold of βth = 5/9 (corresponding to a uniform distribution, see Supplemental Material [32]), we found that P(m) is bimodal for k01 ~ k10 ≲ and kINI ≳ 1, and unimodal outside this region (Fig. 2(b)). The unimodal region can be further divided based on the position of the distribution peak, at m= 0 or m> 0 (Fig. 2(b)).

Case 2: g = − τ

This corresponds to measuring the total length of nascent RNA, summed over multiple molecules present at the gene (Fig. 1(c), panel II). Experimentally, this is achieved by using multiple smFISH probes covering the length of the target gene [4]. In contrast to Case 1 above, m is now continuous, and Eq. (2) can be transformed to a single equation for Ψ(1, ω) (Supplemental Material [32]):

Ψ¨(1,ω)+[k01+k10+kINI(1-e-iωτ)]Ψ.(1,ω)-[kINI(k01-iω)e-iωτ-kINIk01]Ψ(1,ω)=0, (8)

with the initial conditions Ψτ=-1(1,ω)=k01k01+k10,Ψ.τ=-1(1,ω)=k01kINIk01+k10(eiω-1). By solving Eq. (8), we obtain the exact expression for Ψ(ω) ≡Ψ(0, ω) +Ψ (1, ω) as a combination of confluent hypergeometric functions. Since transforming Ψ (ω) back to an analytical form of P(m) is challenging, we proceed to calculate P(m) using finite state projection [16,23,33]. The calculated P(m) exhibits the same three characteristic shapes as in Case 1, but the boundaries in parameter space between regions exhibiting different shapes are shifted by up to 2-fold (Fig. S2 in the Supplemental Material [32]). Thus, the difference in contribution functions can lead to different shapes of P(m) for the same transcription parameters (another example of this effect is described below).

Inferring transcription kinetics from single-cell measurements of nascent RNA

To demonstrate how the model can be used to interpret experimental data, we first examined the transcription of the hunchback (hb) gene in embryos of the fruit fly, Drosophila melanogaster ([16] and Supplemental Material [32]). Early in development, hb is regulated by the transcription factor Bicoid (Bcd), whose concentration forms a gradient along the embryo [56] (Fig. 3(a)). We measured the amount of nascent RNA at individual copies of the hb gene [16], and examined the distribution of nascent RNA over all cell nuclei within a given region of the embryo (corresponding to a given Bcd concentration) (Fig. 3(a)). Next, we solved Eq. (1) using g(τ) that corresponds to the set of smFISH probes used in the experiment [16], and used maximum likelihood estimation to fit the model to the experimental data. The model was able to capture the change in P(m) shape along the embryo (Fig. 3(a)). We found that the regulatory effect of Bcd is to increase k01 (>50 fold along a single embryo) while k10 and k INI remain almost unchanged (Fig. 3(a)). Thus, the model allowed us to identify what aspect of hb kinetics is modulated during gene regulation [16].

FIG. 3. Estimating transcription kinetics from experimental data.

FIG. 3

(a) Regulation of the hb gene by Bcd. Top left, Bcd forms a concentration gradient along the anterior-posterior axis of the Drosophila embryo. Grey circles indicate individual cell nuclei. Three representative regions of the embryo are highlighted in pink, corresponding to high (I), medium (II) and low (III) Bcd concentrations. Right, the measured distribution of nascent hb RNA at each region (smFISH data from a single embryo, >200 data points per histogram, bin width = 3), and the corresponding theoretical fit (red). Bottom left, the estimated transcription parameters (dots), superimposed on the modality phase plane of P(m) calculated as in Fig. 2(b). (b) The effect of smFISH probe positions. Two different sets of probes were designed against the bcd3-lacZ reporter gene, targeting the first half (blue) and second half (magenta) of the gene. The two sets yielded different distributions of nascent RNA (top and bottom, >250 data points from a single embryo, at 0.2–0.3 embryo length, bin width = 4). Using the contribution functions calculated from the probe positions on the gene (insets) yielded a good fit between the model and experimental data.

In the second example, we labeled the two halves of the same gene using two different smFISH probe sets carrying two different fluorescent dyes (Fig. 3(b) and Supplemental Material [32]). In the experiment, the two probe sets yielded very different signal distributions P(m) (both normalized to the signal from a single full-length RNA). In particular, the signal from the first half of the gene was spread ~2 fold wider on the m axis than that from the second half (Fig. 3(b)). Since both probe sets label the same gene, the two data sets should be describable using the same kinetic parameters, the only difference being the form of g(τ) , which we calculated directly from the probe positions on the gene (Fig. 3(b)). In agreement with this hypothesis, we were able to fit the two experimental distributions (as well as the joint distribution) using a single set of transcription parameters (Fig. 3(b) and Supplemental Material [32]).

Discontinuities in P(m)

As noted above, a distinctive feature of nascent RNA, in contrast to mature cellular RNA, is that it can be approximated as continuous [4,5,16]. When examining the behavior of our model in the case g=−τ (i.e. measuring the total amount of nascent RNA at the gene), we found that, for multiple parameter choices, P(m) appears discontinuous at integer values of m (insets of Fig. S2(a) in the Supplemental Material [32]). This discontinuity was consistent with the appearance of terms of order 1ω in the characteristic function Ψ(ω) [57]. The source of the discontinuity can be understood by noting that, in Eq. (4), P(m) is written as the sum of PN(m) , the probabilities of observing m given that the number of initiation events in the time interval −TRESτ≤ 0 is N (equivalently, the number of RNAPs present at the gene is N ). Since, for a given N , m cannot exceed N , the result may be a discontinuity of P(m) or its derivatives at integer values. Specifically, since P0(m) ∝δ (m) , P(m) has an infinite discontinuity at m= 0. P1(m) is nonzero only for m≤1, hence P(m) has a jump discontinuity at m=1. For higher values of N , it can be shown that the (N−1) th derivative of P(m) has a jump discontinuity at m = N (Supplemental Material [32]). For each point of discontinuity, the magnitude of the jump is

ΔPN=dN-1P(m)dmN-1|m=N--dN-1P(m)dmN-1|m=N+=(-1)N-1N!u·e(K-KINI)KININΨτ=-1. (9)

We explore this feature in Fig. 4. For the parameters used (k01 = k10 = 0.1, k NI = 50 ), zooming in to the low range of m reveals a sharp drop of P(m) at m=1 (Fig. 4(a)). At higher integer m ’s, the drop becomes smaller and is shifted to the left (Fig. 4(a)). The drop reflects the discontinuity of P(m) (or its derivatives) at integer m ’s. Each drop is preceded by an increase of P(m) , resulting in a peak at mN (Fig. 4(b)). This peak, in turn, is due to the fact that, when kININ and gene transitions are slow (k01, k10 ≤ 1), the two most probable ways of observing exactly N initiation events are for the gene to be active only at the beginning ( τ-TRES+) or the end (τ→ 0) of the time window, resulting in maxima of PN (m) at mN and m→ 0+ , respectively (Fig. 4(b)).

FIG. 4. Discontinuities in nascent RNA distribution at integer m values.

FIG. 4

(a) The calculated distribution of nascent RNA at small values of m , for k01 = k0= 0.1 , kINI = 50. A larger range of m is shown in the inset. The range of m was divided into windows covering 0.5 to 0.5 around each integer (colored shading). (b) The origin of discontinuity at m=1. The total probability of observing m is a marginalization over different numbers of RNAPs on the gene (plotted for N=1, 2, 3 ). (c) The discontinuity factor r as a function of k01, k10 and kINI was calculated and thresholded (rth = 0.1, left, red surface). Black dot indicates the experimental data analyzed in panel d. (d) The experimental signature of P(m) discontinuity. Nascent RNA from bcd3-lacZ was measured using smFISH (at 0.1–0.3 embryo length, 23 embryos). The distribution of m0, the deviation of m from the nearest integer, was calculated (gray, 3.5 ≤ m< 6.5, ~500 data points, bin width = 0.1) and compared to model predictions with (red) and without (dashed blue) incorporating the effect of finite probe binding probability p0.

To ask whether these features of P(m) can be detected experimentally, we first defined the discontinuity factor r ≡ ΔP1/Pm (=1+) to characterize the magnitude of the jump in the distribution of nascent RNA. Calculating r over a wide range of kinetic rates indicated that it would be high (> 0.1) for k01 ≲ 101 (Fig. 4(c)). This range covers the estimated parameters in multiple biological systems [16,23,58], including our measurements in Drosophila (Fig. 3 above). To then try and detect this feature in our experimental data, we focused on the small m (< 6.5 ) range, where the peaks in P(m) are expected to be the highest (Fig. 4(a)). To improve data sampling, we defined the variable m0 = m −[m] (where [·] denotes the nearest integer) such that all m values are mapped into the range [−0.5, 0.5). Using this procedure, we detected a peak to the left of m0 = 0 , as predicted by the model (Fig. 4(d)). Allowing for the finite binding probability of smFISH probes [16,53], we were able to successfully reproduce the shape of the folded probability distribution (Fig. 4(d), see Supplemental Material [32]). Thus, the experimental data supports the theoretical prediction of discontinuity in the distribution of nascent RNA. The periodic discontinuities can be used to identify the signal intensity corresponding to a single RNA, thus improving the precision of RNA counting using smFISH [3,5,14,16].

Conclusion

We presented a theoretical framework for connecting the stochastic kinetics of transcription with the resulting probability distribution of nascent RNA at the gene. By changing the form of the contribution function g(τ) , the model can be used to describe different experimental observables. The model allowed us to interpret experimental data, extract the kinetic parameters of gene activity, and identify how the kinetics vary under the regulatory influence of a transcription factor. The model also predicted a hitherto unobserved feature of discontinuities and periodic peaks in nascent RNA distribution, which we were able to validate experimentally. To further improve the estimation of transcription parameters, the model for nascent RNA can be combined with one for the total cellular RNA [15] and compared to experimental measurements of both species simultaneously [3,6,8,15] (Supplemental Material [32]). Beyond the steady-state distribution discussed here, solving for the time-dependent behavior of the model (Supplemental Material [32]) can allow a direct comparison with live-cell measurements of nascent RNA [19,21,22].

Supplementary Material

Supplemental Material

Acknowledgments

We are grateful to the following people for generous advice: H. Garcia, D. Larson, H. Levine, A. Sanchez and N. Wingreen. Work in the Golding lab is supported by grants from NIH (R01 GM082837), NSF (PHY 1147498, PHY 1430124 and PHY 1427654), The Welch Foundation (Q-1759) and The John S. Dunn Foundation (Collaborative Research Award). H.X. is supported by the Burroughs Wellcome Fund Career Award at the Scientific Interface. A.M.S. is supported by a grant from the NIH (R01 GM115111). We gratefully acknowledge the computing resources provided by the CIBR Center of Baylor College of Medicine.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES