Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 Apr 16;53(7):gkaf295. doi: 10.1093/nar/gkaf295

Cell-cycle dependence of bursty gene expression: insights from fitting mechanistic models to single-cell RNA-seq data

Augustinas Sukys 1,2,3,, Ramon Grima 4,
PMCID: PMC12000877  PMID: 40240003

Abstract

Bursty gene expression is characterized by two intuitive parameters, burst frequency and burst size, the cell-cycle dependence of which has not been extensively profiled at the transcriptome level. In this study, we estimate the burst parameters per allele in the G1 and G2/M cell-cycle phases for thousands of mouse genes by fitting mechanistic models of gene expression to messenger RNA count data, obtained by sequencing of single cells whose cell-cycle position has been inferred using a deep-learning method. We find that upon DNA replication, the median burst frequency approximately halves, while the burst size remains mostly unchanged. Genome-wide distributions of the burst parameter ratios between the G2/M and G1 phases are broad, indicating substantial heterogeneity in transcriptional regulation. We also observe a significant negative correlation between the burst frequency and size ratios, suggesting that regulatory processes do not independently control the burst parameters. We show that to accurately estimate the burst parameter ratios, mechanistic models must explicitly account for gene copy number variation and extrinsic noise due to the coupling of transcription to cell age across the cell cycle, but corrections for technical noise due to imperfect capture of RNA molecules in sequencing experiments are less critical.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Gene expression occurs in random bursts of transcription, associated with alternating active and inactive promoter states [1, 2]. By tagging messenger RNA (mRNA) molecules using fluorescent reporters, live-cell imaging reveals a train of pulses of fluorescent intensity whose mean height is proportional to the mean number of transcripts produced when the gene is actively transcribing (the mean burst size) and whose frequency reflects how often transcription occurs (the burst frequency) [3]. These two burst parameters offer a simple, intuitive, and practical quantitative description of gene expression and hence their estimation has been the focus of many studies (for reviews see [4–6]).

While live-cell measurements are ideal to understand transcription, they are challenging because they are low-throughput and require genome editing [7, 8]. To overcome this challenge, one can leverage the inherent heterogeneity generated by bursting within a population of cells. Acquiring a distribution of mRNA counts per cell is nowadays relatively straightforward through techniques such as single-molecule fluorescence in situ hybridization (smFISH) and single-cell sequencing. By fitting these measured distributions to the distributions predicted by mechanistic models of gene expression (where the rate parameters determine the speed of molecular processes), one can then infer the burst size and frequency. Common examples of such models are the two-state telegraph model [1, 9] that predicts both unimodal and bimodal mRNA count distributions, and the simpler bursty model that predicts unimodal negative binomial (NB) distributions [10].

The vast majority of studies reporting estimates of the burst size and frequency have used smFISH data [11–14] because these are widely considered to be the gold standard for the accurate measurement of the number of RNA molecules for a specific gene in single cells. However, many of these studies suffer from a major limitation—they implicitly assume that the burst parameters are the same for each cell in the sample. Equivalently, it is assumed that the observed inter-cell differences in the transcript numbers are generated by the uncertainty in the timing of biochemical events (intrinsic noise) [15, 16]. However, generally this is not the case—burst parameters vary from one cell to another because of differences in the number of cellular components such as RNA polymerases, transcription factors, and other molecules playing key roles in transcription. These parameter fluctuations are commonly referred to as extrinsic noise, since they arise independently of a gene of interest but act on it. Thus, the differences in transcript numbers between cells are due to both intrinsic and extrinsic noise, although the latter has been suggested to be the dominant source of heterogeneity [17, 18].

One might suppose that the burst parameters estimated using models that assume perfectly identical cells can be approximately interpreted as the parameter values averaged over all cells. However, a recent study conclusively showed this is not the case. In previous work [19], it was demonstrated that there exist systematic biases in the parameters inferred using the telegraph model fitted to simulated single-allele mRNA data for a population of cells with extrinsic noise in one of the transcriptional parameters. Whether the burst parameters are over- or underestimated was found to depend on the source of extrinsic noise (which parameter is the most variable amongst cells) and the mode of transcriptional activity. In reality, these systematic biases are likely to be larger when non-allele-specific mRNA count data are used for inference because cell-to-cell differences in gene copy numbers due to DNA replication (since cells are likely to be found at different stages of the cell cycle) can be a strong determinant of gene expression noise [20].

Together, these studies show that it is difficult to obtain a reliable quantitative picture of gene expression by fitting models to single-cell data, unless the inference procedure appropriately corrects for the changes of gene copy number with the cell-cycle phase and for extrinsic noise due to other exogenous factors that vary within each phase. Several smFISH-based studies have suggested that the burst frequency is principally affected by the cell-cycle phase [20–22] while the burst size scales with cell volume [21, 23], leading to concentration homeostasis [24]. Note that cell volume varies considerably within a cell-cycle phase and hence it is possibly a dominant source of extrinsic noise within each phase; other extrinsic noise sources include mitochondrial variability [25] and Ca2+ signalling [26]. However, even if sophisticated modelling frameworks that include all the principal sources of extrinsic noise can be developed and used to fit the mRNA count distributions measured using smFISH, this would not lead to a transcriptome-wide quantitative understanding of gene expression, as such data are typically limited to a relatively small number of genes of interest.

A possible way to circumvent this problem involves the use of data generated by single-cell RNA sequencing (scRNA-seq). This allows comprehensive profiling of large numbers of cells on a genome-wide level, thus providing an unprecedented insight into cell-to-cell variability [27–29]. While it has primarily been used to distinguish between different cell types in a sample [30–33], to analyse the dynamics of cell state switching [34–36], and to reconstruct gene regulatory networks [37–41], in recent years it has also been used to study the inherent stochasticity of gene expression [19, 42–48]. Despite their potential for transcriptome-wide inference, scRNA-seq data have one major disadvantage compared with smFISH data, namely the high amount of technical (non-biological) noise [49, 50]. In droplet-based sequencing, this can arise from the imperfect capture of all RNA molecules (especially from lowly expressed genes), amplification errors, and non-uniquely identifiable reads. Although technical noise is reduced by the use of data that incorporate unique molecular identifiers (UMIs) [51, 52]—random nucleotide sequences that tag an individual mRNA molecule and thus allow separation between the original molecules and their duplicates derived from the complementary DNA library amplification [53–55]—this is not a panacea. For example, a previous comparison of the probability distributions of mRNA counts for five genes measured using both smFISH and an UMI-based scRNA-seq platform has shown that there is a non-negligible fraction of transcripts in each cell that are not captured by sequencing (see extended figure 7 of [43]). Considering the limitations of inference already identified using smFISH data, it is clear that the deduction of quantitative models of bursty gene expression using scRNA-seq data would need an inference procedure that corrects for the cell-cycle phase, extrinsic noise within each phase, and technical noise. This trifarious correction is missing from the literature. Current methods of inferring burst parameters from scRNA-seq data either ignore/cannot properly account for extrinsic and technical noise [42, 43] or correct only for extrinsic noise [19], or consider only certain types of extrinsic and technical noise but not the variation in gene copy number over the cell cycle [44, 45, 48, 56].

In this paper, we develop an inference framework that takes into account the cell-cycle phase, extrinsic noise due to factors varying within each phase, and technical noise, and use it to investigate how the burst frequency and size are modulated over the cell cycle in eukaryotic genes. While previous studies sought to answer this question using smFISH-based approaches for a few tens of genes, we use previously published scRNA-seq data [57] to understand the cell-cycle dependence of the bursty expression of about 1000 mouse genes. The overall approach is illustrated in Fig. 1. Sequencing data provide spliced and unspliced read counts for over 104 genes in undifferentiated mouse embryonic stem cells (mESCs). DeepCycle [57] uses a deep-learning approach to map the cycling patterns in the unspliced–spliced RNA space of cell-cycle-related genes to a parameter θ for each cell, which gives the cell age: θ = 0 means a newborn cell, while θ = 1 indicates a cell that is about to divide via mitosis (hence θ is a rough proxy for cell size). From this information, it is also possible to determine the cell-cycle phase which agrees with the phases according to the DNA content given by fluorescence-activated cell sorting (as discussed in [57]). Given this joint information of mRNA counts, cell age θ, and cell-cycle phase for each cell, we fit two classes of mechanistic models of stochastic transcription (age-independent and age-dependent models) to the data, from which we estimate the burst parameters. We show that the most robust inference strategy involves using age-dependent models to estimate the ratios of the burst parameters between the G1 and G2/M phases, rather than their phase-specific values, which are highly sensitive to the model type and the influence of technical noise due to imperfect capture of transcripts. From the obtained burst parameter ratios, we deduce how the burst frequency and size vary with cell-cycle phase, thus providing an unprecedented quantitative view of the dynamics of gene expression in mammalian cells.

Figure 1.

Figure 1.

Schematic of the single-cell data acquisition, processing, and parameter inference pipeline. (A) Our mechanistic analysis is built upon the study of Riba et al. [57], which generated an scRNA-seq dataset of over 5000 mouse embryonic stem cells and developed a deep-learning approach called DeepCycle to estimate the cell age and cell-cycle phase for each cell in the population. Specifically, DeepCycle fits the gene expression patterns in the spliced–unspliced space to estimate the cell age θ ∈ [0, 1) associated with the relative position of a cell in its progression along the cell cycle from G1 to mitosis. In addition, certain values of θ can be linked to transitions between the main cell-cycle phases, and hence all cells in the dataset can be grouped by their cell-cycle phase (G1, S, or G2/M). (B) An example showing a subset of the processed data for the Mdh1 gene. The bottom graph shows how the mean spliced mRNA counts (average over cells falling in each bin of θ values) evolve as the cells go through G1, S, and G2/M phases of the cell cycle. The histograms above show the spliced count distributions for three different values of the cell age θ; the vertical lines denote the mean values (also indicated on the bottom plot). (C) In this paper, we use the processed data and the associated cell-cycle information to perform maximum likelihood-based model inference on the experimentally observed (spliced) mRNA count distributions, taking into account intrinsic, extrinsic, and technical noise sources. For inference, we use two classes of mechanistic models of stochastic transcription (age-independent models and age-dependent models). As an example, maximum likelihood (MLE) fits for the optimal age-dependent model describing the stochastic gene expression of gene Mdh1 are shown in the upper graphs in panel (B). The burst parameters (burst size and frequency) are then computed from the optimal model parameters. For technical details, see the ‘Materials and methods’ section.

Materials and methods

Processing of scRNA-seq data

In our analysis, we use the scRNA-seq dataset of a population of mouse embryonic stem cells with inhibited differentiation generated by Riba et al. [57] using the 10x Genomics Chromium platform, containing spliced and unspliced transcript counts for over 5000 cells and >10 000 genes. Each cell in the population has an associated cell age θ (originally referred to as the ‘transcriptional phase’ in [57]) which characterizes its position along the cell cycle and can also be used to determine the cell-cycle phase (G1, S, or G2/M). The ‘cell age’ parameter θ ∈ [0, 1) is a continuous periodic variable where θ = 0 indicates the beginning of the G1 phase of a newly formed cell and θ = 1 corresponds to cell division. In the following, we reiterate the main steps of the data processing and downstream analysis published in [57] and discuss how it is reprocessed for the purposes of this paper.

In order to infer the age of each cell, Riba et al. developed a neural network-based method called DeepCycle. Their approach is motivated by the concept of RNA velocity, which can be used to estimate the future transcriptional state of a cell from unspliced/spliced mRNA counts [35, 58]. Namely, the authors proposed to order cells along the cell-cycle trajectory by fitting the circular expression patterns of cell-cycle-related genes in the unspliced–spliced space using an autoencoder. The data are initially processed using scVelo [36]: the spliced/unspliced counts are normalized to the median of total counts per cell; a nearest-neighbour graph is constructed from principal component analysis on logarithmized counts (using the first 20 principal components); and spliced/unspliced counts are smoothed out for each cell by averaging the counts over its 30 nearest neighbours. The next step in the DeepCycle procedure is to identify the cell-cycle-related genes that exhibit clearly cycling transcriptional dynamics in the unspliced–spliced space. The z-scored spliced and unspliced counts of the selected genes for each cell comprise the training data for the autoencoder, which learn to denoise and reconstruct the expression patterns of the cycling genes. This way, the single latent variable of the autoencoder, the cell age parameter θ, effectively characterizes the relative position of a cell along the cell cycle. The variable is then discretized as θ ∈ [0, 0.01, … , 0.99] and all cells are assigned to their corresponding bins based on the autoencoder predictions. Finally, specific values of θ can be associated with the transitions between the main cell-cycle phases based on the expression of cell-cycle marker genes and the total transcripts per cell—this allows to map the cells to the G1, S, and G2/M cell-cycle phases and to align θ so that cell division occurs at θ = 1. Specifically, the G1/S and S/G2 transitions occur at θ values of 0.26–0.27 and 0.63–0.64, respectively. Although RNA velocity-based methods have various shortcomings [59], Riba et al. performed a number of self-consistency checks that establish the validity of DeepCycle for the considered datasets. For instance, the authors verified that the expected expression patterns of marker genes are recovered and also provide complementary validation using flow cytometry analysis.

As the processed mESC dataset published in [57] does not contain the discrete UMI counts needed for our inference and these cannot be easily retrieved, we start from the raw FASTQ data generated by the authors and reprocess it following the same methodology. The reads were aligned to the mouse reference transcriptome (mm10-2020-A) using CellRanger count pipeline (v7.1.0), and the spliced and unspliced transcripts were quantified using velocyto (v0.17.17) [35]. Low-quality cells and genes were filtered out using scVelo (v0.2.5) [36]. More specifically, we select genes that are expressed in at least 200 cells and cells that have a total number of UMIs >10 000, express >3000 unique genes and have a low percentage of mitochondrial genes (<15%)—the resulting dataset contains 5686 cells and 11 544 genes. Next, as described in the paragraph above, we process the data further using scVelo and run DeepCycle to assign the transcriptional phase θ ∈ [0, 0.01, … , 0.99] and the cell-cycle phase (G1, S, or G2/M) for each cell. Although not exactly equivalent, the obtained phase characterization is similar both quantitatively and qualitatively to the original processed data in [57].

In DeepCycle, the transition between mitosis and G1 is defined so that θ = 0.99 corresponds to the peak in the average total RNA counts per cell, whereas θ = 0 coincides with their immediate drop. Although one may expect to observe a sharp halving in the total counts upon division, this decline is spread over a wider range of θ bins assigned to the G1 phase (figure 3C of [57]), and it remains unclear how many cells in this range are misclassified by DeepCycle as belonging to the G1 phase. Therefore, as a precaution, we choose to remove all cells assigned to the [0, 0.09] range of θ values and rescale the remaining θ ∈ [0.1, 0.11, … , 0.99] into the same range (0–0.99) as θ↦(θ − 0.1) × 0.99/0.89. The truncated dataset contains 5294 cells in total, of which 941 are assigned to the G1 phase, 2240 to the S phase, and 2113 to the G2/M phase. Finally, we also remove genes with low expression, i.e. genes that in either G1, S, or G2/M phases have a mean transcript abundance below one. The parameter inference is then performed on the spliced mRNA counts of the remaining 3792 genes.

Mechanistic models of stochastic gene expression

Age-independent telegraph, NB, and Poisson models

In this paper, we consider the standard model of stochastic gene expression (the telegraph model) [9] and its variants. The telegraph model is defined in terms of the following effective reactions:

graphic file with name TM0001.gif (1)

where a gene switches between the active (G) and the inactive (G*) promoter states with rates σ0 and σ1, and in the active state produces spliced mRNA (M) with rate ρ, which subsequently degrades with rate d. Transcriptional bursting is characterized by the burst frequency f and burst size b, which for the telegraph model are given by f = σ1 and b = ρ/σ0. Note that the burst size is the mean number of mRNA molecules produced when the gene is active.

The chemical master equation (CME) describing the telegraph model can be solved exactly in the steady state for the probability distribution of mRNA counts m [1, 9]:

graphic file with name TM0001a.gif (2)

where (x)m = Γ(x + m)/Γ(x) is the Pochhammer symbol and 1F1 is Kummer’s (confluent hypergeometric) function. Alternatively, equation (2) can be represented by a beta-Poisson distribution [42, 60]. The rate parameters (σ0, σ1, and ρ), as they appear in the function, are divided by the mRNA degradation rate d, and hence their values cannot be determined using only static snapshot data without explicit measurements of the degradation rate. For generality, we estimate the normalized parameter values, i.e. the ratios of the rate parameters and the mRNA degradation rate, and hence we set d = 1 in the following discussion of the age-independent models. However, note that for the age-dependent models and their analysis, we consider only those genes for which experimentally measured gene-specific degradation rates are available, as discussed in the ‘Gene filtering’ section.

In the limit σ0 ≫ σ1, the gene is mostly inactive and transcription only occurs in short bursts, and the steady-state solution of the telegraph model reduces to a NB distribution:

graphic file with name TM0002.gif (3)

parametrized by the burst frequency f = σ1 and the burst size b = ρ/σ0. Note that this corresponds to the steady-state CME solution of the following reaction network [10]:

graphic file with name TM0003.gif (4)

where r is a geometrically distributed random variable with mean b.

In the limit σ1 ≫ σ0, the gene is mostly active and the model reduces to a simple birth–death process:

graphic file with name TM0004.gif (5)

which in steady-state conditions is described by the Poisson distribution P(m) = Pois(m; ρ). The burst frequency and burst size cannot be defined in this case as the distribution is characterized by a single parameter, the transcription rate ρ, which equals the mean mRNA count.

Extending the models to account for zero inflation of scRNA-seq data

Substantial technical noise in scRNA-seq experiments [49, 61] can lead to the presence of excess zero counts in the data. The treatment of such non-biological zeros has been a subject of active debate [50, 62]; although it is commonly taken into account by considering zero-inflated models, a number of recent studies have argued that zero inflation is largely avoided in UMI count experiments [50–52, 63, 64]. Nevertheless, to allow for more flexibility in our inference procedure, for age-independent models we model zero inflation explicitly as follows:

graphic file with name TM0005.gif (6)

where the parameter p0 ∈ [0, 1] describes the probability of zero inflation (fraction of zeros that are non-biological). For parameter inference, we specifically consider age-independent telegraph, NB, and Poisson models, and their zero-inflated counterparts, i.e. six models illustrated in Fig. 2B.

Figure 2.

Figure 2.

The relationship between cell-cycle phases and the burst parameters. (A) Histogram of the ratio of mean mRNA counts in G2/M and G1 phases, Λ = 〈mG2/M〉/〈mG1〉, for all 1760 bursty genes in the dataset. Ratio of 1 implies a mechanism that perfectly compensates for the doubling of the mRNA number due to the doubling of the gene dosage during replication, while a ratio of 2 indicates the opposite. (B) The model fitting and selection algorithm uses six different stochastic models of transcription: Poisson, ZIP, NB, ZINB, telegraph, and ZI-telegraph models. The NB model, most often selected as the optimal model for bursty genes, is illustrated in more detail. The rate parameters (normalized by the mRNA degradation rate) are inferred per gene copy. The mRNA distribution in G1, PG1, is given by a convolution of two NB models PNB (with rate parameters having a subscript G1). In the G2/M phase, due to the doubling of gene copy number in the S phase, the mRNA distribution, PG2/M, becomes a convolution of four NB models (with rate parameters subscripted by G2/M). Plots to the right show the mRNA distributions of gene Nme1 in G1 and G2/M phases and the corresponding best fits using maximum likelihood estimation (MLE fit), where the vertical lines indicate the mean values. (C) Ratios of the burst frequency per allele (top) and burst size per allele (bottom) in the G2/M and G1 phases, denoted by Qf and Qb, respectively, for five different genes in the dataset. Error bars indicate the 95% confidence intervals. The dashed lines indicate no change in the burst parameter values upon cell-cycle progression. (D) Scatter plot of the burst size ratio between G2/M and G1 phases versus the burst frequency ratio for all 1760 genes (each dot indicates a single gene). (E) Raincloud plots [99] of the burst frequency (Qf) and burst size (Qb) ratios. Each raincloud plot combines a smoothed histogram, a box plot, and the jittered raw values of all data points.

Accounting for multiple gene copy numbers

The models considered above describe the transcription from a single allele. However, the mESC dataset that we are analysing in this paper is non-allele-specific, and mouse embryonic stem cells possess two alleles which are replicated during the S phase of the cell cycle. Therefore, we need to consider the combined expression of two gene copies in the G1 phase and four gene copies in the G2/M phase. By assuming that the expression of the alleles is identical and independent from each other, we can model the total count distribution in the G1 phase by convolving the allele-specific count distribution P with itself:

graphic file with name TM0006.gif (7)

Similarly, the total count distribution in the G2/M phase will be given by the convolution of P with itself four times:

graphic file with name TM0007.gif (8)

The allele-specific distribution P(m) is given by the telegraph, NB, or Poisson models, or their zero-inflated counterparts, with cell-cycle phase-specific parameters—this is schematically illustrated in Fig. 2B. Note that closed-form expressions are known for the convolutions of Poisson and NB distributions, e.g. Pois(ρ)∗Pois(ρ) = Pois(2ρ) and NB(f, p)∗NB(f, p) = NB(2f, p), which considerably simplifies the numerics.

Extended age-dependent model with bursty transcription

As discussed in the ‘Results’ section, age-independent (steady-state) gene expression models need to be extended to incorporate the exponential dependence of the mean expression level with cell age θ. Here we define a stochastic age-dependent transcriptional bursting model that explicitly incorporates the cell-cycle dynamics (visualized in Fig. 3B), and solve it approximately for the mRNA distribution at any cell age θ, under the assumption that θ is equivalent to the normalized time within a cell cycle of duration T, so that θ = t/T ∈ [0, 1]. The analytic model solution can then be utilized to fit the experimental count data.

Figure 3.

Figure 3.

(A) Analysis of spliced mRNA counts (from 1000 cells) for three genes shows cell-age-dependent expression patterns. Each bold line corresponds to the mean mRNA counts (an average over cells falling into each bin of θ values). (B) Schematic of the age-dependent model describing bursty gene transcription with burst frequency and size that vary with cell age and cell-cycle phase. The model also incorporates the doubling of gene copy number when DNA replication occurs and mRNA dilution when cell division occurs. (C) For most genes, there are four main dynamical phases, as shown by a plot of the total mRNA counts versus age (top left). The three vertical lines show the G1–S transition, the point half-way through the S phase, and the S–G2/M transition, respectively. An exponential curve keβθ is fitted to the data from each gene in each of the four phases using non-linear regression (top right). Correlation coefficients and the estimated parameters across phases are visualized for 1760 genes (same as used for Fig. 2) using raincloud plots. (D) Comparison of the mean, variance, and distributions of mRNA counts predicted by the model with parameters inferred using maximum likelihood (MLE) and the statistics calculated from scRNA-seq data for three sample genes.

In the case of transcriptional bursting, the mRNA expression from each allele copy can be described by the following reaction scheme:

graphic file with name TM0008.gif (9)

where r is an integer sampled from the geometric distribution with mean b (the mean burst size), f is the burst frequency, and d is the mRNA degradation rate—this is equivalent to the bursty limit of the telegraph model given by equation (4). The age-dependent mean count 〈m〉(θ) is given by the rate equation

graphic file with name TM0009.gif (10)

where μ(θ) is the effective age-dependent mRNA production rate, which equals the product of the burst frequency and the mean burst size at cell age θ. In Fig. 3C, we have demonstrated that the mean mRNA counts in each cell-cycle phase can be well described by an exponential curve keβθ, pointing out a possible functional form for the effective production rate μ(θ) in equation (10): to enforce 〈m〉(θ)∝eβθ in our model, it is easy to show that μ(θ) likewise has to proportional to eβθ, and hence it follows that either the burst frequency or the mean burst size must be proportional to eβθ, with both β and the proportionality constant being dependent only on the cell-cycle phase.

To satisfy this parametric constraint, we choose to model the burst frequency f as a cell-cycle phase-specific constant, and introduce an exponential scaling of the mean burst size with cell age θ, so that b = ρeβθ, where the power β is a gene-specific age scaling exponent. Note that the proportionality constant ρ in this context has a different interpretation from the transcription rate ρ of the standard telegraph model. As discussed in the main text, the choice of functional form here is guided by previous experimental findings. As the burst size b is given for each θ, we also define the burst size averaged over all cells in a specific cell-cycle phase as:

graphic file with name TM00010.gif (11)

where θ varies from θi to θf, and g(θ) is the distribution of θ bins over all cells assigned to that phase (the age distribution in the cell-cycle phase). Note that in the main text we report and discuss the averaged burst sizes in G1 and G2/M, i.e. bG1 and bG2/M, where the parameters ρ and β in equation (11) are the piecewise constants ρ1 and β1 for cells in G1, and ρ2 and β4 for cells in G2/M.

We construct the age-dependent model in a piecewise manner in order to account for the extrinsic noise due to the cell-cycle phase and the doubling of gene copy number upon DNA replication at θ = θr (see Fig. 3B for a visual summary). Namely, we assume that the allele-specific burst parameters f and ρ are given by the respective constants f1 and ρ1 before replication, and f2 and ρ2 after replication. Furthermore, the parameter β that characterizes the exponential scaling of transcription with θ is piecewise parametrized by four constants: β1 in the G1 phase for θ = [0, θG1), β2 in the S phase before replication for θ = [θG1, θr), β3 in the S phase after replication for θ = [θr, θS), and β4 in the G2/M phase for θ = [θS, 1), where θG1 and θS are the transition time points between the G1/S and S/G2 phases respectively (as determined by DeepCycle). In summary, the age-dependent mRNA production rate in equation (10) is equal to

graphic file with name TM00011.gif (12)

Note that the parameter β is segmented into four parts in order to introduce more model flexibility and better fit the non-monotonic trend of the mean counts with respect to cell age θ observed in the S phase. While we could similarly consider the parameter ρ to be a different constant in each of the four phases, we only allow it to take two values (one prior to and another post-replication). This choice is motivated by the fact that β is the parameter which most strongly varies between the phases (bottom row of Fig. 3C), and by the practical desire to reduce the number of model parameters that need to be estimated for each gene.

To inform the model, we use a fixed cell-cycle duration Inline graphic, experimentally determined as the median cell-cycle length of mESCs in the pluripotent ground state [65] (grown in medium with two inhibitors and leukemia inhibitory factor (2i+LIF), similarly to the data generated by Riba et al. [57]). Gene-specific degradation rates d are obtained from an mRNA decay database for differentiating mESCs [66], discussed in more detail in the ‘Gene filtering’ section. Although gene-specific DNA replication times θr could in principle be extracted from the existing DNA replication timing profiles for mESCs [67], to our knowledge, these are not readily available in experimental literature without requiring additional processing and technical considerations. Hence, for simplicity, we fix the replication time for all genes in the middle of the S phase at θr = (θG1 + θS)/2. In Supplementary Text S3, we argue in favour of this assumption by performing parameter inference using an alternative age-dependent model formulation that avoids setting a fixed replication time, but still leads to qualitatively similar observations on the transcriptome level.

Using equations (10) and (12), we can derive piecewise differential equations that describe the time evolution of the mean mRNA counts due to all (independent) alleles in each of the four cell-cycle phases:

graphic file with name TM00013.gif (13)
graphic file with name TM00014.gif (14)
graphic file with name TM00015.gif (15)
graphic file with name TM00016.gif (16)

where the additional factors of two and four stem from the number of alleles in each cell-cycle phase and the subscript i in 〈mi(θ) denotes the mean mRNA in the i-th cell-cycle phase. There are four boundary conditions:

graphic file with name TM00017.gif (17)
graphic file with name TM00018.gif (18)
graphic file with name TM00019.gif (19)
graphic file with name TM00020.gif (20)

The first three boundary conditions ensure that the mean mRNA count is continuous as the cell-cycle phase progresses from G1 to G2, while the last boundary condition models dilution due to the binomial partitioning of mRNA molecules at cell division with probability 1/2, i.e. on average only half of the mRNA molecules from generation i are present when the next generation i + 1 starts. Note that here we consider steady-state growth conditions, i.e. enough time has passed such that the probability that a cell of age θ has a given number of mRNA molecules is independent of which generation it belongs to [68]—this means that there is no generation index in our equations.

Similarly, from the CME for reaction scheme (9), we can derive equations for the variance of the mRNA counts due to all (independent) alleles in each cell-cycle phase:

graphic file with name TM00021.gif (21)
graphic file with name TM00022.gif (22)
graphic file with name TM00023.gif (23)
graphic file with name TM00024.gif (24)

The boundary conditions satisfied by these ordinary differential equations (ODEs) are

graphic file with name TM00025.gif (25)
graphic file with name TM00026.gif (26)
graphic file with name TM00027.gif (27)
graphic file with name TM00028.gif (28)

As for the mean, the first three conditions stem from continuity of the variance at the transition points between the cell-cycle phases and the gene replication point, while the last boundary condition is due to binomial partitioning of the mRNA molecules at division with probability 1/2. This last condition can be derived from the equation linking the mRNA distribution at the beginning and the end of the cell cycle [68]:

graphic file with name TM00029.gif (29)

where P(i, θ) is the probability of observing i mRNA molecules at time θ.

We can solve the ODEs for the means (given by equations (13)–(16)) and variances (equations (21)–(24)) with the corresponding boundary conditions (equations (17)–(20) and (25)–(28)) using Mathematica. As the obtained expressions are lengthy, we omit them from the manuscript—the complete Mathematica notebook is given in the associated GitHub repository (see ‘Data availability’ section).

Finally, we assume that the distribution of mRNA counts at any cell age θ is well approximated by a NB distribution whose first and second moments agree with the exact moment solutions derived above:

graphic file with name TM00030.gif (30)
graphic file with name TM00031.gif (31)
graphic file with name TM00032.gif (32)
graphic file with name TM00033.gif (33)

Note that this approximation is based on previous work which showed that in complex models of bursty expression including cell division, DNA replication and other phenomena, the time-dependent distributions were generally well approximated by the NB distribution [69]. This assumption is further supported by the model selection results using steady-state age-independent models, revealing that bursty genes are in almost all cases optimally fit by the NB model. Moreover, in Supplementary Text S2 (and Supplementary Fig. S2) we benchmark the performance of our NB approximation using synthetically generated count data, further validating its accuracy.

Parameter inference

We use Julia [70] for parameter inference and subsequent analysis, utilising Distributions.jl [71] in the implementation and Makie.jl [72] for the visualisation of results, together with a number of numerical optimization packages listed below.

Maximum likelihood estimation

Let Inline graphic denote a vector of spliced mRNA counts for one gene over n cells in a specific cell-cycle phase. The likelihood of observing the data Inline graphic for a model with a count distribution Inline graphic given a set of parameters Inline graphic is defined by

graphic file with name TM00038.gif (34)

In practice, to find the maximum likelihood estimate (MLE) of the model parameters, Inline graphic, we minimize the negative log-likelihood:

graphic file with name TM00040.gif (35)

Note that for the age-dependent model we have to maximise the likelihood by explicitly taking into account the age θ of each cell. Over the entire cell cycle, with the associated θ bins in the range from θi = 0 to θf = 0.99, the optimal parameters will be given by

graphic file with name TM00041.gif (36)

where nθ is the number of cells in a given θ bin and miθ is the mRNA count in cell i with the same θ value.

We fit the steady-state age-independent gene expression models to the scRNA-seq data for all 3792 genes using maximum likelihood. Note that each model here refers to the total count distribution given by a k-fold convolution of the allele-specific count distribution with itself, where k is the number of gene copies in a given cell-cycle phase—throughout the paper we report the burst parameters per gene copy. We compute the count distributions explicitly (as given in the ‘Mechanistic models of stochastic gene expression section) to obtain the likelihood of each model. Although some studies have highlighted the computational challenges of evaluating the steady-state solution of the telegraph model involving a confluent hypergeometric function [42], we found our direct implementation using log-probabilities to be sufficiently numerically stable and more efficient than the corresponding beta-Poisson formulation using Gaussian quadratures [43, 73]. Note that parameter estimation using the age-dependent gene expression model is performed only for 1351 bursty genes, as covered in the ‘Gene filtering’ section.

To minimize the negative log-likelihood, we initially perform a global optimization using the adaptive differential evolution optimiser with radius limited sampling, as implemented in BlackBoxOptim.jl [74]. To ensure convergence, we then use the candidate solution as the initial value for a local search algorithm, here we employ the BFGS optimiser implemented in Optim.jl [75]. The global and local optimization steps are both terminated after 60 s, albeit they generally converge much faster for the majority of genes. Although the optimiser is quite sensitive to the initial condition and may get stuck in a bad local minimum and fail to converge by the designated time for the telegraph-like steady-state models or the age-dependent model due to parameter unidentifiability, we found that simply restarting the inference procedure for the problematic genes typically resolved the issue. For this reason, we automate the procedure further by rerunning the inference for all genes with a different random initial condition for the global optimiser five times using the age-independent models (10 times for the age-dependent model) and choosing the fit with the best likelihood value.

The optimization is performed in linear, log- or logit-transformed parameter space with model-specific parameter bounds. Note that for the Poisson model the solution is simply given by the sample mean, which in the zero-inflated case is constrained as ρ ∈ [ −9, 7] on the log-scale. The search space for the NB model, NB(f, p = (1 + b)−1), is constrained to f ∈ [ −10, 10] on the log-scale and p ∈ [ −30, 30] on the logit-scale (as p is a probability from zero to one). For the telegraph model, we fix σ0, σ1, ρ ∈ [ −9, 7] on the log-scale. The zero-inflated counterparts of the mentioned models have exactly the same parameter constraints, and we set the zero-inflation probability to p0 ∈ [ −30, 30] on the logit-scale. The search space for the piecewise parameters characterizing the age-dependent model is constrained to f1, f2, ρ1, ρ2 ∈ [ −7, 7] on the log-scale and β1, β2, β3, β4 ∈ [ − 100, 100] on the linear scale, in order to allow for the negative exponential scaling of transcription with the cell age θ observed in the S phase.

Age-independent model selection

We determine the optimal (steady-state) age-independent model for each gene using the Bayesian information criterion (BIC):

graphic file with name TM00042.gif (37)

where k is the number of model parameters, n is the number of data points (cells), and Inline graphic denotes the maximized likelihood function for the data vector Inline graphic. Given two models M1 and M2, the difference between their respective BIC values Inline graphic and Inline graphic, Inline graphic, reflects the strength of evidence in favour of M2 over M1, with Δ > 10 constituting ‘very strong’ evidence for model M2 over M1 [76, 77]. We utilize this criterion to iteratively compare six different models (from the simplest to the most complex): Poisson, zero-inflated Poisson (ZIP), NB, zero-inflated negative binomial (ZINB), telegraph, and zero-inflated telegraph (ZI-telegraph) models.

We start by computing the difference in the BIC values between the Poisson model and all the other candidates, i.e. Inline graphic. If all five Δ values are <10, we pick the Poisson model as the optimal choice; otherwise, we proceed to compute the difference in the BIC values between the ZIP and the remaining (more complex) models, repeating the procedure until the best candidate is found. This allows us to filter out the models with practically unidentifiable parameters and select the simplest gene expression model that still provides a good fit to the mRNA count data. We verify the robustness of our model selection procedure in Supplementary Text S1 (also see Supplementary Fig. S1).

The model selection is performed for all 3792 genes in the dataset in both G1 and G2/M phases. As the burst frequency and size parameters cannot be extracted from the Poisson and ZIP models, all genes that either in G1 or in G2/M are best fit by the Poisson or ZIP models are excluded from further analysis. After this filtering step applied to the age-independent models, we are left with 1760 genes that in both cell-cycle phases are best fit by either the NB, ZINB or telegraph models. Namely, we find that the overwhelming majority of the remaining genes are best fit by the NB model; the ZINB is the optimal model for 22 genes in G1 and 15 genes in G2/M; the telegraph model fits only 1 gene in G1 and 2 genes in G2/M; and the ZI-telegraph model was not found to be optimal for any genes.

Gene filtering for the age-dependent model

Following the ‘model selection’ performed for (steady-state) age-independent models, we are left with 1760 bursty genes that are mostly best fit by the NB model. To proceed with the analysis of these bursty genes using the age-dependent model, we perform a series of gene filtering steps prior to and following parameter inference, as outlined below.

The age-dependent model requires us to provide an explicit mRNA degradation rate for each individual gene. We use the mRNA decay rates from Supplementary Table S2 [66] obtained for mESC lines MC1 (129S6/SvEvTac) and MC2-B6 (C57BL/6J) grown in the presence of Leukemia Inhibitory Factor (LIF): if the gene-specific decay rate is negative for MC1 due to experimental constraints, we use the decay rate obtained for MC2 cells if it was found to be positive (as the observed correlation between the strains on average is high), and consider the decay rate to be unknown otherwise. Out of the 1760 initial bursty genes we thus discard 334 genes that do not have an associated mRNA degradation rate.

Next, we filter out 75 genes that display a negative correlation between the mean mRNA transcription and cell age θ either in the G1 or G2/M phases. Most genes in this category tend to switch on in the S phase or display a downward trend in the mean counts in the G1 phase with respect to θ—these genes are outliers in terms of their phase-specific transcriptional profiles, and hence for simplicity we choose to discard them.

After the two filtering steps, we fit the age-dependent model to mRNA count data for the remaining 1351 bursty genes using maximum likelihood. We then filter out 291 genes with poor model fits, where the predicted mean expression in the G1 or G2/M phases negatively correlated with cell age θ, whereas the observed count data showed the opposite trend.

Lastly, we compute the coefficient of determination R2 for the predicted mean and variance of age-dependent model fits over the entire cell cycle to identify any other extreme outliers. We define the coefficient of determination as Inline graphic, where SSres and SStot stand for the residual sum of squares and the total sum of squares, respectively. We choose to discard 16 genes that have negative associated R2 values, which can occur when fitting a non-linear model without an intercept term, and which usually indicate bad model fits [78].

Following all filtering steps, we are left with 1044 genes, which we investigate further in the main text using the age-dependent model.

Confidence intervals

To quantify the uncertainty in the best-fit model estimates, we compute the 95% confidence intervals for each model parameter using profile likelihood [79, 80]. If we partition a vector of model parameters Inline graphic into a parameter of interest ϕ and a vector of nuisance parameters Inline graphic, i.e. Inline graphic, the profile likelihood of the parameter ϕ given a set of observations Inline graphic can be defined as:

graphic file with name TM00054.gif (38)

where we maximize the likelihood keeping ϕ fixed. It can be shown that the approximate 100(1 − α)% confidence interval for ϕ is given by [79]:

graphic file with name TM00055.gif (39)

where Inline graphic is the maximized likelihood (with all parameters allowed to vary) and Inline graphic is the (1 − α) quantile of the χ2 distribution with one degree of freedom.

We can generalize the approach above to compute the confidence intervals for some function of the model parameters, Inline graphic, by defining the prediction profile likelihood [80, 81]:

graphic file with name TM00059.gif (40)

where we maximize the likelihood under the constraint that Inline graphic equals the value Inline graphic. Similarly to the confidence interval for a single model parameter, the 100(1 − α)% confidence interval for the set of predictions Inline graphic is given by:

graphic file with name TM00063.gif (41)

We can also find the confidence intervals for functions of the parameters of multiple different models by considering their joint likelihood. For example, consider the ratio of the burst frequency between the G2/M and G1 phases, Qf = fG2/M/fG1 (examined in Fig. 2C). Let Inline graphic and Inline graphic be the vectors of mRNA counts in G1 and G2/M, respectively, and similarly define Inline graphic and Inline graphic as the phase-specific model parameters. The joint likelihood is then given by

graphic file with name TM00068.gif (42)

and the prediction profile likelihood can be written as

graphic file with name TM00069.gif (43)

so that the confidence interval for Qf is similar in form to equation (41).

The outlined approach is used throughout the paper to compute the confidence intervals for any considered quantity Inline graphic. We implement this numerically by simply increasing (decreasing) Inline graphic in small steps starting from its maximum likelihood value, and recomputing the prediction profile likelihood at each step until the threshold defining its upper (lower) confidence interval limit is reached. We perform the constrained optimization using the Ipopt algorithm [82] through the MetaOptInterface.jl [83] and Optimization.jl packages. The parameter bounds for each model are the same as previously used for maximum likelihood estimation.

Binomial downsampling of count data

In this section, we describe the implementation details of our binomial downsampling strategy for scRNA-seq data, which is analysed in the main text. In brief, we simulate the true mRNA count data and downsample it based on a binomial mRNA capture model [84, 85] (visualized in Fig. 4D left) assuming that there is a finite capture probability p of detecting each mRNA molecule in each cell in an scRNA-seq experiment (where p varies from cell to cell according to a probability distribution).

Figure 4.

Figure 4.

Effect of extrinsic noise due to age-dependent transcription rates and technical noise on the burst parameter inference. (A) Burst frequency (top) and burst size (bottom) ratios between θ-dependent (θ-dep) and θ-independent (θ-ind) model estimates for five different genes in the dataset. The ratios are inferred both for G1 and G2/M phases. The error bars show 95% confidence intervals. The dashed lines indicate no change in the parameter values between the two cell-cycle phases. (B) Raincloud plots of burst frequency (top) and burst size (bottom) ratios between θ-dep and θ-ind model estimates for all genes, inferred for G1 and G2/M phases, separately. (C) Scatter plot of the burst size ratio, Inline graphic, versus the burst frequency ratio, Inline graphic, between the G2/M and G1 phases for all 1044 genes (each dot indicates a single gene) estimated with the θ-dep bursty model. The smoothed histogram on the right represents the distribution of burst size ratios, and the distribution of burst frequency ratios is given on the top. The equivalent distributions of burst frequency and burst size ratios between G2/M and G1 obtained using the θ-ind model (shown in Fig. 2D), i.e. Inline graphic and Inline graphic, are superimposed as transparent smoother histograms with darker contours for comparison. (D) Left: Illustration of the binomial model of mRNA capture. In a sequencing experiment, only a fraction of the original mRNA counts will be detected due to the finite capture probability p that varies between cells. This can be simulated by sampling the true counts from the age-dependent model and subsequently downsampling them using a binomial model (with a cell-specific value of p ) to mimic the observed data. Right: Scatter plot of the burst frequency and burst size ratios between the G2/M and G1 phases estimated by fitting the age-dependent model to the generated (true) count data and the corresponding downsampled (observed) data, obtained by simulating binomial transcript capture (see text for details).

The mESC library published by Riba et al. [57] is generated using the 10x Genomics Single Cell 3’ Reagent Kit (v3), which reportedly (according to the manufacturer) captures 30-32% of transcripts per cell, and hence we assume that the mean capture efficiency is given by 〈p〉 = 0.3. To address the heterogeneity in capture efficiencies amongst cells, we add small noise by assigning the capture efficiency p for each cell to be a random number drawn from Beta(30, 70) distribution with mean 〈p〉 and coefficient of variation CVp ≈ 0.15. Although the assumed noise model is quite simplistic, more accurate assignment of capture efficiencies would require spike-in data (unavailable for this mESC experiment) and careful statistical analysis to correct for other potential cell- or gene-specific biases — to our knowledge, studies focusing on mRNA capture efficiency are quite limited and make additional modelling assumptions [84, 85], while more relevant analyses for the 10x Genomics v3 Kit do not appear to be publicly available.

We consider 200 genes randomly chosen from the subset of all bursty genes fit by the age-dependent model whose burst frequency and burst size ratios between the G2/M and G1 phases, i.e. Qf and Qb values, lie in their respective 5–95 percentile ranges (Fig. 4C). For each gene, we use the best-fit age-dependent model and rescale its parameters ρ1 and ρ2 by the inverse of the average capture efficiency 〈p〉, thus rescaling the mean burst size so that the downsampled counts resemble the experimental data. The observed mRNA count x for each gene in each cell at a certain cell age θ is obtained by sampling the true count y from the age-dependent NB mRNA distribution of the rescaled model and subsequently downsampling y so that x ∼ Binomial(y, p). By repeating this process for each of 5294 cells, we construct a downsampled dataset equivalent in structure to the original data with the same number of data points in each θ bin. We use this procedure to generate 100 of such datasets and fit them to the age-dependent model. In Fig. 4D right, we plot the observed median burst parameter ratios between the G2/M and G1 phases with their respective interquartile ranges for the 200 selected genes against the ground-truth estimates fit to the original count data (for further discussion see the Results section).

Results

Inference of cell-cycle dependent burst parameters using age-independent models

To understand whether the burst parameters vary across the cell cycle, we calculate the mean mRNA counts of each gene using data from all cells that are in G1 (〈mG1〉) and, similarly, using data from all cells in G2/M (〈mG2/M〉). In Fig. 2A, we show the distribution of the ratio of the means (Λ = 〈mG2/M〉/〈mG1〉). Note that the mass of the distribution is concentrated between Λ = 1 and Λ = 2 with a median Λ = 1.40. The case Λ = 2 is expected if there is no change in the rate parameters from G1 to G2/M, since the doubling of the gene copies would simply imply a corresponding doubling of the mean mRNA count. In contrast, the case Λ = 1 is what can be described as perfect gene dosage compensation, in the sense that the rate parameters have changed in such a way that the mean mRNA count remains unaltered as the cell cycle progresses. For the vast majority of genes, Λ is neither 1 nor 2, implying that there is a change in the rate parameters between the two cell-cycle phases but it is insufficient to perfectly compensate for the doubling of the gene copy number during DNA replication. Hence, we conclude that partial dosage compensation is the norm across the transcriptome.

Given the foregoing conclusion that for most genes there is a change in the rate parameters as the cell cycle progresses, we next infer the burst parameter values in each phase. We assume that in each of the two phases, the distribution of spliced mRNA counts is well described by the steady-state distribution of a stochastic model of gene expression describing transcription from N independent promoters, where N = 2 in the G1 phase and N = 4 in the G2/M phase. The mRNA count distribution of this model is obtained by convolving the steady-state mRNA distribution of a stochastic model of gene expression for one promoter with itself N times; this is equivalent to assuming that expression from individual gene copies is independent of each other, a property of many eukaryotic genes [22, 34, 43] (this assumption can be avoided if the scRNA-seq data is allele-specific [43]). We refer to this type of model as an age-independent model to emphasize that here we assume that all cells within each cell-cycle phase, independent of their cell age, have the same burst size and frequency; the burst parameters only change with cell-cycle phase. Using the model count distribution and the transcript count data for each cell, we use the method of maximum likelihood to estimate the rate parameters (normalized by the mRNA degradation rate) in each cell-cycle phase. We fit six different gene expression models (telegraph model, NB model, Poisson model, and their equivalents with zero inflation) separately to G1 and G2/M count data from 3792 genes that satisfied quality control criteria, and then select the optimal model using the BIC. More details regarding the age-independent models, parameter inference and model selection are provided in the Materials and methods section.

We note that in our repertoire of models we have not included models more complex than the telegraph model. Some (non-sequencing) studies that can directly measure the distribution of the times spent in the active and inactive gene state suggest the three-state model as the optimal model of gene expression in mammalian cells [2, 86]. Models that are more complex than the (two-state) telegraph model can be fit to snapshot scRNA-seq data to estimate the burst parameters [87]. However, as the steady-state mRNA count distributions of the three-state and the telegraph models can often equally well fit snapshot count data from a few thousand cells [88–90], a model selection algorithm is unlikely to select a three-state model over a two-state model because the former has more parameters than the latter; this issue is particularly pronounced for bursty gene expression, since both models reduce to the NB model in this case.

We find that 55% genes are best fit by a Poisson or a ZIP model, which describe non-regulated expression from an active gene state. The rest of the genes are best fit by models that implicitly or explicitly assume gene state switching between active and inactive states, and hence we call them bursty genes since their expression occurs in bursts when the gene is active. Most of these genes are fit by the NB model (44%) and very few (1%) by the rest of the models. The model that most successfully describes bursty genes, the NB model (also called the bursty model), is illustrated in Fig. 2B. Note that in these age-independent models, the burst parameters are normalized by the mRNA degradation rate (‘Materials and methods’ section).

For each of the bursty genes, of which there are 1760, we calculate Qf = fG2/M/fG1 and Qb = bG2/M/bG1—the ratios of the burst frequency and burst size (per gene copy) in the G2/M and G1 cell-cycle phases, respectively. In Fig. 2C, we show these ratios for five genes. Note that there appears to be a tendency for the burst frequency in the G1 phase to be greater than that in the G2/M phase, and the opposite tendency is observed for the burst size. In Fig. 2D, we show the distributions of Qf and Qb for all genes, confirming that for most genes there is an attenuation of the burst frequency and a corresponding amplification of the burst size as the cell cycle progresses. Specifically, the distribution of Qf has a median of 0.49 with the first and third quartiles of 0.44 and 0.56, while the distribution of Qb has a median of 1.41 with the first and third quartiles of 1.23 and 1.60.

Note that as a first step, we have fitted age-independent models to cell-cycle phase-specific data because their steady-state mRNA count distributions are all known in closed form and they are straightforward to fit using the method of maximum likelihood. However, these models suffer from two major disadvantages: (i) they implicitly assume that the cells in each of the cell-cycle phases are identical, i.e. the burst parameters within each cell-cycle phase do not vary from cell to cell, and hence cannot be age-dependent; (ii) they assume that shortly after a cell transitions from one cell-cycle phase into another, a new steady state is quickly reached. Regarding (i), as mentioned in the ‘Introduction’ section, there is evidence from smFISH studies of extrinsic noise within the cell cycle for various types of mammalian cells; however, we do not know of studies specific to mESCs. Regarding (ii), rough estimates can be made to see whether the steady-state assumption generally holds. The median mRNA half-life is estimated to be 7.1 h [66], while the median cell-cycle duration is estimated to be 13.25 h for mESCs with blocked differentiation [65]. Since the median half-life is comparable to or longer than the time spent in each cell-cycle phase, it is unlikely that a steady state is reached within each cell-cycle phase for most mRNA species. Evidence supporting this hypothesis is shown in Fig. 3A and C (top left plot): the number of spliced mRNA counts tends to increase with cell age θ within the cell cycle. It is hence possible that due to the aforementioned limitations of age-dependent models the inference procedure has introduced biases in parameter estimation.

Constructing a sophisticated cell-age and cell-cycle phase-dependent model

A schematic of a cell-age dependent model of bursty gene expression that overcomes the limitations of previous models is shown in Fig. 3B. At its core, the model assumes that the transcription of mRNA M from each allele copy G occurs in bursts at an associated burst frequency f, where the number of molecules produced in a burst is a geometrically distributed variable with the mean burst size b, and the mRNA decays at a constant rate d [14]. We assume that all parts of the cell cycle are proportional to the total division time, akin to the stretched cell-cycle model proposed in [91, 92]. Specifically, we let θ = t/T, where T is the median cell-cycle duration time, assuming that the cell age is equivalent to the normalized time within the cell cycle. It follows that the cell cycle starts at θ = 0, changes from G1 to S phase at θ = θG1, from pre-DNA replication to post-DNA replication at θ = θr and from S to G2/M phase at θ = θS. Note that since we do not know the precise cell age at which each gene replicates, for simplicity, for all genes we choose θr to lie exactly in the middle of the S phase (while this assumption may appear rough, as we show later on, it has only a minor influence on the inference results). Furthermore, it is assumed that the number of gene copies changes from 2 to 4 upon DNA replication, and the number of mRNA copies are randomly partitioned with probability 1/2 between the two daughter cells upon (symmetric) cell division. What remains for full model specification is to devise functions that describe how the burst size and frequency change as a function of θ within the four ranges of the cell-cycle phase in our model (G1, S phase prior to replication, S phase post-replication and G2/M).

To obtain some insight into how we can choose these functions in a manner that is biologically realistic, we plot the total mRNA counts versus age (top left corner of Fig. 3C)—the plot shows the main four dynamical phases (increasing/decreasing/increasing/increasing) which correspond to G1, early S, late S, and G2/M phases. It is not clear what causes the monotonic decrease in total mRNA counts with age during the early S phase, although there is limited evidence that transcription of a gene is downregulated before its replication occurs [93]. Using non-linear regression, the temporal variation of the mean count data in each of these four phases is found to be well fit by an exponential law keβθ, where k and β are gene- and phase-specific constants (top right corner and bottom of Fig. 3C).

In the ‘Materials and methods’ section, we argue that the age-dependent mean mRNA predicted by our model will follow the empirical piecewise exponential law if the burst frequency or the mean burst size are chosen to vary exponentially with cell age. We make the following specific choices for the burst parameters. The burst frequency is assumed to change with cell age in a piecewise manner, from a constant f1 (for θ < θr) to another constant f2 (for θ ≥ θr). The mean burst size is assumed to be given by b = ρeβθ where ρ and β are piecewise constant functions of θ. The functions specifying the variation of the burst frequency and the mean burst size with θ are illustrated on the right-hand side of Fig. 3B.

Our functional choice for the burst parameters is in line with the findings from smFISH studies. These studies find that the burst frequency decreases upon DNA replication [20, 21] but does not change with cell volume, whereas the burst size increases with cell volume [21]. Given that cell age can be interpreted as a rough proxy for cell volume, these findings suggest that the burst size should depend on cell age. Specifically, for exponentially growing cells that perfectly halve in size upon division and for which the growth rate is constant throughout the cell cycle, it is straightforward to show that V(θ) = V0elog (2)θ, where V(θ) is the cell volume at age θ and V0 is the cell birth volume. Since the burst size increases with cell volume, we choose the burst size to be b = ρeβθ, where we allow ρ and β to be some constants that are cell-cycle phase-dependent. This choice increases the flexibility of the model to fit the data and allows it to accommodate deviations due to incorrect or rough assumptions (such as constant growth rate). We note that the dependence of the burst size with cell age is a source of cell-to-cell variation within a cell-cycle phase, since there is a wide distribution of cell age within each phase. Hence, the age-dependent model overcomes the two limitations of the age-independent models mentioned earlier.

The equations for the temporal variation of the mean 〈m〉(θ) and the variance of mRNA counts σ2(θ) of the age-dependent model can be solved exactly using a computer algebra system (see the ‘Materials and methods’ section). The distribution of mRNA counts at normalized time θ is then assumed to be given by a NB distribution with mean 〈m〉(θ) and variance σ2(θ); note that while the exact time-dependent distribution is not a NB, this distribution is known to provide an excellent approximation for similar complex models of gene expression [69]. This choice is also motivated by the fact that the NB model was the most commonly selected model when the scRNA-seq data were fit by the age-independent model (the Materials and methods section).

Finally, we use the method of maximum likelihood, with the likelihood being the aforementioned age-dependent NB, to fit the model to cell-age resolved data and estimate the parameters f1, f2, ρ1, ρ2, β1, β2, β3, β4 (Fig. 3B) for each gene. In Fig. 3D, we show that this procedure leads to a good model fit. Specifically, we show that for three sample genes the model predictions of the temporal dependence of the mean, variance and distribution of mRNA counts (obtained by evaluating the model using the estimated parameters) are in good agreement with the same statistics computed from the data. The accuracy of the inference procedure is further validated using simulated scRNA-seq data, where the ground truth parameters are known (Supplementary Text S2 and Supplementary Fig. S2).

Refinement of burst parameter inference using the cell-age dependent model

It is worthwhile to compare the burst parameters for each gene using the age-independent and the age-dependent models, since in this way we can determine the impact of the two main assumptions of the former age-independent model on the inference results. Because the two models differ in their parameterisation, a meaningful comparison must be done carefully, as follows.

Inference using the age-independent model led to a point estimate for the normalized burst frequency and normalized burst size in G1 and G2/M; multiplying these by the experimentally measured gene-specific mRNA degradation rate [66] (where available) leads to the absolute values of the burst parameters in the G1 and G2/M phases. In the age-dependent model, the burst frequency in G1 and G2/M phases simply corresponds to the parameters f1 and f2, respectively. To calculate the mean burst size for a gene in the G1 (G2/M) phase, we average the age-dependent burst size Inline graphic (Inline graphic) over the cell age distribution in this phase. For more details, see Materials and methods.

In Fig. 4A, we contrast the phase-dependent burst parameters computed using the two models for 5 genes. Note that for these genes, the burst frequency estimated from the age-dependent model is larger than from the age-independent model. In contrast, the burst sizes appear to be smaller, although it is difficult to ascertain because of the large errors in the parameter estimates. In Fig. 4B, we confirm that for the vast majority of bursty genes, using an age-independent model leads to a systematic underestimation of the burst frequency in both G1 and G2/M. Regarding the burst size, the age-independent model systematically overestimates it in G2/M; in G1, the burst size can be underestimated or overestimated with little influence on average. These results are broadly in agreement with a recent moment-based study that showed that extrinsic noise in parameter rates leads to systematic biases if they are inferred using a steady-state model that accounts only for intrinsic noise within each cell-cycle phase [19].

A main result of inference using the age-independent model was that the burst frequency decreases while the burst size increases as the cell cycle progresses from G1 to G2/M (Fig. 2D). Specifically, we inferred that the genome-wide distribution of the ratio of burst frequency in G1 and G2/M for 1760 bursty genes has a median of 0.49 (1st and 3rd quartiles are 0.44 and 0.56), while for the burst size the median is 1.41 (1st and 3rd quartiles are 1.23 and 1.60). In contrast, the age-dependent model (for the subset of 1044 genes which can be described by this model; see Materials and methods) predicts that the burst frequency decreases, while the burst size remains approximately the same as the cell cycle progresses from G1 to G2/M (Fig. 4C); the burst frequency and size ratios in G1 an G2/M have a median of 0.62 (1st and 3rd quartiles are 0.43 and 0.90) and of 0.86 (1st and 3rd quartiles are 0.64 and 1.15), respectively. Hence, the main difference between the predictions of the two models lies in how the burst size changes with cell-cycle phase.

In Supplementary Text S3, we redo the parameter inference using an alternative age-dependent model that, unlike the model in Fig. 3B, does not assume a fixed cell age at which genes replicate in the S phase. The fits to the mean, variance and the distributions of mRNA counts as a function of the cell age of the two age-dependent models are of comparable quality, albeit the alternative model gives larger confidence intervals (data not shown). The burst frequency and size ratios between the G2/M and G1 phases estimated using the alternative age-dependent model are in good agreement with the original model estimates (Supplementary Fig. S3), leading to the same conclusion as earlier, namely that the burst frequency decreases while the burst size remains approximately the same as as cells transition from G1 to G2/M. Specifically, the alternative age-dependent model (with age-dependent burst size and variable replication timing) estimates that the burst frequency and size ratios in G1 an G2/M have a median of 0.56 (1st and 3rd quartiles are 0.36 and 0.90) and of 1.05 (1st and 3rd quartiles are 0.71 and 1.57), respectively.

In Supplementary Text S4, we test yet another alternative age-dependent model, where we now enforce an exponential scaling of transcription with cell age through an age-dependent burst frequency (and piecewise constant burst size), in contrast to the original model (Fig. 3B) with an age-dependent burst size (and piecewise constant burst frequency). As shown in Supplementary Fig. S4, this model formulation results in largely similar fits to the moments and distributions of the mRNA counts as the original model, and the estimates of the burst parameter ratios between the G2/M and G1 phases are also in close agreement. Specifically, the alternative age-dependent model (with age-dependent burst frequency and fixed replication timing) estimates that the burst frequency and size ratios in G1 an G2/M have a median of 0.50 (1st and 3rd quartiles are 0.36 and 0.69) and of 1.09 (1st and 3rd quartiles are 0.86 and 1.31), respectively. These findings suggest that the choice of placement of an exponential scaling with cell age either on the burst size or burst frequency has little influence on the results of our transcriptome-wide analysis. Nevertheless, we choose to focus on the model with an age-dependent burst size throughout the paper, as such parametrisation has grounding in previous work (discussed in the last section).

We note that inference using both age-independent and age-dependent models leads to a negative correlation between the burst frequency and burst size ratios (this is clear from the downward-right trend of the grey scatter points in Figs 2D and 4C). However, for the 1044 genes used in the age-dependent model inference, the correlation coefficient between the two ratios is −0.51 (the two alternative age-dependent models presented above have very similar associated correlation coefficients), while using the age-independent model for the same set of gene leads to a correlation coefficient of −0.82. This suggests that while some of the anti-correlation is an artefact of the assumptions behind the model used for inference, the rest may be reflecting underlying transcriptional mechanisms that co-regulate the burst parameters.

In parameter estimation using age-independent models, technical noise was to some extent taken into account by fitting models that allow for zero inflation. However, for the vast majority of genes, the zero-inflated models were rarely chosen by the model selection algorithm (Materials and methods). Whilst this implies that zero inflation is likely not an issue for the 10x sequencing dataset we have analysed (in agreement with Ref. [51]), technical noise introduced by the inherent downsampling of mRNA counts may still affect the presented analysis. For example, the real transcript count for a particular gene in a cell might be 10 molecules but only 6 of these are captured — in this case there is no inflation of the zero count, but clearly technical noise has impacted the number of measured counts. This implies that downsampling could have possibly introduced a systematic bias in our estimation of the ratio of burst parameters in the two cell-cycle phases.

To address this concern, using a binomial mRNA capture model [84, 85] (illustrated in Fig. 4D left), we developed a stochastic simulation-based strategy to estimate the bias introduced by ignoring this important source of technical noise in sequencing data. The idea is to simulate the true mRNA count data and then downsample it to resemble a virtual scRNA-seq experiment, fit the age-dependent model to the downsampled data and finally compare the estimated ratios of the burst parameters in G1 and G2/M (Qf and Qb) to those obtained with the true count data. To mimic the true count data for a gene in a cell measured at a certain cell age, we sample the NB distribution solution of the age-dependent model (Materials and methods). In this way, we generate data for 200 genes (each with a different set of rate parameters) in a population of cells, where the number of cells and the cell-age distribution match those of our mouse embryonic dataset. To mimic the observed scRNA-seq count data x for a particular gene and cell, we downsample the true count y using the binomial model, i.e. x ∼ Binomial(y, p), where p is the capture probability for the cell, assumed to be a random number drawn from the Beta distribution with mean 〈p〉 = 0.3 and coefficient of variation CVp ≈ 0.15. In Fig. 4D right, we show a scatter plot of the burst parameter ratios between the G2/M and G1 phases estimated using the true count data and the corresponding downsampled (observed) data. Each point in the plot is a median over 100 repeats of the procedure, where the error bars indicate the lower and upper quartiles. Remarkably, our results show that downsampling due to a finite capture probability has very little effect on the estimated ratio of burst parameters in the two cell-cycle phases, and hence our inference results are robust with respect to technical noise.

Discussion

In this paper, by fitting a variety of mechanistic models of stochastic gene expression to single-cell sequencing data that is cell-cycle specific, we have estimated the dependence of the burst parameters (the burst frequency and size) on the cell-cycle phase for about 1000 bursty genes in undifferentiated mouse embryonic stem cells. Inference using more sophisticated age-dependent models that account for noise due to transcriptional bursting, DNA replication, cell division and the coupling of gene expression with cell-cycle age reveals that, as the cell cycle progresses from G1 to G2/M, the median burst frequency decreases by about a half, while the median burst size remains approximately the same. However, the genome-wide distributions of the inferred ratios of burst frequency/size in G1 and G2/M are broad, suggesting a large degree of heterogeneity in transcriptional regulation patterns.

We note that whilst the burst parameters have previously been estimated using model-based inference applied to smFISH and scRNA-seq data, in the vast majority of cases, these were not estimated for individual cell-cycle phases [11–14, 19, 42, 44, 45, 48]. However, there are a few published studies which have reported this. An scRNA-seq-based study reported a comparison of single-allele burst parameters in G1, S and G2/M phases estimated using the telegraph model (see extended data fig. 10 in [43]); because of the small numbers of cells in each phase (165, 28, and 31 in G1, S, and G2/M, respectively) the errors in these parameters are necessarily large, and hence it is difficult to come to any reasonable conclusion on how the burst parameters vary across the cell cycle. Two smFISH-based studies [20, 22] used model-based inference to estimate the burst parameters before and after replication, but these were limited to only three eukaryotic genes and did not account for extrinsic noise due to factors varying within a cell-cycle phase, such as cell size. In Ref. [21], the burst parameter dependence with cell-cycle phase and cell size for about 25 mammalian genes was roughly estimated directly from the number of active transcription sites and their fluorescent intensity in smFISH measurements (without the use of model-based inference). Therefore, our study is the first to our knowledge that uses mechanistic model-based inference to reliably estimate the ratio of the burst parameters pre- and post-replication for over a thousand bursty eukaryotic genes.

Our results quantitatively agree with the findings of other studies and extend their predictions to a much larger number of genes. We found that partial dosage compensation is the norm across the transcriptome. This is because the distribution of the ratio of the mean counts in G2/M and G1 had a median close to 1.40, while perfect gene dosage compensation would be consistent with 1 and no dosage compensation with 2. This is in agreement with the findings of previous smFISH studies investigating gene expression in a few eukaryotic genes: in [22] it was found that for the Oct4 and Nanog genes, the fold change in the nascent mRNA level between the G1 and G2/M phases was 1.28 and 1.51, respectively, while in [20] for the GAL10 gene the fold change was measured to be 1.7. We inferred the transcriptome-wide distribution of Qf, the ratio of the burst frequencies per allele in G2/M and G1, and estimated its median to be equal to 0.56 or 0.60, depending on the choice of age-dependent model (referring to values given in Supplementary Text S3 both for the original model with age-dependent burst size with fixed replication timing and the alternative model with age-dependent burst size without fixed replication timing). This is in good agreement with the few cases in which these were estimated in previous smFISH studies of eukaryotic gene expression. In [22], for the Oct4 and Nanog genes in mouse embryonic stem cells, it was found that the burst frequency ratio was Qf = 0.63 ± 0.06, while in [20] for the GAL10 gene in yeast, the frequency ratio was found to be Qf = 0.66 ± 0.17. In Ref. [21], from smFISH measurements of the ratio of the number of active transcription sites in a cell and the total number of gene copies, it was estimated that Qf ≈ 0.5 for a few tens of genes in human primary fibroblast and lung cancer lines. We also found that the transcriptome-wide distribution of Qb, the ratio of the burst size per allele in G2/M and G1, has a median equal to 0.85 or 1.05 (reported in Supplementary Text S3 both for the original age-dependent model and the alternative model without fixed replication timing). This is in agreement with what was estimated in [21] for a few tens of mammalian genes.

Our study also sheds light on the relationship between the choice of model and the results of parameter inference. While we found that the values of the burst parameters are quite sensitive to the choice of the mechanistic model used for inference, the ratio of the burst frequency in the G1 and G2/M phases is a robust estimate. In contrast, the inference of the burst size ratio in G1 and G2/M is more sensitive to the chosen model and is best estimated by an age-dependent model, which accounts for cell-cycle dynamics and the coupling of the burst size to cell age. Surprisingly, we found that technical noise does not affect the estimation of the burst frequency and size ratios. We also found that the use of (age-independent) models that assume steady state within each cell-cycle phase and no extrinsic noise due to the coupling of transcription to cell age led to a strong artificial correlation between the ratio of the burst frequency in G1 and G2/M and the corresponding burst size ratio; models that do not make these assumptions (age-dependent models) find a smaller but sizeable correlation, thus suggesting that this may be due to underlying biological mechanisms.

In the future, the availability of high-quality sequencing data from a large population of cells of one type (tens of thousands) may make it possible to fit more realistic models of gene expression than we have considered here. Examples of features that could be included are stochasticity in the duration of the cell cycle [94], cell-size homeostasis mechanisms [95, 96] and multiple transcription and mRNA degradation steps [97], all of which have some impact on the mRNA count distribution and therefore could potentially lead to further refinements of the inferred burst parameter values.

In conclusion, we have devised a simple, tractable and robust approach to estimate from scRNA-seq data the dependence of the transcriptional burst parameters on the cell-cycle phase. While the use of mechanistic models for the extraction of parameters controlling gene expression has become commonplace in fluorescence-based single-cell transcriptomics, their use in the field of single-cell sequencing has severely lagged behind [98]. This is largely due to the significant technical noise inherent to the sequencing technology and the difficulty of quantifying the sources of extrinsic noise. However, as we have here shown, with a carefully designed model-based inference approach, these issues present no impediment to building quantitative models of gene expression at the transcriptome level.

Supplementary Material

gkaf295_Supplemental_File

Acknowledgements

The authors would like to thank Kaan Öcal and Abhyudai Singh for useful discussions, as well as Michael Stumpf and the members of Theoretical Systems Biology Group at the University of Melbourne for their support. Lastly, the presented research was made possible by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative.

Author contributions: Augustinas Sukys (Methodology, Software, Formal analysis, Visualization, Writing—original draft, Writing—review & editing) and Ramon Grima (Conceptualization, Methodology, Supervision, Writing—original draft, Writing—review & editing)

Contributor Information

Augustinas Sukys, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, United Kingdom; The Alan Turing Institute, London NW1 2DB, United Kingdom; School of BioSciences, University of Melbourne, Parkville, Victoria 3052, Australia.

Ramon Grima, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, United Kingdom.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

This work was supported through an Alan Turing Institute Doctoral Studentship (EPSRC grant EP/N510129/1) and an Australian Research Council (ARC) Laureate Fellowship (FL220100005) for A.S., and a Leverhulme Trust grant (Grant No. RPG-2020-327) for R.G. Funding to pay the Open Access publication charges for this article was provided by University of Edinburgh.

Data availability

The original scRNA-seq dataset can be obtained from Riba et al. [57]. The processed data has been deposited at https://doi.org/10.5281/zenodo.10467234 and the code for this paper is available at https://github.com/augustinas1/CellCycle-RNAseq and https://zenodo.org/records/15096048.

References

  • 1. Raj  A, Peskin  CS, Tranchina  D  et al.  Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 2006; 4:e309. 10.1371/journal.pbio.0040309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Suter  DM, Molina  N, Gatfield  D  et al.  Mammalian genes are transcribed with widely different bursting kinetics. Science. 2011; 332:472–4. 10.1126/science.1198817. [DOI] [PubMed] [Google Scholar]
  • 3. Donovan  BT, Huynh  A, Ball  DA  et al.  Live-cell imaging reveals the interplay between transcription factors, nucleosomes, and bursting. EMBO J. 2019; 38:e100809. 10.15252/embj.2018100809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rodriguez  J, Larson  DR  Transcription in living cells: molecular mechanisms of bursting. Annu Rev Biochem. 2020; 89:189–212. 10.1146/annurev-biochem-011520-105250. [DOI] [PubMed] [Google Scholar]
  • 5. Tunnacliffe  E, Chubb  JR  What is a transcriptional burst?. Trends Genet. 2020; 36:288–97. 10.1016/j.tig.2020.01.003. [DOI] [PubMed] [Google Scholar]
  • 6. Sanchez  A, Golding  I  Genetic determinants and cellular constraints in noisy gene expression. Science. 2013; 342:1188–93. 10.1126/science.1242975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Brouwer  I, Patel  HP, Meeussen  JVW  et al.  Single-molecule fluorescence imaging in living Saccharomycescerevisiae cells. STAR Protoc. 2020; 1:100142. 10.1016/j.xpro.2020.100142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lenstra  TL, Larson  DR  Single-molecule mRNA detection in live yeast. Curr Protoc Mol Biol. 2016; 113:14–24. 10.1002/0471142727.mb1424s113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Peccoud  J, Ycart  B  Markovian modeling of gene-product synthesis. Theor Popul Biol. 1995; 48:222–34. 10.1006/tpbi.1995.1027. [DOI] [Google Scholar]
  • 10. Paulsson  J, Berg  OG, Ehrenberg  M  Stochastic focusing: fluctuation-enhanced sensitivity of intracellular regulation. Proc Natl Acad Sci USA. 2000; 97:7148–53. 10.1073/pnas.110057697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Zenklusen  D, Larson  DR, Singer  RH  Single-RNA counting reveals alternative modes of gene expression in yeast. Nat Struct Mol Biol. 2008; 15:1263–71. 10.1038/nsmb.1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Senecal  A, Munsky  B, Proux  F  et al.  Transcription factors modulate c-Fos transcriptional bursts. Cell Rep. 2014; 8:75–83. 10.1016/j.celrep.2014.05.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Ochiai  H, Sugawara  T, Sakuma  T  et al.  Stochastic promoter activation affects Nanog expression variability in mouse embryonic stem cells. Sci Rep. 2014; 4:7125. 10.1038/srep07125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Golding  I, Paulsson  J, Zawilski  SM  et al.  Real-time kinetics of gene activity in individual bacteria. Cell. 2005; 123:1025–36. 10.1016/j.cell.2005.09.031. [DOI] [PubMed] [Google Scholar]
  • 15. Rosenfeld  N, Young  JW, Alon  U  et al.  Gene regulation at the single-cell level. Science. 2005; 307:1962–5. 10.1126/science.1106914. [DOI] [PubMed] [Google Scholar]
  • 16. Gillespie  DT  Stochastic simulation of chemical kinetics. Annu Rev Phys Chem. 2007; 58:35–55. 10.1146/annurev.physchem.58.032806.104637. [DOI] [PubMed] [Google Scholar]
  • 17. Elowitz  MB, Levine  AJ, Siggia  ED  et al.  Stochastic gene expression in a single cell. Science. 2002; 297:1183–6. 10.1126/science.1070919. [DOI] [PubMed] [Google Scholar]
  • 18. Raser  JM, O’Shea  EK  Control of stochasticity in eukaryotic gene expression. Science. 2004; 304:1811–4. 10.1126/science.1098641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Grima  R, Esmenjaud  PM  Quantifying and correcting bias in transcriptional parameter inference from single-cell data. Biophys J. 2023; 123:4–30. 10.1016/j.bpj.2023.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Fu  X, Patel  HP, Coppola  S  et al.  Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions. Elife. 2022; 11:e82493. 10.7554/eLife.82493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Padovan-Merhar  O, Nair  GP, Biaesch  AG  et al.  Single mammalian cells compensate for differences in cellular volume and DNA copy number through independent global transcriptional mechanisms. Mol Cell. 2015; 58:339–52. 10.1016/j.molcel.2015.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Skinner  SO, Xu  H, Nagarkar-Jaiswal  S  et al.  Single-cell analysis of transcription kinetics across the cell cycle. Elife. 2016; 5:e12175. 10.7554/eLife.12175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Sun  XM, Bowman  A, Priestman  M  et al.  Size-dependent increase in RNA polymerase II initiation rates mediates gene expression scaling with cell size. Curr Biol. 2020; 30:1217–30. 10.1016/j.cub.2020.01.053. [DOI] [PubMed] [Google Scholar]
  • 24. Berry  S, Pelkmans  L  Mechanisms of cellular mRNA transcript homeostasis. Trends Cell Biol. 2022; 32:655–68. 10.1016/j.tcb.2022.05.003. [DOI] [PubMed] [Google Scholar]
  • 25. Johnston  IG, Gaal  B, Neves  RPd  et al.  Mitochondrial variability as a source of extrinsic cellular noise. PLoS Comput Biol. 2012; 8:e1002416. 10.1371/journal.pcbi.1002416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Foreman  R, Wollman  R  Mammalian gene expression variability is explained by underlying cell state. Mol Syst Biol. 2020; 16:e9146. 10.15252/msb.20199146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Tang  F, Barbacioru  C, Wang  Y  et al.  mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009; 6:377–82. 10.1038/nmeth.1315. [DOI] [PubMed] [Google Scholar]
  • 28. Svensson  V, Vento-Tormo  R, Teichmann  SA  Exponential scaling of single-cell RNA-seq in the past decade. Nat Protoc. 2018; 13:599–604. 10.1038/nprot.2017.149. [DOI] [PubMed] [Google Scholar]
  • 29. Zheng  GX, Terry  JM, Belgrader  P  et al.  Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8:14049. 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Yao  Z, Liu  H, Xie  F  et al.  A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature. 2021; 598:103–10. 10.1038/s41586-021-03500-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Chen  S, Rivaud  P, Park  JH  et al.  Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign. Proc Natl Acad Sci USA. 2020; 117:28784–94. 10.1073/pnas.2005990117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Aibar  S, González-Blas  CB, Moerman  T  et al.  SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017; 14:1083–6. 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Chari  T, Gorin  G, Pachter  L  Biophysically interpretable inference of cell types from multimodal sequencing data. Nat Comput Sci. 2024; 4:677–89. 10.1038/s43588-024-00689-2. [DOI] [PubMed] [Google Scholar]
  • 34. Deng  Q, Ramsköld  D, Reinius  B  et al.  Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014; 343:193–6. 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
  • 35. La  Manno G, Soldatov  R, Zeisel  A  et al.  RNA velocity of single cells. Nature. 2018; 560:494–8. 10.1038/s41586-018-0414-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Bergen  V, Lange  M, Peidli  S  et al.  Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol. 2020; 38:1408–14. 10.1038/s41587-020-0591-3. [DOI] [PubMed] [Google Scholar]
  • 37. Jaitin  DA, Kenigsberg  E, Keren-Shaul  H  et al.  Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014; 343:776–9. 10.1126/science.1247651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Patel  AP, Tirosh  I, Trombetta  JJ  et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014; 344:1396–401. 10.1126/science.1254257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Trapnell  C, Cacchiarelli  D, Grimsby  J  et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32:381–6. 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Treutlein  B, Brownfield  DG, Wu  AR  et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014; 509:371–75. 10.1038/nature13173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Xue  Z, Huang  K, Cai  C  et al.  Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature. 2013; 500:593–7. 10.1038/nature12364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Kim  JK, Marioni  JC  Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biol. 2013; 14:R7. 10.1186/gb-2013-14-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Larsson  AJM, Johnsson  P, Hagemann-Jensen  M  et al.  Genomic encoding of transcriptional burst kinetics. Nature. 2019; 565:251–4. 10.1038/s41586-018-0836-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Tang  W, Jørgensen  ACS, Marguerat  S  et al.  Modelling capture efficiency of single cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics. Bioinformatics. 2023; 39:btad395. 10.1093/bioinformatics/btad395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Öcal  K  Incorporating extrinsic noise into mechanistic modelling of single-cell transcriptomics. bioRxiv2 October 2023, preprint: not peer reviewed 10.1101/2023.09.30.560282. [DOI]
  • 46. Luo  S, Wang  Z, Zhang  Z  et al.  Genome-wide inference reveals that feedback regulations constrain promoter-dependent transcriptional burst kinetics. Nucleic Acids Res. 2023; 51:68–83. 10.1093/nar/gkac1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Gorin  G, Vastola  JJ, Fang  M  et al.  Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat Commun. 2022; 13:7620. 10.1038/s41467-022-34857-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Jiang  Y, Zhang  NR, Li  M  SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol. 2017; 18:74. 10.1186/s13059-017-1200-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Kim  JK, Kolodziejczyk  AA, Ilicic  T  et al.  Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat Commun. 2015; 6:8687. 10.1038/ncomms9687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Jiang  R, Sun  T, Song  D  et al.  Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 2022; 23:31. 10.1186/s13059-022-02601-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Svensson  V  Droplet scRNA-seq is not zero-inflated. Nat Biotechnol. 2020; 38:147–50. 10.1038/s41587-019-0379-5. [DOI] [PubMed] [Google Scholar]
  • 52. Cao  Y, Kitanovski  S, Küppers  R  et al.  UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol. 2021; 39:158–9. 10.1038/s41587-020-00810-6. [DOI] [PubMed] [Google Scholar]
  • 53. Kivioja  T, Vähärautio  A, Karlsson  K  et al.  Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012; 9:72–4. 10.1038/nmeth.1778. [DOI] [PubMed] [Google Scholar]
  • 54. Ziegenhain  C, Vieth  B, Parekh  S  et al.  Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 2017; 65:631–43. 10.1016/j.molcel.2017.01.023. [DOI] [PubMed] [Google Scholar]
  • 55. Grün  D, Kester  L, Van Oudenaarden  A  Validation of noise models for single-cell transcriptomics. Nat Methods. 2014; 11:637–40. 10.1038/nmeth.2930. [DOI] [PubMed] [Google Scholar]
  • 56. Volteras  D, Shahrezaei  V, Thomas  P  Global transcription regulation revealed from dynamical correlations in time-resolved single-cell RNA sequencing. Cell Syst. 2024; 15:694–708. 10.1016/j.cels.2024.07.002. [DOI] [PubMed] [Google Scholar]
  • 57. Riba  A, Oravecz  A, Durik  M  et al.  Cell cycle gene regulation dynamics revealed by RNA velocity and deep-learning. Nat Commun. 2022; 13:2865. 10.1038/s41467-022-30545-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Bergen  V, Soldatov  RA, Kharchenko  PV  et al.  RNA velocity—current challenges and future perspectives. Mol Syst Biol. 2021; 17:e10282. 10.15252/msb.202110282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Gorin  G, Fang  M, Chari  T  et al.  RNA velocity unraveled. PLOS Comput Biol. 2022; 18:e1010492. 10.1371/journal.pcbi.1010492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Stinchcombe  AR, Peskin  CS, Tranchina  D  Population density approach for discrete mRNA distributions in generalized switching models for stochastic gene expression. Phys Rev E. 2012; 85:061919. 10.1103/PhysRevE.85.061919. [DOI] [PubMed] [Google Scholar]
  • 61. Hicks  SC, Townes  FW, Teng  M  et al.  Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018; 19:562–78. 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Choi  K, Chen  Y, Skelly  DA  et al.  Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics. Genome Biol. 2020; 21:183. 10.1186/s13059-020-02103-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Chen  W, Li  Y, Easton  J  et al.  UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol. 2018; 19:70. 10.1186/s13059-018-1438-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Townes  FW, Hicks  SC, Aryee  MJ  et al.  Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 2019; 20:295. 10.1186/s13059-019-1861-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Waisman  A, Sevlever  F, Elías  Costa M  et al.  Cell cycle dynamics of mouse embryonic stem cells in the ground state and during transition to formative pluripotency. Sci Rep. 2019; 9:8051. 10.1038/s41598-019-44537-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Sharova  LV, Sharov  AA, Nedorezov  T  et al.  Database for mRNA half-life of 19 977 genes obtained by DNA microarray analysis of pluripotent and differentiating mouse embryonic stem cells. DNA Res. 2009; 16:45–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Hiratani  I, Ryba  T, Itoh  M  et al.  Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol. 2008; 6:e245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Berg  OG  A model for the statistical fluctuations of protein numbers in a microbial population. J Theor Biol. 1978; 71:587–603. [DOI] [PubMed] [Google Scholar]
  • 69. Cao  Z, Grima  R  Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proc Natl Acad Sci USA. 2020; 117:4682–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Bezanson  J, Edelman  A, Karpinski  S  et al.  Julia: a fresh approach to numerical computing. SIAM Rev. 2017; 59:65–98. [Google Scholar]
  • 71. Besançon  M, Papamarkou  T, Anthoff  D  et al.  Distributions.jl: definition and modeling of probability distributions in the JuliaStats ecosystem. J Stat Softw. 2021; 98:1–30. 10.18637/jss.v098.i16. [DOI] [Google Scholar]
  • 72. Danisch  S, Krumbiegel  J  Makie.jl: flexible high-performance data visualization for Julia. J Open Source Softw. 2021; 6:3349. 10.21105/joss.03349. [DOI] [Google Scholar]
  • 73. Vu  TN, Wills  QF, Kalari  KR  et al.  Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics. 2016; 32:2128–35. 10.1093/bioinformatics/btw202. [DOI] [PubMed] [Google Scholar]
  • 74. Feldt  R, Stukalov  A  BlackBoxOptim.jl. GitHub; https://github.com/robertfeldt/BlackBoxOptim.jl.git. [Google Scholar]
  • 75. Mogensen  PK, Riseth  AN  Optim: a mathematical optimization package for Julia. J Open Source Softw. 2018; 3:615. 10.21105/joss.00615. [DOI] [Google Scholar]
  • 76. Raftery  AE  Bayesian model selection in social research. Sociol Methodol. 1995; 25:111–63. 10.2307/271063. [DOI] [Google Scholar]
  • 77. Neath  AA, Cavanaugh  JE  The Bayesian information criterion: background, derivation, and applications. WIREs Comput Stat. 2012; 4:199–203. 10.1002/wics.199. [DOI] [Google Scholar]
  • 78. Kvålseth  TO  Cautionary note about R2. Am Stat. 1985; 39:279–85. [Google Scholar]
  • 79. Pawitan  Y  In All Likelihood. 2013; Oxford, England, UK: Oxford University Press. [Google Scholar]
  • 80. Kreutz  C, Raue  A, Kaschek  D  et al.  Profile likelihood in systems biology. FEBS J. 2013; 280:2564–71. 10.1111/febs.12276. [DOI] [PubMed] [Google Scholar]
  • 81. Kreutz  C, Raue  A, Timmer  J  Likelihood based observability analysis and confidence intervals for predictions of dynamic models. BMC Syst Biol. 2012; 6:120. 10.1186/1752-0509-6-120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Wächter  A, Biegler  LT  On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math Program. 2006; 106:25–57. 10.1007/s10107-004-0559-y. [DOI] [Google Scholar]
  • 83. Legat  B, Dowson  O, Dias  Garcia J  et al.  MathOptInterface: a data structure for mathematical optimization problems. INFORMS J Comput. 2021; 34:672–89. 10.1287/ijoc.2021.1067. [DOI] [Google Scholar]
  • 84. Klein  AM, Mazutis  L, Akartuna  I  et al.  Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015; 161:1187–201. 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Tang  W, Bertaux  F, Thomas  P  et al.  BayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics. 2020; 36:1174–81. 10.1093/bioinformatics/btz726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Wan  Y, Anastasakis  DG, Rodriguez  J  et al.  Dynamic imaging of nascent RNA reveals general principles of transcription dynamics and stochastic splice site selection. Cell. 2021; 184:2878–95. 10.1016/j.cell.2021.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Luo  S, Zhang  Z, Wang  Z  et al.  Inferring transcriptional bursting kinetics from single-cell snapshot data using a generalized telegraph model. Roy Soc Open Sci. 2023; 10:221057. 10.1098/rsos.221057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Cao  Z, Filatova  T, Oyarzún  DA  et al.  A stochastic model of gene expression with polymerase recruitment and pause release. Biophys J. 2020; 119:1002–14. 10.1016/j.bpj.2020.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Nicoll  AG, Szavits-Nossan  J, Evans  MR  et al.  Transient power-law behaviour following induction distinguishes between competing models of stochastic gene expression. Nat Commun. 2025; 16:2833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Wang  Y, Szavits-Nossan  J, Cao  Z  et al.  Joint distribution of nuclear and cytoplasmic mRNA levels in stochastic models of gene expression: analytical results and parameter inference. bioRxiv15 October 2024, preprint: not peer reviewed 10.1101/2024.04.29.591679. [DOI]
  • 91. Dowling  MR, Kan  A, Heinzel  S  et al.  Stretched cell cycle model for proliferating lymphocytes. Proc Natl Acad Sci USA. 2014; 111:6377–82. 10.1073/pnas.1322420111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Jia  C, Grima  R  Coupling gene expression dynamics to cell size dynamics and cell cycle events: exact and approximate solutions of the extended telegraph model. iScience. 2023; 26:105746. 10.1016/j.isci.2022.105746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Tsirkas  I, Dovrat  D, Thangaraj  M  et al.  Transcription-replication coordination revealed in single live cells. Nucleic Acids Res. 2022; 50:2143–56. 10.1093/nar/gkac069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Perez-Carrasco  R, Beentjes  C, Grima  R  Effects of cell cycle variability on lineage and population measurements of messenger RNA abundance. J R Soc Interface. 2020; 17:20200360. 10.1098/rsif.2020.0360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95. Jia  C, Grima  R  Frequency domain analysis of fluctuations of mRNA and protein copy numbers within a cell lineage: theory and experimental validation. Phys Rev X. 2021; 11:021032. 10.1103/PhysRevX.11.021032. [DOI] [Google Scholar]
  • 96. Jia  C, Singh  A, Grima  R  Concentration fluctuations in growing and dividing cells: insights into the emergence of concentration homeostasis. PLoS Comput Biol. 2022; 18:e1010574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Weidemann  DE, Holehouse  J, Singh  A  et al.  The minimal intrinsic stochasticity of constitutively expressed eukaryotic genes is sub-Poissonian. Sci Adv. 2023; 9:eadh5138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98. Gorin  G, Vastola  JJ, Pachter  L  Studying stochastic systems biology of the cell with single-cell genomics data. Cell Syst. 2023; 14:822–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Allen  M, Poggiali  D, Whitaker  K  et al.  Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Res. 2021; 4:63. 10.12688/wellcomeopenres.15191.2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf295_Supplemental_File

Data Availability Statement

The original scRNA-seq dataset can be obtained from Riba et al. [57]. The processed data has been deposited at https://doi.org/10.5281/zenodo.10467234 and the code for this paper is available at https://github.com/augustinas1/CellCycle-RNAseq and https://zenodo.org/records/15096048.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES