Studying stochastic systems biology of the cell with single-cell genomics data

Gennady Gorin; John J Vastola; Lior Pachter

doi:10.1016/j.cels.2023.08.004

. Author manuscript; available in PMC: 2024 Oct 18.

Published in final edited form as: Cell Syst. 2023 Sep 25;14(10):822–843.e22. doi: 10.1016/j.cels.2023.08.004

Studying stochastic systems biology of the cell with single-cell genomics data

Gennady Gorin ¹, John J Vastola ², Lior Pachter ^3,^4,^5,^*

PMCID: PMC10725240 NIHMSID: NIHMS1935058 PMID: 37751736

Abstract

Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.

Graphical Abstract

graphic file with name nihms-1935058-f0006.jpg

1. INTRODUCTION

In his classic systems biology textbook¹, D. J. Wilkinson notes that “Improvements in experimental technology are enabling quantitative real-time imaging of expression at the single-cell level, and improvement in computing technology is allowing modelling and stochastic simulation of such systems at levels of detail previously impossible. The message that keeps being repeated is that the kinetics of biological processes at the intra-cellular level are stochastic, and that cellular function cannot be properly understood without building that stochasticity into in silico models”. From this perspective, systems biology studies control over randomness, and the ways in which living cells exploit variability to grow and function. Counterintuitively, this stochastic weltanschauung relies on mental models that are inherently deterministic: differentiation landscapes^2–6, gene expression manifolds⁷, cellular state graphs^8,9, gene regulatory networks^10,11, and kinetic parameters¹². Analysis of experimental data therefore requires reconciling underlying deterministic structure with biological stochasticity and experimental technical variability, or noise. In particular, distinguishing technical noise from biological stochasticity involves the statistical modeling of experimental readouts, expected noise sources, and the signal-to-noise ratio, and requires consideration of the theoretical and computational tractability of the model.

How can we model these features—latent deterministic structure, biological stochasticity, and technical noise—in a way that balances our models’ ability to adequately describe available data with our own ability to adequately understand the mathematical behavior and biological interpretation of our models? Answering this question is particularly challenging in the context of single-cell genomics, where datasets are large and sparse, the signal-to-noise ratio is low, and stochasticity is one of the defining features of the underlying biophysics^13–15. Here, we explain why many naïve approaches to understanding the stochastic systems biology of single cells fall short, and describe a theoretical framework that can serve as an alternative. Our framework extends recent work on the mechanistic modeling of single-cell RNA count distributions^16–21, and addresses both how models can be efficiently fit to single-cell data, and what features of the underlying biology we can hope to learn.

After introducing the general framework, we illustrate its consequences through a series of vignettes. In each case, we consider modeling particular aspects of biological and technical noise, and ask: (1) What do our models help us learn about the underlying biology? and (2) What could go wrong if we ignored these features of our data? We find that certain kinds of noise must be carefully modeled, others are poorly identifiable, while others still cannot be identified at all and can be safely ignored.

2. SYSTEMS BIOLOGY AND SINGLE-CELL GENOMICS

2.1. Standard approaches to systems biology

If an experiment has ample controls and provides a readout with a high signal-to-noise ratio in the relevant variables, coarse-grained, moment-based models can be ideal. For example, investigations of cell growth have effectively used least-squares regression to fit scaling relationships between cell volume and molecular abundance that hold on average^22,23. Analogously, experiments leveraging the integration of multiple fluorescent reporters have successfully decomposed molecular noise sources into intrinsic and extrinsic components²⁴, leading to numerous analytical^25–28 and experimental^29–31 extensions that leverage the lower moments of poorly-characterized biological drivers to describe or delimit the system variability. These approaches, which have found application to new experimental techniques, have origins in the Onsager and Langevin theories of the early twentieth century³², which specify the moment behaviors of near-equilibrium statistical thermodynamic systems using Gaussian terms.

Alongside studying biology on a gene-by-gene basis, considerable effort has been dedicated to the discovery of regulatory networks. This problem is considerably more challenging: the number of candidate network modules rapidly grows with the number and size of motifs of interest, and simple moment-based models risk distorting key qualitative features, such as multistability. From the perspective of statistics, network inference requires specifying or bypassing likelihood functions for joint gene expression, which may combine various noise sources in addition to the “signal” of regulation. Typical ways of addressing this challenge include^33,34:

The purely descriptive approach, which interprets an expression correlation matrix as a graph, but does not provide an easily interpretable way to extract its “signal.”
Thresholding, which bins the unknown observed distribution to obtain a known, but lower-information distribution, as with binarization used to construct Boolean networks³⁵ or implement the phixer algorithm³⁶.
Distributional assertion, which fits static observations by assuming statistics or observations are Gaussian, as in a variety of popular Bayesian³⁴, information-theoretic³⁷, and regression-based³⁸ methods; this assumption may³⁹ or may not⁴⁰ provide accurate results.
The dynamic approach, which fits a time-dependent trajectory to data using assuming Gaussian residuals; this assumption may reflect stochastic differential equation dynamics⁴¹ or isotropic observation noise added to a latent process^42–44.

This overview is far from exhaustive, but it demonstrates a key theme: relatively robust signal, such as the lower moments or the absence/presence of gene expression, can be treated using fairly simple models that rely on highly optimized, well-understood methods and algorithms developed in the context of signal processing and dynamical systems analysis. Which simple model may perform best is not known a priori, and heavily depends on the task³³. Ideally, methods are benchmarked on simulated^39,45 or well-characterized “gold standard”^33,46 datasets to glean partial insight about their performance and limitations. In this framework, improving the signal-to-noise ratio requires either designing more precise readouts or sacrificing a portion of the obtained data.

2.2. The challenge of single-cell data

Advances in sequencing technologies, most dramatically the rapid commercialization and adoption of single-cell RNA sequencing (scRNA-seq), which can profile millions of cells on a genome-wide scale^47,48, have been heralded as a promising frontier for systems biology^49–51. This potential is more striking yet due to simultaneous advances in multiomics, or the measurement of multiple modalities (transient and non-coding RNA species, DNA methylation, chromatin accessibility, surface protein abundance) in individual cells^52,53, facilitating “integrated” analysis^54–56. The “big data” from single-cell sequencing have thus served as substrate for a plethora of investigations which are, at first glance, analogous to the research program of systems biology at large: the identification of cell types; their aggregation into trajectories; the discovery of gene modules that consistently differ between cell types or throughout a differentiation trajectory; and the visualization of low-dimensional summaries reflecting some component of the data structure.

To identify these coarse-grained motifs in the structure of single-cell datasets, it is common practice to analyze cell–cell graphs, constructed from measures of expression similarity, to attempt to construct cliques (cell types), shortest paths (trajectories), and neighborhood-preserving low-dimensional embeddings (visualizations). In addition, relatively simple parametric distributions are widely used, with the Gaussian assumption popular for the lower moments (e.g., to compute measures of differential expression), and the lognormal or negative binomial used to describe count distributions^57,58. Standard single-cell RNA sequencing data provide snapshots of processes, rendering dynamical analysis fairly complex, but it is common to fit a “pseudotemporal” curve through the dataset by minimizing a Gaussian error term between this curve and some transformation of the cells’ expression levels^59,60.

Here, however, the underlying assumptions break down. Single-cell data are intrinsically and qualitatively different from readouts of typical systems biology experiments, with drastic implications for analysis. Single-cell data are large and sparse, with a preponderance of technical noise effects, poorly characterized batch- and gene-level biases, and low per-cell copy numbers^13–15. Improving the signal-to-noise ratio by designing more targeted experiments is challenging, as commercial technology is designed to quantify molecules on a genome-wide scale. More problematically, typical distributional assumptions and data transformations risk losing a considerable amount of signal in the low-copy number regime. This challenge informs part of the broader discussion of the relative roles of data analysis and mechanistic hypotheses in genomics^19,20,61, as analyses are not constrained by mechanism or theory and may contradict existing knowledge.

More specific critiques have considered whether various analyses are appropriate or excessively heavy-handed. For example, sparsity has led to ad hoc procedures to “correct” the data, which may in turn lead to incorrect conclusions^62–64. Normalization and log-transformation, which attempt to remove technical biases and prepare the data for dimensionality reduction, rely on assumptions, such as high copy numbers and homogeneity, that are routinely violated in single-cell datasets^65,66. Dimensionality reduction risks distorting both local and global relationships between data points^19,67,68. Finally, the use of cell–cell graphs constructed from noisy data reifies relationships which may not reflect those in the original tissue, and risks introducing hard-to-diagnose errors into downstream analysis^19,69. Although these issues span the entire process of analysis, all, at least partially, trace back to uncomfortable compromises in the treatment of uncertainty and variation in a regime unforgiving of approximations.

2.3. Stochastic modeling of intracellular network dynamics

Stochasticity is, then, mandatory, and we ignore it at our own risk. Therefore, we advocate for probabilistic alternatives to the “extraction” of signal from scRNA-seq datasets. Since biology is stochastic, the noise is the signal. To quantify and characterize the components of deterministic mental models—differentiation landscapes, kinetic parameters, and similar low-dimensional abstractions⁷⁰—in a principled way, we need to combine them with stochastic terms which result from specific hypotheses about the underlying biophysics and chemistry²⁰, or risk confirmation bias¹⁹.

The development of stochastic models offers advantages beyond loss function book-keeping. If multiomic data are available, there is typically a self-consistent way to extend the models accordingly⁷¹. Although likelihoods induced by stochastic processes are challenging to analyze and implement, they provide appealing statistical properties. When the data are sufficiently informative, full distributions provide better estimates than moments⁴⁰. When they are not, probabilistic approaches are appropriately conservative, as they report, rather than elide, the parameter degeneracies. A thorough mathematical understanding of model behaviors—i.e., precisely which parameters are identifiable and which are degenerate, as well as how much data must be collected—enables the design of informative experiments^20,72. Finally, the use of mechanistic models, parametrized by rate constants, allows us to draw conclusions about the mechanistic bases and effects of perturbations⁷³.

These principles have guided fluorescence-based single-cell transcriptomics for nearly twenty years. To obtain as much information as possible from entire copy-number distributions^40,74, the field has developed a considerable arsenal of theoretical tools^75,76 and solution strategies^77–79. It is, then, particularly natural to build scRNA-seq models that extend processes consistent with fluorescence imaging: this approach allows us to leverage existing theory, as well as encode the intuition that technology-dependent effects should be independent from biological ones. A particularly popular class of models involves the bursty production of RNA and its Markovian degradation^73,80, which can be analyzed in the chemical master equation (CME) framework^81,82. The key theoretical points have already been applied in the context of single-cell sequencing; for example, the Poisson, Poisson-gamma, and Poisson-beta distributions, which are common in sequencing analyses^58,63,83,84, are three of the limiting distributions induced by this class of models^20,80,85. However, this possible mechanistic basis is only rarely^84,86–88 invoked in the development of analysis methods.

2.4. Outlook

Unfortunately, we cannot simply apply existing methods from fluorescence transcriptomics; the scale and chemistry of single-cell technologies create additional desiderata. General CME solutions are computationally prohibitive and challenging to scale to thousands of genes⁸⁹, requiring careful study of narrow model classes with tractable solutions^17,20. In addition, connecting biological models to observations requires explicitly representing the experimental process. The existing models for fluorescence data are sophisticated⁷⁹, but cannot be directly applied to sequencing data. Although a variety of models have been proposed for technical noise in single-cell technologies^13,14,90,91, their chemical foundations, as well as implications for biological parameter identifiability, have been understudied²¹.

In light of this lacuna, we seek to produce a mathematical framework that (1) integrates biological and technical variability in a coherent, modular way; (2) scales to large, multimodal data; (3) can be used to simulate datasets and make testable, quantitative predictions; and (4) affords a thorough mathematical analysis of its components, if not the entire model.

3. STOCHASTIC MODELING OF SINGLE-CELL BIOLOGY

Constructing a general-purpose framework for the stochastic modeling of single-cell biology necessitates working at a relatively high level of abstraction, since we would in principle like to account for a range of processes with one formalism. In this section, we motivate our abstract formalism using a collection of concrete, biologically relevant examples.

One of the simplest models of transcription is the constitutive model, which assumes RNA is produced at a constant rate^20,92. It is defined by the chemical reactions

\emptyset \overset{K}{\to} 𝓧, 𝓧 \overset{γ}{\to} \emptyset,

(6)

where $𝓧$ is a single species of RNA, $K$ is the (constant) transcription rate, and $γ$ is the degradation rate. The CME that corresponds to this system is

\frac{\partial P (x, t)}{\partial t} = K [P (x - 1, t) - P (x, t)] + γ [(x + 1) P (x + 1, t) - x P (x, t)],

(7)

where $P (x, t)$ is the probability that the system has $x \in ℕ_{0}$ RNA at time $t$ . Solving the above master equation allows us to compare its predictions with experimental scRNA-seq data. There are several theoretical approaches for doing this—including using a special ansatz⁸⁵, the Poisson representation⁹³, the Doi-Peliti path integral^17,94–96, and operator techniques⁹⁷—but we would like to highlight a straightforward method that we know works for far more general problems. The idea is to consider a certain transformed version of the probability distribution, which satisfies a partial differential equation (PDE) instead of a differential-difference equation. This PDE, for a large class of biologically relevant systems, can then be solved using the method of characteristics⁹⁸, which converts the problem of solving a PDE into integrating a system of ordinary differential equations (ODEs). This is mathematically equivalent to using certain path integral methods^17,20,99.

Define the generating functions (GFs)

G (g, t) : = \sum_{x = 0}^{\infty} g^{x} P (x, t) and ϕ (u, t) : = \log G (g, t),

(8)

where $g$ is on the complex unit circle and $u : = g - 1$ . It is easy to show that $G$ and $ϕ$ satisfy the PDEs

\frac{\partial G}{\partial t} = (g - 1) [K G - γ \frac{\partial G}{\partial g}], \frac{\partial ϕ}{\partial t} = K u - γ u \frac{\partial ϕ}{\partial u} .

(9)

We can use the method of characteristics to find that

ϕ (u, t) = ϕ^{0} (U (t)) + K \int_{0}^{t} U (𝗌) d 𝗌, \frac{d U}{d 𝗌} = - γ U,

(10)

where the $U (𝗌)$ ODE has initial condition $U (𝗌 = 0) = u$ , and where $ϕ^{0}$ is the initial (log-) generating function of the system. In order to determine $P (x, t)$ from $ϕ (u, t) = \log G (g, t)$ , we can use an inverse Fourier transform:

P (x, t) = \oint \frac{d g}{2 π i} \frac{1}{g^{x + 1}} G (g, t) = \int_{- π}^{π} \frac{d θ}{2 π} e^{- i θ x} G (e^{i θ}, t)

where we integrate over all $g$ on the complex unit circle. In practice, this step is done numerically using an inverse fast Fourier transform.

The constitutive model, which produces Poisson distributions at steady state, is too simple for single-cell biology²⁰. But fortunately, the technique we have just described can be adapted to predict the behavior of substantially more complex models.

Multiple types of RNA.

One possible generalization of the constitutive model is to so-called monomolecular systems^17,85, which allow phenomena like RNA splicing to be accommodated. An example is the addition of splicing to the constitutive model:

\emptyset \overset{K}{\to} 𝓧_{N}, 𝓧_{N} \overset{β}{\to} 𝓧_{M}, 𝓧_{M} \overset{γ}{\to} \emptyset .

(11)

In general, any number of production, conversion, and degradation reactions can be modeled:

\emptyset \overset{K_{i}}{\to} 𝓧_{i}, 𝓧_{i} \overset{c_{i j}}{\to} 𝓧_{j}, 𝓧_{i} \overset{c_{i 0}}{\to} \emptyset .

(12)

Using the same technique we described earlier, the probability $P (x, t)$ that the system is in state $x \in ℕ_{0}^{n}$ at time $t$ , can be shown to be equivalent to the generating function

ϕ (u, t) = ϕ^{0} (U (t)) + \int_{0}^{t} K^{T} U (𝗌) d 𝗌, \frac{d U}{d 𝗌} = C U,

(13)

where $U (𝗌 = 0) = u$ , and the $C$ matrix is defined via

C_{i j} = c_{i j} (i \neq j), C_{i i} = - \sum_{j = 0}^{n} c_{i j},

(14)

and where $c_{i i} : = 0$ by convention.

Multiple gene states.

Although the monomolecular model is a step forward, it still does not account for nontrivial transcription rate dynamics. One possibility is that there are multiple gene states, as in the telegraph model^76,97,100:

𝓢_{off} \underset{k_{off}}{\overset{k_{on}}{⇄}} 𝓢_{on}, 𝓢_{on} \overset{k_{init}}{\to} 𝓢_{on} + 𝓧, 𝓧 \overset{γ}{\to} \emptyset .

(15)

The corresponding three-variable generating function is

ϕ (u, u_{on}, u_{off}, t) = ϕ^{0} (U (t), U_{on} (t), U_{off} (t)), \frac{d U}{d 𝗌} = - γ U, \frac{d U_{off}}{d 𝗌} = - k_{on} (U_{off} - U_{on}), \frac{d U_{on}}{d 𝗌} = - k_{off} (U_{on} - U_{off}) + k_{init} (U_{on} + 1) U,

(16)

where $U (0) = u, U_{off} (0) = u_{off}$ , and $U_{on} (0) = u_{on}$ . If we want to marginalize over gene state, which we usually do since it is not observable, we can set $u_{off} = u_{on} = 0$ . Notice that the relevant ODEs are now nonlinear (Riccati-type) equations, which make them difficult to solve by hand. In general, considering multiple gene states, or other kinds of added complexity like autocatalytic reactions, yields nonlinear characteristic ODEs. This is no obstacle for numerical integration, however.

Gene regulation.

Another possibility we would like to account for is nontrivial gene regulation. In previous work²⁰, we considered two models of transcription rate variation: the gamma Ornstein–Uhlenbeck ( $Γ$ -OU) model, which assumes variation is due to changes in the mechanical state of DNA; and the Cox–Ingersoll–Ross (CIR) model, which assumes it is due to fluctuations in the concentration of an abundant regulator molecule. Analyzing them can be mathematically challenging, since the discrete stochastic dynamics of RNA production and degradation are coupled to the continuous stochastic process that controls the transcription rate. Fortunately, both models and many generalizations of them can be solved using the method of characteristics. For example, the CIR model (assuming two RNA species) is defined by a stochastic differential equation (SDE⁸¹) and three reactions:

\frac{d K}{d t} = a θ - κ K + \sqrt{2 κ θ K} ξ (t), \emptyset \overset{K (t)}{\to} 𝓧_{N}, 𝓧_{N} \overset{β}{\to} 𝓧_{M}, X_{M} \overset{γ}{\to} \emptyset,

(17)

and its solution is²⁰

ϕ (u_{N}, u_{M}, u_{K}, t) = ϕ^{0} (U_{N} (t), U_{M} (t), U_{K} (t)) + a θ \int_{0}^{t} U_{K} (𝗌; u_{N}, u_{M}, u_{K}) d 𝗌,

(18)

\begin{array}{l} \frac{d U_{M}}{d 𝗌} = - γ U_{M}, & U_{M} (0) = u_{M}, \\ \frac{d U_{N}}{d 𝗌} = β (U_{M} - U_{N}), & U_{N} (0) = u_{N}, \\ \frac{d U_{K}}{d 𝗌} = U_{N} - κ U_{K} + κ θ U_{K}^{2}, & U_{K} (0) = u_{K} . \end{array}

Thus, it is straightforward to couple dynamics defined on different types of state spaces: categorical (e.g., gene states), continuous (e.g., transcription rates), and discrete (e.g., RNA counts), using the generating function approach. In all cases, one obtains a generating function solution in terms of a finite set of (possibly nonlinear) ODEs. The total number of ODEs is equal to the total number of degrees of freedom.

One feature of single-cell biology that is challenging to capture using this approach is feedback. For example, proteins expressed by a gene may affect the transcription rate of that gene. Although exact solutions for systems involving feedback are available in certain simple cases^101–104, particularly when there is only one chemical species, more general results have proven elusive. From the point of view of our approach, including chemical reactions that involve feedback yields generating function PDEs which are not first order, and cannot be solved in terms of ODEs via the method of characteristics (as explored in more detail in the supplemental information).

Transient effects.

In the context of development or reprogramming, we are especially interested in using single-cell genomics data to study transient processes. In particular, certain cell types or subtypes (like neural progenitor cells) only exist for a certain window of time, and by collecting single-cell data we are taking a snapshot of many cells, each of which may be in a different part of the process. How does this affect observed RNA counts?

Different cells being observed at different times means we are not interested in $P (x, t)$ , but $P (x, t)$ averaged over some distribution that indicates how likely we are to sample different times. The shape of the sampling distribution $f (t)$ depends on when cells tend to exit a given state (e.g., by differentiating into a different cell type). Nontrivial sampling distributions are compatible with our generating function approach, since we can simply modify the distribution that appears. For a model with one discrete species, we can write the full generating function $G_{tot}$ as

G_{tot} (g) = \sum_{x = 0}^{\infty} g^{x} \int_{0}^{T} P (x, t) f (t) d t = \int_{0}^{T} G (g, t) f (t) d t,

i.e., we can obtain it by integrating the generating function that captures intrinsic noise.

Technical noise.

In single-cell genomics experiments, we do not directly observe a given cell’s RNA counts, but those numbers filtered through a noisy sequencing process²¹. In microfluidics-based sequencing, noise can come from some combination of droplets not capturing all molecules (especially types of RNA with low copy numbers), errors in amplification, and reads not being uniquely identifiable. We would like to account for these kinds of technical noise in a way that is both principled, and compatible with our generating function approach to modeling intrinsic noise.

Consider a simple example, in which the relevant biology is described by the one-species constitutive model (Equation 7), and each RNA molecule is observed independently with probability $p$ . The probability of observing $x_{obs}$ molecules of RNA, given a biological distribution $P (x, t)$ , is

P (x_{obs}, t) = \sum_{x = 0}^{\infty} P (x_{obs} ∣ x) P (x, t) = \sum_{x = 0}^{\infty} (\begin{matrix} x \\ x_{obs} \end{matrix}) p^{x_{obs}} {(1 - p)}^{x - x_{obs}} P (x, t) .

(19)

The corresponding generating function $G_{tot}$ is

G_{tot} (g, t) = \sum_{x = 0}^{\infty} \sum_{x_{obs} = 0}^{x} g^{x_{obs}} P (x_{obs} ∣ x) P (x, t) = \sum_{x = 0}^{\infty} {[g p + (1 - p)]}^{x} P (x, t),

(20)

i.e., the result is the same as without technical noise, except that we have $g \to g p + (1 - p)$ . In general, including technical noise requires us to replace the usual $g^{x}$ factor with $G_{noise} (g, x)$ , the generating function associated with the observation model:

G_{tot} (g, t) = \sum_{x = 0}^{\infty} G_{noise} (g, x) P (x, t) .

(21)

For certain common observation models, like the Bernoulli model just described, or a Poisson noise model, we can say more: since

G_{noise} (g, x) = G^{*} {(g)}^{x}

(22)

for some $G *$ , including technical noise amounts to replacing $g$ with $G *$ , so that $G_{tot} = G (G^{*})$ is a composition of generating functions. We typically assume that all technical noise models satisfy Equation 22 for some $G *$ .

4. RESULTS

4.1. Theoretical framework for stochastic systems biology

We are ready to present our general framework for stochastic systems biology, which accommodates all of the sources of stochasticity described in the preceding section: intrinsic noise, transient effects, and technical noise. In order to balance the amount of biology our models can capture with the mathematical tractability of those models, we restrict our analysis to a fairly general class of systems that can be solved using the method of characteristics. For such systems, we can obtain likelihoods by integrating characteristic ODEs, using the obtained characteristics to construct the generating function, and then doing an inverse (fast) Fourier transform.

This class of systems permits gene state interconversion, as well as the production and processing of RNA and proteins, which could treated as discrete or continuous variables depending on their concentration. We allow zero- and first-order reactions, including state-dependent bursting, interconversion, degradation, and catalysis. However, we disallow higher-order reactions (e.g., binding reactions $A + B \to C$ ), including feedback-based regulation like protein-promoter binding. Therefore, our analysis focuses on Markovian systems that possess $N$ categorical degrees of freedom, corresponding to gene states; $n$ discrete ones, corresponding to low-copy number molecular species; and $m$ continuous ones, corresponding to transcription rates or high-concentration species. This class of reactions is schematically represented in Figure 1a; crucially, it consists of distinct “upstream” and “downstream” components.

The biophysical and chemical phenomena of interest, as well as the relationships between their generating functions.

a. The biological phenomena of interest: cell influx and efflux into a tissue observed by sequencing; the time-dependent transcriptional regulation of one or more genes; downstream continuous and discrete processes.

b. The technical phenomena of interest: the encapsulation of cells and cell debris; cDNA library construction; the loss of information in transcript identification (GF: generating function; RTase: reverse transcriptase).

c. The structure of the full generating function of the system in a and b: to obtain the solution, we variously compose, integrate, and multiply the generating functions of the constituent processes.

d. The stochastic and statistical properties of four components of the full system: the background debris, the transcriptional regulation, the cell/tissue relationship, and the technical noise mechanism.

Given all of a model’s possible reactions, one can write down a corresponding master equation that keeps track of how microstate probabilities change with time:

\frac{d P (s, x, y, t)}{d t} = ψ (s, x, y, t),

(27)

where each microstate consists of $s$ , the categorical dimension; $x \in ℕ^{n}$ , the $n$ discrete dimensions; and $y \in ℝ^{m}$ , the $m$ continuous dimensions. The generally complicated function $ψ$ aggregates all reaction rates. Master equations like Equation 27 typically consist of an infinite system of coupled ODEs, and hence are difficult to solve in general. This is one reason we chose a particular class of systems: to solve Equation 27 using the method of characteristics, and hence determine a given model’s predictions, all we need to do is solve (a finite number of) ODEs satisfied by the characteristics and GF.

The $N$ -dimensional $GF G = {(G_{1}, \dots, G_{N})}^{T}$ of the system, which is a function of spectral variables $g$ and $h$ , is defined by

G_{s} (g, h, t) : = \int_{y} \sum_{x} g^{x} e^{h^{T} y} P (s, x, y, t) d y .

(28)

Equation 27 can be converted into a PDE satisfied by $G$ :

\frac{\partial G}{\partial t} = - ℋ (u, t) G + J [C u + diag u D u] ℋ (u, t) = - H {(t)}^{T} - 𝓐 (u, t) ⊙ u : = [\begin{matrix} g - 1 \\ h \end{matrix}],

(29)

where $⊙$ is the Hadamard/elementwise matrix product, $J$ is the Jacobian matrix of the generating function, and $u$ combines the discrete and continuous degrees of freedom. The time-dependent matrix $H$ contains the kinetics of state transitions, whereas the operator $𝓐$ describes the drift and bursty production processes, which may depend on state. Therefore, the operator $ℋ$ aggregates the upstream components of the system. The matrix $C$ contains interconversion, degradation, and mean reversion-like terms, whereas $D$ contains the catalysis and square-root noise terms. $ℋ$ , $C$ , and $D$ encode a quasi-linear, deterministic, and first-order $N$ -component system of partial differential equations in $n + m$ spectral variables.

Applying the method of characteristics to solve Equation 29 tells us that the downstream part of the system is fully determined by a set of characteristics $U$ , which are defined by the ODEs

\frac{d U (𝗌)}{d 𝗌} = C U (𝗌) + diag U (𝗌) D U (𝗌)

(30)

where $𝗌$ is an integration variable, and $U (𝗌 = 0) = u$ . Meanwhile, the generating function $G$ can be determined from

\frac{d G (𝗌)}{d 𝗌} = ℋ (U, t - 𝗌) G,

(31)

which has initial condition $G^{0} (U (t))$ , where $G^{0}$ is the generating function of the initial distribution. The upstream components describe how molecule production occurs, and hence depend on $ℋ$ ; their influence on the final answer is through the above integral.

The detailed form of Equation 27 is complicated, and the arithmetic exercise of converting it into Equation 29 is tedious. We show how to construct the biological master equation in Section 6.1, write it out in full in Section 6.2, and discuss at a high level how to solve it using our generating function approach in Section 6.3. The terms of the full master equation are annotated in Table S1, and the solution process is described in more detail in supplemental information.

In special cases, the ODEs we obtain can be solved exactly. For example, whenever $D = 0$ , the downstream ODE system can be solved analytically by eigendecomposition. If, in addition, only a single gene state is present, $H$ vanishes and the upstream component can be evaluated by numerical integration¹⁶. Finally, in the simplest case of a linear operator $𝓐$ , we obtain an analytically tractable system equivalent to a deterministic system of reaction rate equations^17,85.

Although this formulation nominally describes a single gene, it may be exploited to represent multi-gene systems. Conceptually, this strategy entails constructing a model where the transcription of multiple species is controlled by a common regulator. We discuss potential candidate models in Section 6.4; these models instantiate hypotheses to produce $ℋ$ and $U$ that represent co-regulation.

To explain the observation of transient processes, such as the simultaneous capture of progenitor and descendant cells from a differentiation process, we take inspiration from previous work in sequencing⁸⁶ as well as chemical reactor modeling^105,106, and extend the theoretical framework originally proposed in our recent RNA velocity analysis¹⁹. In brief, the simplest model that accounts for such desynchronization proposes that cells enter a tissue, receive a signal that triggers changes in transcriptional rates $ℋ (t)$ , and leave at some later point. Sequencing is the observation of cells within the tissue; to find the distribution of RNA counts, we need to condition on the distribution of times since receiving the signal.

As we discuss in Section 6.5, this latter distribution is not arbitrary, and reflects the kinetics of cell entry and exit. In the parlance of chemical reaction engineering, the times are drawn from $f (t)$ , the internal-age distribution induced by those kinetics^105,106. This model affords a particularly simple representation of the generating function:

G = \int_{t} \sum_{s} G_{s} (t) f (t) d t,

(32)

where we marginalize over the gene state, which is typically not observable. Conveniently, this model possesses time symmetry: even though the cells within the tissue are all out of equilibrium, the tissue as a whole is at steady state.

We consider the technical noise phenomena shown in Figure 1b, i.e., the encapsulation of cells and background debris into droplets, as well as the stochasticity in cDNA library construction and sequencing. Under the assumption of independent encapsulation, the generating function of molecule count distributions on a per-droplet level takes the following form:

G_{tot} = G_{enc} (G) G_{bg} (G),

(33)

where the $G_{enc}$ is the generating function of the cells per droplet, whereas $G_{bg}$ is the generating function of background molecules per droplet, which depends on the entire cell population (Section 6.6). Finally, to represent sequencing variability and uncertainty, we evaluate the generating function at a set of transformed coordinates:

G_{tot, ta} = G_{tot} (G_{t}^{*} (G_{a}^{*} (u))),

(34)

where $G_{t}^{*}$ reflects the distribution of cDNA produced per molecule of RNA (e.g., Bernoulli, as in Tang et al.^107,108), whereas $G_{a}^{*}$ reflects the distribution of ambiguous sequenced fragments, which depends on transformed variables $u$ (Section 6.7 and supplemental information).

The full generating function of the molecule distribution is given by the composition and integration of the model components, as shown in Figure 1c. To evaluate this generating function, it is necessary to specify all components that make up the model. In the analysis below, we take advantage of the modularity of the system definition to investigate four kinds of modeling choices, their statistical implications, and their compatibility with sequencing data. Specifically, we treat the subsystems illustrated in Figure 1d: background noise in single droplets, stochastic transcription rate models, sampling from a transient process, and variation in technical noise.

4.2. Empty droplets

One of the first steps in scRNA-seq data analysis is cell quality control, which excludes cell barcodes that appear to originate from empty droplets from further analysis⁵⁷. For computational tractability, this procedure typically relies on “hard” assignment, such that barcodes associated with a total molecule count above some threshold are treated as cells, whereas barcodes below the threshold are treated as empty droplets. Threshold selection is necessary because even “empty” droplets contain ambient RNA. This ambient RNA, which appears to originate from cells lysed in the preparation process, contaminates empty and cell-containing droplets alike⁵⁷.

The observation of ambient RNA resulting in unwanted molecule counts has led to the development of statistical methods for removing this source of noise, either by estimating and subtracting it¹⁰⁹ or incorporating it into a stochastic model^110–112. Conceptually, Equation 33 reflects the latter approach: each droplet contains one or more cells, each with biological generating function $G$ , and background, with a generating function $G_{bg}$ that depends on $G$ . To accurately model the background counts, we need to propose and justify a specific functional form for $G_{bg}$ . Thus, under the assumption that empty and cell-containing droplets are similarly susceptible to contamination, the former provide a reasonable estimate of ambient distributions in the latter¹⁰⁹.

The simplest model holds $G_{bg}$ to be equivalent to a “pseudobulk” experiment, with molecules randomly sampled from the lysed cell population. If each cell is equally likely to contribute to the pool of free RNA, and diffusion occurs by a simple independent arrival process, we find that the distribution of background should be Poisson, with the mean for each species proportional to its mean in the original cell population, as in, e.g., Fleming et al.¹¹⁰ This functional form immediately induces a set of testable predictions: not only are the distributions Poisson, but they are independent Poisson, with no meaningful statistical structure remaining between transcripts of a single gene, as well as between different genes, as illustrated in Fig. 2a.

The pseudo-bulk model of background noise is quantitatively consistent with counts from the pbmc_1k_v3 dataset.

a. The simplest explanatory model for background noise invokes the lysis of cells (green), which creates a pool of RNA that reflects the overall transcriptome composition but retains none of the cell-level information. If the loose RNA molecules diffuse into droplets (blue) according to a memoryless and independent arrival process, the resulting background distribution (purple: higher probability mass; white: lower probability mass) observed in empty droplets should be a series of mutually independent Poisson distributions, with the mean controlled by the composition in non-empty droplets.

b. The mature transcriptome in empty droplets has a mean-variance relationship near identity (gray points, $n$ = 12,298), consistent with Poisson statistics (blue line); the non-empty droplets demonstrate considerable overdispersion (red points, $n$ = 17,393).

c. The mature and nascent transcripts in empty droplets have sample correlation coefficients $ρ$ near zero, consistent with distributional independence (gray histogram, $n$ = 9,362); the non-empty droplets demonstrate nontrivial statistical relationships (red histogram, $n$ = 14,365).

d. The mature transcripts of different genes in empty droplets have sample correlation coefficients $ρ$ near zero, consistent with distributional independence (gray histogram, $n$ = 75,614,253); the non-empty droplets demonstrate nontrivial statistical relationships (red histogram, $n$ = 151,249,528).

e. When both are nonzero, the mature count mean in empty droplets is highly correlated with the mean in the non-empty droplets, consistent with the pseudo-bulk interpretation (black points, $n$ = 12,107; dashed line: identity).

To characterize the accuracy of these predictions, we inspected six datasets (Table S2) pseudoaligned with kallisto | bustools¹¹³, and compared the data for barcodes passing bustools quality control to data for barcodes which were filtered out. As a shorthand, we call the former “non-empty” and the latter “empty” droplets, keeping in mind that this identification is approximate. We fully describe the analysis procedure in Section 6.8.2, illustrate the results for the human blood dataset pbmc_1k_v3, and display the results for all datasets in supplemental information.

As shown in Figure 2b, data from non-empty droplets are substantially overdispersed relative to Poisson, whereas data from empty droplets are largely consistent with the Poisson identity mean–variance relationship. However, a small number of relatively high-expression genes are overdispersed. In addition, intra-gene (Figure 2c) and inter-gene (Figure 2d) correlations are typically nontrivial in non-empty droplets, but consistently near zero for empty droplets, supporting distributional independence of the background counts. Finally, the mean expression in empty droplets is highly correlated with mean expression in non-empty droplets, albeit lowered by approximately four orders of magnitude (Figure 2e), supporting the assumption that the original cells are lysed in a uniform fashion.

To characterize the deviations from the pseudo-bulk model, we identified the genes that demonstrated overdispersion in empty droplets (Table S3). A considerable fraction of these genes were associated with mitochondria or blood cells. For example, of the 21 annotated genes overdispersed in the empty droplets of the mouse neuron dataset neuron_1k_v3, nine were mitochondrial (mt-Nd1, mt-Nd2, mt-Co1, mt-Co2, mt-Atp6, mt-Co3, mt-Nd3, mt-Nd4, and mt-Cytb), three coded for hemoglobin subunits (Hba-a1, Hba-a2, and Hbb-bs), and two coded for blood cell-specific proteins (Bsg, Vwf)^114,115. On the other hand, of the 10 annotated genes overdispersed in the empty droplets of the desai_dmso dataset, generated from cultured mouse embryonic stem cells¹¹⁶, six (mt-Nd1, mt-Co2, mt-Atp6, mt-Co3, mt-Nd4, mt-Cytb) were mitochondrial and none were blood cell-specific¹¹⁴ (Table S4).

Since overdispersion implies that contamination involves non-independent encapsulation of these molecules, the results suggest that the cell-free debris contain, among other structures, entire mitochondria or erythrocytes, when they are present in the source tissue. These membrane-bound structures may diffuse into droplets, then lyse and release all of their contents at once. In other words, empty droplets do not merely have disproportionally high mitochondrial content, as has been noted previously^110,117,118; they have nontrivially distributed mitochondrial content, which can hint at the mechanism of its incorporation, and improve interpretation where simple thresholds may be misleading¹¹⁸. We hypothesize that cases where the model fails can be leveraged to discover more complicated forms of contamination, such as molecular aggregates¹¹².

In addition, we examined the total UMI counts in empty droplets, which should be Poisson (Fano = 1) if each individual gene’s distribution is Poisson. For the human blood dataset demonstrated in Figure 2, the empty droplets had fairly significant overdispersion (Fano ≈ 43), which decreased, but did not disappear (Fano ≈ 7.6), once the 53 significantly overdispersed genes were excluded. This result suggests that, although the pseudo-bulk model is approximately valid, some residual variance, possibly due to variability in per-droplet capture rates, is present and needs to be modeled to fully describe the stochasticity in single-cell datasets.

4.3. Noise-corrupted candidate models of transcriptional variation

A considerable fraction of the variability in single-cell datasets arises from cell-to-cell and time-dependent variation in the transcription rates. These sources of variation control distribution shapes. By carefully analyzing candidate models, we can characterize the prospects for model selection: for example, if different models produce nearly identical distributions, selection is impossible and the choice of model is somewhat arbitrary. More interestingly, such analysis can guide the design of experiments: models may be indistinguishable based on some kinds of data, but not others²⁰. This perspective has guided the interest in characterizing noise behaviors^74,119: distributions provide strictly more information than averages, and allow us to distinguish between regulatory mechanisms. Similarly, multivariate distributions provide more information than marginal distributions. Obtaining different data (multiple molecular modalities) is qualitatively more useful than obtaining more data (a larger number of cells) or better data (observations less corrupted by noise).

We illustrate this key point using the simple model system depicted in Figure 3a, which features intrinsic, extrinsic, and technical noise. The continuous stochastic process denoted by $K$ drives the rate of transcription of nascent RNA. We consider three different possibilities for $K$ : the gamma Ornstein–Uhlenbeck process, which models DNA winding and relaxation; the Cox–Ingersoll–Ross process, which models the fluctuations in a high-copy number activator²⁰; and the telegraph process, which models variation due to random exposure of the locus to transcriptional initiation^76,97,100. All three transcription rate models are described by three parameters^20,100. After a Markovian delay, nascent RNA are converted to mature RNA; after another Markovian delay, the mature RNA are degraded. When the system reaches steady state, it is sequenced; each biological molecule has a probability $p$ of being observed in the final dataset. We seek to use imperfect count data to fit parameters and distinguish models. We fully describe the procedures in Section 6.8.3.

The stochastic analysis of biological and technical phenomena facilitates the identification and inference of transcriptional models.

a. A minimal model that accounts for intrinsic (single-molecule), extrinsic (cell-to-cell), and technical (experimental) variability: one of three time-varying transcriptional processes $K$ generates molecules, which are spliced with rate $β$ , degraded with rate $γ$ , and observed with probability $p$ . Given a set of observations, we can use statistics to narrow down the range of consistent models.

b. Given a particular model, parameter regimes indistinguishable using a single modality become distinguishable with two. The mixture-like and burst-like regimes both produce negative binomial marginal distributions, but have different correlation structures (Left: data likelihoods over the parameter space, computed from 200 simulated cells; $Γ$ -OU ground truth; red point: true parameter set in the mixture-like regime; color: log-likelihood of data, yellow is higher, 90th percentile marked with magenta hatching; blue: an illustrative parameter set in a burst-like parameter regime with a similar nascent marginal but drastically different joint structure. Right: nascent marginal and joint distributions at the points indicated on the left. Nascent distributions nearly overlap).

c. Given a location in parameter space, models are easier to distinguish using multiple modalities. However, the performance varies widely based on the location in parameter space and the specific candidate models: for example, the telegraph model has a well-distinguishable bimodal limit when the process autocorrelation is slower than RNA dynamics. In addition, all else held equal, drop-out noise effectively decreases the noise intensity, lowering identifiability (Left: $Γ$ -OU Akaike weights under $Γ$ -OU ground truth, average of $n$ = 50 replicates using 200 simulated cells; color: Akaike weight of correct model, yellow is higher, regions with weight < 0.5 marked with black hatching; large circles: illustrative parameter sets; smaller circles: distributions obtained by applying $p$ = 50%, 75%, and 85% dropout to illustrative parameter sets while keeping the averages constant. Right: the three candidate models’ nascent marginal distributions at the large points indicated on the left).

Even if we have perfect information about the true averages of the transcriptional strength and the molecular species, the systems can exhibit a wide variety of distribution shapes and statistical behaviors. This variety can be summarized by a two-dimensional parameter space, which was introduced in Fig. 2 of Gorin and Vastola et al.²⁰ The “timescale separation” governs the relative timescales of the transcriptional and molecular processes; if it is high, the transcriptional process is faster than RNA turnover. The “noise intensity” governs the variability in the transcriptional process: if it is high, the process exhibits substantial variability that translates to overdispersion in the RNA distributions. The bottom edge of this parameter space produces Poisson distributions of RNA, the top left corner produces Poisson mixtures of the law of $K$ , and the top right corner yields bursty dynamics that do not typically have simple analytical solutions²⁰.

Although these regimes reflect very different transcriptional kinetics, they can produce indistinguishable distributions. The first panel of Figure 3b demonstrates the likelihood landscape of a dataset generated from the gamma Ornstein–Uhlenbeck ( $Γ$ -OU) transcriptional model, evaluated using the nascent marginal and $p = 1$ . The mixture-like true parameters are indicated by a red point and the top decile of likelihoods is indicated by hatching. The $Γ$ -OU model’s transcription rate has a gamma stationary distribution, which produces approximately Poisson-gamma, or negative binomial, RNA marginals in this regime. However, the bursty regime, indicated by a blue point, also yields a negative binomial-like marginal²⁰, preventing us from identifying the kinetics.

On the other hand, if we evaluate likelihoods using the entire two-species dataset, we obtain the landscape in the second panel of Figure 3b: the symmetry is broken, and the parameters can be localized to the mixture-like regime. The source of this improved performance is evident from examining the distributions, shown in the third and fourth panels of Figure 3b. The nascent marginals are essentially identical; no amount of purely nascent count data can distinguish between them. However, the bivariate distributions show subtle differences, such as higher nascent/mature correlations in the true regime, which can be used for inference. This approach is analogous to Fig. 4b of Gorin et al.²¹, where bivariate data are used to disambiguate differences which would otherwise be indistinguishable due to the degeneracies of steady-state distributions.

In addition, the timescale separation and noise intensity determine the model distinguishability. To quantify this, we use the Akaike weight $w_{ϖ}$ , which transforms log-likelihood differences into model probabilities¹²⁰. For example, if the Akaike weight is near 1/3, the models are indistinguishable; if the correct model’s weight is near 1, we can confidently identify the model from the data. The first panel of Figure 3c demonstrates the average Akaike weight landscape of datasets generated from the $Γ$ -OU model, computed using the nascent distribution at the same coordinate. We indicate the region $w_{ϖ} < 1 / 2$ by hatching. As the Akaike weight may be interpreted as a posterior model probability¹²⁰, this somewhat arbitrary threshold gives even odds for choosing the correct model, on average.

The intermediate regime, indicated by a large olive green point, tends to yield fairly high Akaike weights, consistent with the two-model case explored in Fig. 3a of Gorin and Vastola et al.²⁰ On the other hand, the burst-like regime, indicated by a large pink point, provides considerably less ability to distinguish the models. As expected, the situation improves somewhat when using bivariate data (second panel of Figure 3c): the Akaike weights increase throughout the parameter space, and the bursty regime data move closer to even odds for model selection.

To illustrate the source of the identifiability challenges, we plot the nascent marginals of the models at the two points. In the intermediate regime, the $Γ$ -OU and CIR models yield moderately different distributions, whereas the telegraph model is immediately distinguishable by its bimodality (third panel of Figure 3c). In contrast, in the bursty regime, the distributions are all unimodal and less identifiable (fourth panel of Figure 3c); the $Γ$ -OU and telegraph marginals are particularly similar, as they converge to the same negative binomial limit²⁰.

Interestingly, this formulation fully characterizes the effect of certain forms of technical noise. If the transcriptional and observed molecular averages are fixed, but the experiment fails to capture some molecules, the distributions are identical to those obtained by deflating the transcriptional noise intensity. In other words, even though technical noise affects the molecules, its theoretical effects are indistinguishable from decreasing the variability of the transcriptional process. As the noise levels increase, the RNA distributions are pushed toward the indistinguishable Poisson limit at the bottom edge of the reduced parameter space. We quantify how rapidly the information degrades by plotting smaller circles on the first and second panels of Figure 3c to indicate the effect of 50%, 75%, and 85% dropout, in that order from top to bottom.

4.4. Distributions obtained from a transient process

Due to the interest in understanding developmental processes, the characterization of transient process dynamics is a key problem in single-cell analyses. The use of mechanistic models with multimodal data, which we emphasize here, was originally pioneered in the context of the RNA velocity framework, which attempts to exploit the causal relationship between nascent and mature RNA to fit transient processes⁸⁶. However, the implementations proposed so far use relatively simple noise behaviors^59,86,121, which do not recapitulate the bursty transcription observed in living cells. As discussed in our recent analysis of RNA velocity methods¹⁹, this leads to us to hold some reservations about the robustness and appropriate interpretation of results obtained by this class of methods.

The inference of transient dynamics from snapshot data is a formidable problem due to a combination of theoretical and practical factors. Most fundamentally, it is not precisely clear what a snapshot is: how does a single measurement simultaneously capture the early and late states in a differentiation process? To develop an explanatory model, we take inspiration from the existing work on cyclostationary processes^122,123, cell cycle ensemble measurement modeling^124–126, Markov chain occupation measure theory^127–129, and chemical reactor engineering^105,106. In the typical stochastic modeling context, we fit count data using stationary distributions $P (x)$ , obtained as the limit $\lim_{t \to \infty} P (x, t)$ of a transient distribution. By the ergodic theorem^130–132, this distribution, when it exists, coincides with the occupation measure $\lim_{T \to \infty} \frac{1}{T} \int_{0}^{T} P (x, t) d t$ , i.e., observations drawn from a single trajectory over a sufficiently long time horizon, rather than from multiple trajectories at once. Conveniently, the ergodic limit has time symmetry with respect to measurement: the distribution does not depend on the timing of the experiment. In the transient case, we cannot take these limits. However, we can retain time symmetry by proposing that the experiment samples cells at almost surely finite times $t$ since the beginning of the process. Therefore, we conceptualize data as coming from a set of cells indexed by $c$ , such that each cell’s time $t_{c}$ is sampled from $f (t)$ , and counts are drawn from some distribution $P (x, t_{c})$ , which is not typically available in closed form. This formulation yields Equation 32, which requires specifying the distribution $f$ .

We illustrate some of the challenges and implications using the model system shown at the bottom of Figure 4a. The underlying transient structure involves transitions through three cell types, each characterized by a particular transcriptional burst size. The transient transcription process produces nascent and mature RNA trajectories for each cell; however, we only obtain a single data point per trajectory. Even if we have perfect information about the cell times, it is far from clear that we can accurately reconstruct the transcriptional dynamics from snapshot data (center of Figure 4a).

Given ordered and labeled snapshot data obtained from a transient differentiation process, we can typically fit the copy number data, but identifying the mechanism of the snapshot is more challenging.

a. A minimal model that accounts for the observation of transient differentiation processes in scRNA-seq: cells enter a “reactor” and receive a signal to begin transitioning from cell type A through B and to C. The change in cell type is accompanied by a step change in the burst size, which leads to variation in the nascent and mature RNA copy numbers over time. Given information about the cell type abundances and the cells’ time along the process, we may fit a dynamic process to snapshot data and attempt to identify the underlying reactor type, which determines the probability of observing a cell at a particular time since the beginning of the process.

b. In spite of the considerable differences between the reactor architectures, they produce nearly identical molecular count marginals (histogram: data simulated from the Dirac model, 200 cells; colored lines: analytical distributions at the maximum likelihood transcriptional parameter fits for each of the three reactor models. Analytical distributions nearly overlap).

c. The true reactor model may be identified from molecule count data, but statistical performance is typically poor (points: Akaike weight values for $n$ = 50 independent rounds of simulation and inference under a single set of parameters; blue markers and vertical lines: mean and standard deviation at each number of cells; blue line connects markers to summarize the trends; red lines: the Akaike weight values 1/3, which contains no information for model selection, and 1/2, which gives even odds for the correct model; two-species data generated from the Dirac model; uniform horizontal jitter added).

d. The reactor models are poorly identifiable across a range of parameters, and rarely produce Akaike weights above 1/2 (histogram: Akaike weight values for $n$ = 200 independent rounds of parameter generation, simulation, and inference under the true Dirac model; red line: the Akaike weight values 1/3 and 1/2; two-species data for 200 cells generated from the Dirac model; parameters were restricted to the low-expression regime $μ + 4 σ \leq 25$ for both species).

e. The challenges in reactor identification arise because all three models produce similar likelihoods (histograms: likelihood differences between candidate models and the true Dirac model for $n$ = 200 independent rounds of parameter generation, simulation, and inference; red line: no likelihood difference; two-species data for 200 cells generated from the Dirac model; parameters were restricted to the low-expression regime $μ + 4 σ \leq 25$ for both species).

In addition, we wish to know whether we can identify the mechanism of the snapshot collection. We can imagine cells entering and exiting the observed tissue in multiple ways, which correspond to different choices of $f (t)$ . Some natural choices are uniform, which implies the cells stay in the tissue for a deterministic time⁸⁶; decreasing over time, so cells can exit immediately; or uniform, then decreasing, so cells must stay in the tissue for some duration but are free to leave afterward. These choices can be modeled by Dirac, exponential, and Pareto residence distributions. In the parlance of chemical reactor engineering, these configurations are known as the plug flow reactor, the continuously-stirred tank reactor, and the laminar flow reactor, respectively. Their $f (t)$ , which are the reactor internal-age distributions, are well-known in the chemical engineering literature^105,106, and shown at the top of Figure 4a. It is not a priori obvious the configurations are mutually distinguishable from count data. If they are not, the choice of $f (t)$ is immaterial for inference.

We generated snapshot data from the Dirac model and fit it under all three models. To efficiently evaluate snapshot distributions, we designed an algorithm which essentially “recycles” $t_{c}$ for trapezoidal quadrature. The method is fully described in Section 6.8.4. As shown in Figure 4b, despite only having access to a single observation per time point, all models yield results visually close to the true marginals. However, despite these superficial similarities, quantitative model identification is possible: for the simulated dataset shown, the true Dirac model achieves an Akaike weight of $w_{ϖ}$ ≈ 79%, whereas the exponential and Pareto both achieve ≈ 10%. Decreasing the dataset size substantially degrades the identifiability (Figure 4c). Even at higher sizes, spread is considerable; for example, a 150-cell dataset gives approximately even odds ( $w_{ϖ}$ > 1/2) on average, but individual realizations vary from confidently correct ( $w_{ϖ}$ ≈ 1) to confidently wrong ( $w_{ϖ}$ ≈ 0).

To understand the robustness of model identifiability, we generated 200 synthetic datasets at random parameter values, constrained to have fairly low expression. We observed poor identifiability, with even or better odds for the correct model in only 20% of the cases (Figure 4d). This performance appears to be attributable to quantitative similarities between all three models’ likelihoods. As shown in Figure 4e, given data of this quality, we cannot even narrow the scope down to two models, as neither of the candidate models performs conspicuously worse than the true Dirac configuration. Therefore, it is possible to fit snapshot data approximately equally well using a variety of models; candidates for $f (t)$ are identifiable in principle, but challenging to distinguish from any particular dataset. This simulated analysis implies that the details of the reactor configuration may not matter much, providing a basis for omitting this model identification problem for real data.

4.5. Variability in library construction

To properly interpret single-cell data, we need to exhibit caution regarding the technical noise behaviors and consider multiple possible candidate models. However, before fitting distributions, we must fully characterize the models and understand which of their parameters are actually identifiable with the data at hand. For example, the two-species models explored in Section 4.3 produce distributional forms that are closed under the assumption $p_{N} = p_{M} = p$ , i.e., the magnitude of the observation probability $p$ is impossible to identify from count data alone. Interestingly, when $p_{N} \neq p_{M}$ (that is, when nascent and mature RNA may have different observation probabilities), what we can learn about technical noise heavily depends on the form of the biological noise. For example, under slow transcriptional variation (as in the mixture and Poisson limits of the models explored in Section 4.3), the RNA distributions contain no identifiable information whatsoever about the technical noise, regardless of the amount of data. On the other hand, if transcription is bursty, the distributions depend on the ratio of $p_{N}$ and $p_{M}$ , but not their absolute values (Section 6.8.5). This theoretical result calls for further investigation: how much information can we obtain in practice, given finite data?

To understand the prospects for distinguishing parameters, we consider the simple model system shown in Figure 5a, which involves bursty transcription with average burst size $b$ , splicing, degradation, and molecular capture with species-specific probabilities. To characterize how much information about $p_{M} / p_{N}$ we can identify from count data, we simulated 200 datasets at the ratio values 1/4,1, and 4, and calculated their likelihoods over (10⁻², 10²). We repeated this analysis using synthetic datasets with 20, 50, 100, and 200 cells, and plotted the average of the posterior distributions for each condition. As shown in Figure 5b, color-coded by the ground truth $p_{M} / p_{N}$ and intensity-coded by the number of cells, the posteriors are, on average, consistent with the true value. However, even with perfect information about the averages and the nascent RNA distribution, the uncertainty is considerable; at larger dataset sizes, we can typically localize the ratio to an order of magnitude, but not much further.

Technical noise models may be identified from count data, either by direct application of statistics or by imposing informal priors about the biological variability.

a. A minimal model that accounts for non-homogeneous noise: transcriptional events occur with frequency $α$ , generating geometrically-distributed bursts $B$ with mean size $b$ ; the molecules are spliced with rate $β$ and degraded with rate $γ$ . Nascent molecules are observed with probability $p_{N}$ and mature molecules are observed with probability $p_{M}$ .

b. Given information about the nascent distribution and the mature mean, it is possible to use joint distributions to obtain information about the ratio of observation probabilities (curves: average posterior likelihoods, computed from 200 independent synthetic datasets; color: true value of $p_{M} / p_{N}$ , blue: 1/4, red: 1, purple: 4; dashed lines: location of each true value; color intensity: from lightest to darkest, synthetic datasets with 20, 50, 100, and 200 cells).

c. Two models considered in Gorin et al.²¹: the species-independent bias model for length dependence in averages, which proposes that nascent and mature RNA are sampled with equal probabilities, and the species-dependent bias model, which proposes that the nascent RNA sampling rate scales with length (top, gold: kinetics of species-independent model; bottom, blue: kinetics of species-dependent model; center, green: the source RNA molecules used to template cDNA).

d. A variety of single-cell datasets produce consistent and counterintuitive length-dependent trends in nascent RNA observations (lines: average per-species gene expression, binned by gene length; red: nascent RNA observations; gray: mature RNA statistics; data for 2,500 genes analyzed in Gorin et al.²¹).

e. Fits to the species-independent model show a strong positive gene length dependence for inferred burst sizes, whereas fits to the species-dependent model show a modest negative gene length dependence, which is more coherent with orthogonal data (lines: average per-gene burst size inferred by *Monod*¹⁶¹, binned by gene length; gold: results for species-independent model; blue: results for species-dependent model; data for genes analyzed in Gorin et al.²¹ after goodness-of-fit).

Given the statistical challenges illustrated by simulations, we speculate that it may be more fruitful to use prior information about biology and physical intuition about sequencing to construct technical noise models. For example, in a recent paper²¹, we fit models that represent two competing hypotheses (Figure 5c). The first has identical, gene-specific observation probabilities $p$ for the nascent and mature species. In this model, the inferred burst size is $b p$ , as these two parameters are not mutually identifiable. The second has a gene length-dependent technical noise term for the nascent species, which coarsely represents a higher rate of priming for long molecules with abundant intronic poly(A) tracts, and a shared genome-wide term for the mature species, which represents priming at the poly(A) tail. In this model, the inferred burst size is $b$ .

These models attempt to explain the trend summarized in Figure 5d: across a wide range of datasets, nascent RNA averages exhibit a pronounced length dependence¹³³ not evident in mature RNA¹³⁴. The first model explains the trend by a species-independent bias, as $b$ and $p$ control nascent as well as mature RNA levels. Conversely, the second model explains it by a species-dependent bias. Both models produce fair fits to the data (as demonstrated, e.g., by the low rate of rejection by goodness-of-fit in Sections S7.4 and S7.5.2 of Gorin et al.²¹).

However, the trends in the resulting inferred parameters are strikingly different: the species-independent bias model predicts that longer genes have higher $b p$ . Ascribing this trend to the $b$ term—longer genes have higher burst sizes—contradicts burst size trends from fluorescence microscopy¹³⁵. Ascribing it to the $p$ term—longer genes have higher sampling probabilities—is physically unrealistic, because mature RNA molecules are depleted of the internal poly(A) tracts necessary for priming¹³⁶. On the other hand, the species-dependent model predicts a modest negative relationship between length and burst size, which is more coherent with orthogonal data.

This technical noise model is a relatively simplistic low-order approximation, since all genes have the same mature molecule capture rate $λ_{M}$ and length scaling $C_{N}$ . Nevertheless, it foregrounds a key modeling principle of the investigation: in the absence of prior information, biological parameters need to be fit on a gene-by-gene basis, but technical noise should be constructed using a common genome-wide model that varies in a mechanistic rather than arbitrary way. In sum, the mathematics enable us to define and fit systems, but to understand whether the fits are sensible, we need to contextualize and compare them with previous results and physical intuition.

5. DISCUSSION

The results we have derived provide a blueprint for the holistic modeling of single-cell biology and sequencing experiments. First, we have outlined a generic mathematical framework for treating stochasticity in living cells. By exploiting the generating function representation, we reduce discrete, continuous, and mixed reactions to operators in a system of differential equations. These ODEs can be straightforwardly solved via numerical integration to compute model properties, including likelihoods. This approach recapitulates and subsumes a wide range of previous results^{16,17,20,21,75,76,85,100,137,138}.

By treating the discrete and continuous degrees of freedom on equal footing, our approach makes certain otherwise challenging problems straightforward to solve, as illustrated in Section 6.8.1. By making simplifying assumptions—chiefly, the assumption of independent and identically distributed sampling—we reduce the modeling of technical variation to the composition of generating functions. Our framework may be used in its current form, or as a substrate for developing more sophisticated models of transcriptional regulation and sequencing that subsume it in turn. This process simply involves instantiating hypotheses, converting them into probabilistic models, and constructing model solutions using a procedure analogous to the one presented in Figure 1c.

We believe this framework comprises a productive vision for the interpretation of large datasets, but many technological and mathematical challenges remain. For example, the library construction biases are dependent on molecule-specific factors that we do not yet fully understand, because their effect is heavily convolved with biological variability. In Figure 5, we considered two extreme cases, where the noise strength/length scaling is either unconstrained or forced to be identical for all genes. We anticipate that careful investigation of technical biases will be necessary to construct models that constrain the technical biases based on RNA chemistry, while allowing for gene-to-gene and droplet-to-droplet variability.

In Section 6.7 and supplemental information, we discuss the challenges associated with modeling ambiguous species, motivated by the limitations of short-read sequencing for distinguishing between spliced and unspliced forms of the same RNA gene product¹³⁹. It is worth noting that even the spliced/unspliced binary is a convenient simplification primarily adopted because of data availability^86,113; we stress that a truly comprehensive treatment requires defining intermediate states¹⁹, their relationships, and their mutually indistinguishable classes. These computational foundations do not yet exist, although we have attempted a partial solution in recent work¹⁶ and outlined some promising directions in supplemental information. Therefore, despite our immediate interest in bivariate RNA distributions, our framework is designed to generalize to other modalities as they become practical to quantify. In addition, although we focus on Markovian systems here, non-Markovian processing can be represented by appropriately defining $U$ ¹⁴⁰, which suggests avenues for the treatment of systems with molecular memory^141,142.

The full generating function solutions we have outlined here are typically not computable directly. By construction, the generating function needs to be evaluated on a grid; Fourier inversion produces a grid of microstate probabilities, which needs to be quite large to avoid artifacts¹³⁸. If the grid dimension is $s_{i}$ for each discrete species $i$ , the overall state space size is $s = N \prod_{i} s_{i}$ . Even in the simplest case, where we only quantify and fit discrete counts, evaluating the probability mass function requires storing and inverting an $n$ -dimensional array, which usually has size $s$ far too large to be practical (e.g., Fig. S5b of our prior work on bursty models¹⁶).

When applicable, the generating function approach has numerical advantages over the stochastic simulation algorithm (SSA)^143–145, which approximates distributions by the empirical distributions of trajectories, and finite state projection (FSP)⁷⁸, which directly integrates a version of the master equation confined to a finite $s$ . Specifically, if we only care about a particular species $i$ , we can evaluate its marginal using a grid of size $N s_{i}$ with $s$ log $s$ time complexity. In the worst-case scenario, FSP requires a grid of size $s$ with $s$ ³ time complexity, as evaluating a particular marginal requires explicitly evaluating the probabilities for the entire grid, then marginalizing. Similarly, SSA requires explicitly simulating the entire system to obtain the marginals, and has the drawback of the usual inverse square root Monte Carlo convergence^146,147. In addition, FSP is not compatible with the generating function manipulations used to represent technical noise, SSA is relatively challenging to adapt to time-dependent rates¹⁴⁸, and neither FSP not SSA is readily compatible with continuous stochastic processes (although exact²⁰ and approximate^20,149,150 hybrid schema can be constructed with some work). In the future, the “curse of dimensionality”— the reliance on grid evaluation—may be possible to bypass altogether by training neural networks to predict probability distributions, but this approach is as of yet in its nascence^151–154 and will require considerable further development to apply to general systems.

Nevertheless, SSA and FSP are substantially more general than the approach we outline here. The simulation- and matrix-based methods only require a list of reactions, whereas the generating function methods also require those reactions to produce readily solvable partial differential equations. We have omitted phenomena which would be trivial to treat using FSP and SSA, such as regulation involving feedback. (In principle, one can always construct “synthetic likelihoods” for inference by fitting a function approximator to the results of stochastic simulations, even for highly nonlinear and chaotic systems^155–157.) To our knowledge, these phenomena, which are mathematically analogous to adding multi-molecular interaction terms, cannot be directly treated with the method of characteristics. Instead, a mathematically precise treatment of them requires perturbative methods⁷⁷ or fairly complicated special function manipulations^101–104, which do not easily generalize. We illustrate the challenges in supplemental information, using the example of downstream species catalyzing gene state transitions.

On the other hand, there are a number of ways to treat systems involving feedback approximately. Approaches like the linear-mapping approximation¹⁵⁸ permit the derivation of approximate but accurate generating functions for such systems, which can then be used in standard inference pipelines. Alternatively, using only the results presented here, the net effect of feedback can be captured in the time-dependence of certain parameters (e.g., burst sizes) if dynamics are sufficiently chaotic, or if the time scale of feedback is slow compared to other system time scales.

We have, until now, stressed applications to “snapshot” single-cell data from dissociated tissues; however, our framework may be extended to spatial single-cell data; for instance, we can define transcriptional parameters that depend on the cell’s coordinates in the tissue. In this case, the typical systems biology goals translate to fitting a time- and space-dependent function that governs these parameters. However, the generating function formulation relies on the assumption of cells being stochastically independent; it is far from clear that this should hold for densely sampled spatial data, and more sophisticated alternatives, such as agent-based models, may be needed^159,160.

Despite these challenges, the framework is already quantitatively useful. To fully “explain” a dataset, we need to fit gene-specific transcriptional mechanisms, genome-wide technical noise and co-expression parameters, and cell type structure, while controlling for potential misspecification. However, at this time, it may be more fruitful to focus on narrower questions, using assumptions, orthogonal data, or simulated benchmarking to justify omitting some parts of the problem¹⁹. We have applied this “bottom-up” approach to single-cell data, considering, in turn, the estimation of transcriptional kinetics and technical noise^21,161, the identification of transcriptional models²⁰, the analysis of co-regulation patterns¹⁶, and the determination of nuclear transport kinetics¹⁴⁰. Conversely, it may be valuable to apply a “top-down” approach, augmenting an existing method with biophysically meaningful noise, as we have proposed in the context of transient processes¹⁹ and neural network dimensionality reduction⁷¹.

We anticipate that making meaningful progress on the stochastic modeling project championed by Wilkinson will require extended “real contact”¹⁶² between systems biology, genomics, and mathematics. The general framework we propose, which unifies a variety of previous work, represents one step towards this synthesis. The role of mathematics here is key; as Wilkinson noted, the stochastic systems biology of single cells cannot be “properly understood” without stochastic mathematical models.

RESOURCE AVAILABILITY

Lead Contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Lior Pachter (lpachter@caltech.edu).

Materials Availability

This study did not generate new materials.

Data and Code Availability

This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Pseudoaligned count matrices in the mtx format have been deposited at the Zenodo package 8132976. The data, Monod fits, and analysis scripts used to generate Figure 5d-e, originating from Gorin et al.²¹, were previously deposited as the Zenodo package 7388133.
All original code has been deposited at https://github.com/pachterlab/GVP_2023 and the Zenodo package 8132976, and is publicly available as of the date of publication. DOIs are listed in the key resources table.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
H. sapiens peripheral blood 10x v3 scRNA-seq data	178	pbmc_1k_v3
M. musculus heart 10x v3 scRNA-seq data	179	heart_1k_v3
M. musculus neuron 10x v3 scRNA-seq data	180	neuron_1k_v3
M. musculus cultured embryonic stem cells treated with DMSO 10x v2 scRNA-seq data	Desai et al.	desai_dmso
H. sapiens peripheral blood 10x v2 scRNA-seq data (technical replicate of pbmc_1k_v3)	181	pbmc_1k_v2
M. musculus neuron 10x v3 snRNA-seq data	182	brain_nuc_5k_v3
Supporting data for GP_2021_3	Gorin and Pachter	Zenodo: dataset 7388133
Software and Algorithms
Python	python.org	3.9.1
NumPy	numpy.org	1.22.1
SciPy	scipy.org	1.7.3
pandas	pandas.pydata.org	1.2.4
kallisto \| bustools	Melsted and Booeshaghi et al.	0.26.0
Monod	Gorin and Pachter	2.5.0
Other
Count matrices for all datasets	This manuscript	Zenodo: dataset 8132976
Custom analysis notebooks	This manuscript	GitHub: https://github.com/pachterlab/GVP_2023 (version of record deposited at Zenodo: dataset 8132976)

Open in a new tab

6. METHODS

6.1. Master equation models of transcription

We are interested in continuous-time stochastic processes that combine categorical, nonnegative discrete, and (usually nonnegative) continuous degrees of freedom. To solve these systems, we begin by separately defining their allowed transitions and converting them to master equation forms.

The categorical variable, denoted by $s \in {1, ..., N}$ , represents the instantaneous state of a multi-state gene. By assuming that the state interconversions are Markovian and independent of all other components of the system, we can define $H_{i j}$ , the rates of transitioning from state $i$ to state $j$ :

𝓢_{i} \overset{H_{i j}}{\to} 𝓢_{j} .

(35)

These rates can be summarized in the state transition matrix $H \in ℝ_{\geq 0}^{N \times N}$ , such that $H_{i i} = - \sum_{j \neq i} H_{i j}$ and $Σ_{j} H_{i j} = 0$ to enforce the conservation of probability. This set of transitions can be represented by a master equation involving finitely many ODEs, which tracks the probabilities of each state $s$ at a time $t$ :

\frac{\partial P (s, t)}{\partial t} = \sum_{i = 1}^{N} H_{i s} P (i, t), or more compactly \frac{\partial P (t)}{\partial t} = H^{T} P .

(36)

As this system is expressed in terms of a differential equation for an arbitrary time $t$ , the relation holds for time-dependent $H$ . For simplicity, we assume that $H$ is deterministic.

The nonnegative discrete variables, denoted by $x \in ℕ_{0}^{n}$ , represent molecular copy numbers. We assume that $n$ molecular species participate in four classes of transitions, and can summarize their effect by considering their reaction schema and effect on $x_{i}$ , the number of molecules of species $i$ :

𝓧_{i} \overset{c_{i j}}{\to} 𝓧_{j}, 𝓧_{i} \overset{c_{i 0}}{\to} \emptyset, 𝓧_{i} \overset{q_{i j}}{\to} 𝓧_{i} + 𝓧_{j}, \emptyset \overset{α_{ω}}{\to} B_{i_{1}} 𝓧_{i_{1}} + \dots + B_{i_{ℓ_{ω}}} 𝓧_{i_{ℓ_{ω}}} .

(37)

First, species $i$ can be converted to species $j$ with rate $c_{i j} x_{i}$ . Second, species $i$ can spontaneously degrade with rate $c_{i}_{0} x_{i}$ . These classes of monomolecular transitions, which either maintain or reduce the total number of molecules in the system, can be summarized in the matrix $C^{d d} \in ℝ^{n \times n}$ , such that $C_{i j}^{d d} = c_{j i}$ and $C_{i i}^{d d} = - c_{i 0} - \sum_{j \neq i} c_{i j}; C^{d d}$ is the matrix governing the associated reaction rate equations^17,85. Third, species $i$ participate in autocatalysis at the rate $q_{i i}$ , or catalysis of species $j$ at the rate $q_{i j}$ . These reactions can be summarized by the matrix $Q^{d} \in ℝ_{\geq 0}^{n \times n}$ , such that $Q_{i j}^{d} = q_{j i}$ . Finally, molecules can be produced. In the general case, a burst of production simultaneously creates molecules of $ℓ_{ω}$ discrete species ${i_{1}, ..., i_{ℓ w}}$ . We assume bursts are described by a Poisson arrival process, with burst frequency $α_{ω}^{d}$ and the nontrivial $ℓ_{ω}$ -variate joint distribution $p_{ω}^{d} (z)$ of non-negative burst sizes ${B_{i_{1}}, \dots, B_{ℓ_{ω}}}$ ¹⁶. This formulation includes the trivial case of Poisson point process production of species $i$ , for which $ℓ_{ω} = 1$ and $p_{ω}^{d} (z) = δ_{i j}$ , the degenerate distribution located at unity for species $i$ and zero for all other species.

This mass action model, which tracks molecule counts, can be represented by an equivalent discrete chemical master equation, which tracks the probability of each microstate $x$ :

\frac{\partial P (x, t)}{\partial t} = \sum_{i = 1}^{n} c_{i 0} [(x_{i} + 1) P (x_{i} + 1, t) - x_{i} P (x, t)] + \sum_{i, j = 1}^{n} c_{i j} [(x_{i} + 1) P (x_{i} + 1, x_{j} - 1, t) - x_{i} P (x, t)] + \sum_{i = 1}^{n} Q_{i i}^{d} [(x_{i} - 1) P (x_{i} - 1, t) - x_{i} P (x, t)] + \sum_{i, j = 1}^{n} Q_{j i}^{d} [x_{i} P (x_{j} - 1, t) - x_{i} P (x, t)] + \sum_{ω} α_{ω}^{d} [\sum_{z} p_{ω}^{d} (z) P (x - z, t) - P (x, t)] .

(38)

For simplicity of notation, species that do not occur in a reaction are elided from the master equation, as in previous work on modeling bursty transcription¹⁶. As above, this equation holds even if the rates are time-dependent. For the purposes of this report, we assume only $α_{ω}$ and $p_{ω}$ can vary over time.

The nonnegative continuous variables, denoted by $y \in ℝ_{\geq 0}^{m}$ , represent concentrations or coarsely-modeled noise sources. We assume that these variables are governed by Ornstein–Uhlenbeck-type stochastic differential equations:

d y_{t} = C^{c c} y_{t} d t + 𝓠^{c} (y_{t}) d W_{t} + \sum_{ω} d L_{ω} (t),

(39)

where $y_{t}$ is a realization of the process, $W_{t}$ is an $w$ -dimensional Brownian motion, and $L_{ω}$ is a subordinator. The matrix $C^{c c} \in ℝ^{m \times m}$ sets the mean-reversion terms, whereas the operator $𝓠^{c} (y_{t}) : ℝ_{\geq 0}^{m} \to ℝ_{\geq 0}^{m \times w}$ sets the level of noise. We assume that each $L_{ω}$ only includes drift or compound Poisson terms. The drift terms have the form $α_{i}^{c} δ_{i j} t$ . To slightly lighten the notation, we can aggregate all drift terms under $ω = 1, \dots, m$ , as ${α_{1}^{c} d t, \dots, α_{m}^{c} d t}$ ; some of these entries may be zero. The compound Poisson terms have the form $\sum_{k = 0}^{N_{ω} (t)} {(B_{ω})}_{k}$ ¹⁶³, such that $N_{ω} (t)$ is a Poisson random variable with mean $α_{ω}^{c} t$ and ${(B_{ω})}_{k}$ is a set of independent and identically distributed realizations of the random variable $B_{ω}$ . This random variable has a nontrivial $ℓ_{ω}$ -variate joint density $p_{ω}^{c} (z)$ on $ℝ_{\geq 0}^{m}$ , with the remaining $m - ℓ_{ω}$ dimensions concentrated at zero. We note that this formulation entails a slight abuse of notation, as $ω$ is used to index over discrete burst processes as well as continuous drift and jump components.

For simplicity, we assume the noise term takes the form of an uncoupled square-root diffusion, such that $w = m$ and $𝓠^{c} (y_{t}) = diag (σ ⊙ \sqrt{y_{t}})$ . The symbol $⊙$ denotes the elementwise/Hadamard product of two vectors, the square root should be interpreted as elementwise, and all elements of the constant volatility vector $σ$ are non-negative. Although this choice of $𝓠^{c}$ is somewhat restrictive, it produces a particularly simple diffusion tensor $Σ$ :

Σ (y) : = \frac{1}{2} 𝓠^{c} (y) 𝓠^{c} {(y)}^{T} = \frac{1}{2} diag (σ^{2} ⊙ y),

(40)

where the square $σ^{2}$ should be interpreted as elementwise. This formulation can be reframed as a Fokker-Planck equation¹⁶⁴, which tracks the probability density of each microstate $y$ :

\frac{\partial P}{\partial t} = - \sum_{i, j = 1}^{m} C_{j i}^{c c} \frac{\partial}{\partial y_{j}} [y_{i} P] + \frac{1}{2} \sum_{i = 1}^{m} σ_{i}^{2} \frac{\partial^{2}}{\partial y_{i}^{2}} [y_{i} P] - \sum_{i = 1}^{m} α_{i}^{c} \frac{\partial P}{\partial y_{i}} + \sum_{ω > m} α_{ω}^{c} [\int_{z} p_{ω}^{c} (z) P (y - z, t) d z - P (y, t)] .

(41)

As above, we assume that only the components of $L_{ω}$ vary in time.

In addition to these discrete- and continuous-only terms, we need to account for these components’ interactions. For example, we may want to represent the production of a discrete species controlled by a continuous variable, e.g., a time-varying transcription rate²⁰:

\emptyset \overset{y_{i} c_{i j}}{\to} 𝒳_{j} .

(42)

This reaction has the rate $y_{i} c_{i j}$ . This class of reactions can be summarized in the matrix $C^{c d} \in ℝ_{\geq 0}^{m \times n}$ , such that $C_{i j}^{c d} = c_{j i}$ . In other words, this class of reactions contributes the following terms to the overall master equation:

\sum_{i = 1}^{m} \sum_{j = 1}^{n} C_{j i}^{c d} [y_{i} P (x_{j} - 1, y, t) - y_{i} P (x, y, t)] .

(43)

Finally, we may want to represent the production of a continuous species from a discrete one, e.g., the rapid translation of high-abundance protein from low-abundance RNA¹³⁸. This class of reactions simply adds a term proportional to $C^{d c} x d t$ to the expression for $y_{t}$ . The matrix $C^{c d} \in ℝ_{\geq 0}^{m \times n}$ contains the relevant rates, such that $C_{i j}^{d c}$ is the rate of producing the continuous species $i$ from discrete species $j$ . Therefore, we append a set of drift-like terms to the Fokker-Planck equation:

- \sum_{i = 1}^{n} \sum_{j = 1}^{m} C_{j i}^{d c} x_{i} \frac{\partial P (x, y, t)}{\partial y_{j}} .

(44)

To construct the full master equation, we need to define a system of $N$ coupled equations. To do so, we essentially add Equations 36, 38, 41, 43, and 44, replacing all instances of $P$ with $P (s, x, y, t)$ . However, to account for differences in transcription between gene states, we allow the $ω$ -associated terms to vary with $s$ . The full master equation is reported below in Equation 45.

6.2. The full master equation

The full master equation for $P (s, x, y, t)$ is:

\frac{\partial P}{\partial t} = \sum_{i = 1}^{N} H_{i s} (t) P (i, x, y, t) + \sum_{i = 1}^{n} c_{i 0} [(x_{i} + 1) P (s, x_{i} + 1, y, t) - x_{i} P (s, x, y, t)] + \sum_{i, j = 1}^{n} c_{i j} [(x_{i} + 1) P (s, x_{i} + 1, x_{j} - 1, y, t) - x_{i} P (s, x, y, t)] + \sum_{i = 1}^{n} Q_{i i}^{d} [(x_{i} - 1) P (s, x_{i} - 1, y, t) - x_{i} P (s, x, y, t)] + \sum_{i, j = 1}^{n} Q_{j i}^{d} [x_{i} P (s, x_{i} - 1, y, t) - x_{i} P (s, x, y, t)] + \sum_{ω} α_{s, ω}^{d} (t) [\sum_{z} p_{s, ω}^{d} (z, t) P (s, x - z, y, t) - P (s, x, y, t)] - \sum_{i, j = 1}^{m} C_{j i}^{c c} \frac{\partial}{\partial y_{j}} [y_{i} P (s, x, y, t)] + \frac{1}{2} \sum_{i = 1}^{m} σ_{i}^{2} \frac{\partial^{2}}{\partial y_{i}^{2}} [y_{i} P (s, x, y, t)] - \sum_{i = 1}^{m} α_{s, i}^{c} (t) \frac{\partial P (s, x, y, t)}{\partial y_{i}} + \sum_{ω > m} α_{s, ω}^{c} (t) [\int_{z} p_{ω}^{c} (z) P (y - z, t) d z - P (y)] + \sum_{i = 1}^{m} \sum_{j = 1}^{n} C_{j i}^{c d} [y_{i} P (x_{j} - 1, y, t) - y_{i} P (x, y, t)] - \sum_{i = 1}^{n} \sum_{j = 1}^{m} C_{j i}^{d c} x_{i} \frac{\partial P (x, y, t)}{\partial y_{j}} .

(45)

We annotate the terms in Table S1.

6.3. Generating function methods for biological stochasticity

The full master equation is fairly cumbersome and challenging to analyze directly. Therefore, analysis has to proceed by spectral methods. We use the generating function (GF), a length- $N$ vector function $G$ , such that each component is

G_{s} (g, h, t) = \int_{0}^{\infty} \dots \int_{0}^{\infty} \sum_{x_{1} = 0}^{\infty} \dots \sum_{x_{n} = 0}^{\infty} (\prod_{i = 1}^{n} g_{i}^{x_{i}}) (\prod_{i = 1}^{m} e^{h_{i} y_{i}}) P (s, x, y, t) d y_{m} \dots d y_{1} : = \int_{y} \sum_{x} g^{x} e^{h^{T} y} P (s, x, y, t) d y .

where the lowest line is the definition expressed in useful shorthand notation. Formally, the generating function is the combination of a probability-generating function (PGF) in the discrete variables and moment-generating function (MGF) in the continuous variables. The arguments $g \in ℂ^{n}$ and $h \in ℂ^{m}$ are spectral variables. By computing the generating function of both sides of Equation 45, we find (see supplemental information) that the master equation is equivalent to a much more compact system of partial differential equations:

\frac{\partial G}{\partial t} = H^{T} G + G ⊙ 𝓐 (u) + J [C u + diag u D u] .

(46)

This formulation relies on defining the unified variables $u$ :

u : = [\begin{matrix} g - 1 \\ h \end{matrix}] and J_{s i} = \frac{\partial G_{s}}{\partial u_{i}},

(47)

as well as unified matrices:

C : = [\begin{matrix} {(C^{d d})}^{T} + {(Q^{d})}^{T} & {(C^{d c})}^{T} \\ {(C^{c d})}^{T} & {(C^{c c})}^{T} \end{matrix}] D : = [\begin{matrix} {(Q^{d})}^{T} & {(C^{d c})}^{T} \\ 0 & \frac{1}{2} diag σ^{2} \end{matrix}] : = [\begin{matrix} {(Q^{d})}^{T} & {(C^{d c})}^{T} \\ 0 & {(Q^{c})}^{T} \end{matrix}] .

(48)

Each entry of the length- $N$ matrix function $𝓐$ consists of the burst and drift terms:

𝓐_{s} = {(α^{d})}_{s}^{T} (F_{s} (u + 1) - 1) + {(α^{c})}_{s}^{T} (M_{s} (u) - 1) = α_{s}^{T} (ℳ_{s} (u) - 1) .

(49)

The vector $α_{s}^{d}$ contains the frequencies of all discrete burst processes for state $s$ . The first $m$ entries of $α_{c}^{d}$ contain the continuous species’ drifts in state $s$ . The remaining entries contain the corresponding rates of continuous burst processes. $α_{s}$ aggregates these quantities. The vector function $F_{s}$ contains the joint PGF of the discrete burst processes, and only depends on the first $n$ variables. The vector function $M_{s}$ contains the drift terms, as well as the joint MGF of the continuous burst processes, and only depends on the last $m$ variables. The parameters of the $𝓐_{s}$ operator may vary in time.

To obtain the generating function at $t$ , we apply the method of characteristics. First, we calculate the characteristics parametrized by the scalar variable $𝗌$ :

T (𝗌) = t - 𝗌 \frac{d U (𝗌)}{d 𝗌} = C U (𝗌) + diag U (𝗌) D U (𝗌)

(50)

where $U (𝗌 = 0) = u$ . This is the “downstream” ODE, which governs abundances in isolation from production and regulation.

Therefore, $G$ is governed by the following system of ordinary differential equations:

\frac{d G (U (𝗌), T (𝗌))}{d 𝗌} = - H {(T (𝗌))}^{T} G - G ⊙ 𝓐 (U (𝗌), T (𝗌)) : = ℋ (U, T) G .

(51)

To obtain $G$ at $t$ , we integrate this matrix system from $𝗌 = t$ to $𝗌 = 0$ . We use $G^{0} (U (t))$ as the initial condition, where $G^{0}$ is the generating function of the initial distribution. This is the “upstream” ODE, which governs the full generating function.

In the general case, evaluating this system requires two applications of quadrature: first, solving the $n + m$ -dimensional downstream system to obtain the values of characteristics $U$ at a set of grid points over [0, $t$ ]; then, solving the $N$ -dimensional upstream system to obtain the value of the generating function.

Some special cases afford simpler solutions. If $D \neq 0$ , the downstream ODE takes a Riccati-like form and generally resists exact analysis^17,165. However, if $D = 0$ and $C$ is diagonalizable, the system takes the tractable linear form

\frac{d U (𝗌)}{d 𝗌} = C U (𝗌) : = V^{- 1} Λ V U (𝗌), with the solution U (𝗌) = V^{- 1} e^{Λs} V u

(52)

whenever all eigenvalues of $C$ are distinct. When they are not, the ODE can be solved in a similar way using generalized eigenvectors. Practically, this means that only one application of quadrature is required.

If, in addition, $N = 1$ , the upstream ODE reduces to a single integral:

ϕ (t) = \int_{t}^{0} \frac{d ϕ (U (𝗌), T (𝗌))}{d 𝗌} d 𝗌 = ϕ^{0} (U (t)) + \int_{0}^{t} 𝓐 (U (𝗌), T (𝗌)) d 𝗌,

(53)

where $ϕ : = \log G, ϕ^{0} = \log G^{0}$ , and the generating function $G$ is no longer boldfaced because only a single gene state exists.

If $𝓐$ is a linear operator $a_{1} u_{1} + \dots + a_{n + m} u_{n + m}$ , the system is in the drift-only regime; no bursting occurs. In this case, the system reduces to

ϕ (t) = ϕ^{0} (U (t)) + \sum_{i = 1}^{n + m} \int_{0}^{t} a_{i} (t - 𝗌) U_{i} (𝗌) d 𝗌,

(54)

where $U_{i}$ are the components of $U$ . As each $U_{i}$ is, in turn, a weighted sum of $u_{i}$ , the second term of the log-generating function is given by a sum of fairly simple convolutions that scale as $\int_{0}^{t} a_{i} (t - 𝗌) e^{λ_{j} 𝗌} d 𝗌$ .

Finally, in the simplest case, if all eigenvalues $λ_{i}$ of $C$ are negative, the transient part of Equation 54 vanishes as $t \to \infty$ and the stationary log-generating function is a linear combination of $u_{i}$ . This implies that the distribution converges to a product of independent Poisson distributions^17,85.

6.4. Coupling multiple genes

The results solve master equations with abstracted production and processing reactions. To connect them to systems phenomena, such as the co-regulation of multiple genes, we need to specify how upstream interactions lead to co-expression. As the simplest illustrative model system, we can consider the co-regulation of two genes, indexed by $j$ , with $U_{j} = u_{j} e^{- γ_{j} 𝗌}$ . We outline several relatively simple classes of candidate models which induce expression coupling.

In the simplest case, $ℋ (u, t) = \sum_{j} ℋ_{j} (u_{j}, t)$ . In other words, the genes’ dynamics are fully separable, and produce solutions in the form $G (u, t) = \prod_{j} G_{j} (u_{j}, t)$ . This formulation produces independent distributions at each $t$ , but the trajectories may possess nontrivial statistical relationships. For example, if both genes start at $x_{1} = x_{2} = 0$ , their trajectories will be correlated over a finite timespan [0, $T$ ], with the correlation decaying as $T \to \infty$ .

In the next simplest case, co-regulation is the consequence of parameter differences in subpopulations. For example, the full cell population may consist of cell types indexed by $κ$ . If we suppose each cell type has the abundance $π_{κ}$ and transcriptional parameters $Θ_{κ}$ , we obtain

G (u, t) = \sum_{κ} π_{κ} G (u, t; Θ_{κ}) = \sum_{κ} π_{κ} \prod_{j} G_{j} (u_{j}, t; Θ_{j, κ});

(55)

i.e., the generating function decomposes into a product of independent generating functions conditional on a particular cell type, but not globally. In other words, even if transcriptional processes are independent, cell type structure can produce nontrivial relationships between genes.

Alternatively, we can propose a model of co-regulation by the categorical variables. For example, two neighboring genes may prefer to have the same or opposite accessibility, depending on the polymeric properties of DNA. Assuming, for the purposes of illustration, that the system is symmetric, we obtain the following $N = 4$ form:

H = [\begin{matrix} - 2 k_{on} & k_{on} & k_{on} & 0 \\ ε^{- 1} k_{off} & - ε^{- 1} (k_{on} + k_{off}) & 0 & ε^{- 1} k_{on} \\ ε^{- 1} k_{off} & 0 & - ε^{- 1} (k_{on} + k_{off}) & ε^{- 1} k_{on} \\ 0 & k_{off} & k_{off} & - 2 k_{off} \end{matrix}] 𝓐 = [\begin{matrix} 0 \\ k_{init} u_{1} \\ k_{init} u_{2} \\ k_{init} (u_{1} + u_{2}) \end{matrix}] .

(56)

This form encodes the co-regulation of two genes, such that $s \in$ {both off, gene 1 on, gene 2 on, both on}. If $ε ≪ 1$ , the intermediate states are unstable and the genes tend to be either both on or both off. If $ε ≫ 1$ , the intermediate states are particularly stable, and only one of the genes tends to be on at a time. If $ε = 1$ , we recover the independent case.

We can define a similar model for co-regulation by a continuous variable $y_{1}$ . For example, there may be a latent regulator, such as the concentration of an activator, that controls multiple loci: if it is high, both have a high transcription rate; otherwise, both are inactive²⁰. This amounts to appending the following reactions to the master equation:

C_{j 1}^{c d} y_{1} [P (x_{j} - 1) - P (x_{j})],

(57)

where the $C^{c d}$ matrix encodes the relationship between the concentration and the transcription rate. Therefore, the genes become mutually correlated through the trajectory of $y_{1}$ , although the extent of correlation depends on the dynamics.

If the categorical or continuous driving process is bursty, we can approximate it by a co-bursting module. For example, in the limit of $ε \to 0$ , the dynamics of the system in Equation 56 converge to the $N = 2$ formulation

H = [\begin{matrix} - k_{on}^{*} & k_{on}^{*} \\ k_{off}^{*} & - k_{off}^{*} \end{matrix}] and 𝓐 = [\begin{matrix} 0 \\ k_{init} (u_{1} + u_{2}) \end{matrix}], where k_{on}^{*} = \frac{2 k_{on}^{2}}{k_{on} + k_{off}} and k_{off}^{*} = \frac{2 k_{off}^{2}}{k_{on} + k_{off}} .

(58)

If, in addition, $k_{off}^{*}, k_{init} \to \infty$ , we obtain the $N = 1$ module characterized by

𝓐 = k_{on}^{*} [\frac{1}{1 - b (u_{1} + u_{2})} - 1],

(59)

where $b : = k_{init} / k_{off}^{*}$ ¹⁶. This is the bursty limit of Equation 56. Interestingly, that mechanism also possesses a slow mixture limit. If $ε \to \infty$ while $k_{on}, k_{off} \to 0$ , we obtain a special case of Equation 55, with $π_{κ} = 1 / 2$ and mutually exclusive expression in the “cell types,” or long-lived gene states.

Even when we restrict our analysis to simple feed-forward regulation, this outline of motifs is nowhere near exhaustive. Nevertheless, the “mixture” and “bursty” limits are particularly natural starting points, as their distributions are straightforward to construct. In other words, we speculate that the careful analysis of co-expression models can distinguish relationships due to “slow” variation between cell types and “fast” variation due to coupled transcriptional events.

6.5. Transient phenomena

This result yields a fairly simple numerical recipe for the determination of probabilities at a particular time $t$ . Typically, analysis proceeds by assuming $H$ and $𝓐$ are time-independent and letting $t \to \infty$ , i.e., considering the stationary limit of the process. However, this may not be strictly justifiable: much of single-cell analysis involves the determination of trajectories from intrinsically transient data representing differentiation pathways¹⁶⁶ If the transient process occurs on a timescale comparable to RNA turnover, using a stationary model may not be appropriate¹⁶.

To rigorously fit transient data, we need to posit just how a snapshot of cells may capture multiple cell states, such that some states are the progenitors of others. The solution is not yet clear, and multiple reasonable explanations exist; for example, we may suppose that the differentiation process “lags” in certain cells (in the vein of the models of variability proposed in Stumpf et al.⁴⁴ for development, and in Sanders et al.¹⁶⁷ and Perez-Carrasco et al.¹²⁵ for the cell cycle). In other words, all cells are captured at a time $t$ since the beginning of a process, but $H$ and $𝓐$ have different time-dependence for different cells. Although such an explanatory model can be instantiated, it may be too challenging to fit. Further, it does not appear to be compatible with processes that operate continuously; the choice of $t$ becomes somewhat challenging to motivate.

We propose that the simplest model for observations relies on minimal synchronization between the biology and the experimental process. To mathematically formalize it, we take inspiration from the theory of reactor modeling in chemical engineering¹⁰⁵ and extend preliminary work from our recent RNA velocity methods analysis¹⁹. A cell enters a medium; this entrance triggers a chemical signal that begins a transient process. The dynamics of this transient process are only dependent on time since receiving the signal, and identical between cells. After a delay, the cells exit the medium. In this framework, sequencing is the uniform random sampling of cells present within this medium. Although this formulation is admittedly simplistic—it excludes the cell cycle and stochastic driving—it allows us to take the first steps with a systematic study of using snapshot data to fit transient stochastic processes. This toy model is numerically tractable, which is useful for its simulation and characterization, and possesses a stationary state that is independent of the time at which the experiment is performed, which is useful for biological admissibility and realism.

Therefore, to marginalize over $t$ , we need to augment the model with an additional property: the relationship between time along a transient process and the probability of capturing a cell. In the parlance of reactor engineering, this relationship is given by the internal-age distribution $f$ . The simulations of transient processes in La Manno et al.⁸⁶ and Bergen et al.⁵⁹ implicitly adopt this model and assume a particular functional form of $f$ . We might suppose cells enter the observation window at $t = 0$ and leave it at $t = T$ , with a Dirac residence time distribution $δ (t - T)$ and uniform sampling throughout this window. The resulting age distribution is uniform, with $f = T^{- 1}$ , and formally corresponds to the ideal plug flow reactor (PFR) architecture¹⁰⁵. As $T \to \infty$ , we obtain the $t \to \infty$ ergodic limit, if such a limit exists. On the other hand, if $f \to δ (t - T)$ , we recover the instantaneous distribution at time $T$ ; this limit formally corresponds to the batch reactor (BR).

To obtain the generating function for the cells inside a tissue, we represent the tissue as a reactor, specify its influx and efflux properties, and solve for the internal-age distribution $f$ . This internal-age distribution yields the occupation measure of the process times, as discussed in our RNA velocity review¹⁹, and induces the following reactor-wide generating function:

G = \int_{t} G (t) f (t) d t, where G (t) = \sum_{s} G_{s} (t) .

(60)

We have marginalized over the instantaneous gene state $s$ because this variable is typically not observable.

6.6. Droplet encapsulation noise

The generating function $G$ describes the biological variability due to molecular processes, transcriptional driving, and the capture of cells from a reaction medium. However, single-cell RNA sequencing data do not quantify cells—they quantify barcodes. Cells are randomly encapsulated into droplets with barcoded beads; to avoid the formation of “doublets,” with two cells per droplet, the microfluidic protocols typically have a fairly low encapsulation rate. If we assume that a droplet may have either zero or one cells, we obtain the following generating function for the distribution of RNA on a per-barcode level:

G_{enc} = p_{1} G + p_{0} = p G + (1 - p) = G_{bc} (G),

(61)

where $G_{bc}$ is the PGF of the Bernoulli distribution, with $p_{1} = p$ the probability of capturing a single cell and $p_{0} = 1 - p$ that of capturing none. Analogously, if we assume that doublets can occur, and the encapsulation of cells is independent and identically distributed (i.i.d.), we find

G_{enc} = p_{2} G^{2} + p_{1} G + p_{0} = p^{2} G^{2} + 2 p (1 - p) G + {(1 - p)}^{2} = {[p G + (1 - p)]}^{2} = G_{bc} (G),

(62)

where $G_{bc}$ is now the PGF of the binomial distribution. It is straightforward to extend this to the unconstrained case, with per-cell encapsulation rate $λ$ , and obtain the analogous expression

G_{enc} = p_{0} + p_{1} G + p_{2} G^{2} + p_{3} G^{3} + \dots = e^{λ (G - 1)} = G_{bc} (G),

(63)

where $G_{bc}$ is the PGF of the Poisson distribution.

However, even empty droplets typically contain some “background” molecules. Removing the empty droplets by filtering for cells with relatively high expression, as well as correcting for the background, is a standard part of sequencing workflows^57,109–112. To model the joint distribution of biological and background RNA, we need to instantiate a mechanistic hypothesis about its source. The simplest hypothesis consists of two parts. First, we impose the pseudobulk interpretation of background: we assume that a fraction of the cells loaded in the library construction step are lysed, and produce a pool of loose molecules. Next, we assume that these molecules are free to be encapsulated into the droplets in an i.i.d. fashion. This implies the Poisson functional form for the distribution of debris entering each droplet:

G_{bg} = \exp (c \sum_{i} μ_{i} u_{i}),

(64)

where $c$ is some shared constant that reflects the pool size and the rate of diffusion, whereas $μ_{i} = {\frac{\partial G}{\partial u_{i}} |}_{u_{i} = 0}$ is the expectation of species $i$ over the entire cell population. This simplest model assumes that all cells are equally likely to lyse and release their contents; if this assumption is violated, $μ_{i}$ needs to be obtained by computing an expectation with respect to a measure biased toward the less stable cells. Finally, the full per-droplet distribution of molecules is

G_{tot} = G_{bc} G_{bg},

(65)

i.e., each droplet contains contributions from the encapsulated cells, as well as the background. With some abuse of notation, we occasionally use the expression $G_{bc} (G) G_{bg} (G)$ , where the first argument denotes composition, whereas the second denotes functional dependence.

6.7. Library construction and sequencing noise

We cannot observe the biological molecule content of each droplet: we are restricted to analyzing counts of complementary DNA (cDNA). In a typical dual-index 3’ microfluidic workflow (e.g., the commercialized 10x chemistry⁴⁸), these cDNA are quantified by the following sequence of reactions. First, a synthetic primer captures a poly(A) stretch in RNA, which may be an endogenous molecule or a synthetic tag¹⁶⁸. The primer contains a poly(dT) oligonucleotide, a sequencing primer, a cell barcode, and a unique molecular identifier (UMI). Next, reverse transcriptase (RTase) attaches to the RNA-primer complex and synthesizes the complementary strand. When the first strand is complete, a template-switching oligonucleotide (TSO) attaches to the end, allowing RT to synthesize the second strand of cDNA. After library construction, the droplet emulsion is broken, producing a pool of long cDNA; polymerase chain reaction (PCR) is used to amplify this pool. The long cDNA molecules are enzymatically fragmented, and another sequencing primer is attached at the end of the molecule that formerly contained the TSO. Finally, another round of PCR amplifies the pool and appends sample indices and Illumina adaptors to both sides of the molecule. The pool of cDNA is loaded onto a sequencing machine and sequenced from both sides, producing two reads. One read contains the barcode and UMI bases, whereas the other contains partial information about the 3’ end of the molecule, beginning at the fragmentation site. This sequence of reactions represents the ideal-case scenario, and the products may well include artifacts due to off-target reactions¹⁶⁹.

To understand the effect of technical variability on the per-barcode distributions, we need to summarize this workflow in a mechanistic model. First, we assume that the library preparation reactions occur in an i.i.d. fashion relative to each RNA molecule in the droplet, allowing us to construct a separate description of technical noise for each discrete molecular species indexed by $i$ . At this stage, we omit the modeling of continuous species. As we quantify the number of UMIs, we can considerably simplify the description by splitting the workflow into the initial cDNA synthesis and all downstream steps. For the cDNA synthesis, we may choose one of two models:

𝒳_{i} \to 𝒳_{i} + 𝓣_{i} or 𝒳_{i} \to 𝓣_{i} .

(66)

In the first model, the formation of a UMI-tagged cDNA $𝓣_{i}$ is non-sequestering, and the template RNA $𝒳_{i}$ can participate in further cDNA synthesis. In other words, a single RNA molecule can produce more than one cDNA with distinct UMIs. In the second model, the cDNA synthesis is sequestering, and each RNA can template at most one cDNA with a particular UMI. For the downstream steps, if we assume the PCR and sequencing steps produce results that are reasonably faithful to their templates, we are essentially restricted to a single model:

𝓣_{i} \to \emptyset .

(67)

In other words, the sequence of steps after the formation of cDNA $𝓣_{i}$ may lose some UMIs, but it cannot create them. Aggregating these steps, we find the shifted per-molecule generating function for technical noise:

G_{t i}^{*} = G_{t i} - 1 = e^{λ_{i} (g_{i} - 1)} - 1 = e^{λ_{i} u_{i}} - 1 (non-sequestering) = p_{i} g_{i} + (1 - p_{i}) - 1 = p_{i} u_{i} (sequestering),

(68)

where $λ_{i} = λ_{i, c} p_{i, p}$ and $p_{i} = p_{i, c} p_{i, p} . λ_{i, c}$ , is the overall Poisson rate of the catalytic production of cDNA $𝓣_{i}$ with distinct UMIs, $p_{i, c}$ is the probability of producing a single cDNA $𝓣_{i}$ in a non-catalytic fashion, and $p_{i, p}$ is the probability of retaining a molecule of $𝓣_{i}$ through the PCR steps. It is straightforward to use a Taylor expansion to observe that the limit $λ_{i, c} ≪ 1$ yields the Bernoulli form: if non-sequestering sequencing is relatively slow or inefficient, the probability of obtaining multiple cDNA from a single RNA is low, and the mathematically simpler Bernoulli noise form approximately holds^16,161.

Using the properties of PGFs²¹, we find that the overall generating function is given by a simple composition, substituting $G_{t i}$ for $g_{i}$ :

G_{tot, t} = G_{tot} (G_{t}^{*}),

(69)

where we use the $G_{tot} (u)$ parametrization, and each entry of $G_{t}^{*}$ contains the shifted generating function $G_{t i}^{*}$ for a particular species $i$ .

Finally, the reads associated with each cDNA $𝓣$ are not always uniquely identifiable: for example, the sequence content is typically sufficient to identify the gene, but if a read only covers an exonic portion of the gene, it is impossible to distinguish whether or not the original molecule has been spliced¹³⁹. To correctly represent this ambiguity, we need to transform the arguments of the generating function from a length- $n$ vector to a length $n$ -vector, such that $n$ is the total number of mutually distinguishable classes of molecules. The simplest form of this transformation is a linear categorical partition:

g = 𝓟^{a} 𝓰,

(70)

where $𝓟^{a}$ is an $𝓃 \times n$ ambiguity matrix with $𝓟_{i, 𝓲}^{a}$ giving the probability of molecule $i$ being identifiable in the equivalence class $𝓲$ . We assume that each molecule can be assigned to at least one class, implying $\sum_{𝓲} 𝓟_{i, 𝓲}^{a} = 1$ . In principle, only the constraint $\sum_{𝓲} 𝓟_{i, 𝓲}^{a} \leq 1$ is mandatory, but the loss of molecules can be equivalently reframed as a technical noise component in $G_{t}^{*}$ .

We discuss the general case of this model component in Section S3. In summary, the entries of $𝓟^{a}$ are challenging to identify, but it may be possible to exploit genomic information, polymer physics, and orthogonal long-read sequencing data to construct it from first principles. This formulation admits several special cases. For example, if we cannot distinguish any distinct species at all and can only quantify the total RNA content, $n = 1$ and $𝓟_{i, 𝓲}^{a} = 1$ for each $i$ . Then we obtain

{(g)}_{i} = 𝓰 for all i and G (𝓰) = G ([\begin{matrix} 𝓰 \\ ⋮ \\ 𝓰 \end{matrix}]) .

(71)

On the other hand, if all species are perfectly identifiable, we obtain $𝓃 = n$ and $𝓟^{a} = I_{n}$ , the $n$ -dimensional identity matrix. If, say, we have $n = 2$ but $𝓃 = 3$ , as in the case of nascent, mature, and ambiguous molecules described in La Manno et al.⁸⁶ and Eldjárn Hjörleifsson et al.¹³⁹, we obtain

G (𝓰) = G ([\begin{matrix} 𝓟_{1, 1}^{a} 𝓰_{1} + 𝓟_{1, 3}^{a} 𝓰_{3} \\ 𝓟_{2, 2}^{a} 𝓰_{2} + 𝓟_{2, 3}^{a} 𝓰_{3} \end{matrix}]),

(72)

where $𝓰_{1}$ and $𝓰_{2}$ correspond to two unambiguously identifiable species, whereas $𝓰_{3}$ corresponds to ambiguous cDNA which may have come from either. In the general case, we find

u = 𝓟^{a} 𝓰 - 1 = 𝓟^{a} (𝓾 + 1) - 1 = 𝓟^{a} 𝓾 = G_{a} (𝓾) - 1 : = G_{a}^{*} (𝓾),

(73)

where each entry of the vector $G_{a}$ contains the generating function of the relevant categorical distribution that governs how species $i$ is parsed as one of the $𝓃$ identifiable species:

{(G_{a} (𝓾))}_{i} = \sum_{i} 𝓟_{i, i}^{a} 𝓰_{i} .

(74)

Therefore, the overall GF takes the following form:

G_{tot, ta} = G_{tot, t} (G_{a}^{*} (𝓾)) .

(75)

6.8. Example systems

The equation above provides a generic, modular framework for characterizing variability in sequencing experiments. To fit it to data, we need to specify a particular set of models for each step of the process. To do so, we should first strive to understand which modular components are realistic based on relatively simple summaries of data. Further, the process of evaluating and fitting these models is fairly involved, and often requires substantial up-front work to design scalable solvers. Therefore, it is useful to understand their qualitative behaviors relevant to statistical inquiry. In the current section, we characterize some analytically tractable systems, as well as their identifiability properties, such as our ability to distinguish between different models and parameter regimes. To illustrate these points, we apply the models to real and simulated data and speculate about their implications and physical relevance.

6.8.1. Special theoretical cases

We revisit Section 6.3 to emphasize the implications and advantages of unifying the discrete and continuous degrees of freedom of the biological model in a common framework. The similarity of the discrete and continuous generating function terms is not accidental, and follows directly from the Poisson representation⁹³. Occasionally, we can exploit this representation to bypass calculations for discrete processes by referring to results from the study of continuous processes, and vice versa. This approach consists of writing down the generating function PDE for a discrete process, identifying a continuous process governed by the same PDE, obtaining its solution from the stochastic process literature, and asserting that the discrete process distribution is given by compounding a Poisson distribution with the continuous law.

For instance, we may consider the case of a system with constitutive transcription at rate $α$ , autocatalysis at rate $q$ , and degradation at rate $γ (N = 1, n = 1, m = 0)$ :

\emptyset \overset{α}{\to} 𝒳 \overset{γ}{\to} \emptyset 𝒳 \overset{q}{\to} 2 𝒳 .

(76)

We can represent these reactions by the matrices $C = - γ + q$ and $D = q$ , as well as the operator $𝓐 (u) = α u$ . This system was introduced, but not treated, in Jahnke and Huisinga⁸⁵, and, to our knowledge, first solved with master equation and generating function calculations by Vastola¹⁷. However, we can also solve it merely by matching terms, without any new calculations. We provide the full details of the parameter-matching process in Method S2.1. The derivation consists of noticing that the functional form of $C, D$ , and $𝓐$ can also arise from an $N = 1, n = 0, m = 1$ system with drift $α$ , square-root noise $σ = \sqrt{2 q}$ , and mean-reversion at the rate $γ - q$ . This is the Cox–Ingersoll–Ross (CIR) process, a popular mathematical finance model of interest rates^170,171. Its stationary distribution is gamma with shape $α / q$ and scale $\frac{q}{γ - q}$ . This immediately implies the distribution of the discrete process is negative binomial with the same shape and scale. This matches the result obtained by directly solving the master equation¹⁸. We find, then, that autocatalysis with constitutive transcription yields a stationary distribution equivalent to bursty transcription with no autocatalysis.

Obtaining this result, we may ask how the distribution changes if the molecules are produced in geometric bursts $B$ with mean size $b$ :

\emptyset \overset{α}{\to} B \times 𝒳 \overset{γ}{\to} \emptyset 𝒳 \overset{q}{\to} 2 𝒳 .

(77)

By changing the drift operator to a jump operator, we obtain a PDE with $𝓐 (u) = α [\frac{1}{1 - b u} - 1]$ . In other words, the continuous version of this process is a combination of CIR and gamma Ornstein–Uhlenbeck ( $Γ$ -OU) processes²⁰, with the mean-reversion terms of both, the square-root noise of the former, and the exponentially-distributed jumps of the latter.

Define the parameter combinations

c : = γ - q v : = \frac{α b}{b c - q} .

(78)

By direct integration, we find the characteristic and the stationary distribution

U (𝗌) = \frac{c u e^{- c 𝗌}}{c + q u (e^{- c 𝗌} - 1)} G = \exp [α \int_{0}^{\infty} \frac{b U (𝗌)}{1 - b U (𝗌)} d 𝗌] = {(\frac{1 - q c^{- 1} u}{1 - b u})}^{v} .

(79)

Curiously, this distribution exactly matches the transient MGF of the $Γ$ -OU process, as well as the equivalent transient PGF of the bursty transcription process with no autocatalysis¹⁶:

G = {(\frac{1 - b u e^{- κ τ}}{1 - b u})}^{v};

(80)

we may take advantage of the fact that $q c^{- 1}$ can be equivalently expressed as $b e^{- κ τ} < b$ for some positive $k$ and $t$ , because $b c - q > 0$ to have a steady state (i.e., positive $v$ ). In the continuous setting, this process is known¹⁷² to have a law consisting of a mixture of gamma distributions with scale $b e^{- κ τ}$ and shape $k$ ; in turn, $k$ is drawn from a negative binomial distribution with shape $v$ and scale ${(1 + e^{- κ τ})}^{- 1}$ . This immediately implies that the distribution of the corresponding discrete process is a negative binomial-negative binomial mixture with equivalent parameters, which may be confirmed by the considerably more involved direct derivation in Method S2.2. Although this distribution cannot be expressed in closed form, its construction makes the simulation of the bursty transient and stationary autocatalytic processes trivial, and suggests that simple finite approximations (i.e., up to a modest $k$ ) may be developed.

The continuous formulation is a way to exploit existing quantitative results, but does not typically make problems easier. For example, we may be interested in solving an RNA/protein system with transcription, catalytic translation (at rate $q$ ), and the degradation of both species (at respective rates $γ_{R}$ and $γ_{P}$ ). Without specifying the transcriptional dynamics, we find that the downstream ODEs have a nontrivial $D$ matrix, i.e.,

C = {(C^{d d})}^{T} = [\begin{matrix} - γ_{R} & q \\ 0 & - γ_{P} \end{matrix}] and D = {(Q^{d})}^{T} = [\begin{array}{l} 0 & q \\ 0 & 0 \end{array}] .

(81)

Although these matrices can be exploited to obtain both characteristics, the solution depends on special functions and is thus challenging to manipulate⁷⁷. Instead, we may ask whether we can simplify the problem by eliding all stochasticity in the protein species and assuming it may be described by a continuous process. Defining the variables for this system, we find:

C = [\begin{matrix} {(C^{d d})}^{T} & {(C^{d c})}^{T} \\ 0 & {(C^{c c})}^{T} \end{matrix}] = [\begin{matrix} - γ_{R} & q \\ 0 & - γ_{P} \end{matrix}] D = [\begin{matrix} 0 & {(C^{d c})}^{T} \\ 0 & 0 \end{matrix}] = [\begin{array}{l} 0 & q \\ 0 & 0 \end{array}],

(82)

i.e., in spite of this supposed simplification, the problem is precisely as challenging as it was before. This provides an immediate and intuitive explanation for a range of results, such as the observation that the stationary distribution of proteins under constitutive transcription has a complicated solution in terms of Kummer’s hypergeometric function even if one uses a leading-order approximation (cf. Eqns. 34 and 50 of Bokes¹³⁸).

6.8.2. Empty droplets

Model definition.

In Equation 64, we propose the simplest nontrivial model for the background distribution of RNA molecules in each droplet: the RNA content for each species $i$ is described by a set of independent Poisson distributions whose mean is proportional to the mean in the entire cell population. Per Equation 65, the distribution of background is convolved with the endogenous RNA distribution of cell-containing droplets, making it challenging to distinguish technical and biological contributions. However, we can make predictions about the empty droplets, which have $G_{bc} = 1$ , and compare these predictions to real datasets.

First, we define a baseline $n = 2$ model of biology, such that

\emptyset \overset{K}{\to} 𝓧_{N} \overset{β}{\to} 𝓧_{M} \overset{γ}{\to} \emptyset,

(83)

where $K$ is a generic, but non-constant (bursty, multistate, or SDE-controlled) transcription process, $𝒳_{N}$ is a nascent transcript, $𝒳_{M}$ is a mature transcript, and $β$ and $γ$ are Markovian splicing and degradation rates, respectively. As the case of constant $K$ yields a Poisson distribution of $𝒳_{N}$ and $𝒳_{M}$ , the case of variable $K$ induces an overdispersed distribution of RNA in droplets with one or more cells. Further, it implies that certain correlations are nonzero. For a given gene $j$ , the correlation between counts of $𝒳_{N}$ and $𝒳_{M}$ should be nonzero, as the latter is, conceptually, the moving average of the former. Further, the correlation between the counts of a given species for different genes should be nonzero, as it reflects cell type heterogeneity and gene co-regulation¹⁶ (see Section 6.4).

This model describes the biology in living cells; to connect it to UMI measurements, we assume that $G_{t}^{*}$ is an approximately linear map, i.e., library construction is either sequestering or non-sequestering and slow. Further, we assume $G_{a}^{*}$ is a linear map, as in Equation 74. Therefore, for each species $i$ , we have a per-cell biological distribution with mean $μ_{i}$ . In a droplet with a single cell, the mean becomes $μ_{i} p_{i} (1 + c) \approx μ_{i} p_{i}$ , such that $p_{i}$ is the overall probability of capturing, retaining, sequencing, and identifying each molecule (Section 6.7). In a droplet with no cells, the mean is $c μ_{i} p_{i}$ . We assume the number of doublets is negligible.

Under the foregoing assumptions, we predict that the empty-droplet marginal per-gene UMI distribution is Poisson with mean $c μ_{i} p_{i}$ . This mean is proportional to the mean in non-empty droplets with a small coefficient of proportionality $c$ . Further, we should observe zero correlations on an intra-gene basis, between counts of $𝒳_{j, N}$ and $𝒳_{j, M}$ , and on an inter-gene basis, e.g., between counts of $𝒳_{j_{1}, M}$ and $𝒳_{j_{2}, M}$ . However, it is not a priori clear that this model should even approximately describe real data, even in the case of empty droplets. For example, these data may exhibit considerable “read depth” variability^65,83, or, in our framework, inter-droplet variation in the probability $p_{i}$ , which would induce overdispersion or genome-wide correlations between molecule counts. By inspecting the distributional properties of empty droplet data, we can attempt to qualitatively motivate or raise doubts regarding the Poisson model.

Data processing.

To build references and pseudoalign datasets, we used kallisto | bustools 0.26.0. We downloaded pre-built H. sapiens and M. musculus genomes from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest (10x Genomics, GRCh38 and mm10, 2020-A versions). Next, we used the kb ref function with the --lamanno option to build references. We obtained the raw FASTQ files for the six datasets reported in Table S2. Then, we used the kb count function with the --lamanno option, as well as the appropriate (10x v2 or v3) whitelist option -x to quantify the datasets, outputting unspliced and spliced RNA matrices. The unspliced counts correspond to molecular barcodes containing introns, whereas the spliced counts correspond to molecular barcodes not containing introns^139,173. For the reasons outlined in Section S6 of Carilli and Gorin et al.⁷¹, we identify unspliced counts with “nascent” RNA species and spliced counts with “mature” RNA species, and elide any ambiguity.

Data analysis.

We split the datasets into two categories. The “non-empty” droplets were retained after the bustools filter; the “empty” category contains barcodes that were discarded by the filter. Although this split is fairly coarse, as the filtering choices are heuristic, it is coherent with typical processing workflows and allows us to inspect the broad trends of distributional properties.

To investigate the overdispersion, or lack thereof, we separately computed the mean and variance of nascent and mature UMI counts for each gene in each set of cells. We plotted these quantities on a log-log scale, omitting the data points where one or both of these quantities were zero. Under the pseudobulk model, we expect the non-empty droplets to exhibit overdispersion and the empty droplets to be near identity, as the model encodes Poisson statistics for the latter.

To investigate the intra-gene correlation structure, we computed the Pearson correlation coefficient $ρ$ between nascent and mature UMI counts for each gene in each set of cells. We plotted the histograms of these values, as well as their relationship to the mature UMI mean, omitting the data points where $ρ$ was undefined. To investigate the inter-gene correlation structure, we computed the Pearson correlation coefficient between the nascent UMI counts for each pair of genes in each set of cells, and repeated the analysis for mature count data. We plotted the histograms of these values, omitting the data points where $ρ$ was undefined. As the number of gene pairs is fairly large, we first excluded all genes that were not expressed in the dataset. We expect both measures of correlation to be substantial for non-empty droplets and near zero for the empty droplets, as the model encodes statistical independence between marginal distributions for the latter.

To investigate the relationship between the empty and non-empty droplet averages, we plotted the mean mature UMI count for each gene in empty droplets against the mean mature UMI count in cell-containing droplets. As we plotted these quantities on a log-log scale, we omitted the data points where one or both of these quantities were zero. We repeated the analysis for mature RNA data. We expect these averages to be highly correlated, as the pseudobulk model proposes that the background RNA are sampled from a pool representative of the cell population.

Next, we computed and reported the Pearson correlation coefficient between the (well-defined) log-means. To characterize and explain deviations from Poisson behavior, we selected all genes with overdispersion in the mature RNA count distributions in empty droplets $(σ_{M}^{2} > 2 \times μ_{M})$ and reported their identities. Finally, to quantify the variation not included in the model, we computed the mean and variance of total mature UMI counts in empty droplets, with and without the overdispersed genes. As the sum of independent Poisson distributions is Poisson, we expect the total per-cell UMI count distributions to have a variance approximately equal to the mean.

6.8.3. Noise-corrupted candidate models of transcriptional variation

Model definition.

We would like to characterize the mutual distinguishability of superficially similar transcriptional models. In particular, we are interested in the benefits of multimodal data collection and the effects of technical noise.

As above, we begin by defining a baseline $n = 2$ model of biology, such that

\emptyset \overset{K}{\to} 𝓧_{N} \overset{β}{\to} 𝓧_{M} \overset{γ}{\to} \emptyset,

(84)

where $K$ represents one of three candidate transcriptional models. The discrete dynamics are summarized by

C^{d d} = [\begin{matrix} - β & 0 \\ β & - γ \end{matrix}] U_{M} = u_{M} e^{- γ 𝗌} U_{N} = u_{N} e^{- β 𝗌} + \frac{u_{M} β}{β - γ} (e^{- γ 𝗌} - e^{- β 𝗌}) .

(85)

The first transcriptional model is the $Γ$ -OU process, with $N = 1$ and $m = 1$ :

d y_{t} = - κ y_{t} d t + d Z_{t},

(86)

where $Z_{t}$ is a subordinator with arrival rate $a$ and exponentially distributed jumps with mean size $θ$ . This system is characterized by

u = [\begin{array}{l} u_{N} \\ u_{M} \\ u_{K} \end{array}], C^{c c} = - κ, C^{c d} = [\begin{array}{l} 1 & 0 \end{array}] 𝓐 (u) = a [\frac{1}{1 - θ u_{K}} - 1],

(87)

with all other matrices and operators set to zero.

The second is the CIR process, with $N = 1$ and $m = 1$ :

d y_{t} = (a θ - κ y_{t}) d t + \sqrt{2 κ θ y_{t}} d W_{t} .

(88)

This system is characterized by

u = [\begin{array}{l} u_{N} \\ u_{M} \\ u_{K} \end{array}], C^{c c} = - κ, C^{c d} = [\begin{array}{l} 1 & 0 \end{array}], Q^{c} = κ θ 𝓐 (u) = a θ u_{K},

(89)

with all other matrices and operators set to zero.

We previously proposed the $Γ$ -OU and CIR processes as potential explanatory models for gamma-distributed stochastic variability in transcription rates, solved them, and investigated the implications of their kinetics on the model properties and distinguishability²⁰. The stationary distribution of the $Γ$ -OU and CIR processes is gamma, with shape $a / κ$ and scale $θ$ , i.e., mean $a θ / κ$ and variance $a θ^{2} / κ$ . In addition, their (appropriately normalized) autocorrelation function is $e^{- κ t}$ .

Finally, the third is the telegraph process¹⁰⁰, with $N = 2$ and $m = 0$ . This system is characterized by

u = [\begin{array}{l} u_{N} \\ u_{M} \end{array}], H = [\begin{matrix} - k_{on} & k_{on} \\ k_{off} & - k_{off} \end{matrix}], and 𝓐 (u) = [\begin{matrix} 0 \\ k_{init} u_{N} \end{matrix}] .

(90)

The stationary distribution of this process is Bernoulli scaled by $k_{init}$ , with mean $\frac{k_{on} k_{init}}{k_{on} + k_{off}}$ and variance $\frac{k_{on} k_{off} k_{init}^{2}}{{(k_{on} + k_{off})}^{2}}$ . Its autocorrelation function is $e^{- (k_{on} + k_{off}) t}$ ⁸¹.

For all three models, assuming a Bernoulli observation model (i.e., that each molecule has an independent probability $p$ of being observed) is equivalent to a parameter redefinition. For the $Γ$ -OU and CIR models, this redefinition is that $θ \to p θ$ ; for the telegraph model, we have analogously that $k_{init} \to p k_{init}$ .

Let us see why this is true. Recall from Section 3 that the Bernoulli technical noise model amounts to a redefinition $u_{N} \to p u_{N}, u_{M} \to p u_{M}$ . For the $Γ$ -OU model, the steady-state (log-) GF is

ϕ_{s s} (u_{N}, u_{M}) = a \int_{0}^{\infty} \frac{θ U_{K} (𝗌; u_{N}, u_{M})}{1 - θ U_{K} (𝗌; u_{N}, u_{M})} d 𝗌,

(91)

where $U_{K} (𝗌; u_{N}, u_{M})$ is the exponential sum solution of

\frac{d U_{K}}{d 𝗌} = U_{N} - κ U_{K}, U_{K} (0) = 0,

(92)

and where the characteristics $U_{N}$ and $U_{M}$ are as in Equation 85. Because the $U_{K}$ ODE is linear, $U_{K}$ depends linearly on $u_{N}$ and $u_{M}$ (and hence on $p$ ). But $ϕ_{s s}$ only depends on $U_{K}$ through the combination $θ U_{K}$ , so the problem with technical noise is equivalent to redefining $θ \to p θ$ .

For the CIR model, the steady-state (log-) GF is

ϕ_{s s} (u_{N}, u_{M}) = a θ \int_{0}^{\infty} U_{K} (𝗌; u_{N}, u_{M}) d 𝗌,

(93)

where

\frac{d U_{K}}{d 𝗌} = U_{N} - κ U_{K} + κ θ U_{K}^{2}, U_{K} (0) = 0.

The technical noise causes $U_{N} \to p U_{N}$ . Divide both sides by $p$ , so that the $p$ factor is moved elsewhere; we can see that

ϕ_{s s} (u_{N}, u_{M}) = a p θ \int_{0}^{t} \frac{U_{K} (𝗌; u_{N}, u_{M})}{p} d 𝗌 \frac{d (U_{K} / p)}{d 𝗌} = U_{N} - κ (U_{K} / p) + κ p θ {(U_{K} / p)}^{2} U_{K} (0) = 0

(94)

is equivalent, i.e., that again the technical noise problem is equivalent to a non-technical-noise problem with $θ \to p θ$ .

For the telegraph model, the steady-state (log-) GF is

ϕ_{s s} (u_{N}, u_{M}) = ϕ^{0} (U_{N} (\infty), U_{M} (\infty), U_{on} (\infty), U_{off} (\infty)) \frac{d U_{off}}{d 𝗌} = - k_{on} (U_{off} - U_{on}) \frac{d U_{on}}{d 𝗌} = - k_{off} (U_{on} - U_{off}) + k_{init} (U_{on} + 1) U_{N},

(95)

where $U_{off} (0) = U_{on} (0) = 0$ . Since $U_{N} (\infty) = U_{M} (\infty) = 0$ , the values of $U_{N} (𝗌)$ only affect $ϕ_{s s}$ through the combination $k_{init} U_{N}$ that appears in the $U_{on}$ ODE; this means we can just redefine $k_{init} \to p k_{init}$ as promised to get a completely equivalent problem.

Model analysis

Formally, these models have five parameters each: three for the upstream transcriptional dynamics and two for the downstream molecular conversion. However, their qualitative behaviors at steady state can be effectively summarized by fixing $μ_{K}, β$ , and $γ$ , and varying two key parameters, the timescale separation and the noise intensity. From a statistical point of view, $μ_{K} / β$ and $μ_{K} / γ$ are easily and robustly identifiable from the mean molecular counts; from an experimental point of view, $β$ and $γ$ can, in principle, be fit by orthogonal experiments⁸⁶. At steady state, the value of $μ_{K}$ is a somewhat arbitrary scaling factor.

For the two-species SDE driver models, the qualitative parameters take the following form:

timescale separation : = x = \frac{κ}{κ + β + γ} noise intensity : = y = \frac{θ}{a + θ} .

(96)

These parameters both range in (0,1). When the timescale separation approaches zero, the transcriptional variation is much slower than the turnover, and the distribution of RNA is given by a simple Poisson mixture of the law of $K$ . When the noise intensity approaches zero, the law of $K$ degenerates and the distribution of RNA becomes Poisson. Most interestingly, when the timescale separation and the noise intensity are both high, the system exhibits bursty transcription²⁰.

Equation 96 is defined with reference to the process parameters of the $Γ$ -OU and CIR drivers²⁰. It remains to define $κ, θ$ , and $a$ in terms of $k_{on}, k_{off}$ , and $k_{init}$ for the telegraph process. The correct identification is:

κ = k_{on} + k_{off} is the autocorrelation timescale, a = \frac{k_{on} κ}{k_{off}} is the process scaling, and θ = \frac{k_{off} k_{init}}{κ} is the gain .

(97)

These identifications are not arbitrary, as they endow the system with lower moments that match the SDE formulation: autocorrelation function $e^{- κ t}$ , mean $a θ / κ$ , and variance. In addition, the system has the correct geometric burst limit $(k_{init}, k_{off} \to \infty)$ with burst size $θ / κ \to k_{init} / k_{off}$ and burst frequency $a \to k_{on}$ ⁷³; this limit matches the $Γ$ -OU one²⁰.

Given any combination of ${x, y, μ_{K}, β, γ}$ , we can identify the transcriptional parameters:

κ = \frac{(β + γ) x}{1 - x} a / θ = \frac{k_{on} κ}{k_{off}} \frac{κ}{k_{off} k_{init}} = \frac{k_{on} κ^{2}}{k_{off}^{2} k_{init}} y = \frac{1}{1 + a / θ} or \frac{a}{θ} = \frac{1}{y} - 1 \frac{μ_{K}}{κ} (\frac{1}{y} - 1) = \frac{k_{init} k_{on}}{κ^{2}} \frac{k_{on} κ^{2}}{k_{off}^{2} k_{init}} = \frac{k_{on}^{2}}{k_{off}^{2}} = {(\frac{k_{on}}{k_{off}})}^{2} : = c, giving k_{on} = \frac{\sqrt{c} κ}{\sqrt{c} + 1}, k_{off} = \frac{κ}{\sqrt{c} + 1}, and k_{init} = \frac{μ_{K} κ}{k_{on}} .

(98)

This allows us to define a particular set of ${μ_{K}, β, γ}$ , vary $x$ and $y$ over the constrained domain (0,1) × (0,1), and compare the model properties for each (x, y) tuple. If we are interested in a one-species model, we simply replace each instance of $β + γ$ with $β$ . Since the construction in Equation 98 is bijective, if we fairly densely sample the square, we can be confident that the results fully encompass the range of behaviors under a particular set of averages.

Simulated data analysis.

To evaluate PMFs, we used trapezoidal quadrature for the $Γ$ -OU generating function integral, the Runge-Kutta method for the CIR characteristic $U_{k}$ and trapezoidal quadrature for the generating function integral, and the Runge-Kutta method for the telegraph model’s coupled differential equations^18,20. We marginalized over the continuous and categorical dimensions. We evaluated all PMFs on $x_{N}, x_{M} \in$ [0, · · ·, 49] × [0, · · ·, 50]. To generate synthetic data, we sampled with replacement from the 2,550 microstates in the domain, using $P (x_{N}, x_{M})$ as sampling probabilities.

To investigate parameter identifiability, we generated 200 realizations from the $Γ$ -OU model under $κ = 0.1, a = 0,$ $θ = 1, β = 0.8$ , and $γ = 0.9$ . These parameters lie in the “mixture-like” regime, where the transcriptional process is slower than the RNA turnover process. Next, we constructed a uniformly spaced 14 × 15 grid of $x$ and $y$ , constructed at the true values of $μ_{K}, β$ , and $γ$ and bounded by [0.01,0.99]. In statistical terms, this model formulation is the best-case scenario where no noise exists and uncertainty in the fixed parameters is negligible.

To investigate the statistical properties of one-species data, we evaluated the log-likelihood log $L$ of the nascent marginal of the data at each of the 210 $x$ , $y$ coordinates (with the true value being x = 1/9 and y = 5/7). Next, we plotted log $L$ as a heatmap over $x$ , $y$ . The coordinates with high log $L$ are not readily distinguishable, i.e., these parameters produce very similar distributions to the data. We highlighted the coordinates in the 90th percentile of log $L$ —the least distinguishable region—using hatching. To illustrate a case where the one-species data are relatively uninformative, we considered a point with x = 9/10 and y = 5/7, which lies in the qualitatively different “burst-like” regime $(κ = 7.2)$ but closely resembles the “mixture-like” data at steady state.

To investigate the statistical properties of two-species data, we repeated the analysis above, computing the joint likelihood rather than the marginal likelihood. In the two-species model, the true “mixture-like” parameter set has $x = 1/18$ and the illustrative “burst-like” parameter set has $x$ ≈ 0.81; the other parameters do not change. To demonstrate the source of failure to distinguish between these parameter regimes, we plotted the PMFs in both. We used a transparent bar plot for the nascent PMFs and a heatmap for the joint PMFs, with darker colors representing a higher probability mass.

To investigate the mutual identifiability of models, we computed their Akaike weights over the $x$ , $y$ landscape. The Akaike weight of model $ϖ$ is defined as follows:

w_{ϖ} = \frac{e^{- \frac{1}{2} Δ_{ϖ}}}{\sum_{k} e^{- \frac{1}{2} Δ_{k}}}, where Δ_{k} = {AIC}_{k} - {AIC}_{\min}, {AIC}_{\min} = \min_{k} {AIC}_{k}, and {AIC}_{k} : = - 2 \log L_{k} ({\hat{Θ}}_{k}) + 2 ς_{k} .

(99)

Thus, ${AIC}_{k}$ is the Akaike information criterion (AIC) for model $k$ . The AIC depends on the model log likelihood log $L_{k}$ at the maximum likelihood estimate ${\hat{Θ}}_{k}$ , as well as number of model parameters $ζ_{k}$ ¹²⁰. Therefore, the Akaike weight essentially transforms and combines the models’ relative likelihoods to provide a measure of their agreement with the data.

Although this measure has its caveats and limitations—for example, it cannot account for uncertainty in the model-specific parameters $Θ_{k}$ —it is a fairly conventional criterion for model selection. Most usefully to our investigation, it admits a simple interpretation: if the Akaike weight of the true model $w_{ϖ} \approx 1 / 3$ , there is essentially no basis for choosing a particular model, since their distributions are not practically distinguishable. If $w_{ϖ} > 1 / 2$ , we have a basis for model discrimination: the odds for the correct model are even. In the three-model case, this may reflect both, or only one, of the competing hypotheses being substantially worse at describing the data, so more careful examination of the $w_{k}$ values is necessary to judge the models.

To investigate model identifiability, we constructed a uniformly spaced 14×15 grid of $x$ and $y$ , bounded by [0.01,0.99]. At each grid point, we generated 200 realizations from the $Γ$ -OU model under $μ_{K} = 5, β = 0.8$ , and $γ = 0.9$ . Next, we computed the log $L_{k}$ of each model using the nascent marginal and the full data, and used the relative likelihoods to compute the Akaike weights of the $Γ$ -OU model under these two scenarios. Finally, to reduce the impact of stochastic sampling variability, we repeated the process 50 times and computed their average. In other words, we generated 50 independent datasets at each of the 210 grid points, evaluated likelihoods of all models, computed the univariate and bivariate $Γ$ -OU Akaike weight of each, then aggregated the 50 trials at each grid point to obtain two “average-case” performance measures. In statistical terms, this model formulation represents the best-case scenario where the parameters are perfectly known, and the problem solely consists of distinguishing between the models, as in the $Γ$ -OU/CIR case considered in Fig. 3 of Gorin and Vastola et al.²⁰

To visualize the behavior of the Akaike weights under these assumptions, we plotted its value as a heatmap over $x$ , $y$ . We highlighted the coordinates with $w_{ϖ} < 1 / 2$ —the poorly distinguishable region—using hatching. To illustrate a case where the one-species data are relatively uninformative, we compared a point with one-species coordinates x, y = (0.4,0.9), which lies in the “mixture-like” regime, to one with x, y = (0.9, 0.8), which lies in the “burst-like” regime. We visualized these points on the $x$ , $y$ axes using large, color-coded circles. From Gorin and Vastola et al.²⁰ and the properties of low-x processes outlined in the definition of $x$ , we expect the former regime to be highly distinguishable, particularly since the telegraph process converges to a bimodal Bernoulli mixture for $κ \to 0$ . On the other hand, we expect the latter regime to be somewhat less distinguishable; in this limit, the $Γ$ -OU and telegraph models both converge to the bursty model discussed in Singh and Bokes¹³⁷. We repeated this analysis for two-species Akaike weights, transforming the coordinates appropriately (i.e., x ≈ 0.24 for the mixture-like regime and x ≈ 0.81 for the burst-like regime).

To demonstrate the basis of statistical distinguishability properties, we plotted the PMFs of the three models in the two parameter regimes. To simultaneously display them, we plotted marginal distributions of the nascent species as line charts, color-coded by the model identity.

To investigate the effect of drop-out technical noise, we did not perform dedicated simulations; instead, we exploited the result, derived above, that the functional form of the solutions is closed under downsampling. In other words, all distributional properties of a system with gain $θ$ and the technical noise parameter $p$ are identical to those of a system with gain $p θ$ and no technical noise. These properties include the model distinguishability. To illustrate this result, we represented Bernoulli technical noise by arrows in the negative $y$ direction, with small circles located on an arrow corresponding to 50%, 75%, and 85% dropout. To compute the $y$ value under dropout, we use:

y^{*} = \frac{p θ}{p^{- 1} a + p θ}, since μ_{K} = \frac{a θ}{κ} = \frac{p^{- 1} a p θ}{κ} = const .

(100)

The arrows begin at 0% dropout, corresponding to the illustrative base cases (large circles) described above. This demonstrates that increasing the drop-out rate while holding the averages constant leads to the molecular distributions’ degeneration to the Poisson limit. If we do not hold the averages constant, we simply obtain the decreased $y^{*} = \frac{p θ}{a + p θ}$ on the (less identifiable) $x$ , $y$ landscape with mean transcription rate $p μ_{K}$ .

6.8.4. Distributions obtained from a transient process

Model definition.

As motivated in our RNA velocity review¹⁹, understanding transient developmental processes that occur on a timescale comparable to RNA lifetimes requires fitting transient probabilistic models. Even under the considerable simplifications made in Section 6.5, fully treating transient transcriptional phenomena requires identifying the a priori unknown (1) internal-age distribution $f (t)$ as well as (2) process parameters for $G (t)$ . As the time since process start $t$ can be conceptualized as a cell-specific latent variable, this problem can be treated by an expectation–maximization (EM) algorithm, which may proceed by probabilistically constraining the unknown (3) cell-specific times $t_{c}$ .

Since parameter inference is mandatory for the expectation step of the EM algorithm, we begin by characterizing the upper limit on its performance. In particular, previous attempts to treat the problem have assumed simple Gaussian or Poisson error terms^59,86,121, or applied graph methods¹⁷⁴. These approaches do not recapitulate¹⁹ the discrete stochasticity and bursting observed in transient biophysical processes^125,175. However, the transient distributions of bursty processes are not available in closed form, and require new algorithms. Therefore, we treat the simplest nontrivial formulation, which combines points (1) and (2), while omitting (3): if we have perfect information about the cells’ relative times, can we satisfactorily fit a bursty transcriptional model and use the results as a basis for distinguishing between internal-age distributions?

We define a baseline $N = 1, n = 2, m = 0$ model of biology with no technical noise, with the reaction schema

\emptyset \overset{α}{\to} B \times 𝒳_{N} \overset{β}{\to} 𝒳_{M} \overset{γ}{\to} \emptyset,

(101)

representing bursty transcription with stochastic burst sizes $B$ drawn from a geometric distribution with time-dependent mean $b (t)$ :

u = [\begin{array}{l} u_{N} \\ u_{M} \end{array}], C^{d d} = [\begin{matrix} - β & 0 \\ β & - γ \end{matrix}] U = [\begin{matrix} U_{N} \\ U_{M} \end{matrix}] = [\begin{matrix} u_{N} e^{- β 𝗌} + \frac{u_{M} β}{β - γ} (e^{- γ 𝗌} - e^{- β 𝗌}) \\ u_{M} e^{- γ 𝗌} \end{matrix}] 𝓐 (u) = α [\frac{1}{1 - b (t) u_{N}} - 1],

(102)

with all other operators set to zero. To specify $b (t)$ , we define a three-stage model of cell type transitions, such that

b (t) = {\begin{array}{l} b_{1} & t < τ_{1} \\ b_{2} & t \in (τ_{1}, τ_{2}) \\ b_{3} & t > τ_{2} \end{array},

(103)

i.e., a transition is accompanied by a rapid change in burst size at a deterministic time after starting the process.

Next, we propose candidate internal-age distributions. Drawing on the chemical engineering literature^105,106, we outline one-parameter reactor models, such that $t = 0$ corresponds to the cell entering the reactor; after some residence time $𝓽$ , which is dependent on reactor architecture and drawn from the distribution $f_{res}$ , the cell exits. The internal-age distribution is given by

f (t) = \frac{1}{T} \int_{t}^{\infty} f_{res} (𝓽) d 𝓽 .

(104)

The plug flow reactor (PFR) is the model implicit in previous studies^59,86. Formally, it represents each cell entering a reactor, then exiting after some deterministic time $T$ . Its residence-time distribution is Dirac or degenerate, with $f_{res} (𝓽) = δ (𝓽 - T)$ , so

f (t) = \frac{1}{T} \int_{t}^{\infty} f_{res} (𝓽) d 𝓽 = \frac{1}{T} \int_{t}^{\infty} δ (𝓽 - T) d 𝓽 = \frac{I (t < T)}{T},

(105)

the expected uniform distribution. This distribution has the CDF and inverse CDF

F (t) = \frac{t}{T} I (t < T) and F^{- 1} (p) = p T .

(106)

The continuously stirred tank reactor (CSTR) represents a cell entering a homogeneous reactor, then exiting after a random time, in a memoryless fashion. Therefore, the residence-time distribution $f_{res} (𝓽) = \frac{1}{T} e^{- 𝓽 / T}$ is memoryless or exponential, yielding

f (t) = \frac{1}{T} \int_{t}^{\infty} f_{res} (𝓽) d 𝓽 = \frac{1}{T^{2}} \int_{t}^{\infty} e^{- 𝓽 / T} d 𝓽 = \frac{1}{T} e^{- 𝓽 / T};

(107)

i.e., memorylessness implies that the properties inside the reactor—including the age distribution—are identical to the properties of the efflux stream. We obtain the CDF and inverse CDF

F (t) = 1 - e^{- t / T} and F^{- 1} (p) = - T \ln (1 - p) .

(108)

The laminar-flow reactor (LFR) is a configuration between these two extremes: it represents a cell entering a reactor, remaining in it for some time deterministic time, then being able to exit after a power-law delay. Its residence-time distribution $f_{res} (𝓽) = \frac{T^{2}}{2 𝓽^{3}} I (𝓽 > T / 2)$ is Pareto, yielding

\begin{array}{l} f (t) = \frac{1}{T} \int_{t}^{\infty} f_{res} (𝓽) d 𝓽 = \frac{T}{2} \int_{t}^{\infty} \frac{1}{𝓽^{3}} I (𝓽 > T / 2) d 𝓽 \\ = \frac{T}{2} \int_{\max t, T / 2}^{\infty} \frac{1}{𝓽^{3}} d 𝓽 \\ = {\begin{array}{l} \frac{T}{2} \int_{t}^{\infty} 𝓽^{- 3} d 𝓽 = \frac{T}{4 t^{2}} & t > \frac{T}{2} \\ \frac{T}{2} \int_{T / 2}^{\infty} 𝓽^{- 3} d 𝓽 = \frac{1}{T} & t < \frac{T}{2} \end{array} . \end{array}

(109)

The PDF can be integrated to yield the CDF and inverse CDF

F (t) = {\begin{array}{l} \frac{t}{T} & t < \frac{T}{2} \\ 1 - \frac{T}{4 t} & t > \frac{T}{2} \end{array} and F^{- 1} (p) = {\begin{array}{l} p T & p < \frac{1}{2} \\ \frac{T}{4 (1 - p)} & p > \frac{1}{2} \end{array} .

(110)

We are interested in the CDFs and inverse CDFs of the internal-age distributions because “perfect information about the cells’ relative times” properly requires specifying ${F_{ϖ} (t_{c})}$ and ${F_{ϖ} (τ_{i})}$ under the true model $ϖ$ rather than the raw ${t_{c}}$ and ${τ_{i}}$ values. Otherwise, the model selection problem becomes somewhat trivial; for example, if we know the mean residence time is $T$ and we know one of $t_{c} > T$ , we can immediately eliminate the PFR configuration without performing any calculations.

A synthetic dataset consists of observations $x_{N, c}, x_{M, c}$ for each cell $c$ , generated from the true model $ϖ$ at the true time point $t_{c}$ . The log-likelihood of parameters $Θ_{k} = {b_{1}, b_{2}, b_{3}, α, β, γ}_{k}$ for model $k$ takes the form

\log L_{k, c} (Θ_{k} ∣ x_{N, c}, x_{M, c}) = \log P (x_{N, c}, x_{M, c}, t_{c, k} ∣ Θ_{k}; {τ_{i}}_{k}),

(111)

where $t_{c, k} : = F_{k}^{- 1} (F_{ϖ} (t_{c}))$ and ${τ_{i}}_{k} : = {F_{k}^{- 1} (F_{ϖ} (τ_{k}))}$ are the transformed times. This yields the full log-likelihood under the assumption of independence

\log L_{k} (Θ_{k}) = \sum_{c} \log L_{k, c} (Θ_{k} ∣ x_{N, c}, x_{M, c}, {τ_{i}}_{k}) = \sum_{c} \log P (x_{N, c}, x_{M, c}, t_{c, k} ∣ Θ_{k}, {τ_{i}}_{k}) .

(112)

The problem of identifying the maximum likelihood parameter set consists of optimizing Equation 112 with respect to $Θ_{k}$ . The problem of reactor identification consists of using the resulting reactor-specific maximum likelihood value log $L_{k} ({\hat{Θ}}_{k})$ with Equation 99 to obtain the Akaike weights of each reactor configuration.

Simulated data analysis.

To generate the illustrations in Figure 4a, we directly simulated cells entering and exiting each reactor configuration. First, we sampled arrival times from a uniform distribution on [0,100]. Next, we sampled residence times by inverse transform sampling from the inverse CDF corresponding to each $f_{res}$ , using the mean residence time $T = 2$ . We arbitrarily selected the observation time 75 and selected all cells which had arrived but not exited at this time. We computed the cell age by subtracting the arrival time from the current time. We repeated this procedure 10⁷ times for each reactor to obtain the internal-age distribution. Next, we computed the histogram of the distribution on [0,10], using 200 bins. To account for the fact that this histogram only contains part of the CSTR and LFR densities, we rescaled the bins by the internal-age distribution’s CDF value at $t = 10$ . Finally, we plotted the rescaled histogram as a bar plot, and the analytical $f$ as a line plot for comparison.

To understand the actionable differences between reactors, we simulated data from a single reactor model, then fit all three models to the obtained counts. First, we sampled 200 true reaction times ${t_{c}}$ under the PFR model with $T = 5$ and sorted them. To generate synthetic data, we used Gillespie’s stochastic simulation algorithm^140,144 with a time-dependent burst size, storing the state of the system at ${t_{c}}$ . We generated 200 realizations, using only one realization per time point to fit the models. To simulate, we used the parameters $Θ_{ϖ} = {b_{1}, b_{2}, b_{3}, α, β, γ}_{ϖ} = {2, 5, 1, 0.8, 1.2, 3.14}$ . We set ${τ_{1}, τ_{2}}$ to ${1, 3}$ . We started the system in a bivariate Poisson initial distribution with $λ_{N}^{0} = \frac{α b_{1}}{β}$ nascent and $λ_{M}^{0} = \frac{α b_{1}}{γ}$ mature molecules on average. Although this initial condition is somewhat arbitrary, as it is out of equilibrium, it is readily tractable and yields a constant mean over the first stage of the process.

The instantaneous probability $P (x_{N, c}, x_{M, c}, t_{c, k} ∣ Θ_{k}, {τ_{i}}_{k})$ is not available in closed form, and needs to be obtained by inverting the generating function for each $t_{c, k}$ ^16,20,137:

G (u, t_{c, k}) = \exp (λ_{N}^{0} U_{N} (u, t_{c, k}) + λ_{M}^{0} U_{M} (u, t_{c, k}) + α \int_{0}^{t_{c, k}} [\frac{1}{1 - b (t_{c, k} - 𝗌) U_{N} (u, 𝗌)} - 1] d 𝗌)

(113)

where we elide the dependence of $b$ on the model-specific ${τ_{i}}_{k}$ . For a given value of $u$ , it is straightforward to propagate the initial condition. However, it is impractical to compute the integral separately for each $c$ . We can bypass this bottleneck by reusing quadrature points. Conceptually, we define the quadrature matrices

T_{Q} = [\begin{matrix} b (t_{0} - t_{0}) & 0 & \dots & 0 & 0 \\ b (t_{1} - t_{0}) & b (t_{1} - t_{1}) & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ b (t_{η - 1} - t_{0}) & b (t_{η - 1} - t_{1}) & \dots & b (t_{η - 1} - t_{η - 1}) & 0 \\ b (t_{η} - t_{0}) & b (t_{η} - t_{1}) & \dots & b (t_{η} - t_{η - 1}) & b (t_{η} - t_{η}) \end{matrix}] D_{Q} = diag [\begin{matrix} U_{N} (u, t_{0, k}) \\ U_{N} (u, t_{1, k}) \\ ⋮ \\ U_{N} (u, t_{η - 1, k}) \\ U_{N} (u, t_{η, k}) \end{matrix}]

(114)

in the general case with $η$ cells. We appended the starting grid point $t_{0, k} := 0$ to properly integrate from zero. We use the notation $T_{Q}$ because this matrix is Toeplitz in the narrow, but numerically relevant¹⁹, case of a uniformly spaced grid approximating sampling from a PFR. To lighten the notation, we drop the subscript $k$ from the time points in the definition of $T_{Q}$ . $D_{Q}$ is diagonal and does not need to be constructed explicitly; to obtain the product $T_{Q} D_{Q}$ , we broadcast $T_{Q}$ with the vector used in the definition of $D_{Q}$ . Then, we computed $M_{Q} = {(1 - T_{Q} D_{Q})}^{⊙ (- 1)} - 1$ , where $⊙ (- 1)$ is to be interpreted as the elementwise/Hadamard inverse of the matrix. Finally, we approximated the integral by applying the NumPy quadrature algorithm trapz along the rows of $M_{Q}$ , using ${t_{c, k}}$ as the integration grid¹⁷⁶. The GF evaluation grid size was set to $[0, \dots, \max x_{N} + 4] \times [0, \dots, \max x_{M} + 4]$ , where max $x_{i}$ is the highest RNA count observed for species $i$ over the entire simulation, in all cells.

Next, we used the SciPy algorithm optimize. minimize¹⁷⁷ to minimize the negative log-likelihood of the data under all three models, and obtain a satisfactory set of parameters. Specifically, we varied the 6-dimensional vector $\log_{10} Θ$ , with each log-parameter’s bounds set to (−1.5,1.5). We optimized with the L-BFGS-B solver for a maximum of 20 steps. Since we are primarily interested in the models’ relative performance at their maximum likelihood estimates (MLEs), rather than the process of obtaining these estimates, we initialized each search at the parameters used to generate the data.

Next, we sought to illustrate the fit performance and the differences between the models’ distributions. We plotted the marginals of the simulated data at each time point $t_{c}$ as bar plots, now using the counts from all 200 cells to demonstrate the full transient distribution. Next, we plotted the marginal PMFs of the three models at the corresponding time points $t_{c, k}$ as color-coded line charts. We expect the true reactor configuration (PFR) to closely agree with the distribution shapes; however, we have no a priori information regarding how well other reactor architectures can recapitulate the same data. To quantify the prospects for model selection, we inserted the optimal log-likelihoods into Equation 99 and calculated the Akaike weights of the model candidates.

To characterize the identifiability properties, we reproduced the simulation and analysis process using the same parameters, but varying the dataset size, with $η$ = { 20, 40, 60, 80, 100, 150, 200 }. For each $η$ , we generated 50 synthetic datasets, fit them, and computed the Akaike weights of the models. We plotted all $w_{ϖ}$ as a function of the number of cells, adding uniform jitter to facilitate inspection. To visualize the trends in model identifiability, we plotted the mean and standard deviations of all $w_{ϖ}$ for a given $η$ , connecting them with a line to guide the eye. We do not a priori know whether the reactor configurations are meaningfully distinguishable, but if they are, we expect them to become more so with more data.

Next, we sought to characterize the prospects for distinguishing reactor models for a broader range of transcriptional parameters. We used rejection sampling to draw $Θ_{ϖ}$ . First, we drew log₁₀ $b_{i}$ from a normal distribution with mean 0.8 and standard deviation 1, and all other log-parameters from a normal distribution with mean 0 and standard deviation 1. The parameters were clipped to stay in the domain [10^−1.4, 10^1.4] to avoid “trivial” regimes with excessive timescale separation relative to the reactor residence time. Next, we found the highest $b_{i}$ , computed the nascent and mature mean and standard deviation corresponding to this set of $b_{i}, α, β, γ$ ¹³⁷, and kept the proposed $Θ_{ϖ}$ if $μ_{N} + 4 σ_{N}$ and $μ_{M} + 4 σ_{M}$ were both lower than 25. Otherwise, we regenerated $Θ_{ϖ}$ . This is an ad hoc way to limit the state space size for PMF evaluation: although we do not know what the maximum observed counts will be until we simulate the system, $μ + 4 σ$ is typically provides a reasonable estimate⁹⁷. Rejecting parameters in this fashion approximately limited the state space size to 25 × 25. In this way, we simulated, fit, and computed the Akaike weights for 200 parameter sets. All used the PFR ground truth model, ${τ_{1}, τ_{2}} = {1, 3}$ , and $T = 5$ as above.

To summarize the model identifiability over this domain of synthetic parameters, we plotted the distribution of AIC weights $w_{ϖ}$ . Finally, to characterize the relationships between the models, we plotted the distributions of log-likelihood differences $\log L_{k} ({\hat{Θ}}_{k}) - \log L_{ϖ} ({\hat{Θ}}_{ϖ})$ , where $k$ corresponds to the CSTR and LFR models, as transparent histograms color-coded by $k$ . If such a histogram is skewed toward negative values, the model $k$ produces consistently worse fits than the true PFR model. In the other hand, if it is centered at zero, then model $k$ is typically easily confused with the true model. We restricted this visualization to (−5,5) to compensate for potential failure to converge, which produces inflated likelihood differences. This visualization provides a basis for explaining the distribution of $w_{ϖ}$ .

6.8.5. Variability in library construction

Model definition.

In section 6.8.3, we considered the parameter and model identifiability for a two-stage model of RNA processing, and found that several interesting distributions are closed under downsampling, so long as the downsampling is Bernoulli with equal parameters for both species. However, this assumption may be too restrictive in practice: for example, nascent RNA may be more or less likely to be captured than mature RNA, depending on the poly(A) content of their introns. In the current section, we investigate the behavior of models with differences in capture probabilities or rates.

The identifiability properties are highly model-dependent. For example, if we consider the $Γ$ -OU or CIR models, with $N = 1, n = 2, m = 1$ , such that

\emptyset \overset{K}{\to} 𝓧_{N} \overset{β}{\to} 𝓧_{M} \overset{γ}{\to} \emptyset,

(115)

where the autocorrelation of $K$ is $κ ≪ β, γ$ , the stationary distribution of $K$ is gamma with shape $v = a / κ$ and scale $θ$ . We find the stationary RNA generating function is bivariate negative binomial, with

G = {(\frac{1}{1 - \frac{θ u_{N}}{β} - \frac{θ u_{M}}{γ}})}^{v},

(116)

which is outlined in the supplemental section 2.5.2 of Gorin and Vastola et al.²⁰ Under sampling, the distribution stays bivariate negative binomial, with GF

G = {(\frac{1}{1 - \frac{θ p_{N} u_{N}}{β} - \frac{θ p_{M} u_{M}}{γ}})}^{v} .

(117)

In other words, even if we have perfect information about this distribution’s three parameters $ν, θ p_{N} / β$ , and $θ p_{M} / γ$ , we cannot conclude anything about the magnitudes of $p_{N}$ and $p_{M}$ , as they are degenerate with $θ, β$ , and $γ$ . If $K$ is telegraph (i.e., $N = 2, n = 2, m = 0$ ), we obtain a finite Poisson mixture:

G = \frac{k_{off}}{κ} + \frac{k_{on}}{κ} \exp (\frac{k_{init} p_{N} u_{N}}{β} + \frac{k_{init} p_{M} u_{M}}{γ}),

(118)

which exhibits the same degeneracy with respect to $k_{init}, β$ , and $γ$ . Entirely analogously, if the system is in the Poisson limit $(y \approx 0)$ with average transcriptional strength $μ_{K}$ , we find that sampling yields

G = \exp (\frac{μ_{K} p_{N} u_{N}}{β} + \frac{μ_{K} p_{M} u_{M}}{γ}),

(119)

which is non-identifiable.

Interestingly, the bursty regime is partially identifiable. We begin by defining a baseline $N = 1, n = 2, m = 0$ model of biology with technical noise but no ambiguity, such that

\emptyset \overset{α}{\to} B \times 𝓧_{N} \overset{β}{\to} 𝓧_{M} \overset{γ}{\to} \emptyset

(120)

representing bursty transcription with stochastic burst sizes $B$ drawn from a geometric distribution with constant mean $b$ . Further, we assume that a molecule $𝓧_{i}$ is retained with probability $p_{i}$ , yielding:

G_{t}^{*} (u) = [\begin{array}{l} p_{N} u_{N} \\ p_{M} u_{M} \end{array}], C^{d d} = [\begin{matrix} - β & 0 \\ β & - γ \end{matrix}] U (G_{t}^{*} (u), s) = [\begin{matrix} p_{N} u_{N} e^{- β 𝗌} + \frac{p_{M} u_{M} β}{β - γ} (e^{- γ 𝗌} - e^{- β 𝗌}) \\ p_{M} u_{M} e^{- γ 𝗌} \end{matrix}] 𝓐 (u) = α [\frac{1}{1 - b u_{N}} - 1] .

(121)

In other words, the stationary generating function is given by

\exp (α \int_{0}^{\infty} 𝓐 (U (G_{t}^{*} (u, 𝗌))) d 𝗌)

(122)

In principle, this quantity can be integrated, inverted, and optimized with respect to the parameters. However, to be thorough, we need to reformulate the optimization problem in the most compact form available, which involves identifying the distribution’s degeneracies. Although this system formally has six parameters $b, α, β, γ, p_{N}, p_{M}$ , at steady state only four are identifiable. This is made clear by examining the integrand:

b U_{N} = b p_{N} u_{N} e^{- β 𝗌} + b \frac{p_{M} u_{M} β}{β - γ} (e^{- γ 𝗌} - e^{- β 𝗌}) = b p_{N} [u_{N} e^{- β 𝗌} + \frac{p_{M}}{p_{N}} \frac{u_{M} β}{β - γ} (e^{- γ 𝗌} - e^{- β 𝗌})],

(123)

i.e., the characteristic is invariant so long as $b p_{N}$ and $p_{M} / p_{N}$ are constant. By plugging in zero for $u_{N}$ or $u_{M}$ , we observe that the characteristics take the functional form of the characteristics of the noise-free system, implying different values of $p_{N}$ and $p_{M}$ may give indistinguishable distributions. Therefore, identifying the relationship between $p_{N}$ and $p_{M}$ requires bivariate data. To quantitatively characterize how identifiable $p_{N}$ and $p_{M}$ are, we need to use simulations.

However, challenges particular to single-cell technologies arise when attempting to apply this model to large datasets with many genes. Although the Bernoulli model is a useful approximation, considering the sequencing process suggests that the non-sequestering technical noise model is more realistic: there is no chemical barrier to an RNA molecule being captured multiple times. In this formulation, each gene’s technical noise is parametrized by the species’ overall capture rates $λ_{N}$ and $λ_{M}$ , which produce the Bernoulli limit when both of these parameters are small.

Furthermore, it appears implausible that $λ_{j, N}$ , and $λ_{j, M}$ , where $j$ indexes over genes, vary arbitrarily. In a previous report²¹, we have found that the model $λ_{j, N} = C_{N} L_{j}$ and $λ_{j, M} = λ_{M}$ performs satisfactorily. In this model, the nascent species are identified with unspliced molecules, which are considerably longer than spliced molecules and contain a large number of internal poly(A) priming sites. To a first-order approximation, we may propose that nascent species are captured at a rate proportional to the gene length $L_{j}$ , where the constant of proportionality $C_{N}$ is a dataset-wide technical noise parameter. Analogously, we identify the mature species with fully spliced, poly(A)-tailed molecules, and make the zeroth-order approximation that poly(A) tails are chemically identical. The capture rate $λ_{M}$ is, then, also dataset-wide. Although this model is relatively simplistic, it foregrounds a key challenge. Even if we assume different genes’ transcriptional processes are independent, we cannot fit their distributions independently, as we need to account for coupling through the technical noise parameters.

Data analysis.

To illustrate the identifiability of $p_{M} / p_{N}$ under the Bernoulli noise model, we considered the likelihood landscape for the simplest one-parameter formulation. We fixed the parameters $α = 1, b p_{N} = 4.9, μ_{N} = \frac{α b p_{N}}{β} = 7$ , and $μ_{M} = \frac{α b p_{M}}{γ} = 10$ ; in other words, the nascent RNA distribution is negative binomial with shape $\frac{α}{β} \approx 1.43$ and scale $b p_{N}$ . We simulated data at $p_{M} / p_{N} \in {1 / 4, 1, 4}$ , with $η$ = {20, 50, 100, 200} simulated cells. For each of the true $p_{M} / p_{N}$ and $η$ values, we generated 200 datasets by sampling from the PMF on [0, · · ·, 99] × [0, · · ·, 99]. To evaluate the PMF for $p_{M} > p_{N}$ , we set $p_{N}$ to unity with no loss of generality. To evaluate it for $p_{N} > p_{M}$ , we set $p_{N}$ to unity. This yields $b = \frac{(b p_{N})}{p_{N}}$ and $γ = \frac{α b p_{M}}{μ_{M}}$ . Next, we computed the likelihood of the data under $\log_{10} p_{M} / p_{N} \in [- 2, 2]$ , keeping $α, b p_{N}, μ_{N}$ , and $μ_{M}$ constant, using the evaluation grid size $[0, \dots, \max x_{N} + 3] \times [0, \dots, \max x_{M} + 3]$ , where $\max x_{i}$ is the maximum observed for each species in the simulation. We used $200 \log_{10} p_{M} / p_{N}$ grid points, evenly spaced throughout the domain. Next, we computed the posteriors over the grid by dividing each likelihood vector by its sum. Finally, we plotted the average posterior distribution using line charts, with the color indicating the true value of $p_{M} / p_{N}$ and the intensity indicating the number of cells, with more saturated colors corresponding to more simulated cells. For ease of comparison, we plotted the true values using dashed lines. From a statistical perspective, this analysis summarizes the parameter identifiability conditional on perfect information about the nascent marginal and the species averages. As we do not a priori know whether the differences in the PGF are actionable, the analysis illustrates the sample sizes required to fit the parameter to a particular degree of precision.

We previously motivated and fit the Poisson model of technical noise^21,161. In Gorin et al.²¹, we inspected a variety of datasets, and observed a pronounced length bias in the nascent RNA count data, which did not appear in mature RNA counts (Section S7.3 of Gorin et al.²¹). This bias may be explained by three naïve models of biology.

The first model posits that the nascent RNA molecules are in the process of being transcribed; higher amounts of nascent RNA for longer genes simply reflect longer elongation delays. Although this explanation is superficially plausible, it is not borne out by the data. The model predicts a geometric-Poisson distribution of nascent RNA and zero correlation between nascent and mature counts^140,142. Real data, on the other hand, have distinctly negative binomial-like marginals (as evident in, e.g., the third column of Fig. 4b of our recent work on delay CMEs¹⁴⁰, which shows consistently inferior fits under the delay model), and nontrivial nascent/mature correlations (as in the red histogram in Figure 2b).

The second model posits that the differences in expression reflect real differences in the underlying biological parameters, and technical noise may be neglected. However, fitting this model produces pervasive length biases in the parameter values (Section S7.4 of Gorin et al.²¹), which are inconsistent with trends observed in orthogonal data. This is the model we explored in Gorin et al.²¹

The third model posits that technical noise does occur, but takes the species-independent form $p_{N} = p_{M}$ . This formulation is mathematically identical to the second model, but proposes that an apparent length bias in the burst size is actually a length bias in $b p$ . This model partially bypasses the objection raised for the second model by proposing that $p$ is gene length-dependent, identical for nascent and mature species, and higher for longer genes. However, this model is implausible on physical grounds, as mature transcripts do not have the intronic poly(A) content necessary to produce this length dependence. This is indirectly implied by the consistently low fraction of exonic reads in sequencing datasets, in contrast to introns and the 3’ untranslated region¹³⁶.

These biases can be largely eliminated by proposing a length-dependent sampling rate for nascent RNA counts, suggesting that this technical noise model is more coherent with known biology. To illustrate this process, we summarize the key results from Gorin et al.²¹

We obtained the raw data for the twelve 10x v3 datasets reported in Table S4 of Gorin et al.²¹. The raw data consisted of nascent and mature count matrices for 2,500 genes per dataset. The counts were generated by running the kallisto|bustools 0.26.0 kb count command on the raw FASTQs with the --lamanno option, using an intronic/exonic index built from the GRCh38 and mm10 reference genomes, as described in Section 6.8.2. The datasets were filtered to remove low-expression droplets, first using the default bustools filter, then using the manually selected knee plot thresholds shown in Table S5 of Gorin et al.²¹ Next, they were filtered for the top 2,500 moderate- to high-expression genes using the procedure in Section S4.3.1 of Gorin et al.²¹ To visualize the broad trends in count averages, we obtained the gene lengths $L_{j}$ , then binned the values of log₁₀ $L_{j}$ into ten bins, with the edges given by the deciles $d_{0}, d_{1}, \dots, d_{10}$ . Next, we computed the average log₁₀ mean of nascent and mature expression levels for genes falling into each bin. Finally, we plotted these mean levels at each bin center $d_{k} + \frac{1}{2} (d_{k + 1} - d_{k})$ , connecting the values with a line to guide the eye. We repeated this analysis for all twelve datasets, distinguishing the nascent and mature statistics by color.

Next, we obtained the fit results for these datasets. The fits were performed using Monod 0.2.5.0 Python package¹⁶¹ as described in Gorin et al.²¹ Fitting the model with no technical noise entailed gradient optimization over the per-gene joint distributions to fit $b_{j}, β_{j}$ , and $γ_{j}$ . Although the model did not explicitly include technical noise, the theoretical discussion above implies that the results can be interpreted as those from a $p = p_{N} = p_{M}$ model, with the inferred “burst size” corresponding to $b_{j} p_{j}$ for gene $j$ . Fitting the model with technical noise entailed scanning over a grid of $C_{N}$ and $λ_{M}$ , obtaining per-gene maximum likelihood estimates of $b_{j}, β_{j}$ , and $γ_{j}$ conditional on the technical parameter values at the grid point, then identifying the grid point which produced the lowest sum of Kullback-Leibler divergences over all genes. In both cases, the genes underwent a round of goodness-of-fit filtering to remove fits that did not accurately recapitulate the data, as in Section S4.3.5 of Gorin et al.²¹ computed the average inferred log falling into each length bin. As with the means, we plotted the average burst sizes at each bin center. connecting the values with a line to guide the eye. We repeated this analysis for all twelve datasets, distinguishing the results fit with and without a technical noise component by color.

Supplementary Material

Supplementary material

NIHMS1935058-supplement-Supplementary_material.pdf^{(11.2MB, pdf)}

Supplementary Table 3

Table S3 Genes discovered to be overdispersed $(σ^{2} > 2 μ)$ in empty droplets for each dataset in Table S2.

NIHMS1935058-supplement-Supplementary_Table_3.xlsx^{(13.3KB, xlsx)}

Supplementary Table 4

Table S4 Genes discovered to be overdispersed $(σ^{2} > 2 μ)$ in empty droplets for the neuron_1k_v3 and desai_dmso datasets, with function annotations.

NIHMS1935058-supplement-Supplementary_Table_4.xlsx^{(10.8KB, xlsx)}

Box 1. Generating function methods for studying stochastic biological systems.

Generating functions are ubiquitous tools in stochastic modeling. They are central to the analysis of discrete master equations, as they cast difficult-to-solve infinite-dimensional systems to partial differential equations, which can be treated using standard analytical or numerical methods. A (one-variable) probability distribution $P (x)$ and its generating function $G (g)$ are related according to the formulas:

G (g) = \sum_{x = 0}^{\infty} g^{x} P (x), P (x) = \oint \frac{d g}{2 π i} \frac{1}{g^{x + 1}} G (g) = \int_{- π}^{π} \frac{d θ}{2 π} e^{- i θ x} G (e^{i θ}) .

(1)

In the stochastic modeling of transcription, certain distributions, such as the Poisson and negative binomial, frequently appear. Because $G$ uniquely specifies $P$ , we can often invert $G$ simply by recognizing its form and matching terms. Below are some generating functions of common distributions (Bernoulli, Poisson, geometric, and negative binomial):

P (x) = (1 - p) δ_{0 x} + p δ_{1 x}, G (g) = g,

(2)

P (x) = \frac{λ^{x} e^{- λ}}{x!}, G (g) = e^{λ (g - 1)},

(3)

P (x) = \frac{θ}{1 + θ} {(\frac{1}{1 + θ})}^{x}, G (g) = \frac{1}{1 - θ (g - 1)},

(4)

P (x) = \frac{Γ (v + x)}{x! Γ (v)} {(\frac{θ}{1 + θ})}^{v} {(\frac{1}{1 + θ})}^{x}, G (g) = {(\frac{1}{1 - θ (g - 1)})}^{v} .

(5)

The generating function expressions can often be made more compact by applying the substitution $u := g - 1$ .

Box 2. An illustration of the solution procedure.

Here, we will illustrate how to solve two simple transcription models using our framework. We assume that RNA is produced with burst event frequency $α$ and degrades at a rate $γ$ . In the constitutive model, each transcription event creates one RNA. In the bursty model, each transcription event creates a random number of RNA, distributed according to a geometric random variable with mean $b$ . Both models have $N = 1, n = 1$ , and $m = 0$ . Since these models are one-dimensional, the $C$ and $D$ matrices are 1 × 1. For both of them, $C = [- γ]$ and $D = [0]$ . The ODE for the single characteristic $U$ (with initial condition $U (𝗌 = 0) = u)$ is

\frac{d U (u, 𝗌)}{d 𝗌} = - γ U (u, 𝗌) \Rightarrow U (𝗌) = u e^{- γ 𝗌} .

(23)

For a general burst distribution $p (z)$ , the transcriptional evolution operator is $ℋ (u) = - α (F (1 + u) - 1)$ , where $F$ is the GF of the number of molecules produced per transcription event. For our two models, we have

p (z) = δ_{1, z}, F (1 + u) = 1 + u, ℋ (u) = - α u,

(24)

p (z) = {(1 + b)}^{- z - 1} b^{z}, F (1 + u) = \frac{1}{1 - b u}, ℋ (u) = - α \frac{b u}{1 - b u} .

(25)

To compute the stationary log-generating functions $\log G$ , we evaluate the integrals:

\log G = \int_{0}^{\infty} α u e^{- γ 𝗌} d 𝗌 = \frac{u α}{γ} for the constitutive model and

(26)

\log G = \int_{0}^{\infty} α [\frac{1}{1 - b u e^{- γ 𝗌}} - 1] d 𝗌 = - \frac{α}{γ} \log (1 - b u) for the bursty model .

The constitutive model yields a Poisson distribution with mean $α / γ$ (c.f. Equation 3), whereas the bursty model yields a negative binomial distribution with shape $α / γ$ and scale $b$ (c.f. Equation 5).

ACKNOWLEDGMENTS

G.G. and L.P. were partially funded by NIH 5UM1HG01207702 and NIH U19MH114830. J.V. was partially funded by NIH 1U19NS118246-01. The RNA, DNA, and cDNA illustrations were derived from the DNA Twemoji by Twitter, Inc., used under the CC-BY 4.0 license. The authors thank Dr. A. Sina Booeshaghi, Maria Carilli, Tara Chari, Taleen Dilanyan, Dr. Kristján Eldjárn Hjörleifsson, Meichen Fang, Catherine Felce, and Delaney Sullivan for fruitful discussions of co-regulation, contamination, transient behaviors, catalysis, fragmentation, genomic alignment, and a variety of other phenomena and processes. Part of this work was performed during G.G.’s Data Sciences Co-op with Celsius Therapeutics, Inc.

Footnotes

DECLARATION OF INTERESTS

The authors declare no competing interests.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

REFERENCES

1.Wilkinson DJ, 2018. Stochastic modelling for systems biology. Chapman and Hall/CRC. [Google Scholar]
2.Waddington CH, 1957. The strategy of the genes. Routledge. [Google Scholar]
3.Huang S, 2009. Reprogramming cell fates: reconciling rarity with robustness. BioEssays 31:546–560. https://onlinelibrary.wiley.com/doi/abs/10.1002/bies.200800189. [DOI] [PubMed] [Google Scholar]
4.Huang S, 2012. The molecular and mathematical basis of Waddington’s epigenetic landscape: A framework for post-Darwinian biology? BioEssays 34:149–157. https://onlinelibrary.wiley.com/doi/abs/10.1002/bies.201100031. [DOI] [PubMed] [Google Scholar]
5.Rand DA, Raju A, Sáez M, Corson F, and Siggia ED, 2021. Geometry of gene regulatory dynamics. Proceedings of the National Academy of Sciences 118:e2109729118. https://pnas.org/doi/full/10.1073/pnas.2109729118. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Coomer MA, Ham L, and Stumpf MP, 2022. Noise distorts the epigenetic landscape and shapes cell-fate decisions. Cell Systems 13:83–102.e6. https://linkinghub.elsevier.com/retrieve/pii/S2405471221003392. [DOI] [PubMed] [Google Scholar]
7.Wolf FA., Hamey FK, Plas Ms, Solana J, Dahlin JS, Göttgens B, Rajewsky N, Simon L, and Theis FJ, 2019. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biology 20:59. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C, 2017. Reversed graph embedding resolves complex single-cell trajectories. Nature Methods 14:979–982. http://www.nature.com/articles/nmeth.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhou J, and Troyanskaya OG, 2021. An analytical framework for interpretable and generalizable single-cell data analysis. Nature Methods 18:1317–1321. https://www.nature.com/articles/s41592-021-01286-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, and Alon U, 2002. Network Motifs: Simple Building Blocks of Complex Networks. Science 298:824–827. [DOI] [PubMed] [Google Scholar]
11.Levine M, and Davidson EH, 2005. Gene regulatory networks for development. Proceedings of the National Academy of Sciences 102:4936–4942. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Érdi P, and Lente G, 2014. Stochastic chemical kinetics: theory and (mostly) systems biological applications. Springer Complexity. Springer, New York. [Google Scholar]
13.Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann SA, and Marioni JC, 2015. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nature Communications 6:8687. http://www.nature.com/articles/ncomms9687. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Grün D, Kester L, and van Oudenaarden A, 2014. Validation of noise models for single-cell transcriptomics. Nature Methods 11:637–640. http://www.nature.com/articles/nmeth.2930. [DOI] [PubMed] [Google Scholar]
15.Hicks SC, Townes FW, Teng M, and Irizarry RA, 2018. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562–578. https://academic.oup.com/biostatistics/article/19/4/562/4599254. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Gorin G, and Pachter L, 2022. Modeling bursty transcription and splicing with the chemical master equation. Biophysical Journal 121:1056–1069. https://www.cell.com/biophysj/fulltext/S0006-3495(22)00104-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Vastola JJ, 2021. Solving the chemical master equation for monomolecular reaction systems and beyond: a Doi-Peliti path integral view. Journal of Mathematical Biology 83:48. 10.1007/s00285-021-01670-7. [DOI] [PubMed] [Google Scholar]
18.Vastola JJ, 2021. In search of a coherent theoretical framework for stochastic gene regulation. Ph.D. thesis, Vanderbilt. https://ir.vanderbilt.edu/handle/1803/16646. [Google Scholar]
19.Gorin G, Fang M, Chari T, and Pachter L, 2022. RNA velocity unraveled. PLOS Computational Biology 18:e1010492. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010492. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Gorin G, Vastola JJ, Fang M, and Pachter L, 2022. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nature Communications 13:7620. https://www.nature.com/articles/s41467-022-34857-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gorin G, and Pachter L, 2023. Length biases in single-cell RNA sequencing of pre-mRNA. Biophysical Reports 3:100097. https://linkinghub.elsevier.com/retrieve/pii/S2667074722000544. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Belliveau NM, Chure G, Hueschen CL, Garcia HG, Kondev J, Fisher DS, Theriot JA, and Phillips R, 2021. Fundamental limits on the rate of bacterial growth and their influence on proteomic composition. Cell Systems 12:924–944.e2. https://linkinghub.elsevier.com/retrieve/pii/S240547122100209X. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Padovan-Merhar O, Nair G, Biaesch A, Mayer A, Scarfone S, Foley S, Wu A, Churchman L, Singh A, and Raj A, 2015. Single Mammalian Cells Compensate for Differences in Cellular Volume and DNA Copy Number through Independent Global Transcriptional Mechanisms. Molecular Cell 58:339–352. https://linkinghub.elsevier.com/retrieve/pii/S1097276515001707. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Elowitz MB, Levine AJ, Siggia ED, and Swain PS, 2002. Stochastic Gene Expression in a Single Cell. Science 297:1183–1186. [DOI] [PubMed] [Google Scholar]
25.Swain PS, Elowitz MB, and Siggia ED, 2002. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proceedings of the National Academy of Sciences 99:12795–12800. http://www.pnas.org/cgi/doi/10.1073/pnas.162041399. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Hilfinger A, and Paulsson J, 2011. Separating intrinsic from extrinsic fluctuations in dynamic biological systems. Proceedings of the National Academy of Sciences 108:12167–12172. http://www.pnas.org/cgi/doi/10.1073/pnas.1018832108. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Fu AQ, and Pachter L, 2016. Estimating intrinsic and extrinsic noise from single-cell gene expression measurements. Statistical Applications in Genetics and Molecular Biology 15. https://www.degruyter.com/doi/10.1515/sagmb-2016-0002. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Hilfinger A, Norman T, and Paulsson J, 2016. Exploiting Natural Fluctuations to Identify Kinetic Mechanisms in Sparsely Characterized Systems. Cell Systems 2:251–259. https://linkinghub.elsevier.com/retrieve/pii/S2405471216301107. [DOI] [PubMed] [Google Scholar]
29.Finkenstädt B., Woodcock DJ, Komorowski M, Harper CV, Davis JRE, White MRH, and Rand DA, 2013. Quantifying intrinsic and extrinsic noise in gene transcription using the linear noise approximation: An application to single cell data. The Annals of Applied Statistics 7:1960–1982. https://projecteuclid.org/euclid.aoas/1387823306. [Google Scholar]
30.Baudrimont A, Jaquet V, Wallerich S, Voegeli S, and Becskei A, 2019. Contribution of RNA Degradation to Intrinsic and Extrinsic Noise in Gene Expression. Cell Reports 26:3752–3761.e5. https://linkinghub.elsevier.com/retrieve/pii/S2211124719303080. [DOI] [PubMed] [Google Scholar]
31.Hausser J, Mayo A, Keren L, and Alon U, 2019. Central dogma rates and the trade-off between precision and economy in gene expression. Nature Communications 10:68. https://www.nature.com/articles/s41467-018-07391-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Keizer J, 1987. Statistical Thermodynamics of Nonequilibrium Processes. Springer. [Google Scholar]
33.Saint-Antoine MM, and Singh A, 2020. Network inference in systems biology: recent developments, challenges, and applications. Current Opinion in Biotechnology 63:89–98. https://linkinghub.elsevier.com/retrieve/pii/S0958166919301399. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Xing L, Guo M, Liu X, Wang C, Wang L, and Zhang Y, 2017. An improved Bayesian network method for reconstructing gene regulatory network based on candidate auto selection. BMC Genomics 18:844. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4228-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Shmulevich I, and Dougherty ER, 2010. Probabilistic boolean networks: the modeling and control of gene regulatory networks. Society for Industrial and Applied Mathematics, Philadelphia. OCLC: ocn434319365. [Google Scholar]
36.Shaffer SM, Dunagin MC, Torborg SR, Torre EA, Emert B, Krepler C, Beqiri M, Sproesser K, Brafford PA, Xiao M, Eggan E, Anastopoulos IN, Vargas-Garcia CA, Singh A, Nathanson KL, Herlyn M, and Raj A, 2017.Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance. Nature 546:431–435. http://www.nature.com/articles/nature22794. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, and Califano A, 2006. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics 7:S7. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Huynh-Thu VA, Irrthum A, Wehenkel L, and Geurts P, 2010. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5:e12776. https://dx.plos.org/10.1371/journal.pone.0012776. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Silk D, Kirk PDW, Barnes CP, Toni T, and Stumpf MPH, 2014. Model Selection in Systems Biology Depends on Experimental Design. PLoS Computational Biology 10:e1003650. https://dx.plos.org/10.1371/journal.pcbi.1003650. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Munsky B, Li G, Fox ZR, Shepherd DP, and Neuert G, 2018. Distribution shapes govern the discovery of predictive models for gene regulation. Proceedings of the National Academy of Sciences 115:7533–7538. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Huynh-Thu VA, and Sanguinetti G, 2015. Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics 31:1614–1622. https://academic.oup.com/bioinformatics/article/31/10/1614/176842. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Bansal M, Gatta GD, and di Bernardo D, 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22:815–822. https://academic.oup.com/bioinformatics/article/22/7/815/202299. [DOI] [PubMed] [Google Scholar]
43.Henriques D, Rocha M, Saez-Rodriguez J, and Banga JR, 2015. Reverse engineering of logic-based differential equation models using a mixed-integer dynamic optimization approach. Bioinformatics 31:2999–3007. https://academic.oup.com/bioinformatics/article/31/18/2999/241026. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Stumpf PS., Smith RC, Lenz M, Schuppert A, Müller F-J, Babtie A, Chan TE, Stumpf MP, Please CP, Howison SD, Arai F, and MacArthur BD, 2017. Stem Cell Differentiation as a Non-Markov Stochastic Process. Cell Systems 5:268–282.e7. https://linkinghub.elsevier.com/retrieve/pii/S2405471217303423. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Cannoodt R, Saelens W, Deconinck L, and Saeys Y, 2021. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nature Communications 12:3942. http://www.nature.com/articles/s41467-021-24152-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, the DREAM5 Consortium, Kellis M, Collins JJ, and Stolovitzky G, 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9:796–804. http://www.nature.com/articles/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Svensson V, Vento-Tormo R, and Teichmann SA, 2018. Exponential scaling of single-cell RNA-seq in the past decade. Nature Protocols 13:599–604. http://www.nature.com/articles/nprot.2017.149. [DOI] [PubMed] [Google Scholar]
48.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schnall-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, and Bielas JH, 2017. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8:14049. http://www.nature.com/articles/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Stumpf MP, 2021. Inferring better gene regulation networks from single-cell data. Current Opinion in Systems Biology 27:100342. https://linkinghub.elsevier.com/retrieve/pii/S2452310021000275. [Google Scholar]
50.Wang L, Zhang Q, Qin Q, Trasanidis N, Vinyard M, Chen H, and Pinello L, 2021. Current progress and potential opportunities to infer single-cell developmental trajectory and cell fate. Current Opinion in Systems Biology 26:1–11. https://linkinghub.elsevier.com/retrieve/pii/S2452310021000093. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Griffiths JA, Scialdone A, and Marioni JC, 2018. Using single-cell genomics to understand developmental processes and cell fate decisions. Molecular Systems Biology 14:e8046. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Packer J, and Trapnell C, 2018. Single-Cell Multi-omics: An Engine for New Quantitative Models of Gene Regulation. Trends in Genetics 34:653–665. https://linkinghub.elsevier.com/retrieve/pii/S0168952518301082. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Stein-O’Brien GL, Ainslie MC, and Fertig EJ, 2021. Forecasting cellular states: from descriptive to predictive biology via single-cell multiomics. Current Opinion in Systems Biology 26:24–32. https://linkinghub.elsevier.com/retrieve/pii/S245231002100010X. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Gligorijević V, and Pržulj N, 2015. Methods for biological data integration: perspectives and challenges. Journal of The Royal Society Interface 12:20150571. https://royalsocietypublishing.org/doi/10.1098/rsif.2015.0571. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Ha Y., Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LM, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, and Satija R, 2021. Integrated analysis of multimodal single-cell data. Cell 184:3573–3587.e29. https://linkinghub.elsevier.com/retrieve/pii/S0092867421005833. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, and Yosef N, 2021. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods 18:272–282. http://www.nature.com/articles/s41592-020-01050-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Luecken MD, and Theis FJ, 2019. Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology 15:e8746. http://msb.embopress.org/lookup/doi/10.15252/msb.20188746. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Lopez R, Regier J, Cole MB, Jordan MI, and Yosef N, 2018. Deep generative modeling for single-cell transcriptomics. Nature Methods 15:1053–1058. http://www.nature.com/articles/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Bergen V, Lange M, Peidli S, Wolf FA, and Theis FJ, 2020. Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology http://www.nature.com/articles/s41587-020-0591-3. [DOI] [PubMed]
60.Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, Purdom E, and Dudoit S, 2018. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19:477. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Huang S, 2018. The Tension Between Big Data and Theory in the “Omics” Era of Biomedical Research. Perspectives in Biology and Medicine 61:472–488. https://muse.jhu.edu/article/713156. [DOI] [PubMed] [Google Scholar]
62.Jiang R, Sun T, Song D, and Li JJ, 2022. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology 23:31. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Svensson V, 2020. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology 38:147–150. https://www.nature.com/articles/s41587-019-0379-5. [DOI] [PubMed] [Google Scholar]
64.Andrews T, and Hemberg M, 2019. False signals induced by single-cell imputation. F1000Research 7:1740. https://f1000research.com/articles/7-1740/v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Booeshaghi AS, Hallgrímsdóttir IB, Gálvez-Merchán A, and Pachter L, 2022. Depth normalization for single-cell genomics count data. Preprint, bioRxiv: 2022.05.06.490859. http://biorxiv.org/lookup/doi/10.1101/2022.05.06.490859.
66.Booeshaghi AS, and Pachter L, 2021. Normalization of single-cell RNA-seq counts by log( x + 1) or log(1 + x ). Bioinformatics 37:2223–2224. https://academic.oup.com/bioinformatics/article/37/15/2223/6155989. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Cooley SM, Hamilton T, Ray JCJ, and Deeds EJ, 2020. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data. Preprint, bioRxiv: 689851. https://www.biorxiv.org/content/10.1101/689851v4.
68.Chari T, Banerjee J, and Pachter L, 2021. The Specious Art of Single-Cell Genomics. Preprint, bioRxiv: 2021.08.25.457696. http://biorxiv.org/lookup/doi/10.1101/2021.08.25.457696.
69.Zheng SC, Stein-O’Brien G, Boukas L, Goff LA, and Hansen KD, 2022. Pumping the brakes on RNA velocity – understanding and interpreting RNA velocity estimates. Preprint, bioRxiv: 2022.06.19.494717. http://biorxiv.org/lookup/doi/10.1101/2022.06.19.494717. [DOI] [PMC free article] [PubMed]
70.François P, 2023. New wave theory. Development 150:dev201647. https://journals.biologists.com/dev/article/150/4/dev201647/287679/New-wave-theory. [DOI] [PubMed] [Google Scholar]
71.Carilli MT, Gorin G, Choi Y, Chari T, and Pachter L, 2023. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Preprint, bioRxiv: 2023.01.13.523995. http://biorxiv.org/lookup/doi/10.1101/2023.01.13.523995. [DOI] [PubMed]
72.Fox ZR, and Munsky B, 2019. The finite state projection based Fisher information matrix approach to estimate information and optimize single-cell experiments. PLOS Computational Biology 15:e1006365. https://dx.plos.org/10.1371/journal.pcbi.1006365. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Raj A, Peskin CS, Tranchina D, Vargas DY, and Tyagi S, 2006. Stochastic mRNA Synthesis in Mammalian Cells. PLoS Biology 4:e309. https://dx.plos.org/10.1371/journal.pbio.0040309. [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Munsky B, Neuert G, and van Oudenaarden A, 2012. Using Gene Expression Noise to Understand Gene Regulation. Science 336:183–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Shahrezaei V, and Swain PS, 2008. Analytical distributions for stochastic gene expression. Proceedings of the National Academy of Sciences 105:17256–17261. http://www.pnas.org/cgi/doi/10.1073/pnas.0803850105. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Iyer-Biswas S, Hayot F, and Jayaprakash C, 2009. Stochasticity of gene products from transcriptional pulsing. Physical Review E 79:031911. https://link.aps.org/doi/10.1103/PhysRevE.79.031911. [DOI] [PubMed] [Google Scholar]
77.Veerman F, Marr C, and Popović N, 2018. Time-dependent propagators for stochastic models of gene expression: an analytical method. Journal of Mathematical Biology 77:261–312. http://link.springer.com/10.1007/s00285-017-1196-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Munsky B, and Khammash M, 2006. The finite state projection algorithm for the solution of the chemical master equation. The Journal of Chemical Physics 124:044104. [DOI] [PubMed] [Google Scholar]
79.Xu H, Skinner SO, Sokac AM, and Golding I, 2016. Stochastic Kinetics of Nascent RNA. Physical Review Letters 117:128101. https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.117.128101. [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Stinchcombe AR., Peskin CS, and Tranchina D, 2012. Population density approach for discrete mRNA distributions in generalized switching models for stochastic gene expression. Physical Review E 85:061919. https://link.aps.org/doi/10.1103/PhysRevE.85.061919. [DOI] [PubMed] [Google Scholar]
81.Gardiner C, 2004. Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences. Springer, third edition.
82.Gillespie DT, 1992. A rigorous derivation of the chemical master equation. Physica A: Statistical Mechanics and its Applications 188:404–425. https://linkinghub.elsevier.com/retrieve/pii/037843719290283V. [Google Scholar]
83.Hafemeister C, and Satija R, 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology 20:296. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Vu TN, Wills QF, Kalari KR, Niu N, Wang L, Rantalainen M, and Pawitan Y, 2016. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 32:2128–2135. https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw202. [DOI] [PubMed] [Google Scholar]
85.Jahnke T, and Huisinga W, 2006. Solving the chemical master equation for monomolecular reaction systems analytically. Journal of Mathematical Biology 54:1–26. http://link.springer.com/10.1007/s00285-006-0034-x. [DOI] [PubMed] [Google Scholar]
86.La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, Lidschreiber K, Kastriti ME, Lönnerberg P, Furlan A, Fan J, Borm LE, Liu Z, van Bruggen D, Guo J, He X, Barker R, Sundström E, Castelo-Branco G, Cramer P, Adameyko I, Linnarsson S, and Kharchenko PV, 2018. RNA velocity of single cells. Nature 560:494–498. http://www.nature.com/articles/s41586-018-0414-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Kim J, and Marioni JC, 2013. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biology 14:R7. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Delmans M, and Hemberg M, 2016. Discrete distributional differential expression (D3E) - a tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinformatics 17:110. http://www.biomedcentral.com/1471-2105/17/110. [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Vo HD, Fox Z, Baetica A, and Munsky B, 2019. Bayesian Estimation for Stochastic Gene Expression Using Multifidelity Models. The Journal of Physical Chemistry B 123:2217–2234. https://pubs.acs.org/doi/10.1021/acs.jpcb.8b10946. [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, and Heisler MG, 2013. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10:1093–1095. http://www.nature.com/articles/nmeth.2645. [DOI] [PubMed] [Google Scholar]
91.Bacher R, L.-F, Argus C, Bolin JM, Knight P, Thomson JA, Stewart R, and Kendziorski C, 2021. Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization. Nucleic Acids Research gkab1071. [DOI] [PMC free article] [PubMed]
92.Thattai M, and van Oudenaarden A, 2001. Intrinsic noise in gene regulatory networks. Proceedings of the National Academy of Sciences 98:8614–8619. http://www.pnas.org/cgi/doi/10.1073/pnas.151588598. [DOI] [PMC free article] [PubMed] [Google Scholar]
93.Gardiner CW, and Chaturvedi S, 1977. The poisson representation. I. A new technique for chemical master equations. Journal of Statistical Physics 17:429–468. http://link.springer.com/10.1007/BF01014349. [Google Scholar]
94.Doi M, 1976. Stochastic theory of diffusion-controlled reaction. Journal of Physics A: Mathematical and General 9:1479. 10.1088/0305-4470/9/9/009. [DOI] [Google Scholar]
95.Doi M, 1976. Second quantization representation for classical many-particle system. Journal of Physics A: Mathematical and General 9:1465. 10.1088/0305-4470/9/9/008. [DOI] [Google Scholar]
96.Peliti L, 1985. Path integral approach to birth-death processes on a lattice. J. Phys. France 46:1469–1483. 10.1051/jphys:019850046090146900. [DOI] [Google Scholar]
97.Vastola JJ, Gorin G, Pachter L, and Holmes WR, 2021. Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solution. Preprint, arXiv: 2103.10992. http://arxiv.org/abs/2103.10992,arXiv:2103.10992.
98.Ebert MR, and Reissig M, 2018. Methods for Partial Differential Equations. Springer International Publishing, Cham. http://link.springer.com/10.1007/978-3-319-66456-9. [Google Scholar]
99.Vastola JJ, and Holmes WR, 2020. Chemical Langevin equation: A path-integral view of Gillespie’s derivation. Phys. Rev. E 101:032417. https://link.aps.org/doi/10.1103/PhysRevE.101.032417. [DOI] [PubMed] [Google Scholar]
100.Peccoud J, and Ycard B, 1995. Markovian Modeling of Gene Product Synthesis. Theoretical Population Biology 48:222–234. [Google Scholar]
101.Grima R, Schmidt DR, and Newman TJ, 2012. Steady-state fluctuations of a genetic feedback loop: An exact solution. The Journal of Chemical Physics 137:035104. http://aip.scitation.org/doi/10.1063/1.4736721. [DOI] [PubMed] [Google Scholar]
102.Huang L, Yuan Z, Liu P, and Zhou T, 2014. Feedback-induced counterintuitive correlations of gene expression noise with bursting kinetics. Physical Review E 90:052702. https://link.aps.org/doi/10.1103/PhysRevE.90.052702. [DOI] [PubMed] [Google Scholar]
103.Kumar N, Platini T, and Kulkarni RV, 2014. Exact Distributions for Stochastic Gene Expression Models with Bursting and Feedback. Physical Review Letters 113:268105. https://link.aps.org/doi/10.1103/PhysRevLett.113.268105. [DOI] [PubMed] [Google Scholar]
104.Liu P, Yuan Z, Huang L, and Zhou T, 2015. Feedback-Induced Variations of Distribution in a Representative Gene Model. International Journal of Bifurcation and Chaos 25:1540008. https://www.worldscientific.com/doi/abs/10.1142/S0218127415400088. [Google Scholar]
105.Fogler HS, 2006. Elements of chemical reaction engineering. Prentice Hall PTR international series in the physical and chemical engineering sciences. Prentice Hall PTR, Upper Saddle River, NJ, 4th ed edition. OCLC: ocm56956313. [Google Scholar]
106.Roberts GW, 2008. Chemical reactions and chemical reactors. John Wiley & Sons, Hoboken, NJ. OCLC: ocn176897332. [Google Scholar]
107.Tang W, Bertaux F, Thomas P, Stefanelli C, Saint M, Marguerat S, and Shahrezaei V, 2020. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36:1174–1181. https://academic.oup.com/bioinformatics/article/36/4/1174/5581401. [DOI] [PMC free article] [PubMed] [Google Scholar]
108.Tang W, Jørgensen ACS, Marguerat S, Thomas P, and Shahrezaei V, 2023. Modelling capture efficiency of single cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics. Preprint, bioRxiv: 2023.03.06.531327. http://biorxiv.org/lookup/doi/10.1101/2023.03.06.531327. [DOI] [PMC free article] [PubMed]
109.Young MD, and Behjati S, 2020. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9:giaa151. https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giaa151/6049831. [DOI] [PMC free article] [PubMed] [Google Scholar]
110.Fleming SJ, Chaffin MD, Arduini A, Akkad A-D, Banks E, Marioni JC, Philippakis AA, Ellinor PT, and Babadi M, 2019. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Preprint, bioRxiv: 791699. http://biorxiv.org/lookup/doi/10.1101/791699. [DOI] [PubMed]
111.Sheng C., Lopes R, Li G, Schuierer S, Waldt A, Cuttat R, Dimitrieva S, Kauffmann A, Durand E, Galli GG, Roma G, and de Weck A, 2022. Probabilistic machine learning ensures accurate ambient denoising in droplet-based single-cell omics. Preprint, bioRxiv: 2022.01.14.476312. http://biorxiv.org/lookup/doi/10.1101/2022.01.14.476312.
112.Yin Y, Yajima M, and Campbell JD, 2023. Characterization and decontamination of background noise in droplet-based single-cell protein expression data with DecontPro. Preprint, bioRxiv: 2023.01.27.525964v2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979990/. [DOI] [PMC free article] [PubMed]
113.Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KH, da Veiga Beltrame E, Hjörleifsson KE, Gehring J, and Pachter L, 2021. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology 39:813–818. http://www.nature.com/articles/s41587-021-00870-2. [DOI] [PubMed] [Google Scholar]
114.National Library of Medicine, 2004. Gene [Internet]. https://www.ncbi.nlm.nih.gov/gene/.
115.Lutsch G, Vetter R, Offhauss U, Wieske M, Gröne H-J, Klemenz R, Schimke I, Stahl J, and Benndorf R, 1997. Abundance and Location of the Small Heat Shock Proteins HSP25 and aB-Crystallin in Rat and Human Heart. Circulation 96:3466–3476. https://www.ahajournals.org/doi/abs/10.1161/01.CIR.96.10.3466. [DOI] [PubMed] [Google Scholar]
116.Desai RV, Chen X, Martin B, Chaturvedi S, Hwang DW, Li W, Yu C, Ding S, Thomson M, Singer RH, Coleman RA, Hansen MMK, and Weinberger LS, 2021. A DNA repair pathway can regulate transcriptional noise to promote cell fate transitions. Science 373:eabc6506. https://www.sciencemag.org/lookup/doi/10.1126/science.abc6506. [DOI] [PMC free article] [PubMed] [Google Scholar]
117.Heiser CN, Wang VM, Chen B, Hughey JJ, and Lau KS, 2021. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Research 31:1742–1752. http://genome.cshlp.org/lookup/doi/10.1101/gr.271908.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
118.Hippen AA, Falco MM, Weber LM, Erkan EP, Zhang K, Doherty JA, Vähärautio A, Greene CS, and Hicks SC, 2021. miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data. PLOS Computational Biology 17:e1009290. https://dx.plos.org/10.1371/journal.pcbi.1009290. [DOI] [PMC free article] [PubMed] [Google Scholar]
119.Munsky B, Trinh B, and Khammash M, 2009. Listening to the noise: random fluctuations reveal gene network parameters. Molecular Systems Biology 5:318. https://www.embopress.org/doi/full/10.1038/msb.2009.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
120.Burnham KP, and Anderson DR, 2002. Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York, 2nd ed edition. OCLC: ocm48557578. [Google Scholar]
121.Qin Q, Bingham E, Manno GL, Langenau DM, and Pinello L, 2022. Pyro-Velocity: Probabilistic RNA Velocity inference from single-cell data. Preprint, bioRxiv: 2022.09.12.507691. https://www.biorxiv.org/content/10.1101/2022.09.12.507691v2.
122.Dattani J, 2015. Exact solutions of master equations for the analysis of gene transcription models. PhD Dissertation, Imperial College London. [Google Scholar]
123.Dattani J, and Barahona M, 2017. Stochastic models of gene transcription with upstream drives: exact solution and sample path characterization. Journal of The Royal Society Interface 14:20160833. https://royalsocietypublishing.org/doi/10.1098/rsif.2016.0833. [DOI] [PMC free article] [PubMed] [Google Scholar]
124.Thomas P, 2017. Making sense of snapshot data: ergodic principle for clonal cell populations. Journal of The Royal Society Interface 14:20170467. https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0467. [DOI] [PMC free article] [PubMed] [Google Scholar]
125.Perez-Carrasco R, Beentjes C, and Grima R, 2020. Effects of cell cycle variability on lineage and population measurements of messenger RNA abundance. Journal of The Royal Society Interface 17:20200360. https://royalsocietypublishing.org/doi/10.1098/rsif.2020.0360. [DOI] [PMC free article] [PubMed] [Google Scholar]
126.Beentjes CHL, Perez-Carrasco R, and Grima R, 2020. Exact solution of stochastic gene expression models with bursting, cell cycle and replication dynamics. Physical Review E 101:032403. https://link.aps.org/doi/10.1103/PhysRevE.101.032403. [DOI] [PubMed] [Google Scholar]
127.Pitman JW, 1977. Occupation Measures for Markov Chains. Advances in Applied Probability 9:69–86. https://www.jstor.org/stable/1425817, publisher: Applied Probability Trust. [Google Scholar]
128.Yang Y, Nurbekyan L, Negrini E, Martin R, and Pasha M, 2021. Optimal Transport for Parameter Identification of Chaotic Dynamics via Invariant Measures. Preprint, arXiv: 2104.15138. http://arxiv.org/abs/2104.15138.
129.Kuntz J., Thomas P, Stan G-B, and Barahona M, 2019. The Exit Time Finite State Projection Scheme: Bounding Exit Distributions and Occupation Measures of Continuous-Time Markov Chains. SIAM Journal on Scientific Computing 41:A748–A769. https://epubs.siam.org/doi/10.1137/18M1168261. [Google Scholar]
130.Birkhoff GD, 1931. Proof of the Ergodic Theorem. Proceedings of the National Academy of Sciences 17:656–660. https://www.pnas.org/doi/abs/10.1073/pnas.17.2.656. [DOI] [PMC free article] [PubMed] [Google Scholar]
131.Neumann J. v., 1932. Proof of the Quasi-Ergodic Hypothesis. Proceedings of the National Academy of Sciences 18:70–82. https://www.pnas.org/doi/abs/10.1073/pnas.18.1.70. [DOI] [PMC free article] [PubMed] [Google Scholar]
132.Moore CC, 2015. Ergodic theorem, ergodic theory, and statistical mechanics. Proceedings of the National Academy of Sciences 112:1907–1911. https://www.pnas.org/doi/abs/10.1073/pnas.1421798112. [DOI] [PMC free article] [PubMed] [Google Scholar]
133.Gupta A, Shamsi F, Altemose N, Dorlhiac GF, Cypess AM, White AP, Yosef N, Patti ME, Tseng Y-H, and Streets A, 2022. Characterization of transcript enrichment and detection bias in single-nucleus RNA-seq for mapping of distinct human adipocyte lineages. Genome Research 32:242–257. http://genome.cshlp.org/lookup/doi/10.1101/gr.275509.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
134.Phipson B, Zappia L, and Oshlack A, 2017. Gene length and detection bias in single cell RNA sequencing protocols. F1000Research 6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5428526/. [DOI] [PMC free article] [PubMed] [Google Scholar]
135.Larsson AJM, Johnsson P, Hagemann-Jensen M, Hartmanis L, Faridani OR, Reinius B, Segerstolpe A, Rivera CM, Ren B, and Sandberg R, 2019. Genomic encoding of transcriptional burst kinetics. Nature 565:251–254. http://www.nature.com/articles/s41586-018-0836-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
136.Patrick R, Humphreys DT, Janbandhu V, Oshlack A, Ho JW, Harvey RP, and Lo KK, 2020. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biology 21:167. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02071-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
137.Singh A, and Bokes P, 2012. Consequences of mRNA Transport on Stochastic Variability in Protein Levels. Biophysical Journal 103:1087–1096. https://linkinghub.elsevier.com/retrieve/pii/S0006349512007904. [DOI] [PMC free article] [PubMed] [Google Scholar]
138.Bokes P, King JR, Wood ATA, and Loose M, 2012. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression. Journal of Mathematical Biology 64:829–854. http://link.springer.com/10.1007/s00285-011-0433-5. [DOI] [PubMed] [Google Scholar]
139.Eldjárn Hjörleifsson K, Sullivan DK, Holley G, Melsted P, and Pachter L, 2022. Accurate quantification of single-nucleus and single-cell RNA-seq transcripts. Preprint, bioRxiv: 2022.12.02.518832. http://biorxiv.org/lookup/doi/10.1101/2022.12.02.518832.
140.Gorin G, Yoshida S, and Pachter L, 2022. Transient and delay chemical master equations. Preprint, bioRxiv: 2022.10.17.512599. http://biorxiv.org/lookup/doi/10.1101/2022.10.17.512599.
141.Fu X, Patel HP, Coppola S, Xu L, Cao Z, Lenstra TL, and Grima R, 2022. Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions. eLife 11:e82493. https://elifesciences.org/articles/82493. [DOI] [PMC free article] [PubMed] [Google Scholar]
142.Jiang Q, Fu X, Yan S, Li R, Du W, Cao Z, Qian F, and Grima R, 2021. Neural network aided approximation and parameter inference of non-Markovian models of gene expression. Nature Communications 12:2618. http://www.nature.com/articles/s41467-021-22919-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
143.Gillespie DT., 1976. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics 22:403–434. https://linkinghub.elsevier.com/retrieve/pii/0021999176900413. [Google Scholar]
144.Gillespie DT, 1977. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry 81:2340–2361. https://pubs.acs.org/doi/abs/10.1021/j100540a008. [Google Scholar]
145.Gillespie DT, 2007. Stochastic Simulation of Chemical Kinetics. Annual Review of Physical Chemistry 58:35–55. [DOI] [PubMed] [Google Scholar]
146.Geyer CJ, 1992. Practical Markov Chain Monte Carlo. Statistical Science 7:473–483. http://www.jstor.org/stable/2246094. [Google Scholar]
147.Mauch S, and Stalzer M, 2010. An efficient method for computing steady state solutions with Gillespie’s direct method. The Journal of Chemical Physics 133:144108. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2973983/. [DOI] [PMC free article] [PubMed] [Google Scholar]
148.Prados A, Brey JJ, and Sánchez-Rey B, 1997. A Dynamical Monte Carlo Algorithm for Master Equations with Time-Dependent Transition Rates. Journal of Statistical Physics 89:709–734. http://link.springer.com/10.1007/BF02765541. [Google Scholar]
149.Shahrezaei V, Ollivier JF, and Swain PS, 2008. Colored extrinsic fluctuations and stochastic gene expression. Molecular Systems Biology 4:196. https://onlinelibrary.wiley.com/doi/10.1038/msb.2008.31. [DOI] [PMC free article] [PubMed] [Google Scholar]
150.Voliotis M, Thomas P, Grima R, and Bowsher CG, 2016. Stochastic Simulation of Biomolecular Networks in Dynamic Environments. PLOS Computational Biology 12:e1004923. https://dx.plos.org/10.1371/journal.pcbi.1004923. [DOI] [PMC free article] [PubMed] [Google Scholar]
151.Wang S, and Bianco S, 2021. AI-assisted Biology: Predict the Conditional Probability Distributions from Noisy Measurements. Preprint, bioRxiv: 2021.10.07.463577. http://biorxiv.org/lookup/doi/10.1101/2021.10.07.463577.
152.Wang S, Fan K, Luo N, Cao Y, Wu F, Zhang C, Heller KA, and You L, 2019. Massive computational acceleration by using neural networks to emulate mechanism-based biological models. Nature Communications 10:4354. http://www.nature.com/articles/s41467-019-12342-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
153.Gorin G, Carilli M, Chari T, and Pachter L, 2022. Spectral neural approximations for models of transcriptional dynamics. Preprint, bioRxiv: 2022.06.16.496448. http://biorxiv.org/lookup/doi/10.1101/2022.06.16.496448. [DOI] [PMC free article] [PubMed]
154.Sukys A, Öcal K, and Grima R, 2022. Approximating solutions of the Chemical Master equation using neural networks. iScience 25:105010. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9474291/. [DOI] [PMC free article] [PubMed] [Google Scholar]
155.Wood SN, 2010. Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466:1102–1104. 10.1038/nature09319. [DOI] [PubMed] [Google Scholar]
156.Drovandi CC, Pettitt AN, and Lee A, 2015. Bayesian Indirect Inference Using a Parametric Auxiliary Model. Statistical Science 30:72 – 95. 10.1214/14-STS498. [DOI] [Google Scholar]
157.Öcal K, Gutmann MU, Sanguinetti G, and Grima R, 2022. Inference and uncertainty quantification of stochastic gene expression via synthetic models. Journal of The Royal Society Interface 19:20220153. https://royalsocietypublishing.org/doi/abs/10.1098/rsif.2022.0153. [DOI] [PMC free article] [PubMed] [Google Scholar]
158.Cao Z, and Grima R, 2018. Linear mapping approximation of gene regulatory networks with stochastic dynamics. Nature Communications 9:3305. 10.1038/s41467-018-05822-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
159.Thorne BC, Bailey AM, and Peirce SM, 2007. Combining experiments with multi-cell agent-based modeling to study biological tissue patterning. Briefings in Bioinformatics 8:245–257. https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbm024. [DOI] [PubMed] [Google Scholar]
160.Thomas P, and Shahrezaei V, 2021. Coordination of gene expression noise with cell size: analytical results for agent-based models of growing cell populations. Journal of The Royal Society Interface 18:20210274. https://royalsocietypublishing.org/doi/10.1098/rsif.2021.0274. [DOI] [PMC free article] [PubMed] [Google Scholar]
161.Gorin G., and Pachter L, 2023. Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod. Preprint, bioRxiv: 2022.06.11.495771. https://www.biorxiv.org/content/10.1101/2022.06.11.495771v2.
162.Kac M, Rota G-C, and Schwartz JT, 2009. Discrete thoughts: essays on mathematics, science and philosophy. Springer Science & Business Media.
163.Cariboni J, and Schoutens W, 2009. Jumps in intensity models: investigating the performance of Ornstein-Uhlenbeck processes in credit risk modeling. Metrika 69:173–198. http://link.springer.com/10.1007/s00184-008-0213-4. [Google Scholar]
164.Risken H, 1996. The Fokker-Planck equation: methods of solution and applications. Number v. 18 in Springer series in synergetics. Springer-Verlag, New York, 2nd ed edition. [Google Scholar]
165.Montroll EW, 1972. On Coupled Rate Equations with Quadratic Nonlinearities. Proceedings of the National Academy of Sciences of the United States of America 69:2532–2536. https://www.jstor.org/stable/61810, publisher: National Academy of Sciences. [DOI] [PMC free article] [PubMed] [Google Scholar]
166.Weinreb C, Wolock S, Tusi BK, Socolovsky M, and Klein AM, 2018. Fundamental limits on dynamic inference from single-cell snapshots. Proceedings of the National Academy of Sciences 115:E2467–E2476. http://www.pnas.org/lookup/doi/10.1073/pnas.1714723115. [DOI] [PMC free article] [PubMed] [Google Scholar]
167.Sanders S, Joshi K, Levin P, and Iyer-Biswas S, 2022. Single cells tell their own story: An updated framework for understanding stochastic variations in cell cycle progression in bacteria. Preprint, bioRxiv: 2022.03.15.484524. http://biorxiv.org/content/early/2022/03/16/2022.03.15.484524.abstract.
168.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, Satija R, and Smibert P, 2017. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14:865–868. http://www.nature.com/articles/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
169.10x Genomics, 2021. Interpreting Intronic and Antisense Reads in 10x Genomics Single Cell Gene Expression Data. Technical Note CG000376, 10x Genomics. https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/interpreting-intronic-and-antisense-reads-in-10-x-genomics-single-cell-gene-expression-data. [Google Scholar]
170.Cox JC, Ingersoll JE, and Ross SA, 1985. A Theory of the Term Structure of Interest Rates. Econometrica 53:385. https://www.jstor.org/stable/1911242?origin=crossref. [Google Scholar]
171.Fredriksson T, 2017. Fokker Planck for the Cox-Ingersoll-Ross Model. Ph.D. thesis, Uppsala Universitet, Uppsala. [Google Scholar]
172.Sabino P, and Petroni NC, 2021. Gamma-related Ornstein–Uhlenbeck processes and their simulation. Journal of Statistical Computation and Simulation 91:1108–1133. https://www.tandfonline.com/doi/full/10.1080/00949655.2020.1842408. [Google Scholar]
173.Melsted P, Ntranos V, and Pachter L, 2019. The barcode, UMI, set format and BUStools. Bioinformatics btz279. [DOI] [PubMed]
174.Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, Lickert H, Ansari M, Schniering J, Schiller HB, Pe’er D, and Theis FJ, 2022. CellRank for directed single-cell fate mapping. Nature Methods 19:159–170. https://www.nature.com/articles/s41592-021-01346-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
175.Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire PR, Zwaka TP, and Golding I, 2016. Single-cell analysis of transcription kinetics across the cell cycle. eLife 5:e12175. https://elifesciences.org/articles/12175. [DOI] [PMC free article] [PubMed] [Google Scholar]
176.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P,Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, and Oliphant TE, 2020. Array programming with NumPy. Nature 585:357–362. https://www.nature.com/articles/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
177.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0 Contributors, Vijaykumar A, Bardelli AP, Rothberg A, Hilboll A, Kloeckner A, Scopatz A, Lee A, Rokem A, Woods CN, Fulton C, Masson C, Häggström C, Fitzgerald C, Nicholson DA, Hagen DR, Pasechnik DV, Olivetti E, Martin E, Wieser E, Silva F, Lenders F, Wilhelm F, Young G, Price GA, Ingold G-L, Allen GE, Lee GR, Audren H, Probst I, Dietrich JP, Silterra J, Webber JT, Slavič J, Nothman J, Buchner J, Kulick J, Schönberger JL, de Miranda Cardoso JV, Reimer J, Harrington J, Rodríguez JLC, Nunez-Iglesias J, Kuczynski J, Tritz K, Thoma M, Newville M, Kümmerer M, Bolingbroke M, Tartre M, Pak M, Smith NJ, Nowaczyk N, Shebanov N, Pavlyk O, Brodtkorb PA, Lee P, McGibbon RT, Feldbauer R, Lewis S, Tygier S, Sievert S, Vigna S, Peterson S, More S, Pudlik T, Oshima T, Pingel TJ, Robitaille TP, Spura T, Jones TR, Cera T, Leslie T, Zito T, Krauss T, Upadhyay U, Halchenko YO, and Vázquez-Baeza Y, 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17:261–272. http://www.nature.com/articles/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

NIHMS1935058-supplement-Supplementary_material.pdf^{(11.2MB, pdf)}

Supplementary Table 3

Table S3 Genes discovered to be overdispersed $(σ^{2} > 2 μ)$ in empty droplets for each dataset in Table S2.

NIHMS1935058-supplement-Supplementary_Table_3.xlsx^{(13.3KB, xlsx)}

Supplementary Table 4

Table S4 Genes discovered to be overdispersed $(σ^{2} > 2 μ)$ in empty droplets for the neuron_1k_v3 and desai_dmso datasets, with function annotations.

NIHMS1935058-supplement-Supplementary_Table_4.xlsx^{(10.8KB, xlsx)}

Data Availability Statement

This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Pseudoaligned count matrices in the mtx format have been deposited at the Zenodo package 8132976. The data, Monod fits, and analysis scripts used to generate Figure 5d-e, originating from Gorin et al.²¹, were previously deposited as the Zenodo package 7388133.
All original code has been deposited at https://github.com/pachterlab/GVP_2023 and the Zenodo package 8132976, and is publicly available as of the date of publication. DOIs are listed in the key resources table.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
H. sapiens peripheral blood 10x v3 scRNA-seq data	178	pbmc_1k_v3
M. musculus heart 10x v3 scRNA-seq data	179	heart_1k_v3
M. musculus neuron 10x v3 scRNA-seq data	180	neuron_1k_v3
M. musculus cultured embryonic stem cells treated with DMSO 10x v2 scRNA-seq data	Desai et al.	desai_dmso
H. sapiens peripheral blood 10x v2 scRNA-seq data (technical replicate of pbmc_1k_v3)	181	pbmc_1k_v2
M. musculus neuron 10x v3 snRNA-seq data	182	brain_nuc_5k_v3
Supporting data for GP_2021_3	Gorin and Pachter	Zenodo: dataset 7388133
Software and Algorithms
Python	python.org	3.9.1
NumPy	numpy.org	1.22.1
SciPy	scipy.org	1.7.3
pandas	pandas.pydata.org	1.2.4
kallisto \| bustools	Melsted and Booeshaghi et al.	0.26.0
Monod	Gorin and Pachter	2.5.0
Other
Count matrices for all datasets	This manuscript	Zenodo: dataset 8132976
Custom analysis notebooks	This manuscript	GitHub: https://github.com/pachterlab/GVP_2023 (version of record deposited at Zenodo: dataset 8132976)

Open in a new tab

[R1] 1.Wilkinson DJ, 2018. Stochastic modelling for systems biology. Chapman and Hall/CRC. [Google Scholar]

[R2] 2.Waddington CH, 1957. The strategy of the genes. Routledge. [Google Scholar]

[R3] 3.Huang S, 2009. Reprogramming cell fates: reconciling rarity with robustness. BioEssays 31:546–560. https://onlinelibrary.wiley.com/doi/abs/10.1002/bies.200800189. [DOI] [PubMed] [Google Scholar]

[R4] 4.Huang S, 2012. The molecular and mathematical basis of Waddington’s epigenetic landscape: A framework for post-Darwinian biology? BioEssays 34:149–157. https://onlinelibrary.wiley.com/doi/abs/10.1002/bies.201100031. [DOI] [PubMed] [Google Scholar]

[R5] 5.Rand DA, Raju A, Sáez M, Corson F, and Siggia ED, 2021. Geometry of gene regulatory dynamics. Proceedings of the National Academy of Sciences 118:e2109729118. https://pnas.org/doi/full/10.1073/pnas.2109729118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Coomer MA, Ham L, and Stumpf MP, 2022. Noise distorts the epigenetic landscape and shapes cell-fate decisions. Cell Systems 13:83–102.e6. https://linkinghub.elsevier.com/retrieve/pii/S2405471221003392. [DOI] [PubMed] [Google Scholar]

[R7] 7.Wolf FA., Hamey FK, Plas Ms, Solana J, Dahlin JS, Göttgens B, Rajewsky N, Simon L, and Theis FJ, 2019. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biology 20:59. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C, 2017. Reversed graph embedding resolves complex single-cell trajectories. Nature Methods 14:979–982. http://www.nature.com/articles/nmeth.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Zhou J, and Troyanskaya OG, 2021. An analytical framework for interpretable and generalizable single-cell data analysis. Nature Methods 18:1317–1321. https://www.nature.com/articles/s41592-021-01286-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, and Alon U, 2002. Network Motifs: Simple Building Blocks of Complex Networks. Science 298:824–827. [DOI] [PubMed] [Google Scholar]

[R11] 11.Levine M, and Davidson EH, 2005. Gene regulatory networks for development. Proceedings of the National Academy of Sciences 102:4936–4942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Érdi P, and Lente G, 2014. Stochastic chemical kinetics: theory and (mostly) systems biological applications. Springer Complexity. Springer, New York. [Google Scholar]

[R13] 13.Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann SA, and Marioni JC, 2015. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nature Communications 6:8687. http://www.nature.com/articles/ncomms9687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Grün D, Kester L, and van Oudenaarden A, 2014. Validation of noise models for single-cell transcriptomics. Nature Methods 11:637–640. http://www.nature.com/articles/nmeth.2930. [DOI] [PubMed] [Google Scholar]

[R15] 15.Hicks SC, Townes FW, Teng M, and Irizarry RA, 2018. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562–578. https://academic.oup.com/biostatistics/article/19/4/562/4599254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Gorin G, and Pachter L, 2022. Modeling bursty transcription and splicing with the chemical master equation. Biophysical Journal 121:1056–1069. https://www.cell.com/biophysj/fulltext/S0006-3495(22)00104-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Vastola JJ, 2021. Solving the chemical master equation for monomolecular reaction systems and beyond: a Doi-Peliti path integral view. Journal of Mathematical Biology 83:48. 10.1007/s00285-021-01670-7. [DOI] [PubMed] [Google Scholar]

[R18] 18.Vastola JJ, 2021. In search of a coherent theoretical framework for stochastic gene regulation. Ph.D. thesis, Vanderbilt. https://ir.vanderbilt.edu/handle/1803/16646. [Google Scholar]

[R19] 19.Gorin G, Fang M, Chari T, and Pachter L, 2022. RNA velocity unraveled. PLOS Computational Biology 18:e1010492. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010492. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Gorin G, Vastola JJ, Fang M, and Pachter L, 2022. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nature Communications 13:7620. https://www.nature.com/articles/s41467-022-34857-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Gorin G, and Pachter L, 2023. Length biases in single-cell RNA sequencing of pre-mRNA. Biophysical Reports 3:100097. https://linkinghub.elsevier.com/retrieve/pii/S2667074722000544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Belliveau NM, Chure G, Hueschen CL, Garcia HG, Kondev J, Fisher DS, Theriot JA, and Phillips R, 2021. Fundamental limits on the rate of bacterial growth and their influence on proteomic composition. Cell Systems 12:924–944.e2. https://linkinghub.elsevier.com/retrieve/pii/S240547122100209X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Padovan-Merhar O, Nair G, Biaesch A, Mayer A, Scarfone S, Foley S, Wu A, Churchman L, Singh A, and Raj A, 2015. Single Mammalian Cells Compensate for Differences in Cellular Volume and DNA Copy Number through Independent Global Transcriptional Mechanisms. Molecular Cell 58:339–352. https://linkinghub.elsevier.com/retrieve/pii/S1097276515001707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Elowitz MB, Levine AJ, Siggia ED, and Swain PS, 2002. Stochastic Gene Expression in a Single Cell. Science 297:1183–1186. [DOI] [PubMed] [Google Scholar]

[R25] 25.Swain PS, Elowitz MB, and Siggia ED, 2002. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proceedings of the National Academy of Sciences 99:12795–12800. http://www.pnas.org/cgi/doi/10.1073/pnas.162041399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Hilfinger A, and Paulsson J, 2011. Separating intrinsic from extrinsic fluctuations in dynamic biological systems. Proceedings of the National Academy of Sciences 108:12167–12172. http://www.pnas.org/cgi/doi/10.1073/pnas.1018832108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Fu AQ, and Pachter L, 2016. Estimating intrinsic and extrinsic noise from single-cell gene expression measurements. Statistical Applications in Genetics and Molecular Biology 15. https://www.degruyter.com/doi/10.1515/sagmb-2016-0002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Hilfinger A, Norman T, and Paulsson J, 2016. Exploiting Natural Fluctuations to Identify Kinetic Mechanisms in Sparsely Characterized Systems. Cell Systems 2:251–259. https://linkinghub.elsevier.com/retrieve/pii/S2405471216301107. [DOI] [PubMed] [Google Scholar]

[R29] 29.Finkenstädt B., Woodcock DJ, Komorowski M, Harper CV, Davis JRE, White MRH, and Rand DA, 2013. Quantifying intrinsic and extrinsic noise in gene transcription using the linear noise approximation: An application to single cell data. The Annals of Applied Statistics 7:1960–1982. https://projecteuclid.org/euclid.aoas/1387823306. [Google Scholar]

[R30] 30.Baudrimont A, Jaquet V, Wallerich S, Voegeli S, and Becskei A, 2019. Contribution of RNA Degradation to Intrinsic and Extrinsic Noise in Gene Expression. Cell Reports 26:3752–3761.e5. https://linkinghub.elsevier.com/retrieve/pii/S2211124719303080. [DOI] [PubMed] [Google Scholar]

[R31] 31.Hausser J, Mayo A, Keren L, and Alon U, 2019. Central dogma rates and the trade-off between precision and economy in gene expression. Nature Communications 10:68. https://www.nature.com/articles/s41467-018-07391-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Keizer J, 1987. Statistical Thermodynamics of Nonequilibrium Processes. Springer. [Google Scholar]

[R33] 33.Saint-Antoine MM, and Singh A, 2020. Network inference in systems biology: recent developments, challenges, and applications. Current Opinion in Biotechnology 63:89–98. https://linkinghub.elsevier.com/retrieve/pii/S0958166919301399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Xing L, Guo M, Liu X, Wang C, Wang L, and Zhang Y, 2017. An improved Bayesian network method for reconstructing gene regulatory network based on candidate auto selection. BMC Genomics 18:844. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4228-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Shmulevich I, and Dougherty ER, 2010. Probabilistic boolean networks: the modeling and control of gene regulatory networks. Society for Industrial and Applied Mathematics, Philadelphia. OCLC: ocn434319365. [Google Scholar]

[R36] 36.Shaffer SM, Dunagin MC, Torborg SR, Torre EA, Emert B, Krepler C, Beqiri M, Sproesser K, Brafford PA, Xiao M, Eggan E, Anastopoulos IN, Vargas-Garcia CA, Singh A, Nathanson KL, Herlyn M, and Raj A, 2017.Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance. Nature 546:431–435. http://www.nature.com/articles/nature22794. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, and Califano A, 2006. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics 7:S7. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Huynh-Thu VA, Irrthum A, Wehenkel L, and Geurts P, 2010. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5:e12776. https://dx.plos.org/10.1371/journal.pone.0012776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Silk D, Kirk PDW, Barnes CP, Toni T, and Stumpf MPH, 2014. Model Selection in Systems Biology Depends on Experimental Design. PLoS Computational Biology 10:e1003650. https://dx.plos.org/10.1371/journal.pcbi.1003650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Munsky B, Li G, Fox ZR, Shepherd DP, and Neuert G, 2018. Distribution shapes govern the discovery of predictive models for gene regulation. Proceedings of the National Academy of Sciences 115:7533–7538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Huynh-Thu VA, and Sanguinetti G, 2015. Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics 31:1614–1622. https://academic.oup.com/bioinformatics/article/31/10/1614/176842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Bansal M, Gatta GD, and di Bernardo D, 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22:815–822. https://academic.oup.com/bioinformatics/article/22/7/815/202299. [DOI] [PubMed] [Google Scholar]

[R43] 43.Henriques D, Rocha M, Saez-Rodriguez J, and Banga JR, 2015. Reverse engineering of logic-based differential equation models using a mixed-integer dynamic optimization approach. Bioinformatics 31:2999–3007. https://academic.oup.com/bioinformatics/article/31/18/2999/241026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Stumpf PS., Smith RC, Lenz M, Schuppert A, Müller F-J, Babtie A, Chan TE, Stumpf MP, Please CP, Howison SD, Arai F, and MacArthur BD, 2017. Stem Cell Differentiation as a Non-Markov Stochastic Process. Cell Systems 5:268–282.e7. https://linkinghub.elsevier.com/retrieve/pii/S2405471217303423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Cannoodt R, Saelens W, Deconinck L, and Saeys Y, 2021. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nature Communications 12:3942. http://www.nature.com/articles/s41467-021-24152-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, the DREAM5 Consortium, Kellis M, Collins JJ, and Stolovitzky G, 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9:796–804. http://www.nature.com/articles/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Svensson V, Vento-Tormo R, and Teichmann SA, 2018. Exponential scaling of single-cell RNA-seq in the past decade. Nature Protocols 13:599–604. http://www.nature.com/articles/nprot.2017.149. [DOI] [PubMed] [Google Scholar]

[R48] 48.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schnall-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, and Bielas JH, 2017. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8:14049. http://www.nature.com/articles/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Stumpf MP, 2021. Inferring better gene regulation networks from single-cell data. Current Opinion in Systems Biology 27:100342. https://linkinghub.elsevier.com/retrieve/pii/S2452310021000275. [Google Scholar]

[R50] 50.Wang L, Zhang Q, Qin Q, Trasanidis N, Vinyard M, Chen H, and Pinello L, 2021. Current progress and potential opportunities to infer single-cell developmental trajectory and cell fate. Current Opinion in Systems Biology 26:1–11. https://linkinghub.elsevier.com/retrieve/pii/S2452310021000093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Griffiths JA, Scialdone A, and Marioni JC, 2018. Using single-cell genomics to understand developmental processes and cell fate decisions. Molecular Systems Biology 14:e8046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Packer J, and Trapnell C, 2018. Single-Cell Multi-omics: An Engine for New Quantitative Models of Gene Regulation. Trends in Genetics 34:653–665. https://linkinghub.elsevier.com/retrieve/pii/S0168952518301082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Stein-O’Brien GL, Ainslie MC, and Fertig EJ, 2021. Forecasting cellular states: from descriptive to predictive biology via single-cell multiomics. Current Opinion in Systems Biology 26:24–32. https://linkinghub.elsevier.com/retrieve/pii/S245231002100010X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Gligorijević V, and Pržulj N, 2015. Methods for biological data integration: perspectives and challenges. Journal of The Royal Society Interface 12:20150571. https://royalsocietypublishing.org/doi/10.1098/rsif.2015.0571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Ha Y., Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LM, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, and Satija R, 2021. Integrated analysis of multimodal single-cell data. Cell 184:3573–3587.e29. https://linkinghub.elsevier.com/retrieve/pii/S0092867421005833. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, and Yosef N, 2021. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods 18:272–282. http://www.nature.com/articles/s41592-020-01050-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Luecken MD, and Theis FJ, 2019. Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology 15:e8746. http://msb.embopress.org/lookup/doi/10.15252/msb.20188746. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Lopez R, Regier J, Cole MB, Jordan MI, and Yosef N, 2018. Deep generative modeling for single-cell transcriptomics. Nature Methods 15:1053–1058. http://www.nature.com/articles/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Bergen V, Lange M, Peidli S, Wolf FA, and Theis FJ, 2020. Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology http://www.nature.com/articles/s41587-020-0591-3. [DOI] [PubMed]

[R60] 60.Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, Purdom E, and Dudoit S, 2018. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19:477. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Huang S, 2018. The Tension Between Big Data and Theory in the “Omics” Era of Biomedical Research. Perspectives in Biology and Medicine 61:472–488. https://muse.jhu.edu/article/713156. [DOI] [PubMed] [Google Scholar]

[R62] 62.Jiang R, Sun T, Song D, and Li JJ, 2022. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology 23:31. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Svensson V, 2020. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology 38:147–150. https://www.nature.com/articles/s41587-019-0379-5. [DOI] [PubMed] [Google Scholar]

[R64] 64.Andrews T, and Hemberg M, 2019. False signals induced by single-cell imputation. F1000Research 7:1740. https://f1000research.com/articles/7-1740/v2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Booeshaghi AS, Hallgrímsdóttir IB, Gálvez-Merchán A, and Pachter L, 2022. Depth normalization for single-cell genomics count data. Preprint, bioRxiv: 2022.05.06.490859. http://biorxiv.org/lookup/doi/10.1101/2022.05.06.490859.

[R66] 66.Booeshaghi AS, and Pachter L, 2021. Normalization of single-cell RNA-seq counts by log( x + 1) or log(1 + x ). Bioinformatics 37:2223–2224. https://academic.oup.com/bioinformatics/article/37/15/2223/6155989. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Cooley SM, Hamilton T, Ray JCJ, and Deeds EJ, 2020. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data. Preprint, bioRxiv: 689851. https://www.biorxiv.org/content/10.1101/689851v4.

[R68] 68.Chari T, Banerjee J, and Pachter L, 2021. The Specious Art of Single-Cell Genomics. Preprint, bioRxiv: 2021.08.25.457696. http://biorxiv.org/lookup/doi/10.1101/2021.08.25.457696.

[R69] 69.Zheng SC, Stein-O’Brien G, Boukas L, Goff LA, and Hansen KD, 2022. Pumping the brakes on RNA velocity – understanding and interpreting RNA velocity estimates. Preprint, bioRxiv: 2022.06.19.494717. http://biorxiv.org/lookup/doi/10.1101/2022.06.19.494717. [DOI] [PMC free article] [PubMed]

[R70] 70.François P, 2023. New wave theory. Development 150:dev201647. https://journals.biologists.com/dev/article/150/4/dev201647/287679/New-wave-theory. [DOI] [PubMed] [Google Scholar]

[R71] 71.Carilli MT, Gorin G, Choi Y, Chari T, and Pachter L, 2023. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Preprint, bioRxiv: 2023.01.13.523995. http://biorxiv.org/lookup/doi/10.1101/2023.01.13.523995. [DOI] [PubMed]

[R72] 72.Fox ZR, and Munsky B, 2019. The finite state projection based Fisher information matrix approach to estimate information and optimize single-cell experiments. PLOS Computational Biology 15:e1006365. https://dx.plos.org/10.1371/journal.pcbi.1006365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] 73.Raj A, Peskin CS, Tranchina D, Vargas DY, and Tyagi S, 2006. Stochastic mRNA Synthesis in Mammalian Cells. PLoS Biology 4:e309. https://dx.plos.org/10.1371/journal.pbio.0040309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Munsky B, Neuert G, and van Oudenaarden A, 2012. Using Gene Expression Noise to Understand Gene Regulation. Science 336:183–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] 75.Shahrezaei V, and Swain PS, 2008. Analytical distributions for stochastic gene expression. Proceedings of the National Academy of Sciences 105:17256–17261. http://www.pnas.org/cgi/doi/10.1073/pnas.0803850105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] 76.Iyer-Biswas S, Hayot F, and Jayaprakash C, 2009. Stochasticity of gene products from transcriptional pulsing. Physical Review E 79:031911. https://link.aps.org/doi/10.1103/PhysRevE.79.031911. [DOI] [PubMed] [Google Scholar]

[R77] 77.Veerman F, Marr C, and Popović N, 2018. Time-dependent propagators for stochastic models of gene expression: an analytical method. Journal of Mathematical Biology 77:261–312. http://link.springer.com/10.1007/s00285-017-1196-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] 78.Munsky B, and Khammash M, 2006. The finite state projection algorithm for the solution of the chemical master equation. The Journal of Chemical Physics 124:044104. [DOI] [PubMed] [Google Scholar]

[R79] 79.Xu H, Skinner SO, Sokac AM, and Golding I, 2016. Stochastic Kinetics of Nascent RNA. Physical Review Letters 117:128101. https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.117.128101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Stinchcombe AR., Peskin CS, and Tranchina D, 2012. Population density approach for discrete mRNA distributions in generalized switching models for stochastic gene expression. Physical Review E 85:061919. https://link.aps.org/doi/10.1103/PhysRevE.85.061919. [DOI] [PubMed] [Google Scholar]

[R81] 81.Gardiner C, 2004. Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences. Springer, third edition.

[R82] 82.Gillespie DT, 1992. A rigorous derivation of the chemical master equation. Physica A: Statistical Mechanics and its Applications 188:404–425. https://linkinghub.elsevier.com/retrieve/pii/037843719290283V. [Google Scholar]

[R83] 83.Hafemeister C, and Satija R, 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology 20:296. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] 84.Vu TN, Wills QF, Kalari KR, Niu N, Wang L, Rantalainen M, and Pawitan Y, 2016. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 32:2128–2135. https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw202. [DOI] [PubMed] [Google Scholar]

[R85] 85.Jahnke T, and Huisinga W, 2006. Solving the chemical master equation for monomolecular reaction systems analytically. Journal of Mathematical Biology 54:1–26. http://link.springer.com/10.1007/s00285-006-0034-x. [DOI] [PubMed] [Google Scholar]

[R86] 86.La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, Lidschreiber K, Kastriti ME, Lönnerberg P, Furlan A, Fan J, Borm LE, Liu Z, van Bruggen D, Guo J, He X, Barker R, Sundström E, Castelo-Branco G, Cramer P, Adameyko I, Linnarsson S, and Kharchenko PV, 2018. RNA velocity of single cells. Nature 560:494–498. http://www.nature.com/articles/s41586-018-0414-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] 87.Kim J, and Marioni JC, 2013. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biology 14:R7. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R88] 88.Delmans M, and Hemberg M, 2016. Discrete distributional differential expression (D3E) - a tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinformatics 17:110. http://www.biomedcentral.com/1471-2105/17/110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R89] 89.Vo HD, Fox Z, Baetica A, and Munsky B, 2019. Bayesian Estimation for Stochastic Gene Expression Using Multifidelity Models. The Journal of Physical Chemistry B 123:2217–2234. https://pubs.acs.org/doi/10.1021/acs.jpcb.8b10946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R90] 90.Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, and Heisler MG, 2013. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10:1093–1095. http://www.nature.com/articles/nmeth.2645. [DOI] [PubMed] [Google Scholar]

[R91] 91.Bacher R, L.-F, Argus C, Bolin JM, Knight P, Thomson JA, Stewart R, and Kendziorski C, 2021. Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization. Nucleic Acids Research gkab1071. [DOI] [PMC free article] [PubMed]

[R92] 92.Thattai M, and van Oudenaarden A, 2001. Intrinsic noise in gene regulatory networks. Proceedings of the National Academy of Sciences 98:8614–8619. http://www.pnas.org/cgi/doi/10.1073/pnas.151588598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R93] 93.Gardiner CW, and Chaturvedi S, 1977. The poisson representation. I. A new technique for chemical master equations. Journal of Statistical Physics 17:429–468. http://link.springer.com/10.1007/BF01014349. [Google Scholar]

[R94] 94.Doi M, 1976. Stochastic theory of diffusion-controlled reaction. Journal of Physics A: Mathematical and General 9:1479. 10.1088/0305-4470/9/9/009. [DOI] [Google Scholar]

[R95] 95.Doi M, 1976. Second quantization representation for classical many-particle system. Journal of Physics A: Mathematical and General 9:1465. 10.1088/0305-4470/9/9/008. [DOI] [Google Scholar]

[R96] 96.Peliti L, 1985. Path integral approach to birth-death processes on a lattice. J. Phys. France 46:1469–1483. 10.1051/jphys:019850046090146900. [DOI] [Google Scholar]

[R97] 97.Vastola JJ, Gorin G, Pachter L, and Holmes WR, 2021. Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solution. Preprint, arXiv: 2103.10992. http://arxiv.org/abs/2103.10992,arXiv:2103.10992.

[R98] 98.Ebert MR, and Reissig M, 2018. Methods for Partial Differential Equations. Springer International Publishing, Cham. http://link.springer.com/10.1007/978-3-319-66456-9. [Google Scholar]

[R99] 99.Vastola JJ, and Holmes WR, 2020. Chemical Langevin equation: A path-integral view of Gillespie’s derivation. Phys. Rev. E 101:032417. https://link.aps.org/doi/10.1103/PhysRevE.101.032417. [DOI] [PubMed] [Google Scholar]

[R100] 100.Peccoud J, and Ycard B, 1995. Markovian Modeling of Gene Product Synthesis. Theoretical Population Biology 48:222–234. [Google Scholar]

[R101] 101.Grima R, Schmidt DR, and Newman TJ, 2012. Steady-state fluctuations of a genetic feedback loop: An exact solution. The Journal of Chemical Physics 137:035104. http://aip.scitation.org/doi/10.1063/1.4736721. [DOI] [PubMed] [Google Scholar]

[R102] 102.Huang L, Yuan Z, Liu P, and Zhou T, 2014. Feedback-induced counterintuitive correlations of gene expression noise with bursting kinetics. Physical Review E 90:052702. https://link.aps.org/doi/10.1103/PhysRevE.90.052702. [DOI] [PubMed] [Google Scholar]

[R103] 103.Kumar N, Platini T, and Kulkarni RV, 2014. Exact Distributions for Stochastic Gene Expression Models with Bursting and Feedback. Physical Review Letters 113:268105. https://link.aps.org/doi/10.1103/PhysRevLett.113.268105. [DOI] [PubMed] [Google Scholar]

[R104] 104.Liu P, Yuan Z, Huang L, and Zhou T, 2015. Feedback-Induced Variations of Distribution in a Representative Gene Model. International Journal of Bifurcation and Chaos 25:1540008. https://www.worldscientific.com/doi/abs/10.1142/S0218127415400088. [Google Scholar]

[R105] 105.Fogler HS, 2006. Elements of chemical reaction engineering. Prentice Hall PTR international series in the physical and chemical engineering sciences. Prentice Hall PTR, Upper Saddle River, NJ, 4th ed edition. OCLC: ocm56956313. [Google Scholar]

[R106] 106.Roberts GW, 2008. Chemical reactions and chemical reactors. John Wiley & Sons, Hoboken, NJ. OCLC: ocn176897332. [Google Scholar]

[R107] 107.Tang W, Bertaux F, Thomas P, Stefanelli C, Saint M, Marguerat S, and Shahrezaei V, 2020. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36:1174–1181. https://academic.oup.com/bioinformatics/article/36/4/1174/5581401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R108] 108.Tang W, Jørgensen ACS, Marguerat S, Thomas P, and Shahrezaei V, 2023. Modelling capture efficiency of single cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics. Preprint, bioRxiv: 2023.03.06.531327. http://biorxiv.org/lookup/doi/10.1101/2023.03.06.531327. [DOI] [PMC free article] [PubMed]

[R109] 109.Young MD, and Behjati S, 2020. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9:giaa151. https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giaa151/6049831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R110] 110.Fleming SJ, Chaffin MD, Arduini A, Akkad A-D, Banks E, Marioni JC, Philippakis AA, Ellinor PT, and Babadi M, 2019. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Preprint, bioRxiv: 791699. http://biorxiv.org/lookup/doi/10.1101/791699. [DOI] [PubMed]

[R111] 111.Sheng C., Lopes R, Li G, Schuierer S, Waldt A, Cuttat R, Dimitrieva S, Kauffmann A, Durand E, Galli GG, Roma G, and de Weck A, 2022. Probabilistic machine learning ensures accurate ambient denoising in droplet-based single-cell omics. Preprint, bioRxiv: 2022.01.14.476312. http://biorxiv.org/lookup/doi/10.1101/2022.01.14.476312.

[R112] 112.Yin Y, Yajima M, and Campbell JD, 2023. Characterization and decontamination of background noise in droplet-based single-cell protein expression data with DecontPro. Preprint, bioRxiv: 2023.01.27.525964v2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979990/. [DOI] [PMC free article] [PubMed]

[R113] 113.Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KH, da Veiga Beltrame E, Hjörleifsson KE, Gehring J, and Pachter L, 2021. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology 39:813–818. http://www.nature.com/articles/s41587-021-00870-2. [DOI] [PubMed] [Google Scholar]

[R114] 114.National Library of Medicine, 2004. Gene [Internet]. https://www.ncbi.nlm.nih.gov/gene/.

[R115] 115.Lutsch G, Vetter R, Offhauss U, Wieske M, Gröne H-J, Klemenz R, Schimke I, Stahl J, and Benndorf R, 1997. Abundance and Location of the Small Heat Shock Proteins HSP25 and aB-Crystallin in Rat and Human Heart. Circulation 96:3466–3476. https://www.ahajournals.org/doi/abs/10.1161/01.CIR.96.10.3466. [DOI] [PubMed] [Google Scholar]

[R116] 116.Desai RV, Chen X, Martin B, Chaturvedi S, Hwang DW, Li W, Yu C, Ding S, Thomson M, Singer RH, Coleman RA, Hansen MMK, and Weinberger LS, 2021. A DNA repair pathway can regulate transcriptional noise to promote cell fate transitions. Science 373:eabc6506. https://www.sciencemag.org/lookup/doi/10.1126/science.abc6506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R117] 117.Heiser CN, Wang VM, Chen B, Hughey JJ, and Lau KS, 2021. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Research 31:1742–1752. http://genome.cshlp.org/lookup/doi/10.1101/gr.271908.120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R118] 118.Hippen AA, Falco MM, Weber LM, Erkan EP, Zhang K, Doherty JA, Vähärautio A, Greene CS, and Hicks SC, 2021. miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data. PLOS Computational Biology 17:e1009290. https://dx.plos.org/10.1371/journal.pcbi.1009290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R119] 119.Munsky B, Trinh B, and Khammash M, 2009. Listening to the noise: random fluctuations reveal gene network parameters. Molecular Systems Biology 5:318. https://www.embopress.org/doi/full/10.1038/msb.2009.75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R120] 120.Burnham KP, and Anderson DR, 2002. Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York, 2nd ed edition. OCLC: ocm48557578. [Google Scholar]

[R121] 121.Qin Q, Bingham E, Manno GL, Langenau DM, and Pinello L, 2022. Pyro-Velocity: Probabilistic RNA Velocity inference from single-cell data. Preprint, bioRxiv: 2022.09.12.507691. https://www.biorxiv.org/content/10.1101/2022.09.12.507691v2.

[R122] 122.Dattani J, 2015. Exact solutions of master equations for the analysis of gene transcription models. PhD Dissertation, Imperial College London. [Google Scholar]

[R123] 123.Dattani J, and Barahona M, 2017. Stochastic models of gene transcription with upstream drives: exact solution and sample path characterization. Journal of The Royal Society Interface 14:20160833. https://royalsocietypublishing.org/doi/10.1098/rsif.2016.0833. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R124] 124.Thomas P, 2017. Making sense of snapshot data: ergodic principle for clonal cell populations. Journal of The Royal Society Interface 14:20170467. https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R125] 125.Perez-Carrasco R, Beentjes C, and Grima R, 2020. Effects of cell cycle variability on lineage and population measurements of messenger RNA abundance. Journal of The Royal Society Interface 17:20200360. https://royalsocietypublishing.org/doi/10.1098/rsif.2020.0360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R126] 126.Beentjes CHL, Perez-Carrasco R, and Grima R, 2020. Exact solution of stochastic gene expression models with bursting, cell cycle and replication dynamics. Physical Review E 101:032403. https://link.aps.org/doi/10.1103/PhysRevE.101.032403. [DOI] [PubMed] [Google Scholar]

[R127] 127.Pitman JW, 1977. Occupation Measures for Markov Chains. Advances in Applied Probability 9:69–86. https://www.jstor.org/stable/1425817, publisher: Applied Probability Trust. [Google Scholar]

[R128] 128.Yang Y, Nurbekyan L, Negrini E, Martin R, and Pasha M, 2021. Optimal Transport for Parameter Identification of Chaotic Dynamics via Invariant Measures. Preprint, arXiv: 2104.15138. http://arxiv.org/abs/2104.15138.

[R129] 129.Kuntz J., Thomas P, Stan G-B, and Barahona M, 2019. The Exit Time Finite State Projection Scheme: Bounding Exit Distributions and Occupation Measures of Continuous-Time Markov Chains. SIAM Journal on Scientific Computing 41:A748–A769. https://epubs.siam.org/doi/10.1137/18M1168261. [Google Scholar]

[R130] 130.Birkhoff GD, 1931. Proof of the Ergodic Theorem. Proceedings of the National Academy of Sciences 17:656–660. https://www.pnas.org/doi/abs/10.1073/pnas.17.2.656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R131] 131.Neumann J. v., 1932. Proof of the Quasi-Ergodic Hypothesis. Proceedings of the National Academy of Sciences 18:70–82. https://www.pnas.org/doi/abs/10.1073/pnas.18.1.70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R132] 132.Moore CC, 2015. Ergodic theorem, ergodic theory, and statistical mechanics. Proceedings of the National Academy of Sciences 112:1907–1911. https://www.pnas.org/doi/abs/10.1073/pnas.1421798112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R133] 133.Gupta A, Shamsi F, Altemose N, Dorlhiac GF, Cypess AM, White AP, Yosef N, Patti ME, Tseng Y-H, and Streets A, 2022. Characterization of transcript enrichment and detection bias in single-nucleus RNA-seq for mapping of distinct human adipocyte lineages. Genome Research 32:242–257. http://genome.cshlp.org/lookup/doi/10.1101/gr.275509.121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R134] 134.Phipson B, Zappia L, and Oshlack A, 2017. Gene length and detection bias in single cell RNA sequencing protocols. F1000Research 6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5428526/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R135] 135.Larsson AJM, Johnsson P, Hagemann-Jensen M, Hartmanis L, Faridani OR, Reinius B, Segerstolpe A, Rivera CM, Ren B, and Sandberg R, 2019. Genomic encoding of transcriptional burst kinetics. Nature 565:251–254. http://www.nature.com/articles/s41586-018-0836-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R136] 136.Patrick R, Humphreys DT, Janbandhu V, Oshlack A, Ho JW, Harvey RP, and Lo KK, 2020. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biology 21:167. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02071-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R137] 137.Singh A, and Bokes P, 2012. Consequences of mRNA Transport on Stochastic Variability in Protein Levels. Biophysical Journal 103:1087–1096. https://linkinghub.elsevier.com/retrieve/pii/S0006349512007904. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R138] 138.Bokes P, King JR, Wood ATA, and Loose M, 2012. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression. Journal of Mathematical Biology 64:829–854. http://link.springer.com/10.1007/s00285-011-0433-5. [DOI] [PubMed] [Google Scholar]

[R139] 139.Eldjárn Hjörleifsson K, Sullivan DK, Holley G, Melsted P, and Pachter L, 2022. Accurate quantification of single-nucleus and single-cell RNA-seq transcripts. Preprint, bioRxiv: 2022.12.02.518832. http://biorxiv.org/lookup/doi/10.1101/2022.12.02.518832.

[R140] 140.Gorin G, Yoshida S, and Pachter L, 2022. Transient and delay chemical master equations. Preprint, bioRxiv: 2022.10.17.512599. http://biorxiv.org/lookup/doi/10.1101/2022.10.17.512599.

[R141] 141.Fu X, Patel HP, Coppola S, Xu L, Cao Z, Lenstra TL, and Grima R, 2022. Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions. eLife 11:e82493. https://elifesciences.org/articles/82493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R142] 142.Jiang Q, Fu X, Yan S, Li R, Du W, Cao Z, Qian F, and Grima R, 2021. Neural network aided approximation and parameter inference of non-Markovian models of gene expression. Nature Communications 12:2618. http://www.nature.com/articles/s41467-021-22919-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R143] 143.Gillespie DT., 1976. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics 22:403–434. https://linkinghub.elsevier.com/retrieve/pii/0021999176900413. [Google Scholar]

[R144] 144.Gillespie DT, 1977. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry 81:2340–2361. https://pubs.acs.org/doi/abs/10.1021/j100540a008. [Google Scholar]

[R145] 145.Gillespie DT, 2007. Stochastic Simulation of Chemical Kinetics. Annual Review of Physical Chemistry 58:35–55. [DOI] [PubMed] [Google Scholar]

[R146] 146.Geyer CJ, 1992. Practical Markov Chain Monte Carlo. Statistical Science 7:473–483. http://www.jstor.org/stable/2246094. [Google Scholar]

[R147] 147.Mauch S, and Stalzer M, 2010. An efficient method for computing steady state solutions with Gillespie’s direct method. The Journal of Chemical Physics 133:144108. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2973983/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R148] 148.Prados A, Brey JJ, and Sánchez-Rey B, 1997. A Dynamical Monte Carlo Algorithm for Master Equations with Time-Dependent Transition Rates. Journal of Statistical Physics 89:709–734. http://link.springer.com/10.1007/BF02765541. [Google Scholar]

[R149] 149.Shahrezaei V, Ollivier JF, and Swain PS, 2008. Colored extrinsic fluctuations and stochastic gene expression. Molecular Systems Biology 4:196. https://onlinelibrary.wiley.com/doi/10.1038/msb.2008.31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R150] 150.Voliotis M, Thomas P, Grima R, and Bowsher CG, 2016. Stochastic Simulation of Biomolecular Networks in Dynamic Environments. PLOS Computational Biology 12:e1004923. https://dx.plos.org/10.1371/journal.pcbi.1004923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R151] 151.Wang S, and Bianco S, 2021. AI-assisted Biology: Predict the Conditional Probability Distributions from Noisy Measurements. Preprint, bioRxiv: 2021.10.07.463577. http://biorxiv.org/lookup/doi/10.1101/2021.10.07.463577.

[R152] 152.Wang S, Fan K, Luo N, Cao Y, Wu F, Zhang C, Heller KA, and You L, 2019. Massive computational acceleration by using neural networks to emulate mechanism-based biological models. Nature Communications 10:4354. http://www.nature.com/articles/s41467-019-12342-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R153] 153.Gorin G, Carilli M, Chari T, and Pachter L, 2022. Spectral neural approximations for models of transcriptional dynamics. Preprint, bioRxiv: 2022.06.16.496448. http://biorxiv.org/lookup/doi/10.1101/2022.06.16.496448. [DOI] [PMC free article] [PubMed]

[R154] 154.Sukys A, Öcal K, and Grima R, 2022. Approximating solutions of the Chemical Master equation using neural networks. iScience 25:105010. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9474291/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R155] 155.Wood SN, 2010. Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466:1102–1104. 10.1038/nature09319. [DOI] [PubMed] [Google Scholar]

[R156] 156.Drovandi CC, Pettitt AN, and Lee A, 2015. Bayesian Indirect Inference Using a Parametric Auxiliary Model. Statistical Science 30:72 – 95. 10.1214/14-STS498. [DOI] [Google Scholar]

[R157] 157.Öcal K, Gutmann MU, Sanguinetti G, and Grima R, 2022. Inference and uncertainty quantification of stochastic gene expression via synthetic models. Journal of The Royal Society Interface 19:20220153. https://royalsocietypublishing.org/doi/abs/10.1098/rsif.2022.0153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R158] 158.Cao Z, and Grima R, 2018. Linear mapping approximation of gene regulatory networks with stochastic dynamics. Nature Communications 9:3305. 10.1038/s41467-018-05822-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R159] 159.Thorne BC, Bailey AM, and Peirce SM, 2007. Combining experiments with multi-cell agent-based modeling to study biological tissue patterning. Briefings in Bioinformatics 8:245–257. https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbm024. [DOI] [PubMed] [Google Scholar]

[R160] 160.Thomas P, and Shahrezaei V, 2021. Coordination of gene expression noise with cell size: analytical results for agent-based models of growing cell populations. Journal of The Royal Society Interface 18:20210274. https://royalsocietypublishing.org/doi/10.1098/rsif.2021.0274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R161] 161.Gorin G., and Pachter L, 2023. Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod. Preprint, bioRxiv: 2022.06.11.495771. https://www.biorxiv.org/content/10.1101/2022.06.11.495771v2.

[R162] 162.Kac M, Rota G-C, and Schwartz JT, 2009. Discrete thoughts: essays on mathematics, science and philosophy. Springer Science & Business Media.

[R163] 163.Cariboni J, and Schoutens W, 2009. Jumps in intensity models: investigating the performance of Ornstein-Uhlenbeck processes in credit risk modeling. Metrika 69:173–198. http://link.springer.com/10.1007/s00184-008-0213-4. [Google Scholar]

[R164] 164.Risken H, 1996. The Fokker-Planck equation: methods of solution and applications. Number v. 18 in Springer series in synergetics. Springer-Verlag, New York, 2nd ed edition. [Google Scholar]

[R165] 165.Montroll EW, 1972. On Coupled Rate Equations with Quadratic Nonlinearities. Proceedings of the National Academy of Sciences of the United States of America 69:2532–2536. https://www.jstor.org/stable/61810, publisher: National Academy of Sciences. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R166] 166.Weinreb C, Wolock S, Tusi BK, Socolovsky M, and Klein AM, 2018. Fundamental limits on dynamic inference from single-cell snapshots. Proceedings of the National Academy of Sciences 115:E2467–E2476. http://www.pnas.org/lookup/doi/10.1073/pnas.1714723115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R167] 167.Sanders S, Joshi K, Levin P, and Iyer-Biswas S, 2022. Single cells tell their own story: An updated framework for understanding stochastic variations in cell cycle progression in bacteria. Preprint, bioRxiv: 2022.03.15.484524. http://biorxiv.org/content/early/2022/03/16/2022.03.15.484524.abstract.

[R168] 168.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, Satija R, and Smibert P, 2017. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14:865–868. http://www.nature.com/articles/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R169] 169.10x Genomics, 2021. Interpreting Intronic and Antisense Reads in 10x Genomics Single Cell Gene Expression Data. Technical Note CG000376, 10x Genomics. https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/interpreting-intronic-and-antisense-reads-in-10-x-genomics-single-cell-gene-expression-data. [Google Scholar]

[R170] 170.Cox JC, Ingersoll JE, and Ross SA, 1985. A Theory of the Term Structure of Interest Rates. Econometrica 53:385. https://www.jstor.org/stable/1911242?origin=crossref. [Google Scholar]

[R171] 171.Fredriksson T, 2017. Fokker Planck for the Cox-Ingersoll-Ross Model. Ph.D. thesis, Uppsala Universitet, Uppsala. [Google Scholar]

[R172] 172.Sabino P, and Petroni NC, 2021. Gamma-related Ornstein–Uhlenbeck processes and their simulation. Journal of Statistical Computation and Simulation 91:1108–1133. https://www.tandfonline.com/doi/full/10.1080/00949655.2020.1842408. [Google Scholar]

[R173] 173.Melsted P, Ntranos V, and Pachter L, 2019. The barcode, UMI, set format and BUStools. Bioinformatics btz279. [DOI] [PubMed]

[R174] 174.Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, Lickert H, Ansari M, Schniering J, Schiller HB, Pe’er D, and Theis FJ, 2022. CellRank for directed single-cell fate mapping. Nature Methods 19:159–170. https://www.nature.com/articles/s41592-021-01346-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R175] 175.Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire PR, Zwaka TP, and Golding I, 2016. Single-cell analysis of transcription kinetics across the cell cycle. eLife 5:e12175. https://elifesciences.org/articles/12175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R176] 176.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P,Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, and Oliphant TE, 2020. Array programming with NumPy. Nature 585:357–362. https://www.nature.com/articles/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R177] 177.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0 Contributors, Vijaykumar A, Bardelli AP, Rothberg A, Hilboll A, Kloeckner A, Scopatz A, Lee A, Rokem A, Woods CN, Fulton C, Masson C, Häggström C, Fitzgerald C, Nicholson DA, Hagen DR, Pasechnik DV, Olivetti E, Martin E, Wieser E, Silva F, Lenders F, Wilhelm F, Young G, Price GA, Ingold G-L, Allen GE, Lee GR, Audren H, Probst I, Dietrich JP, Silterra J, Webber JT, Slavič J, Nothman J, Buchner J, Kulick J, Schönberger JL, de Miranda Cardoso JV, Reimer J, Harrington J, Rodríguez JLC, Nunez-Iglesias J, Kuczynski J, Tritz K, Thoma M, Newville M, Kümmerer M, Bolingbroke M, Tartre M, Pak M, Smith NJ, Nowaczyk N, Shebanov N, Pavlyk O, Brodtkorb PA, Lee P, McGibbon RT, Feldbauer R, Lewis S, Tygier S, Sievert S, Vigna S, Peterson S, More S, Pudlik T, Oshima T, Pingel TJ, Robitaille TP, Spura T, Jones TR, Cera T, Leslie T, Zito T, Krauss T, Upadhyay U, Halchenko YO, and Vázquez-Baeza Y, 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17:261–272. http://www.nature.com/articles/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Studying stochastic systems biology of the cell with single-cell genomics data

Gennady Gorin

John J Vastola

Lior Pachter

Abstract

Graphical Abstract

1. INTRODUCTION

2. SYSTEMS BIOLOGY AND SINGLE-CELL GENOMICS

2.1. Standard approaches to systems biology

2.2. The challenge of single-cell data

2.3. Stochastic modeling of intracellular network dynamics

2.4. Outlook

3. STOCHASTIC MODELING OF SINGLE-CELL BIOLOGY

Multiple types of RNA.

Multiple gene states.

Gene regulation.

Transient effects.

Technical noise.

4. RESULTS

4.1. Theoretical framework for stochastic systems biology

Figure 1.

4.2. Empty droplets

Figure 2.

4.3. Noise-corrupted candidate models of transcriptional variation

Figure 3.

4.4. Distributions obtained from a transient process

Figure 4.

4.5. Variability in library construction

Figure 5.

5. DISCUSSION

RESOURCE AVAILABILITY

Lead Contact

Materials Availability

Data and Code Availability

6. METHODS

6.1. Master equation models of transcription

6.2. The full master equation

6.3. Generating function methods for biological stochasticity

6.4. Coupling multiple genes

6.5. Transient phenomena

6.6. Droplet encapsulation noise

6.7. Library construction and sequencing noise

6.8. Example systems

6.8.1. Special theoretical cases

6.8.2. Empty droplets

Model definition.

Data processing.

Data analysis.

6.8.3. Noise-corrupted candidate models of transcriptional variation

Model definition.

Model analysis

Simulated data analysis.

6.8.4. Distributions obtained from a transient process

Model definition.

Simulated data analysis.

6.8.5. Variability in library construction

Model definition.

Data analysis.

Supplementary Material

Box 1. Generating function methods for studying stochastic biological systems.

Box 2. An illustration of the solution procedure.

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases