Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 18.
Published in final edited form as: Cell Syst. 2023 Sep 25;14(10):822–843.e22. doi: 10.1016/j.cels.2023.08.004

Studying stochastic systems biology of the cell with single-cell genomics data

Gennady Gorin 1, John J Vastola 2, Lior Pachter 3,4,5,*
PMCID: PMC10725240  NIHMSID: NIHMS1935058  PMID: 37751736

Abstract

Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.

Graphical Abstract

graphic file with name nihms-1935058-f0006.jpg

1. INTRODUCTION

In his classic systems biology textbook1, D. J. Wilkinson notes that “Improvements in experimental technology are enabling quantitative real-time imaging of expression at the single-cell level, and improvement in computing technology is allowing modelling and stochastic simulation of such systems at levels of detail previously impossible. The message that keeps being repeated is that the kinetics of biological processes at the intra-cellular level are stochastic, and that cellular function cannot be properly understood without building that stochasticity into in silico models”. From this perspective, systems biology studies control over randomness, and the ways in which living cells exploit variability to grow and function. Counterintuitively, this stochastic weltanschauung relies on mental models that are inherently deterministic: differentiation landscapes26, gene expression manifolds7, cellular state graphs8,9, gene regulatory networks10,11, and kinetic parameters12. Analysis of experimental data therefore requires reconciling underlying deterministic structure with biological stochasticity and experimental technical variability, or noise. In particular, distinguishing technical noise from biological stochasticity involves the statistical modeling of experimental readouts, expected noise sources, and the signal-to-noise ratio, and requires consideration of the theoretical and computational tractability of the model.

How can we model these features—latent deterministic structure, biological stochasticity, and technical noise—in a way that balances our models’ ability to adequately describe available data with our own ability to adequately understand the mathematical behavior and biological interpretation of our models? Answering this question is particularly challenging in the context of single-cell genomics, where datasets are large and sparse, the signal-to-noise ratio is low, and stochasticity is one of the defining features of the underlying biophysics1315. Here, we explain why many naïve approaches to understanding the stochastic systems biology of single cells fall short, and describe a theoretical framework that can serve as an alternative. Our framework extends recent work on the mechanistic modeling of single-cell RNA count distributions1621, and addresses both how models can be efficiently fit to single-cell data, and what features of the underlying biology we can hope to learn.

After introducing the general framework, we illustrate its consequences through a series of vignettes. In each case, we consider modeling particular aspects of biological and technical noise, and ask: (1) What do our models help us learn about the underlying biology? and (2) What could go wrong if we ignored these features of our data? We find that certain kinds of noise must be carefully modeled, others are poorly identifiable, while others still cannot be identified at all and can be safely ignored.

2. SYSTEMS BIOLOGY AND SINGLE-CELL GENOMICS

2.1. Standard approaches to systems biology

If an experiment has ample controls and provides a readout with a high signal-to-noise ratio in the relevant variables, coarse-grained, moment-based models can be ideal. For example, investigations of cell growth have effectively used least-squares regression to fit scaling relationships between cell volume and molecular abundance that hold on average22,23. Analogously, experiments leveraging the integration of multiple fluorescent reporters have successfully decomposed molecular noise sources into intrinsic and extrinsic components24, leading to numerous analytical2528 and experimental2931 extensions that leverage the lower moments of poorly-characterized biological drivers to describe or delimit the system variability. These approaches, which have found application to new experimental techniques, have origins in the Onsager and Langevin theories of the early twentieth century32, which specify the moment behaviors of near-equilibrium statistical thermodynamic systems using Gaussian terms.

Alongside studying biology on a gene-by-gene basis, considerable effort has been dedicated to the discovery of regulatory networks. This problem is considerably more challenging: the number of candidate network modules rapidly grows with the number and size of motifs of interest, and simple moment-based models risk distorting key qualitative features, such as multistability. From the perspective of statistics, network inference requires specifying or bypassing likelihood functions for joint gene expression, which may combine various noise sources in addition to the “signal” of regulation. Typical ways of addressing this challenge include33,34:

  1. The purely descriptive approach, which interprets an expression correlation matrix as a graph, but does not provide an easily interpretable way to extract its “signal.”

  2. Thresholding, which bins the unknown observed distribution to obtain a known, but lower-information distribution, as with binarization used to construct Boolean networks35 or implement the phixer algorithm36.

  3. Distributional assertion, which fits static observations by assuming statistics or observations are Gaussian, as in a variety of popular Bayesian34, information-theoretic37, and regression-based38 methods; this assumption may39 or may not40 provide accurate results.

  4. The dynamic approach, which fits a time-dependent trajectory to data using assuming Gaussian residuals; this assumption may reflect stochastic differential equation dynamics41 or isotropic observation noise added to a latent process4244.

This overview is far from exhaustive, but it demonstrates a key theme: relatively robust signal, such as the lower moments or the absence/presence of gene expression, can be treated using fairly simple models that rely on highly optimized, well-understood methods and algorithms developed in the context of signal processing and dynamical systems analysis. Which simple model may perform best is not known a priori, and heavily depends on the task33. Ideally, methods are benchmarked on simulated39,45 or well-characterized “gold standard”33,46 datasets to glean partial insight about their performance and limitations. In this framework, improving the signal-to-noise ratio requires either designing more precise readouts or sacrificing a portion of the obtained data.

2.2. The challenge of single-cell data

Advances in sequencing technologies, most dramatically the rapid commercialization and adoption of single-cell RNA sequencing (scRNA-seq), which can profile millions of cells on a genome-wide scale47,48, have been heralded as a promising frontier for systems biology4951. This potential is more striking yet due to simultaneous advances in multiomics, or the measurement of multiple modalities (transient and non-coding RNA species, DNA methylation, chromatin accessibility, surface protein abundance) in individual cells52,53, facilitating “integrated” analysis5456. The “big data” from single-cell sequencing have thus served as substrate for a plethora of investigations which are, at first glance, analogous to the research program of systems biology at large: the identification of cell types; their aggregation into trajectories; the discovery of gene modules that consistently differ between cell types or throughout a differentiation trajectory; and the visualization of low-dimensional summaries reflecting some component of the data structure.

To identify these coarse-grained motifs in the structure of single-cell datasets, it is common practice to analyze cell–cell graphs, constructed from measures of expression similarity, to attempt to construct cliques (cell types), shortest paths (trajectories), and neighborhood-preserving low-dimensional embeddings (visualizations). In addition, relatively simple parametric distributions are widely used, with the Gaussian assumption popular for the lower moments (e.g., to compute measures of differential expression), and the lognormal or negative binomial used to describe count distributions57,58. Standard single-cell RNA sequencing data provide snapshots of processes, rendering dynamical analysis fairly complex, but it is common to fit a “pseudotemporal” curve through the dataset by minimizing a Gaussian error term between this curve and some transformation of the cells’ expression levels59,60.

Here, however, the underlying assumptions break down. Single-cell data are intrinsically and qualitatively different from readouts of typical systems biology experiments, with drastic implications for analysis. Single-cell data are large and sparse, with a preponderance of technical noise effects, poorly characterized batch- and gene-level biases, and low per-cell copy numbers1315. Improving the signal-to-noise ratio by designing more targeted experiments is challenging, as commercial technology is designed to quantify molecules on a genome-wide scale. More problematically, typical distributional assumptions and data transformations risk losing a considerable amount of signal in the low-copy number regime. This challenge informs part of the broader discussion of the relative roles of data analysis and mechanistic hypotheses in genomics19,20,61, as analyses are not constrained by mechanism or theory and may contradict existing knowledge.

More specific critiques have considered whether various analyses are appropriate or excessively heavy-handed. For example, sparsity has led to ad hoc procedures to “correct” the data, which may in turn lead to incorrect conclusions6264. Normalization and log-transformation, which attempt to remove technical biases and prepare the data for dimensionality reduction, rely on assumptions, such as high copy numbers and homogeneity, that are routinely violated in single-cell datasets65,66. Dimensionality reduction risks distorting both local and global relationships between data points19,67,68. Finally, the use of cell–cell graphs constructed from noisy data reifies relationships which may not reflect those in the original tissue, and risks introducing hard-to-diagnose errors into downstream analysis19,69. Although these issues span the entire process of analysis, all, at least partially, trace back to uncomfortable compromises in the treatment of uncertainty and variation in a regime unforgiving of approximations.

2.3. Stochastic modeling of intracellular network dynamics

Stochasticity is, then, mandatory, and we ignore it at our own risk. Therefore, we advocate for probabilistic alternatives to the “extraction” of signal from scRNA-seq datasets. Since biology is stochastic, the noise is the signal. To quantify and characterize the components of deterministic mental models—differentiation landscapes, kinetic parameters, and similar low-dimensional abstractions70—in a principled way, we need to combine them with stochastic terms which result from specific hypotheses about the underlying biophysics and chemistry20, or risk confirmation bias19.

The development of stochastic models offers advantages beyond loss function book-keeping. If multiomic data are available, there is typically a self-consistent way to extend the models accordingly71. Although likelihoods induced by stochastic processes are challenging to analyze and implement, they provide appealing statistical properties. When the data are sufficiently informative, full distributions provide better estimates than moments40. When they are not, probabilistic approaches are appropriately conservative, as they report, rather than elide, the parameter degeneracies. A thorough mathematical understanding of model behaviors—i.e., precisely which parameters are identifiable and which are degenerate, as well as how much data must be collected—enables the design of informative experiments20,72. Finally, the use of mechanistic models, parametrized by rate constants, allows us to draw conclusions about the mechanistic bases and effects of perturbations73.

These principles have guided fluorescence-based single-cell transcriptomics for nearly twenty years. To obtain as much information as possible from entire copy-number distributions40,74, the field has developed a considerable arsenal of theoretical tools75,76 and solution strategies7779. It is, then, particularly natural to build scRNA-seq models that extend processes consistent with fluorescence imaging: this approach allows us to leverage existing theory, as well as encode the intuition that technology-dependent effects should be independent from biological ones. A particularly popular class of models involves the bursty production of RNA and its Markovian degradation73,80, which can be analyzed in the chemical master equation (CME) framework81,82. The key theoretical points have already been applied in the context of single-cell sequencing; for example, the Poisson, Poisson-gamma, and Poisson-beta distributions, which are common in sequencing analyses58,63,83,84, are three of the limiting distributions induced by this class of models20,80,85. However, this possible mechanistic basis is only rarely84,8688 invoked in the development of analysis methods.

2.4. Outlook

Unfortunately, we cannot simply apply existing methods from fluorescence transcriptomics; the scale and chemistry of single-cell technologies create additional desiderata. General CME solutions are computationally prohibitive and challenging to scale to thousands of genes89, requiring careful study of narrow model classes with tractable solutions17,20. In addition, connecting biological models to observations requires explicitly representing the experimental process. The existing models for fluorescence data are sophisticated79, but cannot be directly applied to sequencing data. Although a variety of models have been proposed for technical noise in single-cell technologies13,14,90,91, their chemical foundations, as well as implications for biological parameter identifiability, have been understudied21.

In light of this lacuna, we seek to produce a mathematical framework that (1) integrates biological and technical variability in a coherent, modular way; (2) scales to large, multimodal data; (3) can be used to simulate datasets and make testable, quantitative predictions; and (4) affords a thorough mathematical analysis of its components, if not the entire model.

3. STOCHASTIC MODELING OF SINGLE-CELL BIOLOGY

Constructing a general-purpose framework for the stochastic modeling of single-cell biology necessitates working at a relatively high level of abstraction, since we would in principle like to account for a range of processes with one formalism. In this section, we motivate our abstract formalism using a collection of concrete, biologically relevant examples.

One of the simplest models of transcription is the constitutive model, which assumes RNA is produced at a constant rate20,92. It is defined by the chemical reactions

K𝓧,𝓧γ, (6)

where 𝓧 is a single species of RNA, K is the (constant) transcription rate, and γ is the degradation rate. The CME that corresponds to this system is

P(x,t)t=K[P(x1,t)P(x,t)]+γ[(x+1)P(x+1,t)xP(x,t)], (7)

where P(x,t) is the probability that the system has x0 RNA at time t. Solving the above master equation allows us to compare its predictions with experimental scRNA-seq data. There are several theoretical approaches for doing this—including using a special ansatz85, the Poisson representation93, the Doi-Peliti path integral17,9496, and operator techniques97—but we would like to highlight a straightforward method that we know works for far more general problems. The idea is to consider a certain transformed version of the probability distribution, which satisfies a partial differential equation (PDE) instead of a differential-difference equation. This PDE, for a large class of biologically relevant systems, can then be solved using the method of characteristics98, which converts the problem of solving a PDE into integrating a system of ordinary differential equations (ODEs). This is mathematically equivalent to using certain path integral methods17,20,99.

Define the generating functions (GFs)

G(g,t):=x=0gxP(x,t)andϕ(u,t):=logG(g,t), (8)

where g is on the complex unit circle and u:=g1. It is easy to show that G and ϕ satisfy the PDEs

Gt=(g1)[KGγGg],ϕt=Kuγuϕu. (9)

We can use the method of characteristics to find that

ϕ(u,t)=ϕ0(U(t))+K0tU(𝗌)d𝗌,dUd𝗌=γU, (10)

where the U(𝗌) ODE has initial condition U(𝗌=0)=u, and where ϕ0 is the initial (log-) generating function of the system. In order to determine P(x,t) from ϕ(u,t)=logG(g,t), we can use an inverse Fourier transform:

P(x,t)=dg2πi1gx+1G(g,t)=ππdθ2πeiθxG(eiθ,t)

where we integrate over all g on the complex unit circle. In practice, this step is done numerically using an inverse fast Fourier transform.

The constitutive model, which produces Poisson distributions at steady state, is too simple for single-cell biology20. But fortunately, the technique we have just described can be adapted to predict the behavior of substantially more complex models.

Multiple types of RNA.

One possible generalization of the constitutive model is to so-called monomolecular systems17,85, which allow phenomena like RNA splicing to be accommodated. An example is the addition of splicing to the constitutive model:

K𝓧N,𝓧Nβ𝓧M,𝓧Mγ. (11)

In general, any number of production, conversion, and degradation reactions can be modeled:

Ki𝓧i,𝓧icij𝓧j,𝓧ici0. (12)

Using the same technique we described earlier, the probability P(x,t) that the system is in state x0n at time t, can be shown to be equivalent to the generating function

ϕ(u,t)=ϕ0(U(t))+0tKTU(𝗌)d𝗌,dUd𝗌=CU, (13)

where U(𝗌=0)=u, and the C matrix is defined via

Cij=cij(ij),Cii=j=0ncij, (14)

and where cii:=0 by convention.

Multiple gene states.

Although the monomolecular model is a step forward, it still does not account for nontrivial transcription rate dynamics. One possibility is that there are multiple gene states, as in the telegraph model76,97,100:

𝓢offkonkoff𝓢on,𝓢onkinit𝓢on+𝓧,𝓧γ. (15)

The corresponding three-variable generating function is

ϕ(u,uon,uoff,t)=ϕ0(U(t),Uon(t),Uoff(t)),dUd𝗌=γU,dUoffd𝗌=kon(UoffUon),dUond𝗌=koff(UonUoff)+kinit(Uon+1)U, (16)

where U(0)=u,Uoff(0)=uoff, and Uon(0)=uon. If we want to marginalize over gene state, which we usually do since it is not observable, we can set uoff=uon=0. Notice that the relevant ODEs are now nonlinear (Riccati-type) equations, which make them difficult to solve by hand. In general, considering multiple gene states, or other kinds of added complexity like autocatalytic reactions, yields nonlinear characteristic ODEs. This is no obstacle for numerical integration, however.

Gene regulation.

Another possibility we would like to account for is nontrivial gene regulation. In previous work20, we considered two models of transcription rate variation: the gamma Ornstein–Uhlenbeck (Γ-OU) model, which assumes variation is due to changes in the mechanical state of DNA; and the Cox–Ingersoll–Ross (CIR) model, which assumes it is due to fluctuations in the concentration of an abundant regulator molecule. Analyzing them can be mathematically challenging, since the discrete stochastic dynamics of RNA production and degradation are coupled to the continuous stochastic process that controls the transcription rate. Fortunately, both models and many generalizations of them can be solved using the method of characteristics. For example, the CIR model (assuming two RNA species) is defined by a stochastic differential equation (SDE81) and three reactions:

dKdt=aθκK+2κθKξ(t),K(t)𝓧N,𝓧Nβ𝓧M,XMγ, (17)

and its solution is20

ϕ(uN,uM,uK,t)=ϕ0(UN(t),UM(t),UK(t))+aθ0tUK(𝗌;uN,uM,uK)d𝗌, (18)
dUMd𝗌=γUM,UM(0)=uM,dUNd𝗌=β(UMUN),UN(0)=uN,dUKd𝗌=UNκUK+κθUK2,UK(0)=uK.

Thus, it is straightforward to couple dynamics defined on different types of state spaces: categorical (e.g., gene states), continuous (e.g., transcription rates), and discrete (e.g., RNA counts), using the generating function approach. In all cases, one obtains a generating function solution in terms of a finite set of (possibly nonlinear) ODEs. The total number of ODEs is equal to the total number of degrees of freedom.

One feature of single-cell biology that is challenging to capture using this approach is feedback. For example, proteins expressed by a gene may affect the transcription rate of that gene. Although exact solutions for systems involving feedback are available in certain simple cases101104, particularly when there is only one chemical species, more general results have proven elusive. From the point of view of our approach, including chemical reactions that involve feedback yields generating function PDEs which are not first order, and cannot be solved in terms of ODEs via the method of characteristics (as explored in more detail in the supplemental information).

Transient effects.

In the context of development or reprogramming, we are especially interested in using single-cell genomics data to study transient processes. In particular, certain cell types or subtypes (like neural progenitor cells) only exist for a certain window of time, and by collecting single-cell data we are taking a snapshot of many cells, each of which may be in a different part of the process. How does this affect observed RNA counts?

Different cells being observed at different times means we are not interested in P(x,t), but P(x,t) averaged over some distribution that indicates how likely we are to sample different times. The shape of the sampling distribution f(t) depends on when cells tend to exit a given state (e.g., by differentiating into a different cell type). Nontrivial sampling distributions are compatible with our generating function approach, since we can simply modify the distribution that appears. For a model with one discrete species, we can write the full generating function Gtot as

Gtot(g)=x=0gx0TP(x,t)f(t)dt=0TG(g,t)f(t)dt,

i.e., we can obtain it by integrating the generating function that captures intrinsic noise.

Technical noise.

In single-cell genomics experiments, we do not directly observe a given cell’s RNA counts, but those numbers filtered through a noisy sequencing process21. In microfluidics-based sequencing, noise can come from some combination of droplets not capturing all molecules (especially types of RNA with low copy numbers), errors in amplification, and reads not being uniquely identifiable. We would like to account for these kinds of technical noise in a way that is both principled, and compatible with our generating function approach to modeling intrinsic noise.

Consider a simple example, in which the relevant biology is described by the one-species constitutive model (Equation 7), and each RNA molecule is observed independently with probability p. The probability of observing xobs molecules of RNA, given a biological distribution P(x,t), is

P(xobs,t)=x=0P(xobsx)P(x,t)=x=0(xxobs)pxobs(1p)xxobsP(x,t). (19)

The corresponding generating function Gtot is

Gtot(g,t)=x=0xobs=0xgxobsP(xobsx)P(x,t)=x=0[gp+(1p)]xP(x,t), (20)

i.e., the result is the same as without technical noise, except that we have ggp+(1p). In general, including technical noise requires us to replace the usual gx factor with Gnoise(g,x), the generating function associated with the observation model:

Gtot(g,t)=x=0Gnoise(g,x)P(x,t). (21)

For certain common observation models, like the Bernoulli model just described, or a Poisson noise model, we can say more: since

Gnoise(g,x)=G*(g)x (22)

for some G*, including technical noise amounts to replacing g with G*, so that Gtot=G(G*) is a composition of generating functions. We typically assume that all technical noise models satisfy Equation 22 for some G*.

4. RESULTS

4.1. Theoretical framework for stochastic systems biology

We are ready to present our general framework for stochastic systems biology, which accommodates all of the sources of stochasticity described in the preceding section: intrinsic noise, transient effects, and technical noise. In order to balance the amount of biology our models can capture with the mathematical tractability of those models, we restrict our analysis to a fairly general class of systems that can be solved using the method of characteristics. For such systems, we can obtain likelihoods by integrating characteristic ODEs, using the obtained characteristics to construct the generating function, and then doing an inverse (fast) Fourier transform.

This class of systems permits gene state interconversion, as well as the production and processing of RNA and proteins, which could treated as discrete or continuous variables depending on their concentration. We allow zero- and first-order reactions, including state-dependent bursting, interconversion, degradation, and catalysis. However, we disallow higher-order reactions (e.g., binding reactions A+BC), including feedback-based regulation like protein-promoter binding. Therefore, our analysis focuses on Markovian systems that possess N categorical degrees of freedom, corresponding to gene states; n discrete ones, corresponding to low-copy number molecular species; and m continuous ones, corresponding to transcription rates or high-concentration species. This class of reactions is schematically represented in Figure 1a; crucially, it consists of distinct “upstream” and “downstream” components.

Figure 1.

Figure 1

The biophysical and chemical phenomena of interest, as well as the relationships between their generating functions.

a. The biological phenomena of interest: cell influx and efflux into a tissue observed by sequencing; the time-dependent transcriptional regulation of one or more genes; downstream continuous and discrete processes.

b. The technical phenomena of interest: the encapsulation of cells and cell debris; cDNA library construction; the loss of information in transcript identification (GF: generating function; RTase: reverse transcriptase).

c. The structure of the full generating function of the system in a and b: to obtain the solution, we variously compose, integrate, and multiply the generating functions of the constituent processes.

d. The stochastic and statistical properties of four components of the full system: the background debris, the transcriptional regulation, the cell/tissue relationship, and the technical noise mechanism.

Given all of a model’s possible reactions, one can write down a corresponding master equation that keeps track of how microstate probabilities change with time:

dP(s,x,y,t)dt=ψ(s,x,y,t), (27)

where each microstate consists of s, the categorical dimension; xn, the n discrete dimensions; and ym, the m continuous dimensions. The generally complicated function ψ aggregates all reaction rates. Master equations like Equation 27 typically consist of an infinite system of coupled ODEs, and hence are difficult to solve in general. This is one reason we chose a particular class of systems: to solve Equation 27 using the method of characteristics, and hence determine a given model’s predictions, all we need to do is solve (a finite number of) ODEs satisfied by the characteristics and GF.

The N-dimensional GFG=(G1,,GN)T of the system, which is a function of spectral variables g and h, is defined by

Gs(g,h,t):=yxgxehTyP(s,x,y,t)dy. (28)

Equation 27 can be converted into a PDE satisfied by G:

Gt=(u,t)G+J[Cu+diaguDu](u,t)=H(t)T𝓐(u,t)u:=[g1h], (29)

where is the Hadamard/elementwise matrix product, J is the Jacobian matrix of the generating function, and u combines the discrete and continuous degrees of freedom. The time-dependent matrix H contains the kinetics of state transitions, whereas the operator 𝓐 describes the drift and bursty production processes, which may depend on state. Therefore, the operator aggregates the upstream components of the system. The matrix C contains interconversion, degradation, and mean reversion-like terms, whereas D contains the catalysis and square-root noise terms. , C, and D encode a quasi-linear, deterministic, and first-order N-component system of partial differential equations in n+m spectral variables.

Applying the method of characteristics to solve Equation 29 tells us that the downstream part of the system is fully determined by a set of characteristics U, which are defined by the ODEs

dU(𝗌)d𝗌=CU(𝗌)+diagU(𝗌)DU(𝗌) (30)

where 𝗌 is an integration variable, and U(𝗌=0)=u. Meanwhile, the generating function G can be determined from

dG(𝗌)d𝗌=(U,t𝗌)G, (31)

which has initial condition G0(U(t)), where G0 is the generating function of the initial distribution. The upstream components describe how molecule production occurs, and hence depend on ; their influence on the final answer is through the above integral.

The detailed form of Equation 27 is complicated, and the arithmetic exercise of converting it into Equation 29 is tedious. We show how to construct the biological master equation in Section 6.1, write it out in full in Section 6.2, and discuss at a high level how to solve it using our generating function approach in Section 6.3. The terms of the full master equation are annotated in Table S1, and the solution process is described in more detail in supplemental information.

In special cases, the ODEs we obtain can be solved exactly. For example, whenever D=0, the downstream ODE system can be solved analytically by eigendecomposition. If, in addition, only a single gene state is present, H vanishes and the upstream component can be evaluated by numerical integration16. Finally, in the simplest case of a linear operator 𝓐, we obtain an analytically tractable system equivalent to a deterministic system of reaction rate equations17,85.

Although this formulation nominally describes a single gene, it may be exploited to represent multi-gene systems. Conceptually, this strategy entails constructing a model where the transcription of multiple species is controlled by a common regulator. We discuss potential candidate models in Section 6.4; these models instantiate hypotheses to produce and U that represent co-regulation.

To explain the observation of transient processes, such as the simultaneous capture of progenitor and descendant cells from a differentiation process, we take inspiration from previous work in sequencing86 as well as chemical reactor modeling105,106, and extend the theoretical framework originally proposed in our recent RNA velocity analysis19. In brief, the simplest model that accounts for such desynchronization proposes that cells enter a tissue, receive a signal that triggers changes in transcriptional rates (t), and leave at some later point. Sequencing is the observation of cells within the tissue; to find the distribution of RNA counts, we need to condition on the distribution of times since receiving the signal.

As we discuss in Section 6.5, this latter distribution is not arbitrary, and reflects the kinetics of cell entry and exit. In the parlance of chemical reaction engineering, the times are drawn from f(t), the internal-age distribution induced by those kinetics105,106. This model affords a particularly simple representation of the generating function:

G=tsGs(t)f(t)dt, (32)

where we marginalize over the gene state, which is typically not observable. Conveniently, this model possesses time symmetry: even though the cells within the tissue are all out of equilibrium, the tissue as a whole is at steady state.

We consider the technical noise phenomena shown in Figure 1b, i.e., the encapsulation of cells and background debris into droplets, as well as the stochasticity in cDNA library construction and sequencing. Under the assumption of independent encapsulation, the generating function of molecule count distributions on a per-droplet level takes the following form:

Gtot=Genc(G)Gbg(G), (33)

where the Genc is the generating function of the cells per droplet, whereas Gbg is the generating function of background molecules per droplet, which depends on the entire cell population (Section 6.6). Finally, to represent sequencing variability and uncertainty, we evaluate the generating function at a set of transformed coordinates:

Gtot,ta=Gtot(Gt*(Ga*(u))), (34)

where Gt* reflects the distribution of cDNA produced per molecule of RNA (e.g., Bernoulli, as in Tang et al.107,108), whereas Ga* reflects the distribution of ambiguous sequenced fragments, which depends on transformed variables u (Section 6.7 and supplemental information).

The full generating function of the molecule distribution is given by the composition and integration of the model components, as shown in Figure 1c. To evaluate this generating function, it is necessary to specify all components that make up the model. In the analysis below, we take advantage of the modularity of the system definition to investigate four kinds of modeling choices, their statistical implications, and their compatibility with sequencing data. Specifically, we treat the subsystems illustrated in Figure 1d: background noise in single droplets, stochastic transcription rate models, sampling from a transient process, and variation in technical noise.

4.2. Empty droplets

One of the first steps in scRNA-seq data analysis is cell quality control, which excludes cell barcodes that appear to originate from empty droplets from further analysis57. For computational tractability, this procedure typically relies on “hard” assignment, such that barcodes associated with a total molecule count above some threshold are treated as cells, whereas barcodes below the threshold are treated as empty droplets. Threshold selection is necessary because even “empty” droplets contain ambient RNA. This ambient RNA, which appears to originate from cells lysed in the preparation process, contaminates empty and cell-containing droplets alike57.

The observation of ambient RNA resulting in unwanted molecule counts has led to the development of statistical methods for removing this source of noise, either by estimating and subtracting it109 or incorporating it into a stochastic model110112. Conceptually, Equation 33 reflects the latter approach: each droplet contains one or more cells, each with biological generating function G, and background, with a generating function Gbg that depends on G. To accurately model the background counts, we need to propose and justify a specific functional form for Gbg. Thus, under the assumption that empty and cell-containing droplets are similarly susceptible to contamination, the former provide a reasonable estimate of ambient distributions in the latter109.

The simplest model holds Gbg to be equivalent to a “pseudobulk” experiment, with molecules randomly sampled from the lysed cell population. If each cell is equally likely to contribute to the pool of free RNA, and diffusion occurs by a simple independent arrival process, we find that the distribution of background should be Poisson, with the mean for each species proportional to its mean in the original cell population, as in, e.g., Fleming et al.110 This functional form immediately induces a set of testable predictions: not only are the distributions Poisson, but they are independent Poisson, with no meaningful statistical structure remaining between transcripts of a single gene, as well as between different genes, as illustrated in Fig. 2a.

Figure 2.

Figure 2

The pseudo-bulk model of background noise is quantitatively consistent with counts from the pbmc_1k_v3 dataset.

a. The simplest explanatory model for background noise invokes the lysis of cells (green), which creates a pool of RNA that reflects the overall transcriptome composition but retains none of the cell-level information. If the loose RNA molecules diffuse into droplets (blue) according to a memoryless and independent arrival process, the resulting background distribution (purple: higher probability mass; white: lower probability mass) observed in empty droplets should be a series of mutually independent Poisson distributions, with the mean controlled by the composition in non-empty droplets.

b. The mature transcriptome in empty droplets has a mean-variance relationship near identity (gray points, n = 12,298), consistent with Poisson statistics (blue line); the non-empty droplets demonstrate considerable overdispersion (red points, n = 17,393).

c. The mature and nascent transcripts in empty droplets have sample correlation coefficients ρ near zero, consistent with distributional independence (gray histogram, n = 9,362); the non-empty droplets demonstrate nontrivial statistical relationships (red histogram, n = 14,365).

d. The mature transcripts of different genes in empty droplets have sample correlation coefficients ρ near zero, consistent with distributional independence (gray histogram, n = 75,614,253); the non-empty droplets demonstrate nontrivial statistical relationships (red histogram, n = 151,249,528).

e. When both are nonzero, the mature count mean in empty droplets is highly correlated with the mean in the non-empty droplets, consistent with the pseudo-bulk interpretation (black points, n = 12,107; dashed line: identity).

To characterize the accuracy of these predictions, we inspected six datasets (Table S2) pseudoaligned with kallisto | bustools113, and compared the data for barcodes passing bustools quality control to data for barcodes which were filtered out. As a shorthand, we call the former “non-empty” and the latter “empty” droplets, keeping in mind that this identification is approximate. We fully describe the analysis procedure in Section 6.8.2, illustrate the results for the human blood dataset pbmc_1k_v3, and display the results for all datasets in supplemental information.

As shown in Figure 2b, data from non-empty droplets are substantially overdispersed relative to Poisson, whereas data from empty droplets are largely consistent with the Poisson identity mean–variance relationship. However, a small number of relatively high-expression genes are overdispersed. In addition, intra-gene (Figure 2c) and inter-gene (Figure 2d) correlations are typically nontrivial in non-empty droplets, but consistently near zero for empty droplets, supporting distributional independence of the background counts. Finally, the mean expression in empty droplets is highly correlated with mean expression in non-empty droplets, albeit lowered by approximately four orders of magnitude (Figure 2e), supporting the assumption that the original cells are lysed in a uniform fashion.

To characterize the deviations from the pseudo-bulk model, we identified the genes that demonstrated overdispersion in empty droplets (Table S3). A considerable fraction of these genes were associated with mitochondria or blood cells. For example, of the 21 annotated genes overdispersed in the empty droplets of the mouse neuron dataset neuron_1k_v3, nine were mitochondrial (mt-Nd1, mt-Nd2, mt-Co1, mt-Co2, mt-Atp6, mt-Co3, mt-Nd3, mt-Nd4, and mt-Cytb), three coded for hemoglobin subunits (Hba-a1, Hba-a2, and Hbb-bs), and two coded for blood cell-specific proteins (Bsg, Vwf)114,115. On the other hand, of the 10 annotated genes overdispersed in the empty droplets of the desai_dmso dataset, generated from cultured mouse embryonic stem cells116, six (mt-Nd1, mt-Co2, mt-Atp6, mt-Co3, mt-Nd4, mt-Cytb) were mitochondrial and none were blood cell-specific114 (Table S4).

Since overdispersion implies that contamination involves non-independent encapsulation of these molecules, the results suggest that the cell-free debris contain, among other structures, entire mitochondria or erythrocytes, when they are present in the source tissue. These membrane-bound structures may diffuse into droplets, then lyse and release all of their contents at once. In other words, empty droplets do not merely have disproportionally high mitochondrial content, as has been noted previously110,117,118; they have nontrivially distributed mitochondrial content, which can hint at the mechanism of its incorporation, and improve interpretation where simple thresholds may be misleading118. We hypothesize that cases where the model fails can be leveraged to discover more complicated forms of contamination, such as molecular aggregates112.

In addition, we examined the total UMI counts in empty droplets, which should be Poisson (Fano = 1) if each individual gene’s distribution is Poisson. For the human blood dataset demonstrated in Figure 2, the empty droplets had fairly significant overdispersion (Fano ≈ 43), which decreased, but did not disappear (Fano ≈ 7.6), once the 53 significantly overdispersed genes were excluded. This result suggests that, although the pseudo-bulk model is approximately valid, some residual variance, possibly due to variability in per-droplet capture rates, is present and needs to be modeled to fully describe the stochasticity in single-cell datasets.

4.3. Noise-corrupted candidate models of transcriptional variation

A considerable fraction of the variability in single-cell datasets arises from cell-to-cell and time-dependent variation in the transcription rates. These sources of variation control distribution shapes. By carefully analyzing candidate models, we can characterize the prospects for model selection: for example, if different models produce nearly identical distributions, selection is impossible and the choice of model is somewhat arbitrary. More interestingly, such analysis can guide the design of experiments: models may be indistinguishable based on some kinds of data, but not others20. This perspective has guided the interest in characterizing noise behaviors74,119: distributions provide strictly more information than averages, and allow us to distinguish between regulatory mechanisms. Similarly, multivariate distributions provide more information than marginal distributions. Obtaining different data (multiple molecular modalities) is qualitatively more useful than obtaining more data (a larger number of cells) or better data (observations less corrupted by noise).

We illustrate this key point using the simple model system depicted in Figure 3a, which features intrinsic, extrinsic, and technical noise. The continuous stochastic process denoted by K drives the rate of transcription of nascent RNA. We consider three different possibilities for K: the gamma Ornstein–Uhlenbeck process, which models DNA winding and relaxation; the Cox–Ingersoll–Ross process, which models the fluctuations in a high-copy number activator20; and the telegraph process, which models variation due to random exposure of the locus to transcriptional initiation76,97,100. All three transcription rate models are described by three parameters20,100. After a Markovian delay, nascent RNA are converted to mature RNA; after another Markovian delay, the mature RNA are degraded. When the system reaches steady state, it is sequenced; each biological molecule has a probability p of being observed in the final dataset. We seek to use imperfect count data to fit parameters and distinguish models. We fully describe the procedures in Section 6.8.3.

Figure 3.

Figure 3

The stochastic analysis of biological and technical phenomena facilitates the identification and inference of transcriptional models.

a. A minimal model that accounts for intrinsic (single-molecule), extrinsic (cell-to-cell), and technical (experimental) variability: one of three time-varying transcriptional processes K generates molecules, which are spliced with rate β, degraded with rate γ, and observed with probability p. Given a set of observations, we can use statistics to narrow down the range of consistent models.

b. Given a particular model, parameter regimes indistinguishable using a single modality become distinguishable with two. The mixture-like and burst-like regimes both produce negative binomial marginal distributions, but have different correlation structures (Left: data likelihoods over the parameter space, computed from 200 simulated cells; Γ-OU ground truth; red point: true parameter set in the mixture-like regime; color: log-likelihood of data, yellow is higher, 90th percentile marked with magenta hatching; blue: an illustrative parameter set in a burst-like parameter regime with a similar nascent marginal but drastically different joint structure. Right: nascent marginal and joint distributions at the points indicated on the left. Nascent distributions nearly overlap).

c. Given a location in parameter space, models are easier to distinguish using multiple modalities. However, the performance varies widely based on the location in parameter space and the specific candidate models: for example, the telegraph model has a well-distinguishable bimodal limit when the process autocorrelation is slower than RNA dynamics. In addition, all else held equal, drop-out noise effectively decreases the noise intensity, lowering identifiability (Left: Γ-OU Akaike weights under Γ-OU ground truth, average of n = 50 replicates using 200 simulated cells; color: Akaike weight of correct model, yellow is higher, regions with weight < 0.5 marked with black hatching; large circles: illustrative parameter sets; smaller circles: distributions obtained by applying p = 50%, 75%, and 85% dropout to illustrative parameter sets while keeping the averages constant. Right: the three candidate models’ nascent marginal distributions at the large points indicated on the left).

Even if we have perfect information about the true averages of the transcriptional strength and the molecular species, the systems can exhibit a wide variety of distribution shapes and statistical behaviors. This variety can be summarized by a two-dimensional parameter space, which was introduced in Fig. 2 of Gorin and Vastola et al.20 The “timescale separation” governs the relative timescales of the transcriptional and molecular processes; if it is high, the transcriptional process is faster than RNA turnover. The “noise intensity” governs the variability in the transcriptional process: if it is high, the process exhibits substantial variability that translates to overdispersion in the RNA distributions. The bottom edge of this parameter space produces Poisson distributions of RNA, the top left corner produces Poisson mixtures of the law of K, and the top right corner yields bursty dynamics that do not typically have simple analytical solutions20.

Although these regimes reflect very different transcriptional kinetics, they can produce indistinguishable distributions. The first panel of Figure 3b demonstrates the likelihood landscape of a dataset generated from the gamma Ornstein–Uhlenbeck (Γ-OU) transcriptional model, evaluated using the nascent marginal and p=1. The mixture-like true parameters are indicated by a red point and the top decile of likelihoods is indicated by hatching. The Γ-OU model’s transcription rate has a gamma stationary distribution, which produces approximately Poisson-gamma, or negative binomial, RNA marginals in this regime. However, the bursty regime, indicated by a blue point, also yields a negative binomial-like marginal20, preventing us from identifying the kinetics.

On the other hand, if we evaluate likelihoods using the entire two-species dataset, we obtain the landscape in the second panel of Figure 3b: the symmetry is broken, and the parameters can be localized to the mixture-like regime. The source of this improved performance is evident from examining the distributions, shown in the third and fourth panels of Figure 3b. The nascent marginals are essentially identical; no amount of purely nascent count data can distinguish between them. However, the bivariate distributions show subtle differences, such as higher nascent/mature correlations in the true regime, which can be used for inference. This approach is analogous to Fig. 4b of Gorin et al.21, where bivariate data are used to disambiguate differences which would otherwise be indistinguishable due to the degeneracies of steady-state distributions.

In addition, the timescale separation and noise intensity determine the model distinguishability. To quantify this, we use the Akaike weight wϖ, which transforms log-likelihood differences into model probabilities120. For example, if the Akaike weight is near 1/3, the models are indistinguishable; if the correct model’s weight is near 1, we can confidently identify the model from the data. The first panel of Figure 3c demonstrates the average Akaike weight landscape of datasets generated from the Γ-OU model, computed using the nascent distribution at the same coordinate. We indicate the region wϖ<1/2 by hatching. As the Akaike weight may be interpreted as a posterior model probability120, this somewhat arbitrary threshold gives even odds for choosing the correct model, on average.

The intermediate regime, indicated by a large olive green point, tends to yield fairly high Akaike weights, consistent with the two-model case explored in Fig. 3a of Gorin and Vastola et al.20 On the other hand, the burst-like regime, indicated by a large pink point, provides considerably less ability to distinguish the models. As expected, the situation improves somewhat when using bivariate data (second panel of Figure 3c): the Akaike weights increase throughout the parameter space, and the bursty regime data move closer to even odds for model selection.

To illustrate the source of the identifiability challenges, we plot the nascent marginals of the models at the two points. In the intermediate regime, the Γ-OU and CIR models yield moderately different distributions, whereas the telegraph model is immediately distinguishable by its bimodality (third panel of Figure 3c). In contrast, in the bursty regime, the distributions are all unimodal and less identifiable (fourth panel of Figure 3c); the Γ-OU and telegraph marginals are particularly similar, as they converge to the same negative binomial limit20.

Interestingly, this formulation fully characterizes the effect of certain forms of technical noise. If the transcriptional and observed molecular averages are fixed, but the experiment fails to capture some molecules, the distributions are identical to those obtained by deflating the transcriptional noise intensity. In other words, even though technical noise affects the molecules, its theoretical effects are indistinguishable from decreasing the variability of the transcriptional process. As the noise levels increase, the RNA distributions are pushed toward the indistinguishable Poisson limit at the bottom edge of the reduced parameter space. We quantify how rapidly the information degrades by plotting smaller circles on the first and second panels of Figure 3c to indicate the effect of 50%, 75%, and 85% dropout, in that order from top to bottom.

4.4. Distributions obtained from a transient process

Due to the interest in understanding developmental processes, the characterization of transient process dynamics is a key problem in single-cell analyses. The use of mechanistic models with multimodal data, which we emphasize here, was originally pioneered in the context of the RNA velocity framework, which attempts to exploit the causal relationship between nascent and mature RNA to fit transient processes86. However, the implementations proposed so far use relatively simple noise behaviors59,86,121, which do not recapitulate the bursty transcription observed in living cells. As discussed in our recent analysis of RNA velocity methods19, this leads to us to hold some reservations about the robustness and appropriate interpretation of results obtained by this class of methods.

The inference of transient dynamics from snapshot data is a formidable problem due to a combination of theoretical and practical factors. Most fundamentally, it is not precisely clear what a snapshot is: how does a single measurement simultaneously capture the early and late states in a differentiation process? To develop an explanatory model, we take inspiration from the existing work on cyclostationary processes122,123, cell cycle ensemble measurement modeling124126, Markov chain occupation measure theory127129, and chemical reactor engineering105,106. In the typical stochastic modeling context, we fit count data using stationary distributions P(x), obtained as the limit limtP(x,t) of a transient distribution. By the ergodic theorem130132, this distribution, when it exists, coincides with the occupation measure limT1T0TP(x,t)dt, i.e., observations drawn from a single trajectory over a sufficiently long time horizon, rather than from multiple trajectories at once. Conveniently, the ergodic limit has time symmetry with respect to measurement: the distribution does not depend on the timing of the experiment. In the transient case, we cannot take these limits. However, we can retain time symmetry by proposing that the experiment samples cells at almost surely finite times t since the beginning of the process. Therefore, we conceptualize data as coming from a set of cells indexed by c, such that each cell’s time tc is sampled from f(t), and counts are drawn from some distribution P(x,tc), which is not typically available in closed form. This formulation yields Equation 32, which requires specifying the distribution f.

We illustrate some of the challenges and implications using the model system shown at the bottom of Figure 4a. The underlying transient structure involves transitions through three cell types, each characterized by a particular transcriptional burst size. The transient transcription process produces nascent and mature RNA trajectories for each cell; however, we only obtain a single data point per trajectory. Even if we have perfect information about the cell times, it is far from clear that we can accurately reconstruct the transcriptional dynamics from snapshot data (center of Figure 4a).

Figure 4.

Figure 4

Given ordered and labeled snapshot data obtained from a transient differentiation process, we can typically fit the copy number data, but identifying the mechanism of the snapshot is more challenging.

a. A minimal model that accounts for the observation of transient differentiation processes in scRNA-seq: cells enter a “reactor” and receive a signal to begin transitioning from cell type A through B and to C. The change in cell type is accompanied by a step change in the burst size, which leads to variation in the nascent and mature RNA copy numbers over time. Given information about the cell type abundances and the cells’ time along the process, we may fit a dynamic process to snapshot data and attempt to identify the underlying reactor type, which determines the probability of observing a cell at a particular time since the beginning of the process.

b. In spite of the considerable differences between the reactor architectures, they produce nearly identical molecular count marginals (histogram: data simulated from the Dirac model, 200 cells; colored lines: analytical distributions at the maximum likelihood transcriptional parameter fits for each of the three reactor models. Analytical distributions nearly overlap).

c. The true reactor model may be identified from molecule count data, but statistical performance is typically poor (points: Akaike weight values for n = 50 independent rounds of simulation and inference under a single set of parameters; blue markers and vertical lines: mean and standard deviation at each number of cells; blue line connects markers to summarize the trends; red lines: the Akaike weight values 1/3, which contains no information for model selection, and 1/2, which gives even odds for the correct model; two-species data generated from the Dirac model; uniform horizontal jitter added).

d. The reactor models are poorly identifiable across a range of parameters, and rarely produce Akaike weights above 1/2 (histogram: Akaike weight values for n = 200 independent rounds of parameter generation, simulation, and inference under the true Dirac model; red line: the Akaike weight values 1/3 and 1/2; two-species data for 200 cells generated from the Dirac model; parameters were restricted to the low-expression regime μ+4σ25 for both species).

e. The challenges in reactor identification arise because all three models produce similar likelihoods (histograms: likelihood differences between candidate models and the true Dirac model for n = 200 independent rounds of parameter generation, simulation, and inference; red line: no likelihood difference; two-species data for 200 cells generated from the Dirac model; parameters were restricted to the low-expression regime μ+4σ25 for both species).

In addition, we wish to know whether we can identify the mechanism of the snapshot collection. We can imagine cells entering and exiting the observed tissue in multiple ways, which correspond to different choices of f(t). Some natural choices are uniform, which implies the cells stay in the tissue for a deterministic time86; decreasing over time, so cells can exit immediately; or uniform, then decreasing, so cells must stay in the tissue for some duration but are free to leave afterward. These choices can be modeled by Dirac, exponential, and Pareto residence distributions. In the parlance of chemical reactor engineering, these configurations are known as the plug flow reactor, the continuously-stirred tank reactor, and the laminar flow reactor, respectively. Their f(t), which are the reactor internal-age distributions, are well-known in the chemical engineering literature105,106, and shown at the top of Figure 4a. It is not a priori obvious the configurations are mutually distinguishable from count data. If they are not, the choice of f(t) is immaterial for inference.

We generated snapshot data from the Dirac model and fit it under all three models. To efficiently evaluate snapshot distributions, we designed an algorithm which essentially “recycles” tc for trapezoidal quadrature. The method is fully described in Section 6.8.4. As shown in Figure 4b, despite only having access to a single observation per time point, all models yield results visually close to the true marginals. However, despite these superficial similarities, quantitative model identification is possible: for the simulated dataset shown, the true Dirac model achieves an Akaike weight of wϖ ≈ 79%, whereas the exponential and Pareto both achieve ≈ 10%. Decreasing the dataset size substantially degrades the identifiability (Figure 4c). Even at higher sizes, spread is considerable; for example, a 150-cell dataset gives approximately even odds (wϖ > 1/2) on average, but individual realizations vary from confidently correct (wϖ ≈ 1) to confidently wrong (wϖ ≈ 0).

To understand the robustness of model identifiability, we generated 200 synthetic datasets at random parameter values, constrained to have fairly low expression. We observed poor identifiability, with even or better odds for the correct model in only 20% of the cases (Figure 4d). This performance appears to be attributable to quantitative similarities between all three models’ likelihoods. As shown in Figure 4e, given data of this quality, we cannot even narrow the scope down to two models, as neither of the candidate models performs conspicuously worse than the true Dirac configuration. Therefore, it is possible to fit snapshot data approximately equally well using a variety of models; candidates for f(t) are identifiable in principle, but challenging to distinguish from any particular dataset. This simulated analysis implies that the details of the reactor configuration may not matter much, providing a basis for omitting this model identification problem for real data.

4.5. Variability in library construction

To properly interpret single-cell data, we need to exhibit caution regarding the technical noise behaviors and consider multiple possible candidate models. However, before fitting distributions, we must fully characterize the models and understand which of their parameters are actually identifiable with the data at hand. For example, the two-species models explored in Section 4.3 produce distributional forms that are closed under the assumption pN=pM=p, i.e., the magnitude of the observation probability p is impossible to identify from count data alone. Interestingly, when pNpM (that is, when nascent and mature RNA may have different observation probabilities), what we can learn about technical noise heavily depends on the form of the biological noise. For example, under slow transcriptional variation (as in the mixture and Poisson limits of the models explored in Section 4.3), the RNA distributions contain no identifiable information whatsoever about the technical noise, regardless of the amount of data. On the other hand, if transcription is bursty, the distributions depend on the ratio of pN and pM, but not their absolute values (Section 6.8.5). This theoretical result calls for further investigation: how much information can we obtain in practice, given finite data?

To understand the prospects for distinguishing parameters, we consider the simple model system shown in Figure 5a, which involves bursty transcription with average burst size b, splicing, degradation, and molecular capture with species-specific probabilities. To characterize how much information about pM/pN we can identify from count data, we simulated 200 datasets at the ratio values 1/4,1, and 4, and calculated their likelihoods over (10−2, 102). We repeated this analysis using synthetic datasets with 20, 50, 100, and 200 cells, and plotted the average of the posterior distributions for each condition. As shown in Figure 5b, color-coded by the ground truth pM/pN and intensity-coded by the number of cells, the posteriors are, on average, consistent with the true value. However, even with perfect information about the averages and the nascent RNA distribution, the uncertainty is considerable; at larger dataset sizes, we can typically localize the ratio to an order of magnitude, but not much further.

Figure 5.

Figure 5

Technical noise models may be identified from count data, either by direct application of statistics or by imposing informal priors about the biological variability.

a. A minimal model that accounts for non-homogeneous noise: transcriptional events occur with frequency α, generating geometrically-distributed bursts B with mean size b; the molecules are spliced with rate β and degraded with rate γ. Nascent molecules are observed with probability pN and mature molecules are observed with probability pM.

b. Given information about the nascent distribution and the mature mean, it is possible to use joint distributions to obtain information about the ratio of observation probabilities (curves: average posterior likelihoods, computed from 200 independent synthetic datasets; color: true value of pM/pN, blue: 1/4, red: 1, purple: 4; dashed lines: location of each true value; color intensity: from lightest to darkest, synthetic datasets with 20, 50, 100, and 200 cells).

c. Two models considered in Gorin et al.21: the species-independent bias model for length dependence in averages, which proposes that nascent and mature RNA are sampled with equal probabilities, and the species-dependent bias model, which proposes that the nascent RNA sampling rate scales with length (top, gold: kinetics of species-independent model; bottom, blue: kinetics of species-dependent model; center, green: the source RNA molecules used to template cDNA).

d. A variety of single-cell datasets produce consistent and counterintuitive length-dependent trends in nascent RNA observations (lines: average per-species gene expression, binned by gene length; red: nascent RNA observations; gray: mature RNA statistics; data for 2,500 genes analyzed in Gorin et al.21).

e. Fits to the species-independent model show a strong positive gene length dependence for inferred burst sizes, whereas fits to the species-dependent model show a modest negative gene length dependence, which is more coherent with orthogonal data (lines: average per-gene burst size inferred by Monod161, binned by gene length; gold: results for species-independent model; blue: results for species-dependent model; data for genes analyzed in Gorin et al.21 after goodness-of-fit).

Given the statistical challenges illustrated by simulations, we speculate that it may be more fruitful to use prior information about biology and physical intuition about sequencing to construct technical noise models. For example, in a recent paper21, we fit models that represent two competing hypotheses (Figure 5c). The first has identical, gene-specific observation probabilities p for the nascent and mature species. In this model, the inferred burst size is bp, as these two parameters are not mutually identifiable. The second has a gene length-dependent technical noise term for the nascent species, which coarsely represents a higher rate of priming for long molecules with abundant intronic poly(A) tracts, and a shared genome-wide term for the mature species, which represents priming at the poly(A) tail. In this model, the inferred burst size is b.

These models attempt to explain the trend summarized in Figure 5d: across a wide range of datasets, nascent RNA averages exhibit a pronounced length dependence133 not evident in mature RNA134. The first model explains the trend by a species-independent bias, as b and p control nascent as well as mature RNA levels. Conversely, the second model explains it by a species-dependent bias. Both models produce fair fits to the data (as demonstrated, e.g., by the low rate of rejection by goodness-of-fit in Sections S7.4 and S7.5.2 of Gorin et al.21).

However, the trends in the resulting inferred parameters are strikingly different: the species-independent bias model predicts that longer genes have higher bp. Ascribing this trend to the b term—longer genes have higher burst sizes—contradicts burst size trends from fluorescence microscopy135. Ascribing it to the p term—longer genes have higher sampling probabilities—is physically unrealistic, because mature RNA molecules are depleted of the internal poly(A) tracts necessary for priming136. On the other hand, the species-dependent model predicts a modest negative relationship between length and burst size, which is more coherent with orthogonal data.

This technical noise model is a relatively simplistic low-order approximation, since all genes have the same mature molecule capture rate λM and length scaling CN. Nevertheless, it foregrounds a key modeling principle of the investigation: in the absence of prior information, biological parameters need to be fit on a gene-by-gene basis, but technical noise should be constructed using a common genome-wide model that varies in a mechanistic rather than arbitrary way. In sum, the mathematics enable us to define and fit systems, but to understand whether the fits are sensible, we need to contextualize and compare them with previous results and physical intuition.

5. DISCUSSION

The results we have derived provide a blueprint for the holistic modeling of single-cell biology and sequencing experiments. First, we have outlined a generic mathematical framework for treating stochasticity in living cells. By exploiting the generating function representation, we reduce discrete, continuous, and mixed reactions to operators in a system of differential equations. These ODEs can be straightforwardly solved via numerical integration to compute model properties, including likelihoods. This approach recapitulates and subsumes a wide range of previous results16,17,20,21,75,76,85,100,137,138.

By treating the discrete and continuous degrees of freedom on equal footing, our approach makes certain otherwise challenging problems straightforward to solve, as illustrated in Section 6.8.1. By making simplifying assumptions—chiefly, the assumption of independent and identically distributed sampling—we reduce the modeling of technical variation to the composition of generating functions. Our framework may be used in its current form, or as a substrate for developing more sophisticated models of transcriptional regulation and sequencing that subsume it in turn. This process simply involves instantiating hypotheses, converting them into probabilistic models, and constructing model solutions using a procedure analogous to the one presented in Figure 1c.

We believe this framework comprises a productive vision for the interpretation of large datasets, but many technological and mathematical challenges remain. For example, the library construction biases are dependent on molecule-specific factors that we do not yet fully understand, because their effect is heavily convolved with biological variability. In Figure 5, we considered two extreme cases, where the noise strength/length scaling is either unconstrained or forced to be identical for all genes. We anticipate that careful investigation of technical biases will be necessary to construct models that constrain the technical biases based on RNA chemistry, while allowing for gene-to-gene and droplet-to-droplet variability.

In Section 6.7 and supplemental information, we discuss the challenges associated with modeling ambiguous species, motivated by the limitations of short-read sequencing for distinguishing between spliced and unspliced forms of the same RNA gene product139. It is worth noting that even the spliced/unspliced binary is a convenient simplification primarily adopted because of data availability86,113; we stress that a truly comprehensive treatment requires defining intermediate states19, their relationships, and their mutually indistinguishable classes. These computational foundations do not yet exist, although we have attempted a partial solution in recent work16 and outlined some promising directions in supplemental information. Therefore, despite our immediate interest in bivariate RNA distributions, our framework is designed to generalize to other modalities as they become practical to quantify. In addition, although we focus on Markovian systems here, non-Markovian processing can be represented by appropriately defining U140, which suggests avenues for the treatment of systems with molecular memory141,142.

The full generating function solutions we have outlined here are typically not computable directly. By construction, the generating function needs to be evaluated on a grid; Fourier inversion produces a grid of microstate probabilities, which needs to be quite large to avoid artifacts138. If the grid dimension is si for each discrete species i, the overall state space size is s=Nisi. Even in the simplest case, where we only quantify and fit discrete counts, evaluating the probability mass function requires storing and inverting an n-dimensional array, which usually has size s far too large to be practical (e.g., Fig. S5b of our prior work on bursty models16).

When applicable, the generating function approach has numerical advantages over the stochastic simulation algorithm (SSA)143145, which approximates distributions by the empirical distributions of trajectories, and finite state projection (FSP)78, which directly integrates a version of the master equation confined to a finite s. Specifically, if we only care about a particular species i, we can evaluate its marginal using a grid of size Nsi with s log s time complexity. In the worst-case scenario, FSP requires a grid of size s with s3 time complexity, as evaluating a particular marginal requires explicitly evaluating the probabilities for the entire grid, then marginalizing. Similarly, SSA requires explicitly simulating the entire system to obtain the marginals, and has the drawback of the usual inverse square root Monte Carlo convergence146,147. In addition, FSP is not compatible with the generating function manipulations used to represent technical noise, SSA is relatively challenging to adapt to time-dependent rates148, and neither FSP not SSA is readily compatible with continuous stochastic processes (although exact20 and approximate20,149,150 hybrid schema can be constructed with some work). In the future, the “curse of dimensionality”— the reliance on grid evaluation—may be possible to bypass altogether by training neural networks to predict probability distributions, but this approach is as of yet in its nascence151154 and will require considerable further development to apply to general systems.

Nevertheless, SSA and FSP are substantially more general than the approach we outline here. The simulation- and matrix-based methods only require a list of reactions, whereas the generating function methods also require those reactions to produce readily solvable partial differential equations. We have omitted phenomena which would be trivial to treat using FSP and SSA, such as regulation involving feedback. (In principle, one can always construct “synthetic likelihoods” for inference by fitting a function approximator to the results of stochastic simulations, even for highly nonlinear and chaotic systems155157.) To our knowledge, these phenomena, which are mathematically analogous to adding multi-molecular interaction terms, cannot be directly treated with the method of characteristics. Instead, a mathematically precise treatment of them requires perturbative methods77 or fairly complicated special function manipulations101104, which do not easily generalize. We illustrate the challenges in supplemental information, using the example of downstream species catalyzing gene state transitions.

On the other hand, there are a number of ways to treat systems involving feedback approximately. Approaches like the linear-mapping approximation158 permit the derivation of approximate but accurate generating functions for such systems, which can then be used in standard inference pipelines. Alternatively, using only the results presented here, the net effect of feedback can be captured in the time-dependence of certain parameters (e.g., burst sizes) if dynamics are sufficiently chaotic, or if the time scale of feedback is slow compared to other system time scales.

We have, until now, stressed applications to “snapshot” single-cell data from dissociated tissues; however, our framework may be extended to spatial single-cell data; for instance, we can define transcriptional parameters that depend on the cell’s coordinates in the tissue. In this case, the typical systems biology goals translate to fitting a time- and space-dependent function that governs these parameters. However, the generating function formulation relies on the assumption of cells being stochastically independent; it is far from clear that this should hold for densely sampled spatial data, and more sophisticated alternatives, such as agent-based models, may be needed159,160.

Despite these challenges, the framework is already quantitatively useful. To fully “explain” a dataset, we need to fit gene-specific transcriptional mechanisms, genome-wide technical noise and co-expression parameters, and cell type structure, while controlling for potential misspecification. However, at this time, it may be more fruitful to focus on narrower questions, using assumptions, orthogonal data, or simulated benchmarking to justify omitting some parts of the problem19. We have applied this “bottom-up” approach to single-cell data, considering, in turn, the estimation of transcriptional kinetics and technical noise21,161, the identification of transcriptional models20, the analysis of co-regulation patterns16, and the determination of nuclear transport kinetics140. Conversely, it may be valuable to apply a “top-down” approach, augmenting an existing method with biophysically meaningful noise, as we have proposed in the context of transient processes19 and neural network dimensionality reduction71.

We anticipate that making meaningful progress on the stochastic modeling project championed by Wilkinson will require extended “real contact”162 between systems biology, genomics, and mathematics. The general framework we propose, which unifies a variety of previous work, represents one step towards this synthesis. The role of mathematics here is key; as Wilkinson noted, the stochastic systems biology of single cells cannot be “properly understood” without stochastic mathematical models.

RESOURCE AVAILABILITY

Lead Contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Lior Pachter (lpachter@caltech.edu).

Materials Availability

This study did not generate new materials.

Data and Code Availability

  • This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Pseudoaligned count matrices in the mtx format have been deposited at the Zenodo package 8132976. The data, Monod fits, and analysis scripts used to generate Figure 5d-e, originating from Gorin et al.21, were previously deposited as the Zenodo package 7388133.

  • All original code has been deposited at https://github.com/pachterlab/GVP_2023 and the Zenodo package 8132976, and is publicly available as of the date of publication. DOIs are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data
H. sapiens peripheral blood 10x v3 scRNA-seq data 178 pbmc_1k_v3
M. musculus heart 10x v3 scRNA-seq data 179 heart_1k_v3
M. musculus neuron 10x v3 scRNA-seq data 180 neuron_1k_v3
M. musculus cultured embryonic stem cells treated with DMSO 10x v2 scRNA-seq data Desai et al. desai_dmso
H. sapiens peripheral blood 10x v2 scRNA-seq data (technical replicate of pbmc_1k_v3) 181 pbmc_1k_v2
M. musculus neuron 10x v3 snRNA-seq data 182 brain_nuc_5k_v3
Supporting data for GP_2021_3 Gorin and Pachter Zenodo: dataset 7388133
Software and Algorithms
Python python.org 3.9.1
NumPy numpy.org 1.22.1
SciPy scipy.org 1.7.3
pandas pandas.pydata.org 1.2.4
kallisto | bustools Melsted and Booeshaghi et al. 0.26.0
Monod Gorin and Pachter 2.5.0
Other
Count matrices for all datasets This manuscript Zenodo: dataset 8132976
Custom analysis notebooks This manuscript GitHub: https://github.com/pachterlab/GVP_2023 (version of record deposited at Zenodo: dataset 8132976)

6. METHODS

6.1. Master equation models of transcription

We are interested in continuous-time stochastic processes that combine categorical, nonnegative discrete, and (usually nonnegative) continuous degrees of freedom. To solve these systems, we begin by separately defining their allowed transitions and converting them to master equation forms.

The categorical variable, denoted by s{1,...,N}, represents the instantaneous state of a multi-state gene. By assuming that the state interconversions are Markovian and independent of all other components of the system, we can define Hij, the rates of transitioning from state i to state j:

𝓢iHij𝓢j. (35)

These rates can be summarized in the state transition matrix H0N×N, such that Hii=jiHij and ΣjHij=0 to enforce the conservation of probability. This set of transitions can be represented by a master equation involving finitely many ODEs, which tracks the probabilities of each state s at a time t:

P(s,t)t=i=1NHisP(i,t),or more compactlyP(t)t=HTP. (36)

As this system is expressed in terms of a differential equation for an arbitrary time t, the relation holds for time-dependent H. For simplicity, we assume that H is deterministic.

The nonnegative discrete variables, denoted by x0n, represent molecular copy numbers. We assume that n molecular species participate in four classes of transitions, and can summarize their effect by considering their reaction schema and effect on xi, the number of molecules of species i:

𝓧icij𝓧j,𝓧ici0,𝓧iqij𝓧i+𝓧j,αωBi1𝓧i1++Biω𝓧iω. (37)

First, species i can be converted to species j with rate cijxi. Second, species i can spontaneously degrade with rate ci0xi. These classes of monomolecular transitions, which either maintain or reduce the total number of molecules in the system, can be summarized in the matrix Cddn×n, such that Cijdd=cji and Ciidd=ci0jicij;Cdd is the matrix governing the associated reaction rate equations17,85. Third, species i participate in autocatalysis at the rate qii, or catalysis of species j at the rate qij. These reactions can be summarized by the matrix Qd0n×n, such that Qijd=qji. Finally, molecules can be produced. In the general case, a burst of production simultaneously creates molecules of ω discrete species {i1,...,iw}. We assume bursts are described by a Poisson arrival process, with burst frequency αωd and the nontrivial ω-variate joint distribution pωd(z) of non-negative burst sizes {Bi1,,Bω}16. This formulation includes the trivial case of Poisson point process production of species i, for which ω=1 and pωd(z)=δij, the degenerate distribution located at unity for species i and zero for all other species.

This mass action model, which tracks molecule counts, can be represented by an equivalent discrete chemical master equation, which tracks the probability of each microstate x:

P(x,t)t=i=1nci0[(xi+1)P(xi+1,t)xiP(x,t)]+i,j=1ncij[(xi+1)P(xi+1,xj1,t)xiP(x,t)]+i=1nQiid[(xi1)P(xi1,t)xiP(x,t)]+i,j=1nQjid[xiP(xj1,t)xiP(x,t)]+ωαωd[zpωd(z)P(xz,t)P(x,t)]. (38)

For simplicity of notation, species that do not occur in a reaction are elided from the master equation, as in previous work on modeling bursty transcription16. As above, this equation holds even if the rates are time-dependent. For the purposes of this report, we assume only αω and pω can vary over time.

The nonnegative continuous variables, denoted by y0m, represent concentrations or coarsely-modeled noise sources. We assume that these variables are governed by Ornstein–Uhlenbeck-type stochastic differential equations:

dyt=Cccytdt+𝓠c(yt)dWt+ωdLω(t), (39)

where yt is a realization of the process, Wt is an w-dimensional Brownian motion, and Lω is a subordinator. The matrix Cccm×m sets the mean-reversion terms, whereas the operator 𝓠c(yt):0m0m×w sets the level of noise. We assume that each Lω only includes drift or compound Poisson terms. The drift terms have the form αicδijt. To slightly lighten the notation, we can aggregate all drift terms under ω=1,,m, as {α1cdt,,αmcdt}; some of these entries may be zero. The compound Poisson terms have the form k=0Nω(t)(Bω)k163, such that Nω(t) is a Poisson random variable with mean αωct and (Bω)k is a set of independent and identically distributed realizations of the random variable Bω. This random variable has a nontrivial ω-variate joint density pωc(z) on 0m, with the remaining mω dimensions concentrated at zero. We note that this formulation entails a slight abuse of notation, as ω is used to index over discrete burst processes as well as continuous drift and jump components.

For simplicity, we assume the noise term takes the form of an uncoupled square-root diffusion, such that w=m and 𝓠c(yt)=diag(σyt). The symbol denotes the elementwise/Hadamard product of two vectors, the square root should be interpreted as elementwise, and all elements of the constant volatility vector σ are non-negative. Although this choice of 𝓠c is somewhat restrictive, it produces a particularly simple diffusion tensor Σ:

Σ(y):=12𝓠c(y)𝓠c(y)T=12diag(σ2y), (40)

where the square σ2 should be interpreted as elementwise. This formulation can be reframed as a Fokker-Planck equation164, which tracks the probability density of each microstate y:

Pt=i,j=1mCjiccyj[yiP]+12i=1mσi22yi2[yiP]i=1mαicPyi+ω>mαωc[zpωc(z)P(yz,t)dzP(y,t)]. (41)

As above, we assume that only the components of Lω vary in time.

In addition to these discrete- and continuous-only terms, we need to account for these components’ interactions. For example, we may want to represent the production of a discrete species controlled by a continuous variable, e.g., a time-varying transcription rate20:

yicij𝒳j. (42)

This reaction has the rate yicij. This class of reactions can be summarized in the matrix Ccd0m×n, such that Cijcd=cji. In other words, this class of reactions contributes the following terms to the overall master equation:

i=1mj=1nCjicd[yiP(xj1,y,t)yiP(x,y,t)]. (43)

Finally, we may want to represent the production of a continuous species from a discrete one, e.g., the rapid translation of high-abundance protein from low-abundance RNA138. This class of reactions simply adds a term proportional to Cdcxdt to the expression for yt. The matrix Ccd0m×n contains the relevant rates, such that Cijdc is the rate of producing the continuous species i from discrete species j. Therefore, we append a set of drift-like terms to the Fokker-Planck equation:

i=1nj=1mCjidcxiP(x,y,t)yj. (44)

To construct the full master equation, we need to define a system of N coupled equations. To do so, we essentially add Equations 36, 38, 41, 43, and 44, replacing all instances of P with P(s,x,y,t). However, to account for differences in transcription between gene states, we allow the ω-associated terms to vary with s. The full master equation is reported below in Equation 45.

6.2. The full master equation

The full master equation for P(s,x,y,t) is:

Pt=i=1NHis(t)P(i,x,y,t)+i=1nci0[(xi+1)P(s,xi+1,y,t)xiP(s,x,y,t)]+i,j=1ncij[(xi+1)P(s,xi+1,xj1,y,t)xiP(s,x,y,t)]+i=1nQiid[(xi1)P(s,xi1,y,t)xiP(s,x,y,t)]+i,j=1nQjid[xiP(s,xi1,y,t)xiP(s,x,y,t)]+ωαs,ωd(t)[zps,ωd(z,t)P(s,xz,y,t)P(s,x,y,t)]i,j=1mCjiccyj[yiP(s,x,y,t)]+12i=1mσi22yi2[yiP(s,x,y,t)]i=1mαs,ic(t)P(s,x,y,t)yi+ω>mαs,ωc(t)[zpωc(z)P(yz,t)dzP(y)]+i=1mj=1nCjicd[yiP(xj1,y,t)yiP(x,y,t)]i=1nj=1mCjidcxiP(x,y,t)yj. (45)

We annotate the terms in Table S1.

6.3. Generating function methods for biological stochasticity

The full master equation is fairly cumbersome and challenging to analyze directly. Therefore, analysis has to proceed by spectral methods. We use the generating function (GF), a length-N vector function G, such that each component is

Gs(g,h,t)=00x1=0xn=0(i=1ngixi)(i=1mehiyi)P(s,x,y,t)dymdy1:=yxgxehTyP(s,x,y,t)dy.

where the lowest line is the definition expressed in useful shorthand notation. Formally, the generating function is the combination of a probability-generating function (PGF) in the discrete variables and moment-generating function (MGF) in the continuous variables. The arguments gn and hm are spectral variables. By computing the generating function of both sides of Equation 45, we find (see supplemental information) that the master equation is equivalent to a much more compact system of partial differential equations:

Gt=HTG+G𝓐(u)+J[Cu+diaguDu]. (46)

This formulation relies on defining the unified variables u:

u:=[g1h]andJsi=Gsui, (47)

as well as unified matrices:

C:=[(Cdd)T+(Qd)T(Cdc)T(Ccd)T(Ccc)T]D:=[(Qd)T(Cdc)T012diagσ2]:=[(Qd)T(Cdc)T0(Qc)T]. (48)

Each entry of the length-N matrix function 𝓐 consists of the burst and drift terms:

𝓐s=(αd)sT(Fs(u+1)1)+(αc)sT(Ms(u)1)=αsT(s(u)1). (49)

The vector αsd contains the frequencies of all discrete burst processes for state s. The first m entries of αcd contain the continuous species’ drifts in state s. The remaining entries contain the corresponding rates of continuous burst processes. αs aggregates these quantities. The vector function Fs contains the joint PGF of the discrete burst processes, and only depends on the first n variables. The vector function Ms contains the drift terms, as well as the joint MGF of the continuous burst processes, and only depends on the last m variables. The parameters of the 𝓐s operator may vary in time.

To obtain the generating function at t, we apply the method of characteristics. First, we calculate the characteristics parametrized by the scalar variable 𝗌 :

T(𝗌)=t𝗌dU(𝗌)d𝗌=CU(𝗌)+diagU(𝗌)DU(𝗌) (50)

where U(𝗌=0)=u. This is the “downstream” ODE, which governs abundances in isolation from production and regulation.

Therefore, G is governed by the following system of ordinary differential equations:

dG(U(𝗌),T(𝗌))d𝗌=H(T(𝗌))TGG𝓐(U(𝗌),T(𝗌)):=(U,T)G. (51)

To obtain G at t, we integrate this matrix system from 𝗌=t to 𝗌=0. We use G0(U(t)) as the initial condition, where G0 is the generating function of the initial distribution. This is the “upstream” ODE, which governs the full generating function.

In the general case, evaluating this system requires two applications of quadrature: first, solving the n+m-dimensional downstream system to obtain the values of characteristics U at a set of grid points over [0, t]; then, solving the N-dimensional upstream system to obtain the value of the generating function.

Some special cases afford simpler solutions. If D0, the downstream ODE takes a Riccati-like form and generally resists exact analysis17,165. However, if D=0 and C is diagonalizable, the system takes the tractable linear form

dU(𝗌)d𝗌=CU(𝗌):=V1ΛVU(𝗌),with the solutionU(𝗌)=V1eΛsVu (52)

whenever all eigenvalues of C are distinct. When they are not, the ODE can be solved in a similar way using generalized eigenvectors. Practically, this means that only one application of quadrature is required.

If, in addition, N=1, the upstream ODE reduces to a single integral:

ϕ(t)=t0dϕ(U(𝗌),T(𝗌))d𝗌d𝗌=ϕ0(U(t))+0t𝓐(U(𝗌),T(𝗌))d𝗌, (53)

where ϕ:=logG,ϕ0=logG0, and the generating function G is no longer boldfaced because only a single gene state exists.

If 𝓐 is a linear operator a1u1++an+mun+m, the system is in the drift-only regime; no bursting occurs. In this case, the system reduces to

ϕ(t)=ϕ0(U(t))+i=1n+m0tai(t𝗌)Ui(𝗌)d𝗌, (54)

where Ui are the components of U. As each Ui is, in turn, a weighted sum of ui, the second term of the log-generating function is given by a sum of fairly simple convolutions that scale as 0tai(t𝗌)eλj𝗌d𝗌.

Finally, in the simplest case, if all eigenvalues λi of C are negative, the transient part of Equation 54 vanishes as t and the stationary log-generating function is a linear combination of ui. This implies that the distribution converges to a product of independent Poisson distributions17,85.

6.4. Coupling multiple genes

The results solve master equations with abstracted production and processing reactions. To connect them to systems phenomena, such as the co-regulation of multiple genes, we need to specify how upstream interactions lead to co-expression. As the simplest illustrative model system, we can consider the co-regulation of two genes, indexed by j, with Uj=ujeγj𝗌. We outline several relatively simple classes of candidate models which induce expression coupling.

In the simplest case, (u,t)=jj(uj,t). In other words, the genes’ dynamics are fully separable, and produce solutions in the form G(u,t)=jGj(uj,t). This formulation produces independent distributions at each t, but the trajectories may possess nontrivial statistical relationships. For example, if both genes start at x1=x2=0, their trajectories will be correlated over a finite timespan [0, T], with the correlation decaying as T.

In the next simplest case, co-regulation is the consequence of parameter differences in subpopulations. For example, the full cell population may consist of cell types indexed by κ. If we suppose each cell type has the abundance πκ and transcriptional parameters Θκ, we obtain

G(u,t)=κπκG(u,t;Θκ)=κπκjGj(uj,t;Θj,κ); (55)

i.e., the generating function decomposes into a product of independent generating functions conditional on a particular cell type, but not globally. In other words, even if transcriptional processes are independent, cell type structure can produce nontrivial relationships between genes.

Alternatively, we can propose a model of co-regulation by the categorical variables. For example, two neighboring genes may prefer to have the same or opposite accessibility, depending on the polymeric properties of DNA. Assuming, for the purposes of illustration, that the system is symmetric, we obtain the following N=4 form:

H=[2konkonkon0ε1koffε1(kon+koff)0ε1konε1koff0ε1(kon+koff)ε1kon0koffkoff2koff]𝓐=[0kinitu1kinitu2kinit(u1+u2)]. (56)

This form encodes the co-regulation of two genes, such that s {both off, gene 1 on, gene 2 on, both on}. If ε1, the intermediate states are unstable and the genes tend to be either both on or both off. If ε1, the intermediate states are particularly stable, and only one of the genes tends to be on at a time. If ε=1, we recover the independent case.

We can define a similar model for co-regulation by a continuous variable y1. For example, there may be a latent regulator, such as the concentration of an activator, that controls multiple loci: if it is high, both have a high transcription rate; otherwise, both are inactive20. This amounts to appending the following reactions to the master equation:

Cj1cdy1[P(xj1)P(xj)], (57)

where the Ccd matrix encodes the relationship between the concentration and the transcription rate. Therefore, the genes become mutually correlated through the trajectory of y1, although the extent of correlation depends on the dynamics.

If the categorical or continuous driving process is bursty, we can approximate it by a co-bursting module. For example, in the limit of ε0, the dynamics of the system in Equation 56 converge to the N=2 formulation

H=[kon*kon*koff*koff*]and𝓐=[0kinit(u1+u2)],wherekon*=2kon2kon+koffandkoff*=2koff2kon+koff. (58)

If, in addition, koff*,kinit, we obtain the N=1 module characterized by

𝓐=kon*[11b(u1+u2)1], (59)

where b:=kinit/koff*16. This is the bursty limit of Equation 56. Interestingly, that mechanism also possesses a slow mixture limit. If ε while kon,koff0, we obtain a special case of Equation 55, with πκ=1/2 and mutually exclusive expression in the “cell types,” or long-lived gene states.

Even when we restrict our analysis to simple feed-forward regulation, this outline of motifs is nowhere near exhaustive. Nevertheless, the “mixture” and “bursty” limits are particularly natural starting points, as their distributions are straightforward to construct. In other words, we speculate that the careful analysis of co-expression models can distinguish relationships due to “slow” variation between cell types and “fast” variation due to coupled transcriptional events.

6.5. Transient phenomena

This result yields a fairly simple numerical recipe for the determination of probabilities at a particular time t. Typically, analysis proceeds by assuming H and 𝓐 are time-independent and letting t, i.e., considering the stationary limit of the process. However, this may not be strictly justifiable: much of single-cell analysis involves the determination of trajectories from intrinsically transient data representing differentiation pathways166 If the transient process occurs on a timescale comparable to RNA turnover, using a stationary model may not be appropriate16.

To rigorously fit transient data, we need to posit just how a snapshot of cells may capture multiple cell states, such that some states are the progenitors of others. The solution is not yet clear, and multiple reasonable explanations exist; for example, we may suppose that the differentiation process “lags” in certain cells (in the vein of the models of variability proposed in Stumpf et al.44 for development, and in Sanders et al.167 and Perez-Carrasco et al.125 for the cell cycle). In other words, all cells are captured at a time t since the beginning of a process, but H and 𝓐 have different time-dependence for different cells. Although such an explanatory model can be instantiated, it may be too challenging to fit. Further, it does not appear to be compatible with processes that operate continuously; the choice of t becomes somewhat challenging to motivate.

We propose that the simplest model for observations relies on minimal synchronization between the biology and the experimental process. To mathematically formalize it, we take inspiration from the theory of reactor modeling in chemical engineering105 and extend preliminary work from our recent RNA velocity methods analysis19. A cell enters a medium; this entrance triggers a chemical signal that begins a transient process. The dynamics of this transient process are only dependent on time since receiving the signal, and identical between cells. After a delay, the cells exit the medium. In this framework, sequencing is the uniform random sampling of cells present within this medium. Although this formulation is admittedly simplistic—it excludes the cell cycle and stochastic driving—it allows us to take the first steps with a systematic study of using snapshot data to fit transient stochastic processes. This toy model is numerically tractable, which is useful for its simulation and characterization, and possesses a stationary state that is independent of the time at which the experiment is performed, which is useful for biological admissibility and realism.

Therefore, to marginalize over t, we need to augment the model with an additional property: the relationship between time along a transient process and the probability of capturing a cell. In the parlance of reactor engineering, this relationship is given by the internal-age distribution f. The simulations of transient processes in La Manno et al.86 and Bergen et al.59 implicitly adopt this model and assume a particular functional form of f. We might suppose cells enter the observation window at t=0 and leave it at t=T, with a Dirac residence time distribution δ(tT) and uniform sampling throughout this window. The resulting age distribution is uniform, with f=T1, and formally corresponds to the ideal plug flow reactor (PFR) architecture105. As T, we obtain the t ergodic limit, if such a limit exists. On the other hand, if fδ(tT), we recover the instantaneous distribution at time T; this limit formally corresponds to the batch reactor (BR).

To obtain the generating function for the cells inside a tissue, we represent the tissue as a reactor, specify its influx and efflux properties, and solve for the internal-age distribution f. This internal-age distribution yields the occupation measure of the process times, as discussed in our RNA velocity review19, and induces the following reactor-wide generating function:

G=tG(t)f(t)dt,whereG(t)=sGs(t). (60)

We have marginalized over the instantaneous gene state s because this variable is typically not observable.

6.6. Droplet encapsulation noise

The generating function G describes the biological variability due to molecular processes, transcriptional driving, and the capture of cells from a reaction medium. However, single-cell RNA sequencing data do not quantify cells—they quantify barcodes. Cells are randomly encapsulated into droplets with barcoded beads; to avoid the formation of “doublets,” with two cells per droplet, the microfluidic protocols typically have a fairly low encapsulation rate. If we assume that a droplet may have either zero or one cells, we obtain the following generating function for the distribution of RNA on a per-barcode level:

Genc=p1G+p0=pG+(1p)=Gbc(G), (61)

where Gbc is the PGF of the Bernoulli distribution, with p1=p the probability of capturing a single cell and p0=1p that of capturing none. Analogously, if we assume that doublets can occur, and the encapsulation of cells is independent and identically distributed (i.i.d.), we find

Genc=p2G2+p1G+p0=p2G2+2p(1p)G+(1p)2=[pG+(1p)]2=Gbc(G), (62)

where Gbc is now the PGF of the binomial distribution. It is straightforward to extend this to the unconstrained case, with per-cell encapsulation rate λ, and obtain the analogous expression

Genc=p0+p1G+p2G2+p3G3+=eλ(G1)=Gbc(G), (63)

where Gbc is the PGF of the Poisson distribution.

However, even empty droplets typically contain some “background” molecules. Removing the empty droplets by filtering for cells with relatively high expression, as well as correcting for the background, is a standard part of sequencing workflows57,109112. To model the joint distribution of biological and background RNA, we need to instantiate a mechanistic hypothesis about its source. The simplest hypothesis consists of two parts. First, we impose the pseudobulk interpretation of background: we assume that a fraction of the cells loaded in the library construction step are lysed, and produce a pool of loose molecules. Next, we assume that these molecules are free to be encapsulated into the droplets in an i.i.d. fashion. This implies the Poisson functional form for the distribution of debris entering each droplet:

Gbg=exp(ciμiui), (64)

where c is some shared constant that reflects the pool size and the rate of diffusion, whereas μi=Gui|ui=0 is the expectation of species i over the entire cell population. This simplest model assumes that all cells are equally likely to lyse and release their contents; if this assumption is violated, μi needs to be obtained by computing an expectation with respect to a measure biased toward the less stable cells. Finally, the full per-droplet distribution of molecules is

Gtot=GbcGbg, (65)

i.e., each droplet contains contributions from the encapsulated cells, as well as the background. With some abuse of notation, we occasionally use the expression Gbc(G)Gbg(G), where the first argument denotes composition, whereas the second denotes functional dependence.

6.7. Library construction and sequencing noise

We cannot observe the biological molecule content of each droplet: we are restricted to analyzing counts of complementary DNA (cDNA). In a typical dual-index 3’ microfluidic workflow (e.g., the commercialized 10x chemistry48), these cDNA are quantified by the following sequence of reactions. First, a synthetic primer captures a poly(A) stretch in RNA, which may be an endogenous molecule or a synthetic tag168. The primer contains a poly(dT) oligonucleotide, a sequencing primer, a cell barcode, and a unique molecular identifier (UMI). Next, reverse transcriptase (RTase) attaches to the RNA-primer complex and synthesizes the complementary strand. When the first strand is complete, a template-switching oligonucleotide (TSO) attaches to the end, allowing RT to synthesize the second strand of cDNA. After library construction, the droplet emulsion is broken, producing a pool of long cDNA; polymerase chain reaction (PCR) is used to amplify this pool. The long cDNA molecules are enzymatically fragmented, and another sequencing primer is attached at the end of the molecule that formerly contained the TSO. Finally, another round of PCR amplifies the pool and appends sample indices and Illumina adaptors to both sides of the molecule. The pool of cDNA is loaded onto a sequencing machine and sequenced from both sides, producing two reads. One read contains the barcode and UMI bases, whereas the other contains partial information about the 3’ end of the molecule, beginning at the fragmentation site. This sequence of reactions represents the ideal-case scenario, and the products may well include artifacts due to off-target reactions169.

To understand the effect of technical variability on the per-barcode distributions, we need to summarize this workflow in a mechanistic model. First, we assume that the library preparation reactions occur in an i.i.d. fashion relative to each RNA molecule in the droplet, allowing us to construct a separate description of technical noise for each discrete molecular species indexed by i. At this stage, we omit the modeling of continuous species. As we quantify the number of UMIs, we can considerably simplify the description by splitting the workflow into the initial cDNA synthesis and all downstream steps. For the cDNA synthesis, we may choose one of two models:

𝒳i𝒳i+𝓣ior𝒳i𝓣i. (66)

In the first model, the formation of a UMI-tagged cDNA 𝓣i is non-sequestering, and the template RNA 𝒳i can participate in further cDNA synthesis. In other words, a single RNA molecule can produce more than one cDNA with distinct UMIs. In the second model, the cDNA synthesis is sequestering, and each RNA can template at most one cDNA with a particular UMI. For the downstream steps, if we assume the PCR and sequencing steps produce results that are reasonably faithful to their templates, we are essentially restricted to a single model:

𝓣i. (67)

In other words, the sequence of steps after the formation of cDNA 𝓣i may lose some UMIs, but it cannot create them. Aggregating these steps, we find the shifted per-molecule generating function for technical noise:

Gti*=Gti1=eλi(gi1)1=eλiui1(non-sequestering)=pigi+(1pi)1=piui(sequestering), (68)

where λi=λi,cpi,p and pi=pi,cpi,p.λi,c, is the overall Poisson rate of the catalytic production of cDNA 𝓣i with distinct UMIs, pi,c is the probability of producing a single cDNA 𝓣i in a non-catalytic fashion, and pi,p is the probability of retaining a molecule of 𝓣i through the PCR steps. It is straightforward to use a Taylor expansion to observe that the limit λi,c1 yields the Bernoulli form: if non-sequestering sequencing is relatively slow or inefficient, the probability of obtaining multiple cDNA from a single RNA is low, and the mathematically simpler Bernoulli noise form approximately holds16,161.

Using the properties of PGFs21, we find that the overall generating function is given by a simple composition, substituting Gti for gi:

Gtot,t=Gtot(Gt*), (69)

where we use the Gtot(u) parametrization, and each entry of Gt* contains the shifted generating function Gti* for a particular species i.

Finally, the reads associated with each cDNA 𝓣 are not always uniquely identifiable: for example, the sequence content is typically sufficient to identify the gene, but if a read only covers an exonic portion of the gene, it is impossible to distinguish whether or not the original molecule has been spliced139. To correctly represent this ambiguity, we need to transform the arguments of the generating function from a length-n vector to a length n-vector, such that n is the total number of mutually distinguishable classes of molecules. The simplest form of this transformation is a linear categorical partition:

g=𝓟a𝓰, (70)

where 𝓟a is an 𝓃×n ambiguity matrix with 𝓟i,𝓲a giving the probability of molecule i being identifiable in the equivalence class 𝓲. We assume that each molecule can be assigned to at least one class, implying 𝓲𝓟i,𝓲a=1. In principle, only the constraint 𝓲𝓟i,𝓲a1 is mandatory, but the loss of molecules can be equivalently reframed as a technical noise component in Gt*.

We discuss the general case of this model component in Section S3. In summary, the entries of 𝓟a are challenging to identify, but it may be possible to exploit genomic information, polymer physics, and orthogonal long-read sequencing data to construct it from first principles. This formulation admits several special cases. For example, if we cannot distinguish any distinct species at all and can only quantify the total RNA content, n=1 and 𝓟i,𝓲a=1 for each i. Then we obtain

(g)i=𝓰for alliandG(𝓰)=G([𝓰𝓰]). (71)

On the other hand, if all species are perfectly identifiable, we obtain 𝓃=n and 𝓟a=In, the n-dimensional identity matrix. If, say, we have n=2 but 𝓃=3, as in the case of nascent, mature, and ambiguous molecules described in La Manno et al.86 and Eldjárn Hjörleifsson et al.139, we obtain

G(𝓰)=G([𝓟1,1a𝓰1+𝓟1,3a𝓰3𝓟2,2a𝓰2+𝓟2,3a𝓰3]), (72)

where 𝓰1 and 𝓰2 correspond to two unambiguously identifiable species, whereas 𝓰3 corresponds to ambiguous cDNA which may have come from either. In the general case, we find

u=𝓟a𝓰1=𝓟a(𝓾+1)1=𝓟a𝓾=Ga(𝓾)1:=Ga*(𝓾), (73)

where each entry of the vector Ga contains the generating function of the relevant categorical distribution that governs how species i is parsed as one of the 𝓃 identifiable species:

(Ga(𝓾))i=i𝓟i,ia𝓰i. (74)

Therefore, the overall GF takes the following form:

Gtot,ta=Gtot,t(Ga*(𝓾)). (75)

6.8. Example systems

The equation above provides a generic, modular framework for characterizing variability in sequencing experiments. To fit it to data, we need to specify a particular set of models for each step of the process. To do so, we should first strive to understand which modular components are realistic based on relatively simple summaries of data. Further, the process of evaluating and fitting these models is fairly involved, and often requires substantial up-front work to design scalable solvers. Therefore, it is useful to understand their qualitative behaviors relevant to statistical inquiry. In the current section, we characterize some analytically tractable systems, as well as their identifiability properties, such as our ability to distinguish between different models and parameter regimes. To illustrate these points, we apply the models to real and simulated data and speculate about their implications and physical relevance.

6.8.1. Special theoretical cases

We revisit Section 6.3 to emphasize the implications and advantages of unifying the discrete and continuous degrees of freedom of the biological model in a common framework. The similarity of the discrete and continuous generating function terms is not accidental, and follows directly from the Poisson representation93. Occasionally, we can exploit this representation to bypass calculations for discrete processes by referring to results from the study of continuous processes, and vice versa. This approach consists of writing down the generating function PDE for a discrete process, identifying a continuous process governed by the same PDE, obtaining its solution from the stochastic process literature, and asserting that the discrete process distribution is given by compounding a Poisson distribution with the continuous law.

For instance, we may consider the case of a system with constitutive transcription at rate α, autocatalysis at rate q, and degradation at rate γ(N=1,n=1,m=0):

α𝒳γ𝒳q2𝒳. (76)

We can represent these reactions by the matrices C=γ+q and D=q, as well as the operator 𝓐(u)=αu. This system was introduced, but not treated, in Jahnke and Huisinga85, and, to our knowledge, first solved with master equation and generating function calculations by Vastola17. However, we can also solve it merely by matching terms, without any new calculations. We provide the full details of the parameter-matching process in Method S2.1. The derivation consists of noticing that the functional form of C,D, and 𝓐 can also arise from an N=1,n=0,m=1 system with drift α, square-root noise σ=2q, and mean-reversion at the rate γq. This is the Cox–Ingersoll–Ross (CIR) process, a popular mathematical finance model of interest rates170,171. Its stationary distribution is gamma with shape α/q and scale qγq. This immediately implies the distribution of the discrete process is negative binomial with the same shape and scale. This matches the result obtained by directly solving the master equation18. We find, then, that autocatalysis with constitutive transcription yields a stationary distribution equivalent to bursty transcription with no autocatalysis.

Obtaining this result, we may ask how the distribution changes if the molecules are produced in geometric bursts B with mean size b:

αB×𝒳γ𝒳q2𝒳. (77)

By changing the drift operator to a jump operator, we obtain a PDE with 𝓐(u)=α[11bu1]. In other words, the continuous version of this process is a combination of CIR and gamma Ornstein–Uhlenbeck (Γ-OU) processes20, with the mean-reversion terms of both, the square-root noise of the former, and the exponentially-distributed jumps of the latter.

Define the parameter combinations

c:=γqv:=αbbcq. (78)

By direct integration, we find the characteristic and the stationary distribution

U(𝗌)=cuec𝗌c+qu(ec𝗌1)G=exp[α0bU(𝗌)1bU(𝗌)d𝗌]=(1qc1u1bu)v. (79)

Curiously, this distribution exactly matches the transient MGF of the Γ-OU process, as well as the equivalent transient PGF of the bursty transcription process with no autocatalysis16:

G=(1bueκτ1bu)v; (80)

we may take advantage of the fact that qc1 can be equivalently expressed as beκτ<b for some positive k and t, because bcq>0 to have a steady state (i.e., positive v). In the continuous setting, this process is known172 to have a law consisting of a mixture of gamma distributions with scale beκτ and shape k; in turn, k is drawn from a negative binomial distribution with shape v and scale (1+eκτ)1. This immediately implies that the distribution of the corresponding discrete process is a negative binomial-negative binomial mixture with equivalent parameters, which may be confirmed by the considerably more involved direct derivation in Method S2.2. Although this distribution cannot be expressed in closed form, its construction makes the simulation of the bursty transient and stationary autocatalytic processes trivial, and suggests that simple finite approximations (i.e., up to a modest k) may be developed.

The continuous formulation is a way to exploit existing quantitative results, but does not typically make problems easier. For example, we may be interested in solving an RNA/protein system with transcription, catalytic translation (at rate q), and the degradation of both species (at respective rates γR and γP). Without specifying the transcriptional dynamics, we find that the downstream ODEs have a nontrivial D matrix, i.e.,

C=(Cdd)T=[γRq0γP]andD=(Qd)T=[0q00]. (81)

Although these matrices can be exploited to obtain both characteristics, the solution depends on special functions and is thus challenging to manipulate77. Instead, we may ask whether we can simplify the problem by eliding all stochasticity in the protein species and assuming it may be described by a continuous process. Defining the variables for this system, we find:

C=[(Cdd)T(Cdc)T0(Ccc)T]=[γRq0γP]D=[0(Cdc)T00]=[0q00], (82)

i.e., in spite of this supposed simplification, the problem is precisely as challenging as it was before. This provides an immediate and intuitive explanation for a range of results, such as the observation that the stationary distribution of proteins under constitutive transcription has a complicated solution in terms of Kummer’s hypergeometric function even if one uses a leading-order approximation (cf. Eqns. 34 and 50 of Bokes138).

6.8.2. Empty droplets

Model definition.

In Equation 64, we propose the simplest nontrivial model for the background distribution of RNA molecules in each droplet: the RNA content for each species i is described by a set of independent Poisson distributions whose mean is proportional to the mean in the entire cell population. Per Equation 65, the distribution of background is convolved with the endogenous RNA distribution of cell-containing droplets, making it challenging to distinguish technical and biological contributions. However, we can make predictions about the empty droplets, which have Gbc=1, and compare these predictions to real datasets.

First, we define a baseline n=2 model of biology, such that

K𝓧Nβ𝓧Mγ, (83)

where K is a generic, but non-constant (bursty, multistate, or SDE-controlled) transcription process, 𝒳N is a nascent transcript, 𝒳M is a mature transcript, and β and γ are Markovian splicing and degradation rates, respectively. As the case of constant K yields a Poisson distribution of 𝒳N and 𝒳M, the case of variable K induces an overdispersed distribution of RNA in droplets with one or more cells. Further, it implies that certain correlations are nonzero. For a given gene j, the correlation between counts of 𝒳N and 𝒳M should be nonzero, as the latter is, conceptually, the moving average of the former. Further, the correlation between the counts of a given species for different genes should be nonzero, as it reflects cell type heterogeneity and gene co-regulation16 (see Section 6.4).

This model describes the biology in living cells; to connect it to UMI measurements, we assume that Gt* is an approximately linear map, i.e., library construction is either sequestering or non-sequestering and slow. Further, we assume Ga* is a linear map, as in Equation 74. Therefore, for each species i, we have a per-cell biological distribution with mean μi. In a droplet with a single cell, the mean becomes μipi(1+c)μipi, such that pi is the overall probability of capturing, retaining, sequencing, and identifying each molecule (Section 6.7). In a droplet with no cells, the mean is cμipi. We assume the number of doublets is negligible.

Under the foregoing assumptions, we predict that the empty-droplet marginal per-gene UMI distribution is Poisson with mean cμipi. This mean is proportional to the mean in non-empty droplets with a small coefficient of proportionality c. Further, we should observe zero correlations on an intra-gene basis, between counts of 𝒳j,N and 𝒳j,M, and on an inter-gene basis, e.g., between counts of 𝒳j1,M and 𝒳j2,M. However, it is not a priori clear that this model should even approximately describe real data, even in the case of empty droplets. For example, these data may exhibit considerable “read depth” variability65,83, or, in our framework, inter-droplet variation in the probability pi, which would induce overdispersion or genome-wide correlations between molecule counts. By inspecting the distributional properties of empty droplet data, we can attempt to qualitatively motivate or raise doubts regarding the Poisson model.

Data processing.

To build references and pseudoalign datasets, we used kallisto | bustools 0.26.0. We downloaded pre-built H. sapiens and M. musculus genomes from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest (10x Genomics, GRCh38 and mm10, 2020-A versions). Next, we used the kb ref function with the --lamanno option to build references. We obtained the raw FASTQ files for the six datasets reported in Table S2. Then, we used the kb count function with the --lamanno option, as well as the appropriate (10x v2 or v3) whitelist option -x to quantify the datasets, outputting unspliced and spliced RNA matrices. The unspliced counts correspond to molecular barcodes containing introns, whereas the spliced counts correspond to molecular barcodes not containing introns139,173. For the reasons outlined in Section S6 of Carilli and Gorin et al.71, we identify unspliced counts with “nascent” RNA species and spliced counts with “mature” RNA species, and elide any ambiguity.

Data analysis.

We split the datasets into two categories. The “non-empty” droplets were retained after the bustools filter; the “empty” category contains barcodes that were discarded by the filter. Although this split is fairly coarse, as the filtering choices are heuristic, it is coherent with typical processing workflows and allows us to inspect the broad trends of distributional properties.

To investigate the overdispersion, or lack thereof, we separately computed the mean and variance of nascent and mature UMI counts for each gene in each set of cells. We plotted these quantities on a log-log scale, omitting the data points where one or both of these quantities were zero. Under the pseudobulk model, we expect the non-empty droplets to exhibit overdispersion and the empty droplets to be near identity, as the model encodes Poisson statistics for the latter.

To investigate the intra-gene correlation structure, we computed the Pearson correlation coefficient ρ between nascent and mature UMI counts for each gene in each set of cells. We plotted the histograms of these values, as well as their relationship to the mature UMI mean, omitting the data points where ρ was undefined. To investigate the inter-gene correlation structure, we computed the Pearson correlation coefficient between the nascent UMI counts for each pair of genes in each set of cells, and repeated the analysis for mature count data. We plotted the histograms of these values, omitting the data points where ρ was undefined. As the number of gene pairs is fairly large, we first excluded all genes that were not expressed in the dataset. We expect both measures of correlation to be substantial for non-empty droplets and near zero for the empty droplets, as the model encodes statistical independence between marginal distributions for the latter.

To investigate the relationship between the empty and non-empty droplet averages, we plotted the mean mature UMI count for each gene in empty droplets against the mean mature UMI count in cell-containing droplets. As we plotted these quantities on a log-log scale, we omitted the data points where one or both of these quantities were zero. We repeated the analysis for mature RNA data. We expect these averages to be highly correlated, as the pseudobulk model proposes that the background RNA are sampled from a pool representative of the cell population.

Next, we computed and reported the Pearson correlation coefficient between the (well-defined) log-means. To characterize and explain deviations from Poisson behavior, we selected all genes with overdispersion in the mature RNA count distributions in empty droplets (σM2>2×μM) and reported their identities. Finally, to quantify the variation not included in the model, we computed the mean and variance of total mature UMI counts in empty droplets, with and without the overdispersed genes. As the sum of independent Poisson distributions is Poisson, we expect the total per-cell UMI count distributions to have a variance approximately equal to the mean.

6.8.3. Noise-corrupted candidate models of transcriptional variation

Model definition.

We would like to characterize the mutual distinguishability of superficially similar transcriptional models. In particular, we are interested in the benefits of multimodal data collection and the effects of technical noise.

As above, we begin by defining a baseline n=2 model of biology, such that

K𝓧Nβ𝓧Mγ, (84)

where K represents one of three candidate transcriptional models. The discrete dynamics are summarized by

Cdd=[β0βγ]UM=uMeγ𝗌UN=uNeβ𝗌+uMββγ(eγ𝗌eβ𝗌). (85)

The first transcriptional model is the Γ-OU process, with N=1 and m=1 :

dyt=κytdt+dZt, (86)

where Zt is a subordinator with arrival rate a and exponentially distributed jumps with mean size θ. This system is characterized by

u=[uNuMuK],Ccc=κ,Ccd=[10]𝓐(u)=a[11θuK1], (87)

with all other matrices and operators set to zero.

The second is the CIR process, with N=1 and m=1:

dyt=(aθκyt)dt+2κθytdWt. (88)

This system is characterized by

u=[uNuMuK],Ccc=κ,Ccd=[10],Qc=κθ𝓐(u)=aθuK, (89)

with all other matrices and operators set to zero.

We previously proposed the Γ-OU and CIR processes as potential explanatory models for gamma-distributed stochastic variability in transcription rates, solved them, and investigated the implications of their kinetics on the model properties and distinguishability20. The stationary distribution of the Γ-OU and CIR processes is gamma, with shape a/κ and scale θ, i.e., mean aθ/κ and variance aθ2/κ. In addition, their (appropriately normalized) autocorrelation function is eκt.

Finally, the third is the telegraph process100, with N=2 and m=0. This system is characterized by

u=[uNuM],H=[konkonkoffkoff],and𝓐(u)=[0kinituN]. (90)

The stationary distribution of this process is Bernoulli scaled by kinit, with mean konkinitkon+koff and variance konkoffkinit2(kon+koff)2. Its autocorrelation function is e(kon+koff)t81.

For all three models, assuming a Bernoulli observation model (i.e., that each molecule has an independent probability p of being observed) is equivalent to a parameter redefinition. For the Γ-OU and CIR models, this redefinition is that θpθ; for the telegraph model, we have analogously that kinitpkinit.

Let us see why this is true. Recall from Section 3 that the Bernoulli technical noise model amounts to a redefinition uNpuN,uMpuM. For the Γ-OU model, the steady-state (log-) GF is

ϕss(uN,uM)=a0θUK(𝗌;uN,uM)1θUK(𝗌;uN,uM)d𝗌, (91)

where UK(𝗌;uN,uM) is the exponential sum solution of

dUKd𝗌=UNκUK,UK(0)=0, (92)

and where the characteristics UN and UM are as in Equation 85. Because the UK ODE is linear, UK depends linearly on uN and uM (and hence on p). But ϕss only depends on UK through the combination θUK, so the problem with technical noise is equivalent to redefining θpθ.

For the CIR model, the steady-state (log-) GF is

ϕss(uN,uM)=aθ0UK(𝗌;uN,uM)d𝗌, (93)

where

dUKd𝗌=UNκUK+κθUK2,UK(0)=0.

The technical noise causes UNpUN. Divide both sides by p, so that the p factor is moved elsewhere; we can see that

ϕss(uN,uM)=apθ0tUK(𝗌;uN,uM)pd𝗌d(UK/p)d𝗌=UNκ(UK/p)+κpθ(UK/p)2UK(0)=0 (94)

is equivalent, i.e., that again the technical noise problem is equivalent to a non-technical-noise problem with θpθ.

For the telegraph model, the steady-state (log-) GF is

ϕss(uN,uM)=ϕ0(UN(),UM(),Uon(),Uoff())dUoffd𝗌=kon(UoffUon)dUond𝗌=koff(UonUoff)+kinit(Uon+1)UN, (95)

where Uoff(0)=Uon(0)=0. Since UN()=UM()=0, the values of UN(𝗌) only affect ϕss through the combination kinitUN that appears in the Uon ODE; this means we can just redefine kinitpkinit as promised to get a completely equivalent problem.

Model analysis

Formally, these models have five parameters each: three for the upstream transcriptional dynamics and two for the downstream molecular conversion. However, their qualitative behaviors at steady state can be effectively summarized by fixing μK,β, and γ, and varying two key parameters, the timescale separation and the noise intensity. From a statistical point of view, μK/β and μK/γ are easily and robustly identifiable from the mean molecular counts; from an experimental point of view, β and γ can, in principle, be fit by orthogonal experiments86. At steady state, the value of μK is a somewhat arbitrary scaling factor.

For the two-species SDE driver models, the qualitative parameters take the following form:

timescale separation:=x=κκ+β+γnoise intensity:=y=θa+θ. (96)

These parameters both range in (0,1). When the timescale separation approaches zero, the transcriptional variation is much slower than the turnover, and the distribution of RNA is given by a simple Poisson mixture of the law of K. When the noise intensity approaches zero, the law of K degenerates and the distribution of RNA becomes Poisson. Most interestingly, when the timescale separation and the noise intensity are both high, the system exhibits bursty transcription20.

Equation 96 is defined with reference to the process parameters of the Γ-OU and CIR drivers20. It remains to define κ,θ, and a in terms of kon,koff, and kinit for the telegraph process. The correct identification is:

κ=kon+koffis the autocorrelation timescale,a=konκkoffis the process scaling, andθ=koffkinitκis the gain. (97)

These identifications are not arbitrary, as they endow the system with lower moments that match the SDE formulation: autocorrelation function eκt, mean aθ/κ, and variance. In addition, the system has the correct geometric burst limit (kinit,koff) with burst size θ/κkinit/koff and burst frequency akon73; this limit matches the Γ-OU one20.

Given any combination of {x,y,μK,β,γ}, we can identify the transcriptional parameters:

κ=(β+γ)x1xa/θ=konκkoffκkoffkinit=konκ2koff2kinity=11+a/θoraθ=1y1μKκ(1y1)=kinitkonκ2konκ2koff2kinit=kon2koff2=(konkoff)2:=c,givingkon=cκc+1,koff=κc+1,andkinit=μKκkon. (98)

This allows us to define a particular set of {μK,β,γ}, vary x and y over the constrained domain (0,1) × (0,1), and compare the model properties for each (x, y) tuple. If we are interested in a one-species model, we simply replace each instance of β+γ with β. Since the construction in Equation 98 is bijective, if we fairly densely sample the square, we can be confident that the results fully encompass the range of behaviors under a particular set of averages.

Simulated data analysis.

To evaluate PMFs, we used trapezoidal quadrature for the Γ-OU generating function integral, the Runge-Kutta method for the CIR characteristic Uk and trapezoidal quadrature for the generating function integral, and the Runge-Kutta method for the telegraph model’s coupled differential equations18,20. We marginalized over the continuous and categorical dimensions. We evaluated all PMFs on xN,xM [0, · · ·, 49] × [0, · · ·, 50]. To generate synthetic data, we sampled with replacement from the 2,550 microstates in the domain, using P(xN,xM) as sampling probabilities.

To investigate parameter identifiability, we generated 200 realizations from the Γ-OU model under κ=0.1,a=0, θ=1,β=0.8, and γ=0.9. These parameters lie in the “mixture-like” regime, where the transcriptional process is slower than the RNA turnover process. Next, we constructed a uniformly spaced 14 × 15 grid of x and y, constructed at the true values of μK,β, and γ and bounded by [0.01,0.99]. In statistical terms, this model formulation is the best-case scenario where no noise exists and uncertainty in the fixed parameters is negligible.

To investigate the statistical properties of one-species data, we evaluated the log-likelihood log L of the nascent marginal of the data at each of the 210 x, y coordinates (with the true value being x = 1/9 and y = 5/7). Next, we plotted log L as a heatmap over x, y. The coordinates with high log L are not readily distinguishable, i.e., these parameters produce very similar distributions to the data. We highlighted the coordinates in the 90th percentile of log L—the least distinguishable region—using hatching. To illustrate a case where the one-species data are relatively uninformative, we considered a point with x = 9/10 and y = 5/7, which lies in the qualitatively different “burst-like” regime (κ=7.2) but closely resembles the “mixture-like” data at steady state.

To investigate the statistical properties of two-species data, we repeated the analysis above, computing the joint likelihood rather than the marginal likelihood. In the two-species model, the true “mixture-like” parameter set has x=1/18 and the illustrative “burst-like” parameter set has x ≈ 0.81; the other parameters do not change. To demonstrate the source of failure to distinguish between these parameter regimes, we plotted the PMFs in both. We used a transparent bar plot for the nascent PMFs and a heatmap for the joint PMFs, with darker colors representing a higher probability mass.

To investigate the mutual identifiability of models, we computed their Akaike weights over the x, y landscape. The Akaike weight of model ϖ is defined as follows:

wϖ=e12Δϖke12Δk,whereΔk=AICkAICmin,AICmin=minkAICk,andAICk:=2logLk(Θ^k)+2ςk. (99)

Thus, AICk is the Akaike information criterion (AIC) for model k. The AIC depends on the model log likelihood log Lk at the maximum likelihood estimate Θ^k, as well as number of model parameters ζk120. Therefore, the Akaike weight essentially transforms and combines the models’ relative likelihoods to provide a measure of their agreement with the data.

Although this measure has its caveats and limitations—for example, it cannot account for uncertainty in the model-specific parameters Θk—it is a fairly conventional criterion for model selection. Most usefully to our investigation, it admits a simple interpretation: if the Akaike weight of the true model wϖ1/3, there is essentially no basis for choosing a particular model, since their distributions are not practically distinguishable. If wϖ>1/2, we have a basis for model discrimination: the odds for the correct model are even. In the three-model case, this may reflect both, or only one, of the competing hypotheses being substantially worse at describing the data, so more careful examination of the wk values is necessary to judge the models.

To investigate model identifiability, we constructed a uniformly spaced 14×15 grid of x and y, bounded by [0.01,0.99]. At each grid point, we generated 200 realizations from the Γ-OU model under μK=5,β=0.8, and γ=0.9. Next, we computed the log Lk of each model using the nascent marginal and the full data, and used the relative likelihoods to compute the Akaike weights of the Γ-OU model under these two scenarios. Finally, to reduce the impact of stochastic sampling variability, we repeated the process 50 times and computed their average. In other words, we generated 50 independent datasets at each of the 210 grid points, evaluated likelihoods of all models, computed the univariate and bivariate Γ-OU Akaike weight of each, then aggregated the 50 trials at each grid point to obtain two “average-case” performance measures. In statistical terms, this model formulation represents the best-case scenario where the parameters are perfectly known, and the problem solely consists of distinguishing between the models, as in the Γ-OU/CIR case considered in Fig. 3 of Gorin and Vastola et al.20

To visualize the behavior of the Akaike weights under these assumptions, we plotted its value as a heatmap over x, y. We highlighted the coordinates with wϖ<1/2—the poorly distinguishable region—using hatching. To illustrate a case where the one-species data are relatively uninformative, we compared a point with one-species coordinates x, y = (0.4,0.9), which lies in the “mixture-like” regime, to one with x, y = (0.9, 0.8), which lies in the “burst-like” regime. We visualized these points on the x, y axes using large, color-coded circles. From Gorin and Vastola et al.20 and the properties of low-x processes outlined in the definition of x, we expect the former regime to be highly distinguishable, particularly since the telegraph process converges to a bimodal Bernoulli mixture for κ0. On the other hand, we expect the latter regime to be somewhat less distinguishable; in this limit, the Γ-OU and telegraph models both converge to the bursty model discussed in Singh and Bokes137. We repeated this analysis for two-species Akaike weights, transforming the coordinates appropriately (i.e., x ≈ 0.24 for the mixture-like regime and x ≈ 0.81 for the burst-like regime).

To demonstrate the basis of statistical distinguishability properties, we plotted the PMFs of the three models in the two parameter regimes. To simultaneously display them, we plotted marginal distributions of the nascent species as line charts, color-coded by the model identity.

To investigate the effect of drop-out technical noise, we did not perform dedicated simulations; instead, we exploited the result, derived above, that the functional form of the solutions is closed under downsampling. In other words, all distributional properties of a system with gain θ and the technical noise parameter p are identical to those of a system with gain pθ and no technical noise. These properties include the model distinguishability. To illustrate this result, we represented Bernoulli technical noise by arrows in the negative y direction, with small circles located on an arrow corresponding to 50%, 75%, and 85% dropout. To compute the y value under dropout, we use:

y*=pθp1a+pθ,sinceμK=aθκ=p1apθκ=const. (100)

The arrows begin at 0% dropout, corresponding to the illustrative base cases (large circles) described above. This demonstrates that increasing the drop-out rate while holding the averages constant leads to the molecular distributions’ degeneration to the Poisson limit. If we do not hold the averages constant, we simply obtain the decreased y*=pθa+pθ on the (less identifiable) x, y landscape with mean transcription rate pμK.

6.8.4. Distributions obtained from a transient process

Model definition.

As motivated in our RNA velocity review19, understanding transient developmental processes that occur on a timescale comparable to RNA lifetimes requires fitting transient probabilistic models. Even under the considerable simplifications made in Section 6.5, fully treating transient transcriptional phenomena requires identifying the a priori unknown (1) internal-age distribution f(t) as well as (2) process parameters for G(t). As the time since process start t can be conceptualized as a cell-specific latent variable, this problem can be treated by an expectation–maximization (EM) algorithm, which may proceed by probabilistically constraining the unknown (3) cell-specific times tc.

Since parameter inference is mandatory for the expectation step of the EM algorithm, we begin by characterizing the upper limit on its performance. In particular, previous attempts to treat the problem have assumed simple Gaussian or Poisson error terms59,86,121, or applied graph methods174. These approaches do not recapitulate19 the discrete stochasticity and bursting observed in transient biophysical processes125,175. However, the transient distributions of bursty processes are not available in closed form, and require new algorithms. Therefore, we treat the simplest nontrivial formulation, which combines points (1) and (2), while omitting (3): if we have perfect information about the cells’ relative times, can we satisfactorily fit a bursty transcriptional model and use the results as a basis for distinguishing between internal-age distributions?

We define a baseline N=1,n=2,m=0 model of biology with no technical noise, with the reaction schema

αB×𝒳Nβ𝒳Mγ, (101)

representing bursty transcription with stochastic burst sizes B drawn from a geometric distribution with time-dependent mean b(t):

u=[uNuM],Cdd=[β0βγ]U=[UNUM]=[uNeβ𝗌+uMββγ(eγ𝗌eβ𝗌)uMeγ𝗌]𝓐(u)=α[11b(t)uN1], (102)

with all other operators set to zero. To specify b(t), we define a three-stage model of cell type transitions, such that

b(t)={b1t<τ1b2t(τ1,τ2)b3t>τ2, (103)

i.e., a transition is accompanied by a rapid change in burst size at a deterministic time after starting the process.

Next, we propose candidate internal-age distributions. Drawing on the chemical engineering literature105,106, we outline one-parameter reactor models, such that t=0 corresponds to the cell entering the reactor; after some residence time 𝓽, which is dependent on reactor architecture and drawn from the distribution fres, the cell exits. The internal-age distribution is given by

f(t)=1Ttfres(𝓽)d𝓽. (104)

The plug flow reactor (PFR) is the model implicit in previous studies59,86. Formally, it represents each cell entering a reactor, then exiting after some deterministic time T. Its residence-time distribution is Dirac or degenerate, with fres(𝓽)=δ(𝓽T), so

f(t)=1Ttfres(𝓽)d𝓽=1Ttδ(𝓽T)d𝓽=I(t<T)T, (105)

the expected uniform distribution. This distribution has the CDF and inverse CDF

F(t)=tTI(t<T)andF1(p)=pT. (106)

The continuously stirred tank reactor (CSTR) represents a cell entering a homogeneous reactor, then exiting after a random time, in a memoryless fashion. Therefore, the residence-time distribution fres(𝓽)=1Te𝓽/T is memoryless or exponential, yielding

f(t)=1Ttfres(𝓽)d𝓽=1T2te𝓽/Td𝓽=1Te𝓽/T; (107)

i.e., memorylessness implies that the properties inside the reactor—including the age distribution—are identical to the properties of the efflux stream. We obtain the CDF and inverse CDF

F(t)=1et/TandF1(p)=Tln(1p). (108)

The laminar-flow reactor (LFR) is a configuration between these two extremes: it represents a cell entering a reactor, remaining in it for some time deterministic time, then being able to exit after a power-law delay. Its residence-time distribution fres(𝓽)=T22𝓽3I(𝓽>T/2) is Pareto, yielding

f(t)=1Ttfres(𝓽)d𝓽=T2t1𝓽3I(𝓽>T/2)d𝓽=T2maxt,T/21𝓽3d𝓽={T2t𝓽3d𝓽=T4t2t>T2T2T/2𝓽3d𝓽=1Tt<T2. (109)

The PDF can be integrated to yield the CDF and inverse CDF

F(t)={tTt<T21T4tt>T2andF1(p)={pTp<12T4(1p)p>12. (110)

We are interested in the CDFs and inverse CDFs of the internal-age distributions because “perfect information about the cells’ relative times” properly requires specifying {Fϖ(tc)} and {Fϖ(τi)} under the true model ϖ rather than the raw {tc} and {τi} values. Otherwise, the model selection problem becomes somewhat trivial; for example, if we know the mean residence time is T and we know one of tc>T, we can immediately eliminate the PFR configuration without performing any calculations.

A synthetic dataset consists of observations xN,c,xM,c for each cell c, generated from the true model ϖ at the true time point tc. The log-likelihood of parameters Θk={b1,b2,b3,α,β,γ}k for model k takes the form

logLk,c(ΘkxN,c,xM,c)=logP(xN,c,xM,c,tc,kΘk;{τi}k), (111)

where tc,k:=Fk1(Fϖ(tc)) and {τi}k:={Fk1(Fϖ(τk))} are the transformed times. This yields the full log-likelihood under the assumption of independence

logLk(Θk)=clogLk,c(ΘkxN,c,xM,c,{τi}k)=clogP(xN,c,xM,c,tc,kΘk,{τi}k). (112)

The problem of identifying the maximum likelihood parameter set consists of optimizing Equation 112 with respect to Θk. The problem of reactor identification consists of using the resulting reactor-specific maximum likelihood value log Lk(Θ^k) with Equation 99 to obtain the Akaike weights of each reactor configuration.

Simulated data analysis.

To generate the illustrations in Figure 4a, we directly simulated cells entering and exiting each reactor configuration. First, we sampled arrival times from a uniform distribution on [0,100]. Next, we sampled residence times by inverse transform sampling from the inverse CDF corresponding to each fres, using the mean residence time T=2. We arbitrarily selected the observation time 75 and selected all cells which had arrived but not exited at this time. We computed the cell age by subtracting the arrival time from the current time. We repeated this procedure 107 times for each reactor to obtain the internal-age distribution. Next, we computed the histogram of the distribution on [0,10], using 200 bins. To account for the fact that this histogram only contains part of the CSTR and LFR densities, we rescaled the bins by the internal-age distribution’s CDF value at t=10. Finally, we plotted the rescaled histogram as a bar plot, and the analytical f as a line plot for comparison.

To understand the actionable differences between reactors, we simulated data from a single reactor model, then fit all three models to the obtained counts. First, we sampled 200 true reaction times {tc} under the PFR model with T=5 and sorted them. To generate synthetic data, we used Gillespie’s stochastic simulation algorithm140,144 with a time-dependent burst size, storing the state of the system at {tc}. We generated 200 realizations, using only one realization per time point to fit the models. To simulate, we used the parameters Θϖ={b1,b2,b3,α,β,γ}ϖ={2,5,1,0.8,1.2,3.14}. We set {τ1,τ2} to {1,3}. We started the system in a bivariate Poisson initial distribution with λN0=αb1β nascent and λM0=αb1γ mature molecules on average. Although this initial condition is somewhat arbitrary, as it is out of equilibrium, it is readily tractable and yields a constant mean over the first stage of the process.

The instantaneous probability P(xN,c,xM,c,tc,kΘk,{τi}k) is not available in closed form, and needs to be obtained by inverting the generating function for each tc,k16,20,137:

G(u,tc,k)=exp(λN0UN(u,tc,k)+λM0UM(u,tc,k)+α0tc,k[11b(tc,k𝗌)UN(u,𝗌)1]d𝗌) (113)

where we elide the dependence of b on the model-specific {τi}k. For a given value of u, it is straightforward to propagate the initial condition. However, it is impractical to compute the integral separately for each c. We can bypass this bottleneck by reusing quadrature points. Conceptually, we define the quadrature matrices

TQ=[b(t0t0)000b(t1t0)b(t1t1)00b(tη1t0)b(tη1t1)b(tη1tη1)0b(tηt0)b(tηt1)b(tηtη1)b(tηtη)]DQ=diag[UN(u,t0,k)UN(u,t1,k)UN(u,tη1,k)UN(u,tη,k)] (114)

in the general case with η cells. We appended the starting grid point t0,k:=0 to properly integrate from zero. We use the notation TQ because this matrix is Toeplitz in the narrow, but numerically relevant19, case of a uniformly spaced grid approximating sampling from a PFR. To lighten the notation, we drop the subscript k from the time points in the definition of TQ. DQ is diagonal and does not need to be constructed explicitly; to obtain the product TQDQ, we broadcast TQ with the vector used in the definition of DQ. Then, we computed MQ=(1TQDQ)(1)1, where (1) is to be interpreted as the elementwise/Hadamard inverse of the matrix. Finally, we approximated the integral by applying the NumPy quadrature algorithm trapz along the rows of MQ, using {tc,k} as the integration grid176. The GF evaluation grid size was set to [0,,maxxN+4]×[0,,maxxM+4], where max xi is the highest RNA count observed for species i over the entire simulation, in all cells.

Next, we used the SciPy algorithm optimize. minimize177 to minimize the negative log-likelihood of the data under all three models, and obtain a satisfactory set of parameters. Specifically, we varied the 6-dimensional vector log10Θ, with each log-parameter’s bounds set to (−1.5,1.5). We optimized with the L-BFGS-B solver for a maximum of 20 steps. Since we are primarily interested in the models’ relative performance at their maximum likelihood estimates (MLEs), rather than the process of obtaining these estimates, we initialized each search at the parameters used to generate the data.

Next, we sought to illustrate the fit performance and the differences between the models’ distributions. We plotted the marginals of the simulated data at each time point tc as bar plots, now using the counts from all 200 cells to demonstrate the full transient distribution. Next, we plotted the marginal PMFs of the three models at the corresponding time points tc,k as color-coded line charts. We expect the true reactor configuration (PFR) to closely agree with the distribution shapes; however, we have no a priori information regarding how well other reactor architectures can recapitulate the same data. To quantify the prospects for model selection, we inserted the optimal log-likelihoods into Equation 99 and calculated the Akaike weights of the model candidates.

To characterize the identifiability properties, we reproduced the simulation and analysis process using the same parameters, but varying the dataset size, with η = { 20, 40, 60, 80, 100, 150, 200 }. For each η, we generated 50 synthetic datasets, fit them, and computed the Akaike weights of the models. We plotted all wϖ as a function of the number of cells, adding uniform jitter to facilitate inspection. To visualize the trends in model identifiability, we plotted the mean and standard deviations of all wϖ for a given η, connecting them with a line to guide the eye. We do not a priori know whether the reactor configurations are meaningfully distinguishable, but if they are, we expect them to become more so with more data.

Next, we sought to characterize the prospects for distinguishing reactor models for a broader range of transcriptional parameters. We used rejection sampling to draw Θϖ. First, we drew log10 bi from a normal distribution with mean 0.8 and standard deviation 1, and all other log-parameters from a normal distribution with mean 0 and standard deviation 1. The parameters were clipped to stay in the domain [10−1.4, 101.4] to avoid “trivial” regimes with excessive timescale separation relative to the reactor residence time. Next, we found the highest bi, computed the nascent and mature mean and standard deviation corresponding to this set of bi,α,β,γ137, and kept the proposed Θϖ if μN+4σN and μM+4σM were both lower than 25. Otherwise, we regenerated Θϖ. This is an ad hoc way to limit the state space size for PMF evaluation: although we do not know what the maximum observed counts will be until we simulate the system, μ+4σ is typically provides a reasonable estimate97. Rejecting parameters in this fashion approximately limited the state space size to 25 × 25. In this way, we simulated, fit, and computed the Akaike weights for 200 parameter sets. All used the PFR ground truth model, {τ1,τ2}={1,3}, and T=5 as above.

To summarize the model identifiability over this domain of synthetic parameters, we plotted the distribution of AIC weights wϖ. Finally, to characterize the relationships between the models, we plotted the distributions of log-likelihood differences logLk(Θ^k)logLϖ(Θ^ϖ), where k corresponds to the CSTR and LFR models, as transparent histograms color-coded by k. If such a histogram is skewed toward negative values, the model k produces consistently worse fits than the true PFR model. In the other hand, if it is centered at zero, then model k is typically easily confused with the true model. We restricted this visualization to (−5,5) to compensate for potential failure to converge, which produces inflated likelihood differences. This visualization provides a basis for explaining the distribution of wϖ.

6.8.5. Variability in library construction

Model definition.

In section 6.8.3, we considered the parameter and model identifiability for a two-stage model of RNA processing, and found that several interesting distributions are closed under downsampling, so long as the downsampling is Bernoulli with equal parameters for both species. However, this assumption may be too restrictive in practice: for example, nascent RNA may be more or less likely to be captured than mature RNA, depending on the poly(A) content of their introns. In the current section, we investigate the behavior of models with differences in capture probabilities or rates.

The identifiability properties are highly model-dependent. For example, if we consider the Γ-OU or CIR models, with N=1,n=2,m=1, such that

K𝓧Nβ𝓧Mγ, (115)

where the autocorrelation of K is κβ,γ, the stationary distribution of K is gamma with shape v=a/κ and scale θ. We find the stationary RNA generating function is bivariate negative binomial, with

G=(11θuNβθuMγ)v, (116)

which is outlined in the supplemental section 2.5.2 of Gorin and Vastola et al.20 Under sampling, the distribution stays bivariate negative binomial, with GF

G=(11θpNuNβθpMuMγ)v. (117)

In other words, even if we have perfect information about this distribution’s three parameters ν,θpN/β, and θpM/γ, we cannot conclude anything about the magnitudes of pN and pM, as they are degenerate with θ,β, and γ. If K is telegraph (i.e., N=2,n=2,m=0), we obtain a finite Poisson mixture:

G=koffκ+konκexp(kinitpNuNβ+kinitpMuMγ), (118)

which exhibits the same degeneracy with respect to kinit,β, and γ. Entirely analogously, if the system is in the Poisson limit (y0) with average transcriptional strength μK, we find that sampling yields

G=exp(μKpNuNβ+μKpMuMγ), (119)

which is non-identifiable.

Interestingly, the bursty regime is partially identifiable. We begin by defining a baseline N=1,n=2,m=0 model of biology with technical noise but no ambiguity, such that

αB×𝓧Nβ𝓧Mγ (120)

representing bursty transcription with stochastic burst sizes B drawn from a geometric distribution with constant mean b. Further, we assume that a molecule 𝓧i is retained with probability pi, yielding:

Gt*(u)=[pNuNpMuM],Cdd=[β0βγ]U(Gt*(u),s)=[pNuNeβ𝗌+pMuMββγ(eγ𝗌eβ𝗌)pMuMeγ𝗌]𝓐(u)=α[11buN1]. (121)

In other words, the stationary generating function is given by

exp(α0𝓐(U(Gt*(u,𝗌)))d𝗌) (122)

In principle, this quantity can be integrated, inverted, and optimized with respect to the parameters. However, to be thorough, we need to reformulate the optimization problem in the most compact form available, which involves identifying the distribution’s degeneracies. Although this system formally has six parameters b,α,β,γ,pN,pM, at steady state only four are identifiable. This is made clear by examining the integrand:

bUN=bpNuNeβ𝗌+bpMuMββγ(eγ𝗌eβ𝗌)=bpN[uNeβ𝗌+pMpNuMββγ(eγ𝗌eβ𝗌)], (123)

i.e., the characteristic is invariant so long as bpN and pM/pN are constant. By plugging in zero for uN or uM, we observe that the characteristics take the functional form of the characteristics of the noise-free system, implying different values of pN and pM may give indistinguishable distributions. Therefore, identifying the relationship between pN and pM requires bivariate data. To quantitatively characterize how identifiable pN and pM are, we need to use simulations.

However, challenges particular to single-cell technologies arise when attempting to apply this model to large datasets with many genes. Although the Bernoulli model is a useful approximation, considering the sequencing process suggests that the non-sequestering technical noise model is more realistic: there is no chemical barrier to an RNA molecule being captured multiple times. In this formulation, each gene’s technical noise is parametrized by the species’ overall capture rates λN and λM, which produce the Bernoulli limit when both of these parameters are small.

Furthermore, it appears implausible that λj,N, and λj,M, where j indexes over genes, vary arbitrarily. In a previous report21, we have found that the model λj,N=CNLj and λj,M=λM performs satisfactorily. In this model, the nascent species are identified with unspliced molecules, which are considerably longer than spliced molecules and contain a large number of internal poly(A) priming sites. To a first-order approximation, we may propose that nascent species are captured at a rate proportional to the gene length Lj, where the constant of proportionality CN is a dataset-wide technical noise parameter. Analogously, we identify the mature species with fully spliced, poly(A)-tailed molecules, and make the zeroth-order approximation that poly(A) tails are chemically identical. The capture rate λM is, then, also dataset-wide. Although this model is relatively simplistic, it foregrounds a key challenge. Even if we assume different genes’ transcriptional processes are independent, we cannot fit their distributions independently, as we need to account for coupling through the technical noise parameters.

Data analysis.

To illustrate the identifiability of pM/pN under the Bernoulli noise model, we considered the likelihood landscape for the simplest one-parameter formulation. We fixed the parameters α=1,bpN=4.9,μN=αbpNβ=7, and μM=αbpMγ=10; in other words, the nascent RNA distribution is negative binomial with shape αβ1.43 and scale bpN. We simulated data at pM/pN{1/4,1,4}, with η = {20, 50, 100, 200} simulated cells. For each of the true pM/pN and η values, we generated 200 datasets by sampling from the PMF on [0, · · ·, 99] × [0, · · ·, 99]. To evaluate the PMF for pM>pN, we set pN to unity with no loss of generality. To evaluate it for pN>pM, we set pN to unity. This yields b=(bpN)pN and γ=αbpMμM. Next, we computed the likelihood of the data under log10pM/pN[2,2], keeping α,bpN,μN, and μM constant, using the evaluation grid size [0,,maxxN+3]×[0,,maxxM+3], where maxxi is the maximum observed for each species in the simulation. We used 200log10pM/pN grid points, evenly spaced throughout the domain. Next, we computed the posteriors over the grid by dividing each likelihood vector by its sum. Finally, we plotted the average posterior distribution using line charts, with the color indicating the true value of pM/pN and the intensity indicating the number of cells, with more saturated colors corresponding to more simulated cells. For ease of comparison, we plotted the true values using dashed lines. From a statistical perspective, this analysis summarizes the parameter identifiability conditional on perfect information about the nascent marginal and the species averages. As we do not a priori know whether the differences in the PGF are actionable, the analysis illustrates the sample sizes required to fit the parameter to a particular degree of precision.

We previously motivated and fit the Poisson model of technical noise21,161. In Gorin et al.21, we inspected a variety of datasets, and observed a pronounced length bias in the nascent RNA count data, which did not appear in mature RNA counts (Section S7.3 of Gorin et al.21). This bias may be explained by three naïve models of biology.

The first model posits that the nascent RNA molecules are in the process of being transcribed; higher amounts of nascent RNA for longer genes simply reflect longer elongation delays. Although this explanation is superficially plausible, it is not borne out by the data. The model predicts a geometric-Poisson distribution of nascent RNA and zero correlation between nascent and mature counts140,142. Real data, on the other hand, have distinctly negative binomial-like marginals (as evident in, e.g., the third column of Fig. 4b of our recent work on delay CMEs140, which shows consistently inferior fits under the delay model), and nontrivial nascent/mature correlations (as in the red histogram in Figure 2b).

The second model posits that the differences in expression reflect real differences in the underlying biological parameters, and technical noise may be neglected. However, fitting this model produces pervasive length biases in the parameter values (Section S7.4 of Gorin et al.21), which are inconsistent with trends observed in orthogonal data. This is the model we explored in Gorin et al.21

The third model posits that technical noise does occur, but takes the species-independent form pN=pM. This formulation is mathematically identical to the second model, but proposes that an apparent length bias in the burst size is actually a length bias in bp. This model partially bypasses the objection raised for the second model by proposing that p is gene length-dependent, identical for nascent and mature species, and higher for longer genes. However, this model is implausible on physical grounds, as mature transcripts do not have the intronic poly(A) content necessary to produce this length dependence. This is indirectly implied by the consistently low fraction of exonic reads in sequencing datasets, in contrast to introns and the 3’ untranslated region136.

These biases can be largely eliminated by proposing a length-dependent sampling rate for nascent RNA counts, suggesting that this technical noise model is more coherent with known biology. To illustrate this process, we summarize the key results from Gorin et al.21

We obtained the raw data for the twelve 10x v3 datasets reported in Table S4 of Gorin et al.21. The raw data consisted of nascent and mature count matrices for 2,500 genes per dataset. The counts were generated by running the kallisto|bustools 0.26.0 kb count command on the raw FASTQs with the --lamanno option, using an intronic/exonic index built from the GRCh38 and mm10 reference genomes, as described in Section 6.8.2. The datasets were filtered to remove low-expression droplets, first using the default bustools filter, then using the manually selected knee plot thresholds shown in Table S5 of Gorin et al.21 Next, they were filtered for the top 2,500 moderate- to high-expression genes using the procedure in Section S4.3.1 of Gorin et al.21 To visualize the broad trends in count averages, we obtained the gene lengths Lj, then binned the values of log10 Lj into ten bins, with the edges given by the deciles d0,d1,,d10. Next, we computed the average log10 mean of nascent and mature expression levels for genes falling into each bin. Finally, we plotted these mean levels at each bin center dk+12(dk+1dk), connecting the values with a line to guide the eye. We repeated this analysis for all twelve datasets, distinguishing the nascent and mature statistics by color.

Next, we obtained the fit results for these datasets. The fits were performed using Monod 0.2.5.0 Python package161 as described in Gorin et al.21 Fitting the model with no technical noise entailed gradient optimization over the per-gene joint distributions to fit bj,βj, and γj. Although the model did not explicitly include technical noise, the theoretical discussion above implies that the results can be interpreted as those from a p=pN=pM model, with the inferred “burst size” corresponding to bjpj for gene j. Fitting the model with technical noise entailed scanning over a grid of CN and λM, obtaining per-gene maximum likelihood estimates of bj,βj, and γj conditional on the technical parameter values at the grid point, then identifying the grid point which produced the lowest sum of Kullback-Leibler divergences over all genes. In both cases, the genes underwent a round of goodness-of-fit filtering to remove fits that did not accurately recapitulate the data, as in Section S4.3.5 of Gorin et al.21 computed the average inferred log falling into each length bin. As with the means, we plotted the average burst sizes at each bin center. connecting the values with a line to guide the eye. We repeated this analysis for all twelve datasets, distinguishing the results fit with and without a technical noise component by color.

Supplementary Material

Supplementary material
Supplementary Table 3

Table S3 Genes discovered to be overdispersed (σ2>2μ) in empty droplets for each dataset in Table S2.

Supplementary Table 4

Table S4 Genes discovered to be overdispersed (σ2>2μ) in empty droplets for the neuron_1k_v3 and desai_dmso datasets, with function annotations.

Box 1. Generating function methods for studying stochastic biological systems.

Generating functions are ubiquitous tools in stochastic modeling. They are central to the analysis of discrete master equations, as they cast difficult-to-solve infinite-dimensional systems to partial differential equations, which can be treated using standard analytical or numerical methods. A (one-variable) probability distribution P(x) and its generating function G(g) are related according to the formulas:

G(g)=x=0gxP(x),P(x)=dg2πi1gx+1G(g)=ππdθ2πeiθxG(eiθ). (1)

In the stochastic modeling of transcription, certain distributions, such as the Poisson and negative binomial, frequently appear. Because G uniquely specifies P, we can often invert G simply by recognizing its form and matching terms. Below are some generating functions of common distributions (Bernoulli, Poisson, geometric, and negative binomial):

P(x)=(1p)δ0x+pδ1x,G(g)=g, (2)
P(x)=λxeλx!,G(g)=eλ(g1), (3)
P(x)=θ1+θ(11+θ)x,G(g)=11θ(g1), (4)
P(x)=Γ(v+x)x!Γ(v)(θ1+θ)v(11+θ)x,G(g)=(11θ(g1))v. (5)

The generating function expressions can often be made more compact by applying the substitution u:=g1.

Box 2. An illustration of the solution procedure.

Here, we will illustrate how to solve two simple transcription models using our framework. We assume that RNA is produced with burst event frequency α and degrades at a rate γ. In the constitutive model, each transcription event creates one RNA. In the bursty model, each transcription event creates a random number of RNA, distributed according to a geometric random variable with mean b. Both models have N=1,n=1, and m=0. Since these models are one-dimensional, the C and D matrices are 1 × 1. For both of them, C=[γ] and D=[0]. The ODE for the single characteristic U (with initial condition U(𝗌=0)=u) is

dU(u,𝗌)d𝗌=γU(u,𝗌)U(𝗌)=ueγ𝗌. (23)

For a general burst distribution p(z), the transcriptional evolution operator is (u)=α(F(1+u)1), where F is the GF of the number of molecules produced per transcription event. For our two models, we have

p(z)=δ1,z,F(1+u)=1+u,(u)=αu, (24)
p(z)=(1+b)z1bz,F(1+u)=11bu,(u)=αbu1bu. (25)

To compute the stationary log-generating functions logG, we evaluate the integrals:

logG=0αueγ𝗌d𝗌=uαγfor the constitutive model and (26)
logG=0α[11bueγ𝗌1]d𝗌=αγlog(1bu)for the bursty model.

The constitutive model yields a Poisson distribution with mean α/γ (c.f. Equation 3), whereas the bursty model yields a negative binomial distribution with shape α/γ and scale b (c.f. Equation 5).

ACKNOWLEDGMENTS

G.G. and L.P. were partially funded by NIH 5UM1HG01207702 and NIH U19MH114830. J.V. was partially funded by NIH 1U19NS118246-01. The RNA, DNA, and cDNA illustrations were derived from the DNA Twemoji by Twitter, Inc., used under the CC-BY 4.0 license. The authors thank Dr. A. Sina Booeshaghi, Maria Carilli, Tara Chari, Taleen Dilanyan, Dr. Kristján Eldjárn Hjörleifsson, Meichen Fang, Catherine Felce, and Delaney Sullivan for fruitful discussions of co-regulation, contamination, transient behaviors, catalysis, fragmentation, genomic alignment, and a variety of other phenomena and processes. Part of this work was performed during G.G.’s Data Sciences Co-op with Celsius Therapeutics, Inc.

Footnotes

DECLARATION OF INTERESTS

The authors declare no competing interests.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material
Supplementary Table 3

Table S3 Genes discovered to be overdispersed (σ2>2μ) in empty droplets for each dataset in Table S2.

Supplementary Table 4

Table S4 Genes discovered to be overdispersed (σ2>2μ) in empty droplets for the neuron_1k_v3 and desai_dmso datasets, with function annotations.

Data Availability Statement

  • This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. Pseudoaligned count matrices in the mtx format have been deposited at the Zenodo package 8132976. The data, Monod fits, and analysis scripts used to generate Figure 5d-e, originating from Gorin et al.21, were previously deposited as the Zenodo package 7388133.

  • All original code has been deposited at https://github.com/pachterlab/GVP_2023 and the Zenodo package 8132976, and is publicly available as of the date of publication. DOIs are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data
H. sapiens peripheral blood 10x v3 scRNA-seq data 178 pbmc_1k_v3
M. musculus heart 10x v3 scRNA-seq data 179 heart_1k_v3
M. musculus neuron 10x v3 scRNA-seq data 180 neuron_1k_v3
M. musculus cultured embryonic stem cells treated with DMSO 10x v2 scRNA-seq data Desai et al. desai_dmso
H. sapiens peripheral blood 10x v2 scRNA-seq data (technical replicate of pbmc_1k_v3) 181 pbmc_1k_v2
M. musculus neuron 10x v3 snRNA-seq data 182 brain_nuc_5k_v3
Supporting data for GP_2021_3 Gorin and Pachter Zenodo: dataset 7388133
Software and Algorithms
Python python.org 3.9.1
NumPy numpy.org 1.22.1
SciPy scipy.org 1.7.3
pandas pandas.pydata.org 1.2.4
kallisto | bustools Melsted and Booeshaghi et al. 0.26.0
Monod Gorin and Pachter 2.5.0
Other
Count matrices for all datasets This manuscript Zenodo: dataset 8132976
Custom analysis notebooks This manuscript GitHub: https://github.com/pachterlab/GVP_2023 (version of record deposited at Zenodo: dataset 8132976)

RESOURCES