Abstract
The rise of single-cell data highlights the need for a nondeterministic view of gene expression, while offering new opportunities regarding gene regulatory network inference. We recently introduced two strategies that specifically exploit time-course data, where single-cell profiling is performed after a stimulus: HARISSA, a mechanistic network model with a highly efficient simulation procedure, and CARDAMOM, a scalable inference method seen as model calibration. Here, we combine the two approaches and show that the same model driven by transcriptional bursting can be used simultaneously as an inference tool, to reconstruct biologically relevant networks, and as a simulation tool, to generate realistic transcriptional profiles emerging from gene interactions. We verify that CARDAMOM quantitatively reconstructs causal links when the data is simulated from HARISSA, and demonstrate its performance on experimental data collected on in vitro differentiating mouse embryonic stem cells. Overall, this integrated strategy largely overcomes the limitations of disconnected inference and simulation.
Author summary
Gene regulatory network (GRN) inference is an old problem, to which single-cell data has recently offered new challenges and breakthrough potential. Many GRN inference methods based on single-cell transcriptomic data have been developed over the last few years, while GRN simulation tools have also been proposed for generating synthetic datasets with realistic features. However, except for benchmarking purposes, these two fields remain largely disconnected. In this work, building on a combination of two methods we recently described, we show that a particular GRN model can be used simultaneously as an inference tool, to reconstruct a biologically relevant network from time-course single-cell gene expression data, and as a simulation tool, to generate realistic transcriptional profiles in a non-trivial way through gene interactions. This integrated strategy demonstrates the benefits of using the same executable model for both simulation and inference.
Introduction
Cell decision making as a response to exogenous or endogenous stimuli (e.g., differentiation, proliferation, cell death or biological activity modulation) is often supported by time-dependent modulation of gene expression upon stimulation. Understanding how and why gene expression changes as a function of time in response to specific stimuli is therefore critical to understand the underlying biological processes.
The “how” question can now be approached using single-cell-based technologies, offering an unprecedented resolution and a much finer view than population-based measures [1, 2]. The “why” question relates to the functioning of an underlying gene regulatory network (GRN) which describes interactions between genes through their expression products. GRNs are thus a central notion for understanding and predicting cellular behavior, but their construction from literature is a very laborious task, sometimes even impossible due to the lack of knowledge.
Reconstructing most-likely GRNs from transcriptomic datasets has therefore become a major goal in systems biology [3] but is also notoriously difficult, especially in the case of single-cell transcriptomic data. Indeed, the bursty synthesis of mRNAs, now clearly evidenced [4, 5], gives rise to highly variable and non-Gaussian expression data [1, 6], and current GRN inference methods employ a wide range of statistical and modeling tools [7]. Methods based on a specific dynamical model, called here GRN models, have the great advantage of providing biological interpretability, since each inferred interaction between genes can be understood in terms of model behavior. Moreover, such approach generally provides interactions with their direction and intensity, which is not the case for most purely statistical methods.
In this article, we make a distinction between mechanistic GRN models (e.g., built on the biological understanding of differentiation mechanisms) for which cell behavior appears as an emergent property of gene interactions, and phenomenological models, for which the expected outcome is directly prescribed by some dedicated parameters. In this case, although such parameters can still have a biological meaning, the cellular behavior is not biologically emerging but rather ‘hard-coded’ by the model. As developed afterwards, many GRN models fall in between: some aspects of gene expression patterns are then hard-coded instead of emerging from interactions between genes, and gene expression stochasticity is often assumed to be driven by Gaussian white noise only, requiring ad hoc additional noise to match the data.
Moreover, the results of a method based on a mechanistic model can only be considered relevant if the model is able to correctly reproduce single-cell datasets. For instance, it is now widely accepted that the transcriptional bursting phenomenon is associated to specific patterns of gene expression products [8, 9], making continuous single-cell data close to Gamma distributions [10] and discrete data close to negative binomial distributions [11], the latter being themselves mathematically equivalent to Poisson distributions with Gamma-distributed random parameters. Thus, executable network models should at least be able to generate these patterns in their marginal distributions. In any case, the use of a mechanistic model-based method requires prior strong evidence that the underlying model is relevant for simulating realistic single-cell transcriptomic datasets.
We recently developed several methods for inferring GRNs from single-cell data based on a particular mechanistic network model, defined as a ‘multi-agent’ generalization of the well-known two-state stochastic model of gene expression [8] where genes are now being described by interacting two-state models [6]. These methods are well suited for single-cell RNA-seq (scRNA-seq) time-course data, each dataset being considered as a partial observation of the model at a certain time. Crucially, they do not require the observation of cell trajectories, whose inference is a problem in itself [12, 13], but only that the cells sampled at each timepoint are driven by the same dynamical process, i.e., resulting from the same GRN. Our first proposal was called WASABI [14], which uses a divide-and-conquer approach where the problem of GRN inference is solved one gene at a time. Although able to propose relevant GRNs, this approach suffered from two drawbacks: it required days of computation for a GRN with 50 genes, and proposed a potentially long list of candidate networks. We therefore developed two other methods: HARISSA [6], a GRN simulation algorithm based on the mechanistic model together with a proof-of-concept inference method derived from likelihood maximization, and CARDAMOM [15], a simplified and scalable alternative for the GRN inference part that crucially exploits the notions of landscape and metastability.
In this work, we sought to investigate the benefits of using this model as an integrated tool for both GRN inference and data simulation. We therefore assessed its ability to allow for efficient network reconstruction from time-course scRNA-seq data, while accurately reproducing the dataset main features from the functioning of the inferred network. Note that to the best of our knowledge, this is not performed by existing GRN-based simulation tools, which are generally based on more phenomenological than mechanistic models, with at least some important aspects of gene expression patterns, such as transitions between cell types [16] or gene expression variability [16, 17], being hard-coded instead of arising from biological mechanisms.
After introducing the setup of our benchmark made from in silico datasets generated with the mechanistic model, we first evaluate the performances of HARISSA and CARDAMOM together with four state-of-the-art GRN inference algorithms: GENIE3 [18], PIDC [19], SINCERITIES [20] and SCRIBE [21]. We study the limits of the different categories of inference methods in the case of transcriptional bursting, and verify that the two model-based methods perform better than the others on these datasets. CARDAMOM appears as the best performing algorithm during this benchmark step, which only considers network structures. Importantly, the output of this algorithm is not only a matrix of interaction scores, but also a set of quantitative parameters that can be plugged into the GRN model for simulations.
In a second step, we use CARDAMOM to calibrate the model with a real time-stamped scRNA-seq dataset of differentiating mouse embryonic stem (ES) cells [22]. We demonstrate the ability of the model to reproduce the global features of real time-course transcriptomic profiles. We also show that most of the inferred interactions are indeed supported by biological evidence such as ChIP-seq experiments, although this evidence was not used during the inference process. Altogether, these results establish the ability of an executable network model not only to simulate realistic single-cell datasets, but also to provide an effective reverse-engineering algorithm capable of reproducing the main gene expression patterns of an experimental dataset as emergent properties of the underlying GRN.
Results
HARISSA simulates single-cell datasets from a mechanistic GRN model
We first wanted to benchmark the ability of the different inference algorithms to reconstruct correct network structures from in silico generated datasets, i.e., when the ground truth is known. For this, we used the simulation module of HARISSA [6], which generates trajectories of a mechanistic model describing gene expression dynamics (both mRNA and the corresponding proteins) within a single cell, these dynamics being influenced by an underlying GRN and driven by transcriptional bursting (see Simulation of the inferred network reproduces the original dataset and S1 Fig). As shown in previous work, this model is indeed able to generate scRNA-seq datasets with realistic marginal distributions [23].
We simulated nine datasets corresponding to different network structures (Fig 1): a network of 4 genes with a branching structure and inhibition feedback loop (FN4); a network of 5 genes with a cycling structure (CN5); a network of 8 genes with multiple branching structure and feedback loops (FN8); a network of 8 genes with branching trajectories (BN8); networks with a tree structure of 5, 10, 20, 50, and 100 genes (Trees). These networks represent the main types of network structures that have been used for benchmarking GRN inference algorithms [17]. Overall, the objective was to reproduce time-course experiments in which single-cell profiling is performed after a given stimulus, typically a change of medium [22, 24, 25]. This stimulus was therefore taken into account in all the simulations, in the form of a virtual gene defined as being inactive before the beginning of the experiment and fully activated afterwards.
For each network structure (Fig 1A), the transcriptional bursting model implies that typical single-cell trajectories do not follow a diffusion-like process (at least in the space of mRNA levels), and differ strongly from the more usual and intuitive population-average trajectory (Fig 1B). The practical datasets were obtained by sampling independent cells at a specific sequence of timepoints, therefore not keeping the real cell trajectories but rather considering different cells at each timepoint, forming time-stamped snapshots (Fig 1C). Interestingly, both feedback networks (FN4 and especially FN8) produce a recognizable “differentiation trajectory” across the UMAP space with a clear temporal order of cells. Due to the stochastic nature of cell trajectories generated by the mechanistic model, branching trajectories in snapshots only appear in specific cases, generally when a toggle-switch is dominating the GRN structure and then generates distinct branches in the UMAP representation (see BN8 in Fig 1).
As mentioned previously, HARISSA consists of two modules for performing respectively simulation and inference. Whereas the original inference module of HARISSA was limited to a few genes [6], it recently integrated an effective CARDAMOM-inspired simplification [23] that allows to infer networks with a much larger number of genes. We therefore also benchmarked this method along with the others.
CARDAMOM quantitatively reconstructs causal GRN links
We then inferred GRN structures from the in silico generated datasets using the six algorithms presented in the Simulation of the inferred network reproduces the original dataset section (HARISSA, CARDAMOM, GENIE3, PIDC, SINCERITIES, and SCRIBE). Note that neither GENIE3 nor PIDC are able to use the temporal information (except for the stimulus state information, which they are also provided with), giving them a disadvantage compared to the other algorithms. They were nevertheless used in the benchmark as they are considered to be among the best algorithms for single-cell data, and given that very few algorithms are specifically adapted to time-stamped datasets. Indeed, most methods are limited to static data, and those that are not (such as SCRIBE) require temporally-ordered cell trajectories instead of independent snapshots, thus requiring a pre-processing step that can itself be subject to errors. Moreover, it was not known how they would fare in a time-course setting with transcriptional bursting, which was an interesting question per se.
We also emphasize that among these algorithms, only CARDAMOM and HARISSA have the significant advantage of providing biological interpretability, thanks to the mechanistic model on which they are based: here the network parameters are not mere interaction scores, but quantitative parameters that can be plugged into the model for simulations. Besides, the main objective is really to reverse engineer such a model: from this perspective, despite the obvious advantage of CARDAMOM and HARISSA being built on the same mathematical framework as the one used to generate the data, even similar performances compared to the other algorithms would be satisfying.
Inference was performed on ten independent datasets for each network, and the results were merged into the area under the precision-recall curve (AUPR) which measures the quality of the inferred GRN structure. We also compared the inferred GRNs with a naive method consisting in assigning to each edge of the network the value given by the Pearson correlation coefficient between the corresponding genes (abbreviated as PEARSON): this comparison with Pearson coefficients makes it possible to verify, when the algorithms show good performances, that these are not only due to highly correlated data which are thus not difficult to analyze. The results are presented in Fig 2A and 2B for the first five algorithms. We present the results for SCRIBE separately in Fig 2C because this algorithm requires temporally or pseudo-temporally ordered trajectories, and the results then depend on the pre-processing that is applied on the time-stamped data.
CARDAMOM and HARISSA appeared to outperform the other algorithms for most of these datasets. In particular, in terms of directed interactions, these two methods always clearly performed better than the others. The undirected networks for which GENIE3 and PIDC have similar performances (CN5 and BN8) correspond to cases where the Pearson correlation method is also accurate.
Also, if GENIE3 and PIDC represent an improvement over the Pearson correlation method, they seem to perform poorly when the correlation between genes is not sufficient to infer a reliable GRN. More precisely, we observe that GENIE3 and PIDC are accurate for tree-like networks (Trees), even with bifurcating trajectories (BN8) and cycling (CN5), which was not the case in [17]. On the contrary, SINCERITIES performs very poorly for these type of networks, but seems however competitive for networks with feedback loops (FN4 and FN8) where GENIE3 and PIDC have lower performances. These networks are more difficult to reconstruct. Indeed, as visible in Fig 1, the population-average trajectories of some genes are completely similar. Some genes also have the same marginal distribution of mRNA levels: for example in the network FN4, gene 2 and gene 3 have the same input (gene 1), so their marginal distributions evolve similarly at each timepoint. Then SINCERITIES, which bases the inference procedure on the approximate distribution for each gene, fails to make this subtle distinction, illustrating the improvement that is typically expected from HARISSA and CARDAMOM. On all the networks, GENIE3 fails to infer reliably the direction of the interaction, i.e., to distinguish the interaction i → j from the interaction j → i. On the contrary, because of their mechanistic assumptions, CARDAMOM, HARISSA and SINCERITIES have always quite similar results for directed and undirected inference. Finally, we observed that CARDAMOM outperforms HARISSA on most of the networks.
Regarding SCRIBE, which is a trajectory-based method, we tested its performances in three scenarios (Fig 2C):
When we have access to real trajectories (Real traj.): each cell at each timepoint is being associated to a real ancestor at the previous timepoint and a real descendant at the following one. Such knowledge can of course only be accessed with in silico generated datasets or in vitro for a very limited number of genes by using live-cell imaging of short-lived transcriptional reporters [26];
When the dataset is the same as for the other methods (i.e., no access to real trajectories), and each cell at each timepoint is associated to a pseudo-ancestor at the previous timepoint and a pseudo-descendant at the following one, using the Waddington-OT method described in [27] (Coupling);
When the dataset is the same as for the other methods (i.e., no access to real trajectories), and the algorithm SLINGSHOT [28] is used for reconstructing a pseudo-temporally ordered trajectory (Pseudotime).
We observed that SCRIBE performs well in scenario 1, but poorly in scenarios 2 and 3, at least on the tested networks (Fig 2C). These poor performances are due to the loss of temporal coupling between measurements of genes that interact. They suggest that neither optimal coupling nor pseudotime reconstruction are sufficiently efficient for GRN inference in case of transcriptional bursting. Concerning the optimal coupling method, we notice that this might be due to the “movement by diffusion” assumption on which the Waddington-OT method is built, which does not take into account the constraints on the trajectories imposed by the GRN.
When computing the average runtime of each algorithm on the tree-like networks, we observed that except for SCRIBE, all algorithms are suitable for inferring GRN with a realistic number of genes (see S1 Table). Thus, due to this computational limit and its poor performances when using time-stamped data, we did not consider SCRIBE for further analysis.
We then investigated the limit performances with respect to the number of cells and/or timepoints. We observed that the performances of the first five algorithms decrease for the tree-like networks when the number of genes increases (Fig 2). This can be due to three main factors:
A sequence of timepoints too coarse in relation to the dynamics would directly lead to a lack of inference accuracy;
A sequence of timepoints which is too restricted may not allow to see interactions involving some genes that are regulated late in the process. For example, in Fig 2, we observe that the inference on the Trees networks is very poor for more than ten genes: it comes from the fact that some genes are never activated before 96h;
The number of cells at each timepoint can simply not be enough to infer a reliable GRN.
We therefore investigated the effects of these three factors on the accuracy of the algorithms by studying their performances in terms of AUPR for ten datasets generated from ten randomly-generated tree-like network of ten genes, when varying the number of cells at each timepoint (Fig 3A), the length of the interval for a fixed time gap between each timepoint (Fig 3B), and the density of the sequence of timepoints for an interval with fixed length (Fig 3C). As anticipated, all these conditions have an impact on the quality of the inference: augmenting their values tends to produce a better quality of inference. We also observed that the number of sampled cells seems less critical than the other factors, confirming that few cells at a sequence of timepoints which is dense and long-enough is preferable to many cells on a sequence of timepoints which is too coarse and/or too short. This should be kept in mind when designing single-cell transcriptomics experiments aiming at GRN inference.
Hence, both CARDAMOM and HARISSA, with a benefit for using CARDAMOM, allowed to efficiently reconstruct network structures by reverse engineering the generative model on which they are based. We then needed to test its ability to reproduce an experimental dataset from the literature after network inference.
Application to a real dataset yields a biologically relevant network
As a test case, we used a time-stamped in vitro dataset from Semrau et al. [22] obtained by scRNA-seq of a retinoic acid (RA)-induced differentiation of mouse ES cells (see Simulation of the inferred network reproduces the original dataset). This well-characterized model system of in vitro differentiation recapitulates the transition from pluripotent embryonic stem cells towards two cellular lineages (ectoderm- and extraembryonic endoderm-like cells), all characterized by well-established molecular markers that were further used in GRN inference.
In order to interpret the resulting GRN, we sought to assess whether the inferred interactions are supported by known biochemical evidence of physical interaction between regulators and regulated genes (Fig 4). For this, we annotated the inferred edges coming from genes encoding known transcription regulators (i.e., transcription factors and cofactors) included in the network and for which ChIP-seq data are currently available in ES or the closely related embryonic carcinoma (EC) cell system. Since the RA stimulus exerts its differentiating effect mainly through the members of the RA-activated nuclear receptors subfamily RAR (NR1B) that encompass 3 paralogs (i.e., RARα, RARβ and RARγ), the annotation of the interaction edges linking the stimulus and the regulated genes was based on the presence/absence of ChIP-seq peaks for any RAR paralogs at less than 10 kb upstream or downstream of the annotated transcription start site (TSS) in RA-stimulated ES or EC cells [29, 30]. Although arbitrary, the chosen distance between TSS and DNA binding site for the indicated transcription regulator is relatively conservative as transcriptional effect could be exerted from greater distance up to megabases [31] and the absence of supporting peak as defined should not be interpreted as a proof of absence of any direct modulating effect. Similarly, the edges supported by physical interactions data for Sox2 and Pou5f1, or Jarid2 were extracted from [32, 33].
Using these known physical interactions as a ground-truth, we compared the receiver operating characteristic (ROC) and precision-recall (PR) curves related to the network structures inferred by the four algorithms (Fig 4A and 4B). We observed that, in accordance with the previous results, CARDAMOM and HARISSA appear as the top-ranked algorithms, displaying both a very close ability to infer known edges.
We then examined the structure of the network inferred by CARDAMOM (Fig 5). Importantly, in agreement with its differentiating effect in ES/EC cell systems, we observed that the RA stimulus is densely connected with genes involved in pluripotency maintenance as supported by multiple biological analyses [29, 30, 34] and to a lesser extent with gene nodes corresponding to genes associated with specific cell fates, this latter observation likely reflecting how the stimulus is modeled (see Simulation of the inferred network reproduces the original dataset). Notably, these last nodes also exhibit a relatively high interconnectivity (e.g., endodermal differentiation) as compared to intergroup connectivity. Although biologically interesting, these observations illustrate our previous conclusion and likely mirrors the unbalanced experimental design characterized by a dense sequence of timepoints during the early phase (0h to 36h) of the differentiation process analysed and a coarser sequence of timepoints in the mid (36h to 48h) and late (48h to 96h) phases of the process [22].
Most notably, the overwhelming majority (85%) of the inferred edges that involve the RA stimulus are supported by biochemical evidence (Fig 4C). Similarly, the edges inferred from Pou5f1, Sox2, and Jarid2 nodes are globally supported by physical interaction (2/3 for Pou5f1, 2/3 for Sox2, and 1/1 for Jarid2).
We also observed that some inferred edges are not supported by documented physical interactions, as expected for genes encoding proteins unable to directly interact with DNA. As an example of such node, Sparc (also known as Osteonectin) appears highly connected to genes associated with all four cell states despite its inability to directly interact with gene basal transcription machinery (i.e., RNA polymerase complex). However, the inferred edges are clearly in agreement with its documented role in promoting endodermal differentiation [35]. Additionally, unsupported inferred edges may mirror the lag time between the expression and therefore the physical interaction between regulator and regulated genes and the observed transcriptional effect. By contrast with TFs that establish contact with the transcription machinery, modifying cofactors often catalyse deposition/erasure of epigenetic marks (e.g., acetylation/methylation of histones, DNA methylation) that will likely modulate transcription in a longer lasting manner. In this respect it is interesting to note that the Dnmt3b gene negatively interacts with many other genes in the network, which mirrors the fact that Dnmt3b is a de novo DNA methyltransferase, and has an indirect effect on gene regulation through CpG methylation, a well documented epigenetic mark generally associated with gene expression silencing. Altogether, this illustrates that our GRN model does incorporate various epigenetic information or indirect effects and is not restricted to physical interactions between transcription factors and their target genes.
While most inferred edges involving genes that encode TFs appear to be supported by physical interactions, many physical interactions detected by ChIP-seq are missed by CARDAMOM (e.g., 50% for RA, see Fig 4C). This observation is however not necessarily the sign of a lack of accuracy of the inferred GRN, since the detection of a physical interaction is not per se the hallmark of a modulating effect on the transcription level of the target gene [30]. Additionally, some specific regulatory structures are notoriously difficult to infer as illustrated by the high failure rate (96%) in inferring edges from Jarid2, a component of a repressive complex expressed in the pluripotent state and directly involved in the silencing of differentiation-associated genes. Interestingly, the interaction between Jarid2 and most of its physical targets presented in our GRN were instead wrongly detected as an inhibitory effect of the regulated genes on their regulators. This is due to the fact that CARDAMOM works by going forward in time, and thus fails to capture an inhibition that has an effect at the beginning of the process and which can be detected only further: instead, it would be prone to interpret the increase of the repressed genes by the effect of other intermediate genes, and the decrease of the repressor by an inhibitory effect of the repressed genes. We discuss further such bias of the algorithm in the Simulation of the inferred network reproduces the original dataset section.
For a better understanding of the inferred GRN dynamics, we also examined a dynamical network representation, where each edge appears at the timepoint for which it was detected with the strongest intensity by the inference algorithm (Fig 6). Unsurprisingly, the RA stimulus is detected at the earliest timepoint of the response (6h) and then ceases to influence the signal, which propagates in waves through the network as we described in a previous study [14]. For example, we do clearly observe the late increase of interactions for genes involved in specifying the extraembryonic endoderm.
The network inferred by SINCERITIES is shown in S7 Fig, which can be compared with Fig 5. Although a detailed analysis of this network is left to the interested reader, we observe that while some important characteristics of the network structure are similar to the ones of CARDAMOM (RA stimulus highly connected with genes involved in pluripotency maintenance, high connectivity of Sparc and Pou5f1), the correspondence ratios with ChIP-seq data are not as good as for CARDAMOM (in agreement with Fig 4A and 4B) as SINCERITIES seems to infer more interactions at the cost of more errors.
Simulation of the inferred network reproduces the original dataset
While inferring the GRN structure, CARDAMOM also inferred all the other parameters of the model, as described in Simulation of the inferred network reproduces the original dataset, except the mRNA degradation rates d0,i for each gene i. These parameters d0,i are not negligible as they scale the dynamics of the process. To address this problem, we used values from the literature that can be found in [36] (see Simulation of the inferred network reproduces the original dataset and S2 Table). Once the model has been calibrated, one can simulate an in silico dataset and sample the nine timepoints corresponding to the in vitro experiment. More precisely, we simulated two different datasets after calibrating the mechanistic model: one actually using the inferred network interactions, and one corresponding to the “null network” defined by removing the interactions (i.e., all genes individually calibrated with the same parameters but kept independent).
We first decided to verify, as advocated by Soneson et al. [37], the suitability of our generated synthetic data. We used countsimQC [37], a recent tool for comparing multivariate single-cell datasets, already used for benchmarking synthetic scRNA-seq data [38] (see Simulation of the inferred network reproduces the original dataset for more details). The synthetic dataset indeed mimics experimental data for a large number of tested characteristics (S2 Fig). However, we observed that except for correlations, the features considered by countsimQC are also well reproduced by the dataset simulated with the null network. This suggests that countsimQC features are not sufficient for measuring the accuracy of the dataset reproduction.
We then explored the ability of the synthetic dataset to recapitulate more sophisticated dimensions of the experimental data. At that stage, the critical question concerns the temporality of the synthetic data. Indeed, as developed in [15] and [23], none of the algorithms used for the benchmark (and presented in Simulation of the inferred network reproduces the original dataset) allows to take into account the real temporality of the data: the only information they use is the order of the sequence of timepoints at which the cells are measured. It is therefore not necessarily expected that a dataset simulated with the network inferred by CARDAMOM can reproduce the data distribution exactly at the same timepoints. The temporality is taken into account in a second time, by setting the value of the degradation rates from the literature. However, the hypothesis that these degradation rates are not time-dependent (which of course oversimplifies the biological reality [39]) may prevent us from being able to perfectly fit the time-dependent evolution of the data.
We observed that this hypothesis indeed limits our ability to simulate the true dynamics at the last timepoint. In particular, the process seems to accelerate between 72h and 96h and the model cannot be in adequacy with both the dynamics between 0h and 72h, and between 72h and 96h with the same degradation rates. It is important to note that such a global variation in degradation rates has been observed experimentally during the differentiation of chicken erythroid progenitors (see [24] and data at https://osf.io/k2q5b/). We thus decided to multiply the degradation rates by a scaling factor after 72h, allowing the process to reach its final state in time.
We compared these datasets using different metrics. First, we examined the extent to which the simulations matched the experimental marginal distributions of each gene. In Fig 7A, we represent the time-dependent evolution of the p-value of a Kolmogorov-Smirnov test between the GRN-generated distributions and the experimental dataset. We can see that some genes are better fitted than others, as exemplified by the Esrrb gene that is correctly fitted while the Sparc gene seems more difficult to catch (see Fig 7B). We nevertheless observe that for most genes and timepoints, the p-value is above 5%, meaning that the marginal distributions of the experimental data are quite well reproduced by the GRN model. This observation is confirmed by smaller p-values (i.e., significant discrepancies from the experimental data) for many more genes and timepoints when removing interactions between genes (S6 Fig). We compare in Fig 7C the mean Earth Mover Distance (EMD), and in Fig 7D the mean p-value of the Kolmogorov-Smirnov test applied on the 41 genes at each timepoint, between the empirical distributions of the experimental dataset and the two simulated datasets (with and without GRN). We observe that without GRN, the distance between the distributions generated by the model are constantly increasing, that is diverging from the experimental datasets (Fig 7C). This is corroborated by the fact that the mean p-values are decreasing monotonically (i.e. the model’s output is more and more significantly different from the experimentally observed distributions, see Fig 7D). The behavior of the GRN-simulated dataset is much closer to the experimental one, as seen from a smaller (and constant) EMD distance as well as a larger mean p-value. However, we observe in Fig 7A that the mid-timepoints (the central portion of the dynamics) seems to be the most difficult to capture, since mid-time p-values are often higher than at the beginning or end of the dynamics. This is corroborated by S3 Fig, where we plotted the temporal marginal distributions of six genes, and where Sox2 and Sparc in particular appear to have a final distribution close to the experimental one but not the correct transient behavior.
Finally, we were interested in how well we could capture the joint distributions. For this, we compared UMAP representations of the experimental dataset (Fig 7E) and the datasets simulated from the inferred network (Fig 7F) and from the null network (Fig 7G). These three datasets were projected on the same pseudo-axes based on the UMAP computed from the experimental dataset, using the methodology described by McInnes et al. [40]. This common projection allows a better side-by-side comparison between datasets. It is immediately evident that our GRN-generated data points are very closely mimicking the actual experimental data points, and that this resemblance is completely lost if all interactions are removed. The fact that UMAP is not linear requires some precautions, that we discussed in the Simulation of the inferred network reproduces the original dataset section.
To conclude, we observed that the mechanistic model can reproduce the major characteristics of the gene expression patterns observed during a differentiation process examined at the single cell level. It also appears clearly that simulating the model with the network inferred by CARDAMOM significantly improves the fit to the experimental dataset compared to the simulation with the null network (see Fig 7 and S5 Fig).
Discussion
The major interest of the method we proposed in this work is that it uses the same model for both inferring and simulating a gene regulatory network. The simulation part is not a novelty: a growing number of algorithms are proposed for simulating realistic single-cell gene expression datasets [38], and some have already been used for benchmarking GRN reconstruction methods [16, 17, 41].
Part of the success of our GRN model lies in its ability to reproduce the main characteristics of cell-cell heterogeneity observed in experimental datasets by making stochasticity an inherent part of the model, instead of adding a noise term a posteriori (see Simulation of the inferred network reproduces the original dataset). This is not the case of most algorithms that are used to simulate gene expression data associated to a regulatory network for benchmarking purposes [42, 43], even when they are based on underlying mechanistic models [16]. This point is also crucial for developing analytical results able to be used for the reverse engineering of the model, as it has been done for HARISSA and CARDAMOM.
Among the recently described algorithms, SERGIO [16] and BoolODE [17] are the closest to our work. Nevertheless, we want to highlight some key differences between these two approaches and ours:
SERGIO and BoolODE mechanistic models are based on SDEs, treating noise as a Gaussian white noise. This is clearly insufficient to capture the biological zeros and Gamma-shaped variability, which in these methods arise only from the addition of technical noise. In our modeling scheme, these features arise naturally from the transcriptional bursting phenomenon.
As stressed by the authors, both methods seek to simulate data with an explicit GRN as an input, a forward simulation goal, rather than attempt to estimate it from data, a reverse engineering goal. This is a fundamental difference with our work where we seek simultaneously to infer and to simulate data. To the best of our knowledge, the ability to do so using a mechanistic model is a true novelty of our work.
Our modeling scheme is amenable to in-depth mathematical analysis [6, 15, 23, 44], which is not the case of SERGIO. For BoolODE, a similar mathematical analysis of the SDE system should be possible, but the fact that gene expression stochasticity is driven only by Gaussian white noise would limit the application of such analysis.
The use of a specific module to add technical noise, as well as SERGIO’s ability to generate both spliced and unspliced versions of mRNAs are welcome innovation that we will consider in future versions of our work.
We also mention that there has been a recent surge of interest in using generative adversarial networks (GANs) for producing realistic new single-cell transcriptomics data [45]. Although it can be an efficient strategy for data generation and augmentation, its behaves as a black box regarding the underlying biology. Our main added value here is that our model is based on the biophysical reality of the cell and provides a clear materialistic explanation for generated data. Since we have shown that this generative model can be calibrated from single-cell datasets, it can also be used for control purposes, aiming at controlling the cellular phenotype by interfering with the GRN behavior.
While the test case was made using a dataset obtained during a differentiation sequence, one should note that our approach can be applied to any biological process for which time-stamped single-cell transcriptomic data are obtained after applying a given stimulus. When such time-stamped snapshots are not available, the algorithm could in principle take as an input time-reconstructed data (i.e., artificially ordered snapshots). In that case, the quality of the inference will strictly depend on the effectiveness of the time reconstruction algorithm.
Although efficient and promising, the model and the method we presented here have some limitations that are clearly identified, and that should guide future research efforts. First, we consider that the burst frequencies, which are critical parameters of the regulation, are sigmoid functions of protein levels. This implies that the distribution of a gene generally cannot have more than two modes [15], one associated to a low frequency of bursts and one to a high frequency of bursts, which correspond to the regions of the gene expression space where the sigmoid is relatively flat. Thus, if the distribution of a gene is more complex than a mixture of these two modes, the model is not expected to reproduce accurately its dynamics, since minor regulatory interactions might go undetected (especially if a third “hidden” mode is close to one of the two main modes). This seems to be the case for example for the slight decrease of Sparc at t = 24h, which would have been better captured by adding an extra mode. To solve this kind of errors, it may be necessary to complexify the model, for example by modeling the burst rate functions by a multi-layer perceptron rather than a sigmoid as proposed in [15].
We also observe that, as for Sparc, there is a positive correlation between the connectivity of a gene in the network and the complexity of its marginal distribution, leading to lower p-values in Fig 7A for genes that are highly connected. This could be because genes with complex behavior have to be regulated by potentially many other genes for that behavior to be explained, and variations in their expression make them good candidates for explaining variations in the expression of other genes.
Second, CARDAMOM uses the temporality linearly, by taking into account the timepoints one after the other without possibility of backward step. This explains why the algorithm is unable to detect the known fact that a gene like Jarid2 inhibits some Extraembryonic and Neuroectoderm genes: indeed these inhibitions could only be detected by going back in time when the target genes see their expression increase, in order to find the real cause of this effect. Instead, we have seen that the algorithm interprets it as an activation of some other genes. We believe that it could be tackled by taking inspiration from the Recurrent Neural Network (RNN) theory, but it should be achieved while keeping the interpretability of the results. Note that we already developed a similar analogy for the regression step at each timepoint, which can be interpreted as the learning step of a perceptron [15].
From that point of view, if the information was indeed transmitted forward, the network would be supposed to be completed step by step until reaching its complete form. However, two types of incompatibilities may still occur:
A direct incompatibility occurs when an edge which has been inferred at a certain timepoint is chosen to be reset to a value close to 0 or even to change sign at another one.
An indirect incompatibility corresponds to the case where the effect of an edge (j → i), which has been inferred at a certain timepoint, is compensated by another edge (k → i) at a following timepoint, but that the gene k expression products were high enough at the previous timepoint to thwart the effect of the relation (j → i).
This explains why, for the experimental dataset presented in the section, the model is not able to reproduce the behavior of some genes. One good example is Pou5f1 between t = 0h and t = 24h: indeed, the edges that could generate the slight increase of Pou5f1 at t = 6h are thwarted by the edges inferred at the following timepoints which lead to the strong decrease of Pou5f1 after t = 48h. However, these incompatibilities could also have a biological meaning, and be impossible to solve by modifying the regression problems. To go further, it is necessary to discuss the notions of structure and states of the network, which is related to the importance of possibly hidden variables. If the network incorporates all critical nodes, then the structure should not change. But it is different if there are some hidden variables, like genes the level of which are not measured, which results in modifying the network structure. For example, the problem of Pou5f1 that we have presented above could be explained in the following way: at t = 0h, an hidden variable may act on the interaction by preventing its possibility, before the hidden variable disappear at t = 24h. So if the hidden variable was integrated, then the network structure would not change anymore, but it is likely that we should in our case consider a modification of the network structure at t = 24h.
Third, the model does not allow the synthesis and degradation rates to vary over time. We have already mentioned that this was a problem for simulating the passage from t = 72h to t = 96h, and we decided to speed up the last time step for the model to reach its stationary distribution at 96h. We believe that most of the errors observed in the simulation with respect to the experimental dataset could be solved by finding an appropriate degradation rate. Thus, a significant improvement would consist in taking into account at each regression step the size of the time interval, and not only its order in the series of timepoints as it is now the case. This could allow to find a most-likely GRN in accordance to the degradation rate at each timepoint, or even to infer a most-likely degradation rate for each timepoint. However, while the latter case may provide new information on the variation of degradation rates as well as better accuracy on the relative importance of interactions in the GRN, it could also accentuate the problems of identifiability, and should therefore be studied carefully.
Fourth, the model does not take into account proliferation nor apoptosis while studying the stochasticity of the differentiation processes, nor the regulation of the proliferation rate by gene expression products. When sampling a distribution of n cells at a time t, the initial condition is built by sampling n cells under the uniform distribution among the set of cells at 0h, and to simulate its evolution during t hours. However, if some cells are supposed to have a higher death rate, and others to have a higher division rate, the process should evolve preferably in a certain direction in the gene expression space, which is going to be ignored in the current version of CARDAMOM. Taking into account these characteristics is a notoriously difficult task: a significant improvement has been recently achieved with Waddington-OT [27, 46], where a stochastic diffusion process models gene expression dynamics. Extending this kind of approach for the mechanistic model will be the subject of future works.
Finally, future versions of our method may consider additional biological features such as spatial cell-cell communication as the advent of multiomics datasets should provide data allowing to analyze the effect of these processes on differentiation, which is not possible in the case of scRNA-seq data without seriously compromising identifiability. We believe that the work presented here could serve as a basis for developing multiscale approaches to differentiation processes.
Methods
Mechanistic model of gene regulatory networks
The model used throughout this article is based on a hybrid version of the well-established two-state model of gene expression [6], where a gene is described by the state of a promoter, which can be either on or off. If the promoter is on, mRNAs are being transcribed at a rate s0, which are then translated into proteins at a rate s1. Degradation of both mRNAs and proteins occurs at a rate d0 and d1, respectively. The transitions between the on and off states occur at times of rates kon and koff. We consider the bursty regime of this model (kon ≪ koff), corresponding to short active periods with high transcription rates, as experimentally observed [47–50]. In this regime, mRNA is then transcribed by bursts of tens to hundreds of molecules. The random times at which these bursts occur are still described by an exponential distribution of parameter kon, and their random size by an exponential distribution with mean s0/koff. This model is compatible with experimental single-cell data, as steady-state mRNA levels follow for each gene a Gamma distribution, in line with continuous single-cell data [10].
The key idea is to incorporate this model into a network: the burst rate for each gene i is given by a gene-specific function , where P is the vector of protein quantities (S1 Fig). This function depends on proteins through a GRN, represented by an n-by-n matrix θ = (θij) where n is the number of genes in the network. The value of then corresponds to the transcriptional burst frequency of gene i given protein levels P. Each parameter θij encodes the interaction j → i with its direction, sign, and intensity. Recent work suggests that burst sizes are smaller and more uniform than previously anticipated [49] therefore leaving more room for burst frequency modulation [51] as a mechanism for gene expression regulation. We therefore consider that interactions come mainly from the modulation of burst frequencies and that for any gene i, the rates koff,i do not depend on P. The burst frequencies can be represented by sigmoid functions [23] as a simplification of the mechanistic form used in [6, 14]:
where k0,i (resp. k1,i) is the minimal (resp. maximal) burst frequency of gene i and βi is the basal activity of gene i, which can be also considered as the constant activity of a set of genes that are not measured but act on the network.
Simulation of time-stamped datasets
In order to simulate the mechanistic model, we used the simulation module of the HARISSA package [23]. One computational advantage of this method, which consists in sampling burst times with maximum rate and then deciding with an appropriate rule which ones to keep, is that it is guaranteed to be exact without requiring any numerical integration.
To simulate discrete “count” data that are produced by current scRNA-seq technologies, each mRNA level is generated by sampling from a Poisson distribution whose mean is the simulated expression level. An important mathematical observation is that the resulting cell profiles are then exactly (resp. approximately) distributed according to the discrete-valued “Gillespie” version of the mechanistic model in the absence (resp. presence) of interactions between genes [52].
In order to reproduce in vitro experiments for a specific GRN, we use the following method: (1) let the model run for t < 0h until its (stochastic) steady state is reached; (2) introduce at t = 0h a virtual stimulus gene with a constant maximal value for its protein. Such stimulus represents a perturbation in the environment of the cells, inducing them to evolve towards a new (stochastic) steady state. For example, in the case of mouse ES cell differentiation, this corresponds to the addition of all-trans RA in the medium. A time-stamped dataset corresponds to the sampling of independent cells at a specific sequence of timepoints (therefore “killing” sampled cells at each timepoint) starting from t = 0h. Namely, for the benchmark of Fig 2, the sequence of 10 timepoints was set to 0, 6, 12, 24, 36, 48, 60, 72, 84, and 96h.
The GRN model parameters (k0,i, k1,i, s0,i/koff,i, βi and θij) as well as the degradation rates used for simulating the datasets of Fig 1 can be found online with the code of the CARDAMOM method. For every gene i, we set k0,i = 0, k1,i = 2 and s0,i/koff,i = 50.
Relevance to biological data
The exact probability distribution associated to the mechanistic model remains unknown for general networks. However, the analysis developed in [15] suggests that the marginal on mRNAs of the distribution at each time t can be reasonably approximated by a Gamma mixture:
(1) |
where Z denotes the set of cell types seen as the basins of attraction of the GRN model, kz,i denotes the mode of burst frequency associated to gene i within basin z ∈ Z, and μt is a probability vector describing the relative weight of the basins in the process at time t.
The Poissonian layer transforms the Gamma distributions into negative binomial (NB) distributions, which gives:
(2) |
Such mixture distributions are known to be compatible with discrete single-cell data [1, 11]. In particular, we recover the second order relationship between pairs of variables that are characteristic of experimental datasets. Indeed, we remark that the mean of a negative binomial distribution NB(a, b) is and its variance , which implies that:
Thus, for every gene i, by replacing b by , we see that the relation , which is characteristic of cell-cell heterogeneity in single-cell data, is well verified by the mechanistic model provided that koff,i does not depend on protein levels. This also argues in favor of the assumption that a GRN does not affect significantly the burst size. Note however that following this criterion, any model generating negative binomial distributions could be considered as realistic. Such criterion is therefore not sufficient for characterizing the accuracy of gene expression models, especially when the simulated distributions arise from a phenomenological “ad hoc” noise term added to fit experimental datasets [16].
Tested algorithms
The six algorithms used for the benchmark represent together the main categories of GRN inference methods presented in [7]:
GENIE3 [18], which computes the regulatory network for each gene independently, using tree-based ensemble methods to predict the expression profile of each target gene from the profiles of all the other genes;
PIDC [19], which infers an undirected network using the notion of mutual information;
SINCERITIES [20], which uses Granger causality after computing temporal changes in gene expression through the distance between two consecutive timepoints of the marginal distributions;
SCRIBE [21], which is based on the notion of conditioned Restricted Directed Information and ideally needs real cell trajectories, which is unrealistic experimentally. We then pseudo-temporally order the time-stamped synthetic data used for the benchmark with two methods, one using a pseudotime algorithm and the other using an optimal coupling method with optimal transport, following the idea developed in [27]. We also tested real trajectories in order to compare the performances. The results of this algorithm are presented separately, due to the difference in the information that is needed.
HARISSA [6, 23] and CARDAMOM [15], which are based on the mechanistic model presented above. Both algorithms reconstruct the network by solving a set of regression problems, based on two distinct mathematical analyses of the same model: HARISSA solves a maximum likelihood problem for the protein distributions after estimating a most-probable position for the protein levels in each cell; CARDAMOM compares the function kon to the modes of a joint mRNA distribution previously inferred, in a two-step procedure. Although there are few differences from previous publications regarding these tools, they have been slightly improved here by taking into account the advantages of each, to make them more efficient and compatible with each other. The differences are described in the file cardamom_vignette.pdf on the associated Git repository.
Measuring algorithm performance for the benchmark
We evaluated the GRN inference algorithms on simulated datasets using the area under the precision-recall curve (AUPR). Since inferring these coefficients is a notoriously difficult task [17], we do not take into account diagonal coefficients of the GRN matrix, which correspond to self-regulations. Interestingly, for the datasets used in the benchmark (see Fig 1), the self-regulation of a gene is generally well detected by CARDAMOM at a significantly higher level than for genes without self-regulation. However, the values inferred without self-regulation remain high relative to the other inferred interactions, which pulls down the AUPR scores while in practice not affecting the associated GRN dynamics. Since this effect of diagonal coefficients tends to lower the AUPR scores for all methods in a similar way, the choice not to take them into account is well justified. Note also that we chose precision-recall (PR) curves rather than receiver operating characteristic (ROC) curves because of the well-known class imbalance problem. Indeed, the sparsity hypothesis suggests that the number of interactions expected for a network of size n is smaller than half of the total number of possible interactions (n2): it is then natural to focus on minimizing false positives (interactions that are detected but not present) rather than false negatives (interactions that are present but not detected), which explains the preference of PR over ROC.
Experimental dataset
We used data collected from a differentiation experiment of mouse embryonic stem cells induced by all-trans retinoic acid treatment [22]. This scRNA-seq dataset consists of 9 timepoints (0, 6, 12, 24, 36, 48, 60, 72, and 96h), each timepoint containing between 137 and 335 sampled cells after pre-processing (272 on average, for a total number of 2449 cells). To limit artificial correlations between genes (due to a multiplicative cell-specific technical factor mainly related to the reverse-transcription step), we selected cells with a total number of UMI counts ≥ 2000 in line with Semrau et al. [22], which resulted in keeping only 2449 out of 3456 measured cells.
On the other hand, we did not normalize cells by their respective total UMI counts and argue that this type of normalization is hazardous in the case of single-cell data. Indeed, such “library sizes” are small compared to bulk data (because 1 sample = 1 cell instead of many cells) and are in fact biologically fluctuating, likely reflecting the transcriptional bursting phenomenon (this is easily seen when simulating “perfect” data from the mechanistic model, see S2 Fig). In practice, since the CARDAMOM inference method starts with a binarization step (applying a specific, statistically derived threshold to each gene based on the mechanistic model), a multiplicative factor on each cell should not have too much impact as long as the number of cells is large enough. More generally, we argue that such normalization of cells should rather be “soft-coded” as a random factor to be estimated within a statistical framework.
The total number of genes measured in this experiment is 17452, which is much larger than in our benchmark. As they are unlikely to all be important in characterizing the differentiation process, we decided to restrict our analysis to a panel of 41 genes that had previously been identified as key marker genes for pluripotency, post-implantation epiblast, neuroectoderm and extraembryonic endoderm [22]. This number of genes allows to infer a network rich enough to make cell types emerge in a non-trivial way, while keeping a reasonable statistical power regarding the number of sampled cells. Note that the speed of the algorithms (S1 Table) would allow a much larger subset of genes to be used: the limiting factor here is not computational speed but statistical power resulting from the number of cells available (see Fig 2).
Calibration of the mechanistic model
The principle of CARDAMOM is based on a two-step procedure:
In a first step, we find the set of parameters α defining the mixture of negative binomial distributions (2) that best fits the data. The only parameter that can vary over time is the mixture proportion parameter μt (allowing to estimate, for each gene i, the mean burst size s0,i/koff,i and the values k0,i and k1,i of the typical modes associated to the function kon,i). Note that all model parameters are estimated except the degradation rates, which are constant for each gene and scale the dynamics of gene expression. See [15, Appendix A] for a full description of these parameters and their precise links with α.
In a second step, we calibrate the basal activity and interaction parameters βi and θij in order to approximate this mixture distribution. The interaction parameters θij are then updated at each timepoint sequentially to match the mixture parameters.
The degradation rates are set to values found in tables from the literature [36]. Since many genes do not appear in these tables, we decide to set the same value for all the genes belonging to the same functional group identified in [22] (see S2 Table). Details concerning the two steps can be found in [15]. Note that the first step has been modified since the original publication in order to replace the MCMC algorithm, which was used to find the parameters of the negative binomial distributions associated to each cell type, by a faster variational method.
As mentioned previously, the model cannot be in adequacy with both the dynamics from 0h to 72h and from 72h to 96h with the same degradation rates for the experimental dataset used in, due to the acceleration of the process between 72h and 96h. For this reason, we decided for our simulation to multiply the synthesis and degradation rates between the last two time points (equivalent to the multiplication of the last timepoint) by a factor f = 6, large enough to reach the steady state between 72h and 96h. This factor was found empirically as the minimum integer factor such that the stationary distribution is reached at 96h: thus any factor greater than 6 would lead to the same results.
Comparing datasets using countsimQC
We used countsimQC [37] to compare the experimental dataset, the dataset simulated with the inferred network and the dataset simulated with the null network (S2 Fig). Although there are no significant difference between the first two datasets, we observe in S2F Fig, that the correlations between genes are not perfectly reproduced (they are clearly more accurate than for the dataset simulated with the null network). This gap between the correlations between genes is also illustrated in S4 Fig, which compares joint distributions between the simulated dataset (with the inferred network) and the experimental one for three pairs of genes at the final timepoint. We observe that if the global form of the correlation is respected, they are not as strong in the simulated dataset as in the experimental dataset. This suggests that the inferred GRN recovers the true correlations but not with the right intensity, which may be due to the sensitivity of the model to the value of its parameters.
The fact that except for the correlations (sample-sample and feature-feature), the statistical characteristics explored by countsimQC are also well reproduced by the dataset simulated with the null network, suggests that they are generally not sufficient for measuring the accuracy of a dataset reproduction. This is partly due to the fact that any calibration of the model with the right scaling parameters but not the right GRN should match most of these characteristics. In that meaning, the successes of simulation algorithms prior to our work are limited when measured with similar criteria. Our methodology, for which we used distinct criteria which are particularly well illustrated in S5 Fig and Fig 7, then appears as a significant improvement in the field of executable GRN inference.
Comparing datasets using UMAP
Since UMAP is not linear, the projections of datasets shown in S5 Fig are likely to force the projected data to be artificially close to the reference dataset. Thus, we decide to present two figures similar to Fig 7E, 7F and 7G, but where instead of projecting the simulated dataset on the pseudo-axis corresponding to the projection of the experimental dataset, we project both datasets together and show separately the cells corresponding to each dataset. Using this methodology, we represented in S5A Fig the UMAP projection of the experimental dataset and in S5B Fig the one of the dataset simulated with the inferred network. We did the same for the experimental dataset (S5C Fig) and the dataset simulated with the null network (S5D Fig). Then, although they have different representations, Figures S5A and S5C Fig represent the same dataset, and the difference comes from the second dataset (the simulated one) with which the reduction has been performed. This allows to emphasize that the representation of a distribution of cells with UMAP is very sensitive to the choice of the data that are integrated in the projection. Once again, we observed that the dataset simulated using the inferred network does seem much closer to the experimental dataset than the one simulated with the null network: in particular, S5B Fig demonstrates that the UMAP projection of the dataset with network is close to the one of the experimental dataset both in the arrangement of the cells between the different timepoints and in the general form of the subspace occupied by the cells, which is not the case for the UMAP projection of the dataset simulated with the null network, represented in S5D Fig.
Supporting information
Acknowledgments
We would like to thank especially Christophe Arpin, Thomas Lepoutre, Anton Crombach and Arnaud Bonnaffoux for critical reading of the manuscript. We also thank all members of the SBDM and Dracula teams, and of the SingleStatOmics project, for providing such stimulating working environment. We finally thank the BioSyL Federation, the LabEx Ecofect (ANR-11-LABX-0048) and the LabEx Milyon of the University of Lyon for inspiring scientific events.
Data Availability
CARDAMOM is available at https://github.com/eliasventre/cardamom along with the code to generate all the results in this paper. HARISSA is available at https://github.com/ulysseherbach/harissa.
Funding Statement
This work was supported by funding from French agency ANR (SingleStatOmics; ANR-18-CE45-0023-03) to OG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Mar JC. The rise of the distributions: why non-normality is important for understanding the transcriptome and beyond. Biophysical Reviews. 2019;11:89–94. doi: 10.1007/s12551-018-0494-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Coskun AF, Eser U, Islam S. Cellular identity at the single-cell level. Mol Biosyst. 2016;12:2965–2979. doi: 10.1039/C6MB00388E [DOI] [PubMed] [Google Scholar]
- 3. Huynh-Thu VA, Sanguinetti G. Gene Regulatory Network Inference: An Introductory Survey. Methods Mol Biol. 2019;1883:1–23. doi: 10.1007/978-1-4939-8882-2_1 [DOI] [PubMed] [Google Scholar]
- 4. Zong C, So LH, Sepúlveda LA, Skinner SO, Golding I. Lysogen stability is determined by the frequency of activity bursts from the fate-determining gene. Molecular Systems Biology. 2010;6:440. doi: 10.1038/msb.2010.96 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ochiai H, Sugawara T, Sakuma T, Yamamoto T. Stochastic promoter activation affects Nanog expression variability in mouse embryonic stem cells. Scientific reports. 2014;4:1–9. doi: 10.1038/srep07125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Herbach U, Bonnaffoux A, Espinasse T, Gandrillon O. Inferring gene regulatory networks from single-cell data: a mechanistic approach. BMC Systems Biology. 2017;11:1–15. doi: 10.1186/s12918-017-0487-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Akers K, Murali TM. Gene regulatory network inference in single-cell biology. Current Opinion in Systems Biology. 2021;26:87–97. doi: 10.1016/j.coisb.2021.04.007 [DOI] [Google Scholar]
- 8. Shahrezaei V, Swain PS. Analytical distributions for stochastic gene expression. PNAS. 2008;105:17256–17261. doi: 10.1073/pnas.0803850105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Friedman N, Cai L, Xie XS. Linking stochastic dynamics to population distribution: an analytical framework of gene expression. Phys Rev Lett. 2006;97(16):168302. doi: 10.1103/PhysRevLett.97.168302 [DOI] [PubMed] [Google Scholar]
- 10. Albayrak C, Jordi CA, Zechner C, Lin J, Bichsel CA, Khammash M, et al. Digital Quantification of Proteins and mRNA in Single Mammalian Cells. Molecular Cell. 2016;61:914–924. doi: 10.1016/j.molcel.2016.02.030 [DOI] [PubMed] [Google Scholar]
- 11. Singer ZS, Yong J, Tischler J, Hackett JA, Altinok A, Surani MA, et al. Dynamic heterogeneity and DNA methylation in embryonic stem cells. Mol Cell. 2014;55(2):319–331. doi: 10.1016/j.molcel.2014.06.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Schiebinger G. Reconstructing developmental landscapes and trajectories from single-cell data. Current Opinion in Systems Biology. 2021;27:100351. doi: 10.1016/j.coisb.2021.06.002 [DOI] [Google Scholar]
- 13. Deconinck L, Cannoodt R, Saelens W, Deplancke B, Saeys Y. Recent advances in trajectory inference from single-cell omics data. Current Opinion in Systems Biology. 2021;27:100344. doi: 10.1016/j.coisb.2021.05.005 [DOI] [Google Scholar]
- 14. Bonnaffoux A, Herbach U, Richard A, Guillemin A, Gonin-Giraud S, Gros PA, et al. WASABI: a dynamic iterative framework for gene regulatory network inference. BMC Bioinformatics. 2019;20:1–19. doi: 10.1186/s12859-019-2798-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Ventre E. Reverse engineering of a mechanistic model of gene expression using metastability and temporal dynamics. In Silico Biology. 2021;14:89–113. doi: 10.3233/ISB-210226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Dibaeinia P, Sinha S. SERGIO: A Single-Cell Expression Simulator Guided by Gene Regulatory Networks. Cell Systems. 2020;11(3):252–271. doi: 10.1016/j.cels.2020.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nature Methods. 2020;17(2):147–154. doi: 10.1038/s41592-019-0690-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLOS One. 2010;5(9):e12776. doi: 10.1371/journal.pone.0012776 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chan TE, Stumpf MPH, Babtie AC. Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures. Cell Systems. 2017;5(3):251–267. doi: 10.1016/j.cels.2017.08.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Papili Gao N, Ud-Dean SMM, Gandrillon O, Gunawan R. SINCERITIES: Inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles. Bioinformatics. 2017;34(2):258–266. doi: 10.1093/bioinformatics/btx575 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Qiu X, Rahimzamani A, Wang L, Ren B, Mao Q, Durham T, et al. Inferring Causal Gene Regulatory Networks from Coupled Single-Cell Expression Dynamics Using Scribe. Cell Systems. 2020;10:1–10. doi: 10.1016/j.cels.2020.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Semrau S, Goldmann JE, Soumillon M, Mikkelsen TS, Jaenisch R, van Oudenaarden A. Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells. Nat Commun. 2017;8:1–16. doi: 10.1038/s41467-017-01076-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Herbach U. Gene regulatory network inference from single-cell data using a self-consistent proteomic field. arXiv. 2021;2109.14888:1–21.
- 24. Richard A, Boullu L, Herbach U, Bonnafoux A, Morin V, Vallin E, et al. Single-Cell-Based Analysis Highlights a Surge in Cell-to-Cell Molecular Variability Preceding Irreversible Commitment in a Differentiation Process. PLoS Biol. 2016;14:e1002585. doi: 10.1371/journal.pbio.1002585 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Stumpf PS, Smith RCG, Lenz M, Schuppert A, Müller FJ, Babtie A, et al. Stem Cell Differentiation as a Non-Markov Stochastic Process. Cell Systems. 2017;5:268–282. doi: 10.1016/j.cels.2017.08.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Phillips NE, Mandic A, Omidi S, Naef F, Suter DM. Memory and relatedness of transcriptional activity in mammalian cell lineages. Nat Commun. 2019;10:1–12. doi: 10.1038/s41467-019-09189-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, et al. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell. 2019;176:928–943.e22. doi: 10.1016/j.cell.2019.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19:1–16. doi: 10.1186/s12864-018-4772-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Moutier E, Ye T, Choukrallah MA, Urban S, Osz J, Chatagnon A, et al. Retinoic acid receptors recognize the mouse genome through binding elements with diverse spacing and topology. J Biol Chem. 2012;287(31):26328–26341. doi: 10.1074/jbc.M112.361790 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Chatagnon A, Veber P, Morin V, Bedo J, Triqueneaux G, Semon M, et al. RAR/RXR binding dynamics distinguish pluripotency from differentiation associated cis-regulatory elements. Nucleic Acids Res. 2015;43(10):4833–4854. doi: 10.1093/nar/gkv370 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Chua EHZ, Yasar S, Harmston N. The importance of considering regulatory domains in genome-wide analyses—the nearest gene is often wrong! Biol Open. 2022;11(4):bio059091. doi: 10.1242/bio.059091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133(6):1106–1117. doi: 10.1016/j.cell.2008.04.043 [DOI] [PubMed] [Google Scholar]
- 33. Li G, Margueron R, Ku M, Chambon P, Bernstein BE, Reinberg D. Jarid2 and PRC2, partners in regulating gene expression. Genes & development. 2010;24:368–380. doi: 10.1101/gad.1886410 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Mahony S, Mazzoni EO, McCuine S, Young RA, Wichterle H, Gifford DK. Ligand-dependent dynamics of retinoic acid receptor binding during early neurogenesis. Genome biology. 2011;12:1–15. doi: 10.1186/gb-2011-12-1-r2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Hrabchak C, Ringuette M, Woodhouse K. Recombinant mouse SPARC promotes parietal endoderm differentiation and cardiomyogenesis in embryoid bodies. Biochemistry and Cell Biology. 2008;86:487–499. doi: 10.1139/O08-141 [DOI] [PubMed] [Google Scholar]
- 36. Schwanhausser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, et al. Global quantification of mammalian gene expression control. Nature. 2011;473:337–42. doi: 10.1038/nature10098 [DOI] [PubMed] [Google Scholar]
- 37. Soneson C, Robinson MD. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics. 2018;34:691–692. doi: 10.1093/bioinformatics/btx631 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Crowell HL, Leonardo SXM, Soneson C, Robinson MD. Built on sand: the shaky foundations of simulating single-cell RNA sequencing data. bioRxiv. 2021; p. 1–18. [Google Scholar]
- 39. Manning KS, Cooper TA. The roles of RNA processing in translating genotype to phenotype. Nat Rev Mol Cell Biol. 2017;18:102–114. doi: 10.1038/nrm.2016.139 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2020;1802.03426:1–63.
- 41. Cannoodt R, Saelens W, Deconinck L, Saeys Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nature Communications. 2021;12(1):1–9. doi: 10.1038/s41467-021-24152-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Mizeranschi A, Zheng H, Thompson P, Dubitzky W. Evaluating a common semi-mechanistic mathematical model of gene-regulatory networks. BMC Systems Biology. 2015;9:1–12. doi: 10.1186/1752-0509-9-S5-S2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Matsumoto H, Kiryu H, Furusawa C, Ko MS, Ko SB, Gouda N, et al. SCODE: An efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics. 2017;33(15):2314–2321. doi: 10.1093/bioinformatics/btx194 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Ventre E, Espinasse T, Bréhier CE, Calvez V, Lepoutre T, Gandrillon O. Reduction of a stochastic model of gene expression: Lagrangian dynamics gives access to basins of attraction as cell types and metastabilty. Journal of Mathematical Biology. 2021;83:1–63. doi: 10.1007/s00285-021-01684-1 [DOI] [PubMed] [Google Scholar]
- 45. Marouf M, Machart P, Bansal V, Kilian C, Magruder DS, Krebs CF, et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat Commun. 2020;11(1):1–12. doi: 10.1038/s41467-019-14018-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lavenant H, Zhang S, Kim YH, Schiebinger G. Towards a mathematical theory of trajectory inference. arXiv preprint arXiv:210209204. 2021; p. 1–62.
- 47. Suter DM, Molina N, Gatfield D, Schneider K, Schibler U, Naef F. Mammalian genes are transcribed with widely different bursting kinetics. Science. 2011;332(6028):472–474. doi: 10.1126/science.1198817 [DOI] [PubMed] [Google Scholar]
- 48. Nicolas D, Phillips NE, Naef F. What shapes eukaryotic transcriptional bursting? Mol Biosyst. 2017;13:1280–1290. doi: 10.1039/C7MB00154A [DOI] [PubMed] [Google Scholar]
- 49. Rodriguez J, Larson DR. Transcription in Living Cells: Molecular Mechanisms of Bursting. Annu Rev Biochem. 2020;89:189–212. doi: 10.1146/annurev-biochem-011520-105250 [DOI] [PubMed] [Google Scholar]
- 50. Tunnacliffe E, Chubb JR. What Is a Transcriptional Burst? Trends Genet. 2020;36:288–297. doi: 10.1016/j.tig.2020.01.003 [DOI] [PubMed] [Google Scholar]
- 51. Li C, Cesbron F, Oehler M, Brunner M, Hofer T. Frequency Modulation of Transcriptional Bursting Enables Sensitive and Rapid Gene Regulation. Cell Syst. 2018;6(4):409–423. doi: 10.1016/j.cels.2018.01.012 [DOI] [PubMed] [Google Scholar]
- 52. Herbach U. Stochastic gene expression with a multistate promoter: breaking down exact distributions. SIAM Journal on Applied Mathematics. 2019;79(3):1007–1029. doi: 10.1137/18M1181006 [DOI] [Google Scholar]