Abstract
Motivation
The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation.
Results
In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses.
Availability and Implementation
The R source code of SCODE is available at https://github.com/hmatsu1226/SCODE
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Conventional bulk RNA-Seq reveals the average gene expression of an ensemble of cells, and therefore does not permit the analysis of detailed states of individual cells. With the advancement of single-cell RNA-Seq (scRNA-Seq), we can now quantify the expression of individual cells and analyze detailed differences among cells (Kolodziejczyk et al., 2015). This enables several analyses such as the identification of cell types (Buettner et al., 2015; Zeisel et al., 2015), especially of rare cells (Grun et al., 2015; Jiang et al., 2016) and the estimation of cellular lineages (Burns, 2015; Treutlein et al., 2014).
In analyses by scRNA-Seq, the reconstruction of cellular differentiation processes attracts attention as a novel approach to revealing differentiation mechanisms (Trapnell, 2015). The differentiation process can be reconstructed using dimension reduction (Ji and Ji, 2016; Trapnell et al., 2014) and stochastic processes (Matsumoto and Kiryu, 2016), for example, and the degree of differentiation (in pseudo-time) of each cell is characterized by the position in the reconstructed process. By investigating the expression pattern in pseudo-time, genes can be clustered into multiple groups with different biological functions (Trapnell et al., 2014). Moreover, the regulatory cascade of cellular state transitions, such as differentiation, can be inferred by comparing the timings of up- and down-regulation (Eckersley-Maslin et al., 2016; Li et al., 2016; Matsumoto and Kiryu, 2016).
In addition, scRNA-Seq also enables the calculation of accurate correlations of expression between genes because scRNA-Seq can distinguish the detailed states of individual cells without contamination from multiple cell types. The accurate co-expression pattern of each cell type (progenitor cells and multiple types of differentiated cells) can reveal the key regulatory factors for lineage programming (Pina et al., 2015).
In this way, expression dynamics in pseudo-time and accurate relationships among genes can be inferred from scRNA-Seq data. For the next step in differentiation analyses using scRNA-Seq, it is important to reveal the regulatory interactions among genes that bring about the observed expression dynamics during differentiation, namely, gene regulatory network (GRN) inference from scRNA-Seq data. Pseudo-time can be regarded as time information, and hence, scRNA-Seq performed on cells undergoing differentiation can be regarded as time-course expression data at a high temporal resolution. Although several algorithms have been proposed to reconstruct GRN from time-course data (Lee and Tzou, 2009), most of them are not suitable for scRNA-Seq data, such as that collected over continuous time and with a large number of samples. Moreover, time complexity is a serious problem, and runtime becomes infeasibly long with large numbers of samples and genes for the network inference from time-course data.
Recently, Boolean network-based algorithms have been proposed for inferring GRN from single-cell data (Chen et al., 2015; Lim et al., 2016; Moignard et al., 2015). Although these algorithms have revealed some interesting regulatory relationships, their time complexity increases significantly as the number of genes and cells increases, and they have thus been applied to data with a small number of genes. In addition, the expression data must first be converted into binary data for Boolean network inference, and therefore the relationship between networks and the underlying dynamics becomes obscured (Woodhouse et al., 2016).
As another approach, ordinary differentiation equations (ODEs) have been used to describe regulatory network and expression dynamics. ODEs can describe continuous variables over continuous time and the underlying physical phenomena, and therefore they are suitable for inferring GRN from scRNA-Seq during differentiation. Although several ODE-based network-inference algorithms have been proposed (Di Bernardo et al., 2004; Gardner et al., 2003), most of them are not suitable for the differentiation case because these algorithms assume a steady-state condition. There are some ODE-based algorithms that infer GRNs such that the observed expression dynamics can be reconstructed from the optimized ODE (Bansal et al., 2006). However, time complexity is still a serious problem for such ODE-based algorithms (Lee and Tzou, 2009). Previous research has described optimizing an ODE by using single-cell data and pseudo-time to infer key GRNs (Ocone et al., 2015). Although it is a suggestive approach, the optimization assumes that the GRNs are given and learns the ODE for a specific GRN. Therefore, a novel and efficient algorithm is necessary to learn GRNs from ODEs designed for scRNA-Seq performed on differentiating cells and for a large number of samples and genes.
Accordingly, we developed an approach to describe regulatory networks and expression dynamics with linear ODEs as well as a novel, highly efficient optimization algorithm, SCODE, for scRNA-Seq performed on differentiating cells by integrating the transformation of linear ODEs and linear regression. In the Methods section, we show that linear ODEs can be transformed from fixed-parameter linear ODEs if they satisfy a relational expression. We also show that the relational expression can be estimated analytically and efficiently by linear regression. In addition, SCODE uses a small number of factors to reconstruct expression dynamics, which results in a marked reduction of time complexity. In the Results sections, we described the application of SCODE for three scRNA-Seq datasets during differentiation. First, we validated that the optimized ODEs can reconstruct observed expression dynamics accurately. Second, we evaluated the inferred network by comparing it to the transcription factor (TF) regulatory network database based on DNaseI footprints and transcription factor binding motifs. SCODE performed best with two of the datasets and was the close second best algorithm for the remaining dataset. Third, we compared the runtimes of the algorithms, and SCODE was significantly faster than previous algorithm that was designed for time-course data. Moreover, SCODE is faster than some algorithms that do not use time parameters. These results illustrate the remarkable efficiency of SCODE. Lastly, we analyzed the network inferred from a dataset and determined that the de novo methyltransferases Dnmt3a and Dnmt3b might be key regulators of differentiation.
In this paper, we propose a novel algorithm for scRNA-Seq performed on differentiating cells to reconstruct expression dynamics and infer regulatory networks with a highly efficient optimization method. We believe that our approach will substantially advance the development of regulatory network inference and promote the development of further single-cell differentiation analyses and bioinformatics methods.
2 Materials and methods
2.1 Describing regulatory networks and expression dynamics with linear ODEs
In this research, we focus on TFs and inferring TF regulatory networks. First, we describe TF expression dynamics throughout differentiation with linear ODEs:
(1) |
where is a vector of length G (G is the number of TFs) that denotes the expression of TFs and is a square matrix with dimensions equal to G that denotes the regulatory network among TFs. We infer the TF regulatory network by optimizing such that the ODE can successfully describe the observed expression data.
The above model and most of the algorithms which infer GRN from RNA-Seq make no distinction between mRNA and corresponding TF levels. Although several studies have shown poor correlation between mRNA and protein levels (Maier et al., 2009), the correlation increases in the case of differentially expressed mRNAs (Koussounadis et al., 2015), and changes of mRNA and protein levels are highly correlated (Lundberg et al., 2010). Therefore, TF levels can be represented by mRNA levels with respect to differentially expressed TF genes during differentiation. The delay due to translation and translocation also may impact our model. The time scales of translation and translocation are seconds (Morisaki et al., 2016) and several tens of minutes (Aymoz et al., 2016), which are significantly shorter than that of differentiation (days). Therefore, our assumption that identifies mRNA level with corresponding TF levels will be appropriate to describe the regulatory relationships and dynamics during differentiation.
The observed expression data consist of a G × C matrix (), where C is the number of cells. In addition, the time parameter of a cell c is given as tc. Therefore, our objective is to optimize such that can properly represent at a corresponding time point.
Here, contains G × G parameters and an efficient parameter optimization algorithm is necessary for large values of G. This is because the time complexity is typically for operation on a G × G matrix, and it will exceed to optimize with a general algorithm. As experimental technologies have advanced, the number of cells that may be subjected to scRNA-Seq has been increasing, and hence C can be quite large. Therefore, we developed a novel algorithm to optimize efficiently, even if both G and C are large, by integrating the transformation of linear ODEs and linear regression.
2.1.1 Deriving from a linear ODE transformation
At first, we consider the following linear ODE:
(2) |
where is a vector of length G and is a known square matrix. If we know a matrix that satisfies , we can derive the ODE of by transforming the ODE of as follows:
(3) |
Therefore, if the parameter of and the relationship are given, we can derive from .
2.1.2 Estimating using linear regression
To infer , we have to estimate a matrix that satisfies . Here, we assume that the problem of inference can be regarded as a linear regression problem. Initially, from , we calculate at t = tc for each cell and generate a G × C matrix () (Fig. 1(a)). With this , we optimize to successfully represent the relationship , which results in . The above problem can be regarded as solving the linear regression for each gene, as follows:
(4) |
where ϵ is a noise term. Therefore, can be optimized analytically and efficiently by linear regression for each TF (Fig. 1(b)), and can be efficiently calculated from .
Fig. 1.
Abstract illustration of SCODE. (a) Sample from the ODE of (b) Estimate based on linear regression. (c) Optimize iteratively. (d) Infer from optimized and . (e) The expression dynamics can be reconstructed from the optimized ODE of
2.1.3 Dimension reduction of
The basic idea of reduction is that the patterns of expression dynamics are limited and expression dynamics can be reconstructed with a small number of patterns. For the next step, we consider a small vector to represent the original expression dynamics. Hereafter, is a vector of length D, with . In this case, is a G × D matrix, and hence we used a pseudo-inverse matrix instead of the inverse matrix, and is derived from . The matrix is estimated as before, via linear regression. By using a small vector , the time complexity of estimation of becomes much lower.
Recently, such dimensionality reduction approach has also been proposed to infer network (Wang et al., 2016). Although it is a sophisticated algorithm, it is designed for discrete time-course data and small samples, and is not suitable for scRNA-Seq data.
2.1.4 Optimizing
Thus far, we have assumed is given. To represent the original expression dynamics with small values of D, we optimize for the next step. We suppose that the appropriate value of satisfies the condition that the generated from can predict with accurately. Therefore, we evaluate the appropriateness of the matrix with the following residual sum of squares (RSS):
(5) |
In this research, we assume is a diagonal matrix and the elements satisfy (we set and to -10 and 2, respectively). This limitation is acceptable because large and small values of represent a dynamics of sharp change and seem to be an inefficient basis for reconstructing the expression dynamics.
We optimize by random sampling and iterative optimization so that the RSS decreases (Fig. 1(c)). The brief pseudocode is given below (see the Supplementary text for the detailed procedure).
Algorithm 1.
Iterative optimization of
Initialize a diagonal matrix randomly
for do
Generate from
Solution of linear regression ()
ifthen
end if
uniform random value
uniform random value
end for
After the above optimization, is inferred with (Fig. 1(d)).
2.1.5 Time complexity
The time complexity of optimizing and is , where I is the number of iterations of optimization. The time complexity of calculating is . Because we assume that D is small, the total time complexity is about . As matrix operations on , such as multiplication, have a time complexity of , our algorithm is highly efficient to infer regulatory network even though it integrates time-course information into the model.
2.2 Other network inference approaches
For comparison, we also developed a simple network inference algorithm based on linear regression that predicts expression of a particular TF from the expression of the remaining TFs as follows:
(6) |
With this method, the optimized is regarded as a regulatory network. In this research, we optimized using two criteria. The first criterion is based on normal linear regression, and we estimated with the lm function in R. The second criterion is based on lasso regression, and we estimated with the msgps package in R, which automatically selects the optimal degrees of freedom (Hirose, 2012). We used the Bayesian information criterion for model selection in msgps.
In addition, we inferred networks with GENIE3 (Huynh-Thu et al., 2010), which also predicts TF expression from the expression of other TFs by using regression trees. The performance of GENIE3 was best in the DREAM5 Network Inference challenge for population data (Marbach et al., 2012).
We also inferred networks with Jump3 (Huynh-Thu and Sanguinetti, 2015), which is the expansion of GENIE3 for reconstructing a network from time-course expression data. Jump3 is based on jump trees and showed high performance for multiple time-course datasets.
2.3 Dataset
We analyzed three time-course scRNA-Seq datasets by the following procedures. First, transcripts per million reads (TPM) and fragments per millions of kilobases mapped (FPKM) were transformed as log(TPM + 1) and log(FPKM + 1), and we regarded these log-transformed values as the expression value. Next, we calculated the averaged expression of each TF at each time point and calculated the variance of the averaged expression for each TF. For TF data, we used Riken TFdb for mouse (Kanamori et al., 2004), and animalTFDB for human (Zhang et al., 2015). (Riken TFdb contains not only TFs but also their related genes, and we called all genes in the database transcription factors in this study.) Lastly, we regarded the TFs with large variances as variable TFs during differentiation. Hereafter, we used the top 100 variable TFs for network inference. For these 100 TFs, we estimated pseudo-time (tc) with Monocle (Trapnell et al., 2014). We also excluded 100 randomly selected cells from the training data in order to use them as test data to evaluate adequate sizes of (D).
2.3.1 Data1: mouse ES cells to primitive endoderm cells
The first time-course scRNA-Seq dataset (at 0, 12, 24, 48 and 72 h) analyzed was derived from primitive endoderm (PrE) cells differentiated from mouse ES cells (by using G6GR ES cells (Shimosato et al., 2007)), containing 456 cells. This dataset was produced with RamDA-Seq, a novel scRNA-Seq protocol developed by our laboratory (in submission).
2.3.2 Data2: mouse embryonic fibroblast cells to myocytes
The second dataset was derived from scRNA-Seq data obtained to examine direct reprogramming from mouse embryonic fibroblast (MEF) cells to myocytes at days 0, 2, 5 and 22 (Treutlein et al., 2016). This dataset contained 405 cells.
2.3.3 Data3: human ES cells to definitive endoderm cells
The third dataset was a scRNA-Seq time course (at 0, 12, 24, 36, 72 and 96 h) derived from definitive endoderm (DE) cells differentiated from human ES cells, containing 758 cells (Chu et al., 2016).
2.4 Network validation method
To validate the inferred networks, we used the Transcription Factor Regulatory Network database (http://www.regulatorynetworks.org), which was constructed from DNaseI footprints and TF-binding motifs (Neph et al., 2012; Stergachis et al., 2014). We integrated the TF regulatory networks of all cells for human and mouse, and extracted 100 × 100 TF regulatory networks for each dataset. We regarded these TF regulatory networks as correct networks for each dataset and calculated the AUC values of the inferred networks. The AUC values were calculated by regarding the directed edges that show higher absolute values as representing reliable regulatory relationships. We removed self-loop regulation and TFs that do not have an edge in the correct network from AUC calculation in order to avoid biases.
3 Results
3.1 Selection of the size of (D) and reproducibility of
Our model was overfitted to the training data, and the inferred was unstable with needlessly large D. Additionally, the model cannot reconstruct expression dynamics with insufficiently small values of D. Therefore, the selection of appropriate values for D is necessary, and we applied SCODE to training data and evaluated the validity of the optimized model on the basis of the RSS of independent test data for various values of D (D = 2, 4, 6 and 8). For each D, we executed SCODE 100 times independently, and the first, second and third quantiles of the RSS values of test data are shown in Figure 2(a). For every dataset, the median of RSS is almost saturated at D = 4.
Fig. 2.
The first, second and third quantiles of the RSS values of test data (a) and the correlations among optimized of the top 50 replicates (b) for each D (D = 2, 4, 6 and 8) for each dataset
Because we used random sampling during optimization, we validated the reproducibility of the optimized . We calculated the correlation coefficient among optimized for the top 50 replicates (in ascending order of RSS values) of test data for each D. The corresponding first, second and third quantiles of correlation coefficients are shown in Figure 2(b). For D = 4, the medians of the correlation coefficients are 0.71, 0.94 and 0.88 for each dataset. The medians tend to decrease for large D because the matrix is unstable with needlessly large D. The medians also decrease for small D, possibly because the optimized is trapped in local optima. In summary, the correlations among replicates are high, and therefore, an optimized matrix is stable for D = 4.
Because the RSS values for test data are almost saturated and the estimated are stable with D = 4, we used D = 4 unless otherwise specified. For optimized of each D, we used the mean of optimized of the top 50 replicates, hereafter.
3.2 Validation of optimization with simulation data
Next, we investigated whether SCODE can infer genuine by using simulated data. Because the dynamics of become unrealistic with randomly determined , we used previously inferred (for D = 4) as genuine and simulated data with the same condition for each dataset (such as the same pseudo-time). We also added uniform random numbers () to simulated data as a noise term. We optimized for each simulated dataset 100 times, and Figure 3 shows the first, second and third quantiles of the correlation coefficients between the genuine and optimized for each D. The medians are 0.70, 0.71 and 0.91 for D = 4, and 0.61, 0.48 and 0.49 for D = 6. Therefore, SCODE can accurately infer the genuine with appropriate D, and can roughly infer with slightly different D values unless we set extremely large or small D.
Fig. 3.
The first, second and third quantiles of the correlation coefficients between genuine and inferred for each D
3.3 Reconstruction of expression dynamics
Although RSS values for test data were almost saturated at D = 4, this does not necessarily mean that SCODE can successfully learn the dynamics. Next, we investigated whether the optimized ODE can accurately reconstruct observed expression dynamics to verify the optimization of SCODE (Fig. 1(e)). For each set of dynamics, the initial values ( at t = 0) were set to the mean expression of 0-h or day 0 cells. At first, we compared the reconstructed dynamics with observed data in the principal component analysis (PCA) space (Fig. 4). For every dataset, SCODE was able to reconstruct the dynamics with .
Fig. 4.
PCA of scRNA-Seq data for each dataset. Each circle represents a cell, and its color represents experimental time (from light gray to black). The reconstructed expression dynamics are projected onto PCA space and are represented by colored lines (green, yellow, orange and red correspond to D = 2, 4, 6 and 8, respectively)
Next, we compared the reconstructed dynamics with observed expression dynamics for some TFs (Sox2, Utf1, Epas1 and Foxq1) in Data1 (Fig. 5). The analysis for every TF and dataset is described in the Supplementary text. Although the reconstructed dynamics of SCODE with D = 2 differ from the observed data, the model with successfully reconstructed complicated dynamics, such as transient patterns. Therefore, we concluded that SCODE can successfully optimize and learn the ODE of
Fig. 5.
Observed expression of four TFs and reconstructed dynamics for each D (green, yellow, orange and red correspond to D = 2, 4, 6 and 8, respectively). The x-axis represents pseudo-time and y-axis represents log(TPM + 1)
3.4 Validation of inferred network
We also evaluated the inferred network of each algorithm including the correlation network by comparing them to TF regulatory networks based on DNaseI footprints and TF-binding motifs (see Section 2.4). Because the runtimes of Jump3 are significantly large for large numbers of cells, we used 25 cells at even intervals in the pseudo-time order as the data for Jump3. The AUC values of each method for each dataset are shown in Table 1.
Table 1.
The AUC values of each method for each dataset
SCODE | lm | msgps | Cor | GENIE3 | Jump3 | |
---|---|---|---|---|---|---|
Data1 | 0.536 | 0.480 | 0.510 | 0.505 | 0.474 | 0.504 |
Data2 | 0.581 | 0.489 | 0.516 | 0.492 | 0.472 | 0.492 |
Data3 | 0.523 | 0.480 | 0.499 | 0.524 | 0.522 | 0.501 |
Note: Cor is the correlation network.
For Data1 and Data2, the AUC values of SCODE are significantly larger than those of the other algorithms. This is because our model considers the dynamics of expression and fully uses time information. Although Jump3 is also designed for time-course expression data, the AUC values are not high. This is because Jump3 is not designed for scRNA-Seq conducted during differentiation, but is designed for multiple time-course data. This suggests the necessity of a novel computational algorithm designed for scRNA-Seq data.
The performance of SCODE is second, but almost equal to the best performance for Data3. Given that the reconstructed path in PCA space is a little out of alignment for Data3 (Fig. 4), our model based on linear ODEs might be slightly insufficient to describe the expression dynamics of Data3.
In summary, our algorithm can infer TF regulatory networks with high performance in comparison to other network inference algorithms, especially for Data1 and Data2. This result implies the importance of time parameters in network inference and the necessity of a novel network inference algorithm designed for scRNA-Seq data obtained during differentiation.
We also compared the result of SCODE of Data3 with TRRUST, which is a human transcriptional regulatory network database based on small-scale experimental studies (Han et al., 2015). As a result, SCODE successfully identified the relationship included in TRRUST that GATA3, which is the known marker for the differentiation of Data3, positively regulate ZEB1 (see Supplementary text for the detailed explanation).
3.5 Runtimes
We investigated the runtime of each method and the runtimes for Data1 are shown in Table 2. The runtime of Jump3 is calculated using the data from 25 cells as stated above. The runtime of SCODE is 11 seconds and is significantly smaller than that of Jump3. Moreover, the runtime of SCODE is smaller than those of msgps and GENIE3, which do not consider time dynamics. These results show that SCODE can infer regulatory networks efficiently, even though it considers a time parameter in its model.
Table 2.
The runtimes of each method for Data1 (456 cells)
SCODE | lm | msgps | Cor | GENIE3 | Jump3 | |
---|---|---|---|---|---|---|
Runtime (s) | 11 | 1.0 | 40 | 0.73 | * |
Note: The runtime of Jump3 is calculated using data from 25 cells. The computations were performed on a MacBook Pro equipped with a 3.1 GHz Intel Core i7 processor and 16 GB of 1867 MHz DDR3 RAM.
3.6 Network analysis
Lastly, we investigated the structure of the inferred regulatory network of Data1. At first, we defined the threshold α as the value of the 1000th largest absolute value in , and we counted the number of positive edges () and negative edges () for TF j. Figure 6(a) shows the total counts for each TF in decreasing order. About 39% of edges are included in the top 10 TFs, and this result implies the existence of key regulators for differentiation. Interestingly, most TFs mainly have either positive or negative edges, and this result suggests that TFs might mainly work as either activators or inhibitors in differentiation. This tendency was shared with Data3, but was not seen in Data2 (see Supplementary text). This result might reflect a difference in the systems; Data1 and Data3 represent differentiation from ES cells, while Data2 represents direct reprogramming from MEF cells.
Fig. 6.
(a) Bar graph of positive and negative edges of each TF in decreasing order. For visibility, only the top 60 TFs are shown (see Supplementary text for plot of all TFs). (b) Bar graph of the top 10 TFs
We also visualized the top 10 TF results Figure 6(b). Interestingly, Dnmt3a and Dnmt3b, which are the de novo DNA methyltransferase have several positive edges. Data1 is derived from scRNA-Seq obtained from cells differentiating from mouse ES cells into PrE cells. To maintain the pluripotency of ES cells, Dnmt3a and Dnmt3b seem dispensable, and these genes must be unimportant for ES cells (Tsumura et al., 2006). However, several studies have suggested the importance of Dnmt3a and Dnmt3b in differentiation. For example, these TFs restrict the lineage-specific function of TFs during differentiation via DNA methylation (Oda et al., 2013). In addition, Dnmt3a is essential for hematopoietic stem cell differentiation and it seems to enhance differentiation by epigenetic silencing of multipotency genes (Challen et al., 2012). Thus, Dnmt3a and Dnmt3b are known to affect differentiation based on de novo DNA methylation.
In this study, these genes were inferred to regulate several TFs positively. Because DNA methylation essentially silences expression, these TFs might be regulated positively indirectly via the inactivation of negative regulators of these TFs. Although the direct targets of Dnmt3a and Dnmt3b are obscure, our result suggests that they are the key regulators of this differentiation.
4 Discussion
The advancement of scRNA-Seq and the analysis of differentiation reconstruction and pseudo-time have elucidated differentiation mechanisms. The inference of regulatory networks associated with differentiation is necessary to further our understanding of differentiation and development. In the inference of regulatory networks, it is important to fully use pseudo-time information and expression dynamics. However, there are no efficient algorithms for inferring the regulatory networks of many TFs from continuous time expression data. Thus, we developed SCODE, an efficient algorithm based on linear ODEs. SCODE is based on the transformation of linear ODEs and linear regression, and the time complexity is significantly small.
We applied SCODE to three scRNA-Seq datasets during differentiation and showed that SCODE can successfully optimize ODEs so that these ODEs can reconstruct observed expression dynamics. In the validation of the inferred network, the AUC values of SCODE were higher than those of other methods in almost of all cases. The runtime of SCODE is significantly smaller than that of Jump3, which also infers networks from time-course data. Additionally, SCODE is faster than GENIE3, which does not use time information. These performance results show the efficiency of SCODE.
Single-cell sequencing technologies are developing rapidly, and the number of scRNA-Seq datasets produced from differentiating cells will therefore increase. Our novel and efficient method for inferring regulatory networks demonstrated high performance and will therefore enhance the analysis of regulatory networks.
Moreover, our model can reconstruct expression dynamics accurately. This means that we can simulate expression dynamics (such as those associated with the knockout of a TF) by using an optimized model, and such simulation-based analyses will be useful for many types of research, such as detection of drivers of differentiation. Thus, SCODE is useful not only for regulatory network inference, but also for various analyses using simulation, and therefore, our research is a promising computational tool for further single-cell sequence analyses.
Supplementary Material
Acknowledgements
The authors thank Yohei Sasagawa, Hiroki Danno, Masashi Ebisawa, Mana Umeda and Haruka Ozaki for assistance in this study. We also thank Tsukasa Fukunaga for critically reading the manuscript and Suguru Yaginuma for helpful discussions about our algorithm.
Funding
This work was supported by a Grant-in-Aid for Japan Society for the Promotion of Science (JSPS) Fellows, JSPS KAKENHI (Grant Number 16J05079, 15H01465, 25134701, 25870190, and 16H01532), and Core Research for Evolutional Science and Technology (CREST) “Creation of Fundamental Technologies for Understanding and Control of Biosystem Dynamics” from the Japan Science and Technology Agency (JST).
Conflict of Interest: none declared.
References
- Aymoz D. et al. (2016) Real-time quantification of protein expression at the single-cell level via dynamic protein synthesis translocation reporters. Nat. Commun., 7, 11304.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bansal M. et al. (2006) Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics, 22, 815–822. [DOI] [PubMed] [Google Scholar]
- Buettner F. et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol., 33, 155–160. [DOI] [PubMed] [Google Scholar]
- Burns J.C. a.o. (2015) Single-cell RNA-Seq resolves cellular complexity in sensory organs from the neonatal inner ear. Nat Commun., 6, 8557.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Challen G.A. et al. (2012) Dnmt3a is essential for hematopoietic stem cell differentiation. Nat. Genet., 44, 23–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H. et al. (2015) Single-cell transcriptional analysis to uncover regulatory circuits driving cell fate decisions in early mouse development. Bioinformatics, 31, 1060–1066. [DOI] [PubMed] [Google Scholar]
- Chu L.F. et al. (2016) Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol., 17, 173.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Di Bernardo D. et al. (2004) Robust identification of large genetic networks. Pac. Symp. Biocomput., 9, 486–497. [DOI] [PubMed] [Google Scholar]
- Eckersley-Maslin M.A. et al. (2016) MERVL/Zscan4 network activation results in transient genome-wide DNA demethylation of mESCs. Cell Rep., 17, 179–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner T.S. et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105. [DOI] [PubMed] [Google Scholar]
- Grun D. et al. (2015) Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature, 525, 251–255. [DOI] [PubMed] [Google Scholar]
- Han H. et al. (2015) TRRUST: a reference database of human transcriptional regulatory interactions. Sci. Rep., 5, 11432.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hirose K. (2012) msgps: Degrees of freedom of elastic net, adaptive lasso and generalized elastic net. R package version 1.3.
- Huynh-Thu V.A., Sanguinetti G. (2015) Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics, 31, 1614–1622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huynh-Thu V.A. et al. (2010) Inferring regulatory networks from expression data using tree-based methods. PLoS ONE, 5, e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji Z., Ji H. (2016) TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res., 44, e117.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang L. et al. (2016) GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol., 17, 144.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanamori M. et al. (2004) A genome-wide and nonredundant mouse transcription factor database. Biochem. Biophys. Res. Commun., 322, 787–793. [DOI] [PubMed] [Google Scholar]
- Kolodziejczyk A.A. et al. (2015) The technology and biology of single-cell RNA sequencing. Mol. Cell, 58, 610–620. [DOI] [PubMed] [Google Scholar]
- Koussounadis A. et al. (2015) Relationship between differentially expressed mRNA and mRNA-protein correlations in a xenograft model system. Sci. Rep., 5, 10775.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee W.P., Tzou W.S. (2009) Computational methods for discovering gene networks from expression data. Brief. Bioinf., 10, 408–423. [DOI] [PubMed] [Google Scholar]
- Li J. et al. (2016) Systematic reconstruction of molecular cascades regulating GP development using single-cell RNA-Seq. Cell Rep., 15, 1467–1480. [DOI] [PubMed] [Google Scholar]
- Lim C.Y. et al. (2016) BTR: training asynchronous Boolean models using single-cell expression data. BMC Bioinformatics, 17, 355.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundberg E. et al. (2010) Defining the transcriptome and proteome in three functionally different human cell lines. Mol. Syst. Biol., 6, 450.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maier T. et al. (2009) Correlation of mRNA and protein in complex biological samples. FEBS Lett., 583, 3966–3973. [DOI] [PubMed] [Google Scholar]
- Marbach D. et al. (2012) Wisdom of crowds for robust gene network inference. Nat. Methods, 9, 796–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matsumoto H., Kiryu H. (2016) SCOUP: a probabilistic model based on the Ornstein-Uhlenbeck process to analyze single-cell expression data during differentiation. BMC Bioinformatics, 17, 232.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moignard V. et al. (2015) Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat. Biotechnol., 33, 269–276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morisaki T. et al. (2016) Real-time quantification of single RNA translation dynamics in living cells. Science, 352, 1425–1429. [DOI] [PubMed] [Google Scholar]
- Neph S. et al. (2012) Circuitry and dynamics of human transcription factor regulatory networks. Cell, 150, 1274–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ocone A. et al. (2015) Reconstructing gene regulatory dynamics from high-dimensional single-cell snapshot data. Bioinformatics, 31, 89–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oda M. et al. (2013) DNA methylation restricts lineage-specific functions of transcription factor Gata4 during embryonic stem cell differentiation. PLoS Genet., 9, e1003574.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pina C. et al. (2015) Single-cell network analysis identifies DDIT3 as a nodal lineage regulator in hematopoiesis. Cell. Rep., 11, 1503–1510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shimosato D. et al. (2007) Extra-embryonic endoderm cells derived from ES cells induced by GATA factors acquire the character of XEN cells. BMC Dev. Biol., 7, 80.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stergachis A.B. et al. (2014) Conservation of trans-acting circuitry during mammalian regulatory evolution. Nature, 515, 365–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trapnell C. (2015) Defining cell types and states with single-cell genomics. Genome Res., 25, 1491–1498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trapnell C. et al. (2014) The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol., 32, 381–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Treutlein B. et al. (2014) Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature, 509, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Treutlein B. et al. (2016) Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature, 534, 391–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsumura A. et al. (2006) Maintenance of self-renewal ability of mouse embryonic stem cells in the absence of DNA methyltransferases Dnmt1, Dnmt3a and Dnmt3b. Genes Cells, 11, 805–814. [DOI] [PubMed] [Google Scholar]
- Wang D. et al. (2016) DREISS: Using State-Space Models to Infer the Dynamics of Gene Expression Driven by External and Internal Regulatory Networks. PLoS Comput. Biol., 12, e1005146.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woodhouse S. et al. (2016) Processing, visualising and reconstructing network models from single-cell data. Immunol. Cell Biol., 94, 256–265. [DOI] [PubMed] [Google Scholar]
- Zeisel A. et al. (2015) Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347, 1138–1142. [DOI] [PubMed] [Google Scholar]
- Zhang H.M. et al. (2015) AnimalTFDB 2.0: a resource for expression, prediction and functional study of animal transcription factors. Nucleic Acids Res., 43, 76–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.