Abstract
The rapid advancement of single-cell technologies has shed new light on the complex mechanisms of cellular heterogeneity. However, compared to bulk RNA sequencing (RNA-seq), single-cell RNA-seq (scRNA-seq) suffers from higher noise and lower coverage, which brings new computational difficulties. Based on statistical independence, cell-specific network (CSN) is able to quantify the overall associations between genes for each cell, yet suffering from a problem of overestimation related to indirect effects. To overcome this problem, we propose the c-CSN method, which can construct the conditional cell-specific network (CCSN) for each cell. c-CSN method can measure the direct associations between genes by eliminating the indirect associations. c-CSN can be used for cell clustering and dimension reduction on a network basis of single cells. Intuitively, each CCSN can be viewed as the transformation from less “reliable” gene expression to more “reliable” gene–gene associations in a cell. Based on CCSN, we further design network flow entropy (NFE) to estimate the differentiation potency of a single cell. A number of scRNA-seq datasets were used to demonstrate the advantages of our approach. 1) One direct association network is generated for one cell. 2) Most existing scRNA-seq methods designed for gene expression matrices are also applicable to c-CSN-transformed degree matrices. 3) CCSN-based NFE helps resolving the direction of differentiation trajectories by quantifying the potency of each cell. c-CSN is publicly available at https://github.com/LinLi-0909/c-CSN.
Keywords: Network flow entropy, Cell-specific network, Single-cell network, Direct association, Conditional independence
Introduction
With the development of high-throughput single-cell RNA sequencing (scRNA-seq), novel cell populations in complex tissues [1], [2], [3], [4], [5] can be identified and the differentiation trajectory of cell states [6], [7], [8] can be obtained, which opens a new way to understand the heterogeneity and transition of cells [9], [10], [11]. However, compared to traditional bulk RNA-seq data, the prevalence of high technical noise and dropout events is a major problem in scRNA-seq [12], [13], [14], [15], [16], [17], which raises substantial challenges for data analysis. To analyze high-dimensional scRNA-seq data, principal component analysis (PCA), nonnegative matrix factorization (NMF), and t-distributed Stochastic Neighbor Embedding (t-SNE) are widely used for dimension reduction. Subsequently, clustering methods such as hierarchical clustering, K-means, SNN-Cliq [18], Corr [19], SC3 [20], and SIMLR [21] could be applied to identify potential cell types, further corroborated with known marker genes. For developmental or differentiation studies, trajectory inference methods such as Monocle [22], TSCAN [23], and DPT [24] can be used to order cells along a pseudo temporal trajectory. Besides these approaches, several methods have been developed to offer special treatments of the dropouts in scRNA-seq data. One way is to explicitly model the dropout events during dimension reduction, e.g., the zero-inflated factor analysis model developed in ZIFA [25]. Another way is to incorporate biological information, especially functional gene–gene association networks. In this direction, SCRL [26] takes another step forward by leveraging gene–gene interactions, learning a more meaningful low-dimensional projection. A recent method, netNMF-sc [27] derives a robust factorization or clustering against dropouts, by regularizing the original NMF model with a given gene correlation network. Furthermore, gene–gene correlations could also be employed to directly estimate the ‘true’ expression values for those observed zero counts, which is known as the data imputation approach, as exemplified by several well-known methods including SAVER [28], MAGIC [29], and scImpute [30]. However, data imputation is with some limitations, such as over-imputation of genes unexpressed in certain cell types and inducing artificial effects that may confound downstream analyses [31].
Several network inference algorithms were also developed for scRNA-seq. MTGO-SC [32] can detect the network modules of genes for each cell cluster though combing the information of network structures and annotations of genes. SCODE [33] can construct regulatory networks and expression dynamics through linear ordinary differential equations (ODEs). These methods only infer the network of a group or cluster of cells, and do not construct networks for individual cells. Recently, cell-specific network (CSN) has been proposed to infer CSNs based on scRNA-seq data [34], which elegantly infers a network for each cell. Moreover, unlike imputation methods, CSN employs a data transformation strategy, and successfully transforms the noisy and “unreliable” gene expression data to the more “reliable” gene association data, thereby alleviating the dropout problem to a certain extent. The network degree matrix (NDM) derived from CSN can be further applied in downstream single-cell analyses, which performs better than traditional expression-based methods in terms of robustness and accuracy. CSN is able to identify the dependency between two genes from single-cell data based on statistical independence. However, CSN suffers from a problem of overestimation on gene–gene associations and includes both direct and indirect associations due to interactive effects from other genes in a network. In other words, a gene pair without direct association can be falsely identified to have a link just because they both have true associations with some other genes. Thus, the gene–gene network of a cell constructed by CSN may be much denser than the real molecular network in this cell, in particular when there are many complex associations among genes.
To overcome these shortcomings of CSN, we introduce a novel computational method c-CSN, which can construct a conditional cell-specific network (CCSN) from scRNA-seq data. Specifically, c-CSN identifies direct associations between genes by filtering out indirect associations in the gene–gene network based on conditional independence. Thus, c-CSN can transform the original gene expression data of each cell to the direct and robust gene–gene association data (or network data) of the same cell. In this study, we first demonstrate that the transformed gene–gene association data not only are fully compatible with traditional analyses such as dimension reduction and clustering, but also enable us to delineate the CSN topology and its dynamics along developmental trajectories. Then, by defining the network flow entropy (NFE) on the gene–gene association data of each cell based on c-CSN, we estimate the differentiation potency of individual cells. We show that NFE can illustrate the lineage dynamics of cell differentiation by quantifying the differentiation potency of cells, which is also one of the most challenging tasks in developmental biology.
Method
Assume that x and y are two random variables, and z is the third random variable. If x and y are independent, then
(1) |
where is the joint probability distribution of x and y; and are the marginal probability distributions of x and y, respectively.
If x and y with the condition z are conditionally independent, then
(2) |
where is the joint probability distribution of x and y with the condition z, and are conditionally marginal probability distributions. Note that Equations (1), (2) are both necessary and sufficient conditions on mutual independence and conditional independence, respectively. Here, we define
(3) |
(4) |
The original CSN method uses to distinguish the independency and association between x and y (File S1 Note 1). However, if two independent variables x and y are both associated with a third random variable z, cannot measure the direct independency because there is an indirect association between x and y. In other words, the associations defined by CSN or Equation (3) include both direct and indirect dependencies, thus resulting in the overestimation on gene–gene associations. To overcome this problem of CSN, we develop a novel method, c-CSN, which measures the direct gene–gene associations based on the conditional independency , i.e., Equation (4), by filtering out the indirect associations in the reconstructed network. The computational framework of c-CSN is shown in Figure 1, and the method is described in next sections.
Probability distribution estimation
We numerically estimate the value of by making a scatter diagram based on gene expression data. Suppose there are m genes and n cells in the data. We depict the expression values of gene x, gene y, and the conditional gene z in a three-dimensional space (Figure S1A–G), where each dot represents one cell. First, we draw two parallel planes which are orthogonal with z axis near the dot k to represent the upper and lower bounds of the neighborhoods of . And the number of dots in the space between the two parallel planes (i.e., the neighborhood of ) is (Figure S1D). Now we get a subspace on condition of gene z. Then, we draw other four planes near the dot k, where two planes are orthogonal with x axis and the other two planes are orthogonal with y axis. We can get the neighborhoods of , , and according to the intersection space of six planes (Figure S1E–G), where the numbers of dots are , , and , respectively. Then, we can get the estimation of probability distributions:
Based on Equation (4), we construct a statistic
(5) |
to measure the conditional independence between gene x and gene y on the condition of gene z in cell k. And when gene x and gene y given gene z are conditionally independent, the expectation and standard deviation (File S1) of the statistic can be obtained:
Then, we normalize the statistic as
(6) |
If gene x and y are conditionally independent on the condition of gene z, it can be proved that the normalized statistic follows the standard normal distribution (File S1 Note 1; Figure S2), and it is less than or equal to 0 when gene x and y are conditionally independent (File S1 Note 2).
Construction of CCSN
To estimate the conditional independency of gene x and gene y given the conditional gene z in cell k, we first use CSN or Equation (3) to distinguish the independence of gene x and gene y and we then use the following hypothesis test. : gene x and gene y are conditionally independent given gene z in cell k. : gene x and gene y are conditionally dependent given gene z in cell k. If , the normalized statistic, is larger than (significance level , is the alpha quantile of the standard normal distribution), the null hypothesis will be rejected and then ( is the edge weight of genes x and y on condition of gene z).
(7) |
All gene pairs can be tested if they are conditionally independent given gene z in cell k. And the CCSN given conditional gene z is obtained for cell k.
Then, to estimate the direct association between a pair of genes in a cell, theoretically we should use all the remaining m − 2 genes as conditional genes, which is computationally intensive. Suppose there are m genes in our analysis, then m(m − 1)/2 gene pairs should be tested. Fortunately, a molecular network is generally sparse, which means that a pair of genes (i.e., genes x and y) are expected to have a very small number of commonly interactive genes (as conditional genes z). In other words, numerically we can use a small number of conditional genes to identify the direct association between a pair of genes in a cell, which can significantly reduce the computational cost (File S1 Note 3; Table S1). For each gene pair in a cell, we choose G (1 ≤ G ≤ m − 2) genes as the conditional genes to test if the gene pair is conditionally independent or not. Generally, the conditional genes may be the key regulatory genes in a biological process, such as transcription factor genes and kinase genes. From a network viewpoint, these genes are usually hub genes in the gene–gene network, and the network degrees of these genes would be higher.
Practically, the conditional genes could be obtained from many available methods, such as highly expressed genes, highly variable genes, key transcription factor genes, and the hub genes in the CSN. For the c-CSN method, the conditional gene sets were defined by CSN. Two steps were used to obtain the conditional genes although other appropriate schemes can also be used.
-
1)
For a given cell, we first construct a CSN without the consideration of conditional genes, where the edge between gene x and gene y in cell k is determined by the following hypothesis test:
: gene x and gene y are independent in cell k.
: gene x and gene y are dependent in cell k.
The statistic can be used to measure the independency of genes x and y (File S1 Note 1). If is larger than a significant level, we will reject the null hypothesis and = 1, otherwise = 0.
Then we use to measure the importance of conditional gene z in cell k:
(8) |
Equation (8) means that if a gene is connected to more other genes, this gene is more important.
-
2)
For a given cell k, we choose the top largest ‘importance’ genes as the conditional genes. We assume that the conditional gene set is , and CCSN is obtained for cell k given conditional gene . The CCSNs of the cell k on the condition of gene set are {,}. Then, we use
(9) |
to represent the degrees of gene–gene interaction network of cell k, where for is the (i, j) element of the matrix .
For scRNA-seq data with all n cells, we can construct n CCSNs, which can be used for further dimension reduction and clustering. In other words, instead of the originally measured gene expression data with n cells, we use the n transformed CCSNs for further analysis.
Network degree matrix from CCSN
CCSN could be used for various biological studies by exploiting the gene–gene conditional association network from a network viewpoint. We transform Equation (9) to a conditional network degree vector based on the following transformation
(10) |
Then, for , an m × n matrix conditional network degree matrix (CNDM) is obtained.
(11) |
The matrix has the same dimension with the gene expression matrix (GEM), i.e., GEM = (with i = 1,⋯,m; k = 1,⋯,n), but CNDM can reflect the gene–gene direct association in terms of interaction degrees. Moreover, this CNDM matrix after normalization could be further analyzed by most traditional scRNA-seq methods for dimension reduction and clustering analysis. The input/output settings as well as application fields of our c-CSN method are listed in File S1 Note 4.
Network analysis of c-CSN
The relationship between gene pairs can be obtained by c-CSN at a single-cell level. c-CSN also provides a new way to build gene–gene interaction network for each cell. And the CNDM derived from CCSNs can be further used in dimension reduction, clustering and NFE analysis by many existing methods.
Dimension reduction
We used PCA [35] and t-SNE [36] which respectively represent linear and nonlinear methods, to perform dimension reduction on public scRNA-seq datasets with known cell types.
Clustering
To validate the good performance of c-CSN in clustering analysis, several traditional clustering methods such as K-means, Hierarchical clustering analysis, and K-medoids were applied to clustering analysis. Furthermore, state-of-the-art scRNA-seq data clustering methods such as SC3, SIMLR, and Seurat [20], [21], [37] were also used for comparison.
NFE analysis
Quantifying the differentiation potency of a single cell is one of the important tasks in scRNA-seq studies [15], [38], [39]. A recent study developed SCENT [40], which uses protein–protein interaction (PPI) network and gene expression data as input to obtain the potency of cells. However, SCENT depends on the PPI network, which may ignore many important relationships between genes in specific cells. In this study, we developed NFE to estimate the differentiation potency of a cell from its CSN or CCSN, which is constructed for each cell. The normalized gene expression profile and CSN/CCSN are used when we compute the NFE. The value of NFE is expected to be lower for differentiated cells, since differentiation is accompanied by activation of a specific subnetwork, which actually diverts the signaling flux from other parts of the network.
Estimating NFE requires a background network, which could be provided by CSN or CCSN. Based on CSN or CCSN, we could know whether or not there is an edge between gene i and gene j. We assume that the weight of an edge between gene i and gene j, , is proportional to the normalized expression levels of gene i and gene j, that is with . These weights are interpreted as interaction probabilities. Then, we normalize the weighted network as a stochastic matrix, P = with
where contains the neighbors of gene i, and A is the CSN or CCSN ( if i and j are connected, otherwise ).
And then, we define the NFE as:
(12) |
where is the normalized gene expression of gene i. From the definition, NFE is clearly different from network entropy.
Datasets used
Twelve scRNA-seq datasets and one bulk RNA-seq dataset [15], [40], [41], [42], [43], [44], [45], [46], [47] were used to validate our c-CSN method. The number of cells in these datasets ranges from 100 to 20,000. Table S2 gives a brief introduction of these datasets.
Results and discussion
Visualization and clustering of scRNA-seq datasets with CNDM
Characterizing cell heterogeneity is one of the important tasks for scRNA-seq data analysis. To test whether CCSN-transformed network data can help segregate cell types, we performed dimension reduction and clustering on the CNDMs of gold-standard scRNA-seq datasets, using algorithms widely employed in scRNA-seq studies. The numbers of conditional genes used in CCSN construction are listed in Table S2.
For visualizing the structure of these datasets in a two-dimensional space, we used the representative linear and nonlinear dimension reduction methods, PCA [48] and t-SNE [36], respectively. As shown in Figure 2 and Figure S3, CNDMs can separate different cell types clearly in the low-dimensional space by both PCA and t-SNE. Notably, they generally perform even better than GEM (Figure 2, Figure S3). Hence, the network data of CNDMs contain sufficient information for separating cell types in scRNA-seq datasets.
To quantitatively evaluate the power of CNDMs in cell type identification, we performed clustering on CNDMs and computed the adjusted Rand index (ARI) for each dataset based on the background truth (File S1 Note 5; Figure S4). As shown in Table 1 and Figure S5, CNDM performs obviously better than GEM on all datasets. These provide a strong support of the notion that the CCSN-transformed network data are highly informative for characterizing single-cell populations. Interestingly, when further compared to NDM, CNDM also shows a good performance (Table 2; Figure S6).
Table 1.
Method | Input | Buettner[15] | Kolodziejczyk[41] | Gokce[46] | Chu-time[42] | Chu-type[42] | Kim[43] |
---|---|---|---|---|---|---|---|
K-means | GEM | 0.29 | 0.54 | 0.42 | 0.17 | 0.22 | 0.20 |
CNDM | 0.87 | 0.85 | 0.75 | 0.45 | 0.57 | 0.81 | |
Hierarchical | GEM | 0.32 | 0.49 | 0.47 | 0.22 | 0.22 | 0.12 |
CNDM | 0.73 | 0.65 | 0.92 | 0.47 | 0.61 | 0.77 | |
K-means (t-SNE) | GEM | 0.41 | 0.87 | 0.43 | 0.33 | 0.55 | 0.53 |
CNDM | 0.95 | 0.91 | 0.36 | 0.56 | 0.70 | 0.93 | |
Hierarchical (t-SNE) | GEM | 0.55 | 0.99 | 0.50 | 0.39 | 0.67 | 0.73 |
CNDM | 0.95 | 0.99 | 0.39 | 0.61 | 0.80 | 0.95 | |
K-medoids | GEM | 0.23 | 0.29 | 0.40 | 0.33 | 0.33 | 0.79 |
CNDM | 0.53 | 0.63 | 0.81 | 0.17 | 0.38 | 0.61 | |
SC3 | GEM | 0.89 | 1 | 0.56 | 0.66 | 0.78 | 0.89 |
CNDM | 0.98 | 0.72 | 0.72 | 0.63 | 0.98 | 0.96 | |
SIMLR | GEM | 0.89 | 0.49 | 0.43 | 0.30 | 0.48 | 0.38 |
CNDM | 0.63 | 0.52 | 0.85 | 0.58 | 0.54 | 0.95 | |
Seurat | GEM | 0.67 | 0.43 | 0.35 | 0.52 | 0.52 | 0.41 |
CNDM | 0.90 | 0.56 | 0.32 | 0.56 | 0.69 | 0.84 |
Note: The performance of clustering is evaluated by ARI. Hierarchical (t-SNE) and K-means (t-SNE) indicate clustering after t-SNE. CNDM, conditional network degree matrix; GEM, gene expression matrix; ARI, adjusted Rand index; t-SNE, t-distributed Stochastic Neighbor Embedding. Bold font (ARI) indicates that CNDM performs better.
Table 2.
Method | Input | Buettner[15] | Kim[43] | Wang[45] | Gokce[46] | Tabula Muris[47](aorta) | Tabula Muris[47](limb muscle) |
---|---|---|---|---|---|---|---|
K-means | NDM | 0.50 | 0.50 | 0.30 | 0.79 | 0.21 | 0.58 |
CNDM | 0.87 | 0.81 | 0.45 | 0.75 | 0.63 | 0.66 | |
Hierarchical | NDM | 0.69 | 0.59 | 0.38 | 0.95 | 0.12 | 0.65 |
CNDM | 0.73 | 0.77 | 0.45 | 0.92 | 0.75 | 0.76 | |
K-means (t-SNE) | NDM | 0.83 | 0.84 | 0.61 | 0.38 | 0.46 | 0.62 |
CNDM | 0.95 | 0.93 | 0.67 | 0.36 | 0.61 | 0.65 | |
Hierarchical (t-SNE) | NDM | 0.89 | 0.98 | 0.58 | 0.47 | 0.50 | 0.66 |
CNDM | 0.95 | 0.95 | 0.72 | 0.39 | 0.50 | 0.66 | |
K-medoids | NDM | 0.26 | 0.49 | 0.31 | 0.60 | 0.35 | 0.14 |
CNDM | 0.53 | 0.61 | 0.21 | 0.81 | 0.53 | 0.39 | |
SC3 | NDM | 0.67 | 1 | 0.70 | 0.45 | 0.29 | 0.66 |
CNDM | 0.98 | 0.96 | 0.86 | 0.72 | 0.73 | 0.76 | |
SIMLR | NDM | 0.64 | 0.75 | 0.29 | 0.74 | 0.40 | 0.60 |
CNDM | 0.63 | 0.95 | 0.60 | 0.85 | 0.70 | 0.71 | |
Seurat | NDM | 0.82 | 0.97 | 0.59 | 0.44 | 0.45 | 0.66 |
CNDM | 0.90 | 0.84 | 0.59 | 0.32 | 0.76 | 0.75 |
Note: The performance of clustering is evaluated by ARI. Hierarchical (t-SNE) and K-means (t-SNE) indicate clustering after t-SNE. NDM, network degree matrix. Bold font (ARI) indicates that CNDM performs better.
We further evaluated the performance of c-CSN in larger datasets. The Tabula Muris droplet1 dataset [47] comprising more than 20,000 cells from three tissues (bladder, trachea, and spleen) were tested. The Seurat package was used to perform dimension reduction and clustering analysis on the CNDM [37]. The cells were clearly segregated into three dominant groups in the t-SNE map, which were largely defined by their cell origins (ARI = 0.73 and Figure S7). This indicates that CCSN can be effectively extended to larger datasets in addition to the relatively small gold-standard datasets benchmarked above.
CCSN reveals network structure and dynamics on a single-cell basis
In this study, we applied c-CSN to Wang dataset [45], which comes from a study of neural progenitor cells (NPCs) that differentiate into mature neurons. The dataset contains six time points over a 30-day period.
The CSN and c-CSN were performed on a single cell (Day 0, RHB1742_d0) using 195 transcription factors that are differentially expressed across all the cell subpopulations and all time points. In CCSN, two genes (HMGB1 and SOX11) of high coefficients of variation (CV) were chosen as the conditional genes. The results (Figure 3A) illustrate that the network of CCSN is much sparser than the network of CSN. There are three modules in the CCSN, while there is only one dense network in the CSN. Furthermore, three hub genes were obtained in three modules in the CCSN. One of the hub genes is ASCL1, which plays an important role in neural development [13], [49]. Thus, by removing indirect associations, c-CSN can extract a more informative network structure than CSN, which could improve the characterization of key regulatory factors in individual cells.
c-CSN also reveals the network dynamics over the differentiation trajectory. As illustrated in Figure 3B, a core neural differentiation network composed of eight regulatory genes was dynamically modulated through the temporal progression of NPC differentiation. At Day 0, the associations among these genes were the strongest, consistent with the high potency of progenitor cells. As NPC differentiates, the network became much sparser, suggesting more specified cell fates. In addition, when constructing CCSN from all genes, the degrees of MEIS2, PBX1 and POU3F2 were also larger at Day 0 and quickly decreases afterward (Figure 3C). These indicate that these genes are highly connected with other genes in NPCs, which is consistent with their known important roles in early differentiation of NPCs [45].
Both theoretically and computationally, c-CSN can also construct a gene–gene network for a single bulk RNA-seq sample, in addition to a single cell. To validate this biologically, we applied c-CSN to the TCGA lung adenocarcinoma (LUAD) RNA-seq dataset. The t-SNE plot based on CNDM reveals two obvious clusters, which respectively corresponding to normal adjacent lung tissues and lung tumors (Figure S8A), supporting the effective application of c-CSN to bulk RNA-seq data as well. Moreover, the EGFR pathway, a well-known oncogenic driver pathway for LUAD [50], [51], [52], was densely connected in tumor samples but not in benign tissues, as illustrated in the representative single-sample EGFR networks (Figure S8B), and the CCSN degrees of EGF and EGFR in each normal and tumor samples (Figure S8C). These data demonstrate that c-CSN well extends to single sample bulk RNA-seq data analysis and uncovers important biological connections related to disease states.
CCSN-based NFE analysis
To quantify the differentiation state of cells, we further develop a new method, NFE, to estimate the differentiation potency of cells by exploiting the gene–gene network constructed by c-CSN.
To assess the performance of NFE, we applied it to two datasets. In Wang dataset [45], there were 483 cells with 6 stages (Day 0, Day 1, Day 5, Day 7, Day 10, Day 30) and the CCSNs with one conditional gene were used to compute the NFE. We compared NPCs at Day 0 and Day 1 with mature neurons at Day 30 (Figure 4A). In Yang dataset [44], we compared the cells at embryonic day 10 (E10) with those at embryonic day 17 (E17) in differentiation of mouse hepatoblasts (Figure 4B) and the CSN was used to compute the NFE. In both datasets, NFE assigned significantly higher scores to the progenitors than the differentiated cells (one-sided Wilcox rank sum test, P = 2.062E−12 in Wang dataset, P = 3.756E−19 in Yang dataset).
To further validate the accuracy of NFE, we generated a three-dimensional representation of the cell-lineage trajectory for the Wang dataset [45]. In the time-course differentiation experiment of NPCs into neurons [45], NFE correctly predicted a gradual decrease in differentiation potency (Figure 5). Therefore, NFE is effectively applicable to single-cell differentiation studies and highly predictive of developmental states and directions.
Conclusion
Estimating functional gene networks from noisy single-cell data has been a challenging task. Motivated by network-based data transformation, we have previously developed CSN to uncover CSNs and successfully applied it to extract biologically important gene interactions. However, CSN does not distinguish direct and indirect associations and thus suffers from the so-called overestimation problem. In this study, we propose a more sophisticated approach termed c-CSN, which constructs direct gene–gene associations (network) of each cell by eliminating false connections introduced by indirect effects.
c-CSN can transform GEM to CNDM for downstream dimension reduction and clustering analysis. These allow us to identify cell populations, generally better than GEM in the datasets tested above. In addition, c-CSN also shows good performance when compared to CSN. Moreover, we can construct one direct gene–gene association network by one cell based on c-CSN. From the networks of the individual cells, we can obtain the dynamically changed networks. As shown in Figure 3B, the CCSNs of these cells dynamically changed at different time points, and the network at Day 0 shows the strongest associations. Moreover, the hub genes of the networks constructed by c-CSN method may play an important role in biological processes. As shown in Figure 3A, the hub genes of three modules in the network constructed by c-CSN play a vital role in neural development. These clearly demonstrate the advantages of CCSN. In addition, individual networks of cells constructed by c-CSN can also be applied to construct network biomarkers [53], [54] for accurate disease diagnosis/prognosis, or dynamic network biomarkers [55], [56], [57], [58], [59] for reliable disease prediction.
According to the Waddington’s landscape model of cellular differentiation, cellular differentiation potency is decreased as a pluripotent cell “rolls” down from a “hill” to nearby “valleys”, and cell fate transitions could be modeled as “canalization” events [60], [61], [62]. The differentiation potency quantifies the relative number of fate choices that a cell may have and provides a useful indicator of cellular “stemness”. Recently, SCENT [40] and MCE [63] use PPI network and gene expression data as input to obtain the potency of cells. However, these methods estimate the entropy of cells based on the available PPI network across various tissues, which may ignore many important relationships between genes in specific cells. Here, we develop the NFE to integrate the scRNA-seq profile of a cell with its gene–gene association network, and the results show that NFE performs well in distinguishing various cells of differential potency.
Nonetheless, the computational cost of c-CSN generally increases by G times comparing with the original CSN due to G conditional genes. Thus, a parallel computation scheme is desired to reduce the computation time. Also, c-CSN is not designed to construct the causal gene association networks, and the directions of the gene associations cannot be obtained. These could be our future research topics.
Code availability
CCSN is available at https://github.com/LinLi-0909/c-CSN.
CRediT author statement
Lin Li: Conceptualization, Methodology, Writing - review & editing. Hao Dai: Methodology, Writing - review & editing. Zhaoyuan Fang: Conceptualization, Methodology, Writing - review & editing. Luonan Chen: Conceptualization, Writing - review & editing. All authors read and approved the final manuscript.
Competing interests
The authors have declared no competing interests.
Acknowledgments
This work was supported by the National Key R&D Program of China (Grant No. 2017YFA0505500), the National Natural Science Foundation of China (Grant Nos. 31771476 and 31930022), and the Shanghai Municipal Science and Technology Major Project, China (Grant No. 2017SHZDZX01).
Footnotes
Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.05.005.
Contributor Information
Zhaoyuan Fang, Email: fangzhaoyuan@sibs.ac.cn.
Luonan Chen, Email: lnchen@sibs.ac.cn.
Supplementary material
The following are the Supplementary data to this article:
References
- 1.Yan L., Yang M., Guo H., Yang L., Wu J., Li R., et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013;20:1131–1139. doi: 10.1038/nsmb.2660. [DOI] [PubMed] [Google Scholar]
- 2.Treutlein B., Brownfield D.G., Wu A.R., Neff N.F., Mantalas G.L., Espinoza F.H., et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509:371–375. doi: 10.1038/nature13173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zeisel A., Munoz-Manchado A.B., Codeluppi S., Lonnerberg P., La Manno G., Jureus A., et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. doi: 10.1126/science.aaa1934. [DOI] [PubMed] [Google Scholar]
- 4.Fuzik J., Zeisel A., Mate Z., Calvigioni D., Yanagawa Y., Szabo G., et al. Integration of electrophysiological recordings with single-cell RNA-seq data identifies neuronal subtypes. Nat Biotechnol. 2016;34:175–183. doi: 10.1038/nbt.3443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Scialdone A., Tanaka Y., Jawaid W., Moignard V., Wilson N.K., Macaulay I.C., et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature. 2016;535:289–293. doi: 10.1038/nature18633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bendall S.C., Davis K.L., Amir el A.D., Tadmor M.D., Simonds E.F., Chen T.J., et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014;157:714–725. doi: 10.1016/j.cell.2014.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nestorowa S., Hamey F.K., Pijuan Sala B., Diamanti E., Shepherd M., Laurenti E., et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood. 2016;128:e20–e31. doi: 10.1182/blood-2016-05-716480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Woyke T., Doud D.F.R., Schulz F. The trajectory of microbial single-cell sequencing. Nat Methods. 2017;14:1045–1054. doi: 10.1038/nmeth.4469. [DOI] [PubMed] [Google Scholar]
- 9.Jaitin D.A., Kenigsberg E., Keren-Shaul H., Elefant N., Paul F., Zaretsky I., et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343:776–779. doi: 10.1126/science.1247651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shapiro E., Biezuner T., Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14:618–630. doi: 10.1038/nrg3542. [DOI] [PubMed] [Google Scholar]
- 11.Grun D., Lyubimova A., Kester L., Wiebrands K., Basak O., Sasaki N., et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525:251–255. doi: 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
- 12.Kuznetsov V.A., Knott G.D., Bonner R.F. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics. 2002;161:1321–1332. doi: 10.1093/genetics/161.3.1321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kim J.K., Marioni J.C. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biol. 2013;14:R7. doi: 10.1186/gb-2013-14-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kharchenko P.V., Silberstein L., Scadden D.T. Bayesian approach to single-cell differential expression analysis. Nat Methods. 2014;11:740–742. doi: 10.1038/nmeth.2967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Buettner F., Natarajan K.N., Casale F.P., Proserpio V., Scialdone A., Theis F.J., et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33:155–160. doi: 10.1038/nbt.3102. [DOI] [PubMed] [Google Scholar]
- 16.Daigle B.J., Jr., Soltani M., Petzold L.R., Singh A. Inferring single-cell gene expression mechanisms using stochastic simulation. Bioinformatics. 2015;31:1428–1435. doi: 10.1093/bioinformatics/btv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Vu T.N., Wills Q.F., Kalari K.R., Niu N., Wang L., Rantalainen M., et al. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics. 2016;32:2128–2135. doi: 10.1093/bioinformatics/btw202. [DOI] [PubMed] [Google Scholar]
- 18.Xu C., Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31:1974–1980. doi: 10.1093/bioinformatics/btv088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jiang H., Sohn L.L., Huang H., Chen L. Single cell clustering based on cell-pair differentiability correlation and variance analysis. Bioinformatics. 2018;34:3684–3694. doi: 10.1093/bioinformatics/bty390. [DOI] [PubMed] [Google Scholar]
- 20.Kiselev V.Y., Kirschner K., Schaub M.T., Andrews T., Yiu A., Chandra T., et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483–486. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wang B., Zhu J., Pierson E., Ramazzotti D., Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14:414–416. doi: 10.1038/nmeth.4207. [DOI] [PubMed] [Google Scholar]
- 22.Trapnell C., Cacchiarelli D., Grimsby J., Pokharel P., Li S., Morse M., et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32:381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ji Z., Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44:e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Angerer P., Haghverdi L., Buttner M., Theis F.J., Marr C., Buettner F. Destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32:1241–1243. doi: 10.1093/bioinformatics/btv715. [DOI] [PubMed] [Google Scholar]
- 25.Pierson E., Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 2015;16:241. doi: 10.1186/s13059-015-0805-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Li X., Chen W., Chen Y., Zhang X., Gu J., Zhang M.Q. Network embedding-based representation learning for single cell RNA-seq data. Nucleic Acids Res. 2017;45:e166. doi: 10.1093/nar/gkx750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Elyanow R., Dumitrascu B., Engelhardt B.E., Raphael B.J. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 2020;30:195–204. doi: 10.1101/gr.251603.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Huang M., Wang J., Torre E., Dueck H., Shaffer S., Bonasio R., et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15:539–542. doi: 10.1038/s41592-018-0033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.van Dijk D., Sharma R., Nainys J., Yim K., Kathail P., Carr A.J., et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174 doi: 10.1016/j.cell.2018.05.061. 716–29.e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Li W.V., Li J.J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun. 2018;9:997. doi: 10.1038/s41467-018-03405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Andrews T.S., Hemberg M. False signals induced by single-cell imputation. F1000Res. 2018;7:1740. doi: 10.12688/f1000research.16613.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nazzicari N., Vella D., Coronnello C., Di Silvestre D., Bellazzi R., Marini S. MTGO-SC, a tool to explore gene modules in single-cell RNA sequencing data. Front Genet. 2019;10:953. doi: 10.3389/fgene.2019.00953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Matsumoto H., Kiryu H., Furusawa C., Ko M.S.H., Ko S.B.H., Gouda N., et al. SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics. 2017;33:2314–2321. doi: 10.1093/bioinformatics/btx194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Dai H., Li L., Zeng T., Chen L. Cell-specific network constructed by single-cell RNA sequencing data. Nucleic Acids Res. 2019;47:e62. doi: 10.1093/nar/gkz172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jolliffe I.T., Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374:20150202. doi: 10.1098/rsta.2015.0202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Der Maaten L.V., Hinton G.E. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
- 37.Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W.M., 3rd, et al. Comprehensive integration of single-cell data. Cell. 2019;177 doi: 10.1016/j.cell.2019.05.031. 1888–902.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.MacArthur B.D., Lemischka I.R. Statistical mechanics of pluripotency. Cell. 2013;154:484–489. doi: 10.1016/j.cell.2013.07.024. [DOI] [PubMed] [Google Scholar]
- 39.Stegle O., Teichmann S.A., Marioni J.C. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16:133–145. doi: 10.1038/nrg3833. [DOI] [PubMed] [Google Scholar]
- 40.Teschendorff A.E., Enver T. Single-cell entropy for accurate estimation of differentiation potency from a cell's transcriptome. Nat Commun. 2017;8:15599. doi: 10.1038/ncomms15599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kolodziejczyk A.A., Kim J.K., Tsang J.C., Ilicic T., Henriksson J., Natarajan K.N., et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015;17:471–485. doi: 10.1016/j.stem.2015.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chu L.F., Leng N., Zhang J., Hou Z., Mamott D., Vereide D.T., et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 2016;17:173. doi: 10.1186/s13059-016-1033-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kim K.T., Lee H.W., Lee H.O., Song H.J., Jeong da E., Shin S., et al. Application of single-cell RNA sequencing in optimizing a combinatorial therapeutic strategy in metastatic renal cell carcinoma. Genome Biol. 2016;17:80. doi: 10.1186/s13059-016-0945-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yang L., Wang W.H., Qiu W.L., Guo Z., Bi E., Xu C.R. A single-cell transcriptomic analysis reveals precise pathways and regulatory mechanisms underlying hepatoblast differentiation. Hepatology. 2017;66:1387–1401. doi: 10.1002/hep.29353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Wang J., Jenjaroenpun P., Bhinge A., Angarica V.E., Del Sol A., Nookaew I., et al. Single-cell gene expression analysis reveals regulators of distinct cell subpopulations among developing human neurons. Genome Res. 2017;27:1783–1794. doi: 10.1101/gr.223313.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gokce O., Stanley G.M., Treutlein B., Neff N.F., Camp J.G., Malenka R.C., et al. Cellular taxonomy of the mouse striatum as revealed by single-cell RNA-Seq. Cell Rep. 2016;16:1126–1137. doi: 10.1016/j.celrep.2016.06.059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Tabula Muris Consortium, Overall coordination, Logistical coordination, Organ collection and processing, Library preparation and sequencing, Computational data analysis, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 2018;562:367–72. [DOI] [PMC free article] [PubMed]
- 48.Baglama J., Reichel L. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J Sci Comput. 2005;27:19–42. [Google Scholar]
- 49.Ming G.L., Song H. Adult neurogenesis in the mammalian brain: significant answers and significant questions. Neuron. 2011;70:687–702. doi: 10.1016/j.neuron.2011.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Ohsaki Y., Tanno S., Fujita Y., Toyoshima E., Fujiuchi S., Nishigaki Y., et al. Epidermal growth factor receptor expression correlates with poor prognosis in non-small cell lung cancer patients with p53 overexpression. Oncol Rep. 2000;7:603–607. doi: 10.3892/or.7.3.603. [DOI] [PubMed] [Google Scholar]
- 51.Nicholson R.I., Gee J.M., Harper M.E. EGFR and cancer prognosis. Eur J Cancer. 2001;37:S9–15. doi: 10.1016/s0959-8049(01)00231-3. [DOI] [PubMed] [Google Scholar]
- 52.Sharma S.V., Bell D.W., Settleman J., Haber D.A. Epidermal growth factor receptor mutations in lung cancer. Nat Rev Cancer. 2007;7:169–181. doi: 10.1038/nrc2088. [DOI] [PubMed] [Google Scholar]
- 53.Zhang W., Zeng T., Liu X., Chen L. Diagnosing phenotypes of single-sample individuals by edge biomarkers. J Mol Cell Biol. 2015;7:231–241. doi: 10.1093/jmcb/mjv025. [DOI] [PubMed] [Google Scholar]
- 54.Liu X., Wang Y., Ji H., Aihara K., Chen L. Personalized characterization of diseases using sample-specific networks. Nucleic Acids Res. 2016;44:e164. doi: 10.1093/nar/gkw772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Yang B., Li M., Tang W., Liu W., Zhang S., Chen L., et al. Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma. Nat Commun. 2018;9:678. doi: 10.1038/s41467-018-03024-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Liu R., Chen P., Chen L. Single-sample landscape entropy reveals the imminent phase transition during disease progression. Bioinformatics. 2020;36:1522–1532. doi: 10.1093/bioinformatics/btz758. [DOI] [PubMed] [Google Scholar]
- 57.Liu R., Wang J., Ukai M., Sewon K., Chen P., Suzuki Y., et al. Hunt for the tipping point during endocrine resistance process in breast cancer by dynamic network biomarkers. J Mol Cell Biol. 2019;11:649–664. doi: 10.1093/jmcb/mjy059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Liu X., Chang X., Leng S., Tang H., Aihara K., Chen L. Detection for disease tipping points by landscape dynamic network biomarkers. Natl Sci Rev. 2018;6:775–785. doi: 10.1093/nsr/nwy162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Chen C., Li R., Shu L., He Z., Wang J., Zhang C., et al. Predicting future dynamics from short-term time series using an Anticipated Learning Machine. Natl Sci Rev. 2020;7:1079–1091. doi: 10.1093/nsr/nwaa025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Moris N., Pina C., Arias A.M. Transition states and cell fate decisions in epigenetic landscapes. Nat Rev Genet. 2016;17:693–703. doi: 10.1038/nrg.2016.98. [DOI] [PubMed] [Google Scholar]
- 61.Laurenti E., Gottgens B. From haematopoietic stem cells to complex differentiation landscapes. Nature. 2018;553:418–426. doi: 10.1038/nature25022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lang A.H., Li H., Collins J.J., Mehta P. Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes. PLoS Comput Biol. 2014;10:e1003734. doi: 10.1371/journal.pcbi.1003734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Shi J., Teschendorff A.E., Chen W., Chen L., Li T. Quantifying Waddington’s epigenetic landscape: a comparison of single-cell potency measures. Brief Bioinform. 2018;21:248–261. doi: 10.1093/bib/bby093. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.