Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2021 Jul 22;49(18):e104. doi: 10.1093/nar/gkab601

Codependency and mutual exclusivity for gene community detection from sparse single-cell transcriptome data

Natsu Nakajima 1, Tomoatsu Hayashi 2, Katsunori Fujiki 3, Katsuhiko Shirahige 4, Tetsu Akiyama 5, Tatsuya Akutsu 6, Ryuichiro Nakato 7,
PMCID: PMC8501962  PMID: 34291282

Abstract

Single-cell RNA-seq (scRNA-seq) can be used to characterize cellular heterogeneity in thousands of cells. The reconstruction of a gene network based on coexpression patterns is a fundamental task in scRNA-seq analyses, and the mutual exclusivity of gene expression can be critical for understanding such heterogeneity. Here, we propose an approach for detecting communities from a genetic network constructed on the basis of coexpression properties. The community-based comparison of multiple coexpression networks enables the identification of functionally related gene clusters that cannot be fully captured through differential gene expression-based analysis. We also developed a novel metric referred to as the exclusively expressed index (EEI) that identifies mutually exclusive gene pairs from sparse scRNA-seq data. EEI quantifies and ranks the exclusive expression levels of all gene pairs from binary expression patterns while maintaining robustness against a low sequencing depth. We applied our methods to glioblastoma scRNA-seq data and found that gene communities were partially conserved after serum stimulation despite a considerable number of differentially expressed genes. We also demonstrate that the identification of mutually exclusive gene sets with EEI can improve the sensitivity of capturing cellular heterogeneity. Our methods complement existing approaches and provide new biological insights, even for a large, sparse dataset, in the single-cell analysis field.

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) enables us to explore and characterize variability in individual cells. At the same time, it can provide information on the regulatory relationships between genes and how genes interact with each other at the single-cell level. Since scRNA-seq experiments profile the dynamics and variation of gene expression in different cell states, such as the cell cycle, cell division and cell differentiation, across thousands of cells, the reconstruction of gene regulatory networks (GRNs) helps us to understand gene functions and cell type- or state-specific variability in genetic interactions.

Many mathematical methods have been proposed to infer GRNs from bulk transcriptome data, including the use of Boolean networks (1–4), Bayesian networks (5–8), mutual information (9–11) and linear regression (12,13). Although most of these methods enable us to capture codependency and regulatory interactions from a dataset with a limited sample size, they are not suitable for inferring regulatory relationships on the basis of temporal information or sparse expression. In addition, they are not applicable to large-scale networks due to their high computational complexity.

With the development of high-throughput sequencing techniques, expression can now be measured simultaneously in a large number of cells. In particular, while droplet-based single-cell RNA sequencing can measure expression levels in thousands of cells, it exhibits lower sensitivity for each gene compared to other scRNA-seq methods (14). This leads to an excessive amount of zero read counts, which is likely to affect downstream analysis. Previous studies as mentioned in (15,16) are useful for integrated datasets generated by distinct platforms. The lower sensitivity of sparse scRNA-seq might be improved by integrating datasets from several distinct platforms. However, it would be better to develop methods to capture the specific expression from sparse scRNA-seq data through a single platform.

For the inference of GRNs from scRNA-seq data, numerous methods have been developed (17–22). These methods are mostly aimed at reconstructing the statistical dependency between genes based on the ordering of single cells according to the time information underlying dynamic processes from a dataset obtained from high-quality cells. The variability in gene expression can be revealed by taking advantage of underlying temporal information for individual cells. In addition, an approach based on information theory estimates non-linear dependencies with pairwise joint probability distributions if the sample size is sufficient (17). However, the sparsity of single-cell data might mean that these data are not inherently suitable for the inference of GRNs, and GRN algorithms are mostly effective for up to a thousand genes (23).

Moreover, the community detection of coexpression networks is important for the identification of groups of functionally related genes. To address the community detection problem, many methods have been studied (24–28). The Girvan-Newman method basically decomposes a network to maximize the modularity (24,29). To improve the computational complexity, the Leading eigenvector method, which partitions a network based on the spectral optimization of modularity according to the eigenvector, is developed (27).

Similar to codependency, mutual exclusivity is an inherent expression pattern in gene expression. At single-cell resolution, mutually exclusive genetic alterations are likely to be one cause of cellular heterogeneity not only within a tumour but also during cell development. Notably, several approaches have been developed for detecting mutually exclusive gene sets associated with cancer driver mutations. For example, mutual exclusivity modules (MEMo) are used to identify gene sets that belong to the same pathway by extracting all groups of functionally related genes from the Human Reference Network using multiple somatic mutation datasets (30). Other methods predict whether the cause of an amino acid change is associated (or not) with cancer based on a random forest with somatic missense mutations (31,32). Although these methods are suitable for detecting independent alterations in cancer driver genes from mutation data, the modelling of mutual exclusivity from scRNA-seq data has yet to be studied.

In this paper, we focus on the modelling of one-to-one relationships between two genes, such as codependency and mutual exclusivity, because codependency and mutual exclusivity are inherent and specific gene expression patterns. The purpose of this paper is twofold. First, we develop an approach to detect communities of coexpression networks with the co-dependency index (CDI) (33). We evaluate the effectiveness of community detection based on glioblastoma scRNA-seq data and show that this approach finds not only densely connected subgraphs but also functionally related networks. Community-based comparisons provide information on the similarities and differences in coexpressed genes when applied to multiple samples. Second, we develop a novel metric, the exclusively expressed index (EEI), for identifying mutually exclusive gene sets from sparse scRNA-seq data. According to the idea of detecting sparse expression (33), EEI enables us to quantify mutually exclusive expression due to negative correlations and genetic alterations not only in cancer driver genes but also in cell-type specific marker genes if the input is scRNA-seq data. We apply our method to glioblastoma scRNA-seq data and show that it outperforms existing methods while maintaining robustness against the sequencing read depth, and mutually exclusive gene sets can improve the sensitivity of the identification of cellular heterogeneity.

MATERIALS AND METHODS

Co-dependency index score

A previous study proposed a method, the co-dependency index (CDI), for quantifying the codependency relationship between two genes, gene i and gene j, across thousands of cells from scRNA-seq data (33).

Assuming that gene i and gene j are independently expressed, P(gi = 1) denotes the probability of observing that gene i has the nonzero expression. P(gi = 1, gj = 1) denotes the joint probability of observing that both gene i and gene j exhibit coincident nonzero expression values in the same cell and is formulated as follows:

graphic file with name M11.gif (1)

where Inline graphic and N represent the number of cells in which gene i presents nonzero values and the total number of cells, respectively. Under this assumption, peij), defined as follows, is the probability of observing a test statistic being as extreme under the null hypothesis that gene i and gene j are independent:

graphic file with name M13.gif (2)

where Inline graphic represents the probability of observing x successes in N trials if the probability of success is πij. Inline graphic is the number of cells in which gene i and gene j simultaneously present nonzero values. Then, CDI is defined as follows:

graphic file with name M16.gif (3)

Exclusively expressed index score

Definition

In this study, we propose a novel metric, the exclusively expressed index (EEI), for the quantification of mutual exclusivity between two genes according to the concept of CDI from sparse scRNA-seq data. Mutually exclusive expression between two specific genes, gene i and gene j, can be divided into two cases: A and B. In the case of A, gene i shows zero expression, while gene j shows nonzero expression, and the probability that A will occur can be denoted by P(gi = 0, gj = 1). In the case of B, gene i exhibits nonzero expression, gene j exhibits zero expression, and P(gi = 1, gj = 0) indicates the probability of B. EEI is computed for each pair of genes for all possible combinations. Under the null hypothesis that gene i and gene j are independent, the probability of the occurrence of A is defined as follows:

graphic file with name M17.gif (4)

where kij represents the number of cells in which gene i presents zero values and gene j presents nonzero values. Similarly,

graphic file with name M18.gif (5)

where hij represents the number of cells in which gene i presents nonzero values and gene j presents zero values. Then, EEI is defined as follows:

graphic file with name M19.gif (6)

Although CDI and EEI do not impute technical zeros, these metrics calculate the possibility of codependency and mutual exclusivity by counting the number of cells according to binary quantification if the sample size is sufficient. Gene pairs with lower CDI scores cannot be applied to alternative gene pairs with higher EEI scores. Since CDI captures the possibility of coexpression rather than being a coincidence, the gene pairs that are expressed in all cells (i.e. housekeeping genes) exhibit lower CDI and EEI scores. A high EEI can directly quantify that the gene pairs exhibit mutually exclusive expression. The advantages of EEI are twofold. First, EEI infers marker genes specific to each cell cluster. Second, mutually exclusive gene sets detected by EEI can be applied to improve the sensitivity of clustering of cells.

Application of EEI to single-cell clustering

Mutually exclusive expression can be observed as genetic alterations (34–37) and negative correlations (38,39) between two specific genes. Genetic alterations lead to protein production because mutually exclusive gene pairs might be amplified in different cell populations (34). Notably, tumour cell heterogeneity might be considered to be caused by mutations in driver genes or to result from tumour cell progression. At single-cell resolution, mutually exclusive expression means that it is highly possible that two genes are exclusively expressed in the different cell types. Since this property is specific to single-cell populations, we attempt to apply the expression of mutually exclusive gene sets to improve the sensitivity of the identification of cellular heterogeneity. In this subsection, we introduce a novel strategy for the classification of single cells with EEI gene sets.

Clustering analysis with EEI can be summarized as follows:

  1. EEI is calculated for all gene pairs from scRNA-seq data, and the 1000 gene sets with the highest EEI scores are extracted.

  2. To generate a feature matrix, the expression ratio (i.e. the proportion of the expression values) is calculated using normalized (i.e. normalize the gene expression for each cell by the total read count) and log-transformed expression for all extracted pairs. Then, the features of expression are merged with the log-transformed expression values in the scRNA-seq dataset, as shown in Figure 1A.

  3. The merged data can be regarded as an input, and classification is performed via dimensionality reduction methods. In this experiment, we reduce the high-dimensional data to 40 dimensions via singular value decomposition (SVD), and then UMAP is used to reduce 40 dimensions to 2 dimensions.

Figure 1.

Figure 1.

(A) Overview of the generation of a feature matrix from scRNA-seq data. (i, ii) If a gene-cell expression matrix is provided, EEI is calculated, and highly mutually exclusive gene pairs are extracted. (iii) The feature matrix is generated by merging the expression ratio matrix for EEI pairs with the gene-cell expression matrix. (iv) Dimension reduction is performed using SVD and UMAP with the feature matrix as an input. (B) Summary of the six scRNA-seq datasets. This contains the number of genes that expressed in at least one cell, the number of cells, the number of total reads, the units of transcript counts and the reference.

Community detection algorithm

Community detection can extract topological characteristics from complex networks. The Girvan–Newman method is a basic method that decomposes a network iteratively by removing the edges connecting communities that present the highest edge betweenness (24,29) based on maximizing modularity, Q (24):

graphic file with name M20.gif (7)

where m represents the total number of edges, Aij is the weight of the edge between nodes i and j, ki is the sum of the weights of edges attached to node i or the degree of node i, ci is the community to which node i belongs, and δ is defined as δ(u, v) = 1 if u = v and as 0 otherwise.

The Leading eigenvector method (27) is based on the spectral optimization of modularity by reformulation of modularity in terms of the eigenvalue (40). It calculates the eigenvector of the modularity matrix and the largest positive eigenvector partitions the network into two communities. This algorithm performs faster than other modularity optimizations and slightly better for large-scale networks and a wide variety of networks (41,42).

This method maximizes modularity, which is defined in terms of a matrix based on the eigenvalues and eigenvectors, called the modularity matrix, B (29):

graphic file with name M21.gif (8)

where A is the adjacency matrix. Then, modularity is defined as follows:

graphic file with name M22.gif (9)

where S is an index matrix to detect c (c ≥ 2) communities. Each column of this matrix is an index vector of (0, 1) elements. Writing B = UDUT, where U = (u1|u2|…) is the matrix of eigenvectors of B, D is the diagonal matrix of eigenvalues Dii = βi and s is an index vector of (−1, 1) elements in which Inline graphic is the ith element of the eigenvector, u1 of B, then Q is defined as follows:

graphic file with name M24.gif (10)

For a different approach, the Louvain method (29) is an agglomerative hierarchical clustering method to maximize modularity by local optimization. It aggregates each cluster into a single node until the modularity does not increase, which leads to a small-sized network and fast performance (43).

In this study, we applied the Leading eigenvector method because it is a fundamental method based on spectral optimization that partitions a network into clusters with eigenvalue decomposition.

Dataset

To evaluate the performance of EEI and the community detection of coexpression networks, we applied six scRNA-seq datasets.

Glioblastoma scRNA-seq data: We generated single-cell RNA-seq data obtained from glioblastoma stem-like cells before (stem) and after (serum+) the addition of serum that are collected at 0 and 12 hours (see ‘Materials and Methods’ section). Glioblastoma stem-like cells are subsets of glioblastoma cells that possess self-renewal ability and exhibit extensive tumorigenicity. The datasets obtained at 0h and 12h contained 2102 and 2209 single cells in total, respectively.

Human ES progenitor scRNA-seq data: This dataset was published by Chu et al. (44) and provides snapshots of lineage-specific progenitor cells differentiated from human ES cells. The progenitor cells consisted of 1018 single cells in total and included cell types such as neural progenitor cells, endoderm cells, endothelial cells and trophoblast-like cells. Library preparation was performed using the Fluidigm C1 system (45).

Human ESC-derived neuron scRNA-seq data: This dataset was published by Manno et al. (46) and provides the transcriptomes of human ventral midbrain single cells. These cells, 1715 in total, included cell types such as oculomotor and trochlear nucleus, serotonergic and medial neuroblasts.

Mouse cortex scRNA-seq data: This dataset was published by Zeisel et al. (47) and provides the transcriptomes of mouse cortex and hippocampal cells. These cells, 3,005 in total, included cell types such as interneurons, oligodendrocytes and microglial cells.

PBMC CELseq2 scRNA-seq data: This dataset was published by Mereu et al. (48) and provides a reference sample containing human peripheral blood mononuclear cells that were generated with the CELseq2 protocol. These cells, 1083 in total, included cell types such as B cells, NK cells and CD14 monocytes.

PBMC MARSseq scRNA-seq data: This dataset was published by Mereu et al. (48) and provides a reference sample containing human peripheral blood mononuclear cells that were generated with the MARSseq protocol. These cells, 1481 in total, included cell types such as CD4 T cells, FCGR3A monocytes and dendritic cells.

Sample preparation for glioblastoma scRNA-seq data

The establishment and characterization of glioblastoma stem-like cells (GSCs) have been previously reported (49). Briefly, GSCs were cultured in Dulbecco’s modified Eagle’s medium (DMEM)/F12 (Life Technologies) containing a B27 supplement minus vitamin A (Life Technologies), epidermal growth factor, and fibroblast growth factor 2 (20 ng/ml each; Wako Pure Chemicals Industries). For in vitro differentiation, GSCs were cultured in Dulbecco’s modified Eagle’s medium/F-12 medium (Life Technologies) containing 10Inline graphic foetal bovine serum for the indicated times. Single-cell suspensions of GSCs or serum-induced differentiated GSCs were subjected to droplet-based scRNA-seq library preparation with the Chromium Single Cell 3’ Reagent Kit v2 (10× Genomics), aiming for an estimated 2,000 cells per library and following the manufacturer’s instructions. The libraries were checked with a BioAnalyzer High Sensitivity Chip (Agilent), quantified with a KAPA Library Quantification Kit (Roche), and then sequenced on the Illumina HiSeq 2500 platform in rapid mode.

Performance evaluation

We evaluated the effectiveness of EEI compared to four existing methods: the Pearson correlation coefficient (referred to as Pearson), minet (50), GENIE3 (51) and PIDC (17).

The Pearson correlation is a basic correlation measure for a linear relationship between two variables ranging from –1.0 to 1.0. Note that the relationship of mutual exclusivity indicates that the Pearson coefficient is negative and greater than –1.0. minet and PIDC are methods for inferring of GRNs based on mutual information. minet infers the non-linear relationship between two genes from microarray data and ranges from 0.0 to 1.0. PIDC is an inference algorithm used to quantify the statistical relationships between triplets of genes based on the conditional mutual information from scRNA-seq data as positive values. GENIE3 infers the non-linear interactions among two genes based on a random forest regression and ranges from 0.0 to 1.0. Since a very large amount of computational time is required for the large-scale datasets, we performed parallel computing by using 25 cores.

Performance metrics

We evaluated the performance of EEI on the basis of the area under the precision-recall curve (AUPR) and average precision. The PR curve is plotted as the precision against recall. It is appropriate for binary classification with imbalanced data in which the number of positive samples is lower than the number of negative samples because the PR curve is sensitive to class distribution. Average precision (AP) indicates the weighted mean of precisions, with an increase in recall at each threshold (see Supplementary Appendix).

To validate the performance for clustering of single cells, we adopted the Adjusted Rand Index (ARI) and the silhouette coefficient. ARI measures the similarity between predicted and true cluster labels and ranges from 0.0 to 1.0. The silhouette coefficient measures the cluster cohesion and separation and ranges from –1.0 to 1.0. If the distance between one cluster and the other cluster is large, the silhouette coefficient is high. When evaluating the community detection of coexpression networks, we used the Szymkiewicz-Simpson coefficient and Jaccard index to measure the similarity between two sets of nodes, which ranged from 0.0 to 1.0.

RESULTS

Comparison of exclusively expressed genes

Comparison of mutually exclusive gene sets

Initially, we evaluated the effectiveness of EEI for the detection of mutually exclusive gene pairs by comparing it with four existing methods, the Pearson correlation coefficient, minet, GENIE3 and PIDC, using glioblastoma stem-like cell scRNA-seq data. Since minet, GENIE3 and PIDC cannot be applied to large-scale datasets, we used the expression data of the top 5000 highly variable genes (HVG dataset) detected by Seurat (52). Gene sets with mutually exclusive expression may exhibit genetic alterations and negative correlations. Highly mutually exclusive means that there is a high possibility of exclusive expression between two genes, rather than being a coincidence.

Since the five methods commonly output the prediction score for each gene pair, we assessed the performance of these methods for a binary classification problem. In this classification, we prepared 17 mutually exclusive gene pairs that are reported in the literature as positive samples (Supplementary Table S1). We also prepared 50 negative gene pairs that were randomly sampled from 5,000 highly variable genes. To evaluate the performance of these methods, we used AUPR and AP scores (see ‘Materials and Methods’ section).

Table 1 and Supplementary Figure S1(A) summarize the performances of the five methods. The AUPR and AP of EEI were the highest among all methods. While EEI could capture mutually exclusive patterns according to binary quantification, minet, GENIE3 and PIDC exhibited lower AUPR values, indicating that these methods could not correctly capture the exclusivity of expression due to excessive zero read counts.

Table 1.

The prediction accuracies of five methods

EEI Pearson Minet GENIE3 PIDC
AUPR 0.51 0.47 0.45 0.40 0.42
AP 0.52 0.48 0.43 0.42 0.44

As shown in Supplementary Figure S1B, there were a few gene pairs had high EEI scores, while many other gene pairs presented lower EEI scores. Since six positive gene pairs were among the top pairs, actual mutually exclusive gene pairs could be predicted with moderate and high scores could be predicted by using EEI. Notably, the PDGFRA and MET gene pair showed the highest EEI of 12.4. Although minet and GENIE3 produced the highest scores for that pair, these methods inferred lower scores for other gene pairs. Therefore, these results indicate that our method enables us to identify mutually exclusive gene sets independent of the sequencing depth in sparse scRNA-seq data.

Robustness analysis of read depth in scRNA-seq data

Since EEI identifies mutually exclusive gene pairs without taking into account technical zeros by dropouts, we examined the robustness of EEI against an insufficient read depth by comparison with the four methods. (Note that since PIDC cannot read the expression files, we used the other four methods.) We generated synthetic datasets from glioblastoma scRNA-seq data by randomly decreasing the total number of read counts from the original data by 10Inline graphic.

First, we analysed the robustness for common gene pairs using the HVG dataset. We calculated the Pearson coefficient of the expression of two genes detected by minet and GENIE3 and regarded the top 500 gene pairs with negative coefficients as mutually exclusive gene pairs. Since the four methods shared 270 common gene pairs among the top 500 mutually exclusive gene pairs, we prepared these gene pairs as the positive samples and 500 negative samples that were randomly generated from genes that did not have zero expression. Figure 2 and Supplementary Table S2 summarize the performances of the four methods, and EEI showed the best performance among them. Existing methods showed lower accuracy due to insufficient read counts as the read depth decreased. In contrast, EEI exhibited better performance at a lower sequencing read depth and was not strongly affected by a decrease in the total number of read counts.

Figure 2.

Figure 2.

Comparison of the performances of the four methods with common gene sets from the glioblastoma scRNA-seq dataset. r0.1 represents the synthetic dataset in which 90Inline graphic of read counts present zero expression compared to the original data. The average AUROC (A), AUPR (B) and AP (C) were calculated by repeating each simulation 10 times.

Second, we also analysed the robustness for gold standard gene pairs using two types of datasets: an HVG dataset and an expression dataset consisting of 18,597 genes that are expressed in at least one cell (NTZ dataset). We prepared 17 and 29 positive samples reported in the literature (listed in Supplementary Table S1) for the HVG dataset and NTZ dataset, respectively and 50 negative samples. Since minet and GENIE3 cannot be applied to the large-scale datasets, we examined the performances of the EEI and Pearson methods. As the gene pairs of negative samples may also correspond to mutually exclusive gene pairs in some cases, we calculated only the AUPR and AP for the evaluation. As shown in Figure 3 and Supplementary Tables S3 and S4, EEI exhibited the best performance among all methods in both the HVG and NTZ datasets. For the HVG dataset, the AUPR and AP of Pearson, minet and GENIE3 were low and decreased as the lower sequencing depth decreased. In contrast, even when the sequence read depth decreased to 90Inline graphic (r0.1 in Figure 3 and Supplementary Tables S3 and S4), the AUPR and AP of EEI showed moderate values.

Figure 3.

Figure 3.

Comparison of the performances of the four methods with gold standard gene pairs. The AUPR (A) and AP (B) were calculated using the NTZ dataset and the AUPR (C) and AP (D) were calculated using the HVG dataset.

In particular, since the NTZ dataset contains a various types of genes, including not only highly variable genes but also genes with low expression, EEI could comprehensively capture various genes that were specifically expressed in each cell type. EEI enables the application of a large-scale dataset and the extensive capture of mutually exclusive expression levels even if a dataset is sparse. While the other methods were sensitive to the depth of sequencing, EEI was not affected by a decrease in the sequencing depth. These results suggest that our method enables us to comprehensively detect mutually exclusive gene sets while maintaining robustness against the sequencing read depth in sparse scRNA-seq data.

Identification of cell marker genes

To examine the possibility of the identification of marker genes, we assessed the performance of EEI for detecting marker genes by comparing the marker selection method and the databases for single cells. SCMarker (53) is an unsupervised marker selection method that identifies genes that are discriminatively expressed across cell types based on a mixture distribution model and are co- or mutually exclusively expressed. We calculated the prediction accuracy of EEI and SCMarker in terms of the positive and negative samples in Supplementary Table S1C using human ES cells (44) and glioblastoma scRNA-seq datasets. The marker genes of human ES and cancer stem cells in glioblastoma have been reported in the previous studies (44,54–56), respectively. Note that since the two methods detected the different numbers of genes, we generated the same number of false positives and true negatives with SCMarker by randomly selecting genes detected from EEI.

For the glioblastoma dataset, EEI identified 5,354 genes above the threshold, 1.0, and SCMarker identified 456 genes. Table 2 shows that EEI outperformed SCMarker for detecting marker genes from the datasets that contain both sparse and sufficient read counts. We also analysed the detected gene pairs in glioblastoma in two public databases for cell type markers, CellMarker (57) and PanglaoDB (58), by EEI. The results showed that PDGFRA was included in CellMarker and that PDGFRA, MET, MEF2C, OLIG1, SDC2, A2M, CHL1, MEG3 and SLC1A3 were included in PanglaoDB. These markers are expressed in specific cell types in brain tissue. These results suggest that EEI has the possibility of detecting not only mutually exclusive gene pairs but also cell type marker genes.

Table 2.

Prediction accuracies of EEI and SCMarker

EEI SCMarker
(a) Human ES dataset
Accuracy 1.00 0.824
Precision 1.00 0.00284
Sensitivity 0.750 0.313
F1 score 0.857 0.00564
(b) Glioblastoma dataset
EEI SCMarker
Accuracy 0.687 0.682
Precision 0.0194 0.00439
Sensitivity 0.643 0.143
F1 score 0.0377 0.00851

Application of EEI to single-cell clustering

In the classification of single cells, feature selection is an important step. Notably, it is crucial for there to be an association between gene expression features extracted from scRNA-seq data and the clustering of single cells. At the single-cell level, mutually exclusive gene sets due to genetic alterations might be considered to be expressed exclusively in different types of cells, which leads to tumour heterogeneity in cancer progression. This means that mutually exclusive gene sets can be used as features for the clustering of single cells. To evaluate the effectiveness of EEI, we compared the performances of the five methods using the five scRNA-seq datasets (‘Materials and Methods’ section). The feature matrix was generated by merging the ratio matrix of the top 1,000 mutually exclusive gene pairs by each method listed in Supplementary Table S5 and the expression matrix of highly variable genes.

Figure 4 shows the UMAP results with EEI, the Pearson correlation, minet, GENIE3 and PIDC. For both the human ESC-derived neuron (46) and PBMC_CELseq2 (48) datasets, clustering with EEI showed that each cell type was detected as a separate cluster and produced compact clusters for all cell types. We also performed the Kmeans clustering to evaluate the clustering results of five methods. ARI of EEI for these datasets and PBMC_MARSseq were highest among other methods in Table 3. In particular, ARIs of EEI were higher than those of an original data, indicating that mutually exclusive gene sets can be effective for improving the clustering performance. As shown in Supplementary Figures S3 and S5, the silhouette coefficients of EEI were greater than those of the other methods for both datasets. For the other datasets shown in Supplementary Figures S2, S4 and S6, the silhouette coefficients of PIDC and minet were higher than those of EEI for the human ES (44), mouse brain (47) and PBMC_MARSseq (48) datasets, respectively. In particular, since the PBMC_CELseq2, PBMC_MARSseq and human ES datasets were generated by CELseq2, MARSseq and Smartseq2 protocols, not by a droplet-based protocol, they contained sufficient read counts. In most cases, clustering with EEI displayed distinct clusters of cell types in these datasets. This means that EEI tends to be effective not only for sparse scRNA-seq data but also for data with a sufficient sequencing depth. Therefore, EEI enables us to capture the exclusive expression between two genes displaying intercell-type (but not intracell-type) heterogeneity.

Figure 4.

Figure 4.

Comparison of the EEI (i), Pearson correlation (ii), minet (iii), GENIE3 (iv) and PIDC (v) UMAP results using human ES cell (A) and PBMC_CELseq2 (B) scRNA-seq datasets.

Table 3.

Comparison of the clustering results of five methods

Original data EEI Pearson Minet GENIE3 PIDC
Human ES (44) 0.60 0.98 0.97 0.98 0.98 0.98
Human ES (46) 0.25 0.43 0.23 0.24 0.25 0.27
Mouse Brain 0.35 0.66 0.51 0.67 0.51 0.42
PBMC_CELseq2 0.12 0.39 0.38 0.36 0.36 0.36
PBMC_MARSseq 0.19 0.32 0.31 0.30 0.29 0.30

In addition, we examined the effectiveness of EEI using human MEP scRNA-seq data (59). This dataset consists of normalized data, including negative values, and does not contain any zero values. While CDI identified 854 gene pairs under a threshold of 0.01, EEI outputted no gene pairs. One reason for this result is that since the dataset does not include any zero values, EEI could not capture exclusive patterns by counting only the number of cells according to binary quantification. Therefore, these results suggest that EEI is widely applicable to scRNA-seq datasets including zero expression values generated from distinct platforms and that the mutually exclusive gene sets detected by EEI can be applied to improve the sensitivity of the identification of cellular heterogeneity.

Comparison of community detection

Community detection using human es cell scRNA-seq data

First, we evaluated the performance of community detection with CDI for human embryonic stem cell scRNA-seq data (44) with a sufficient sequencing depth. For the differentiation of ES cells into specific cell types, several marker genes that are expressed in each cell type as reported in (44) were examined. The coexpression network was constructed under a CDI threshold of 10.0. A total of 3270 genes were included in this network, and we identified 102 communities in total, including small-sized communities. Since the modularity score was 0.64, the network contained densely connected communities. Regarding known marker genes, we identified 13 markers described in (44) and analysed which markers were included in individual communities. Table 4 shows the marker genes and the cell types in which the corresponding markers were expressed.

Table 4.

List of human ES cell type-specific genes included in each community and the corresponding cell types. Six genes specific to DE cells are included in the c0 community

Cell type name Marker gene (community no.)
H1 cells DNMT3B(c97)
NPCs SOX2(c97), PAX6(c0), MAP2(c97)
ECs CD34(c98)
TBs GATA3(c98), HAND1(c99)
DECs CER1(c0), EOMES(c0), GATA6(c0), LEFTY1(c0), SOX17(c0), CXCR4(c0)

The detected marker genes were mostly included only in communities c0, c97, c98 and c99. In particular, 6 genes (CER1, EOMES, GATA6, LEFTY1, SOX17 and CXCR4) specific to definitive endoderm (DE) cells were identified in the c0 community. Since human ES cells can differentiate towards DE cells, it is possible that c0 contains not only marker genes but also some genes associated with endoderm development. Similarly, other markers were mainly included in the c97 community, and it is possible that c97 contains sets of genes that exhibit functions specific to H1 and NP cells. Therefore, community detection enables us to identify groups of functionally related genes, in contrast to the analysis for individual genes using a coexpression network. These results suggest that CDI can capture coexpression patterns not only from sparse datasets but also from different types of datasets with a sufficient read depth.

Comparison of coexpressed gene sets

Second, we evaluated the performance of the community detection of gene coexpression networks from sparse scRNA-seq data. Since CDI can be applicable to large-scale networks, we focused on detecting communities of coexpression networks using the Leading eigenvector method (see ‘Materials and Methods’ section). We compared codependent gene sets by CDI using glioblastoma 0h scRNA-seq data with those by the Pearson correlation and cosine similarity. In this experiment, we selected the top 11 100 gene pairs under a CDI threshold of 10.0, and the same number of gene sets with positive coefficients in descending order were subjected to the other methods. The numbers of genes included in the resulting networks associated with CDI, the Pearson correlation and cosine similarity were 835, 1221 and 436, respectively. The modularity score and the number of communities detected under each method are summarized in Table 5.

Table 5.

Comparison of modularity scores and the numbers of communities under three methods applied to glioblastoma scRNA-seq data

CDI Pearson correlation Cosine similarity
Modularity 0.38 0.35 0.09
The number of communities 8 147 29

The highest modularity score was 0.38 for CDI, and it was observed that the coexpression network according to CDI was divided into 6 medium-sized communities composed of approximately 10–300 genes (c0, c3, c4, c5, c6 and c7), as shown in Supplementary Figure S7. In particular, the 0h_c3 community consisted of many genes that exhibited a high degree and were densely connected. At 0h, c0, c4 and c5 contained fewer hub genes than other communities. In contrast, the network obtained by the Pearson correlation was divided into several medium-sized and many small communities. Networks with high modularity scores presented sets of nodes that were densely connected.

To evaluate the properties of the communities obtained with the three methods, we calculated the Jaccard index to measure the similarity of the sets of genes in the community under each method. We regarded 30 genes included in the glioma pathway (hsa05214) in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (60) as the gold standard: EGF, TGFA, PDGF, IGF1, EGFR, PDGFRA, IGF1R, PLCG1, SHC1, GRB2, CALM, PRKCA, SOS, CAMK1, HRAS, PIK3CA, PTEN, BRAF, AKT, MAP2K1, MTDR, ERK, MDM2, TP53, P21, P16, CCND1, CDK4, RB1 and E2F1. Since there were some variations in community size, we used only communities containing more than eight genes. The largest coefficients were 0.0156 for c0 and 0.0115 for c4. On the other hand, no genes identified according to the Pearson correlation and cosine similarity were shared with the KEGG pathways. The c0 community consisted of 165 genes, including the stemness marker genes PDGFRA, A2M and NEAT1, and the genes shared with the KEGG pathways were PDGFRA, IGF1R and EGFR. Interestingly, these genes were commonly included in the c0 community. We also performed Gene Ontology (GO) term enrichment analysis (61,62), and 24 functional pathways were found to be significantly enriched (with a p-value of less than 1e−04 for three shared genes), as listed in Supplementary Figure S8. The top 3 pathways were transmembrane receptor protein tyrosine kinase activity, positive regulation of DNA replication and tyrosine-protein kinase, catalytic domain. Although there were a few shared genes between the 0h_c0 community and the KEGG pathways, many different pathways were enriched compared to other CDI communities. There is a possibility that several specific genes associated with glioblastoma and various biological functions were included in the 0h_c0 community. CDI can capture coexpression patterns from sparse scRNA-seq data if the sample size is sufficient. However, it must be noted that it does not necessarily provide information indicating that two genes are positively correlated. These results suggest that the coexpression network associated with CDI contained densely connected subnetworks and that community detection enables us to identify not only possible densely connected subgraphs but also biologically functional networks.

Comparison of coexpression networks for multiple samples

To validate the effectiveness of community detection when multiple samples are provided, we performed a comparative analysis of coexpression networks from glioblastoma scRNA-seq data at different time points. The dataset for 0 h consisted of 18 597 genes and 2102 cells, and the dataset from 12 h consisted of 18 163 genes and 2209 cells. Using these datasets, we reconstructed coexpression networks based on a CDI threshold of 10.0. After decomposition of the networks using the Leading eigenvector method, to compare extracted communities, we calculated the Szymkiewicz–Simpson coefficient to evaluate the similarity between two communities at different time points.

As shown in Figure 5, every network exhibited a power-law degree distribution. The scale-free network presented an uneven node degree distribution and contained a few hub nodes. With regard to the hub genes, in the 0-hour network, CDK1, PBK, KIAA0101, GTSE1, MKI67, UBE2C, SGOL1, TPX2, AURKB and RRM2 were identified, and in the 12-h network, TOP2A, PBK, TYMS, NUSAP1, BIRC5, UBE2C, TPX2, GTSE1, ATAD2 and MKI67 were also identified. Additionally, the results showed that four communities including more than 100 genes were extracted for each sample. We observed high similarity for shared communities: 0.83 for 0h_c3 and 12h_c6 and 0.74 for 0h_c4 and 12h_c7 (see Supplementary Table S6). In contrast, 0h_c0 and 12h_c8 as well as 0h_c5 and 12h_c0 were considered to be specific communities in response to external stimuli. For example, it has been reported that the known stemness marker genes PDGFRA and MET exhibit different expression patterns (34). While PDGFRA and neighbouring genes existed in 0h_c0, PDGFRA did not exist in any community, and some neighbouring genes remained in 12h_c8. MET and neighbouring genes existed in both in 0h_c5 and 12h_c0. EEI results showed that PDGFRA and MET were expressed in a mutually exclusive manner, indicating that these two genes are expressed in the different cell populations (34). This means that although there are some variations in the expression of individual genes, the sets of genes that constitute the community are partially conserved regardless of the effects of external stimuli. Therefore, these results suggest that community-based comparisons of coexpression networks enable us to detect similarities and differences in coexpressed genes across multiple samples.

Figure 5.

Figure 5.

Comparative analysis of coexpression networks at different time points. The coexpression network constructed at each time point (A) and degree distribution (B). The blue and yellow nodes represent the high- and low-degree nodes, respectively. After decomposition of the coexpression networks, four communities were extracted for each sample, and the values between the communities at 0 and 12 hours represent the Szymkiewicz-Simpson coefficients (C).

DISCUSSION AND CONCLUSION

In this study, we developed a novel metric, EEI, to comprehensively quantify mutual exclusivity between two genes from sparse scRNA-seq data. A comparison with existing methods using glioblastoma scRNA-seq data suggested that EEI identified gene sets due to genetic alterations and negative correlations.

In particular, our findings show that EEI is effective for detecting mutually exclusive gene sets, while maintaining robustness against the sequencing read depth in droplet-based scRNA-seq data. We also applied EEI to improve the sensitivity of the classification of single cells. The results suggest that exclusive expression can be introduced to identify intercell-type heterogeneity based on the feature matrix.

We also examined the performance of coexpression networks from glioblsatoma scRNA-seq data in community detection. Although the Louvain method is faster, we used the Leading eigenvector method, which is fundamental and sufficiently applicable to large-scale networks. The results suggested that the communities detected from CDI contained more densely connected subgraphs than existing methods, and some marker genes associated with specific pathways in glioma were identified. Community detection enables us to identify candidate marker genes from known marker genes. A community-based comparison provides information not only on functionally related genes but also on the similarities and differences in coexpressed genes when multiple samples are provided.

Since EEI does not impute technical zeros and captures genes that are mutually exclusively expressed without discriminating those with zero expression due to biological and technical zeros, the imputed data might be able to improve the detection of gene pairs. Although EEI and CDI can be applied to large-scale datasets, they require considerable computational time and are effectively parallelizable. In addition, while the mutually exclusive gene sets identified by EEI can improve the sensitivity of the identification of cell-to-cell heterogeneity, this approach is not suitable for datasets containing excessive zeros, and another future goal will be to improve the extraction of expression features.

DATA AVAILABILITY

The code to calculate CDI and EEI is available at https://github.com/Natsu01/EEISP. The raw sequencing data and processed files of the glioblastoma scRNA-seq are available at the NCBI Gene Expression Omnibus (GEO) under accession number GSE144623.

Supplementary Material

gkab601_Supplemental_Files

ACKNOWLEDGEMENTS

This research was partially supported by Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under grant number JP19am0101105.

Author contributions: N.N. performed the implementation and computational studies: T. Akutsu, and R.N. organized the project: T.H. and T. Akiyama prepared the glioblastoma scRNA-seq data: K.F. and K.S. sequenced the scRNA-seq data: N.N. and R.N. wrote the draft of the manuscript, and all authors approved the submitted version.

Notes

Present address: Natsu Nakajima, Institute of Medical Science, The University of Tokyo, General Research Building 8th floor, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan.

Contributor Information

Natsu Nakajima, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

Tomoatsu Hayashi, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

Katsunori Fujiki, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

Katsuhiko Shirahige, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

Tetsu Akiyama, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

Tatsuya Akutsu, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan.

Ryuichiro Nakato, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Scientific Research [17H06331, 20K19916 to N.N., R.N. and 17H06325 to T.H. and T. Akiyama]; P-CREATE (Project for Cancer Research and Therapeutic Evolution, no. 19cm0106103h0004); Japan Agency for Medical Research and Development; T. Akutsu was partially supported by JSPS Grant [18H04113]. Funding for open access charge: Scientific Research [20K19916].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Akutsu T., Miyano S., Kuhara S.. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics. 2000; 16:727–734. [DOI] [PubMed] [Google Scholar]
  • 2. Hickman G.J., Hodgman T.C.. Inference of gene regulatory networks using boolean-network inference methods. J. Bioinform. Comput. B. 2009; 7:1013–1029. [DOI] [PubMed] [Google Scholar]
  • 3. Barman S., Kwon Y.K.. A Boolean network inference from time-series gene expression data using a genetic algorithm. Bioinformatics. 2018; 34:1927–1933. [DOI] [PubMed] [Google Scholar]
  • 4. Chen L., Kulasiri D., Samarasinghe S.. A novel data-driven boolean model for genetic regulatory networks. Front. Physiol. 2018; 25:1328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Penfold C.A., Shifaz A., Brown P.E., Nicholson A., Wild D.L.. CSI: a nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data. Stat. Appl. Genet. Mol. Biol. 2015; 14:307–310. [DOI] [PubMed] [Google Scholar]
  • 6. Yu B., Xu J.M., Li S., Chen C., Chen R.X., Wang L., Zhang Y., Wang M.H.. Inference of time-delayed gene regulatory networks based on dynamic Bayesian network hybrid learning method. Oncotarget. 2017; 8:80373–80392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sanchez-Castillo M., Blanco D., Tienda-Luna I.M., Carrion M.C., Huang Y.. A Bayesian framework for the inference of gene regulatory networks from time and pseudo-time series data. Bioinformatics. 2017; 34:964–970. [DOI] [PubMed] [Google Scholar]
  • 8. Xing L., Guo M., Liu X., Wang C., Wang L., Zhang Y.. An improved Bayesian network method for reconstructing gene regulatory network based on candidate auto selection. BMC Genomics. 2017; 18:844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Villaverde A.F., Ross J., Morán F., Banga J.R.. MIDER: network inference with mutual information distance and entropy reduction. PLoS One. 2014; 9:e96732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Margolin A.A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Dalla Favera R., Califano A.. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006; 7:S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Liang K.C., Wang X.. Gene regulatory network reconstruction using conditional mutual information. EURASIP J. Bioinformatics Syst. Biol. 2008; 2008:253894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Pirgazi J., Khanteymoori A.R.. A robust gene regulatory network inference method base on kalman filter and linear regression. PLoS One. 2018; 13:e0200094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Kim H., Lee J.K., Park T.. Inference of large-scale gene regulatory networks using regression-based network approach. J. Bioinform. Comput. B. 2009; 7:717–735. [DOI] [PubMed] [Google Scholar]
  • 14. Kharchenko P.V., Silberstein L., Scadden D.T.. Bayesian approach to single-cell differential expression analysis. Nat. Methods. 2014; 11:740–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Kashima Y., Suzuki A., Liu Y., Hosokawa M., Matsunaga H., Shirai M., Arikawa K., Sugano S., Kohno T., Takeyama H.et al.. Combinatory use of distinct single-cell RNA-seq analytical platforms reveals the heterogeneous transcriptome response. Sci. Rep. 2018; 8:3482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Haber A.L., Biton M., Rogel N., Herbst R.H., Shekhar K., Smillie C., Burgin G., Delorey T.M., Howitt M.R., Katz Y.et al.. A single-cell survey of the small intestinal epithelium. Nature. 2017; 551:333–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Chan T.E., Stumpf M.P.H., Babtie A.C.. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2019; 5:251–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Aibar S., González-Blas C.B., Moerman T., Huynh-Thu V.A., Imrichova H., Hulselmans G., Rambow F., Marine J.C., Geurts P., Aerts J.et al.. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods. 2017; 14:1083–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Specht A.T., Li J.. LEAP: constructing gene co-expression networks for single-cell RNA-sequencing data using pseudotime ordering. Bioinformatics. 2017; 33:764–766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Matsumoto M., Kiryu H., Furusawa C., Ko M.S.H., Ko S.B.H., Gouda N., Hayashi T., Nikaido I.. SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics. 2017; 33:2314–2321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Gao N.P., Ud-Dean S.M.M., Gandrillon O., Gunawan R.. SINCERITIES: Inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles. Bioinformatics. 2018; 34:258–266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Aubin-Frankowski P.C., Vert J.P.. Gene regulation inference from single cell RNA-seq data with linear differential equations and velocity inference. Bioinformatics. 2020; 36:4774–4780. [DOI] [PubMed] [Google Scholar]
  • 23. Pratapa A., Jalihal A.P., Law J.N., Bharadwaj A., Murali T.M.. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat. Methods. 2020; 17:147–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Girvan M., Newman M.E.J.. Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:7821–7826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Clauset A., Newman M.E.J., Moore C.. Finding community structure in very large networks. Phys. Rev. E. 2005; 70:066111. [DOI] [PubMed] [Google Scholar]
  • 26. Blondel V.D., Guillaume J.L., Lambiotte R., Lefebvre E.. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008; 2008:P10008. [Google Scholar]
  • 27. Newman M.E.J. Finding community structure in networks using the eigenvectors of metrices. Phys. Rev. E. 2006; 74:036104. [DOI] [PubMed] [Google Scholar]
  • 28. Reichardt J., Bornholdt S.. Statistical mechanics of community detection. Phys. Rev. E. 2006; 74:016110. [DOI] [PubMed] [Google Scholar]
  • 29. Rahiminejad S., Maurya M.R., Subramaniam S.. Topological and functional comparison of community detection algorithms in biological networks. BMC Bioinformatics. 2019; 20:212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ciriello G., Cerami E., Sander C., Schultz N.. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 2012; 22:398–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Kaminker J.S., Zhang Y., Watanabe C., Zhang Z.. CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic Acids Res. 2007; 35:W595–W598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Carter H., Chen S., Isik L., Tyekucheva S., Velculescu V.E., Kinzler K.W., Vogelstein B., Karchin R.. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 2009; 69:6660–6667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Mohammadi S., Velderrain J.D., Kellis M., Grama A.. DECODE-ing sparsity patterns in single-cell RNA-seq. 2018; bioRxiv doi:09 March 2018, preprint: not peer reviewed 10.1101/241646. [DOI]
  • 34. Snuderl M., Fazlollahi L., Le L.P., Nitta M., Zhelyazkova B.H., Davidson C.J., Akhavanfard S., Cahill D.P., Aldape K.D., Betensky R.A.et al.. Mosaic amplification of multiple receptor tyrosine kinase genes in glioblastoma. Cancer Cell. 2011; 20:810–817. [DOI] [PubMed] [Google Scholar]
  • 35. Soroceanu L., Kharbanda S., Chen R., Soriano R.H., Aldape K., Misra A., Zha J., Forrest W.F., Nigro J.M., Modrusan Z.et al.. Identification of IGF2 signaling through phosphoinositide-3-kinase regulatory subunit 3 as a growth-promoting axis in glioblastoma. Proc. Natl. Acad. Sci. U.S.A. 2007; 104:3466–3471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Watanabe K., Tachibana O., Sata K., Yonekawa Y., Kleihues P., Ohgaki H.. Overexpression of the EGF receptor and p53 mutations are mutually exclusive in the evolution of primary and secondary glioblastomas. Brain Pathol. 1996; 6:217–223. [DOI] [PubMed] [Google Scholar]
  • 37. Johansson P., Krona C., Kundu S., Doroszko M., Baskaran S., Schmidt L., Vinel C., Almstedt E., Elgendy R., Elfineh L.et al.. A patient-derived cell atlas informs precision targeting of glioblastoma. Cell Rep. 2020; 32:107897. [DOI] [PubMed] [Google Scholar]
  • 38. Suh S.S., Yoo Y.J., Nuovo G.J., Jeon Y.J., Kim S., Lee T.J., Kim T., Bakàcs A., Alder H., Kaur B.et al.. MicroRNAs/TP53 feedback circuitry in glioblastoma multiforme. Proc. Natl. Acad. Sci. U.S.A. 2012; 109:5316–5321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Ying M., Tilghman J., Wei Y., Guerrero-Cazares H., Quinones-Hinojosa A., Ji H., Laterra J.. Kruppel-like factor-9 (KLF9) inhibits glioblastoma stemness through global transcription repression and integrin α 6 inhibition. J. Biol. Chem. 2014; 289:32742–32756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Newman M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 2006; 103:8577–8582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Fortunato S. Community detection in graph. Phys. Rep. 2010; 486:75–174. [Google Scholar]
  • 42. Tautenhain C.P.S., Nascimento M.C.V.. Spectral algorithm for line graphs to find overlapping communities in social networks. Proceedings of the 11th International Conference on Agents and Artificial Intelligence. 2019; 2:306–317. [Google Scholar]
  • 43. Lee Y., Lee Y., Seong J., Stanescu A., Hwang C.S.. A comparison of network clustering algorithms in keyword network analysis: a case study with geography conference presentations. Int. J. Geospatial Environ. Res. 2020; 7:1. [Google Scholar]
  • 44. Chu L.F., Leng N., Zhang J., Hou Z., Mamott D., Vereide D.T., Choi J., Kendziorski C., Stewart R., Thomson J.A.. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 2016; 17:173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Leng N., Chu L.F., Barry C., Choi Y.L.J., Jiang X.L.P., Stewart R.M., Thomson J.A., Kendziorski C.. Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods. 2015; 12:947–950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Manno G.L., Gyllborg D., Codeluppi S., Nishimura K., Salto C., Zeisel A., Borm L.E., Stott S.R.W., Toledo E.M., Villaescusa J.C.et al.. Molecular diversity of midbrain development in mouse, human, and stem cells. Cell. 2016; 167:566–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Zeisel A., Muñoz-Manchado A.B., Codeluppi S., Lönnerberg P., Manno G.L., Juréus A., Marques S., Munguba H., He L., Betsholtz C.et al.. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015; 347:1138–1142. [DOI] [PubMed] [Google Scholar]
  • 48. Mereu E., Lafzi A., Moutinho C., Ziegenhain C., McCarthy D.J., Álvarez-Varela A., Batlle E., Sagar, Grün D., Lau J.K.et al.. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020; 38:747–755. [DOI] [PubMed] [Google Scholar]
  • 49. Koyama-Nasu R., Nasu-Nishimura Y., Todo T., Ino Y., Saito N., Aburatani H., Funato K., Echizen K., Sugano H., Haruta R.et al.. The critical role of cyclin D2 in cell cycle progression and tumorigenicity of glioblastoma stem cells. Oncogene. 2013; 32:3840–3845. [DOI] [PubMed] [Google Scholar]
  • 50. Meyer P.E., Lafitte F., Bontempi G.. minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008; 9:461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Huynh-Thu V.A., Irrthum A., Wehenkel L., Geurts P.. Inferring regulatory networks from expression data using tree-based methods. BMC Bioinformatics. 2010; 5:e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Butler A., Hoffman P., Smibert P., Papalex E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Wang F., Liang S., Kumar T., Navin N., Chen K.. SCMarker: ab initio marker selection for single cell transcriptome profiling. PLoS Comput. Biol. 2019; 15:e1007445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Couturier C.P., Ayyadhury S., Le P.U., Nadaf J., Monlong J., Riva G., Allache R., Baig S., Yan X., Bourgey M.et al.. Single-cell RNA-seq reveals that glioblastoma recapitulates a normal neurodevelopmental hierarchy. Nat. Commun. 2020; 11:3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Mukherjee S. Quiescent stem cell marker genes in glioma gene networks are sufficient to distinguish between normal and glioblastoma (GBM) samples. Sci. Rep. 2020; 10:10937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Cheng Q., Li J., Fan F., Cao H., Dai Z.Y., Wang Z.Y., Feng S.S.. Identification and analysis of glioblastoma biomarkers based on single cell sequencing. Front Bioeng. Biotechnol. 2020; 8:167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Zhang Z., Lan Y., Xu J., Quan F., Zhao E., Deng C., Luo T., Xu L., Liao G., Yan M.et al.. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019; 47:D721–D728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Franzén O., Gan L.M., Björkegren J.L.M.. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database. 2019; 2019:baz046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Psaila B., Barkas N., Iskander D., Roy A., Anderson S., Ashley N., Caputo V.S., Lichtenberg J., Loaiza S., Bodine D.M.et al.. Single-cell profiling of human megakaryocyte-erythroid progenitors identifies distinct megakaryocyte and erythroid differentiation pathways. Genome Biol. 2016; 17:83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K.. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017; 45:D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Huang D.W., Sherman B.T., Lempicki R.A.. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009; 4:44–57. [DOI] [PubMed] [Google Scholar]
  • 62. Huang D.W., Sherman B.T., Lempicki R.A.. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab601_Supplemental_Files

Data Availability Statement

The code to calculate CDI and EEI is available at https://github.com/Natsu01/EEISP. The raw sequencing data and processed files of the glioblastoma scRNA-seq are available at the NCBI Gene Expression Omnibus (GEO) under accession number GSE144623.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES