Abstract
Accurate identification of cell types from single-cell RNA sequencing (scRNA-seq) data plays a critical role in a variety of scRNA-seq analysis studies. This task corresponds to solving an unsupervised clustering problem, in which the similarity measurement between cells affects the result significantly. Although many approaches for cell type identification have been proposed, the accuracy still needs to be improved. In this study, we proposed a novel single-cell clustering framework based on similarity learning, called SSRE. SSRE models the relationships between cells based on subspace assumption, and generates a sparse representation of the cell-to-cell similarity. The sparse representation retains the most similar neighbors for each cell. Besides, three classical pairwise similarities are incorporated with a gene selection and enhancement strategy to further improve the effectiveness of SSRE. Tested on ten real scRNA-seq datasets and five simulated datasets, SSRE achieved the superior performance in most cases compared to several state-of-the-art single-cell clustering methods. In addition, SSRE can be extended to visualization of scRNA-seq data and identification of differentially expressed genes. The matlab and python implementations of SSRE are available at https://github.com/CSUBioGroup/SSRE.
Keywords: Single-cell RNA sequencing, Clustering, Cell type, Similarity learning, Enhancement
Introduction
With the recent emergence of single-cell RNA sequencing (scRNA-seq) technology, numerous scRNA-seq datasets have been generated, which brings unique challenges for advanced omics data analysis [1], [2]. Unlike bulk sequencing averaging the expression of mass cells, scRNA-seq technique quantifies gene expression at the single-cell resolution. Single-cell techniques promote a wide variety of biological topics such as cell heterogeneity, cell fate decision, and disease pathogenesis [3], [4], [5]. Among all the applications, cell type identification plays a fundamental role and its performance has a substantial impact on downstream studies [6]. However, identifying cell types from scRNA-seq data is still a challenging problem. The traditional clustering methods cannot work well on scRNA-seq data because of the high noise rate and high dropouts [7]. Therefore, new efficient and reliable clustering methods for cell type identification are urgent and meaningful.
In recent studies, several novel clustering approaches for detecting cell types from scRNA-seq data have been proposed. Among these methods, cell types are mainly decided on the basis of learned cell-to-cell similarity. For example, single-cell interpretation via multikernel learning (SIMLR) [8] visualizes and clusters cells using multi-kernel similarity learning [9], which performs well on grouping cells. Shared nearest neighbor (SNN)-Cliq [10] firstly constructs a distance matrix based on the Euclidean distance, and then introduces the shared k-nearest neighbors (KNN) model to redefine the similarity. SNN-Cliq provides both the estimation of cluster number and the clustering results by searching for quasi-cliques. Moreover, Corr [11] defines the cell-pair differentiability correlation instead of computing primary (dis)similarity like Pearson correlation and Euclidean distance. RAFSIL [12] divides genes into multiple clusters, and makes dimension reduction on each gene cluster. Then, RAFSIL concatenates the informative features obtained from each gene cluster. Finally, RAFSIL applies the random forest to calculate the similarities for each cell recursively. Besides, nonnegative matrix factorization (NMF) determines the cell types in the latent space [13], while SinNLRR [14] and AdaptiveSSC [15] learn the similarity matrix with nonnegative low rank and sparse constraints. Instead of learning a specific similarity, some researchers have turned to use ensemble learning that focuses on the consensus of multiple clustering methods [16], [17].
Even though many approaches have been applied to cell type identification, most of them are sensitive to noise, especially for the high-dimensional data. They generally compute the similarity between two cells merely considering the gene expression of these two cells [18]. In this study, we developed SSRE, a novel method for cell type identification. It focuses on similarity learning, in which the cell-to-cell similarity is measured by considering more similar neighbors. SSRE computes the linear representation between cells based on sparse subspace theory, and thus generates a sparse representation of cell-to-cell similarity [19]. Moreover, motivated by the observations that each similarity measurement can represent data from a different aspect [16], [20], SSRE incorporates three classical pairwise similarities into similarity learning. In order to reduce the effect of irrelevant features and improve the overall accuracy, SSRE designs a two-step procedure, i.e., 1) adaptive gene selection and 2) similarity enhancement. The experimental results show that when combined with spectral clustering, the learned similarities by SSRE can reveal the block structure of scRNA-seq data reliably. Also, the experimental results on ten real scRNA-seq datasets and five simulated scRNA-seq datasets show that SSRE achieves higher accuracy of cell type detection in most cases than the compared popular approaches. Moreover, SSRE can be easily extended to other scRNA-seq tasks such as differential expression analysis and data visualization.
Method
Framework of SSRE
We introduce the overview of SSRE briefly. A schematic diagram of SSRE is shown in Figure 1, and detailed steps of SSRE are introduced later in this section. Given a scRNA-seq expression matrix, SSRE first removes genes whose expression levels are zero in all the cells. Then, the informative genes are selected based on the sparse subspace representation (SSR), Pearson correlation, Spearman correlation, and cosine similarity. With the preprocessed gene expression matrix, SSRE learns SSR for each cell simultaneously. Then, SSRE derives an enhanced similarity matrix from the learned SSR similarity and the other three pairwise similarities. Finally, SSRE uses the enhanced similarity to identify cell types and visualize data.
Sparse subspace representation
The estimation of the similarity (or distance) matrix is a crucial step in clustering [8]. If the similarity matrix is well generated, it could be relatively easier to distinguish the cluster. In this study, we adopted sparse subspace theory [19] to compute the linear representation between cells and generate a sparse representation of the cell-to-cell similarity. Some subspace-based clustering methods have been successfully applied to computer vision field, and have been proved to be highly robust in corrupted data [21], [22]. For scRNA-seq data, the sparse representation of cell-to-cell similarity is measured by considering the linear combination of similar neighbors. This tends to catch global structure information and generate more reliable similarity than traditional similarity measurement. The specific calculation processes are described as follows.
Mathematically, a scRNA-seq dataset with genes and cells can be denoted as , where indicates the expression profiles of the genes in cell . Its linear representation coefficient matrix satisfies the equation . According to the assumption that the expression of a cell can be represented by other cells in the same type, only the similarity of cells in the same cluster is non-zero. It also means that the coefficient matrix is usually sparse. With the relaxed sparse constraint, the coefficient matrix can be computed by solving an optimization problem as follows:
(1) |
where denotes the Frobenius norm which calculates the square root of sum of all squared elements, and constraint prevents the cells from being represented by themselves, while is a penalty factor. An efficient approach to solve Equation (1) is the alternating direction method of multipliers (ADMM) [23]. We rewrite Equation (1) as follows:
(2) |
where is an auxiliary matrix. According to the model of ADMM, the augmented Lagrangian with auxiliary matrix and penalty parameter () > 0 for the optimization Equation (2) is
(3) |
where is the dual variable. The derivation of its update can be found in section 1 of File S1. Matrix is the target sparse representation matrix. To keep the symmetry and nonnegative nature of similarity matrix, the element of SSR is calculated as .
Data preprocessing and gene selection
Before used to calculate SSR, the original data needs to be preprocessed. Various data preprocessing methods have been used in the previous studies, such as gene filtering [12], [16], feature selection [24], [25], and imputation [26], [27]. In this study, we first removed genes with zero expression in all of cells and applied -norm to each cell to eliminate the expression scale difference between different cells. Then, we computed the preliminary SSR with the normalized gene expression matrix, and adopted the Laplacian score [28] on SSR to assess the contribution that genes make to cell-to-cell similarity learning. According to the Laplacian scores, we selected significant genes for the following study. Genes with higher Laplacian scores are considered as more informative in distinguishing cell types [8]. Besides the SSR, we also considered three additional pairwise similarities, i.e., Pearson correlation, Spearman correlation, and cosine similarity, to evaluate the importance of genes (denoted as , , and , respectively). For each similarity, we ranked genes in descending order by the Laplacian score and selected the top genes as an important gene set that is denoted by . The determination of the threshold can be formulated as
(4) |
where = [] and denote two gene sets divided by . The and are the Laplacian scores of genes in sets and , respectively, and is the cardinality of a set. The indicates variance of a set while is the number of genes. Finally, we recomputed , , , and based on the intersection of four selected important gene sets. In the next section, we introduce an enhancement strategy to further improve the learned SSR .
Similarity enhancement
The SSR may suffer from the high-level technical noise in the data resulting in underestimation. Inspired by the consensus clustering and resource allocation, we further enhanced by integrating multiple pairwise similarities including , , and . These pairwise similarities partially reveal the local information between cells.
We imputed the missing values in according to their nearest neighbors’ information. We firstly defined a target similarity matrix as follows:
(5) |
where indicates the KNN of cell . Then we marked the similarity between cells and as a missing value when it is zero in the but in at least one pairwise similarity matrix. Let denotes the initial matrix to be imputed where n indicates the number of cells. For a marked missing value, the similarity was computed by the modified Weighted Adamic/Adar [29], [30]. It was formulated as follows:
(6) |
where indicates the number of neighbors of cell , and denotes the set of common neighbors of cell and . Note that the imputed similarity is zero when . At the end, an enhanced and more comprehensive SSR matrix was computed as + .
Spectral clustering
Spectral clustering is a typical clustering technique that divides multiple objects into disjoint clusters depending on the spectrum of the similarity matrix [31]. Compared with the traditional clustering algorithms, spectral clustering is advantageous in model simplicity and robustness. In this study, we performed spectral clustering on the final enhanced SSR . The inputs of spectral clustering are the cell-to-cell similarity matrix and the cluster number. The detailed introduction and analysis of spectral clustering could be found in previous studies [31], [32].
Datasets
Datasets used in this study consist of two parts, real scRNA-seq datasets and simulated scRNA-seq datasets. We collected ten real scRNA-seq datasets that vary in terms of species, tissues, and biological processes, from public databases or published studies. The scale of these ten datasets varies from dozens to thousands, and the gene expression levels of them were computed by different units. The details of these real datasets are described in Table 1. Four datasets (i.e., Treutlein [33], Deng [34], Ting [35], and Macosko [36] datasets) of these ten datasets were downloaded from the data subdirectory of MPSSC tool (https://github.com/ishspsy/project/tree/master/MPSSC). The Yan [37] and Goolam [38] datasets were collected from the popular single-cell consensus clustering (SC3) software package (https://github.com/hemberg-lab/SC3). The Song [39], Engel [40], and Haber [41] datasets were obtained via Gene Expression Omnibus [42] database (GEO: GSE85908, GSE74597, and GSE92332, respectively; https://www.ncbi.nlm.nih.gov/geo/), and the Vento [43] dataset was downloaded from ArrayExpress [44] (ArrayExpress: E-MTAB-6678; https://www.ebi.ac.uk/arrayexpress/). In addition, we used Splatter [45] to simulate five scRNA-seq datasets for more comprehensive analysis. They either have different size or different sparsity. We set group.prob to (0.65, 0.25, 0.1) for all simulated datasets, and changed the scale and sparsity by adjusting nCells and dropout.mid, respectively. The other parameters were set to default. The sample sizes of the five simulated datasets are 1000, 1000, 1000, 500, and 1500, and the corresponding sparsity is 0.61, 0.8, 0.94, 0.94, and 0.94, respectively.
Table 1.
Note: FPKM, fragments per kilobase of exon model per million mapped fragments; RPKM, reads per kilobase of exon model per million mapped reads; CPM, counts of exon model per million mapped reads; RPM, reads of exon model per million mapped reads; TPM, transcripts per kilobase of exon model per million mapped reads; UMI, unique molecular identifier.
scRNA-seq clustering methods
For performance comparison, we took the original SSR, native spectral clustering (SC), and eight state-of-the-art clustering methods (i.e., SIMLR [8], MPSSC [20], Corr [11], SNN-Cliq [10], NMF [13], SC3 [16], dropClust [46], and Seurat [47]) as comparison. Among these methods, SIMLR, MPSSC, Corr, and SNN-Clip focus on similarity learning. Both SIMLR and MPSSC learn a representative similarity matrix from multi-Gaussian-kernels with different resolutions. Corr introduces a cell-pair differentiability correlation to relieve the effect of dropouts. SNN-Cliq applies the SNN to redefine the pairwise similarity. NMF detects the type of cells by projecting the high dimensional data into a latent space, in which each dimension of the latent space denotes a specific type. SC3 is a typical and powerful consensus clustering method. It obtains clusters by applying different upstream processes, and desires the final clusters to fit better. DropClust is a clustering algorithm designed for large-scale single-cell data, and it exploits an approximate nearest neighbor search technique to reduce the time complexity of analyzing large-scale data. Seurat, a popular R package for single-cell data analysis, obtains cell groups based on KNN-graph and Louvain clustering. Moreover, SC [32] with the Pearson correlation is considered as a baseline.
Metric of performance evaluation
We evaluated the proposed approach using two common metrics, i.e., normalized mutual information (NMI) [48] and adjusted rand index (ARI) [49]. They have been widely used to assess clustering performance. Both NMI and ARI evaluate the consistency between the obtained clustering and pre-annotated labels, and have slightly different emphasis [50]. Given the real labels and the clustering labels , NMI is calculate as
(7) |
is the mutual information between and , and H denotes entropy. For ARI, given and , it is computed as
(8) |
where is the number of cells in both group and group . The and denote the number of cells in group and group , respectively.
Results and discussion
SSRE can greatly improve the clustering accuracy
In order to evaluate the performance of SSRE comprehensively, we first applied it on ten pre-annotated real scRNA-seq datasets and compared its performance with the original SSR, SC, and eight state-of-the-art clustering methods. See details in the Method section. Then, we tested all these methods on five simulated datasets for further comparison. In our experiments, for a fair comparison, we set the number of clusters to the number of pre-annotated types for all methods except SNN-Cliq and Seurat because SNN-Cliq and Seurat do not need the number of clusters as input. The other parameters in all the methods were set to the default as described in the original papers. Table 2 and Table 3 summarize the NMI and ARI values of all methods on ten real scRNA-seq datasets, respectively. The results of Corr in large datasets are unreachable because of the high computational complexity. As shown in Table 2 and Table 3, the proposed method SSRE outperformed all other methods in most cases. SSRE achieved the best or tied first on seven datasets upon NMI and ARI. Meanwhile, SSRE ranked the second on three datasets based on NMI and two datasets based on ARI. It demonstrates that SSRE obtains more reliable results independent to the scale and the biological conditions of scRNA-seq data. Moreover, SSRE performed better than SSR on nine of the ten datasets in terms of NMI and ARI, which illustrates the effectiveness of the enhancement strategy in SSRE. Results of simulation experiment are shown in Tables S1 and S2. SSRE achieved the better performance overall, which shows the good stability of SSRE. SSRE is slightly time-consuming compared with some methods such as SC and dropClust, but its running time is still in a reasonable range. More detailed descriptions can be found in section 2 of File S1.
Table 2.
Note: SC, native spectral clustering; SNN, shared nearest neighbor; SIMLR, single-cell interpretation via multikernel learning; SC3, single-cell consensus clustering; NMF, nonnegative matrix factorization; SSR, sparse subspace representation; SSRE, single-cell clustering framework based on similarity learning. “–” indicates unreachable. The bold value is the highest value in each column.
Table 3.
Note: The bold value is the highest value in each column.
Estimating number of clusters is another key step in most clustering methods, which affects the accuracy of clustering method. SSRE performed eigengap [32] on the learned similarity matrix to estimate the number of clusters. Eigengap is a typical cluster number estimation method. It determines the number of clusters by calculating max gap between eigenvalues of a Laplacian matrix. To assess reliability of the estimation in different methods, we compared the estimated numbers with pre-annotated numbers. The results are summarized in Table S3. Besides SSRE and SSR, another four methods which also focus on similarity learning were selected for comparison. More experimental details can be seen in section 3 of File S1.
Analysis of parameter setting
In SSRE, four parameters are required to be set by users, i.e., penalty coefficients and in solving SSR , gene selection threshold , and the number of nearest neighbors in similarity enhancement procedure. In this study, the selection of the threshold was determined adaptively by solving Equation (4). The number of nearest neighbors was set to ( is the number of cells) for small datasets with less than 5000 cells and set to for other larger datasets. The other two parameters and in augmented Lagrangian (we used and in the coding implementation) were proportionally set as:
(9) |
where is the element of matrix . The is equivalent to the cosine similarity between cells and . This is same as previous work [19]. In our experiments, was set to a constant. For a given dataset, the larger value of leads to the larger value of , which will result in the sparser matrix C. It means that the value of can control the sparsity of matrix C adaptively in different datasets. Moreover, to validate the effect of penalty coefficient in clustering results, we tested SSRE with from 2 to 30 with the increment of 2 on all real datasets. We found that SSRE’s performance was basically stable when is in the interval of 6 and 20. The results are shown in Figure 2 and Figure S1. In our study, we set to 10 and as default for all datasets.
Application of SSRE in visualization
One of the most valuable aims in single-cell analysis is to identify new cell types or subtypes [6]. Visualization is an effective tool to intuitively display subgroups of all cells. The t-distributed stochastic neighbor embedding (t-SNE) [51] is one of the most popular visualization methods, and it has been proved to be powerful in scRNA-seq data. In our study, we performed a modified t-SNE on the similarities learned by different methods for visualization. We focused on two datasets, Goolam and Yan, and selected the native t-SNE, Corr, SIMLR, MPSSC, SSR, and SSRE for comparison. In Goolam dataset [38], cells were derived from mouse embryos in five differentiation stages: 2-cell, 4-cell, 8-cell, 16-cell, and 32-cell. The visualization results of Goolam dataset are shown in Figure 3A. As shown in Figure 3A, SSRE placed cells with the same type together and distinguished cells with different types clearly. And, although SIMLR can clearly distinguish groups from each other, some cells with the same type were separated. The second dataset Yan [37] was obtained from human pre-implantation embryos. It involves seven primary stages of preimplantation development: metaphase II oocyte, zygote, 2-cell, 4-cell, 8-cell, morula, and late blastocyst. As shown in Figure 3B, Corr, SIMLR, and SSRE had a better overall performance than other methods. However, the four cell types, i.e., oocyte, zygote, 2-cell, and 4-cell, were mixed totally in Corr, and mixed partially in SIMLR. Moreover, SIMLR also divided the cells with same type into different groups that were generally far away from each other. SSRE clusters cells more accurately, according to oocyte, 2-cell, and other cell types, than the competing methods.
Application of SSRE in identifying differentially expressed genes
The predicted clusters may potentially enable enhanced downstream scRNA-seq data analysis in biological sights. As a demonstration, we aimed to detect significantly differentially expressed genes (DEGs) based on the clustering results. Specifically, we applied the Kruskal-Wallis test [52] to the gene expression profiles with the inferred labels. The Kruskal-Wallis test, a non-parametric method, is often used for testing that if two or more groups are from the same distribution. We used the R function kruskal.test to perform the Kruskal-Wallis test. Then we detected DEGs according to the P value. The significant P value (P < 0.01) of a gene indicates that the gene’s expression in at least one group stochastically dominates one other group. We took the Yan [37] dataset as an example to analyze the DEGs. The details of Yan have been introduced above. Figure S2 shows the heat map of gene expression of the top 50 most significantly DEGs identified. Notice that genes NLRP11, NLRP4, CLEC10A, H1FOO, GDF9, OTX2, ACCSL, TUBB8, and TUBB4Q have been reported in previous studies [37], [53], which were also identified by SSRE. Genes CLEC10A, H1FOO, and ACCSL were reported as the markers of 1-cell stage cells (zygote) of human early embryos, while NLRP11 and TUBB4Q are the markers of 4-cell stage cells [54]. Genes GDF9 and OTX2 are the markers of germ cell and primitive endoderm cell, respectively [55], [56]. Genes H1FOO and GDF9 were marked as the potential stage-specific genes in the oocyte and the blastomere of 4-cell stage embryos [57]. Certain PRAMEF family genes were reported as ones with transiently enhanced transcription activity in 8-cell stage. MBD3L family genes were identified as 8-cell stage-specific genes during the human embryo development in the previous studies [58], [59]. All these are part of the top 50 significantly DEGs detected by SSRE.
Conclusion
Identifying cell types from single-cell transcriptome data is a meaningful but challengeable task because of the high-level noise and high dimension. The ideal identification of cell types enables more reliable characterizations of a biological process or phenomenon. Otherwise, it will introduce additional biases. Many approaches from different perspectives have been proposed recently, but the accuracy of cell type identification is still far from expectation. In this study, we presented SSRE, a similarity learning-based computational framework for cell type identification. Besides three classical pairwise similarities, SSRE computed the SSR of cells based on the subspace theory. Moreover, a gene selection process and an enhancement strategy were designed based on the characteristics of different similarities to learn more reliable similarities. SSRE greatly improved the clustering performance by appropriately combining multiple similarity measurements and adopting the embedding of sparse structure. The systematic performance evaluations on multiple scRNA-seq datasets showed that SSRE achieves superior performance among all competing methods. Furthermore, with the further downstream analyses, it is demonstrated that the learned similarity and inferred clusters can potentially be applied to more exploratory analyses, such as identifying gene markers and detecting new cell subtypes. In addition, for more flexible use, users can choose one or two of the three pairwise similarities mentioned in this study to perform gene selection and similarity enhancement procedures, and all three are used by default. Nonetheless, the proposed computational framework still can be improved in future study. One limitation of SSRE is relatively time-consuming in large-scale datasets; therefore, parallel computing is a possible strategy to accelerate the framework [60]. And more informative genes can be extracted or other biological information, such as gene functions [61] and gene regulatory relationships [62], [63], can be incorporated to distinguish cell types. In addition, with the emergence of single-cell multi-omics data, it will be a possible trend to design corresponding multi-view clustering models to integrate the multi-omics data for cell type identification [64], [65].
Code availability
The matlab and python implementations of SSRE are available at https://github.com/CSUBioGroup/SSRE.
CRediT author statement
Zhenlan Liang: Conceptualization, Methodology, Validation, Writing - original draft. Min Li: Supervision, Methodology, Writing - review & editing. Ruiqing Zheng: Methodology, Writing - original draft. Yu Tian: Data curation, Validation. Xuhua Yan: Software. Jin Chen: Writing - review & editing. Fang-Xiang Wu: Writing - review & editing. Jianxin Wang: Writing - review & editing. All authors read and approved the final manuscript.
Competing interests
The authors have declared that they have no competing interests.
Acknowledgments
This work was supported in part by the Natural Science Foundation of China (NSFC)-Zhejiang Joint Fund for the Integration of Industrialization and Information (Grant No. U1909208); the 111 Project, China (Grant No. B18059); the Hunan Provincial Science and Technology Program, China (Grant No. 2019CB1007); the Fundamental Research Funds for the Central Universities-Freedom Explore Program of Central South University, China (Grant No. 2019zzts592); and the Natural Science Foundation, USA (Grant No. 1716340).
Handled by Luonan Chen
Footnotes
Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.09.004.
Supplementary materials
The following are the Supplementary data to this article:
References
- 1.Saliba A.E., Westermann A.J., Gorski S.A., Vogel J. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res. 2014;42:8845–8860. doi: 10.1093/nar/gku555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stegle O., Teichmann S.A., Marioni J.C. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16:133–145. doi: 10.1038/nrg3833. [DOI] [PubMed] [Google Scholar]
- 3.Buettner F., Natarajan K.N., Casale F.P., Proserpio V., Scialdone A., Theis F.J., et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33:155–160. doi: 10.1038/nbt.3102. [DOI] [PubMed] [Google Scholar]
- 4.Guo G., Huss M., Tong G.Q., Wang C., Sun L.L., Clarke N.D., et al. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev Cell. 2010;18:675–685. doi: 10.1016/j.devcel.2010.02.012. [DOI] [PubMed] [Google Scholar]
- 5.Papalexi E., Satija R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol. 2018;18:35–45. doi: 10.1038/nri.2017.76. [DOI] [PubMed] [Google Scholar]
- 6.Kiselev V.Y., Andrews T.S., Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20:273–282. doi: 10.1038/s41576-018-0088-9. [DOI] [PubMed] [Google Scholar]
- 7.Elowitz M.B., Levine A.J., Siggia E.D., Swain P.S. Stochastic gene expression in a single cell. Science. 2002;297:1183–1186. doi: 10.1126/science.1070919. [DOI] [PubMed] [Google Scholar]
- 8.Wang B., Zhu J., Pierson E., Ramazzotti D., Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14:414–416. doi: 10.1038/nmeth.4207. [DOI] [PubMed] [Google Scholar]
- 9.Lanckriet G.R., De Bie T., Cristianini N., Jordan M.I., Noble W.S. A statistical framework for genomic data fusion. Bioinformatics. 2004;20:2626–2635. doi: 10.1093/bioinformatics/bth294. [DOI] [PubMed] [Google Scholar]
- 10.Xu C., Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31:1974–1980. doi: 10.1093/bioinformatics/btv088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jiang H., Sohn L., Huang H., Chen L. Single cell clustering based on cell-pair differentiability correlation and variance analysis. Bioinformatics. 2018;34:3684–3694. doi: 10.1093/bioinformatics/bty390. [DOI] [PubMed] [Google Scholar]
- 12.Pouyan M.B., Kostka D. Random forest based similarity learning for single cell RNA sequencing data. Bioinformatics. 2018;34:i79–i88. doi: 10.1093/bioinformatics/bty260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shao C., Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2017;33:235–242. doi: 10.1093/bioinformatics/btw607. [DOI] [PubMed] [Google Scholar]
- 14.Zheng R., Li M., Liang Z., Wu F.X., Pan Y., Wang J. SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics. 2019;35:3642–3650. doi: 10.1093/bioinformatics/btz139. [DOI] [PubMed] [Google Scholar]
- 15.Zheng R., Liang Z., Chen X., Tian Y., Cao C., Li M. An adaptive sparse subspace clustering for cell type identification. Front Genet. 2020;11:407. doi: 10.3389/fgene.2020.00407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kiselev V.Y., Kirschner K., Schaub M.T., Andrews T., Yiu A., Chandra T., et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483–486. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang Y., Huh R., Culpepper H.W., Lin Y., Love M.I., Li Y. SAFE-clustering: Single-cell Aggregated (from Ensemble) clustering for single-cell RNA-seq data. Bioinformatics. 2019;35:1269–1277. doi: 10.1093/bioinformatics/bty793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lin P., Troup M., Ho J.W. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 2017;18:59. doi: 10.1186/s13059-017-1188-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Elhamifar E., Vidal R. Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell. 2013;35:2765–2781. doi: 10.1109/TPAMI.2013.57. [DOI] [PubMed] [Google Scholar]
- 20.Park S., Zhao H. Spectral clustering based on learning similarity matrix. Bioinformatics. 2018;34:2069–2076. doi: 10.1093/bioinformatics/bty050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Elhamifar E., Vidal R. Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell. 2013;35:2765–2781. doi: 10.1109/TPAMI.2013.57. [DOI] [PubMed] [Google Scholar]
- 22.Vidal R., Favaro P. Low rank subspace clustering (LRSC) Pattern Recognit Lett. 2014;43:47–61. [Google Scholar]
- 23.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc; 2011.
- 24.Feng Z., Wang Y. Elf: extract landmark features by optimizing topology maintenance, redundancy, and specificity. IEEE-ACM Trans Comput Biol Bioinform. 2020;17:411–421. doi: 10.1109/TCBB.2018.2846225. [DOI] [PubMed] [Google Scholar]
- 25.Feng Z., Ren X., Fang Y., Yin Y., Huang C., Zhao Y., et al. scTIM: Seeking Cell-Type-Indicative Marker from single cell RNA-seq data by consensus optimization. Bioinformatics. 2020;36:2474–2485. doi: 10.1093/bioinformatics/btz936. [DOI] [PubMed] [Google Scholar]
- 26.Huang M., Wang J., Torre E., Dueck H., Shaffer S., Bonasio R., et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15:539–542. doi: 10.1038/s41592-018-0033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Van Dijk D., Sharma R., Nainys J., Yim K., Kathail P., Carr A., et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174:716–729. doi: 10.1016/j.cell.2018.05.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.He X., Cai D., Niyogi P. Laplacian score for feature selection. Adv Neural Inf Process Syst. 2005;18:507–514. [Google Scholar]
- 29.Murata T, Moriyasu S. Link prediction of social networks based on weighted proximity measures. IEEE WIC ACM Int Conf Web Intell; 2007. pp. 85–88. [Google Scholar]
- 30.Pech R., Hao D., Cheng H., Zhou T. Enhancing subspace clustering based on dynamic prediction. Front Comput Sci. 2019;13:802–812. [Google Scholar]
- 31.Bach F.R., Jordan M.I. Learning spectral clustering. Adv Neural Inf Process Syst. 2004;16:305–312. [Google Scholar]
- 32.von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. [Google Scholar]
- 33.Treutlein B., Brownfield D.G., Wu A.R., Neff N.F., Mantalas G.L., Espinoza F.H., et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509:371–375. doi: 10.1038/nature13173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Deng Q., Ramsköld D., Reinius B., Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343:193–196. doi: 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
- 35.Ting D.T., Wittner B.S., Ligorio M., Jordan N.V., Shah A.M., Miyamoto D.T., et al. Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 2014;8:1905–1918. doi: 10.1016/j.celrep.2014.08.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Yan L., Yang M., Guo H., Yang L., Wu J., Li R., et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013;20:1131–1139. doi: 10.1038/nsmb.2660. [DOI] [PubMed] [Google Scholar]
- 38.Goolam M., Scialdone A., Graham S.J., Macaulay I.C., Jedrusik A., Hupalowska A., et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell. 2016;165:61–74. doi: 10.1016/j.cell.2016.01.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Song Y., Botvinnik O.B., Lovci M.T., Kakaradov B., Liu P., Xu J.L., et al. Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Mol Cell. 2017;67:148–161. doi: 10.1016/j.molcel.2017.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Engel I., Seumois G., Chavez L., Samaniego-Castruita D., White B., Chawla A., et al. Innate-like functions of natural killer T cell subsets result from highly divergent gene programs. Nat Immunol. 2016;17:728–739. doi: 10.1038/ni.3437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Haber A.L., Biton M., Rogel N., Herbst R.H., Shekhar K., Smillie C., et al. A single-cell survey of the small intestinal epithelium. Nature. 2017;551:333–339. doi: 10.1038/nature24489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Edgar R., Domrachev M., Lash A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Vento-Tormo R., Efremova M., Botting R.A., Turco M.Y., Vento-Tormo M., Meyer K.B., et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature. 2018;563:347–353. doi: 10.1038/s41586-018-0698-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Brazma A., Parkinson H., Sarkans U., Shojatalab M., Vilo J., Abeygunawardena N., et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31:68–71. doi: 10.1093/nar/gkg091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zappia L., Phipson B., Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18:1–15. doi: 10.1186/s13059-017-1305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Sinha D., Kumar A., Kumar H., Bandyopadhyay S., Sengupta D. dropClust: efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res. 2018;46:e36. doi: 10.1093/nar/gky007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Butler A., Hoffman P., Smibert P., Papalexi E., Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Strehl A., Ghosh J. Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2002;3:583–617. [Google Scholar]
- 49.Wagner S, Wagner D. Comparing clusterings: an overview. Karlsruhe: Universität Karlsruhe, Fakultät für Informatik; 2007, p. 1–19.
- 50.Romano S., Vinh N.X., Bailey J., Verspoor K. Adjusting for chance clustering comparison measures. J Mach Learn Res. 2016;17:4635–4666. [Google Scholar]
- 51.Lvd M., Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
- 52.Kruskal W.H., Wallis W.A. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47:583–621. [Google Scholar]
- 53.Madissoon E., Töhönen V., Vesterlund L., Katayama S., Unneberg P., Inzunza J., et al. Differences in gene expression between mouse and human for dynamically regulated genes in early embryo. PLoS One. 2014;9:e102949. doi: 10.1371/journal.pone.0102949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Xue Z., Huang K., Cai C., Cai L., Jiang C.Y., Feng Y., et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature. 2013;500:593–597. doi: 10.1038/nature12364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Pennetier S., Uzbekova S., Perreau C., Papillier P., Mermillod P., Dalbiès-Tran R. Spatio-temporal expression of the germ cell marker genes MATER, ZAR1, GDF9, BMP15, and VASA in adult bovine tissues, oocytes, and preimplantation embryos. Biol Reprod. 2004;71:1359–1366. doi: 10.1095/biolreprod.104.030288. [DOI] [PubMed] [Google Scholar]
- 56.Petropoulos S., Edsgärd D., Reinius B., Deng Q., Panula S.P., Codeluppi S., et al. Single-cell RNA-seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell. 2016;165:1012–1026. doi: 10.1016/j.cell.2016.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Tang F., Barbacioru C., Nordman E., Li B., Xu N., Bashkirov V.I., et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat Protoc. 2010;5:516–535. doi: 10.1038/nprot.2009.236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Wang Y., Zhao C., Hou Z., Yang Y., Bi Y., Wang H., et al. Unique molecular events during reprogramming of human somatic cells to induced pluripotent stem cells (iPSCs) at naïve state. Elife. 2018;7:e29518. doi: 10.7554/eLife.29518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Töhönen V., Katayama S., Vesterlund L., Sheikhi M., Antonsson L., Filippini-Cattaneo G., et al. Transcription activation of early human development suggests DUX4 as an embryonic regulator. bioRxiv. 2017:123208. [Google Scholar]
- 60.Kumar S., Singh M. A novel clustering technique for efficient clustering of big data in Hadoop Ecosystem. Big Data Min Anal. 2019;2:240–247. [Google Scholar]
- 61.Li H.-D., Xu Y., Zhu X., Liu Q., Omenn G.S., Wang J. Clustermine: a knowledge-integrated clustering approach based on expression profiles of gene sets. J Bioinform Comput Biol. 2020;18:2040009. doi: 10.1142/S0219720020400090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zheng R., Li M., Chen X., Zhao S., Wu F., Pan Y., et al. An ensemble method to reconstruct gene regulatory networks based on multivariate adaptive regression splines. IEEE-ACM Trans Comput Biol Bioinform. 2021;18:347–354. doi: 10.1109/TCBB.2019.2900614. [DOI] [PubMed] [Google Scholar]
- 63.Aibar S., González-Blas C.B., Moerman T., Imrichova H., Hulselmans G., Rambow F., et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Duren Z., Chen X., Zamanighomi M., Zeng W., Satpathy A.T., Chang H.Y., et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc Natl Acad Sci U S A. 2018;115:7723–7728. doi: 10.1073/pnas.1805681115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Welch J.D., Kozareva V., Ferreira A., Vanderburg C., Martin C., Macosko E.Z. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell. 2019;177:1873–1887. doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.