Skip to main content
PeerJ logoLink to PeerJ
. 2023 Jan 23;11:e14706. doi: 10.7717/peerj.14706

A clustering method for small scRNA-seq data based on subspace and weighted distance

Zilan Ning 1,2, Zhijun Dai 1, Hongyan Zhang 2, Yuan Chen 1,, Zheming Yuan 1,
Editor: Gökhan Karakülah
PMCID: PMC9879162  PMID: 36710872

Abstract

Background

Identifying the cell types using unsupervised methods is essential for scRNA-seq research. However, conventional similarity measures introduce challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout.

Methods

We proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. To accurately capture the intrinsic relationship among cells or genes, a new distance metric that combines Euclidean and Pearson distance through a weighting strategy was proposed. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom.

Results

We compared SSWD with seven prevailing methods on eight publicly scRNA-seq datasets. The experimental results show that the SSWD has better clustering accuracy and the partitioning ability of cell groups. SSWD can be downloaded at https://github.com/ningzilan/SSWD.

Keywords: scRNA-seq, Consensus clustering, Subspace, EP_dis, Marker gene

Introduction

Single-cell RNA-sequencing (scRNA-seq) technologies capture cellular heterogeneity between single cell, which allows researchers to dissect complex biological samples with detailed information about the transcriptome, thereby changing our understanding of biological systems (Tang et al., 2009; Jaitin et al., 2014; Praktiknjo et al., 2020). Identifying the cell types is essential in analyzing scRNA-seq data, and the quality will directly affect downstream analysis in single-cell (Kharchenko, Silberstein & Scadden, 2014). Unsupervised clustering is one of the most widely used methods for identifying cell groups in scRNA-seq data (Ji & Ji, 2016; Žurauskiene & Yau, 2016; Kiselev, Andrew & Hemberg, 2019; Peyvandipour et al., 2020; Qi et al., 2020). However, high dimensional, noise, and dropout characteristics of scRNA-seq data present traditional clustering methods with a challenge (Elowitz et al., 2002; Stegle, Teichman & Marioni, 2015). Therefore, it is important to develop efficient and reliable clustering algorithms to identify cell groups.

Recently, many novel clustering methods have been developed for identifying cell groups of scRNA-seq data. Most of them focus on computing more accurate and robust similarity measures between cells (Taiyun et al., 2018; Peng et al., 2020). Single-cell Interpretation via Multi-kernel LeaRning (SIMLR) (Wang et al., 2017) chooses the most appropriate distance measure through multiple kernel learning and uses k-means to determine the cell groups. Seurat (Satija et al., 2015; Butler et al., 2018) and SNN-Cliq (Xu & Su, 2015) are graph-based clustering methods. Seurat constructs a k-nearest neighbor (KNN) graph with Euclidean distance in PCA (Jolliffe, 2002). SNN-Cliq combines a previously developed clustering algorithm with an SNN-based similarity measure, which determines cell groups automatically but requires three parameters to be specified. SC3 (Kiselev et al., 2017) employs consensus clustering to merge the clustering results under Euclidean distance, Pearson’s correlation, and Superman’s correlation to improve performance. However, SC3 is not scalable (Kiselev, Andrew & Hemberg, 2019). Besides, nonnegative matrix factorization, imputation, dimensionality reduction-based methods, and mixture model ensemble have been used to assess cellular heterogeneity (Grün et al., 2015; Lin, Troup & Ho, 2017; Shao & Höfer, 2017; Yang et al., 2019; Huh et al., 2020; Venkatasubramanian et al., 2020).

Subspace clustering is an efficient technique to mitigate noise applied in various fields (Chen, Nasrabad & Tran, 2011; Ekström & Hagen, 2019). SinNLRR (Zheng et al., 2019) considers cell clustering as a sparse subspace clustering (SSC) problem and uses the multiplier with an alternating direction to solve the optimization problem. S3C2 (Zhuang et al., 2021) combines enhanced SSC and low-rank completion algorithms in an optimization framework. DSCD (Wang et al., 2020) discovers the low dimensional latent structure from the compressed representation in scRNA-seq data and learns global relationships in single cells via a novel self-expressive denoise layer.

Highlighted by previous methods, calculating the similarity (distance) matrix of cells and reducing noise interference are crucial in clustering. This paper proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which assumed that sets of gene subspace composed of similar gene kernel density distributing genes could distinguish cell groups better. We proposed a new distance metric EP_dis, which integrates Euclidean and Pearson distance through a weighting strategy. Furthermore, we used the relative Calinski-Harabasz (RCH) index to determine the cluster numbers instead of CH because of its advantage of comparability in degrees of freedom. SSWD also included a consensus clustering process. Each of the gene subspace’s clustering results was summarized using the consensus matrix integrated by PAM clustering. We applied the SSWD to eight public scRNA-seq datasets and contrasted it with seven widespread scRNA-seq clustering methods. The results show that SSWD reduces the influence of noise in clustering and better captures intrinsic relationships among cells or genes, which has greater clustering accuracy and the partitioning ability of cell groups.

Materials & Methods

Datasets

Simulated datasets

This paper used six simulation data to demonstrate the effect of EP_dis and RCH in the improved k-means algorithm. D1 and D2 were synthesized using different mathematical models (Zhang, Yue & Zhang, 2014) (Table 1). D1 contains five clusters with 420 (60, 80, 90, 90, 100) samples and 30 features. D2 contains four clusters with 300 (60, 70, 80, 90) samples and 10 features. Furthermore, four Gaussian datasets (D3-D6) (Fig. S1) (Liu et al., 2010; Hussain & Haris, 2019) were used to explain the properties of RCH with monotonicity, noise, density, and subcluster.

Table 1. Mathematical models of D1 and D2.

D1 contains five clusters with 420 (60, 80, 90, 90, 100) samples and 30 features. D2 contains four clusters with 300 (60, 70, 80, 90) samples and 10 features. ij, and k represent the cluster id, the feature id and the sample id, respectively. ξ: the random error.

D1 D2
cluster 1 0.1+ sinj3+ξi,j,k,ξN (0,1) expj1000+ξi,j,k,ξN (0,1)
cluster 2 1.2sin2j52+ξi,j,k,ξN (0,1) j6.6+ξi,j,k,ξN (0,2)
cluster 3 1.5sinj33.5+ξi,j,k,ξN (0,1) 5j42 maxj42+ξi,j,k,ξN (0,2)
cluster 4 0.5sin2j52.2+ξi,j,k,ξN (0,1) sin(j) + ξ(ijk), ξN (0,1)
cluster 5 0.6sinj33.8+ξi,j,k,ξN (0,1)

UCI datasets

Six real datasets (Table 2) from UCI (University of California Irvine) (https://archive.ics.uci.edu/) were used to validate the performance of RCH.

Table 2. Description of the six UCI datasets.

UCI (University of California Irvine) machine learning repository: https://archive.ics.uci.edu/.

Datasets No. of samples No. of features No. of categories
Dermatology 366 33 6
Seed 569 7 3
Sensor 5,456 24 4
Statlog 2,000 36 6
Waveform 5,000 21 3
Yeast 1,484 8 10

scRNA-seq datasets

We downloaded eight scRNA-seq datasets from GEO (https://www.ncbi.nlm.nih.gov/geo/) to validate the effectiveness of SSWD, for which the cell types were declared in the original publications. These datasets, including human and mouse species, involve various tissues and biological processes, such as cell development and differentiation, using different unit counts, e.g., RPKM and FPKM. Specifically, Biase, Cao & Sheng (2014), Yan et al. (2013), and Qiaolin et al. (2014) consist of transcriptomes of human/mouse cells in embryos at some crucial developmental stages. Treutlein et al. (2014) contains 201 cells in four developmental stages of mouse lung epithelial cells. Patel et al. (2014) contains 430 glioblastoma cells from five patients. Li et al. (2016) is a human islet cell dataset, which contains alpha (n = 18), beta (n = 12), pp (n = 9), acinar (n = 11), and ductal (n = 8) cell subtypes. Tian307 and Tian305 (Tian et al., 2019) include lung adenocarcinoma cells from five patients. The detailed description of the datasets is listed in Table 3.

Table 3. The details of eight scRNA-seq datasets.
Datasets Groups Variables Cells Units Species Protocol Reference
Biase 3 25737 49 FPKM Mus musculus Smart-Seq Biase, Cao & Sheng (2014)
Li 5 180253 58 RPKM Homo sapiens Smart-Seq2 Li et al. (2016)
Patel 5 5948 430 TPM Homo sapiens Smart-Seq Patel et al. (2014)
Deng 7 12735 135 RPKM Mus musculus Smart-Seq2 Qiaolin et al. (2014)
Treutlein 4 11245 201 FPKM Mus musculus SMARTer Treutlein et al. (2014)
Yan 7 12325 90 FPKM Homo sapiens Smart-Seq2 Yan et al. (2013)
Tian307 5 13800 307 UMI Homo sapiens CEL-Seq2 Tian et al. (2019)
Tian305 5 13137 305 UMI Homo sapiens CEL-Seq2 Tian et al. (2019)

Notes.

FPKM
fragments per kilobase of transcript per million mapped reads
RPKM
reads per kilobase of transcript per million mapped reads
TPM
transcripts per million mapped reads
UMI
unique molecular identifiers

The improved k-means algorithm with EP_dis and relative CH (RCH)

The k-means is a widely used clustering algorithm (MacQueen, 1967; Jain, Murt & Flynn, 1999; Jain, 2008). The algorithm requires the user to provide cluster initialization, distance metric, and the cluster numbers as the parameters (Chiang & Mirkin, 2010). Here we designed an improved k-means algorithm by introducing the EP_dis and RCH, which measure the similarity between two cells more appropriately and can automatically determine the cluster numbers.

EP_dis metric

Euclidean distance (E) is the most commonly used distance metric in traditional k-means, it characterizes the global correlation in high-dimensional space between samples. However, it will lose the correlation information between samples (cells or genes) when they have the same trend (Taiyun et al., 2018). Pearson distance (P) is another commonly used distance metric in clustering, which can captures the locally variable trend between samples (cells or genes), where P = (1 − R), and R is the Pearson correlation coefficient (Fulekar, 2009). Here, we combined Euclidean and Pearson distances through a weighting strategy and defined a new distance, EP_dis metric (Ning et al., 2022). It was defined as follows:

EP_dis=wE+1wP. (1)

A bigger EP_dis shows a weaker similarity between samples. If w = 0, EP_dis is Pearson distance; if w = 1, it is Euclidean distance. w is the weight, and it ranges from 0 to 1. The matrix E and P must be min-max normalized when calculating EP_dis because the range of E and P are different. Take the maximum SS B/SSW as the standard, and a step-by-step search determines the suitable w in EP_dis. Where SSB=i=1kni||cic ¯||2 represents the sum of squares between clusters and SSW=i=1kj=1nj||xjci||2 represents the sum of squares within clusters; k represents the cluster numbers; ni (nj) represents the sample numbers in cluster Vi (Vj); c ¯=i=1NxiN is the overall mean; N is the sample numbers. We adopt the maximum technique (Fränti & Sieranoja, 2019) to obtain the cluster’s initial centroids to ensure clustering stability.

Determine the number of clusters

The k-means algorithm needs to be specified the number of clusters. The clustering internal validation (CIV) indices, such as the Calinski-Harabasz (CH) index (Caliński & Harabasz, 1974), Silhouette (Sil) index (Rousseeuw, 1987), and Gap Statistic (Tibshirani & Hastie, 2001), can be used for estimating the cluster numbers. The CH has been proven the best in estimating cluster numbers (Milligan & Cooper, 1985; Chiang & Mirkin, 2010). It is defined as:

CH=SSBk1SSWNk,k=2,3NC, (2)

where N as the sample numbers, NC as the largest cluster numbers. The k with the maximum CH is the suitable cluster numbers. In different k, the CH value is incomparable because the degrees of freedom vary. So, we designed a new index, relative CH (RCH) (Ning et al., 2022), that was relatively comparable under different k:

RCHk=CHkFα,k1,Nk. (3)

The workflow of the improved k-means algorithm with EP_dis and RCH is shown in Fig. 1.

Figure 1. The workflow of improved k-means algorithm with EP_dis and relative CH.

Figure 1

NC is the largest cluster numbers; N is the sample numbers; w_step is the search step; F (0.05, N, N-k) is the corresponding F-test threshold at the significance level of 0.05.

The overview of the SSWD

In the scRNA-seq data matrix XG×N = {xij|1 ≤ i ≤ G, 1 ≤ j ≤ N}, rows represent genes, and columns represent cells. xij represents the value of gene i in the j th cell. The framework of SSWD is depicted in Fig. 2.

Figure 2. The SSWD framework for clustering scRNA-seq data.

Figure 2

(A) clustering by the improved k-means with EP_dis and RCH; (B) retaining the d-dimension with the elbow method; (C) visualization of Tian307 gene expression profile under clustering results; *: each element represents a cell, and different colors represent different clusters under clustering.

Step 1 filtering genes

Since rare and ubiquitous genes provide insufficient information for clustering, we only retained the v genes (default: 1,000) with the highest variance after log-transformed. Specifically when the maximum value in X is greater than 10,000, X′ = log10(X + 1), otherwise X′ = log2(X + 1). The gene subset X′ was the input of the second module.

Step 2 partition genes with subspace

In scRNA-seq data, the sets of subspace represent the groups of genes. The gene subspace with similar density genes can distinguish informative features from noise (Song et al., 2021). We used the function density in R to calculate the gene’s density. Specifically, the kernel density function scattered the density of genes over a regular grid of 512 points and convolved this approximation with the discretized kernel version using a fast Fourier transform. Then the function used the linear approximation to evaluate the density at each point (Sheather & Jones, 1991). In the density matrix EG′×512, column and row represent the density values and gene, respectively. The improved k-means algorithm with EP_dis and RCH was employed to group the genes with similar density in matrix ‘E’. Then, the XG×N was separated into several sets of gene subspace. Each set of gene subspaces contains all cells and some genes XG×N=i=1,2,,csubspacei, subspacei=XN×Gi.

Step 3 cell clustering in subspace

The sets of gene subspace containing more than three genes have been kept. Then, we used PCA for dimensionality reduction and retained the first d-dimension with the Elbow method (Thorndike, 1953). Then the improved k-means algorithm with EP_dis and RCH was employed to get the sets of gene subspace clustering results Ysubspacei.

Step 4 consensus clustering

The cluster-based similarity partitioning algorithm (CSPA) was used to compute the consensus matrix M (Strehl & Ghosh, 2002). MN×N = {Mij|Mij = num}, (ij = 1, 2, ...N), based on the clustering results from the sets of gene subspace. The num is the number of subspaces where cells i and j are in the same cluster. If num =0, cell i and j are never in the same subspace. Because the M was a discrete matrix, the improved k-means with EP_dis and RCH were unsuitable, but the PAM algorithm did. PAM is a variation of the k-means clustering algorithm, which uses the median of data points rather than the mean and minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distance as the objective function (Park & Jun, 2009). The PAM is more robust to noise and outliers than k-means. Then, we used the Sil index (Rousseeuw, 1987) to estimate the cluster numbers (cell groups).

Time complexity of SSWD

The main time-consuming step of SSWD is clustering by the improved k-means with EP_dis and RCH. In step 2 (see The overview of the SSWD), we used the improved k-means algorithm in the density matrix. We denoted n represents the sample numbers, m represents the feature numbers, k represents the cluster numbers, NC represents the range of cluster numbers, l represents the iteration numbers to determine the cluster centers, and w_step represents the search step. Since the k<<n, NC<<n, the step 2 time complexity holds about O(lmn). In step 3, each subspace would be performed the improved k-means algorithm after PCA. We denoted d as the retained dimension after PCA, and s is the number of genes subspace. The SSWD time complexity has roughly O(lmn +lnds). Since d<<m, s<<m, we can simplify the time complexity of SSWD to approximately O(lmn).

Biological insights

We transformed the clustering results of each cell group into “one-against-the-rest”. Then, we executed the Wilcoxon rank-sum test for each gene between the expression value and the binary cluster, adjusting the p-value based on FDR. The gene that adjusted p-value<0.001 was preserved as the differential gene. Next, we used the AUC score to evaluate the performance of genes in distinguishing different cell types. Since AUC was only suitable for dichotomous problems, we constructed a binary classifier based on the mean expression value of each gene and compared the processed values with the binary cluster value. We defined the genes with AUC >0.85 and p-value<0.001 as marker genes.

Evaluation metrics

Two external validation indices, ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information) were used to evaluate the effectiveness of clustering methods.

ARI (Hubert & Arabie, 1985) is a widely used external validation index in clustering, and it is defined as follows:

ARIR,C=ijnij2iai2jbj2n212iai2+jbj2iai2jbj2n2, (4)

where R and C are published and predicted clusters, respectively. The overlap of samples between R and C can be generalized into a contingency table. nij is the times a sample occurs in the ith cluster of R and the jth cluster of C, ai is the sum of the ith row in the contingency table, bj is the sum of the jth column in the contingency table, and (.) represents the binomial coefficient.

NMI (Strehl & Ghosh, 2002) is defined as follows:

NMIR,C=2IR,CHR+HC, (5)

where IR,C=i=1|R|j=1|C|pij logpijpipj is the mutual information between R and C, HR=i=1|R|pilogpi is the entropy with R, HC=i=1|C|pilogpi is the entropy with C, pij=nijn is the probability that a cell belongs to both the ith cluster in R and the jth cluster in C. The range of ARI and NMI are [0, 1]. The larger ARI (NMI) represent a better performance of clustering.

Reference methods

In this article, seven prevailing clustering algorithms were introduced as reference methods. The SC3 v.1.22.0 (Kiselev et al., 2017), CIDR v.0.1.5 (Lin, Troup & Ho, 2017), Seurat v.4.1.1 (Satija et al., 2015) , SIMLR v.1.20.0 (Wang et al., 2017) were implemented with the original R package in Rstudio4.0. SinNLRR (Zheng et al., 2019) (https://github.com/zrq0123/SinNLRR) and S3C2 (Zhuang et al., 2021) (https://github.com/Cuily-v/S3C2) were implemented in Matlab2017a. The SNN-Cliq (Xu & Su, 2015) (https://github.com/BIOINSu/SNN-Cliq) was run in Matlab2017a and Python3.8. SILMR and SNN-Cliq used the same log transformation as this paper. SinNLRR and S3C2 used the correct cell groups for clustering. Unless specified, the default parameters in the program were used as suggested in the original paper.

Results

Performance evaluation and comparison with reference methods

We compared the performance of SSWD with seven prevailing clustering methods in eight scRNA-seq datasets (Table 4). The SSWD achieved the best clustering performance with an average ARI of 0.791 and was 0.143 higher than the second-ranked SC3, whereas the SNN-Cliq had poor performance (ARI of 0.364). SSWD ranked in the top three for ARI on all other datasets except Yan. SSWD attained the best results for NMI in three datasets (Li, Tian305, Tian307) and the second-best in four datasets (Biase, Yan, Deng, Treutlein). The average NMI of SSWD was the highest (0.850). Seurat had the poorest performance with only 0.579 in NMI because it failed on Biase, and the NMI of Li was only 0.122.

Table 4. The performance of SSWD.

—The method that fails in clustering. () The actual cell groups have been provided as prior parameters. The best accuracy and the correct number of clusters (cell groups) are marked as bold for each dataset.

Datasets Actual cell groups Measure SSWD SC3 CIDR Seurat SIMLR SNN-Cliq SinNRLL S3C2
Biase 3 k 3 3 5 7 7 (3) (3)
ARI 0.948 0.948 0.795 0.521 0.445 1.00 0.948
NMI 0.929 0.929 0.860 0.610 0.672 1.00 0.929
Li 5 k 5 3 9 2 9 7 (5) (5)
ARI 0.967 0.292 0.072 0.045 0.317 0.746 0.057 0.080
NMI 0.964 0.449 0.288 0.122 0.504 0.835 0.177 0.191
Patel 5 k 5 18 7 6 5 26 (5) (5)
ARI 0.776 0.445 0.744 0.689 0.809 0.278 0.849
NMI 0.762 0.668 0.846 0.680 0.849 0.463 0.823
Deng 7 k 4 5 5 4 9 16 (7) (7)
ARI 0.526 0.530 0.513 0.390 0.484 0.346 0.272 0.387
NMI 0.751 0.738 0.725 0.602 0.755 0.639 0.505 0.609
Treutlein 4 k 6 7 4 4 10 19 (4) (4)
ARI 0.607 0.724 0.188 0.531 0.353 0.209 0.583 0.475
NMI 0.732 0.850 0.304 0.648 0.534 0.505 0.664 0.644
Yan 7 k 10 6 5 3 10 13 (7) (7)
ARI 0.591 0.650 0.602 0.685 0.473 0.568 0.782 0.718
NMI 0.803 0.784 0.718 0.784 0.744 0.802 0.783 0.829
Tian307 5 k 5 7 5 5 8 42 (5) (5)
ARI 0.958 0.745 0.651 0.910 0.576 0.154 0.915 0.955
NMI 0.945 0.836 0.714 0.885 0.733 0.546 0.888 0.938
Tian305 5 k 5 8 6 6 10 45 (5) (5)
ARI 0.948 0.841 0.585 0.802 0.396 0.148 0.593 0.694
NMI 0.909 0.872 0.655 0.906 0.644 0.531 0.692 0.819
Category correct ratio (%) 62.5 12.5 25.0 25.0 12.5 0
Average ARI 0.791 0.647 0.519 0.507 0.491 0.364 0.631 0.608
NMI 0.850 0.766 0.639 0.579 0.672 0.624 0.692 0.709

We further demonstrated the SSWD performance by ranking clustering accuracies on eight datasets (Fig. 3). For ARI (Fig. 3A), SSWD was superior to the seven reference methods in rank-wise (median of 2). SNN-Cliq performed the worst, with a median of 7. For NMI (Fig. 3B), SSWD was also superior to others, and the performance of CIDR, Seurat, and SNN-Cliq was all poor. Furthermore, the one-sided Wilcoxon signed-rank test was used to explain the statistical difference between SSWD and the reference methods. Except for SC3 and SinNRLL (in ARI), all the p-value are less than 0.05, which shows SSWD is superior to other methods (Table 5).

Figure 3. The ranking performance of eight clustering methods on eight datasets.

Figure 3

Each method is ranked according to ARI (A) and NMI (B) for eight datasets. A lower rank represents better performance (1 is the best and 8 is the worst). Ties are replaced by the mean of their ranks.

Table 5. The results of the Wilcoxon signed-rank test conducted on SSWD versus the reference algorithms.

The p-value ( <0.05) indicates the significant difference between SSWD and the reference algorithms.

Measure SC3 CIDR Seurat SIMLR SNN-Cliq SinNRLL S3C2
ARI 0.074 0.004 0.014 0.004 0.002 0.150 0.012
NMI 0.074 0.010 0.002 0.014 0.002 0.049 0.012

The SSWD was also better for estimating the cell groups. Five out of eight datasets (Biase, Patel, Li, Tian307, Tian305) acquired the correct cell groups using SSWD. Especially for Tian307 and Tian305, only the SSWD estimated the correct cell groups and achieved the best ARI of over 0.948. Deng contains seven cell groups, for which all methods failed to identify the correct number of cell groups. For Treutlein, CIDR and Seurat estimated the correct cell groups, but the ARI (0.188 and 0.531) and NMI (0.304 and 0.648) were lower than those of SSWD (0.607 and 0.732). For Yan, the SinNRLL and S3C2 performed very well under the correct cell groups.

Annotate the clusters

We illustrated the effectiveness of the cell annotation using PanglaoDB (Oscar, Li-Ming & Johan, 2019) to clusters taking the Li dataset as an example. Li is a human pancreatic islet cells dataset containing five subtypes (alpha, beta, pp, acinar, and ductal) (Li et al., 2016). According to the AUC score (see Biological insights), we obtained the marker genes for each cluster identified by SSWD. Figure 4 is the expression heatmap of the top 10 marker genes for each cluster, which was divided into five clear modules and indicated that these marker genes could distinguish the clusters well. The keration8 (KRT8) in cluster 1; transthyretin (TTR), glucagon (GEG) in cluster 2; insulin (INS) in cluster 3; pancreatic polypeptide (PPY) in cluster 4; REG1B, REG1A, CTRB2 in cluster 5 were all reported in the original publication. We also annotated the cluster with PanglaoDB. The cluster results annotated with PanglaoDB are consistent with the cell annotations in the original publication (Table 6).

Figure 4. The expression heatmap of the top 10 marker genes for each cluster in Li.

Figure 4

Rows represent genes, columns represent cells.

Table 6. Cluster annotation with the top marker genes and the PanglaoDB for the Li dataset.

SSWD results Marker genes AUROC Adjust p-value Cell type annotion with PanglaoDB
cluster 1 ductal cells
CTSH 0.997 2.56E−06
KRT8 0.995 5.86E−06
ANXA4 0.989 3.89E−06
cluster 2
TTR 1.00 1.52E−09 alpha cells
GCG 0.951 1.52E−09
PEMT 0.904 3.70E−08
FXYD5 0.886 1.37E−07
cluster 3
INS 1.00 1.23E−07 beta cells
NPTX2 0.946 1.67E−06
IAPP 0.933 4.69E−06
PDX1 0.924 1.25E−06
ERO1B 0.917 3.83E−07
PCSK1 0.889 1.67E−06
G6PC2 0.886 3.13E−07
cluster 4
PPY 1.00 2.31E−06 pp cells
ETV1 0.960 4.31E−06
FXYD2 0.955 2.12E−05
MEIS1 0.934 3.73E−05
cluster 5 acinar cells
REG1B 1.00 8.22E−07
REG1A CTRB1 0.996 8.22E−07
CTRB2 0.996 9.13E−07
RARRES2 0.996 1.25E−06
SPINK1 0.977 1.01E−06
CPA2 0.977 1.25E−06

Discussion

Role of the EP_dis metric

EP_dis was used in SSWD to assess the similarity between cells or genes. According to the EP_dis definition (see Materials & Methods), the optimal w was determined by SSB/SSW using a search strategy. When w = 1, the EP_dis equals the Euclidean distance; when w = 0, it is the Pearson distance. We used two simulated datasets, D1 and D2, to display the impact of EP_dis and explain the process of optimizing w by SSB/SSW. Figure 5 shows the clustering accuracy of D1 (Fig. 5A) and D2 (Fig. 5B) under different w. It can be seen that the highest scores (CA, Rand, and SSB/SSW) in D1 and D2 are not appearing at the endpoints (0.6 in D1 and 0.8 in D2), which indicates that the EP_dis could capture more information between samples.

Figure 5. SSB/SSW, ARI, and NMI values on D1 (A) and D2 (B) under different w.

Figure 5

The left vertical axis in each subplot represents the values of ARI and NMI indices, and the right axis represents the SSB/SSW.

Role of the relative CH

The RCH in the improved k-means algorithm was used to determine the cluster numbers. In SSWD, we employed RCH to estimate the gene subspace numbers and guide each set of gene subspace grouping. The capability of the RCH directly affects the performance of the SSWD. We utilized simulated datasets D3–D6 with different characteristics, six UCI datasets, and three scRNA-seq datasets to illustrate RCH properties and compare them with CH (Table 7). We can see that CH and RCH were consistent in D3–D6, indicating their good performance in simulated datasets. In the UCI datasets, RCH could estimate the correct cluster numbers except for Dermatology and Yeast, but the corresponding cluster numbers estimated by the RCH was closer to the real value than those of CH. For the scRNA-seq datasets, RCH and CH all failed. Their poor performance may be due to the characteristics of scRNA-seq data. Nonetheless, the RCH result was closer to the true value.

Table 7. Comparison of the estimated cluster numbers between the CH and RCH under simulated and real datasets.

The correct number is marked as bold for each dataset.

Datasets True cluster number Measure
CH RCH
D3 5 5 5
D4 5 5 5
D5 2 2 2
D6 5 5 5
Dermatology 6 4 5
Seed 3 2 3
Sensor 4 2 4
Statlog 6 3 6
Waveform 3 2 3
Yeast 10 7 9
Biase 3 2 2
Tian307 5 2 3
Yan 7 2 9

Role of the subspace

After performing steps 1 and 2 of SSWD (see “Materials & Methods”), the Li has been separated into eight sets of gene subspace, and seven participate in consensus clustering (Fig. 6, Fig. S2). The expression heatmaps of the best three sets of genes subspace display clear patterns (Figs. 6A6C), and their EP_dis heatmap (Figs. 6D6F) effectively clustered cells with similar expression patterns. Compared with the EP_dis heatmap by 1,000 genes (Fig. 6G), the consensus matrix using sets of genes subspace (Fig. 6H) enhances intercellular signaling. The consensus matrix clustering result was better (ARI of 0.967, NMI of 0.964) than the former (ARI of 0.386, NMI of 0.579) because the former could not distinguish alpha and pp cells well.

Figure 6. Three subspace expression heatmaps and distance heatmaps for the LI dataset.

Figure 6

(A–C) are the best three subspaces expression heatmap; rows represent genes, columns represent cells; (D–F) are the EP_dis distance heatmap relating to (A–C); (G) is the EP_dis distance heatmap with 1,000 genes; (H) is the consensus matrix heatmap; in (D–H), both rows and columns represent cells. In (G) and (H), the first color bar is the cell groups after clustering by SSWD, and the second is the actual cell types.

Discussion of prevailing methods

We also provided further discussion in Tables 45. The average performance of SC3 was only lower than SSWD, and its results were not significantly different from SSWD in the one-sided Wilcoxon signed-rank test. The SC3 combined multiple similarity measures (Euclidean, Pearson, Spearman) in clustering. It used the consistency matrix to integrate the multiple clustering results, and the consistency matrix strengthened the consensus signal between cells. At the same time, we can see that Deng and Treutlein, the best accuracy performers in SC3, could not obtain the correct number of cell groups. Although Biase estimated the correct number of cell groups in SC3, one cell was classified mistakenly, while SinNRLL could classify all cells accurately. Both SinNRLL and S3C2 introduce the idea of subspace clustering. Their average performances were better than other methods except for SSWD and SC3. However, this result was based on the cell group numbers being provided. Evaluating the cluster numbers is an important aspect of clustering methods. Although SinNRLL could estimate the cluster numbers by other methods, its accuracy is still unsatisfactory (Zheng et al., 2019).

SNN-Cliq performed the worst (ARI = 0.364, NMI = 0.624), with none of the seven datasets estimating the correct number of cell groups. SNN-Cliq tended to divide more clusters, probably because the method requires providing three suitable parameters, and the results depend on the graphical representation of the data. CIDR used an implicit imputation approach to reduce the impact of dropout in scRNA-seq and used CH to estimate the cell groups. The method determined the correct number of cell groups in Treutlein and Tian307, but their clustering accuracies were poor. SIMLR adopts a multi-kernel strategy to adaptively select an appropriate distance metric and automatically determine the cell groups. However, this method achieved good performance only in Patel because it used Euclidean distance as the metric to construct a Gaussian kernel function (Taiyun et al., 2018). For Seurat, Biase failed, and the ARI of Li was only 0.084. The results show that Seurat may be unsuitable for small datasets, consistent with the literature (Kiselev, Andrew & Hemberg, 2019).

The SSWD had the best performance in experiments. However, the performance of Patel and Yan were mediocre. Although Patel estimated the correct cell groups, the clustering accuracies were only ranked the third (in ARI) and the fourth (in NMI), probably because there were negative values in Patel datasets. All methods failed to estimate the correct cell group numbers in Yan. The poor performance of Yan in SSWD was because the estimated cell groups was far from the actual numbers.

We can draw the following conclusions from the above observations: (1) Due to the complex structure of scRNA-seq data, developing an optimal clustering method for all situations is impossible. (2) Determining the cluster numbers is difficult, so assigning cells to appropriate types is more important. (3) Selecting suitable similarity measures and using subspace in single-cell clustering help obtain better clustering results.

Conclusions

The identification of cell types is a fundamental problem in scRNA-seq data analysis. In recent years, many clustering methods have been proposed. Most of them focus on computing more accurate and robust similarity measures between cells. However, conventional similarity measures are encountering challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. This study proposed a clustering method for small scRNA-seq data, named as SSWD, based on sets of gene subspace and weighted distance. Firstly, an improved k-means with EP_dis and RCH was applied to divide sets of gene subspace with similar density distributions, which better identify distinct cell groups. Secondly, cell clustering was performed in these sets of gene subspace. Lastly, the ensemble clustering with PAM was conducted on the consensus matrix composed of gene subspace clustering results. The results of eight scRNA-seq datasets showed that SSWD could effectively reduce the influence of noise in clustering and better capture the intrinsic relationship between cells or genes, thereby achieving more robust and accurate clustering results.

Supplemental Information

Supplemental Information 1. The source code of SSWD.
DOI: 10.7717/peerj.14706/supp-1
Supplemental Information 2. Supplemental Figures.
DOI: 10.7717/peerj.14706/supp-2

Funding Statement

This work was supported by the Natural Science Foundation of Hunan Province (2021JJ30351), the Scientific Research Project of Hunan Provincial Department of Education (21B0187), and the Special Funds for Construction of Innovative Provinces in Hunan Province (2021NK1011). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Contributor Information

Yuan Chen, Email: Chenyuan0510@126.com.

Zheming Yuan, Email: zhmyuan@sina.com.

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Zilan Ning conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Zhijun Dai performed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Hongyan Zhang analyzed the data, prepared figures and/or tables, and approved the final draft.

Yuan Chen analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Zheming Yuan conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The raw data is available at NCBI GEO: GSE57249, GSE73727, GSE57872, GSE45719, GSE52583, GSE118767, GSE36552.

The code of SSWD described in this article is available at Github: https://github.com/ningzilan/SSWD; Zilan Ning. (2022). SSWD: The clustering method for small scRNA-seq data based on subspace and weighted distance. Zenodo. https://doi.org/10.5281/zenodo.7471227.

References

  • Biase, Cao & Sheng (2014).Biase FH, Cao X, Sheng Z. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Research. 2014;24(11):1787–1796. doi: 10.1101/gr.177725.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Butler et al. (2018).Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology. 2018;36(5):411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Caliński & Harabasz (1974).Caliński T, Harabasz J. A dendrite method for cluster analysis. Communications in Statistics. 1974;3(1):1–27. [Google Scholar]
  • Chen, Nasrabad & Tran (2011).Chen Y, Nasrabad NMT, Tran D. Hyperspectral image classification using dictionary-based sparse representation. IEEE Transactions on Geoscience and Remote Sensing. 2011;49(10):3973–3985. doi: 10.1109/TGRS.2011.2129595. [DOI] [Google Scholar]
  • Chiang & Mirkin (2010).Chiang MT, Mirkin B. Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. Journal of Classification. 2010;27(1):3–40. doi: 10.1007/s00357-010-9049-5. [DOI] [Google Scholar]
  • Ekström & Hagen (2019).Ekström A, Hagen G. Global sensitivity analysis of bulk properties of an atomic nucleus. Physical Review Letters. 2019;123(25):252501. doi: 10.1103/PhysRevLett.123.252501. [DOI] [PubMed] [Google Scholar]
  • Elowitz et al. (2002).Elowitz MB, Levine AJ, Siggia ED, Swain PS. Stochastic gene expression in a single cell. Science. 2002;297(5584):1183–1186. doi: 10.1126/science.1070919. [DOI] [PubMed] [Google Scholar]
  • Fränti & Sieranoja (2019).Fränti P, Sieranoja S. How much k-means can be improved by using better initialization and repeats? Pattern Recognition. 2019;93:95–112. doi: 10.1016/j.patcog.2019.04.014. [DOI] [Google Scholar]
  • Fulekar (2009).Fulekar MH. Bioinformatics: applications in life and environmental sciences. Springer; Dordrecht: 2009. [Google Scholar]
  • Grün et al. (2015).Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, Oudenaarden AV. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525:251–255. doi: 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
  • Hubert & Arabie (1985).Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075. [DOI] [Google Scholar]
  • Huh et al. (2020).Huh R, Yang Y, Jiang Y, Shen Y, Li Y. SAME-clustering: Single-cell Aggregated clustering via Mixture Model Ensemble. Nucleic Acids Research. 2020;48(1):86–95. doi: 10.1093/nar/gkz959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Hussain & Haris (2019).Hussain SF, Haris M. A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data. Expert Systems with Applications. 2019;118(15):20–34. doi: 10.1016/j.eswa.2018.09.006. [DOI] [Google Scholar]
  • Jain (2008).Jain AK. Data clustering: 50 years beyond K-means. In: Daelemans W, Goethals B, Morik K, editors. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. vol. 5211. Springer; Berlin, Heidelberg: 2008. (Lecture Notes in Computer Science). [DOI] [Google Scholar]
  • Jain, Murt & Flynn (1999).Jain AK, Murt MN, Flynn PJ. Data clustering: a review. Acm Computing Surveys. 1999;31(3):264–323. doi: 10.1145/331499.331504. [DOI] [Google Scholar]
  • Jaitin et al. (2014).Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A. Massively parallel single-cell RNA-Seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–779. doi: 10.1126/science.1247651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ji & Ji (2016).Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Research. 2016;44(13):e117–e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jolliffe (2002).Jolliffe IT. Principal component analysis. Journal of Marketing Research. 2002;25(4):513. [Google Scholar]
  • Kharchenko, Silberstein & Scadden (2014).Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014;11(7):740. doi: 10.1038/nmeth.2967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kiselev, Andrew & Hemberg (2019).Kiselev VY, Andrew TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics. 2019;20:273–282. doi: 10.1038/s41576-018-0088-9. [DOI] [PubMed] [Google Scholar]
  • Kiselev et al. (2017).Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR. SC3: consensus clustering of single-cell RNA-seq data. Nature Methods. 2017;14(5):483–486. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Li et al. (2016).Li J, Klughammer J, Farlik M, Penz T, Spittler A, Barbieux C, Berishvili E, Bock C, Kubicek S. Single-cell transcriptomes reveal characteristic features of human pancreatic islet cell types. Embo Reports. 2016;17(2):178–187. doi: 10.15252/embr.201540946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Lin, Troup & Ho (2017).Lin P, Troup M, Ho J. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biology. 2017;18(1):59. doi: 10.1186/s13059-017-1188-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Liu et al. (2010).Liu Y, Li Z, Hui X, Gao X, Wu J. Understanding of internal clustering validation measures. 2010 IEEE international conference on data mining; Piscataway: IEEE; 2010. pp. 911–916. [Google Scholar]
  • MacQueen (1967).MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland CA, USA.1967. [Google Scholar]
  • Milligan & Cooper (1985).Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50(2):159–179. doi: 10.1007/BF02294245. [DOI] [Google Scholar]
  • Ning et al. (2022).Ning Z, Chen J, Huang J, Sabo UJ, Yuan Z, Dai Z. WeDIV–an improved k-means clustering algorithm with a weighted distance and a novel internal validation index. Egyptian Informatics Journal. 2022;23(4):133–144. [Google Scholar]
  • Oscar, Li-Ming & Johan (2019).Oscar F, Li-Ming G, Johan B. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database the Journal of Biological Databases & Curation. 2019;2019:baz046. doi: 10.1093/database/baz046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Park & Jun (2009).Park H-S, Jun C-H. A simple and fast algorithm for K-medoids clustering - ScienceDirect. Expert Systems with Applications. 2009;36(Part 2):3336–3341. doi: 10.1016/j.eswa.2008.01.039. [DOI] [Google Scholar]
  • Patel et al. (2014).Patel AP, Tirosh I, Trombett JJ, Shalek AK. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344(6190):1396–1401. doi: 10.1126/science.1254257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Peng et al. (2020).Peng L, Tian X, Tian G, Xu J, Huang X, Weng Y, Yang J, Zhou L. Single-cell RNA-seq clustering: datasets, models, and algorithms. RNA Biology. 2020;17(6):765–783. doi: 10.1080/15476286.2020.1728961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Peyvandipour et al. (2020).Peyvandipour A, Shafi A, Saberian N, Draghici S. Identification of cell types from single cell data using stable clustering. Scientific Reports. 2020;10(1):12349. doi: 10.1038/s41598-020-66848-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Praktiknjo et al. (2020).Praktiknjo SD, Obermayer B, Zhu Q, Fang L, Rajewsky N. Tracing tumorigenesis in a solid tumor model at single-cell resolution. Nature Communications. 2020;11(1):991. doi: 10.1038/s41467-020-14777-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Qiaolin et al. (2014).Qiaolin D, Daniel R, Bjöern R, Rickard S. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343(Jan.10 TN.6167):193–196. doi: 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
  • Qi et al. (2020).Qi R, Ma A, Ma Q, Zou Q. Clustering and classification methods for single-cell RNA-sequencing data. Briefings in Bioinformatics. 2020;21(4):1196–1208. doi: 10.1093/bib/bbz062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Rousseeuw (1987).Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
  • Satija et al. (2015).Satija R, Farrell JA, Gennert D, Schie AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology. 2015;33(5):495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Shao & Höfer (2017).Shao C, Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2017;33(2):235–242. doi: 10.1093/bioinformatics/btw607. [DOI] [PubMed] [Google Scholar]
  • Sheather & Jones (1991).Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B: Methodological. 1991;53(3):683–690. [Google Scholar]
  • Song et al. (2021).Song J, Liu Y, Zhang X, Wu Q, Yang C. Entropy subspace separation-based clustering for noise reduction (ENCORE) of scRNA-seq data. Nucleic Acids Research. 2021;49(3):18. doi: 10.1093/nar/gkaa1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Stegle, Teichman & Marioni (2015).Stegle O, Teichman SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics. 2015;16(3):133–145. doi: 10.1038/nrg3833. [DOI] [PubMed] [Google Scholar]
  • Strehl & Ghosh (2002).Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research. 2002;3(3):583–617. [Google Scholar]
  • Taiyun et al. (2018).Taiyun K, Chen IR, Lin Y, Wang YY, Yang J, Yang P. Impact of similarity metrics on single-cell RNA-seq data clustering. Briefings in Bioinformatics. 2018;20:2316–2326. doi: 10.1093/bib/bby076. [DOI] [PubMed] [Google Scholar]
  • Tang et al. (2009).Tang F, Barbacioru C, Wang Y, Nordman E, Surani MA. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods. 2009;6(5):377–382. doi: 10.1038/nmeth.1315. [DOI] [PubMed] [Google Scholar]
  • Thorndike (1953).Thorndike R. Who belongs in the family? Psychometrika. 1953;18(4):267–276. doi: 10.1007/BF02289263. [DOI] [Google Scholar]
  • Tian et al. (2019).Tian L, Dong X, Freytag S, Cao K, Ritchie ME. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nature Methods. 2019;16(6):479–487. doi: 10.1038/s41592-019-0425-8. [DOI] [PubMed] [Google Scholar]
  • Tibshirani & Hastie (2001).Tibshirani R, Hastie WT. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B. 2001;63(2):411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]
  • Treutlein et al. (2014).Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasno MA, Quake SR. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509(7500):371–375. doi: 10.1038/nature13173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Venkatasubramanian et al. (2020).Venkatasubramanian M, Chetal K, Schnell DJ, Atluri G, Salomonis N. Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF. Bioinformatics. 2020;36(12):3773–3780. doi: 10.1093/bioinformatics/btaa201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang et al. (2017).Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature Methods. 2017;14(4):414. doi: 10.1038/nmeth.4207. [DOI] [PubMed] [Google Scholar]
  • Wang et al. (2020).Wang Z, Lu Y, Yu C, Zhou T, Li R, Hou S. DSCD: a novel deep subspace clustering denoise network for single-cell clustering. IEEE Access. 2020;8:109857–109865. doi: 10.1109/ACCESS.2020.3001986. [DOI] [Google Scholar]
  • Xu & Su (2015).Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31(12):1974–1980. doi: 10.1093/bioinformatics/btv088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Yan et al. (2013).Yan L, Yang M, Guo H, Yang L, Wu J. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nature Structural & Molecular Biology. 2013;20(9):1131–1139. doi: 10.1038/nsmb.2660. [DOI] [PubMed] [Google Scholar]
  • Yang et al. (2019).Yang Y, Huh R, Culpepper HW, Lin Y, Lov MI, Li Y. SAFE-clustering: single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data. Bioinformatics. 2019;35(8):1269–1277. doi: 10.1093/bioinformatics/bty793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Zhang, Yue & Zhang (2014).Zhang J, Yue C, Zhang Y. Comparison of cluster analysis methods for gene expression profile. Journal of Nanjing Agricultural University. 2014;37(6):1–6. [Google Scholar]
  • Zheng et al. (2019).Zheng R, Li M, Liang Z, Wu F-X, Pan Y, Wang J. SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics. 2019;35(19):3642–3650. doi: 10.1093/bioinformatics/btz139. [DOI] [PubMed] [Google Scholar]
  • Zhuang et al. (2021).Zhuang J, Cui L, Qu T, Ren C, Xu J, Li T, Tian G, Yang J. A streamlined scRNA-Seq data analysis framework based on improved sparse subspace clustering. IEEE Access. 2021;9:9719–9727. doi: 10.1109/ACCESS.2021.3049807. [DOI] [Google Scholar]
  • Žurauskiene & Yau (2016).Žurauskiene J, Yau C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016;17:140. doi: 10.1186/s12859-016-0984-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information 1. The source code of SSWD.
DOI: 10.7717/peerj.14706/supp-1
Supplemental Information 2. Supplemental Figures.
DOI: 10.7717/peerj.14706/supp-2

Data Availability Statement

The following information was supplied regarding data availability:

The raw data is available at NCBI GEO: GSE57249, GSE73727, GSE57872, GSE45719, GSE52583, GSE118767, GSE36552.

The code of SSWD described in this article is available at Github: https://github.com/ningzilan/SSWD; Zilan Ning. (2022). SSWD: The clustering method for small scRNA-seq data based on subspace and weighted distance. Zenodo. https://doi.org/10.5281/zenodo.7471227.


Articles from PeerJ are provided here courtesy of PeerJ, Inc

RESOURCES