Abstract
Single-cell omics is the fastest-growing type of genomics data in the literature and public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Through benchmarking with other label transfer methods on a total of 30 single cell omics datasets, scGCN consistently demonstrates superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow as a python software, which is available at https://github.com/QSong-github/scGCN.
Subject terms: Computational models, Software, Sequencing, Transcriptomics
Making sense of the rapidly growing single-cell omics datasets available is limited by difficulties in leveraging disparate datasets in analyses. Here, the authors present scGCN, a graph based convolutional network to allow effective knowledge transfer across omics datasets.
Introduction
Single-cell omics technologies are increasingly used in biomedical research to provide high resolution insights into the complex cellular ecosystem and underlying molecular interconnectedness1–3. Leading this wave of omics is single-cell RNA sequencing (scRNA-seq), which allows measurement of the transcriptome in thousands of single cells from multiple biological samples under various conditions4–8. Other single-cell-based assays, include Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq), profile cellular heterogeneity at the epigenetic level9–11, which further elucidates transcriptional regulators8,12. These technological developments allow profiling of multiple molecular layers at single-cell resolution and assaying cells from multiple samples under different conditions.
The rapid advances in single-cell technologies have led to remarkable growth of single cell omics data. As more and more single-cell datasets become available, there is an urgent need to leverage existing and newly generated data in a reliable and reproducible way, learning from the established single-cell data with well-defined labels as reference, and transferring labels to newly generated datasets to assign cell-level annotations10,11. However, existing datasets and newly generated datasets are often collected from different tissues and species13,14, under various experimental conditions, generated by different platforms15,16, and in different omics types17. Thus a reliable and accurate knowledge transfer method must overcome the following challenges: (1) the unique technical issues of single-cell data (e.g., dropouts and dispersion)18–21; (2) batch effects that arise from different operators, experimental protocols16, and technical variation (e.g., mRNA quality, pre-amplification efficiency, technical settings during data generation)22–24; and (3) intrinsic biological variances associated with different tissues, species, and molecular layers such as RNA-seq and ATAC-seq.
To address these challenges in transferring labels across different datasets, several methods have been developed. The most commonly used are Seurat v325,26 and the recently reported Conos27, scmap28, and CHETAH29. Seurat is a well-established, widely used toolkit for single cell genomics25,26. Recently an anchor-based label transfer method across substantially different single-cell samples has been proposed in Seurat v326. Conos generates a joint graph representation by pairwise alignments of samples, to propagate labels from one sample to another. scmap learns cell types by measuring the maximum similarity between the reference dataset with well-annotated cells and unknown datasets30. Guided by the reference data, CHETAH identifies a classification tree for a top-to-bottom classification in unannotated data. Whereas these methods are valuable in different settings, they exhibit limited capability and performance, partially due to the fact that they only extract shared information from individual cells but ignore higher-order relations between cells. Such topological cell relation can be well captured by the Graph Convolutional Networks (GCN)31. Recently, GCN and its related methods have been successfully applied in single cells and in diseases32–36, showing that inclusion of GCN enables learning of high-order representation and topological relations of cells that improve performance.
Here, we present a graph-based artificial intelligence model, termed single-cell Graph Convolutional Network (scGCN). We provide evidence that scGCN allows reliable and reproducible integration of single-cell datasets and transferring labels across studies. Thus, knowledge learned from well-characterized datasets in previous studies can be transferred to and provides insights in studies. Using a wide range of different single-cell omics datasets, including data from different tissues, species, sequencing platforms, and molecular layers (such as RNA-seq and ATAC-seq), we demonstrate that scGCN outperforms other methods in accuracy and reproducibility. In addition, we provide the implementation software of scGCN, which is compatible with various single-cell datasets for accurate cell type identification.
Results
Overview of the scGCN
Knowledge learned from existing single-cell datasets is often represented as cell labels. Examples of cell labels are cell type, developmental state, activation status, cellular functionality, and signaling pattern. The scGCN approach leverages well-characterized single-cell data as reference to infer such cellular-level knowledge in query dataset, i.e., label transfer, through semi-supervised learning (Fig. 1). First, scGCN learns a sparse and hybrid graph of both inter-dataset and intra-dataset cell mappings using mutual nearest neighbors of canonical correlation vectors that project different datasets onto a correlated low-dimensional space, which enables the identification and propagation of shared information between reference and query data. Subsequently, based on the constructed graph, semi-supervised GCN is used to project cells of both reference and new datasets onto the same latent space so that cells with the same labels present in the same population. Consequently, cell labels in query data are predicted and learned from reference data. Details are described in “Methods” section.
Performance of label transfer within datasets
To evaluate the performance of scGCN, we benchmark it against other methods, including Seurat v326, Conos27, scmap28, and CHETAH29. For quantitative benchmarks, we first use ten scRNA-seq datasets that vary in cell numbers, tissues, species, and sequencing technologies. These datasets represent different scenarios and challenges in cell label transfer. For each dataset, we randomly select 50% of its cells as the reference data and apply the above methods to learn the labels of the other 50% of cells as the query data.
We evaluate the performance of each method using the accuracy score (Acc), which is defined as the proportion of correctly predicted cells among all cells in the query data. Our results show that scGCN consistently demonstrates better performance than other methods across datasets and achieves the highest accuracy (mean Acc = 91%, Fig. 2a), which is significantly higher than other methods (Seurat v3: 87%; Conos: 83%; scmap: 82%; CHETAH: 77%) (P value = 0.0019, 0.0058, 0.0039, and 0.002, respectively). scGCN demonstrates particularly higher accuracy relative to these methods in some datasets. For example, Seurat v3 shows a relatively lower accuracy score in GSE99254 dataset (Seurat v3: 77%; scGCN: 85%), while Conos performs poorly on the GSE108989 dataset (Conos: 53%; scGCN: 86%), and CHETAH performs poorly on the SRP073767 dataset (CHETAH: 62%; scGCN: 90%). Therefore, scGCN achieves the best performance in transferring labels accurately in the ten benchmarking datasets.
To highlight the comparison regarding specific cell types, we use the SRP073767 dataset as an example, which has two subtypes of T cells (CD4+ T helper2 cells, CD4+/CD25 Treg cells) that every benchmarked method cannot distinguish accurately whereas scGCN performs relatively better (Fig. 2b and Supplementary Fig. 1a). In addition, Seurat v3 and Conos cannot discriminate between CD4+ T helper2 cells and CD4+/CD45RA+/CD2− Naive T cells, nor the CD8+/CD45RA+ Naive Cytotoxic cells and CD8+ Cytotoxic T cells. scmap and CHETAH also inaccurately assigns CD4+ T cells to CD8+ T cells, and has low accuracy for monocyte classification and unassigned cells. scGCN, in contrast, performs better than other methods in discerning these similar cell types.
Transfer labels across datasets of different platforms
As emerging single cell data are generated by different experimental platforms, we test whether scGCN can be used to accurately transfer labels across datasets from different platforms. Here we include 12 paired reference-query datasets. Each pair of reference-query datasets is profiled using different scRNA-seq technologies.
Similarly, we use the accuracy score to evaluate the performance of each method. Based on the 12 reference-query datasets, the accuracy score of scGCN (mean Acc = 87%, Fig. 3a) is consistently higher than Seurat v3 (mean Acc = 82.2%) and Conos (mean Acc = 82.3%), and also significantly better than scmap (mean Acc = 66%; P value = 0.001433) and CHETAH (mean Acc = 58%, P value = 2.219e−05). Specifically, when querying the PBMC Smart-seq2 data from the reference of PBMC Dropseq data, both Seurat v3 (Acc = 68%) and scmap (Acc = 60%) show relatively lower accuracy scores than scGCN (Acc = 77%). Similarly, Conos (Acc = 71%) and CHETAH (Acc = 32%) present lower accuracy than scGCN (Acc = 76%) when mapping MCA Smart-seq2 from MCA 10× data. Notably, when annotating the PBMC 10× V3 data using the PBMC Cel-seq data as reference, scGCN is distinctively superior to other methods. Although these two datasets are generated using different platforms, Uniform Manifold Approximation and Projection (UMAP)37 reveals highly consistent transferred labels by scGCN for the cell populations of the PBMC 10× V3 query data (Fig. 3b). The Sankey diagram shows that the reference dataset is much smaller than the query dataset (Fig. 3c), suggesting the effectiveness of scGCN even with a small reference dataset.
We next use heatmap to depict the accuracy of each cell type, including B cells, CD4+ T cells, and cytotoxic T cells identified by different methods (Fig. 3d and Supplementary Fig. 1b). Seurat v3 and CHETAH incorrectly assign most CD4+ T cells and natural killer cells to cytotoxic T cells. Some megakaryocytes and CD16+ monocytes are assigned to other cell types by Conos and scmap. The closely related cell types such as natural killer cells and T cells, as well as CD14+ monocytes and CD16+ monocytes, are correctly discriminated by scGCN.
Transfer labels across datasets of different species
We next evaluate the performance of scGCN across datasets from different species. We apply all benchmarking methods to four paired reference-query datasets. For each paired dataset, one consists of cells from mouse and the other from human tissues.
We first apply all five methods to identify the labels of query data in the four pairs of datasets. Then we compare the visualization of cells using the aggregated reference-query data by different methods (Fig. 4a). Because CHETAH does not provide the aggregated data, we omit this method in the following comparisons of cell visualization. Due to the inherent noise and batch effects in the raw data, cells are not separated well, particularly for dataset 3, i.e., phs001790 (mouse)—GSE115746 (human), and dataset 4, i.e., GSE115746 (mouse)—phs001790 (human). Similarly, in the UMAP projections of Seurat v3 and Conos, cells are not distinguished explicitly in datasets 3 and 4. scmap shows far less discernable results in these paired datasets (Supplementary Fig. 1c). In contrast, when using the aggregated data generated by scGCN, cell subpopulations are clearly discerned in the UMAP projections for all four scenarios.
The UMAP projections suggest that the aggregated cells are aligned better by scGCN than the other methods. This observation is further confirmed by the accuracy score (Fig. 4b). Specifically, scGCN shows the best performance with higher average accuracy (87%), compared with Conos (71%), Seurat v3 (67%), scmap (48%), and CHETAH (20%). We find that Conos shows higher accuracy than Seurat v3 in datasets 1 and 2, but lower in datasets 3 and 4. Quantitatively, in the four paired datasets, scmap and CHETAH produce less accurate results than those of Seurat v3 and Conos. Together, these results show that scGCN performs consistently better in transferring labels (e.g., Seurat v3 and Conos) across different species.
Transfer labels across datasets of different types of omics
We next examine how well different algorithms transfer labels across different types of omics. Here we apply scGCN, Seurat v3, and Conos, but omit scmap and CHETAH as they are only designed for scRNA-seq data. Four open accessible paired datasets (A549, brain, kidney, and lung tissues) with scRNA-seq data as the reference and scATAC-seq data as the query dataset, respectively, are included for comparison. We use two evaluation metrics including the batch mixing entropy and the silhouette coefficient. The batch mixing entropy shows the mixing level of cells in the aggregated profiles of reference and query data38. A higher entropy value means better intermingling of cells from different batches, whereas the scRNA-seq and scATAC-seq data are regarded as two batches. The silhouette coefficient quantifies the separation of different cell types by calculating the silhouette widths of each cell type38,39. A higher silhouette coefficient represents more differences between cell types and fewer variances within each cell type.
Using these two metrics, we evaluate the transferred labels in each query dataset, i.e., scATAC-seq data. Compared to the other methods, scGCN shows better performance with higher mixing entropy (Fig. 5a). Specifically, Seurat v3 has higher mixing entropy than Conos in A549 data, but is lower than Conos in the other three datasets. In the brain dataset, Conos shows comparable mixing entropy with scGCN. We then calculate the silhouette coefficients for all three methods (Fig. 5b). Compared with Seurat v3 and Conos, scGCN has significantly and consistently higher silhouette coefficients. In most datasets, Seurat v3 shows slightly higher silhouette coefficients than Conos.
To further visualize the joint alignment of scRNA-seq data and scATAC-seq data, we show the UMAP projection of the aggregated A549 cells with labels learned by different methods (Fig. 5c). The cell subpopulations (0, 1, and 3 h cells after 100 nM dexamethasone (DEX) treatment) are clearly discerned in the UMAP projection when using the aggregated data generated by scGCN. In contrast, Seurat v3 and Conos can not explicitly distinguish different cell subpopulations (Fig. 5c and Supplementary Fig. 2a).
To characterize the differential accessible loci and uncover the transcriptional regulatory mechanisms in the scATAC-seq data of A549 cells, we perform the motif enrichment analysis to discover the cis-regulatory DNA sequences that differentially regulate the 3 h cells after DEX treatment. After treatment, the specific loci of 3 h cells are enriched with the binding motifs of FOXO3, REBB1, and ELF1 compared to 0 h cells (Fig. 5d). Moreover, the transcription factor FOXO3 is upregulated in 3 h cells (Fig. 5e), which is a validated regulator that drives cell progression under the DEX treatment40,41. KRT7 and WDR60 also show higher expression in 3 h cells, which occurs in vivo with DEX treatment42,43 (Supplementary Fig. 2b). RREB1 and ELF1 exhibit upregulation in 3 h cells compared to 0 h cells (Fig. 5e). These results suggest the potential roles of these identified transcriptional factors in maintaining and establishing the chromatin accessibility to express functional genes after treatment. We perform similar motif analysis in mouse brain dataset, based on the integrated scRNA-seq and scATAC-seq datasets from the 10× Genomics Chromium system (Supplementary Fig. 3). In these data, overrepresented DNA motifs are identified in L4-specific accessibility peaks, with Foxp1, Egr3, and Smad3 motifs as the highly enriched motifs (Supplementary Fig. 3a), which also exhibit upregulated expression in L4 cell subtype (Supplementary Fig. 3b). Altogether, these results suggest that scGCN outperforms other methods when transferring labels between single cell transcriptomics and epigenomics data.
Discussion
Single-cell omics technologies have allowed biologists to gain insights into the individual cellular components of complex biological ecosystems44–46. Given the explosive growth in single-cell data, there is a critical need to leverage the existing, well-characterized datasets as references to ensure reliable and consistent annotations of data. In this study, we report a graph-based artificial intelligence model, single-cell Graph Convolutional Network (scGCN) that allows researchers to use reference single-cell omics data to annotate data through robust knowledge transfer approach. We provide evidence that scGCN allows for reliable and reproducible cell type transferring across datasets from different tissues, species, sequencing platforms, and molecular layers (such as RNA-seq and ATAC-seq). The scGCN software, which is publicly accessible, is compatible with various single-cell datasets for accurate label transfer.
Apart from the technical aspects of computational analysis, the accuracy, robustness, and sensitivity of label transfer also rely on the quality of the reference datasets. Reference datasets used in this study are well-characterized in literature, covering various types of samples and application scenarios in biomedical research. Whereas the current study does not address the quality issues of the reference datasets, it is well accepted that accumulating data in the field will lead to development and definition of higher quality and experimental evidence-based reference datasets. Additionally, when annotating the PBMC 10X V3 data, scGCN reveals accurate cell mapping even when the reference dataset has few cells, suggesting that scGCN is less insensitive to cell number than other methods, which can be further validated with more datasets for a systematic and stringent analysis.
scGCN consistently outperform four commonly used algorithms, namely Seurat v3, Conos, scmap, and CHETAH that possess the knowledge transfer functionalities. scGCN can also overcome batch effects compared to CCA (Supplementary Figs. 4–5 and Supplementary Note 1). In addition to the datasets used in Figs. 2–5, we further include a well-recognized benchmarking collection of single-cell datasets covering 13 major platforms47 and completely examine all possible reference-query combinations, which allows unbiased evaluation of scGCN (Supplementary Fig. 6–7 and Supplementary Note 2).
With the development of single-cell sequencing technology and growing single-cell data sizes, we show the scalability of scGCN in large-size datasets. Specifically, the memory usage and the computing time are profiled with respect to different sample sizes (from 100 k up to 1 million cells), using the dataset generated by Cao et al48. We compare the computational costs of scGCN with Seurat v3 on a computer with 64 GB memory and 3.6 GHz Intel Core i9 processor. We show the computational time and average memory usage for different sample sizes (i.e., 100 k, 200 k, …, 1 millon) in Supplementary Figs. 8–9 and Supplementary Note 3. scGCN is comparable to Seurat v3 in computational time and memory usage49, demonstrating its efficiency and scalability for large-size single-cell datasets.
From a technical perspective, scGCN has some major advantages. First, unlike the other methods, scGCN simultaneously utilizes features, graphic structures, and reference labels to overcome batch effects, protocol differences, and other intrinsic differences (e.g., different species and omics types) among datasets. Second, scGCN transfers labels from labeled samples to unlabeled samples in a semi-supervised manner, which is more desirable than other unsupervised methods. Third, in every layer, scGCN nonlinearly propagates feature information from neighboring cells in the hybrid graph, which learns the topological cell relations and improves the performance of transferring labels by considering higher-order relations between cells.
Despite these successful results, there are several aspects in which scGCN can be improved. First, as an artificial intelligence (AI) model, scGCN shows not only the merits of its kind, but also some limitations including the black-box nature of AI models50–52. These can be addressed through downstream analysis such as differential gene identification and enrichment analysis, that can ameliorate some of the problems and bring insights into the labeled cells. Second, as a graph model, improving the graph construction can further boost model performance. Our graph construction based on mutual nearest neighbors reflects the state-of-art in single cell graph representation. As a fast-growing research field, graph construction approaches are emerging that we will test and adapt in future versions of scGCN.
Methods
Data preprocessing
For each input data, we denote the dataset with known cell labels as reference data and the dataset that needs to be annotated as query data , where is the number of common gene features shared by and , , and are the number of cells in reference data and query data . First, we identify the gene features that exhibit the most variability across different cell types in the reference data, which can be represented as heterogenous and prioritized features. As there are multiple cell types present in the reference data (the number of different cell types is annotated by ), we perform multi-class differential expression analysis to using analysis of variance (ANOVA) to identify the most variable genes across different cell types. Bonferroni correction is used to select the top = 2000 genes in reference data with significantly adjusted P values. In scGCN, all input data filters out non-variable gene features, thus the reference data and the query data become and .
Graph construction
We use the mutual nearest neighbor (MNN)38 approach and canonical correlation analysis53 to construct a hybrid graph capturing the topological characteristics of single cells in both reference and query datasets. First, the reference dataset as well as the query dataset are standardized. Then a reference-to-query graph (inter-dataset) as well as an internal query graph (intra-dataset) are constructed. The final hybrid graph is composed of these two graphs. The hybrid graph is then used as input for the scGCN model.
Standardization transformation
We first perform the standardized transformation for reference data, i.e.,
1 |
where is the raw value of input reference data, is the mean of , and is the standard deviation of . Thus is the standardized value of feature and cell , where and . After standardization, reference data is represented as In the same way, raw data in is processed in the same manner that becomes .
Construction of the reference-to-query graph
For the scGCN method, one critical step is to construct an effective graph, which is represented as an adjacent matrix that best leverages the reference data and query data. Here, we use the MNN38 concept and search for the MNNs from both reference and query data after simultaneous dimensionality reduction of reference and query datasets through canonical correlation analysis.
The goal of canonical correlation analysis is to simultaneously project the high-dimensional reference and query data into the same low dimensional space through two dataset-specific linear transformations. Thus, molecular patterns in both reference and query datasets that share the same biological meaning can be captured and represented uniformly in the low dimensional space. represents the standardized reference data and represents the query data. In order to project these two matrixes to a -dimensional space where , we need to identify pair of canonical correlation vectors of dimension and of dimension, where , to maximize
2 |
s.t. and . We use singular value decomposition to calculate the canonical correlation vector pairs associated with the largest eigenvalues. Each pair of canonical correlation vector and thus can be used to project the original data and to the -th dimension in the -dimensional space, respectively. In this study we set .
The reference-to-query graph is constructed with the MNN approach based on the projected reference and query data. For a cell in the reference data and another cell in the query data, if and only if cell belongs to the nearest neighbors of cell , and at the same time cell is also one of the nearest neighbors of cell . Otherwise . Thus, we construct the inter-dataset cell mapping between reference data and query data.
Construction of the query internal graph
Similarly, we also construct the internal graph for cells in the query dataset using the MNN approach and canonical correlation analysis with .
Construction of the hybrid graph
The final hybrid graph is then constructed by combining the reference-to-query graph () and the query internal graph (). The hybrid graph is used as the input adjacent matrix for our scGCN model. With this constructed hybrid graph, we validate its effectiveness by comparing it with other different graph construction methods (see Supplementary Figs. 10–12 and Supplementary Note 4).
scGCN method
We utilize the GCN31 for semi-supervised learning and transferring labels from reference data to query data. Each cell is viewed as a node. The annotations of cell types in the reference data are the known labels. The goal of GCN is to predict the cell type annotations of the query data by using not only the features of each cell but also the information leveraging reference and query data, which are characterized as the above hybrid graph . Explicitly, the GCN model takes two inputs. One input is the hybrid graph structure learned above, which is represented as the adjacent matrix (see the “Graph construction” section). The other is the feature matrix , where is the total number of cells and is the number of variable features selected in preprocessing. If the reference data is denoted as and the query data as , the input data matrix is extended to:
3 |
Herein, both features of reference and query data, as well as the information leveraging reference and query data are utilized in our model. With these two inputs, the GCN model is constructed with multiple convolutional layers.
For efficient training of GCN introduced in31, the adjacent matrix is modified as:
4 |
where is the identity matrix, , and is the diagonal degree matrix of .
Specifically, each layer is defined as:
5 |
where is the input and is the weight matrix of the -th layer, is the non-linear activation function, and the input layer . The labels of cells in the reference data are represented as a class indicator for cell and label , where represents nodes with known labels, represents the total number of different labels, and indicates cell has the -th label, and as not of this label.
Specifically, for a three-layer GCN with distinct labels, the forward propagation is realized as:
6 |
where is the input-to-hidden weight matrix projecting the features input data into an dimension hidden layer, stands for the rectified linear unit activation function, is a hidden-to-output weight matrix, and is the predicted probabilities of cell labels. The softmax activation function is defined as
7 |
Given the reference data, we use the cross-entropy error the evaluate the predictions, i.e.,
8 |
After training the model we have
9 |
The prediction is the probability of each cell within each class. The final label for cell is determined as the -th label when where .
When applying the scGCN model, we randomly split the reference data as training (80%), test (10%), and validation set (10%), while the query data is unlabeled that can be predicted through the above semi-supervised GCN model. For datasets in this study, we train three-layer scGCN models for a maximum of 200 epochs using Adam with a learning rate of 0.01 and early stopping with a window size of 10. For the number of hidden units, we respectively check 32, 64, 128, 256, 528, 1024, and select the optimal one.
In addition to the semi-supervised GCN31, we also evaluate the other different graph neural network methods, including HYPERGCN54, GAT55, GWNN56, GraphSAGE57, and ChebyNet58, based on our constructed hybrid graph. GCN shows the best overall performance; HYPERGCN, GAT, and GWNN are also good alternatives. We also include the HYPERGCN, GAT, and GWNN models in our scGCN tool to provide more options to users. Results of the comparisons can be found in Supplementary Fig. 13 and Supplementary Note 5.
Detection of unknown cells
To identify potential unknown cells in query data, we provide a screening step in our scGCN model using two statistical metrics, entropy score and enrichment score, representing mixtureness and enrichment. Specifically, all cells in query data are grouped to different clusters by modularity-based community detection59. For each query cluster, we measure its mixtureness and enrichment based on the inter-data graph of scGCN. Our rationale is that a query cluster of unknown cell type is more likely to have random links to different cell types in the reference data, while a query cluster of known cell type is more likely to link to a specific cell type in the reference data. In this way, unknown cells can be identified by the two statistical metrics. (1) Entropy score: For a cluster in the query data, the mixtureness of this cluster is defined by the information entropy of normalized enrichment scores. That is,
10 |
where is a specific cell type, is the set of all cell types in reference data, is the number of cells in query cluster that are linked to cell type in reference data by the inter-data graph of scGCN, and is the number of cells belonging to cell type in reference data. (2) Enrichment score: For a cluster in the query data, the enrichment score is defined by the normalized enrichment of the most enriched cell type. That is,
11 |
For a cluster in the query data, the entropy score describes whether this cluster in query data is dominated by a specific cell type, and the enrichment score describes how strong this cell type is enriched. Thus, if the query cell cluster has higher entropy and lower enrichment, these cells should be assigned as unknown cell types. Detailed performance evaluation is shown in Supplementary Fig. 14 and Supplementary Note 6.
Cross-species classification
We use the HomoloGene databases provided by NCBI (Build 68) to identify homologous genes between humans and mice, and keep only genes that have a one-to-one correspondence, which serves as a look-up table. After obtaining the intersection gene set between the reference data and the look-up table, the gene names are then converted to human gene names to obtain a compatible input for classification.
Methods comparison
For the comparisons with Seurat v3, we used the most recent CRAN release version 3.2.1 of Seurat26, with its anchoring framework in Seurat v3 for transfer labels across datasets. Specifically, we follow the Seurat v3 vignette about Multiple Dataset Integration and Label Transfer at https://satijalab.org/seurat/v3.2/integration.html, and use the FindTransferAnchors and TransferData functions to transfer cell type labels from a reference dataset onto a query dataset. For the FindTransferAnchors function, we use the default parameters, i.e., reduction = pcaproject, npcs = 30, dims = 1:30, k.anchor = 5, k.filter = 200, k.score = 30, max.features = 200, nn.method = rann, normalization.method = LogNormalize, approx.pca = TRUE. For the TransferData function, we also use the default parameters, i.e., weight.reduction = pcaproject, dims = 1:30, k.weight = 50, sd.weight = 1.
As above, Seurat v3 provides the options of using PCA as the default (PCA-based Seurat v3) and CCA as an alternative (CCA-based Seurat v3) to project the structure of a reference onto the query, which are used in the anchor weighting and label transfer steps26. We compare scGCN with CCA-based Seurat v3 in Supplementary Fig. 15 and Supplementary Note 7. For benchmarking with scmap, we identify the classification of query data using the scmapCluster function of scmap. For CHETAH, we identify the classification of query data using the CHETAHclassifier function. All methods are applied with default parameters. In this study, statistical significance is defined using the two-sided Wilcoxon rank test.
A549 data
The A549 dataset is downloaded from GEO, which includes 3260 cells profiled using the sci-CAR protocol60. scRNA-seq and scATAC-seq data can be accessed with accession number of GSM3271040 and GSM3271041. After quality control, the resultant data consists of 2641 cells in our analysis.
Mouse brain data
The single-cell ATAC-seq dataset of adult mouse brain cells is provided by 10× Genomics, which is available through the 10× Genomics website (http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5). scRNA-seq data is obtained from the same biological system (the adult mouse brain), which can be downloaded from Allen Institute website (http://celltypes.brain-map.org/api/v2/well_known_file_download/694413985).
Kidney data
The kidney dataset is downloaded from GEO, which includes 11, 296 single cells from the mammalian kidney using the sci-CAR protocol60. The scATAC-seq data and scRNA-seq data that can be accessed with accession number of GSM3271045 and GSM3271044 are profiled using the sci-CAR protocol60. Duplicate genes are merged by maximum value and cells that are labeled as NA are removed. Moreover, we remove cells with less than 500 expressed genes and genes expressed in fewer than ten cells from the scRNA-seq data. Cells with less than 200 accessible loci and loci opened in fewer than ten cells are removed from scATAC-seq data. The final dataset with 8837 cells is used in our analysis.
Lung data
The lung dataset contains lung count matrices from three scRNA datasets (two lung samples from Tabula Muris [https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733], one lung sample from GSE108097, and two lung sci-ATAC-seq replicates from GSE68103.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
The work is supported in part by the Bioinformatics Shared Resources under the NCI Cancer Center Support Grant to the Comprehensive Cancer Center of Wake Forest University Health Sciences (P30CA012197). WZ is supported by the Hanes and Willis Family Professorship in Cancer. Additional support for Q.S. and W.Z. is provided by a Fellowship to W.Z. from the National Foundation for Cancer Research. This work is also partially supported by the Indiana University Precision Health Initiative to J.S.
Source data
Author contributions
Q.S., J.S., and W.Z. developed the structure and arguments and wrote the manuscript. All the authors reviewed and approved the final manuscript.
Data availability
Source data are provided with this paper. All datasets analyzed in the current study are publicly available and can be downloaded from their public accessions, including GSE108989, GSE115746, GSE118389, GSE72056, GSE98638, GSE99254, GSM3271044, GSM3271045, phs001790, SRP073767 [http://support.10xgenomics.com/single-cell/datasets], E-MTAB-5061, GSE84133, GSE81608, and GSE85241. The PBMC data of six different sequencing protocols are available from the Broad Institute Single Cell portal (https://portals.broadinstitute.org/single_cell/study/SCP424/single-cell-comparison-pbmc-data) and the Zenodo repository (10.5281/zenodo.3357167)30. We select these single-cell data as they have been frequently used to evaluate the performance of transferring labels by different methods25,26,28,30. The following shows the details of scRNA-seq and scATAC-seq data used in this study. Source data are provided with this paper.
Code availability
All the functions mentioned above are implemented as a python software that is available at Github and Zenodo (https://github.com/QSong-github/scGCN)61.
Competing interests
W.Z. is a consultant for Astellas Pharma US, Inc. The other authors have no competing interests to declare.
Footnotes
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jing Su, Email: su1@iu.edu.
Wei Zhang, Email: wezhang@wakehealth.edu.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-24172-y.
References
- 1.Baron M, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 2016;3:346–360. doi: 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Puram SV, et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell. 2017;171:1611–1624. doi: 10.1016/j.cell.2017.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Athanasiadis EI, et al. Single-cell RNA-sequencing uncovers transcriptional states and fate decisions in haematopoiesis. Nat Commun. 2017;8:2045. doi: 10.1038/s41467-017-02305-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Azizi E, et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell. 2018;174:1293–1308. doi: 10.1016/j.cell.2018.05.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cusanovich DA, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174:1309–1324. doi: 10.1016/j.cell.2018.06.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Muraro MJ, et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3:385–394. doi: 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tabula Muris C, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Buenrostro JD, et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell. 2018;173:1535–1548. doi: 10.1016/j.cell.2018.03.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Macaulay IC, Ponting CP, Voet T. Single-cell multiomics: multiple measurements from single cells. Trends Genet. 2017;33:155–168. doi: 10.1016/j.tig.2016.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mezger A, et al. High-throughput chromatin accessibility profiling at single-cell resolution. Nat. Commun. 2018;9:1–6. doi: 10.1038/s41467-018-05887-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Buenrostro JD, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–490. doi: 10.1038/nature14590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Satpathy AT, et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 2019;37:925–936. doi: 10.1038/s41587-019-0206-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Schaum N, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Regev A, et al. Science forum: the human cell atlas. Elife. 2017;6:e27041. doi: 10.7554/eLife.27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 2013;10:1093–1095. doi: 10.1038/nmeth.2645. [DOI] [PubMed] [Google Scholar]
- 16.Lun ATL, Marioni JC. Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data. Biostatistics. 2017;18:451–464. doi: 10.1093/biostatistics/kxw055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Colomé-Tatché M, Theis FJ. Statistical single cell multi-omics integration. Curr. Opin. Syst. Biol. 2018;7:54–59. doi: 10.1016/j.coisb.2018.01.003. [DOI] [Google Scholar]
- 18.Svensson V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 2020;38:147–150. doi: 10.1038/s41587-019-0379-5. [DOI] [PubMed] [Google Scholar]
- 19.Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75. doi: 10.1186/s13059-016-0947-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power analysis for bulk and single cell RNA-seq experiments. Bioinformatics. 2017;33:3486–3488. doi: 10.1093/bioinformatics/btx435. [DOI] [PubMed] [Google Scholar]
- 21.Grun D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11:637–640. doi: 10.1038/nmeth.2930. [DOI] [PubMed] [Google Scholar]
- 22.Cao J, et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science. 2017;357:661–667. doi: 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rosenberg AB, et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018;360:176–182. doi: 10.1126/science.aam8999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Macosko EZ, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Stuart T, et al. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Barkas N, et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods. 2019;16:695–698. doi: 10.1038/s41592-019-0466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods. 2018;15:359–362. doi: 10.1038/nmeth.4644. [DOI] [PubMed] [Google Scholar]
- 29.de Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FCP. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 2019;47:e95–e95. doi: 10.1093/nar/gkz543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Abdelaal T, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20:194. doi: 10.1186/s13059-019-1795-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In The International Conference on Learning Representations (ICLR) (2017).
- 32.Yuan, Y. & Bar-Joseph, Z. GCNG: Graph convolutional networks for inferring cell-cell interactions. Preprint at bioRxiv, 10.1101/2019.12.23.887133 (2019).
- 33.Li J, et al. Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics. 2020;36:2538–2546. doi: 10.1093/bioinformatics/btz965. [DOI] [PubMed] [Google Scholar]
- 34.Zeng, Y., Zhou, X., Rao, J., Lu, Y. & Yang, Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 519–522. IEEE, (2020).
- 35.Zhao, T., Hu, Y., Valsdottir, L. R., Zang, T. & Peng, J. Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief. Bioinform.10.1093/bib/bbaa044 (2020). [DOI] [PubMed]
- 36.Fang C, Xu D, Su J, Dry JR, Linghu B. DeePaN: deep patient graph convolutional network integrating clinico-genomic evidence to stratify lung cancers for immunotherapy. NPJ Digit. Med. 2021;4:14. doi: 10.1038/s41746-021-00381-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Becht E, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019;37:38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]
- 38.Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Math. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
- 40.Lützner N, Kalbacher H, Krones-Herzig A, Rösl F. FOXO3 is a glucocorticoid receptor target and regulates LKB1 and its own expression based on cellular AMP levels via a positive autoregulatory loop. PLoS ONE. 2012;7:e42166. doi: 10.1371/journal.pone.0042166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mostafa MM, et al. Glucocorticoid-driven transcriptomes in human airway epithelial cells: commonalities, differences and functional insight from cell lines and primary cells. BMC Med. Genomics. 2019;12:1–21. doi: 10.1186/s12920-018-0467-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bhadri VA, Cowley MJ, Kaplan W, Trahair TN, Lock RB. Evaluation of the NOD/SCID xenograft model for glucocorticoid-regulated gene expression in childhood B-cell precursor acute lymphoblastic leukemia. BMC Genomics. 2011;12:565. doi: 10.1186/1471-2164-12-565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Caldwell JM, et al. Glucocorticoid-regulated genes in eosinophilic esophagitis: a role for FKBP51. J. Allergy Clin. Immunol. 2010;125:879–888. doi: 10.1016/j.jaci.2010.01.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Song, Q. & Su, J. DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence. Brief. Bioinform.10.1093/bib/bbaa414 (2021). [DOI] [PMC free article] [PubMed]
- 45.Song Q, et al. Dissecting intratumoral myeloid cell plasticity by single cell RNA-seq. Cancer Med. 2019;8:3072–3085. doi: 10.1002/cam4.2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Song, Q., Su, J., Miller, L. D. & Zhang, W. scLM: Automatic detection of consensus gene clusters across multiple single-cell datasets. Genom. Proteom. Bioinform.10.1016/j.gpb.2020.09.002 (2020). [DOI] [PMC free article] [PubMed]
- 47.Mereu E, et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020;38:747–755. doi: 10.1038/s41587-020-0469-4. [DOI] [PubMed] [Google Scholar]
- 48.Cao, J. et al. A human cell atlas of fetal gene expression. Science370, 6518 (2020). [DOI] [PMC free article] [PubMed]
- 49.Tran HTN, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21:1–32. doi: 10.1186/s13059-019-1850-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Ghosh A, Kandasamy D. Interpretable artificial intelligence: why and when. Am. J. Roentgenol. 2020;214:1137–1138. doi: 10.2214/AJR.19.22145. [DOI] [PubMed] [Google Scholar]
- 51.Moore JH, et al. Preparing next-generation scientists for biomedical big data: artificial intelligence approaches. Per Med. 2019;16:247–257. doi: 10.2217/pme-2018-0145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Filipp FV. Opportunities for artificial intelligence in advancing precision medicine. Curr. Genet. Med. Rep. 2019;7:208–213. doi: 10.1007/s40142-019-00177-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002;97:77–87. doi: 10.1198/016214502753479248. [DOI] [Google Scholar]
- 54.Chami, I., Ying, R., Ré, C. & Leskovec, J. Hyperbolic graph convolutional neural networks. Advances in Neural Information Processing Systems 32 (NeurIPS) (2019). [PMC free article] [PubMed]
- 55.Veličković, P. et al. Graph attention networks. In The International Conference on Learning Representations (ICLR) (2018).
- 56.Xu, B., Shen, H., Cao, Q., Qiu, Y. & Cheng, X. Graph wavelet neural network. In The International Conference on Learning Representations (ICLR) (2019).
- 57.Hamilton, W. L., Ying, R. & Leskovec, J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30 (NIPS) (2017).
- 58.Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst. 2016;29:3844–3852. [Google Scholar]
- 59.Waltman L, Van Eck NJ. A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B. 2013;86:1–14. doi: 10.1140/epjb/e2013-40829-0. [DOI] [Google Scholar]
- 60.Cao J, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;361:1380–1385. doi: 10.1126/science.aau0730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Song, Q., Su, J. & Zhang, W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. QSong-github/scGCN, 10.5281/zenodo.4758844 (2021). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source data are provided with this paper. All datasets analyzed in the current study are publicly available and can be downloaded from their public accessions, including GSE108989, GSE115746, GSE118389, GSE72056, GSE98638, GSE99254, GSM3271044, GSM3271045, phs001790, SRP073767 [http://support.10xgenomics.com/single-cell/datasets], E-MTAB-5061, GSE84133, GSE81608, and GSE85241. The PBMC data of six different sequencing protocols are available from the Broad Institute Single Cell portal (https://portals.broadinstitute.org/single_cell/study/SCP424/single-cell-comparison-pbmc-data) and the Zenodo repository (10.5281/zenodo.3357167)30. We select these single-cell data as they have been frequently used to evaluate the performance of transferring labels by different methods25,26,28,30. The following shows the details of scRNA-seq and scATAC-seq data used in this study. Source data are provided with this paper.
All the functions mentioned above are implemented as a python software that is available at Github and Zenodo (https://github.com/QSong-github/scGCN)61.