Abstract
Background
Recent advancements in single-cell RNA sequencing have greatly expanded our knowledge of the heterogeneous nature of tissues. However, robust and accurate cell type annotation continues to be a major challenge, hindered by issues such as marker specificity, batch effects, and a lack of comprehensive spatial and interaction data. Traditional annotation methods often fail to adequately address the complexity of cellular interactions and gene regulatory networks.
Results
We proposed scMCGraph, a comprehensive computational framework that integrates gene expression with pathway activity to accurately annotate cell types within diverse scRNA-seq datasets. Initially, our model constructs multiple pathway-specific views using various pathway databases, which reflect both gene expression and pathway activities. These pathway-specific views are then integrated into a consensus graph. The consensus graph is subsequently utilized to reconstruct the multiple pathway views. Our model demonstrated exceptional robustness and accuracy across various analyses, including cross-platform, cross-time, cross-sample, and clinical dataset evaluations.
Conclusions
scMCGraph represents a significant advance in cell type annotation. The experiments have demonstrated that introducing pathway information significantly improves the learning of cell–cell graphs, with their resulting consensus graph enhancing the predictive performance of cell type prediction. Different pathway databases provide complementary data, and an increase in the number of pathways can also boost model performance. Extensive testing shows that in various cross-dataset application scenarios, scMCGraph consistently exhibits both accuracy and robustness.
Keywords: Single-cell RNA sequencing, Cell type annotation, Consensus graph, Cellular communication, Pathway integration
Background
Single-cell RNA sequencing (scRNA-seq) has revolutionized genomics by offering a high-resolution view of cellular composition in complex tissues, transforming the analysis of cellular heterogeneity [1, 2]. Unlike bulk RNA sequencing, which pools gene expression from multiple cells, scRNA-seq resolves transcriptomic landscapes at the single-cell level, enabling nuanced analyses across various biological states and conditions [3, 4]. A significant challenge in harnessing scRNA-seq data is the accurate and robust annotation of cell types [5, 6]. This process is crucial for understanding cellular dynamics and gene functionality. The accuracy of annotation affects how well cellular functions can be interpreted, disease states predicted, and developmental trajectories understood. Despite advancements, accurately annotating cell types from considerable scRNA-seq datasets remains a critical challenge, necessitating innovative solutions [7].
Advancements in cell type annotation have transitioned from manual, expert-driven methodologies to automated, algorithmic strategies. These advances address the labor-intensive and biased nature of manual annotation by offering rapid and accurate cell identification, democratizing the analysis for researchers across disciplines. Automated annotation in scRNA-seq has evolved into three primary strategies: marker-, correlation-, and model-based methodologies [8]. Marker-based techniques, such as scTyper [9], Digital Cell Sorter [10], SCINA [11], SCSA [12], CellAssign [13], scCATCH [14], MarkerCount [15], scClassifR [16], Garnett [17], and scSorter [18], use established cell markers for identification. However, their effectiveness is limited by the availability and specificity of these markers, often failing to identify cell types without known markers or in the presence of novel cells. Correlation-based methodologies, such as scmap-cell [19], CHETAH [20], and SingleR [21], use reference datasets to align and annotate cell types based on gene expression similarity but struggle with batch effects when reference and query datasets are from different batches. Model-based methodologies, including SciBet [22], scNym [23], SingleCellNet [24], scPred [25], and Moana [26], leverage algorithmic learning to refine cell type annotation, addressing data noise and batch effects. Nevertheless, the inherent limitations of scRNA-seq technology preclude capturing certain critical biological information, such as spatial orientation, cellular interactions, and gene regulatory networks (GRNs), thus constraining these models.
In recent years, more sophisticated models have emerged, utilizing deep learning strategies to address these challenges. For example, scBERT [27] adapts the BERT architecture from natural language processing, pretraining on large, unlabeled datasets and fine-tuning for specific annotation tasks, which enhances accuracy and generalizability across diverse datasets. However, one of its key limitations is its non-interpretable nature, which can make it challenging to understand the reasoning behind its predictions. In contrast, scTab [28] utilizes a large-scale training set of over 22 million cells, enabling robust cross-tissue cell type classification and demonstrating scalability for large datasets. While scTab excels in scalability, it does not integrate biological pathway information, which can limit its ability to provide insights into underlying biological processes. On the other hand, SIMS [29] incorporates a self-attention mechanism that improves label transfer, even with smaller training sets, allowing for the identification of misannotated cells and genetic variations in complex biological systems. SIMS is recognized for its interpretability, which is valuable for understanding model predictions in biological research. Additionally, scArches [30] leverages transfer learning to efficiently map query datasets to large reference atlases, enabling iterative updates and contextualization of new datasets without sharing raw data, which is particularly useful in dealing with batch effects and preserving biological state information. CellTypist, [31] a logistic regression-based framework, is specifically designed for accurate and automated cell type annotation across tissues, contributing to the accurate dissection of immune cell populations in large-scale studies. These models focus on learning patterns and relationships within gene expression data using deep learning techniques, without relying on graph-based representations [32].
However, despite the success of these deep learning-based approaches, their reliance on large training datasets and their inability to fully capture the complexity of cellular interactions and GRNs limit their effectiveness in modeling biological systems. In contrast, graph-based computational models address these limitations by directly incorporating critical biological features, such as cellular interactions and GRNs, into the annotation process. By employing three main strategies for graph construction—cell-gene, gene–gene, and cell–cell—graph-based models provide a more nuanced approach to cell type annotation. Cell-gene graphs are effective in representing cellular diversity. However, they increase computational demands, which limits their application in large datasets. For instance, scDeepSort [33] constructs a cell-gene graph using scRNA-seq data with nodes as cells connected by their expression profiles. scDeepSort uses a weighted graph neural network (GNN), pre-trained on cell atlases, for precise cell type annotation. Gene–gene graphs, which prioritize GRNs, often overlook critical intercellular communications essential for understanding complex tissue dynamics. An example is scGraph [34], which employs GNNs to discern cell types through a gene–gene interaction graph derived from scRNA-seq data, refining cell type annotation by incorporating gene expression data and pre-existing GRNs information. Meanwhile, cell–cell graphs leverage gene expression similarities to enhance robustness and biological relevance, further refined by techniques like GNNs and attention methods. These models, exemplified by the SCEA [35], HNNVAT[36], and scAGN [37], illustrate the diverse approaches to capturing and analyzing cellular interactions. Specifically, in the SCEA method, cells are established as nodes with edges assigned weights determined by Pearson correlation coefficients, quantifying gene expression similarity between cells [38]. SCEA enhances graph robustness through a noise reduction technique called network enhancement, identifying stronger and more biologically relevant correlations. Similarly, HNNVAT represents cells as nodes with edges informed by gene expression similarities, refined through GNNs pre-trained on comprehensive cell atlases, enabling effective message passing techniques. scAGN constructs its graph by representing cells as nodes and connecting each cell to its closest counterparts based on gene expression similarity using a k-nearest neighbor method. The attention mechanism within scAGN dynamically adjusts edge weights, focusing on significant expression patterns for cell type annotation.
In summary, existing graph-based models for cell type annotation offer intricate representations of cellular interactions but often neglect the contextual richness derived from biological pathways. To address this gap, distinct graphs have been constructed articulating the full spectrum of cell–cell relationships, supplemented by a consensus graph concept to synthesize varied biological interactions. Integrating pathway information into models is indispensable, augmenting the comprehension of cell signaling pathways and fortifying biological network construction. This approach, enriched by multiple pathway databases, captures the complex nature of cellular activities. Through various graph structures and the innovative consensus graph concept, a foundation has been laid for an expansive representation of cellular relationships. Graph convolutional neural networks (GCNs) effectively capture high-order relational data between cells, leading to low-dimensional, biologically informative representations. This graph construction methodology redefines the benchmark for biological data representation, ensuring retention of intricate details from cellular pathways and culminating in more precise and biologically pertinent predictions from GCNs.
This work embarks on a deep investigation into the creation of biologically significant cell–cell graphs from pathway datasets. The analysis acknowledges distinct gene regulatory patterns unique to each cell type and diverse gene interaction relationships elucidated by various pathway databases, reflecting complex cellular life activities. Multiple graph representations, each based on different pathway datasets, have been engineered to encapsulate complete cell–cell relationships. The introduction of a consensus graph concept amalgamates these representations into a unified model capturing the essence of cellular interplay. The hypothesis is that by merging data from multiple pathway databases, a multi-view graph can be crafted, encapsulating cellular relationships involved in various biological processes. This multi-view pathway analysis enhances existing techniques by integrating data from various pathway databases, constructing an integrated graph that revives a holistic view of cellular processes and serves as a robust framework. This integrated approach, combined with the consensus graph concept, captures detailed cellular functions and regulatory networks, thereby improving cell type annotation.
Results
Overview description of scMCGraph model
A comprehensive computational framework that integrates gene expression with pathway activity was developed to annotate cell types within diverse scRNA-seq datasets, utilizing multiple pathway databases. Figure 1 provides a schematic overview of the scMCGraph model, illustrating the major stages of this computational framework. This method facilitated the development of a consensus representation of cell–cell interactions, which are derived from signaling pathways, thereby enabling precise cell type annotation. Affinity matrices, created from gene expression and pathway activity data, captured intricate intercellular relationships and were integrated using advanced computational techniques to form a robust model for cell type annotation.
Fig. 1.
Schematic overview of the scMCGraph model. The figure illustrates the workflow of the scMCGraph computational framework, from the initial generation of scRNA expression matrices to the final cell type annotation. Key stages include the construction of matrices based on gene expression, calculation with pathway activity data from six databases via the AUCell algorithm, integration and refinement of data through graph fusion module, and application of graph refinement module to highlight significant interactions. The process culminates with the extraction of low-dimensional embeddings by a GNN-encoder and the reconstruction of the data structure by a parallel decoder, enabling accurate cell type annotation
The process began with the independent construction of cell–cell affinity matrices for both reference and query datasets, based on their respective gene expressions. To further enhance the model’s ability to capture subtle signals from low-expressing cell types while filtering out noise, pathway activity data was integrated with gene expression matrices. The AUCell algorithm, which assesses the activation states of cellular pathways, was employed as a means to reduce the impact of non-essential genes. This pathway-focused approach prioritizes biologically relevant signals, thus enhancing the overall sensitivity of the model. By integrating pathway activity, even subtle biological signals from low-expressing cells are preserved, and noise from irrelevant genes is effectively reduced. To explore pathway-specific cellular functions in more detail, we utilized six different pathway databases. The AUCell algorithm was then applied to each dataset, analyzing these six databases. This analysis produced six pathway-cell affinity matrices for each dataset, illustrating the activation status of individual cells within each pathway. These matrices were then transformed into cell–cell affinity matrices based on pathways, capturing similarities in pathway activation across cells and providing a comprehensive map of cellular interactions grounded in shared biological processes.
Following the creation of pathway-based cell–cell affinity matrices for each dataset, the analysis was refined by applying similarity network fusion (SNF [39]) to each dataset’s primary cell–cell affinity matrix along with its corresponding pathway-specific matrices as part of the graph fusion module. This was performed for both reference and query datasets, resulting in six SNF-enhanced cell–cell affinity matrices per dataset, each providing an integrated view of gene expression and pathway-specific interactions. By integrating multiple views from pathway-specific data, the SNF module further enhances the model’s ability to capture intercellular relationships and subtle signals from low-expressing cell types, improving the robustness of cell type annotation.
To synthesize these insights into a unified framework, the similarity subspace matrices fusion (SSMF), also within the graph fusion module, was employed for each dataset. This method merged the six SNF-enhanced matrices into a single, comprehensive cell–cell affinity matrix, encapsulating a holistic overview of cellular interactions. Each individual pathway view contributed to the final consensus map by enhancing its ability to prioritize biologically relevant signals and filter out noise. By dynamically adjusting the weights in each pathway view to optimize the consensus map, the model effectively reduces biases and errors that may arise from relying on a single data source. This approach resulted in two composite matrices, one for each of the reference and query datasets, representing a unified map of intercellular and pathway-driven affinities. To enhance the structural detail within these composite matrices, positive pointwise mutual information (PPMI) was applied as part of the graph refinement module, improving the representation of significant associations and reducing data noise. Consequently, two refined, unified cell–cell similarity matrices were obtained, each offering a detailed and integrated view of cellular interactions and pathway networks.
The refined matrices served as the foundation for the cell type annotation model. A GNN-encoder was used to derive a low-dimensional embedding for each cell, crucial for the multi-class classification task. Utilizing known cell labels, a multi-class classification loss function () was developed to guide model training. Concurrently, a parallel decoder was employed to reconstruct the six pathway-specific cell-cell similarity matrices initially integrated by SNF, defining a reconstruction loss (). To enhance the model’s predictive accuracy for cell type annotation, a loss based on Kullback-Leibler divergence () was included, combining , and in a joint optimization framework. This integrative approach ensured that the model not only accurately annotates cell types but also captures the complexity of pathway-specific interactions, key for understanding the nuanced biological processes involved.
The scMCGraph model demonstrated enhanced robustness and accuracy in handling diverse batch effects, as evidenced by its performance across a spectrum of analyses, including cross-platform, cross-time, cross-sample, and clinical dataset analyses. In this section, the dataset preparation is extensively detailed. Furthermore, the performance of the methods was evaluated using multiple metrics, including the accuracy score (ACC), weighted F1 score, and balanced accuracy score (BA), with each providing complementary insights into the model’s overall performance. This comprehensive evaluation showcased the model’s ability to maintain consistent accuracy not only across different developmental stages and sequencing technologies but also in clinical settings. These attributes highlight scMCGraph’s utility in automated cell type annotation and its potential as a powerful tool for complex single-cell sequencing data analysis.
Dataset preparation
To evaluate the proposed model comprehensively, we implemented a multifaceted experimental framework encompassing cross-platform, cross-time, cross-sample, and clinical dataset analyses. To further validate its scalability to larger datasets, we incorporated the breast invasive carcinoma E-MTAB-8107 [40] dataset, demonstrating the model’s capacity to handle more extensive biological data.
Cross-platform analysis
We utilized human pancreatic datasets (Baron Human [41], Muraro [42], Segerstolpe [43], and Xin [44]) alongside PBMC [45] datasets from both high-throughput (10Xv2, 10Xv3, Drop-Seq, inDrop, Seq-Well) and low-throughput (CEL-Seq2, Smart-Seq2) platforms. This yielded twelve reference-query pairs from the pancreatic datasets and forty-two from the PBMC datasets. The design excluded the reference dataset’s platform from the query sets to minimize platform-specific biases.
Cross-time analysis
We used the GSE132188 [46] single-cell dataset, containing mouse embryonic pancreatic epithelial cells at stages E13.5, E14.5, and E15.5 (GSM3852753, GSM3852754, GSM3852755), to create two reference-query pairs. This approach allowed us to assess the consistency of the model’s cell type annotations across developmental stages.
Cross-sample analysis
We analyzed PBMC datasets from diverse samples sequenced on various platforms (CEL-Seq2, Drop-Seq, inDrop, Seq-Well, Smart-Seq2), establishing five reference-query pairs to evaluate the model’s robustness against biological variability.
Clinical dataset validation
We extended our evaluation to clinical settings by incorporating single-cell transcriptomic data from patients with atherosclerosis and osteoarthritis. The Human Artery dataset (GSE159677 [47]) included samples from the calcified core and adjacent non-lesioned arterial tissue from endarterectomy patients. The Human Bone dataset (GSE152805 [48]) contrasted chondrocytes from diseased medial and healthy lateral tibial plateaus of osteoarthritis patients. From these datasets, we collected 20 unique reference-query pairs to compare intersubject variability and four pairs to assess differences between disease states, thereby enhancing the model’s clinical applicability.
These diverse experimental setups provide a robust framework to evaluate the model’s ability to accurately annotate cell types across various conditions and clinical states.
Pathway databases
We integrated data from established pathway databases to enrich the model with detailed biological pathways and gene functions. This included the KEGG [49] database for metabolic and cellular processes; the PathwayCommons12 series [50], encompassing the humancyc_hgnc [51], panther_hgnc [52], and pathbank_hgnc [53] subsets; and the Reactome [54] and Wikipathways [55] databases for comprehensive coverage of human and cross-species biological pathways. This strategic integration enhanced the dataset with a multidimensional biological context, improving the model’s performance.
Robustness and accuracy of scMCGraph across diverse batch-effect conditions
To advance single-cell sequencing analysis, the scMCGraph model was developed, exemplifying versatility and accuracy. This model surpasses traditional batch-effect limitations, illustrating innovation and exceptional performance in automated cell type annotation. A series of experiments detailed in this section, including cross-platform, cross-time, and cross-sample analyses, comprehensively assess the model’s robustness and accuracy. To evaluate the performance of scMCGraph, we classify the comparison methods into three categories: (a) correlation-based methods, (b) marker-based methods, and (c) model-based methods. These categories will be used to compare against our proposed model to assess its performance comprehensively. As illustrated in Fig. 2a, the scMCGraph model achieves the highest mean ACC across most datasets, outperforming other methods, except for the dataset of homologous mouse embryonic pancreatic epithelial cells across different embryonic stages (cross-time), where scMCGraph secured the third-highest mean ACC, only surpassed by SingleR and TOSICA. This performance highlights its superior accuracy and generalization capability in cell type annotation tasks.
Fig. 2.
Visual and quantitative assessment of scMCGraph’s cell type annotation accuracy. a Line chart visualizes the classification accuracy of scMCGraph compared to other methods. The chart demonstrates scMCGraph’s superior performance in accuracy. b The box plot comparing the ACC of the scMCGraph model with other methods across multiple human pancreatic dataset pairs. c Comparison of scMCGraph with 17 other methods on PBMC dataset. d The line graph depicting the cross-time and cross-sample ACC, highlighting the stability of the scMCGraph model across different conditions. e T-SNE plots before and after batch correction, as well as post-feature aggregation, display the scMCGraph’s refinement in cell clustering. f T-SNE plots of cell embeddings mapped by scMCGraph, compared to other models. These plots highlight scMCGraph’s superior ability to cluster cell populations within human pancreas and PBMC dataset pairs
The model’s efficacy in annotating cell types across diverse sequencing platforms was rigorously tested. It processed pancreatic dataset pairs and PBMC dataset pairs, covering both high-throughput and low-throughput technologies. This extensive cross-platform assessment, depicted in Fig. 2b and c, showcases scMCGraph’s ability to consistently mitigate platform-specific biases and maintain high performance across all datasets. Specifically, Fig. 2b compares scMCGraph with other methods on the pancreatic dataset pairs, where our model achieved a mean ACC of 0.9801 with a variance of 0.0106, both ranking first. In Fig. 2c, the model is compared with other methods across the PBMC dataset pairs, evaluating multiple performance metrics including ACC, BA, and F1 score. Our model achieved the highest average ACC of 0.8077, with the lowest standard deviation of 0.0723, highlighting its superior stability and performance. Furthermore, our model ranked fifth in average BA at 0.7257, demonstrating its strong performance in handling class imbalance. It also ranked second in average F1 score (0.7709), just behind CellTypist, underscoring its robustness and accuracy across multiple performance metrics.
The effectiveness of the scMCGraph model was further evaluated in both cross-time and cross-sample scenarios. For the cross-time analysis, the model was applied to datasets derived from mouse embryonic pancreatic epithelial cells across three developmental stages, where it demonstrated notable accuracy. Its robustness was further evidenced by its application to multiple PBMC datasets on the same sequencing platform. Figure 2d displays line graphs that illustrate the outcomes from both the cross-time and cross-sample experiments. Notably, in a specific cross-time pair (reference dataset GSM3852754 and query dataset GSM3852755), scMCGraph outperformed competing methods, achieving the highest accuracy. In the cross-sample analysis, it recorded the highest accuracy in four out of five reference-query pairs, involving platforms such as CEL-Seq2, inDrop, Seq-Well, and Smart-Seq2. These results collectively underscore scMCGraph’s strong generalizability and robustness across varying experimental setups, establishing its utility as a reliable tool in automated cell type annotation and its adaptability to diverse platforms and temporal stages.
Within the scMCGraph framework, the Harmony algorithm [56] was implemented to address batch effects across diverse single-cell datasets, effectively mitigating batch discrepancies during the integration of reference-query dataset pairs. Subsequently, a two-layer GCN aggregated features from two-hop neighbors within the graph, leveraging learned node embeddings for precise cell type annotation. To demonstrate the model’s capabilities concretely, t-SNE visualizations were conducted on the original gene expression data, the batch-effect-corrected features, and the node embeddings post-GCN for the dataset pair GSM3852754 and GSM3852755 (Fig. 2e). These visualizations clearly illustrate the initial disparities in gene expression between the datasets. After applying Harmony, the datasets converge, displaying emerging clustering trends within individual cell populations. Further processing through the GCN highlights these trends, systematically refining cellular features for cell type annotation, thus emphasizing the model’s exceptional proficiency in batch correction and cell type discrimination.
In this work, the aim was to showcase the superior capability of the proposed model in automating cell type annotation and clustering cellular embedding features. Two dataset pairs from the human pancreas—Segerstolpe as reference with Baron as query and Muraro as reference with Segerstolpe as query—were meticulously evaluated. Additionally, two pairs from the PBMC dataset were scrutinized—Drop-Seq as reference with 10Xv2 and 10Xv3 as respective queries. The clustering of cell types across these four reference-query pairs was visualized using t-SNE plots, as shown (Fig. 2f). These findings reveal that conventional methods fail to segregate cell types based on the original gene expression. In the Segerstolpe-Baron pancreas dataset pair, the CHETAH algorithm was unable to effectively separate gamma from alpha cell types as well as delta from beta cell types. The scPred method struggled to discriminate beta from gamma cell types, and the SingleR approach was similarly unable to cleanly segregate gamma from alpha cell types. For the Muraro-Segerstolpe dataset pair, CHETAH again fell short in distinguishing gamma from alpha cell types, and scPred showed persistent admixture of a small number of beta cells in regions rich with alpha cells; SingleR exhibited difficulties in separating beta from delta cell types.
When analyzing the selected PBMC dataset pairs, the results show that in the Drop-Seq-10Xv2 comparison, both CHETAH and SingleR methods failed to adequately resolve CD4+ T cells from cytotoxic T cells as well as cytotoxic T cells from natural killer (NK) cells. The scPred algorithm could not satisfactorily separate CD14+ monocytes from CD16+ monocytes and also struggled with the separation of CD4+ T cells from cytotoxic T cells. In the Drop-Seq-10Xv3 dataset pair, CHETAH and SingleR again faced challenges in segregating CD4+ T cells from cytotoxic T cells, and scPred had difficulty distinguishing certain areas of CD16+ monocytes, cytotoxic T cells, and CD4+ T cells. However, the t-SNE projections generated by the model exhibited clear boundaries between different cell types, with minimal overlap in regions rich in specific cell types. This visually demonstrates that the proposed model has a significant advantage in discerning the gene expression differences among various cell types, thereby more effectively accomplishing the task of cell type annotation.
Clinical validation of scMCGraph for cell type identification across individual and disease states
In this section, the scMCGraph model underwent rigorous validation using clinical datasets from two diseases characterized by significant cellular heterogeneity and clinical importance: atherosclerosis and osteoarthritis. The validation robustly confirmed the model’s efficacy, demonstrating its ability to accurately delineate cell types with clinical relevance. For atherosclerosis, single-cell RNA-sequencing data (GSE159677) from both the calcified core of atherosclerotic plaques and adjacent non-lesioned tissue were used to compare pathological and normal cell environments. In osteoarthritis, chondrocytes from both diseased and healthy states within individuals (GSE152805) were analyzed. Utilizing these datasets, 12 datasets in osteoarthritis and 8 datasets in atherosclerosis were used for cross-individual comparisons, and 4 datasets in osteoarthritis were used for cross-health condition comparisons, enabling detailed comparisons across subjects and disease states.
The analysis of experimental results is depicted in Fig. 3a and b, where scMCGraph is shown to outperform advanced comparative methodologies in intersubject experiments. Although in inter-disease state experiments, scMCGraph’s maximum accuracy was slightly lower than that of the singleR method, it demonstrated a notably higher minimum accuracy, highlighting its robust performance and reliability.
Fig. 3.
Efficacy and accuracy of scMCGraph in cell type classification across clinical datasets. a The accuracy of the proposed model was compared with other advanced methods on the cross-individual dataset. This comparison demonstrates that scMCGraph achieves superior performance in both accuracy and robustness. b The accuracy of scMCGraph was compared with other methods on the cross-health condition dataset. It can be seen that scMCGraph demonstrates superior reliability and better performance in terms of minimum accuracy. c Visualization of cell type classification accuracies uses a Sankey diagram for GSM4837524-GSM4837528 and a Chord diagram for GSM4837526-GSM4837524, highlighting scMCGraph’s robust performance. d Heat maps show scMCGraph’s cell type discrimination for GSM4837527-GSM4837525 and intercellular relationship insights critical for tissue microenvironment analysis
The Sankey and Chord diagram (Fig. 3c) depict the classification results for the GSM4837524-GSM4837528 and GSM4837526-GSM4837524 dataset pairs, respectively. Although scMCGraph faced challenges in accurately annotating rare cell types like B cells and mast cells, it predominantly classified cells correctly, with a minimal mislabeling of mesenchymal stem (MSC) cells as fibroblasts. In contrast, other methods such as MarkerCount misclassified numerous MSC cells, and scLearn and scPred had difficulties with myeloid, smooth muscle (SMC) cells, and other cell types. SingleR also showed misclassification of MSC cells. The Chord diagram further elucidates the prediction challenges faced by comparative methods, with MarkerCount and scLearn leaving many cells unannotated and scPred underperforming in classifying NK cells. The scMCGraph model, however, demonstrated robust performance across most cell types, with NK cells being the primary exception.
The heat map (Fig. 3d) of cell type classification for the GSM4837527-GSM4837525 dataset pair is presented alongside a cell correlation heatmap. This comparison reveals that although the model had some difficulties with MSC and plasma cells, it was effective for other cell types. Comparative methods like MarkerCount and scmapCell struggled particularly with SMC, plasma, NK, and MSC cells. scPred showed poor performance for plasma, NK, and MSC cells, with a significant number of cells remaining unannotated across several methods. The cell correlation heatmap serves to elucidate the relationships between cell types, which is imperative for studying cell-to-cell communication and understanding the tissue microenvironment’s complexity.
Pathway analysis and expression profiling in single-cell data
Cellular processes are governed by a complex array of biochemical reactions and tightly controlled pathways. These pathways, which are fundamental to the flow of cellular information, serve as the blueprints for critical functions such as gene interactions, metabolic processes, and signal transduction. Their comprehensive analysis is indispensable for unraveling the complexities of cellular functions and disease mechanisms. In this study, pathway analysis was performed using cell embeddings generated by our model based on the GSM4626768 dataset. t-SNE was applied to these embeddings to identify the pathways that best differentiate distinct cell types within the embedding space. One representative pathway was selected from each of six different pathway databases. From KEGG, the human ribosome pathway (hsa03010) was selected, which elucidates the assembly and functions of ribosomal subunits, vital for the translation of mRNA into proteins. The glycolysis/gluconeogenesis pathway (hsa00010) from humancyc_hgnc was chosen to represent metabolic flux in transitioning between glucose breakdown and synthesis. The vasopressin synthesis pathway from panther_hgnc was included, detailing the regulatory process from gene transcription to hormone secretion. From pathbank_hgnc, the protein synthesis pathway was selected, emphasizing the intricate process of translating genetic information into functional proteins. The Reactome database contributed the peptide chain elongation pathway (R-HSA-156902), a critical step in protein synthesis involving ribosomal catalysis. Lastly, the Wikipathways database provided insights into the cytoplasmic ribosomal proteins pathway, focusing on the protein components of the ribosome. These pathways were then visualized using a t-SNE plot, and the resulting figure (Fig. 4a) highlights the AUCell scores of the selected pathways across different cells, represented through color-coding and contour lines.
Fig. 4.
Pathway analysis across single-cell RNA-seq datasets. a T-SNE plot visualization of pathway expression within the GSM4626771 single-cell dataset. Pathways selected from six distinct databases are color-coded to represent their relative expression levels across different cell embeddings. b Heat map shows the AUCell scores of the top 10 pathways from each database across the GSM4626768 (health) and GSM4626771 (disease) datasets. The consistency of pathway expression across datasets and cell types demonstrates the robustness of pathway analysis in single-cell RNA-seq data
Simultaneously, it is important to consider that single-cell data inherently possess challenges such as high noise levels and dropout events that can obscure true biological signals. Fortunately, the impact of these factors on pathway-level analyses is relatively mitigated due to the nature of pathway information. To assess the robustness of the pathway analysis against these challenges, we first computed AUCell scores for each of the six pathway databases separately across the two datasets, GSM4626768 (health) and GSM4626771 (disease). Both datasets include seven cell types: HomC, RegC, RepC, HTC, FC, preFC, and preHTC. For each dataset, we selected the top 10 pathways with the highest AUCell scores from each pathway database. The heat map in Fig. 4b shows that the AUCell scores of these top 10 pathways are highly consistent across the two datasets (healthy and diseased) for each corresponding cell type, demonstrating the robustness of the pathway analysis in capturing reliable biological insights despite the inherent challenges of single-cell data.
Comprehensive evaluation and optimization of scMCGraph
A series of parameter selection, ablation studies, and additional evaluations were conducted to optimize and validate the performance of the scMCGraph model, demonstrating its robustness and effectiveness for cell type annotation. We conduct experiments to evaluate model performance under various parameter settings as well as ablation experiments to assess the contributions of key components, including SSMF, PPMI, pathway databases, and KL divergence. Beyond optimization, we also performed extensive evaluations to further validate the model’s computational efficiency, scalability with larger datasets, sensitivity to pathway sparsity, and predictive reliability using uncertainty quantification techniques. Recognizing the significant impact of parameter choices on model performance, a human pancreatic dataset was employed to deeply investigate the percentage of neighbor nodes in the construction of k-NN graphs (i.e., the k value). The results were visualized using a box plot, as shown (Fig. 5a). Observations indicated that a 2% k value outshines the 1% and 5% alternatives, striking a balance that maximizes accuracy. At 1%, the model’s accuracy is foundational yet suboptimal, and at 5%, it is evident that an expanded neighborhood adversely affects performance.
Fig. 5.
Visualization of parameter selection and ablation study results. a The box plot represents the variation in accuracy with different k values in the k-NN graph construction using human pancreatic datasets. b A comparative analysis of the SSMF and SUM methods was conducted using the PBMC dataset from the CEL-Seq2 platform, which underscored the enhanced performance of SSMF in integrating cell similarity graphs with pathway information. c The PPMI ablation study on the Human Bone dataset, showcasing the benefits of PPMI in enhancing the integration of cell similarity graphs by emphasizing biologically relevant pathway information. d The removal of individual sources from the pathway database leads to diminished model performance, as revealed by analyses using the PBMC dataset from the Smart-Seq2 platform, thus underscoring the critical role of pathway diversity. e The evaluation of KL divergence in optimizing model accuracy was demonstrated across 42 PBMC dataset pairings, affirming its essential contribution to enhancing performance. f Training times for different methods. g Memory usage for different methods. h Predictive performance of the scMCGraph method on small datasets at varying proportions. i Comparison of predictive performance of the scMCGraph model with varying degrees of sparsity in pathway databases
In the presented SSMF ablation experiment, the PBMC dataset from the CEL-Seq2 platform is utilized as the reference with datasets from six other platforms acting as query datasets. The SUM method, which simply aggregates data by direct summation, is used as a baseline for comparison. The outcomes, depicted through a slide bead diagram (Fig. 5b), show SSMF’s consistent outperformance over the SUM method across all the dataset pairs. This superiority is particularly striking when examining the Drop-Seq platform as the query dataset; the diagram clearly indicates a substantial margin by which SSMF’s ACC surpass those of the SUM method. This substantial lead underscores the SSMF method’s proficiency in integrating complex data patterns, thereby delivering a more precise and coherent unified graph structure, indicative of a robust enhancement in graph integration. In the PPMI ablation experiment utilizing the Human Bone dataset, five reference-query dataset pairs were formed, and the results were presented as depicted (Fig. 5c). The analysis indicates that the PPMI method consistently enhances model performance across all pairs. Notably, the improvement is most pronounced in the GSM4626767-GSM4626769 reference-query pair, where the application of PPMI leads to a visibly higher bar in the chart. This distinct increase reinforces the conclusion that PPMI plays a critical role in improving the model’s data integration capabilities by contributing to a more comprehensive global graph structure.
In the pathway database ablation study, the PBMC dataset from the Smart-Seq2 platform was utilized as the reference dataset, with datasets from six other platforms serving as query datasets, thus establishing six reference-query dataset pairs. One source of pathway database information was sequentially removed from each pair, and the results were visualized (Fig. 5d). This systematic elimination of databases revealed the individual contribution of each database to the model’s performance. It is observed that the integration of multiple sources of pathway information is crucial, as the enrichment of pathway data consistently enhances the model’s performance. It is evident that when any single pathway database was removed, the performance was consistently and significantly lower than when the full complement of databases was applied. These results underscore the importance of a consensus representation approach, where the synergy of multiple pathway databases fosters a more robust and accurate integration of diverse biological data. The KL divergence ablation study utilized the PBMC dataset from each platform as the reference dataset, with the datasets from remaining six platforms serving as query datasets, resulting in 42 reference-query dataset pairs. These comparisons were visualized using scatter plots, in which each dot represented an individual dataset pair and a solid black line indicated the average ACC across all pairs. The visualization of these results, as indicated (Fig. 5e), revealed that KL divergence played a crucial role in boosting model performance, demonstrating excellence across all dataset pairs. Upon close observation of the average values, it becomes evident that the performance with KL divergence was significantly better than without it. Thus, it becomes evident that incorporating KL divergence is essential, as it significantly improves the ACC performance of our cell type annotation efforts.
In the computational complexity analysis, we compare the performance of scMCGraph with 17 other methods in terms of runtime and memory usage, using the PBMC dataset from the CEL-Seq2 platform as the reference dataset and the PBMC dataset from the 10Xv2 platform as the query dataset. In the results shown in Fig. 5f and g, we can observe that the singleCellNet method is the quickest, while the scAGN method utilizes the least amount of storage. In contrast, scBERT is the most demanding in terms of both time and storage. Specifically, for correlation-based methods, the average processing time is 32.79 s, and they typically require 2730.67 MB of RAM. Marker-based methods average 10.44 s for execution and use 645.79 MB of RAM, making them the least resource-intensive. On the other hand, model-based methods take considerably longer, averaging 212.95 s, and use significantly more RAM, averaging 3297.29 MB. This suggests that Marker-based methods are the least resource-demanding, followed by correlation-based methods, with model-based methods consuming the most resources. Our proposed model, scMCGraph, operates within a competitive timeframe of 43.84 s and requires 1977.31 MB of storage. Among the 14 model-based methods, scMCGraph ranks 6th in terms of speed and 5th in terms of storage efficiency, indicating robust performance with a balanced trade-off between execution time and memory usage.
To investigate the minimum number of cells required for effective training of scMCGraph, additional experiments were performed using the PBMC dataset from the Smart-Seq2 platform, which is the smallest dataset in our study. The PBMC dataset from the Smart-Seq2 platform was utilized as the reference dataset, with datasets from six other platforms serving as query datasets. Cells were randomly sampled from the reference dataset at different proportions (0.2, 0.4, 0.6, 0.8, and 1.0) for each cell type. At the 1.0 proportion, the dataset contained 253 cells in total, with the following distribution for each cell type: 117 cytotoxic T cells, 58 CD4+ T cells, 34 CD14+ monocytes, 22 B cells, 14 megakaryocytes, and 8 CD16+ monocytes. To evaluate the impact of different sampling proportions on the model’s performance, the corresponding ACC was calculated for each sampling proportion. As shown in Fig. 5h, the performance of the model at different proportions was as follows: at 0.2 proportion, the average ACC was 0.1162, and at 0.4 proportion, the average ACC remained the same at 0.1162. When the proportion was increased to 0.6, the average ACC rose to 0.6529, and at 0.8, the ACC further improved to 0.7529. The highest average ACC of 0.7996 was achieved when all available cells (1.0 proportion) were used. The standard deviations for these proportions were 0.0364 at 0.2 and 0.4, 0.0599 at 0.6, 0.0392 at 0.8, and 0.0447 at 1.0, indicating that the model’s performance stabilized with larger sample sizes. Specifically, training effects become apparent starting at the 0.6 proportion (149 cells), with optimal performance achieved when the dataset reaches the full sample size (253 cells). These results show that a minimum of 250 cells in total, with at least 10 cells per cell type, is required for reliable model performance. These findings demonstrate that scMCGraph can achieve reliable performance with a relatively small number of cells. This experiment underscores the model’s scalability, showing its ability to handle smaller datasets while maintaining robust cell type annotation.
To evaluate the scalability and robustness of scMCGraph on larger and more complex datasets, we tested the model on five datasets from the E-MTAB-8107 collection, each containing between 2000 and 4000 cells. The datasets used were sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060. To simulate a larger dataset, we concatenated four of these datasets to form a comprehensive training set, using the remaining dataset as the test set. This allowed us to assess the performance of scMCGraph as the dataset size increased. In the cross-validation tests, we varied the reference and query datasets, calculating the model’s ACC in each case. For instance, using sc5rJUQ026, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060 as the reference (12,218 cells) and sc5rJUQ024 as the query (3426 cells), scMCGraph achieved an accuracy of 0.8573. In another configuration, where sc5rJUQ024, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060 were used as the reference (13,428 cells) and sc5rJUQ026 as the query (2216 cells), the accuracy was 0.8150. When using sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ060 as the reference (11,796 cells) and sc5rJUQ033 as the query (3848 cells), the accuracy was 0.8132. Additionally, when sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ060 were used as the reference (12,626 cells) and sc5rJUQ050 as the query (3018 cells), the accuracy reached a higher value of 0.9248. Finally, with sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ050 as the reference (12,508 cells) and sc5rJUQ060 as the query (3136 cells), the accuracy was 0.7985. In all configurations tested, the model consistently demonstrated accuracy above 0.80, with the only exception being 0.7985. These results show that scMCGraph is capable of scaling effectively to handle larger datasets without compromising accuracy.
To investigate the sensitivity of scMCGraph to pathway sparsity, we conducted additional experiments using varying proportions of pathways from all pathway databases. In these experiments, we used the Smart-Seq2 platform PBMC dataset as the reference, with each of the dataset form other six platform datasets serving as query datasets. For each pathway database, we randomly selected subsets representing 0.2, 0.4, 0.6, 0.8, and 1.0 proportions of the total number of pathways and assessed the model’s predictive performance. The results, shown in Fig. 5i, reveal that scMCGraph’s performance generally improves with the increase in the number of pathways used. The average accuracies ranged from 0.7017 (with 0.2 proportions of the total number of pathways) to 0.8162 (with all pathways), with standard deviations of 0.0592, 0.0544, 0.0656, 0.0647, and 0.0631, respectively. Interestingly, some datasets, such as PBMC1_Smart-PBMC1_inDrop, showed fluctuations in performance between the 0.4 and 0.6 proportions, but the model reached stable performance once the pathway proportion exceeded 0.8. These findings suggest that while reducing the number of pathways leads to a decrease in accuracy, performance tends to stabilize when at least 80% of the pathways are retained. Lower proportions may result in an overly sparse pathway network, which could hinder the model’s ability to effectively represent cellular states and degrade predictive performance.
Additionally, we adopted the post-hoc uncertainty quantification approach for multi-class problems proposed by Khatri et al. [57] to evaluate the predictive reliability of scMCGraph. Specifically, we first calculated nonconformity scores for each sample in the reference dataset based on the prediction scores. These scores were then used to establish a threshold, selecting the top 0.025 of the nonconformity scores. This threshold, derived from the training set, was applied to determine the confidence sets for each prediction score in the query dataset. The model’s performance was assessed using two key metrics: coverage and efficiency. Coverage indicates the proportion of true labels included within the confidence sets, reflecting the model’s reliability, while efficiency measures the average size of these confidence sets, aiming for smaller sets with maintained high coverage. To assess the model’s performance across different platforms, we used the PBMC dataset from the Seq-Well platform as the reference and employed PBMC datasets from the three largest platforms in terms of cell count among the remaining six platforms—namely, 10Xv2, 10Xv3, and Drop-Seq—as query datasets. These datasets contain 5398, 2700, and 2835 cells, respectively. The model demonstrated an impressive average coverage of 98.48%, indicating high reliability in predicting true labels across datasets. In terms of efficiency, the model achieved an average value of 2.97, suggesting it effectively minimizes the size of the confidence sets while maintaining robust predictive performance. For individual datasets, the 10Xv2 dataset achieved the highest coverage of 99.17% with an efficiency of 2.47, followed by the 10Xv3 dataset, which showed a coverage of 98.81% and an efficiency of 3.19, and the Drop-Seq dataset, which achieved a coverage of 97.46% with an efficiency of 3.24. These results highlight the model’s strong performance in terms of both reliability and efficiency, making it well-suited for applications requiring post-hoc uncertainty quantification in single-cell transcriptomic analyses.
Discussion
In this study, we utilized the capabilities of consensus representation in multi-view learning to enhance the accuracy and robustness of cell type annotation. This approach involves analyzing complex data from multiple analytical perspectives, crucially incorporating insights from cell–cell graphs derived from gene signaling pathways. The efficacy of this consensus representation within our enhanced multi-view learning framework is evident in its ability to handle variability within a single data type, thus promoting models that demonstrate remarkable generalizability across various datasets and resistance to overfitting.
The use of consensus representation strategy within a multi-view learning framework is profound and multifaceted. It deepens our understanding of the cellular landscape by effectively discriminating between meaningful biological signals and potential noise or outliers. Moreover, it refines cell type inference within heterogeneous populations, a task complicated by the complexity of biological systems. By incorporating consensus representation of cellular graphs from multiple analytical views of the same dataset, our model captures subtle yet significant biological patterns, substantially improving the accuracy of cell type annotations.
However, our approach does have certain limitations. Firstly, the reliance on AUCell scoring to integrate prior knowledge of gene pathways introduces fixed correlations between cells and pathways, potentially limiting the model’s ability to learn these relationships dynamically. In future work, we plan to incorporate adaptive learning mechanisms, such as attention networks, to allow the model to automatically determine the most relevant pathways for each cell type. Secondly, while our method works effectively with broad pathway databases, the granularity of pathways within these databases remains underexplored. Future research could benefit from considering more detailed sub-classifications within pathways, offering deeper insights into specific biological functions and regulatory processes. In conclusion, the integration of consensus representation in multi-view learning significantly improves the accuracy and robustness of cell type annotation, particularly in complex biological datasets. Despite some limitations, our approach provides a promising framework for single-cell analysis, with potential for further refinement and application across a broader range of biological questions.
Conclusions
As we chart the course for future research, we acknowledge the challenges posed by the growing complexity and volume of biological data. Scaling consensus representation strategies within multi-view learning algorithms to manage this increase and enhancing model interpretability are essential goals. The integration of diverse data types and platforms presents both challenges and opportunities for advancing consensus-based multi-view learning frameworks. The evolution of these methodologies promises to enhance our investigative capabilities into biological systems, potentially revealing insights at an unprecedented level of detail.
In summary, the strategy that combines consensus representation with multi-view learning offers a powerful approach for integrating diverse analytical perspectives within a single data modality, leading to more precise cell type annotation and a nuanced understanding of pathway analysis. These advancements hold significant potential for unraveling complex biological systems and guiding the development of new therapeutic strategies. As we continue to refine this approach, it is expected to yield deeper insights into the vast and intricate tapestry of life, contributing to the broader goals of biomedicine and biomarker discovery.
Methods
Enhancing cellular profiling with pathway-informed gene expression
In the initial phase, cell–cell affinity matrices, and , for the reference and query datasets, respectively, are constructed based on gene expression profiles. The affinity score between cells and is defined as:
| 1 |
where is the total number of genes.
Subsequently, to delineate pathway-specific cellular functions, the AUCell algorithm was applied to both datasets, utilizing six distinct pathway databases. Based on these pathways, six pathway-cell matrices for each of the reference and query datasets were obtained, respectively. These matrices are subsequently transformed into cell–cell affinity matrices based on pathways, designated as and .
The affinity score between cells and is defined as:
| 2 |
Within this analytical framework, technologies capable of integrating multimodal data are utilized to consolidate disparate datasets. Specifically, the SNF method plays a crucial role by effectively merging various graph structures, thereby forming a unified, comprehensive, and robust network graph. This consolidated graph not only retains the integrity of the original data but also enhances its noise resilience and interpretability, thereby providing a robust foundation for a more nuanced data analysis framework. The SNF technique is then utilized to integrate the gene expression-based affinity matrices, and , with the respective sets of pathway matrices to get . This results in a series of integrated matrices for the reference dataset and for the query dataset, which encapsulate both gene expression and pathway information. The corresponding similarity matrices are iteratively updated according to the following formulas:
| 3 |
| 4 |
After iteration, the SNF outputs a pathway matrices as:
| 5 |
Finally, the k-NN graph methodology is applied to both the reference and query datasets, with k set to 2% of the total number of samples in each dataset. This process transforms the integrated matrices and into a series of cell similarity graphs with pathway implications, denoted as for the reference dataset and for the query dataset. The k-NN graph methodology denotes as :
| 6 |
Synthesizing integrated cellular similarity graphs with pathway implications
Upon generating cell similarity graphs for the reference dataset, and for the query dataset, SSMF is applied to integrate these datasets. SSMF retains the essential structural features of each individual graph while employing structural similarity for a robust fusion, resulting in a composite graph structure. It is distinguished by its ability to integrate multiple adjacency matrices by exploiting the inherent similarities between them. It commences with the identification of spectral features from each matrix, using the eigenvalues as a measure of similarity. These eigenvalues inform the creation of a distance matrix, which quantifies the similarities between matrices.
From this, a similarity matrix is constructed, reflecting the interconnected relationships among the matrices. This matrix subsequently directs the fusion of the adjacency matrices into a comprehensive matrix that encapsulates the aggregated information. The fusion is realized through a weighted approach that effectively amalgamates the similarity matrix with the original adjacency matrices, culminating in the integrated matrix. SSMF’s strategic use of leading eigenvalues for thresholding enhances the efficacy of the fusion, ensuring a more accurate and interpretable representation of the integrated data. This approach significantly improves data analysis, yielding a refined and reliable synthesis of multimodal data.
For the reference and query datasets, the spectral eigenvalues for each set of cell similarity graphs with pathway implications are calculated. The reference dataset graphs, , and the query dataset graphs, , are each subjected to spectral analysis to determine these eigenvalues.
| 7 |
Subsequently, a distance matrix is calculated based on the spectral eigenvalues:
| 8 |
Building upon the distance matrix, a similarity matrix is then computed:
| 9 |
Finally, the matrices are fused:
| 10 |
Here, is the resultant fused matrix, and represents the cell similarity graph structure with pathway implications.
The process of thresholding is applied to the integrated adjacency matrix to produce a binarized adjacency matrix, , as follows:
| 11 |
In this procedure, the threshold is set to the value corresponding to 2% of the total number of samples in both the reference and query datasets. Through this procedure, the integrated and fused matrices for the reference dataset and for the query dataset, are produced.
Constructing consensus graphs to enhance global structure clarity in the fused matrix
The PPMI matrix serves as a powerful tool to refine cell similarity graphs, for the reference dataset and for the query dataset, by emphasizing rare yet significant node interactions that often indicate key biological insights. By applying the PPMI method, noise from frequent interactions is removed, thereby sharpening the focus on genuine associations within the graph. This results to the creation of globally coherent structures, and , which more accurately capture the complex interplay of cellular pathways and offer a deeper understanding of the underlying biological processes.
Initially, a frequency matrix , derived from the incidence rates of node pairs in paths generated by a random walk on the graph, was computed. A state at time , , is represented by the presence of a random walker at node . The probability of transition from node to an adjacent node at the subsequent time step is:
| 12 |
By conducting random walks with each node as the root, numerous paths were generated, and cell pairs along each path were sampled, with the corresponding entries in matrix being incremented accordingly. Subsequent to acquiring matrix , the PPMI matrix was computed as follows:
| 13 |
| 14 |
| 15 |
| 16 |
Here, estimates the probability of cell in the context of , while and represent marginal probabilities of cell and context , respectively. The PPMI matrix entry quantifies the association strength between nodes and , with zero indicating independence according to statistical definition.
The refined graph matrices and , underpinned by the PPMI methodology, offer a sophisticated representation of global structural properties. This approach emphasizes the critical but infrequent node interactions, thereby enriching the graph’s portrayal of cellular connections. The resulting structures encapsulate a more precise and holistic view of the intricate network dynamics, facilitating a deeper understanding of the biological interactions.
Unveiling cellular identity through encoder-decoder architectural innovations
Utilizing the refined graph structures, and , which have been enhanced via PPMI, the model introduces a tailored multi-view architecture specifically for cell type annotation tasks. Within this framework, the GNN-encoder processes the PPMI-enhanced graph representations for both reference and query datasets singularly. Subsequently, a parallel decoder is deployed to reconstruct the original SNF-derived similarity graphs, specifically for the reference dataset and for the query dataset. Incorporating a consensus representation strategy, this architecture adeptly consolidates shared representations from a variety of graph structures and node features, utilizing a single GNN-encoder and multiple decoders to reconstruct individual views. This consensus representation of multiple cell–cell graphs from gene signaling pathways is critical for enhancing cell type annotation accuracy. The framework not only simplifies the learning from complex multi-view data but also addresses the common challenges in learning shared representations and reducing noise, thereby providing a clear advantage in multi-view cell type annotation tasks.
Unifying Graph Representations GNN-Encoder: Within the model, the PPMI-refined and are recognized as the principal graph structures for the reference and query datasets, respectively. These refined structures are considered to contain the most critical and comprehensive information required, thereby serving as the most informative consensus representation within this framework. Node attributes, and , are derived by first isolating the top 2000 highly variable genes (HVGs) within the reference and query datasets and then identifying the common HVGs between these sets. This process defines shared node characteristics across both datasets. Then, the GCN was adopted to aggregate features from adjacent nodes within the graph and the GCN layers act as the encoder by using the spectral convolution function :
| 17 |
where represents node embeddings learned by the layer, with , , and is the activation function. The two-layer GCN of encoder’s structure is constructed as:
| 18 |
is the activation function. The output dimensions of the two layers of the GCN are set to 32 dimensions and 16 dimensions, respectively. Finally, a dense layer is incorporated to map the feature representations to the label space dimensionality. This layer serves as a transformation stage, effectively adapting the dimensionality of the node embeddings produced by the two-layer GCN to the required number of labels, thus facilitating the subsequent classification task.
Diversified Reconstruction Parallel Decoder: The parallel decoder’s role is to guide the GNN-encoder in learning a shared representation by reconstructing the multiple graph views from the shared representation . The decoder consists of view-specific decoders , each predicting the existence of links between nodes in view with view-specific weights . The prediction is executed through a link prediction layer:
| 19 |
Reconstruction loss
To refine the predictive accuracy and biological relevance of this model in cell type annotation, the loss function was restructured into a sophisticated tripartite strategy. This strategy integrates to ensure the integrity of graph reconstruction, to incorporate cues from supervised predictions, and to align probabilistic distributions using Kullback–Leibler divergence.
: The aim is to minimize the reconstruction error across all graph views. The is defined as:
| 20 |
represents the reconstruction loss for view , and is the total reconstruction loss for all views. During backpropagation, the gradients from the parallel decoder encourage the encoder to extract representations common to all views, which are then used for forward-propagation. The model can be seen as a form of multi-task learning, where the parallel decoder provides supervised signals for the encoder, facilitating the extraction of a more comprehensive and generalized shared representation.
: The goal is to minimize the cross-entropy error for supervised learning, thus sharpening the model’s predictive accuracy on annotated cell types. The component is formulated as follows:
| 21 |
: The objective is to minimize the Kullback–Leibler divergence, thereby ensuring the model’s probability distribution closely aligns with the true data distribution, enhancing the robustness of cell type inference.
The objective function involving KL divergence is designed to minimize the discrepancy between the predicted soft label distribution and a target distribution , which is based on highly confident annotations. The is defined as follows:
| 22 |
is the KL divergence, is the target distribution for node and cell type , and is the predicted probability that node belongs to cell type .
The similarity measure used to calculate is based on Student’s t-distribution, employed to assess the similarity between the embedding of node , , and the centroid of cell type , :
| 23 |
This approach results in a soft assignment of nodes to cell types, which is then refined by the target distribution , defined as:
| 24 |
where is the frequency of each cell type in the predicted distribution, normalizing the contribution of each cell type and addressing the imbalance that arise from cell type frequency in the data. is employed to align the model’s inferred distributions with ground-truth data, significantly bolstering the precision of cell type annotation.
Overall objective function: , , and were jointly optimized, and the total Loss function is defined as:
| 25 |
where , , and represent hyperparameters that control the relative contribution of each loss component to the overall optimization objective, with values set to 0.1, 1, and 0.1, respectively. Additionally, the learning rate is set to 0.005.
Acknowledgements
We would like to express our sincere thanks to all the authors for their contributions to the study.
Abbreviations
- scRNA-seq
Single-cell RNA sequencing
- GRNs
Gene regulatory networks
- GNN
Graph neural network
- GCNs
Graph convolutional neural networks
- SNF
Similarity network fusion
- SSMF
Similarity subspace matrices fusion
- PPMI
Positive pointwise mutual information
- ACC
Accuracy score
- BA
Balanced accuracy score
- HVGs
Highly variable genes
- MSC
Mesenchymal stem
- SMC
Smooth muscle
- NK
Natural killer
Authors' contributions
Y.A.H, Y.C.L, and Z.H.Y conceived the study. Y.A.H and Y.C.L designed the algorithm and wrote and revised the paper. L.H, P.W.H, L.W, Y.Z.P, and Z.A.H contributed the idea and revised the paper. All authors read and approved the final manuscript.
Funding
This research was funded by the National Science Fund for Distinguished Young Scholars of China under grant number: 62325308; the Science and Technology Innovation 2030–New Generation Artificial Intelligence Major Project under grant number No. 2018AAA0100103; the National Natural Science Foundation of China under grant number 62002297, grant number 61722212, grant number 62072378, grant number 62273284, grant number 62472353, grant number 62302495, and grant number 62172338; the Neural Science Foundation of Shaanxi Province under grant number: 2022JQ-700; the Fundamental Research Funds for the Central Universities under grant number: D5000230199; the Guangdong Basic and Applied Basic Research Foundation under grant number 2024A1515011984; and the Fundamental Research Funds for the Central Universities under grant number G2023KY05102.
Data availability
The datasets underpinning the findings of this study are publicly accessible. Single-cell RNA-seq datasets on human pancreatic cells are available in the GEO repository under accession numbers GSE84133 [58] (Baron Human), GSE85241 [59] (Muraro), and GSE81608 [60] (Xin), and in the ArrayExpress repository under accession number E-MTAB-5061 [61] (Segerstolpe). E-MTAB-8107 [62] is a single-cell RNA-seq dataset of breast invasive carcinoma samples, available in the ArrayExpress repository. PBMC datasets are available in GEO under GSE132044 [63]. Mouse embryonic pancreatic epithelial cell data for developmental stages E13.5, E14.5, and E15.5 are accessible under GEO accession numbers GSM3852753 to GSM3852755 [64]. Clinical datasets used for validating our model include the Human Artery dataset from GEO under accession number GSE159677 [65], which examines the single-cell transcriptome of entire calcified atherosclerotic core plaques and patient-matched proximal adjacent portions of carotid artery tissue from patients undergoing carotid endarterectomy. Another clinical dataset, the Human Bone dataset from GEO under accession number GSE152805 [66], features chondrocytes from osteoarthritic and healthy tibial plateaus. The source code is available at https://github.com/LiYuechao1998/scMCGraph [67].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yu-An Huang and Yue-Chao Li contributed equally to this work.
Contributor Information
Yu-An Huang, Email: yuanhuang@nwpu.edu.cn.
Zhu-Hong You, Email: zhuhongyou@gmail.com.
Zhi-An Huang, Email: huang.za@cityu-dg.edu.cn.
References
- 1.Ding S, Chen X, Shen K. Single-cell RNA sequencing in breast cancer: understanding tumor heterogeneity and paving roads to individualized therapy. Cancer Commun. 2020;40(8):329–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Potter SS. Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol. 2018;14(8):479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kang HM, Subramaniam M, Targ S, Nguyen M, Maliskova L, McCarthy E, Wan E, Wong S, Byrnes L, Lanata CM. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol. 2018;36(1):89–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat Commun. 2019;10(1):380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wu Y, Zhang K. Tools for the analysis of high-dimensional single-cell RNA sequencing data. Nat Rev Nephrol. 2020;16(7):408–21. [DOI] [PubMed] [Google Scholar]
- 6.Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018;9(1):284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Heumos L, Schaar AC, Lance C, Litinetskaya A, Drost F, Zappia L, Lücken MD, Strobl DC, Henao J, Curion F. Best practices for single-cell analysis across modalities. Nat Rev Genet. 2023;24(8):550–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Du ZH, Hu WL, Li JQ, Shang X, You ZH, Chen ZZ, Huang YA. scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data. Commun Biol. 2023;6(1):1268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Choi JH, In Kim H, Woo HG. scTyper: a comprehensive pipeline for the cell typing analysis of single-cell RNA-seq data. BMC Bioinformatics. 2020;21:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Domanskyi S, Hakansson A, Bertus TJ, Paternostro G, Piermarocchi C. Digital Cell Sorter (DCS): a cell type identification, anomaly detection, and Hopfield landscapes toolkit for single-cell transcriptomics. PeerJ. 2021;9:e10670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, Mahrt E, Guo W, Stawiski EW, Modrusan Z. SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes. 2019;10(7):531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cao Y, Wang X, Peng G. SCSA: a cell type annotation tool for single-cell RNA-seq data. Front Genet. 2020;11: 524690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang AW, O’Flanagan C, Chavez EA, Lim JL, Ceglia N, McPherson A, Wiens M, Walters P, Chan T, Hewitson B. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods. 2019;16(10):1007–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Shao X, Liao J, Lu X, Xue R, Ai N, Fan X. scCATCH: automatic annotation on cell types of clusters from single-cell RNA sequencing data. Iscience. 2020;23(3):100882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kim H, Lee J, Kang K, Yoon S. MarkerCount: a stable, count-based cell type identifier for single-cell RNA-seq experiments. Comput Struct Biotechnol J. 2022;20:3120–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nguyen V, Griss J. scClassifR: Framework to accurately classify cell types in single-cell RNA-sequencing data. BioRxiv 2020:2020.12.22.424025. 10.1101/2020.12.22.424025.
- 17.Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16(10):983–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol. 2021;22(1):69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods. 2018;15(5):359–62. [DOI] [PubMed] [Google Scholar]
- 20.De Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FC. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 2019;47(16):e95–e95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol. 2019;20(2):163–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li C, Liu B, Kang B, Liu Z, Liu Y, Chen C, Ren X, Zhang Z. SciBet as a portable and fast single cell type identifier. Nat Commun. 2020;11(1):1818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kimmel JC, Kelley DR. Semisupervised adversarial neural networks for single-cell classification. Genome Res. 2021;31(10):1781–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tan Y, Cahan P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 2019;9(2):207–13 e202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Alquicira-Hernandez J, Sathe A, Ji HP, Nguyen Q, Powell JE. scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 2019;20:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wagner F, Yanai I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. BioRxiv. 2018:456129. 10.1101/456129.
- 27.Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, Lu H, Yao J. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66. [Google Scholar]
- 28.Fischer F, Fischer DS, Mukhin R, Isaev A, Biederstedt E, Villani A-C, Theis FJ. scTab: scaling cross-tissue single-cell annotation models. Nat Commun. 2024;15(1):6611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gonzalez-Ferrer J, Lehrer J, O’Farrell A, Paten B, Teodorescu M, Haussler D, Jonsson VD, Mostajo-Radji MA. SIMS: a deep-learning label transfer tool for single-cell RNA sequencing analysis. Cell Genom. 2024;4(6). 10.1016/j.xgen.2024.100581. [DOI] [PMC free article] [PubMed]
- 30.Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Büttner M, Wagenstetter M, Avsec Ž, Gayoso A, Yosef N, Interlandi M. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol. 2022;40(1):121–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Domínguez Conde C, Xu C, Jarvis L, Rainbow D, Wells S, Gomes T, Howlett S, Suchanek O, Polanski K, King H. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376(6594):eabl5197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Boiarsky R, Singh N, Buendia A, Getz G, Sontag D. A deep dive into single-cell RNA sequencing foundation models. BioRxiv. 2023:2023.10.19.563100. 10.1101/2023.10.19.563100.
- 33.Shao X, Yang H, Zhuang X, Liao J, Yang Y, Yang P, Cheng J, Lu X, Chen H, Fan X. Reference-free cell-type annotation for single-cell transcriptomics using deep learning with a weighted graph neural network. BioRxiv. 2020:2020.05.13.094953. 10.1101/2020.05.13.094953. [DOI] [PMC free article] [PubMed]
- 34.Yin Q, Liu Q, Fu Z, Zeng W, Zhang B, Zhang X, Jiang R, Lv H. scGraph: a graph neural network-based approach to automatically identify cell types. Bioinformatics. 2022;38(11):2996–3003. [DOI] [PubMed] [Google Scholar]
- 35.Abadi SAR, Laghaee SP, Koohi S. An optimized graph-based structure for single-cell RNA-seq cell-type classification based on non-linear dimension reduction. BMC Genomics. 2023;24(1):227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wang K, Li Z, You ZH, Han P, Nie R. Adversarial dense graph convolutional networks for single-cell classification. Bioinformatics. 2023;39(2):btad043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bhadani R, Chen Z, An L. Attention-based graph neural network for label propagation in single-cell omics. Genes. 2023;14(2):506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Sedgwick P. Pearson’s correlation coefficient. BMJ. 2012;345:e4483. [Google Scholar]
- 39.Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7. [DOI] [PubMed] [Google Scholar]
- 40.Qian J, Olbrecht S, Boeckx B, Vos H, Laoui D, Etlioglu E, Wauters E, Pomella V, Verbandt S, Busschaert P. A pan-cancer blueprint of the heterogeneous tumor microenvironment revealed by single-cell profiling. Cell Res. 2020;30(9):745–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen-Orr SS, Klein AM. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 2016;3(4):346–60 e344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, Van Gurp L, Engelse MA, Carlotti F, De Koning EJ. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3(4):385-394. e383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Segerstolpe Å, Palasantza A, Eliasson P, Andersson E-M, Andréasson A-C, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24(4):593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, Murphy AJ, Yancopoulos GD, Lin C, Gromada J. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 2016;24(4):608–15. [DOI] [PubMed] [Google Scholar]
- 45.Ding J, Adiconis X, Simmons SK, Kowalczyk MS, Hession CC, Marjanovic ND, Hughes TK, Wadsworth MH, Burks T, Nguyen LT. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol. 2020;38(6):737–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Bastidas-Ponce A, Tritschler S, Dony L, Scheibner K, Tarquis-Medina M, Salinno C, Schirge S, Burtscher I, Böttcher A, Theis FJ. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development. 2019;146(12):dev173849. [DOI] [PubMed] [Google Scholar]
- 47.Alsaigh T, Evans D, Frankel D, Torkamani A. Decoding the transcriptome of calcified atherosclerotic plaque at single-cell resolution. Commun Biol. 2022;5(1):1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Chou C-H, Jain V, Gibson J, Attarian DE, Haraden CA, Yohn CB, Laberge R-M, Gregory S, Kraus VB. Synovial cell cross-talk with cartilage plays a major role in the pathogenesis of osteoarthritis. Sci Rep. 2020;10(1):10868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur Ö, Anwar N, Schultz N, Bader GD, Sander C. Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 2010;39(suppl_1):D685–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Trupp M, Altman T, Fulcher CA, Caspi R, Krummenacker M, Paley S, Karp PD. Beyond the genome (BTG) is a (PGDB) pathway genome database: HumanCyc. Genome Biol. 2010;11:1–1. [Google Scholar]
- 52.Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H. PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci. 2022;31(1):8–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Wishart DS, Li C, Marcu A, Badran H, Pon A, Budinski Z, Patron J, Lipton D, Cao X, Oler E. PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Res. 2020;48(D1):D470–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, Matthews L, Gong C. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, Miller RA, Digles D, Lopes EN, Ehrhart F. WikiPathways: connecting communities. Nucleic Acids Res. 2021;49(D1):D613–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods. 2019;16(12):1289–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Khatri R, Bonn S. Uncertainty Estimation for Single-cell Label Transfer. In: Conformal and Probabilistic Prediction with Applications. Brighton: PMLR. 2022. p. 109–128.
- 58.A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133. Accessed 10 Oct 2023. [DOI] [PMC free article] [PubMed]
- 59.A single-cell transcriptome atlas of the human pancreas. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241. Accessed 10 Oct 2023.
- 60.RNA sequencing of single human islet cells reveals type 2 diabetes genes. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81608. Accessed 10 Oct 2023. [DOI] [PubMed]
- 61.Single-cell RNA-seq analysis of human pancreas from healthy individuals and type 2 diabetes patients. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-5061. Accessed 10 Oct 2023.
- 62.Single-cell RNA sequencing of ovarian, colorectal and breast cancer. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-8107. Accessed 10 Oct 2023.
- 63.Systematic comparative analysis of single cell RNA-sequencing methods. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132044. Accessed 10 Oct 2023.
- 64.Comprehensive single-cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188. Accessed 10 Oct 2023. [DOI] [PubMed]
- 65.Decoding the transcriptome of calcified atherosclerotic plaque at single-cell resolution. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159677. Accessed 10 Oct 2023. [DOI] [PMC free article] [PubMed]
- 66.Synovial cell cross-talk with cartilage plays a major role in the pathogenesis of osteoarthritis. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152805. Accessed 10 Oct 2023. [DOI] [PMC free article] [PubMed]
- 67.Huang YA, Li YC, You ZH, Hu L, Hu PW, Wang L, Peng Y, Huang ZA. Consensus representation of multiple cell-cell graphs from gene signaling pathways for cell type annotation. Github. 2024. https://github.com/LiYuechao1998/scMCGraph. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets underpinning the findings of this study are publicly accessible. Single-cell RNA-seq datasets on human pancreatic cells are available in the GEO repository under accession numbers GSE84133 [58] (Baron Human), GSE85241 [59] (Muraro), and GSE81608 [60] (Xin), and in the ArrayExpress repository under accession number E-MTAB-5061 [61] (Segerstolpe). E-MTAB-8107 [62] is a single-cell RNA-seq dataset of breast invasive carcinoma samples, available in the ArrayExpress repository. PBMC datasets are available in GEO under GSE132044 [63]. Mouse embryonic pancreatic epithelial cell data for developmental stages E13.5, E14.5, and E15.5 are accessible under GEO accession numbers GSM3852753 to GSM3852755 [64]. Clinical datasets used for validating our model include the Human Artery dataset from GEO under accession number GSE159677 [65], which examines the single-cell transcriptome of entire calcified atherosclerotic core plaques and patient-matched proximal adjacent portions of carotid artery tissue from patients undergoing carotid endarterectomy. Another clinical dataset, the Human Bone dataset from GEO under accession number GSE152805 [66], features chondrocytes from osteoarthritic and healthy tibial plateaus. The source code is available at https://github.com/LiYuechao1998/scMCGraph [67].





