scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement

Qirui Guo; Musu Yuan; Lei Zhang; Minghua Deng

doi:10.1093/bib/bbae305

. 2024 Jun 27;25(4):bbae305. doi: 10.1093/bib/bbae305

scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement

Qirui Guo ^1,^#, Musu Yuan ^2,^#, Lei Zhang ^3,^4,^5,^✉, Minghua Deng ^6,^7,^8,^✉

PMCID: PMC11209730 PMID: 38935069

Abstract

Motivation

In the past decade, single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal method for transcriptomic profiling in biomedical research. Precise cell-type identification is crucial for subsequent analysis of single-cell data. And the integration and refinement of annotated data are essential for building comprehensive databases. However, prevailing annotation techniques often overlook the hierarchical organization of cell types, resulting in inconsistent annotations. Meanwhile, most existing integration approaches fail to integrate datasets with different annotation depths and none of them can enhance the labels of outdated data with lower annotation resolutions using more intricately annotated datasets or novel biological findings.

Results

Here, we introduce scPLAN, a hierarchical computational framework designed for scRNA-seq data analysis. scPLAN excels in annotating unlabeled scRNA-seq data using a reference dataset structured along a hierarchical cell-type tree. It identifies potential novel cell types in a systematic, layer-by-layer manner. Additionally, scPLAN effectively integrates annotated scRNA-seq datasets with varying levels of annotation depth, ensuring consistent refinement of cell-type labels across datasets with lower resolutions. Through extensive annotation and novel cell detection experiments, scPLAN has demonstrated its efficacy. Two case studies have been conducted to showcase how scPLAN integrates datasets with diverse cell-type label resolutions and refine their cell-type labels.

Availability

https://github.com/michaelGuo1204/scPLAN

Keywords: cell-type annotation; dataset integration, single-cell transcriptome, partial label learning

Introduction

Over the past decade, single-cell RNA sequencing (scRNA-seq) has rapidly become a transformative approach for transcriptomic profiling in biomedical research. Recent technological innovations have facilitated the interrogation of transcriptomic landscapes with greater depth and breadth. Novel high-throughput approaches, including droplet-based 10x Genomics, Drop-seq and plate-based Smart-seq2, which can profile hundreds of thousands of cells in one run, have revealed unprecedented cellular heterogeneity. Accurate cell-type identification is essential for downstream analysis of single-cell data. However, the sheer scale of scRNA-seq data makes this task both time and computationally intensive. Meanwhile, despite the accumulation of annotated public scRNA-seq data, constructing an extensive database remains challenging due to the diversity of sequencing platforms and variations in annotation resolutions across datasets. Furthermore, with increasingly deeper investigation into various diseases and relating datasets, many novel cell subtypes have been discovered and many formerly identified cell subtypes have undergone further subclassification, presenting a new challenge in integrating novel cell types into existing databases.

Recently, many computational tools have been developed to address the scRNA-seq annotation task. These methods can be broadly categorized into two classes: marker gene-based methods and label transfer methods. Marker gene-based methods first cluster the cell population into small subgroups, followed by identifying marker genes specific to each cluster through differential expression analysis. Finally, the cells are annotated according to the ontological functions of their genes. Seurat [1] employs the Louvain and Leiden community discovery algorithms to segment the shared neighbor graph among cells, while SC3 [2] utilizes k-means clustering on the cell similarity matrix obtained by consensus learning for cluster segmentation. However, given the dramatic variation in marker gene utilization across different experiments, such methods generally demonstrate poor generalization performance.

Label transfer methods address this issue by projecting target samples onto a well-annotated reference dataset and transferring label information from the reference data, thus avoiding explicit utilization of non-unified marker gene information. In recent years, apart from conventional label transfer methods such as scmap [3] and Symphony [4], which utilize linear projections to reduce the dimension of scRNA-seq data, several deep learning approaches have also been proposed. MARS [5] employs a deep autoencoder and learns cell-type landmarks and nonlinear cell embeddings with deep neural networks, assigning target samples to the cell type with the closest landmark. ItClust [6] utilizes distances between cell embeddings and cluster centers to obtain cell-type assignment vectors. scMRA [7] can annotate a target dataset using multiple reference datasets, aligning all datasets by minimizing the gap between prototype adjacency matrices of each reference or target dataset. scArches [8] uses transfer learning techniques and proposes a transfer learning and fine-tuning strategy to leverage existing conditional neural network models and transfer them to new datasets. However, these methods do not currently consider the hierarchical nature of cell types found in reference data. The inclusion of such hierarchical ontological information could enhance the consistency of annotations across different depths and consequently decrease prediction error rates. Moreover, many current methods fail to effectively handle the discovery of novel cells, often simply categorizing all newly identified cells under a single subgroup labeled as ‘unknown’.

The label transfer methods can be regarded as a specialized category of scRNA-seq dataset integration techniques that aligns unlabeled scRNA-seq data with well-annotated references. In addition to these approaches, numerous algorithms have been proposed to tackle the integration challenge of scRNA-seq data under varying conditions. Representative methods such as Liger [9], scAlign [10] and Harmony [11] adopt unsupervised strategies to integrate unlabeled datasets. Conversely, methods like LambDA [12] and SIDA [13] leverage annotation information to perform supervised integration of annotated scRNA-seq datasets. By incorporating informative cell-type labels, these methods often outperform unsupervised integration algorithms and can be employed in constructing large-scale databases using aggregated scRNA-seq datasets. While these methods effectively address batch effects between datasets arising from diverse sequencing platforms, they ignore the fact that many datasets are annotated at different resolutions. Consequently they are unable to properly align, for an instance, ‘CD4+ memory cells’ in one dataset to the ‘CD4+ cells’ in another dataset. scHPL [14] notices this issue and is proposed as a hierarchical progressive learning method that can simultaneously utilize multiple reference data with different annotation depth to annotate an unlabeled dataset. It integrates reference datasets with diverse annotation depths and can effectively learn the hierarchical structure of cell-type labels of various resolution. However, it does not further concern the refinement of lower resolution labels and thus cannot further refine low-resolution annotations using more intricately annotated high-resolution data or novel cell-type discoveries within specific biological or disease domains.

Currently, partial label learning (PLL,[15]) has been gaining popularity in the field of deep learning. In the PLL framework, each training example is equipped with a set of candidate labels instead of the definitive ground truth label. The primary task of PLL is to conduct label disambiguation while concurrently learning a reliable latent representation of samples. The PLL task aligns well with our objective of ascertaining the precise subtype of a target sample given knowledge of the broader cell type to which it belongs.

Here, we propose scPLAN (single-cell Partial Label ANnotator), a hierarchical computational framework for scRNA-seq data analysis. scPLAN can assign unlabeled samples to different resolutions of cell oncology classes consistently and hierarchically detect potential novel cell types. scPLAN can also integrate annotated scRNA-seq datasets of diverse annotation resolutions and refine the lower resolution cell-type labels with the higher resolution data. Namely, scPLAN adopts a denoising Zero-Inflated Negative Binomial (ZINB) autoencoder to reduce the dimension of scRNA-seq data in a nonlinear manner. With a contrastive-style PLL loss, scPLAN can balance well between cell-type label disambiguation and representation learning. The layer-wise ensemble expert system enables scPLAN to detect potential novel cells on different resolutions of cell oncology class automatically and accurately. Extensive real-world experiments demonstrate the effectiveness and robustness of scPLAN annotating scRNA-seq data. Over a human lung and a human PBMC dataset, we show that scPLAN can simultaneously detect novel cells and find the relationship between these cells and existing cell types. For the dataset integration task, we apply scPLAN on a newly proposed pan-cancer scRNA-seq dataset of NK cells and a popular published PBMC dataset Zheng68K. Similar experiments are conducted integrating two PBMC datasets with diverse annotation depths. We illustrate that scPLAN can successfully refine the cell-type labels in existing dataset with higher quality data or new discoveries in cell typing.

Material and Methods

We begin by establishing the problem setting and notations. scPLAN depends on a label hierarchy that could be organized to a directed acyclic graph Inline graphic . The node set includes a leaf set representing specific cell types and representing broad labels. For clarity we further assume that the descendants of each broader cell type are all leaf nodes as shown in Figure 1(B). During training, scPLAN would first focus on broad types and then detailed labels. This two-layer design can be easily generalized to label hierarchy with multiple layers.

Introduction of scPLAN: (A) scPLAN occupies a ZINB-model-based autoencoder to cast input expression data into low-dimensional latent space together with a momentum encoder for contrastive learning. (B) A diagram denoting adaptive hierarchical classifier. Based on hierarchical of labels, we organize candidate labels and prototypes of different resolution and refine these elements progressively.

For the annotation task, suppose we have a reference dataset Inline graphic containing well-annotated cells, along with cell-type labels . These labels constitute detailed cell types in label hierarchy and we further introduce virtual broad nodes to organize these labels. Under such a scenario our objective is to transfer labels, , across both levels within the hierarchy from the reference dataset to a target dataset Inline graphic and identify potential novel cell types among the transfers. For the integration task between two datasets with diverse annotation depths, we can regard it as a special case where both reference and target datasets have labels, but not all of these labels are positioned at leaf nodes Inline graphic within the hierarchy, i.e. some of them are broad cell types in which are to be further specified into leaf-node cell types. Accordingly, our goal is to refine all broad types by selecting the most appropriate descendant label within the hierarchy for these samples. The construction of hierarchy in both tasks are illustrated in the following section.

Label Hierarchy Construction

As the ‘naming’ of cell types in the reference dataset is highly subjective and variable—for instance, NK cells may be labeled as ‘CD56NK cell’ in one dataset and ‘Natural Killer Cell’ in another—it is impractical to construct a hierarchy of these labels based solely on biological ontology. For the annotation task, we utilize the expression similarity of the labeled data to construct the graph Inline graphic . Using gene expression of the reference dataset, we first calculate the cluster means for each category and then perform hierarchical clustering on these centroids to produce a dendrogram that includes all types. We flatten the dendrogram using an inconsistency coefficient for the predetermined first-level clusters. Additionally, we compute the silhouette score for each flattened result and select the graph that yields the highest score. For integration tasks where all input datasets are already labeled, we refer to scHPL [14] for organizing the graph with annotations at various resolutions.

Denoising ZINB autoencoder

We employ a widely used ZINB [16] model-based autoencoder (referred to as ‘Encoder’ and ‘Decoder’ in Figure 1) to project raw scRNA-seq counts onto a lower dimensional manifold space, thereby capturing the intrinsic characteristics of the data. Unlike a standard autoencoder, the ZINB autoencoder estimates three parameters: mean Inline graphic , dispersion and dropout probability . These parameters define the ZINB distribution, which has been proven effective in modeling scRNA-seq count data characterized by overdispersion and sparsity. The probability mass functions for the Negative Binomial (NB) and ZINB distributions are as follows:

Given the input raw count data Inline graphic , the ZINB autoencoder produces the corresponding parameters , and through a bottleneck encoder , which maps the counts onto a low-dimensional latent space. To evaluate the model’s ability to capture the intrinsic characteristics of the input data, we compute the negative log-likelihood of the estimated ZINB distribution:

Contrastive Momentum Encoder

In scPLAN we adopt MoCo’s [17] setting, where for input data Inline graphic , we generate an augmented count data and project it into the latent space using an auxiliary encoder (referred to as ’Encoder’ in Figure 1), which has an identical structure to the primary ZINB autoencoder’s encoder . The parameters of this auxiliary encoder are updated by the main encoder using a momentum-based approach, without backpropagation of gradients. We draw on the SupCon [18] approach for constructing the contrastive loss.

For an input batch Inline graphic and its augmentation , we first compute -normalized latent features, denoted by and , respectively. To enhance sample diversity, we maintain a dynamic pool , which includes latent representations from the most recent batches. The overall supervised contrastive loss is formulated as

Here, Inline graphic is a scalar temperature parameter, represents the active set for within the pool and denotes the positive set for latent , containing latents that are more similar to . We employ the hierarchical classifier to aid in the selection of the positive set; given a latent and its classifier prediction Inline graphic , the positive set is defined as

Considering that the contrastive loss depends on classifier predictions, which may not be reliable in the early stages of training, we initially apply a self-supervised contrastive loss. During the initial epochs, this loss only considers the counterpart from the auxiliary encoder Inline graphic as the positive sample:

Adaptive Hierarchical Classifier

A central challenge in scPLAN is to devise a classifier that can provide accurate predictions while preserving the hierarchical structure inherent to various cell types. Drawing inspiration from PiCO [15], we introduce an adaptive classifier that is assisted by latent prototypes. For each level on label hierarchy shown in Figure 1B, we assign sample Inline graphic with corresponding latent vector a mutable candidate vector , where each component reflects the confidence that belongs to class on that level. We train the classifier using a cross-entropy loss defined as

(1)

Initialization of Inline graphic is elaborated in supplementary files under both annotation and integration scenarios. Throughout training on different levels within hierarchy , we update candidate vector of by its similarity with cell type latent prototypes in a moving-average scheme:

(2)

Here Inline graphic represent cell types whose prototype finds the highest similarity with . is the momentum parameter controlling update step-size. Cooperating with the classifier, we update these prototypes on each level by

(3)

We also introduce a cross-entropy loss Inline graphic on each annotation level to ensure the accuracy of these prototypes:

(4)

Novel cell type perception

Identifying potential novel cell types in the target dataset is crucial for downstream analysis. Following the approach outlined in scEmail [19], we recognize novel cells using the Inline graphic , which is derived from metrics based on the classifier output for a latent . These metrics are designed to assess the uncertainty of the classifier’s output:

A higher Inline graphic suggests greater certainty in the classifier’s prediction, indicating a higher likelihood that sample and its corresponding latent belongs to a well-annotated cell type. To establish a decision threshold for novel cell detection, we employ a mix-up strategy to simulate potential novel cells. We generate synthetic samples by mixing latent representations Inline graphic , where and and compute a threshold based on of these samples. Using this threshold, we can partition samples into known and unknown categories. Meanwhile, to promote consistency in the classifier’s predictions for similar samples, we introduce an additional loss term, , defined as

where Inline graphic is the classifier output and denote the nearest neighbor features of . here indicates prototype soft-label, i.e. similarity between latent feature and prototypes of each class.

Finally, the total loss for scPLAN goes as the following equation. Details of scPLAN’s implementation are further noted in the supplementary file in the form of pseudo codes.

Results

scPLAN hierarchically annotates scRNA-seq data

In this section, we evaluate the annotation performance of scPLAN and several popular annotation tools over six scRNA-seq or snRNA-seq datasets from different tissues and various sequencing platforms. The competing methods include Seurat [1], Single R [20], scMAP [3], scArches [8], Symphony [4], Liger [9] and expiMap [21]. All these methods are executed with default settings. The training details are listed in the supplementary. The six selected datasets to be annotated include four human pancreatic datasets: Enge (Smart-Seq2) [22], Segerstolpe (Smart-Seq2) [23], Lawlor (SMARTER) [24] and Muraro (SEL-seq2) [25]. To annotate these datasets, Baron (inDrop-seq) [26] is set to be the reference. The other two datasets are a human kidney dataset Wu (10X snRNA-seq) [27] and a human peripheral blood mononuclear cells dataset PBMC10Xv2 (10X scRNA-seq) [28]; the corresponding reference datasets are Park (10X scRNA-seq) [29] and FACS-sorted PBMC dataset FACS (10X scRNA-seq) [30]. The detailed information and the preprocessing procedures of these datasets are also listed in the supplementary file. The annotation performance of each method is evaluated by annotation accuracy and the UMAP([31] visualization of extracted features in the classification space.

We present the annotation accuracy of scPLAN in comparison with other competing methods in Table 1. To emphasize the improvements brought about by hierarchical label assignment, we also include a version of scPLAN without label hierarchy, referred to as scPLAN*. Over all the six annotation tasks, scPLAN obtains top two highest accuracy, followed by scPLAN* ranking top two highest accuracy in three experiments, scArches ranking top two highest accuracy in two experiments and expiMap which ranks top two highest accuracy in Lawlor experiment. These six annotation experiments have witnessed the efficacy and robustness of scPLAN. Additionally, over all six annotation tasks, scPLAN performs better than scPLAN* with an annotation accuracy increase varying from 0.87% to 12.51%, demonstrating that introducing hierarchical information of cell types can truly improve annotation performance.

Table 1.

Annotation accuracy of scPLAN compared with other methods

Dataset	Accuracy
	scPLAN*	scPLAN	Seurat	SingleR	scMAP	scArches	Symphony	Liger	expiMap
Enge	96.10	96.97	95.61	91.17	78.26	94.83	35.14	90.97	95.31
Segerstolp	95.19	96.63	92.79	93.65	76.44	91.34	30.58	89.42	91.15
Lawlor	95.94	97.02	96.21	92.43	90.81	92.70	34.68	90.54	96.22
Muraro	91.18	94.91	87.65	91.19	61.16	95.29	23.51	88.74	89.67
Wu	79.99	85.36	41.44	82.11	3.05	80.17	35.15	4.26	75.68
PBMC10Xv2	68.95	81.46	39.96	53.83	40.09	64.74	22.98	57.03	54.70

Open in a new tab

In particular, on Segerstolpe experiment both scPLAN and scPLAN* surpass all other methods with an accuracy exceeding 95%. Prior to annotation, a label hierarchy is initially constructed on the reference Baron dataset and is shown in Figure 2(A). Starting from data with clear distinctions, scPLAN initially aligns samples into major groups on the label hierarchy and subsequently categorizes them into smaller clusters. Figure 2(B) and Figure S2 list the UMAP visualization of original data and latent features encoded by scPLAN, colored by datasets or cell types. We can see that broad clusters identified by scPLAN fit well with the cell-type clusters in the reference data, and the subsequent categorization accurately divides these borad clusters into their respective subtypes.

scPLAN hierarchically annotates scRNA-seq data: (A) Constructed label hierarchy of Baron dataset. (B): UMAP visualizations of latent representations during annotation of Segerstolpe dataset.upper left: UMAP visualizations of the raw data. *upper right*: UMAP visualizations of latent representation during scPLAN’s broad level annotation. *lower*: UMAP visualizations of scPLAN’s annotation result. scPLAN first performs higher level alignment and assigns precise labels afterwards. (C): Sankey diagram of annotation result on Segerstolple dataset by scPLAN. *left*: ground truth label; *right*: scPLAN predictions. (D): Differential Expression Analysis on CD4+ T group in PBMC10Xv2 dataset with respect to scPLAN predictions. Difference in fatal marker could be revealed. (E): UMAP visualization of annotation results and CD8A/B expression on CD4+ labeled samples. *left*: UMAP of raw PBMC10Xv2 data;*right*: UMAP of scPLAN encoded latent representation. (F): scPLAN hierarchically identify novel cells during annotation of Xin dataset. (G): UMAP visualization of scPLAN’s first level annotation on Lung dataset, latent representations of epithelial cells are assigned onto an isolated cluster, indicating distinctions of epithelial cells with other types. (H): Differential expression analysis of identified novel cell and ground truth novel epithelial cells. Identified novel cell exhibit high consistency with epithelial cells in differentially expressed genes. (I): scPLAN hierarchically identify novel cells during annotation of Lung dataset.

We display the UMAP visualizations of latent representations of competing methods on the Segerstolpe dataset in Figure S2 and S3, separately colored by data source and cell-type labels. As observed, Symphony fails to correct batch effects between datasets, leading to subpar annotation metrics. While Liger effectively bridges the gap between the two datasets, it fails to extract intrinsic signals among cell types to form compact clusters. Seurat, expiMAP and scArches do show alignment of datasets. However, the label assignment of these methods to the aligned latent space is less convincing compared with scPLAN, with the correspondence between some of the cell types appearing confusing. For instance, these methods struggle to distinguish between certain pancreatic acinar and ductal cells in the latent space. Among the annotations, scPLAN and scPLAN* exhibit remarkable capabilities in overcoming batch effects and recognizing expression representations of different cell types across datasets. Hierarchical learning also plays a crucial role in achieving convincing alignment: During the annotation of the human kidney dataset Wu, scPLAN* fails to eliminate gaps between some minor groups and their corresponding groups in the reference dataset (see Figure S1), resulting in lower annotation accuracy. In contrast, by hierarchically annotating from major groups to individual clusters, scPLAN yields more compact clusters and a more reasonable gap between datasets. These characteristics make the latent representations learned by scPLAN more suitable for subsequent annotation analysis.

We further show the Sankey diagram of each method applied on six datasets to investigate the cell-type assignment of each method. In the Segerstolpe experiment, compared with other methods, scPLAN and scPLAN* offer a more credible label flux in the Sankey diagram. During the annotation of the Lawlor dataset, most methods, including scPLAN*, incorrectly assign some of the target samples to endothelial cells, a category not present in Lawlor’s annotation (see Figure S1). However, due to the hierarchical update of the classifier, scPLAN reduces these misclassifications, as endothelial cells lie in isolated clusters from other cell types in the initial stage. Similarly, scPLAN provides more accurate annotations for pancreatic ductal cells in the Lawlor dataset compared with scPLAN, where ambiguous classification occurs between ductal and type B pancreatic cells or PP cells. Assisted by hierarchical annotation, scPLAN gains the ability to correct these errors at the first level of annotation, thereby achieving superior accuracy.

Another case further emphasizes scPLAN’s annotation efficacy. During annotation of PBMC10Xv2 dataset, a noticeable drop in the performance of scPLAN* and other methods catches our attention. As shown in the Sankey diagram (Figure S1), scPLAN* fails to distinguish between CD4+ and CD8+ T cells; similar pattern is also observed in scPLAN’s predictions. To understand the reason behind this misclassification, we conduct a differential expression analysis on the samples originally labeled as CD4+ T cells with respect to scPLAN’s predictions (Figure 2(D)). We observe that, compared with cells correctly recognized as CD4+ T cells, cells classified as CD8+ T cells by scPLAN indeed have higher CD8A, CD8B and NKG gene expression along with lower CD4 expression, which matches the cell type marker pattern of CD8+ T cells. Based on the results of differential expression, there appears to be potential mislabeling within the CD4+ T cell groups. To further investigate this, we visualize the predictions of scPLAN alongside the CD8A/B expression in the originally labeled CD4+ samples. We also included the corresponding latent features encoded by scPLAN, using UMAP for visual representation (refer to Figure 2(E)). From the raw expression data, we observe a subtle diversity within the CD4+ labeled population. scPLAN successfully captures and amplifies this heterogeneity, identifying samples with elevated CD8A/B expression as CD8+ T cells. These samples are subsequently assigned to separate CD8+ cluster, distinct from the CD4+ cluster. This phenomenon is also confirmed by [14], stating that some of the CD4+ T cells in the PBMC10Xv2 dataset should be relabeled as CD8+ T cells. The success of scPLAN in identifying these mislabeled populations underscores its robustness and ability to extract informative features from expression data.

In conclusion, following a comparison with other methods and scPLAN without label hierarchy, we demonstrate that scPLAN has gained the ability to accurately annotate a dataset with a reference through contrastive and hierarchical learning. Hierarchical label assignment in scPLAN not only prevents significant misclassifications but also enhances the disambiguation between samples with similar features.

scPLAN hierarchically detect potential novel cells

scPLAN also demonstrates the ability to detect potential novel cells in a layer-by-layer manner. In the following section, we apply scPLAN and the previously mentioned competing methods to two scRNA-seq datasets that contain ‘unseen’ cells compared with the corresponding reference datasets. The first dataset is a human lung dataset LungSS2 (SmartSeq2) [32] and the reference dataset we select is another human lung dataset Lung10X [32] sequenced from 10X Chromium platform. The ‘epithelial cell of lung’ that occupied a proportion of 6.8% in LungSS2 is not reported in the Lung10X dataset. The second dataset is a human pancreatic dataset Xin (SMARTER) [33] and the reference dataset is a modified Baron sequenced from 10X platform. Pancreatic PP and pancreatic D cells (5.4% and 4.6%, respectively) are private cell types that only appear in Xin.

As depicted in the Sankey diagram of the Lung dataset annotation (Figure 2(I)), scPLAN successfully identifies lung epithelial cells as novel cells via the uncertainty metric, known as the E-score (Figure S5), and achieves a total annotation accuracy of 86.2%. Assuming the null hypothesis that all samples are known, the novel sample perception of scPLAN has a false positive rate (FPR) of 23.9% and a false negative rate (FNR) of 6.8%. During annotation, most of these epithelial samples are extracted at the first stage of annotation. Interestingly, we find that during that period of major group assignment, the encoded latent features of these epithelial cells on the cell type colored UMAP (Figure 2(G)) form a compact cluster distinct from other feature groups. This suggests potential isolation in the cell type hierarchy. We further conduct differential expression analysis on the identified novel samples and specific lung epithelial cells compared with other samples. The results (Figure 2(H)) show significant expression differences between lung epithelial cells and other samples in some typical epithelial marker genes, such as Wfdc2 [34], Cbr2 [35] and the Sftp [36] families. This accounts for the potential isolation of these epithelial cells on the UMAP or label hierarchy. Additionally, the differential expression results also demonstrate consistency between the epithelial cells and identified novel cells on these markers, validating the perception result of scPLAN. These findings fully reflect scPLAN’s ability to efficiently detect novel cells and to indicate the potential relation of these novel samples to other cell groups. Unlike the Lung dataset, novel cells in the Xin dataset are not recognized until the second stage of annotation (Figure 2(F)). With a total annotation accuracy of 91.5%, the novel cell perception in the Xin dataset has an FPR of 24.6% and an FNR of 6.4%. Unlike the epithelial cells in the Lung dataset, the novel cells (pancreatic D/PP) in Xin do not exhibit a strong distinction from other samples as suggested by the differential expression result (Figure S6). As a result, scPLAN assigns them to Cluster 1 in stage 1 and distinguishes them from pancreatic A cell/type B pancreatic cell in the following stage 2. This indicates that the detected novel cells have similar expression pattern or function as pancreatic A cell/type B pancreatic cell, which is further illustrated by differential expression analysis shown in Figure S6. This incident further proves the robustness of scPLAN’s hierarchical novel cell perception under different dataset scenarios. Meanwhile, compared with other methods (Table. S1, Supplementary SS4), scPLAN still showcases well efficiency in balancing annotation and novel cell discovery.

In summary, the results from both datasets underscore scPLAN’s ability to balance the detection of novel samples and their annotation, thereby providing convincing results in these areas. Furthermore, the hierarchical information uniquely suggested by scPLAN for these novel cells can provide researchers with insights into the characteristics of novel cells or their relationships with other cell categories. This function from scPLAN is of great biological significance and can potentially lead to novel discoveries and advancements.

scPLAN integrates datasets of various resolutions

In the subsequent section, we will focus on scPLAN’s ability to integrate datasets with different levels of annotation, as well as to enhance low-level annotations based on detailed cell type labels found in another dataset. We have employed scPLAN on two pairs of 10X sequenced PBMC datasets, namely the eQTL [37]/FACS [30] and Zheng68K [30]/NKAtlas [38] datasets. Both datasets are labeled, but they exhibit different levels of resolution in cell type identification. As mentioned in the preceding section, we use the existing labels to create a hierarchical categorization with the aid of scHPL prior to integration, and then proceed to execute the standard scPLAN procedure.

We first highlight scPLAN’s functionality in refining the low-resolution cell-type label with higher resolution annotated data. Namely, the FACS dataset categorizes CD4+ T cells into memory, regulatory and naive subtypes, while all these cells are labeled as ‘CD4+ T cell’ in eQTL. Conversely, for NK cells and monocytes, the FACS dataset provides lower resolution annotations, labeling all corresponding cells as ‘NK cell’ or ‘Monocyte’, whereas the eQTL dataset further categorizes them into more specific sub-cell types, as depicted in Figure 3(A). scPLAN can refine all low-resolution annotations in both datasets into precise cell-type labels.

scPLAN integrates dataset with different annotation resolution: (A) Label hierarchy during integration between PBMCeQTL and PBMCFACS. (B) Integration results visualizations of PBMCeQTL dataset and PBMCFACS dataset. During that process samples with rough labels are first well mixed with detail labeled data and then assigned into refined clusters. (C): UMAP visualization of integration result of NK cells in PBMC68k dataset with pan-cancer NK Atlas. (D): Differential Expression Results upon CD4+ T cell groups from eQTL dataset with respect to integration results. (E): Expression of *GZMK*,*GZMB* and *FGFBP2* of assigned groups from NK PBMC68k dataset. Identification of could be verified by higher expression level of *GZMK*. (F) NK population composition of NKAtlas dataset (outer ring) and scPLAN refined PBMC68k(inner ring). Most of the NK cells in PBMC68k dataset are recognized as NK cells. (G): Detailed differential expression analysis of refined NK group in PBMC68k dataset.

Inline graphic — scPLAN integrates dataset with different annotation resolution: (A) Label hierarchy during integration between PBMCeQTL and PBMCFACS. (B) Integration results visualizations of PBMCeQTL dataset and PBMCFACS dataset. During that process samples with rough labels are first well mixed with detail labeled data and then assigned into refined clusters. (C): UMAP visualization of integration result of NK cells in PBMC68k dataset with pan-cancer NK Atlas. (D): Differential Expression Results upon CD4+ T cell groups from eQTL dataset with respect to integration results. (E): Expression of *GZMK*,*GZMB* and *FGFBP2* of assigned groups from NK PBMC68k dataset. Identification of could be verified by higher expression level of *GZMK*. (F) NK population composition of NKAtlas dataset (outer ring) and scPLAN refined PBMC68k(inner ring). Most of the NK cells in PBMC68k dataset are recognized as NK cells. (G): Detailed differential expression analysis of refined NK group in PBMC68k dataset.

Figure 3(B) shows the UMAP visualizations of raw data and latent representations learned by scPLAN on both resolutions. During the broad level integration, we observe efficient mixing of CD4+ T cell latent features from both datasets, indicating effective batch effect correction. Finally, these broad clusters are further divided into clearly separated groups, which means samples within the broad cell type are assigned to accurate cell types. To assess the reliability of this refinement, we conduct a differential expression analysis of the newly categorized CD4+ T cells from the eQTL dataset (Figure 3(D)). Notably, the NT5DC2 gene, part of the NT5DC superfamily, showed higher expression in the regulatory and memory CD4+ T cell subgroups. This correlates with the family’s known association with immune cell infiltration in various cancers, highlighting the role of these cells in maintaining immune response([39, 40]). Additionally, elevated expression of CLEC7A in the memory subgroup supports its predicted role in initiating immune responses [41], while overexpression of EPHX2 in naive CD4+ T cells validates their classification([42]. Similar to the annotation result of PBMC10Xv2 dataset shown before, misclassification of CD8+ T cells as CD4+ T cells is also observed based on the expression of CD8+ associated markers C1QC [42] and INPP4B [43]. We also illustrate these genes’ expression levels on a UMAP diagram (Figure S8), revealing distinct expression patterns among the four newly identified groups. Hence, scPLAN not only facilitates the integration of CD4+ T samples from different datasets but also enhances biological interpretation within the CD4+ T cell group of the eQTL dataset.

Next, we illustrate scPLAN’s capacity to enhance annotation resolution in the Atlas dataset using newly discovered cell types. Zheng68K is a commonly used atlas dataset for PBMC annotation, and contains 8744 samples labeled as Natural Killer (NK) cells. However, as our understanding of NK cells has deepened, their heterogeneity has become evident. According to the expression level of CD56, NK cells can be divided into CD56-dim and CD56-bright clusters. [44–46]. A recent study upon NK cell typing in pan-cancer tissues [38] further categorizes CD56-dim NK cells into five groups and CD56-bright NK cells into nine groups, and provides a well-annotated NKAtlas for researchers. Here, we apply scPLAN on Zheng68K and NKAltas to modify the Zheng68K dataset with the new discovery in NK cell typing, further identifying the NK cells in Zheng68K into precise sub-cell types.

As depicted in the UMAP (Figure 3(C)) and Sankey diagram (Figure S9), scPLAN successfully corrected the batch effect between two datasets and divided the original NK samples into corresponding subtypes labeled in NKAtlas. During the refinement, scPLAN classifies only 1.7% (148 out of 8744) of the original NK samples as CD56-bright. To validate this result, we conducted a differential expression analysis on the identified CD56-dim/bright subgroups (Figure 3(E)) and found a higher expression of GZMK, a granzyme uniquely expressed in CD56-bright NK cells [47], among the identified CD56-bright samples. This result, together with the higher expression of the CD56-dim marker GZMB [47], confirms the accuracy of scPLAN’s refinement at the CD56-bright/dim level. Simultaneously, the Sankey (Figure S9) and pie (Figure 3(F)) diagrams revealed that the majority (95.9%, 8388 out of 8744) of NK cells were identified as CD56-dim-NR4A3 subtypes, with elevated expression levels of cytotoxic granzymes GZMB, GZMH and the critical marker FGFBP2 (Figure 3(G)). Such observation underscores the differences in NK cell composition between healthy individuals and those afflicted with tumors, corroborating prior hypotheses regarding the cytotoxic functionality [38, 44] of CD56-dim NK cells. Furthermore, after relaxing the integration to a single NK group annotation task, we compare the refinement performance of scPLAN with other annotation methods on the Zheng68k NK cluster. Among these results, scPLAN showcases the best performance in batch effect correction and subgroup identification (Figures S10–S12, Supplementary SS5). This advantage highlights scPLAN’s capability in identifying expression patterns of arbitrary groups. With these detailed annotations, we can introduce new observations to the current Atlas dataset and update it with information of great importance.

In conclusion, scPLAN demonstrates a robust ability to efficiently and accurately integrate datasets with varying annotation resolutions, and to introduce new insights into existing datasets. This feature not only allows for a more profound understanding of refined cell populations within a dataset, but also provides a practical method to update annotations in current atlas datasets. By leveraging scPLAN, researchers can effectively enhance the resolution of cell type annotations and uncover previously unobserved biological insights, contributing to the ongoing advancement of single-cell research.

Ablation studies

To explore the sensitivity of the hyperparameters introduced in scPLAN, namely Inline graphic and , we conduct a series of experiments with fixed target data (Enge) and reference data (Baron human), adjusting these hyperparameters to observe any changes in scPLAN’s performance (see Figure S13). Initially, we fix the reconstruction loss weight and varied the contrastive loss weight Inline graphic within the range (0.1, 1.0). As shown in Figure S13(A), the overall accuracy slightly fluctuated among different values. We then select , which yielded the highest annotation outcome, and varied (Figure S13(B)). We do not observe any significant changes in accuracy. Generally speaking, scPLAN demonstrates robustness under different hyperparameters.

We also design experiments on several datasets to illustrate that the active set selection strategy in the contrastive learning module aids in correcting batch effects. In these ablation experiments, the active set of a feature contains only latents from the identical dataset. As demonstrated in Figure S14, we conduct such ablation on both the annotation task and the integration task. During annotation on Enge dataset, the modified scPLAN loses its ability to align samples from different dataset but still gives correct predictions. However, on PBCM10Xv2 annotation task and integration task between eQTL and FACS, unsolved batch effect causes a dramatically decreased performance. These results confirm that without a heterologous contrastive basis, scPLAN could not minimize the gap between different datasets and therefore lost its ability to effectively align these samples. Such a disability can be catastrophic for an integration task or an annotation task on a complex dataset. Consequently, it is essential to contain samples from different origin in contrastive learning.

Discussion

In this work, we propose scPLAN, a hierarchical computational framework for scRNA-seq data. First, we evaluate annotation performances on six scRNA-seq or scRNA-seq datasets from different tissues and different platforms. We demonstrate the annotating superiority over existing non-hierarchical annotating methods as well as a non-hierarchical version of scPLAN itself, suggesting that the hierarchical structure could enhance annotation consistency and thus improve annotation accuracy. We further conduct novel cell annotation experiments over two other scRNA-seq datasets and demonstrate that scPLAN can not only find novel cells but also reveal the relationship between novel cells and known cell types. Next, we evaluate the performance of scPLAN when integrating annotated scRNA-seq datasets with diverse annotation depth. We show that scPLAN can refine the broad cell-type labels with detailed cell-type information in the other dataset.

More specifically, as for annotation task, we first highlight scPLAN’s annotation performance by comparing it with seven competing methods. As shown before, scPLAN could efficiently align samples from different dataset and perform annotation of higher precision than other methods. We also demonstrate scPLAN’s ability to learn the effective expression pattern by a case study on PBMC10Xv2 annotation where scPLAN successfully identifies misclassified CD4+ samples. By two annotation experiments on Lung and Xin dataset we further exhibit scPLAN’s capability to identify novel cells together with potential novel type hierarchy. In Lung dataset annotation, scPLAN successfully identifies ‘epithelial cell of lung’ at the first stage and suggests that these novel cells have distinct expression patterns from other known cell types, which is confirmed by differential analysis. In Xin dataset annotation, scPLAN initially fails to recognize novel cells due to their subtle similarity to known cells. However, in the second stage, scPLAN identifies these novel cells and forms compact clusters for them. This result is consistent with the similarity in expression between novel and known cells. As for integration task, we underscore scPLAN’s capability to integration datasets of different annotation resolution and refine broader labels by two pairs of PBMC datasets. On eQTL and FACS dataset we have shown that scPLAN perfectly integrates CD4+ T samples that have different annotation depth among the datasets. Meanwhile, scPLAN refined the rough CD4+ T annotation into more detailed labels, whose biological significance has been confirmed by differential expression analysis. We also showcase scPLAN’s integration between Zheng68K and NKAtlas datasets. By tracking the refinement of natural killer cells in Zheng68k, we further exhibit scPLAN’s power to introduce novel observations into current atlas dataset.

Results discussed in this work demonstrate that scPLAN is well qualified for scRNA-seq data annotation and multi-datasets integration. However, scPLAN still gets shortages and defects. For instance, during novel cell perception in annotation task, FNR of scPLAN’s prediction could be further improved. Meanwhile, integration scPLAN still requires an auxiliary label hierarchy from scHPL, which breaks the scPLAN’s end–end property.

In conclusion, scPLAN is a versatile framework that can hierarchically annotate cells with known type labels, detect potential novel cells and integrate datasets with various annotation resolutions. We used scPLAN to analyze scRNA-seq and snRNA-seq data in this work. The framework of using deep autoencoder and contrastive partial learning techniques can be extended to other measurement modalities such as spatial transcriptomes or multiplexed immunofluorescence data, which is an interesting direction of future work. The relationship of detected novel cells and existing cell types could also be investigated in more detail.

Key Points

We present scPLAN, a hierarchical computational framework for scRNA-seq data. scPLAN can annotate unlabeled scRNA-seq data in a systematic, layer-by-layer manner, bringing about higher annotation consistency and accuracy.
scPLAN can hierarchically detect potential novel cells, offering additional information about the relationship between novel cells and existing cell types in reference data.
scPLAN can integrate annotated datasets with different annotation depths. Two case studies are made to illustrate scPLAN can refine the cell-type labels with higher quality data or newly discovered cell types.

Supplementary Material

scPLAN_supp_revised_bbae305

scplan_supp_revised_bbae305.pdf^{(9.3MB, pdf)}

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2021YFF1200500, 2021YFF1200902) and National Natural Science Foundation of China (No.12225102, T2321001, 31871342 and 12288101).

Contributor Information

Qirui Guo, Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China.

Musu Yuan, Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China.

Lei Zhang, Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China; Beijing International Center for Mathematical Research, Peking University, Yiheyuan Road, 100871, Beijing, China; Center for Machine Learning Research, Peking University, Yiheyuan Road, 100871, Beijing, China.

Minghua Deng, Center for Quantitative Biology, Peking University, Yiheyuan Road, 100871, Beijing, China; School of Mathematical Sciences, Peking University, Yiheyuan Road, 100871, Beijing, China; Center for Statistical Science, Peking University, Yiheyuan Road, 100871, Beijing, China.

References

1. Hao Y, Stuart T, Kowalski MH.. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 2024; 42:293–304. 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Kiselev VY, Kirschner K, Schaub MT.. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017; 14:483–6. 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Kiselev VY, Yiu A, Hemberg M. Scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 2018; 15:359–62. 10.1038/nmeth.4644. [DOI] [PubMed] [Google Scholar]
4. Kang JB, Nathan A, Weinand K.. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat Commun 2021; 12:5890. 10.1038/s41467-021-25957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Brbić M, Zitnik M, Wang S.. et al. MARS: discovering novel cell types across heterogeneous single-cell experiments. Nat Methods 2020; 17:1200–6. 10.1038/s41592-020-00979-3. [DOI] [PubMed] [Google Scholar]
6. Hu J, Li X, Hu G.. et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2020; 2:607–18. 10.1038/s42256-020-00233-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Yuan M, Chen L, Deng M. scMRA: a robust deep learning method to annotate scRNA-seq data with multiple reference datasets. Bioinformatics 2022; 38:738–45. 10.1093/bioinformatics/btab700. [DOI] [PubMed] [Google Scholar]
8. Lotfollahi M, Naghipourfar M, Luecken MD.. et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022; 40:121–30. 10.1038/s41587-021-01001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Lu L, Welch JD. PyLiger: scalable single-cell multi-omic data integration in python. Bioinformatics 2022; 38:2946–8. 10.1093/bioinformatics/btac190. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Johansen N, Quon GT. Scalign: a tool for alignment, integration, and rare cell identification from scrna-seq data. Genome Biol 2019; 20:166. 10.1186/s13059-019-1766-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Korsunsky I, Fan J, Slowikowski K.. et al. Fast, sensitive, and accurate integration of single cell data with harmony. Nat Methods 2019; 16:1289–96. 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Johnson TS, Huang Z, Yu CY.. et al. Lambda: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics 2019; 35:4696–706. 10.1093/bioinformatics/btz295. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Sun Y, Qiu P. Domain adaptation for supervised integration of scRNA-seq data. Commun Biol 2023;6:274. 10.1038/s42003-023-04668-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Michielsen L, Reinders MJT, Mahfouz A. Hierarchical progressive learning of cell identities in single-cell data. Nat Commun 2021; 12:2799. 10.1038/s41467-021-23196-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Wang H. et al. PiCO+: Contrastive Label Disambiguation for Robust Partial Label Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024;46:5, 3183–3198. 10.1109/TPAMI.2023.3342650. [DOI] [PubMed] [Google Scholar]
16. Eraslan G, Simon LM, Mircea M.. et al. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 2019; 10:390. 10.1038/s41467-018-07931-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. He K, Fan H, Wu Y.. et al. Momentum contrast for unsupervised visual representation learning. 2020; arXiv:1911.05722 [cs].
18. Khosla P, Teterwak P, Wang C.. et al. Supervised contrastive learning. 2021; arXiv:2004.11362 [cs, stat].
19. Wan H, Chen L, Deng M. scEMAIL: universal and source-free annotation method for scRNA-seq data with novel cell-type perception. Genomics Proteomics Bioinf 2022; 20:939–58. 10.1016/j.gpb.2022.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Aran D, Looney AP, Liu L.. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 2019; 20:163–72. 10.1038/s41590-018-0276-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Lotfollahi M, Rybakov S, Hrovatin K.. et al. Biologically informed deep learning to query gene programs in single-cell atlases. Nat Cell Biol 2023; 25:337–50. 10.1038/s41556-022-01072-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Enge M, Arda HE, Mignardi M.. et al. Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell 2017; 171:321–330.e14. 10.1016/j.cell.2017.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Segerstolpe S, Palasantza A, Eliasson P.. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab 2016; 24:593–607. 10.1016/j.cmet.2016.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Lawlor N, George J, Bolisetty M.. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res 2017; 27:208–22. 10.1101/gr.212720.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Muraro M, Dharmadhikari G, Grün D.. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst 2016; 3:385–394.e3. 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Baron M, Veres A, Wolock SL.. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst 2016; 3:346–360.e4. 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Wu H, Uchimura K, Donnelly EL.. et al. Comparative analysis and refinement of human PSC-derived kidney organoid differentiation with single-cell transcriptomics. Cell Stem Cell 2018; 23:869–881.e8. 10.1016/j.stem.2018.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Ding J, Adiconis X, Simmons SK.. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 2020; 38:737–46. 10.1038/s41587-020-0465-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Park J, Shrestha R, Qiu C.. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 2018; 360:758–63. 10.1126/science.aar2131. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Zheng GXY, Terry JM, Belgrader P.. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017; 8:14049. 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2020; arXiv:1802.03426 [cs, stat].
32. Schaum N, Karkanias J, Neff NF.. et al. Single-cell transcriptomics of 20 mouse organs creates a tabula Muris. Nature 2018; 562:367–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Xin Y, Kim J, Okamoto H.. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab 2016; 24:608–15. 10.1016/j.cmet.2016.08.018. [DOI] [PubMed] [Google Scholar]
34. Bingle L, Cross SS, High AS.. et al. WFDC2 (HE4): a potential role in the innate immunity of the oral cavity and respiratory tract and the development of adenocarcinomas of the lung. Respir Res 2006; 7:61. 10.1186/1465-9921-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Mutze K, Vierkotten S, Milosevic J.. et al. Enolase 1 (ENO1) and protein disulfide-isomerase associated 3 (PDIA3) regulate Wnt/−catenin-driven trans-differentiation of murine alveolar epithelial cells. Dis Model Mech 2015; 8:877–90. 10.1242/dmm.019117. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Bi G, Wu L, Huang P.. et al. Up-regulation of SFTPB expression and attenuation of acute lung injury by pulmonary epithelial cell-specific NAMPT knockdown. FASEB J 2018; 32:3583–96. 10.1096/fj.201701059R. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. van der Wijst MGP, Brugge H, de Vries DH.. et al. Single-cell RNA sequencing identifies celltype-specific cis-eQTLs and co-expression QTLs. Nat Genet 2018; 50:493–7. 10.1038/s41588-018-0089-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Tang F, Li J, Qi L.. et al. A pan-cancer single-cell panorama of human natural killer cells. Cell 2023; 186:4235–4251.e20. 10.1016/j.cell.2023.07.034. [DOI] [PubMed] [Google Scholar]
39. Jia Y, Li J, Wu H.. et al. Comprehensive analysis of NT5DC family prognostic and immune significance in breast cancer. Medicine 2023; 102:e32927. 10.1097/MD.0000000000032927. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Zhu Z, Hou Q, Guo H. NT5DC2 knockdown inhibits colorectal carcinoma progression by repressing metastasis, angiogenesis and tumor-associated macrophage recruitment: a mechanism involving VEGF signaling. Exp Cell Res 2020; 397:112311. 10.1016/j.yexcr.2020.112311. [DOI] [PubMed] [Google Scholar]
41. Al Madhoun A, Kochumon S, Al-Rashed F.. et al. Dectin-1 as a potential inflammatory biomarker for metabolic inflammation in adipose tissue of individuals with obesity. Cells 2022; 11:2879. 10.3390/cells11182879. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Pan Q, Cheng Y, Cheng D. Identification of CD8+ T cell-related genes: correlations with immune phenotypes and outcomes of liver cancer. J Immunol Res, 2021, 9960905. 10.1155/2021/9960905. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Rose JR, Akdogan-Ozdilek B, Rahmberg AR.. et al. Distinct transcriptomic and epigenomic modalities underpin human memory T cell subsets and their activation potential. Commun Biol 2023; 6:1–19. 10.1038/s42003-023-04747-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Cooper MA, Fehniger TA, Caligiuri MA. The biology of human natural killer-cell subsets. Trends Immunol 2001; 22:633–40. 10.1016/S1471-4906(01)02060-9. [DOI] [PubMed] [Google Scholar]
45. Poli A, Michel T, Thérésine M.. et al. CD56 natural killer (NK) cells: an important NK cell subset. Immunology 2009; 126:458–65. 10.1111/j.1365-2567.2008.03027.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Caligiuri MA. Human natural killer cells. Blood 2008; 112:461–9. 10.1182/blood-2007-09-077438. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Bade B, Boettcher HE, Lohrmann J.. et al. Differential expression of the granzymes a, K and M and perforin in human peripheral blood lymphocytes. Int Immunol 2005; 17:1419–28. 10.1093/intimm/dxh320. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

scPLAN_supp_revised_bbae305

scplan_supp_revised_bbae305.pdf^{(9.3MB, pdf)}

[ref1] 1. Hao Y, Stuart T, Kowalski MH.. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 2024; 42:293–304. 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Kiselev VY, Kirschner K, Schaub MT.. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017; 14:483–6. 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Kiselev VY, Yiu A, Hemberg M. Scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 2018; 15:359–62. 10.1038/nmeth.4644. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Kang JB, Nathan A, Weinand K.. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat Commun 2021; 12:5890. 10.1038/s41467-021-25957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Brbić M, Zitnik M, Wang S.. et al. MARS: discovering novel cell types across heterogeneous single-cell experiments. Nat Methods 2020; 17:1200–6. 10.1038/s41592-020-00979-3. [DOI] [PubMed] [Google Scholar]

[ref6] 6. Hu J, Li X, Hu G.. et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2020; 2:607–18. 10.1038/s42256-020-00233-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Yuan M, Chen L, Deng M. scMRA: a robust deep learning method to annotate scRNA-seq data with multiple reference datasets. Bioinformatics 2022; 38:738–45. 10.1093/bioinformatics/btab700. [DOI] [PubMed] [Google Scholar]

[ref8] 8. Lotfollahi M, Naghipourfar M, Luecken MD.. et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022; 40:121–30. 10.1038/s41587-021-01001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Lu L, Welch JD. PyLiger: scalable single-cell multi-omic data integration in python. Bioinformatics 2022; 38:2946–8. 10.1093/bioinformatics/btac190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Johansen N, Quon GT. Scalign: a tool for alignment, integration, and rare cell identification from scrna-seq data. Genome Biol 2019; 20:166. 10.1186/s13059-019-1766-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Korsunsky I, Fan J, Slowikowski K.. et al. Fast, sensitive, and accurate integration of single cell data with harmony. Nat Methods 2019; 16:1289–96. 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Johnson TS, Huang Z, Yu CY.. et al. Lambda: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics 2019; 35:4696–706. 10.1093/bioinformatics/btz295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Sun Y, Qiu P. Domain adaptation for supervised integration of scRNA-seq data. Commun Biol 2023;6:274. 10.1038/s42003-023-04668-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Michielsen L, Reinders MJT, Mahfouz A. Hierarchical progressive learning of cell identities in single-cell data. Nat Commun 2021; 12:2799. 10.1038/s41467-021-23196-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15.Wang H. et al. PiCO+: Contrastive Label Disambiguation for Robust Partial Label Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024;46:5, 3183–3198. 10.1109/TPAMI.2023.3342650. [DOI] [PubMed] [Google Scholar]

[ref16] 16. Eraslan G, Simon LM, Mircea M.. et al. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 2019; 10:390. 10.1038/s41467-018-07931-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. He K, Fan H, Wu Y.. et al. Momentum contrast for unsupervised visual representation learning. 2020; arXiv:1911.05722 [cs].

[ref18] 18. Khosla P, Teterwak P, Wang C.. et al. Supervised contrastive learning. 2021; arXiv:2004.11362 [cs, stat].

[ref19] 19. Wan H, Chen L, Deng M. scEMAIL: universal and source-free annotation method for scRNA-seq data with novel cell-type perception. Genomics Proteomics Bioinf 2022; 20:939–58. 10.1016/j.gpb.2022.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Aran D, Looney AP, Liu L.. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 2019; 20:163–72. 10.1038/s41590-018-0276-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Lotfollahi M, Rybakov S, Hrovatin K.. et al. Biologically informed deep learning to query gene programs in single-cell atlases. Nat Cell Biol 2023; 25:337–50. 10.1038/s41556-022-01072-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Enge M, Arda HE, Mignardi M.. et al. Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell 2017; 171:321–330.e14. 10.1016/j.cell.2017.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Segerstolpe S, Palasantza A, Eliasson P.. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab 2016; 24:593–607. 10.1016/j.cmet.2016.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Lawlor N, George J, Bolisetty M.. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res 2017; 27:208–22. 10.1101/gr.212720.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Muraro M, Dharmadhikari G, Grün D.. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst 2016; 3:385–394.e3. 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Baron M, Veres A, Wolock SL.. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst 2016; 3:346–360.e4. 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Wu H, Uchimura K, Donnelly EL.. et al. Comparative analysis and refinement of human PSC-derived kidney organoid differentiation with single-cell transcriptomics. Cell Stem Cell 2018; 23:869–881.e8. 10.1016/j.stem.2018.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Ding J, Adiconis X, Simmons SK.. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 2020; 38:737–46. 10.1038/s41587-020-0465-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Park J, Shrestha R, Qiu C.. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 2018; 360:758–63. 10.1126/science.aar2131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Zheng GXY, Terry JM, Belgrader P.. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017; 8:14049. 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2020; arXiv:1802.03426 [cs, stat].

[ref32] 32. Schaum N, Karkanias J, Neff NF.. et al. Single-cell transcriptomics of 20 mouse organs creates a tabula Muris. Nature 2018; 562:367–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Xin Y, Kim J, Okamoto H.. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab 2016; 24:608–15. 10.1016/j.cmet.2016.08.018. [DOI] [PubMed] [Google Scholar]

[ref34] 34. Bingle L, Cross SS, High AS.. et al. WFDC2 (HE4): a potential role in the innate immunity of the oral cavity and respiratory tract and the development of adenocarcinomas of the lung. Respir Res 2006; 7:61. 10.1186/1465-9921-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Mutze K, Vierkotten S, Milosevic J.. et al. Enolase 1 (ENO1) and protein disulfide-isomerase associated 3 (PDIA3) regulate Wnt/−catenin-driven trans-differentiation of murine alveolar epithelial cells. Dis Model Mech 2015; 8:877–90. 10.1242/dmm.019117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Bi G, Wu L, Huang P.. et al. Up-regulation of SFTPB expression and attenuation of acute lung injury by pulmonary epithelial cell-specific NAMPT knockdown. FASEB J 2018; 32:3583–96. 10.1096/fj.201701059R. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. van der Wijst MGP, Brugge H, de Vries DH.. et al. Single-cell RNA sequencing identifies celltype-specific cis-eQTLs and co-expression QTLs. Nat Genet 2018; 50:493–7. 10.1038/s41588-018-0089-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Tang F, Li J, Qi L.. et al. A pan-cancer single-cell panorama of human natural killer cells. Cell 2023; 186:4235–4251.e20. 10.1016/j.cell.2023.07.034. [DOI] [PubMed] [Google Scholar]

[ref39] 39. Jia Y, Li J, Wu H.. et al. Comprehensive analysis of NT5DC family prognostic and immune significance in breast cancer. Medicine 2023; 102:e32927. 10.1097/MD.0000000000032927. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Zhu Z, Hou Q, Guo H. NT5DC2 knockdown inhibits colorectal carcinoma progression by repressing metastasis, angiogenesis and tumor-associated macrophage recruitment: a mechanism involving VEGF signaling. Exp Cell Res 2020; 397:112311. 10.1016/j.yexcr.2020.112311. [DOI] [PubMed] [Google Scholar]

[ref41] 41. Al Madhoun A, Kochumon S, Al-Rashed F.. et al. Dectin-1 as a potential inflammatory biomarker for metabolic inflammation in adipose tissue of individuals with obesity. Cells 2022; 11:2879. 10.3390/cells11182879. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] 42. Pan Q, Cheng Y, Cheng D. Identification of CD8+ T cell-related genes: correlations with immune phenotypes and outcomes of liver cancer. J Immunol Res, 2021, 9960905. 10.1155/2021/9960905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. Rose JR, Akdogan-Ozdilek B, Rahmberg AR.. et al. Distinct transcriptomic and epigenomic modalities underpin human memory T cell subsets and their activation potential. Commun Biol 2023; 6:1–19. 10.1038/s42003-023-04747-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Cooper MA, Fehniger TA, Caligiuri MA. The biology of human natural killer-cell subsets. Trends Immunol 2001; 22:633–40. 10.1016/S1471-4906(01)02060-9. [DOI] [PubMed] [Google Scholar]

[ref45] 45. Poli A, Michel T, Thérésine M.. et al. CD56 natural killer (NK) cells: an important NK cell subset. Immunology 2009; 126:458–65. 10.1111/j.1365-2567.2008.03027.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Caligiuri MA. Human natural killer cells. Blood 2008; 112:461–9. 10.1182/blood-2007-09-077438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Bade B, Boettcher HE, Lohrmann J.. et al. Differential expression of the granzymes a, K and M and perforin in human peripheral blood lymphocytes. Int Immunol 2005; 17:1419–28. 10.1093/intimm/dxh320. [DOI] [PubMed] [Google Scholar]

PERMALINK

scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement

Qirui Guo

Musu Yuan

Lei Zhang

Minghua Deng

Abstract

Motivation

Results

Availability

Introduction

Material and Methods

Figure 1.

Label Hierarchy Construction

Denoising ZINB autoencoder

Contrastive Momentum Encoder

Adaptive Hierarchical Classifier

Novel cell type perception

Results

scPLAN hierarchically annotates scRNA-seq data

Table 1.

Figure 2.

scPLAN hierarchically detect potential novel cells

scPLAN integrates datasets of various resolutions

Figure 3.

Ablation studies

Discussion

Key Points

Supplementary Material

Acknowledgements

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement

Qirui Guo

Musu Yuan

Lei Zhang

Minghua Deng

Abstract

Motivation

Results

Availability

Introduction

Material and Methods

Figure 1.

Label Hierarchy Construction

Denoising ZINB autoencoder

Contrastive Momentum Encoder

Adaptive Hierarchical Classifier

Novel cell type perception

Results

scPLAN hierarchically annotates scRNA-seq data

Table 1.

Figure 2.

scPLAN hierarchically detect potential novel cells

scPLAN integrates datasets of various resolutions

Figure 3.

Ablation studies

Discussion

Key Points

Supplementary Material

Acknowledgements

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases