Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Feb 22;25(2):bbae047. doi: 10.1093/bib/bbae047

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data

Hui Wan 1, Musu Yuan 2, Yiwei Fu 3, Minghua Deng 4,5,6,
PMCID: PMC10883808  PMID: 38388681

Abstract

Motivation

Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable.

Results

We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL.

Availability

An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch.

Contact

dengmh@pku.edu.cn

Supplementary information

Supplementary data are available at Journal Name online.

Keywords: cell-type annotation, continual learning, pre-trained language model, single-cell transcriptome

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) has been extensively utilized to quantify single-cell transcriptional profiling and reveal tissue heterogeneity at the cellular level [1]. Accurate cell-type identification is a crucial step in scRNA-seq data analysis, as it can elucidate complex stochastic biological processes and improve our understanding of the evolutionary mechanisms that shape cellular form and function.

Ever since their introduction to scRNA-seq annotation, deep learning-based methods have quickly gained popularity. These methods primarily utilize knowledge of gene expression profiles from well-annotated source datasets and then transfer labels to unlabeled target datasets. Compared with traditional annotation approaches, deep learning methods are more automated with improved pattern recognition capabilities and superior performance. Existing algorithms mainly employ traditional network architectures such as autoencoders [1, 2] or variational autoencoders [3]. However, the dimensionality reduction and nonlinear aggregation layers of traditional deep neural networks make the final learned representations abstract and hard to trace back the original inputs, which leads to the loss of interpretability [4]. Besides, their network construction does not effectively capture biological interactions of genes. Recently, owing to the powerful self-attention mechanism and ability to integrate information, transformer-based language models have attracted considerable attention. As a prime example, scBERT [5] is based on bidirectional encoder representations from transformers (BERT) [6] and follows the classic pre-training and fine-tuning paradigm used in natural language processing. scBERT first leverages a substantial amount of unlabeled scRNA-seq data to learn the general patterns of gene–gene interactions. It then transfers the pre-trained language model (PTLM) to various downstream annotation tasks in a supervised fine-tuning manner. scBERT has achieved state-of-the-art performance on cell-type annotation tasks with gene-level interpretability, demonstrating the promise of pre-training strategies for single-cell biology research.

However, scBERT assumes that all well-annotated data are available simultaneously for offline learning. It lacks the ability to incrementally incorporate newly collected data and thus has no way of extending the acquired knowledge for future learning. This means that each time new datasets become available, the fine-tuning process begins anew, making this practice inefficient and labor-intensive given the continuous arrival of single-cell data, not to mention infeasible owing to safety issues and storage limitations. This calls for the continuous updating of a pre-trained model, along with the ability to expand knowledge from new data. A natural approach is to simply fine-tune the model sequentially on new data, but this results in the dilemma known as catastrophic forgetting (CF) [7]. Essentially, we know that performance on previously learned data typically degrades over time after updating with recent data. This directly leads to a more general issue in deep learning known as the stability–plasticity dilemma wherein stability refers to the ability to consolidate past knowledge, and plasticity refers to the capacity for fast adaptation to integrate novel information [8]. Therefore, an algorithm that seeks to strike a balance between stability and plasticity is desirable, retaining previous knowledge, while also incorporating new information.

Some recently proposed analytic tools for scRNA-seq data attempt to incorporate the concept of continual learning (also known as online learning). For example, online iNMF [9] extends the non-negative matrix factorization approach based on LIGER [10] to iteratively integrate single-cell datasets by decoupling shared and dataset-specific factors related to cell identities. Additionally, two deep learning data integration methods, scArches [11] and SCALEX [12], are based on the framework of variational autoencoders and also allow for the continuous arrival of datasets. For iterative reference building, scArches implements an architecture surgery approach whereby new study labels are incorporated as new input nodes, and it only fine-tunes the parameters connecting newly added studies. Differently, the encoder of SCALEX is composed of a data projection function that only preserves batch-invariant biological data components, thus requiring no retraining on new data. However, the representation ability and generalization capability of these methods, based on either traditional statistical methods or neural networks, are still not comparable with a PTLM. Moreover, although the above algorithms implement an online framework, they do not address the realistic CF issue from which continual learning models severely suffer. Also, most current online methods aim at unlabeled single-cell data integration; none is customized for cell-type annotation tasks.

Thus, we believe an explicit and systematic framework of continual learning, based on PTLM, should be established, with tailored designs to overcome CF. Utilizing such framework would allow us to leverage the representation and generalization capabilities of PTLM, continuously fine-tuning the pre-trained model to obtain high annotation accuracy on new datasets while avoiding rapid deterioration in performance on previous datasets. Accordingly, we herein introduce a novel paradigm, termed as CANAL (Continual ANnotation framework via Adapting pre-trained Language model), which aims at the continuous adaptation of PTLM for universal annotation of scRNA-seq data, with the pre-trained scBERT model as initialization. CANAL essentially alleviates CF in terms of both model inputs and outputs:

  • For model inputs, we design an experience replay schema that systematically revisits crucial examples from prior training stages within the ongoing training stages, utilizing a dynamic example bank. This example bank is maintained within a class-balanced paradigm, which proves advantageous in consolidating knowledge pertaining to rare cell types.

  • For model outputs, we employ representation knowledge distillation to regularize the discrepancy between previous and current models. Specifically, we impose constraints on the model outputs at the intermediate layer, thereby constraining the extent to which the new model deviates from its predecessor.

Furthermore, our universal annotation framework facilitates the seamless integration of new cell types throughout both the fine-tuning and testing stages. This implies that we can continuously expand the cell-type annotation library by assimilating new cell types from recently arrived, well-annotated training datasets, and also identify novel cells within unlabeled test datasets. To the best of our knowledge, we are the pioneers in adapting the powerful PTLM to practical continual learning conditions, and also the first to consider the CF issue in the field of scRNA-seq data analysis. Extensive experiments conducted on diverse datasets encompassing various biological scenarios demonstrate the superiority of CANAL over existing online methods by leveraging the benefits of PTLM while continuously expanding its accumulated knowledge to future learning. CANAL excels not only in providing robust and accurate cell type annotations, with a particular proficiency in discriminating rare cell types, but also in upholding a high level of gene-level interpretability. This attribute empowers the identification of cell-type-specific discriminative genes, thereby offering crucial insights for subsequent biological investigations.

Figure 1.

Figure 1

Overview of CANAL. The well-annotated data stream evolves over time, and the PTLM undergoes continuous adaptation at different training stages. It is not necessary for the cell-type set to be fully overlapped at each stage. When new cell types emerge in a new stage, our classifier is updated to encompass the entire set of encountered cell types. The fine-tuned model, denoted as Inline graphic, exhibits sufficient generalizability to directly annotate diverse unlabeled test datasets, thereby achieving universal annotation.

METHODS

We begin with problem setting and notations. Suppose that a PTLM Inline graphic is updated sequentially over each well-annotated dataset Inline graphic at stage Inline graphic. Inline graphic consists of Inline graphic well-annotated cells with gene expression vectors Inline graphic and labels Inline graphic. Inline graphic is a BERT-based transformer (see Supplementary Materials Section 2 for detailed network structure), and for each cell, the learned representation is Inline graphic. After that, a linear classifier Inline graphic is plugged into Inline graphic to produce the output logits Inline graphic. At stage Inline graphic, Inline graphic has no access to the full earlier corpora Inline graphic in the data stream Inline graphic. This requirement satisfies practical constraints, such as computation burden for training over all datasets in Inline graphic, as well as privacy preservation for storing previous data. The fine-tuned language model right after updating on Inline graphic is denoted as Inline graphic, and the initial model Inline graphic is fed with pre-trained scBERT weights in our study.

For supervised cell-type prediction, we first employ cross-entropy loss during every stage Inline graphic:

graphic file with name DmEquation1.gif (1)

where Inline graphic is the one-hot encoded label of cell Inline graphic, and Inline graphic is the prediction vector obtained by Inline graphic after softmax normalization. To alleviate CF with limited memory and computation power, we consider solving this issue from the perspective of model inputs and outputs.

Figure 2.

Figure 2

Detailed illustration of experience replay. Whenever the pre-trained model undergoes fine-tuning, we employ a dynamic example bank with a fixed buffer size. It efficiently integrates new data and removes old data in a class-balanced fashion, based on their rank. This approach enables us to train CANAL using datasets from the current stage while also revisiting crucial examples from previous stages.

Class-balanced Experience Replay

In terms of model inputs, we introduce an example bank Inline graphic which stores a subset of earlier examples for reuse to complement learning in the following stages [13]. Prior studies have shown that the weights of a well-trained classifier can be leveraged as class prototypes [14]. Suppose the cell-type set of dataset Inline graphic is Inline graphic, and Inline graphic is the number of cell types in Inline graphic. Then, we can extract the corresponding weights Inline graphic of the classifier as the cell-type prototypes at the Inline graphic-th stage. Thus, we can select the top Inline graphic similar samples to each cell type as the most representative samples to construct the bank,

graphic file with name DmEquation2.gif (2)

where Inline graphic means selecting gene expression of Inline graphic cells, the latent representation of which is most similar to each cell-type prototypes. We should emphasize that Inline graphic are ranked for storage in descending order of similarity to meet the requirement of fast membership adjustment of example bank Inline graphic. At each training stage, we regard both new samples now available and old samples from Inline graphic as training data at this stage to fine-tune the model. Thus the cell-type prediction loss in Equation 1 can be modified into

graphic file with name DmEquation3.gif (3)

Considering the limitations of computation and storage, we herein fix the size of example bank as Inline graphic. To overcome the negative effect of class imbalance and recency bias, we treat each cell type equally, that is, we maintain an equal number of examples for each cell type and training stage in which they have appeared. By doing so, we can reduce the negative impact of class imbalance and place more focus on rare cell types or cell types that exist only at particular stages. Once a training stage is finished, we update Inline graphic for use at the future training stage. We begin by calculating the amount of examples that should be preserved for all cell types at each stage based on the class-balanced principle. Then, for repeated review, we prioritize the retention of earlier examples closer to their prototypes, as they are more representative and hold greater significance. Previous examples with lower similarity, indicated by higher ranks, are removed, and new samples from the current stage are selected to fill the vacancies. Supplementary Materials (Section 3) describes the technical details of the experience replay scheme and our entire workflow to dynamically construct and maintain example bank Inline graphic is displayed in Algorithm S1.

Representation Distillation

To preserve knowledge from previous tasks in term of model outputs, we employ knowledge distillation. In this process, we designate the model that has learned the last task as the ‘teacher’, and the model trained on the current task as the ‘student’. The parameters of the model from the previous stage are stored as Inline graphic. We aim to minimize the difference in latent representations between the previous model Inline graphic and the current model Inline graphic. To achieve this, we extract the representation of the Inline graphic-th cell from both models, denoted as Inline graphic and Inline graphic, respectively. Subsequently, the mean squared error is computed as the distillation loss:

graphic file with name DmEquation4.gif (4)

Therefore, the overall loss function in the Inline graphic-th stage is

graphic file with name DmEquation5.gif (5)

the parameter Inline graphic is introduced to control the relative weight assigned to the two losses. Furthermore, we take into account the ratio of previous knowledge, which is distilled, to the novel knowledge that emerges in the current stage. If Inline graphic, we only train Inline graphic for classification.

Universal Annotation Framework

The label sets of each dataset, comprising both training and test data, are frequently not perfectly overlapped. As a result, we must consider the prospective novel cells in the upcoming training and testing stages.

During fine-tuning stage

At stage Inline graphic, the output probability vector of classifier Inline graphic is Inline graphic-dimensional. On the one hand, if new cell types emerge in the current stage, i.e. Inline graphic, we update and expand our classifier Inline graphic with the newly added Inline graphic dimensions corresponding to these new cell types, respectively. On the other hand, some old cell types may not emerge in the current stage, i.e. Inline graphic. As we maintain an example bank for each of these older cell types, we proceed to train the respective classifier weights using previous examples in order to strengthen existing knowledge.

During testing stage

To annotate Inline graphic unlabeled test data Inline graphic, the final annotation label will be determined by the maximum output component of classifier Inline graphic. However, test data may contain novel cell types not yet included in existing cell-type annotation library. To detect novel cells, both energy (EN) and confidence (CF) scores are adopted (see Supplementary Materials Section 4 for detailed definition), and the final uncertainty score is defined as

graphic file with name DmEquation6.gif (6)

To determine a reasonable threshold, we adopt adaptive data-based threshold by manifold mixup the same way as scEMAIL [15]. Synthetic manifolds are constructed by arbitrarily mixing up two test cells, and the final threshold can be automatically decided by taking the average score of all possible pairs:

graphic file with name DmEquation7.gif (7)

Therefore, we predict an unlabeled test datum Inline graphic belonging to a novel cell type if Inline graphic.

RESULTS

To comprehensively evaluate the ability of online learning and universal annotation of CANAL, we applied it under various biological scenarios. As comparative benchmarks, we also measured the performance of relevant models, namely Online iNMF, SCALEX, scArches and scBERT. Online iNMF, SCALEX and scArches are amenable to online training, while we adapted scBERT into an online version, mirroring the approach employed by CANAL. To ensure robustness, we conducted multiple validations for all competing methods by executing each method 10 times using the same 10 random seeds (seeds 1 to 10). The evaluation of cell-type annotation performance is based on metrics such as accuracy (ACC), Macro F1 score, and adjusted Rand index (ARI). A higher score for each metric indicates superior performance. Further implementation details of each method and experimental procedures can be found in the Supplementary Materials (Sections 5–7).

CANAL realizes accurate annotation with data streams from various batches

We conducted a series of experiments utilizing data streams from different batches to assess the continual learning capability of each online tool. The initial experiment focused on four datasets derived from the human pancreas. Specifically, the datasets Muraro [16], Enge [17] and Baron human [18] constituted three distinct training stages. For each training dataset, we retained 500 cells, while a new dataset, Segerstolpe [19], served as the test data. Throughout the training stages, there were common cell types present, as well as cell types unique to specific stages. Notably, acinar, alpha, beta, delta and ductal cell types were shared across all stages, whereas activated stellate and quiescent stellate cell types were exclusive to the Baron human dataset (stage 3). Overall, CANAL demonstrated superior performance compared with competing approaches, with the mean values of all evaluation metrics exceeding 0.9 (Figure 3A, Supplementary Figure S1). Furthermore, the results obtained by CANAL exhibited minimal variance, indicating its robustness. SCALEX, scArches and scBERT exhibited relatively stable performance, while Online iNMF demonstrated poorer performance with significant fluctuations. The lower F1 scores observed for SCALEX and Online iNMF indicated a trade-off between total annotation accuracy and the accurate identification of rare cell populations, such as activated stellate and gamma cells (Supplementary Figure S2). This limitation was also present, to varying degrees, in the other methods except for CANAL. The primary cause of this issue lies in the imbalanced distribution of cell types within the training data stream, where larger cell types exert a dominant influence during the training process. Consequently, patterns associated with rare cell populations are often overlooked. This can be attributed to the limited sample size, resulting in insufficient information, or to the fact that traditional classification loss treats each sample equally, potentially sacrificing the performance of rare populations in favor of overall annotation accuracy. CANAL, on the other hand, effectively circumvents this issue by maintaining a class-balanced example bank that allocates equal room to each cell type, irrespective of their overall abundance. This approach explicitly modifies the composition of cell types in the model inputs and enhances the importance of rare cell populations.

Figure 3.

Figure 3

Analysis on pancreas and human immune experiments. A. Performance of competing methods on pancreas and human immune experiments using random seeds 1 to 10, with accuracy and F1 score as the metrics. Box plots display the median (center lines), interquartile range (hinges), 1.5-times the interquartile range (whiskers) and all ‘outlying’ points individually. B. A three-stage Sankey plot illustrates the mapping relationship among the prediction results of the first stage, second stage and final prediction, in comparison with the ground truth. C. UMAP visualization plots for pancreas experiments calculated using CANAL’s latent representations, colored by cell types. D. UMAP visualization plots for test data of human immune experiments calculated using CANAL’s latent representations, colored by CANAL’s predictions after each training stage as well as ground truth labels. Main prediction differences exist in the marked regions. MK progenitors: Megakaryocyte progenitors; Mono DC: Monocyte-derived dendritic cells; PDC: Plasmacytoid dendritic cells.

A three-stage Sankey plot visually represented the mapping relationship obtained by CANAL among the prediction results of the first stage, second stage, final prediction and ground truth (Figure 3B). The analysis of the plot revealed a consistent improvement in the model’s performance throughout the three stages of training, leading to enhanced predictive accuracy on the test data. During the initial stage, there were misclassifications of beta cells and alpha cells, erroneously categorized as gamma cells. However, these misclassifications were rectified in the second stage. Notably, the final stage introduced new cell types, effectively addressing misclassifications from the earlier stages. In this final stage, both quiescent and activated stellate cells, which were previously misclassified as mesenchymal cells, were accurately identified. This misclassification at earlier stages can be explained by previous studies, which aligns well with previous observations that emphasize the role of stellate cells in fibrogenesis, considering the suggested fibrogenic potential also found in bone marrow mesenchymal cells [20]. The classification of stellate cells as resident mesenchymal cells in both the liver and pancreas has been substantiated by existing literature. Additionally, research studies have provided evidence of the involvement of pancreatic stellate cells in facilitating the process of epithelial–mesenchymal transition in pancreatic cancer cells [21], suggesting an interaction between stellate cells and mesenchymal cells. The UMAP visualization plots derived from the latent representations of each method clearly demonstrated that CANAL effectively discriminated all 10 cell populations, enabling the clustering of cells based on their biological characteristics rather than batch effects (Figure 3C, Supplementary Figure S3, S4). Conversely, alternative methods tended to group together smaller cell types, such as mesenchymal, activated stellate and quiescent stellate cells, indicating their limited ability to extract sufficient specific information from a smaller number of samples to differentiate these cell types.

The second experiment involved human immune datasets, comprising 10 different batches [22]. The integrated datasets, namely Oetjen, Sun and 10X, were utilized in three distinct training stages. 1000 cells of each training dataset, as well as two new datasets, Freytag and Villani, were kept as test data. All training stages shared the 10 cell types in common. Boxplots and visualization plots once again demonstrated the superior performance of CANAL in extracting biologically specific information while unaffected by complex batch effects (Figure 3A, Supplementary Figure S1, S5). Moreover, CANAL exhibited strong generalization ability, accurately annotating data from previously unseen batches, such as Freytag and Villani, which were not part of the training process. Although alternative methods also maintained stable performance, their annotation accuracies still lagged behind CANAL. The prediction changes observed after each training stage illustrated that CANAL continuously refines the model’s understanding of cell-type boundaries, which were fuzzy, during the learning process (Figure 3D). Concretely, the second stage of learning improved the prediction of the junction area between CD4+ T cells and CD8+ T cells (marked in dashed ellipse), as well as NK cells and NKT cells (marked in red rectangle), while the third stage of learning further adjusted the prediction of the junction region between CD14+ Monocytes and Monocyte-derived dendritic cells (marked in green square). This phenomenon arose from the fact that CANAL facilitates the gradual acquisition of distinctive patterns that were not fully learned previously. Through knowledge distillation and iterative review of crucial examples, CANAL enhances its performance to overcome previous weaker points.

We also conducted a series of experiments on the Zheng68K dataset, consisting of six training stages [23]. In this particular scenario, the data stream exhibited minimal heterogeneity, and the non-online learning method, scBERT, demonstrated its efficacy, highlighting the superior feature mining and information integration capabilities of PTLMs compared with traditional network structures. Supplementary Figure S6, S7 provided comprehensive dataset information and presented the detailed results for further examination.

CANAL achieves superior performance on cross-tissue experiments

To assess the performance of CANAL when applied to data streams originating from diverse tissues, we conducted cross-tissue experiments utilizing the Tabula Muris Consortium dataset [24]. The training data stream was generated using the 10X Genomics platform, with separate data streams obtained from lung, mammary gland, limb muscle and spleen tissues, for each of the four stages. Two test datasets were generated using 10X Genomics and Smart-seq2 (SS2), respectively. These datasets encompassed all the tissues included in the training data, allowing for a comprehensive assessment of the knowledge retention, adaptation and generalization capabilities of each tool. The cell-type set of each training stage only partially overlapped, adding an additional layer of complexity to this task.

In general, CANAL exhibited clear advantages across all metrics (Figure 4A, Supplementary Figure S8). The results obtained by CANAL demonstrated small randomness, and it consistently surpassed the second-best method by at least 10 points in all metrics. Conversely, scBERT’s performance was unsatisfactory, exhibiting significant fluctuations. Due to the absence of a mechanism to handle CF, scBERT quickly forgot previously acquired knowledge, making it challenging for the model to adapt continuously to highly heterogeneous data streams. The other three tools demonstrated relatively good performance on the test data from the 10X platform, thereby showcasing their capacity for online learning. However, when employed to annotate the test data derived from SS2, there remained scope for enhancing their generalization ability, indicating their limited capacity to effectively extrapolate the acquired distinctive patterns to diverse downstream datasets. The visualization plots of both the test and training stages’ raw data revealed prominent batch effects, attributable to the generating platforms and source tissues (Figure 4B, Supplementary Figure S9). However, their latent representation learned by CANAL retained cell-type-specific information, regardless of the tissue or platform of origin for the cells. Notably, certain cell types that exclusively appeared in specific tissues were well-preserved within the representation. Examples included luminal epithelial cells and basal cells in the mammary gland, as well as mesenchymal stem cells and skeletal muscle satellite cells in the limb muscle (Figure 4C). However, other methods failed to form a clear class structure. More specifically, Online iNMF, SCALEX and scArches tended to partition cells of the same cell type into multiple sub-clusters, while the latent space learned by scBERT lacked adequate dispersion. In contrast to CANAL, they demonstrated limited efficacy in extracting shared patterns and mitigating batch effects observed across disparate tissue samples (Supplementary Figure S10).

Figure 4.

Figure 4

Analysis on cross-tissue experiments.  A. Performance of competing methods on test data from 10X and SS2 using random seeds 1 to 10, with accuracy and F1 score as the metrics. Box plots display the median (center lines), interquartile range (hinges), 1.5-times the interquartile range (whiskers) and all ‘outlying’ points individually. B. UMAP visualization plots calculated using raw data and CANAL’s latent representation, colored by tissues. C. UMAP visualization plots calculated using latent representations of various tools, colored by cell types. D. Sankey plots illustrate the mapping relationship between true (left) and predicted cell types (right) of competing methods. SS2: Smart-seq2; LECMG: Luminal epithelial cell of mammary gland; MSC: Mesenchymal stem cell; NK cell: Natural killer cell; SMSC: Skeletal muscle satellite cell.

We also presented Sankey plots to exhibit the mapping between true and predicted cell types for each method (Figure 4D, Supplementary Figure S11). These plots visually depicted the proportions of each cell type, allowing us to compare the annotation performance of cell types across different sizes, particularly for the identification of rare cell types. For a rather large cell type ‘basal cell’, CANAL maintained almost perfect identification while alternative methods classify this type as either ‘B cell’ or ‘stromal cell’. Moreover, CANAL’s superiority was particularly evident in three smaller cell types: LECMG (specific to the Mammary gland dataset), MSC and SMSC (both specific to the Limb muscle dataset). Given their limited abundance and occurrence in only one stage, the classification performance of scArches and scBERT was poor for the cell type LECMG, whereas all four methods failed to correctly classify the cell types MSC and SMSC. Only CANAL consistently and accurately identified these three tissue-specific cell types simultaneously. This phenomenon can be attributed to the training mechanism employed by CANAL. CANAL prioritizes the inclusion of infrequent clusters during experience replay, thereby allocating greater attention to these rare cell types. For instance, if a rare cell type A consists of 50 cells, while a more prevalent cell type B encompasses 5000 cells, CANAL ensures that the entire set of cell type A is retained in the sample collection for subsequent review, whereas only a small fraction of the most representative cells from cell type B are included. This balanced approach proves advantageous for the accurate identification of rare cell types. By examining the prediction changes made by CANAL for all test data at different training stages (Supplementary Figure S12), we observed that our model continuously updates its learned knowledge throughout the training process, gradually consolidating a comprehensive cell-type annotation library, and consequently achieving more precise results. Moreover, the confusion matrices further validated that CANAL consistently maintains a high level of annotation accuracy, irrespective of whether the cells belong to shared or unique, large or small categories (Supplementary Figure S13).

CANAL identifies novel cells during the testing stage

In addition to allowing new cell types to continuously appear during the training stages, CANAL can also identify novel cells during the testing stage, distinguishing unknown cells from cell types that have appeared during training. Here we adopted the pancreas datasets for two additional novel cell-type detection experiments. We artificially removed a relatively large cell type ‘alpha’ (Inline graphic of all training data) and a small cell type ‘delta’ (Inline graphic of all training data) during the training stage, respectively, to produce a novel cell type during the testing stage. Besides total annotation accuracy, the H-score, which calculates the harmonic mean of the accuracy on ‘known’ and ‘unknown’ cells, was also adopted to evaluate the trade-off between the accuracy of ‘known’ and ‘novel’ cells. scBERT detects novel cells based on the confidence score, and scArches calculates the uncertainty score for each cell in the query dataset using its set of closest neighbors in the reference dataset. We applied the same novel cell detection strategy in scArches for SCALEX and Online iNMF, which originally lack the ability to identify novel cells. All the above methods, except CANAL, require a subjective threshold. We set Inline graphic as an interval to adjust the threshold from Inline graphic to Inline graphic and recorded the corresponding result of the best-performing model.

CANAL performed the best, irrespective of the size of novel population, with high total accuracy. H-scores for CANAL, which rose above 0.8, indicated that it could easily balance known cell-type annotation and unknown cell-type identification (Table 1). Empirical distributions determined by CANAL showed that the uncertainty scores of unknown and known cell types had clear differences, forming two peaks, thus proving that utilizing confidence and energy scores to measure the possibility of cells belonging to unknown cell types is reasonable (Figure 5). In addition, the automatically selected threshold marked by the red line also provided desired divisions, achieving a high ability to identify unknown cells, while minimizing misclassification of known cell types into unknowns. scBERT performed second only to CANAL on this task, which may be attributed to the ability of the PTLM to integrate and master the learned specific patterns of different cell types and, hence, better differentiate them from those cell types that have not appeared. The relatively suboptimal performance of the other methods can be explained by the identification of novel cell types based on neighbors because errors are likely to occur if new and known cell types are similar, whereas their boundaries are fuzzy. In contrast, CANAL achieved a low false negative rate with almost perfect identification of all unknown cells, as determined by observing Sankey plots in the rather small delta cell type (Supplementary Figure S14). Compared with other methods, CANAL can find the correct correspondence of known cells, validating its ability of universal annotation.

Table 1.

Performance comparisons on novel cell-type detection

Novel cell type Alpha Delta
Online iNMF 0.556/0.565 0.672/0.684
SCALEX 0.769/0.769 0.674/0.648
scArches 0.783/0.781 0.677/0.733
scBERT 0.832/0.822 0.753/ 0.808
CANAL 0.838/0.845 0.826/0.837

Note. We provide the corresponding H-score and total annotation accuracy (H-score/total accuracy). Highest score is in bold.

Figure 5.

Figure 5

Experiments of novel cell-type detection on pancreas experiments. Empirical distributions and thresholds by CANAL with novel cell type ‘alpha’ and ‘delta’, respectively.

CANAL effectively alleviates CF

As discussed in the Introduction, CF poses a significant challenge in the context of continual learning. To intuitively assess the ability of competing methods to preserve old knowledge, we evaluated the annotation performance of each model on the test data from the initial fine-tuning dataset (referred to as ‘Lung’ and ‘Muraro’ for the cross-tissue and pancreas experiments, respectively) after each training stage. Since SCALEX actually only trains for one stage, we compared the remaining four methods, which share more similar model settings. For unsupervised Online iNMF and scArches, we trained a weighted kNN classifier on the latent-space representation of the training data at each current stage to classify labels for the corresponding test data after the completion of each training stage. By examining the line plots, we could clearly observe that the phenomenon of CF for CANAL was negligible (Figure 6). Accuracy of the initial dataset remained largely unchanged due to the implementation of our novel example bank and knowledge distillation, which sets it apart from scBERT. On the other hand, Online iNMF and scArches focus on data integration without employing a classifier and lack the ability to retain the patterns of specific cell types from previous data. Occasionally, the line plots exhibited a slight increase, as seen in stage 4 of Online iNMF, scBERT and scArches for the left graph, and stage 3 of Online iNMF for the right graph. This increase could be attributed to the training data at these stages sharing common features (or a reduced batch effect) with the data from the initial fine-tuning stage. However, this improvement was quite limited and could not be compared with the performance of our method. For additional experiments comparing CANAL, sequential learning and offline learning in terms of both classification performance and time efficiency, please refer to the Supplementary Materials (Section 8).

Figure 6.

Figure 6

Intuitive evaluation of CF. After each stage of training, we assess the performance of different methods on the test data obtained from the initial dataset for both cross-tissue (left) and pancreas (right) experiments.

CANAL maintains high model interpretability

Compared with conventional machine learning methods, CANAL leverages a transformer backbone that maintains gene-level interpretability through the attention mechanism, facilitating the identification of cell-type-specific markers. The detailed process of attention estimation is described in the Supplementary Materials (Section 9). To illustrate this, we first used pancreas datasets as an example and presented the genes with the top three attention scores for each cell type (Figure 7A). The heatmap demonstrated that CANAL automatically found important features with cell-type-specific patterns. These genes served as valuable references for marker genes associated with each cell type. Genes enclosed in red boxes have been previously reported by PanglaoDB [25] or CellMarker [26] databases as classic specific markers that can be used to define the corresponding cell type in mouse pancreas. While marker genes for certain cell populations, such as activated stellate cells, gamma cells, mesenchymal cells and quiescent stellate cells, have not been extensively studied, we believe that the listed genes are strong candidates for further investigation. For example, activated pancreatic stellate cells are the main effector cells in the process of fibrosis, a vital pathological character in pancreatic diseases including chronic pancreatitis and pancreatic cancer. Bioinformatics studies have already identified MMP14 as immune-related biomarkers associated with pancreatic adenocarcinoma prognosis [27, 28]. To explore whether the top attention genes exhibit differential expression among each cell population, we visualized their expression distribution (Figure 7B). Regardless of the cell population size, we observed distinct differential expression patterns of these selected genes.

Figure 7.

Figure 7

Illustration of model interpretability.  A. A heatmap shows the attention weights provided by CANAL on the pancreas experiments. For each cell type, the top three genes with highest attention weights are listed. B. The stacked violin plots display the expression distribution of the selected genes on the pancreas experiments. C. UMAP visualization plots display the gene expression distribution of top attention gene, for five cell types B cell, basal cell, endothelial cell, NK cell, SMSC, respectively, on the cross-tissue experiments. NK cell: Natural killer cell; SMSC: Skeletal muscle satellite cell.

Furthermore, we presented the genes with the top three attention scores for each cell type obtained from CANAL on cross-tissue experiments (Supplementary Figure S15). The heatmap demonstrated CANAL’s capability to automatically identify significant features displaying cell-type-specific patterns. Additionally, we visualized the gene expression distribution of the top attention gene for each cell type using UMAP visualization (Figure 7C, Supplementary Figure S16). Almost all of the top-attention genes, excluding the known markers, exhibited differential expression and held potential as novel markers. A detailed analysis and discussion can be found in the Supplementary Materials (Section 10). These results collectively demonstrated that CANAL has the ability to (1) advance the understanding of single-cell-specific representation and (2) provide valuable insights for downstream analysis and further biological research.

Ablation studies validating the primary innovative contributions introduced in CANAL were presented in the Supplementary Materials (Section 11). The results demonstrated the effectiveness of our pre-training strategy, schemes to address the CF issue and two uncertainty measurements. Removing any of these components led to a degradation in performance. Additionally, we conducted robustness experiments on two hyper-parameters in our model: the size of example bank Inline graphic and the weight of distillation loss Inline graphic. The performance of CANAL remained stable across a wide range of variations in both parameters. Furthermore, we performed down-sampling experiments to investigate the sample size requirements of CANAL. Detailed information on these experimental results can be found in the Supplementary Materials (Section 12).

CONCLUSION

PTLMs show promise in message passing and information integration. Moreover, the rapid development of sequencing technology and continuous emergence of well-annotated data streams call for the development of existing annotation tools in an online manner. Accordingly, we propose CANAL, a PTLM-based algorithm for online single-cell RNA annotation. To alleviate ‘CF’ as the training progresses, we first design a class-balanced example bank, which is based on similarity to cell-type prototypes, to preserve vital examples for future reviewing. This example bank can be efficiently updated at the completion of each training stage, incurring minimal computational overhead. Furthermore, we employ knowledge distillation between the previous and current models to retain knowledge acquired from previous stages. CANAL accommodates novel cell types in both training and testing stages. Our cell-type annotation library is continuously updated during the training stages, enabling the automatic differentiation of ‘unknown’ cells from ‘known’ cells when annotating unlabeled cells. Consequently, the annotation framework of CANAL is highly versatile and universally applicable.

CANAL demonstrates its applicability across a wide range of scenarios. Firstly, our experimental results validate its ability to handle data streams encompassing different tissues and batches, including non-overlapping cell-type datasets. Secondly, it effectively identifies ‘novel’ cells that cannot be classified into any known cell type when annotating unlabeled data. Furthermore, CANAL’s training strategy facilitates the learning of cell-type-specific representations, irrespective of their sample size, thereby capturing the often overlooked biological characteristics of rare cell types. Moreover, CANAL’s reliance on the network construction of PTLM lends itself to natural interpretability. It enables the identification of potential marker genes in sparsely studied cell types, offering valuable guidance for subsequent downstream biological analysis. In light of these achievements, we would like to highlight potential areas for improvement and suggest several feasible directions for future research.

I. The current framework employs a well-labeled data stream for supervised fine-tuning and utilizes the fine-tuned model to annotate diverse unlabeled test data without additional adaptation. However, if new unlabeled test data become available in the future, we can exploit its inherent structure and perform feature alignment between the labeled and unlabeled data. Specifically, we can incorporate semi-supervised and self-supervised training methods into the existing CANAL framework. This extension would enable the utilization of both labeled and unlabeled data to enhance the model’s performance and capture more comprehensive underlying representations.

II. In this article, CANAL primarily focuses on the annotation of scRNA-seq data streams. However, the emerging field of single-cell multi-omics analysis allows for the simultaneous quantification of multiple modalities, thereby effectively capturing the intricate nature of complex molecular processes and cellular heterogeneity. For instance, the synergistic integration of single-cell ATAC sequencing with scRNA-seq enables the utilization of partial regulatory data pertaining to enhancer regions crucial for maintaining cell-type identities. Additionally, integrating cellular protein and transcriptome measurements at the single-cell level provides additional phenotypic information relevant to cell–cell interactions and signaling processes. This comprehensive approach captures both molecular profiles and functional characteristics, enhancing our understanding of cellular behavior and the underlying mechanisms involved. A promising avenue for future research involves extending and applying the CANAL framework to single-cell multi-omics data streams, thereby providing a more comprehensive and abundant dataset to unravel the heterogeneity of diverse cell types.

III. The recent advancements in spatially resolved transcriptomics technologies have significantly contributed to the investigation of gene expression within a spatial context. Given that individual cells are influenced and molded by their specific local spatial niches, the spatial positioning of cells is highly informative. Incorporating this spatial information into our framework is a natural extension, which can be achieved, for instance, by introducing loss terms that penalize the distances between cells in the latent space when they are situated in close proximity within the spatial locations. This integration of spatial information enables a more comprehensive analysis of cellular identity and its spatial dependencies.

CANAL offers a holistic solution for both laboratory researchers and computational biologists. With the ever-increasing abundance of diverse datasets and data types, we believe that the utility of our toolkit will continue to advance over time.

Key Points

  • We present CANAL, a universal cell-type annotation tool of scRNA-seq data, which continually adapts PTLM, as new well-labeled data becomes available.

  • CANAL essentially alleviates the dilemma of CF. For model inputs, we maintain a dynamic and class-balanced example bank that repeatedly revisits crucial past examples. For model outputs, we employ representation knowledge distillation to retain knowledge acquired during previous training stages.

  • During both the fine-tuning and testing stages, CANAL effectively handles the emergence of novel cell types. The cell-type annotation library undergoes continuous expansion through the integration of newly discovered cell types during the training stages. Moreover, CANAL enables the automatic identification of novel cells in unlabeled datasets based on two uncertainty measurements and adaptive data-based threshold.

  • Comprehensive experimental evaluations demonstrate that CANAL delivers robust and accurate cell-type annotations while preserving high gene-level interpretability, thereby advancing our understanding of cell-type-specific patterns.

Supplementary Material

CANAL_final_supplementary_bbae047

ACKNOWLEDGEMENTS

This work was supported by the National Key Research and Development Program of China (2021YFF1200902) and the National Natural Science Foundation of China (31871342).

Author Biographies

Hui Wan: Ph.D. candidate. Her main research interests include applications of deep learning and statistical methods in biological data analysis.

Musu Yuan: Ph.D. candidate. His main research interests include applied statistics, machine learning and bioinformatics.

Yiwei Fu: Ph.D. candidate. Her main research interests include bioinformatics and deep learning.

Minghua Deng: PhD, professor, PhD supervisor. His main research interests include bioinformatics, computational biology and biostatistics.

Contributor Information

Hui Wan, School of Mathematical Sciences, Peking University, Beijing, China, 100871.

Musu Yuan, Center for Quantitative Biology, Peking University, Beijing, China, 100871.

Yiwei Fu, School of Mathematical Sciences, Peking University, Beijing, China, 100871.

Minghua Deng, School of Mathematical Sciences, Peking University, Beijing, China, 100871; Center for Quantitative Biology, Peking University, Beijing, China, 100871; Center for Statistical Science, Peking university, Beijing, China, 100871.

References

  • 1. Wan H, Chen L, Deng M. Scname: neighborhood contrastive clustering with ancillary mask estimation for scrna-seq data. Bioinformatics  2022;38(6):1575–83. [DOI] [PubMed] [Google Scholar]
  • 2. Chen L, He Q, Zhai Y, Deng M. Single-cell rna-seq data semi-supervised clustering and annotation via structural regularized domain adaptation. Bioinformatics  2021;37(6):775–84. [DOI] [PubMed] [Google Scholar]
  • 3. Chenling X, Lopez R, Mehlman E, et al.  Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol  2021;17(1):e9620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Chen J, Hao X, Tao W, et al.  Transformer for one stop interpretable cell type annotation. Nat Commun  2023;14(1):223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Yang F, Wang W, Wang F, et al.  Scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nat Mach Intell  2022;4(10):852–66. [Google Scholar]
  • 6. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019;1:4171–86. [Google Scholar]
  • 7. Mai Z, Li R, Jeong J, et al.  Online continual learning in image classification: an empirical survey. Neurocomputing  2022;469:28–51. [Google Scholar]
  • 8. Parisi GI, Kemker R, Part JL, et al.  Continual lifelong learning with neural networks: a review. Neural Netw  2019;113:54–71. [DOI] [PubMed] [Google Scholar]
  • 9. Gao C, Liu J, Kriebel AR, et al.  Iterative single-cell multi-omic integration using online learning. Nat Biotechnol  2021;39(8):1000–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Welch JD, Kozareva V, Ferreira A, et al.  Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell  2019;177(7):1873–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Lotfollahi M, Naghipourfar M, Luecken MD, et al.  Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol  2022;40(1):121–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Xiong L, Tian K, Li Y, et al.  Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space. Nat Commun  2022;13(1):6118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Rebuffi S-A, Kolesnikov A, Sperl G, et al.  icarl: Incremental classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 2017, Honolulu, HI, USA, 2017, pp. 5533–42.
  • 14. Liu C, Wang L, Lyu L, et al.  Deja vu: continual model generalization for unseen domains. The Eleventh International Conference on Learning Representations. ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • 15. Wan H, Chen L, Deng M. Scemail: universal and source-free annotation method for scrna-seq data with novel cell-type perception. Genomics Proteomics Bioinformatics  2022;20(5): 939–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Muraro MJ, Dharmadhikari G, Grün D, et al.  A single-cell transcriptome atlas of the human pancreas. Cell Syst  2016;3(4):385–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Martin Enge H, Arda E, Mignardi M, et al.  Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell  2017;171(2):321–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Baron M, Veres A, Wolock SL, et al.  A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst  2016;3(4):346–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Segerstolpe Å, Palasantza A, Eliasson P, et al.  Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab  2016;24(4):593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kordes C, Sawitza I, Götze S, et al.  Stellate cells are mesenchymal stem cells. Eur J Med Res  2014;19:1–2  BioMed Central.24393333 [Google Scholar]
  • 21. Kikuta K, Masamune A, Watanabe T, et al.  Pancreatic stellate cells promote epithelial-mesenchymal transition in pancreatic cancer cells. Biochem Biophys Res Commun  2010;403(3-4):380–4. [DOI] [PubMed] [Google Scholar]
  • 22. Luecken MD, Büttner M, Chaichoompu K, et al.  Benchmarking atlas-level data integration in single-cell genomics. Nat Methods  2022;19(1):41–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Zheng GXY, Terry JM, Belgrader P, et al.  Massively parallel digital transcriptional profiling of single cells. Nat Commun  2017;8(1):14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Schaum N, Karkanias J, Neff NF, et al.  Single-cell transcriptomics of 20 mouse organs creates a tabula muris: the tabula muris consortium. Nature  2018;562(7727):367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Franzén O, Gan L-M, Björkegren JLM. Panglaodb: a web server for exploration of mouse and human single-cell rna sequencing data. Database  2019;2019:baz046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Congxue H, Li T, Yingqi X, et al.  Cellmarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scrna-seq data. Nucleic Acids Res  2023;51(D1):D870–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Jin G, Hong W, Guo Y, et al.  Molecular mechanism of pancreatic stellate cells activation in chronic pancreatitis and pancreatic cancer. J Cancer  2020;11(6):1505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Li Y, Zhou S, Wei B, et al.  Bioinformatics analysis identified mmp14 and col12a1 as immune-related biomarkers associated with pancreatic adenocarcinoma prognosis. Math Biosci Eng  2021;18(5):5921–42. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

CANAL_final_supplementary_bbae047

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES