Abstract
The proliferation of single-cell RNA-seq data has greatly enhanced our ability to comprehend the intricate nature of diverse tissues. However, accurately annotating cell types in such data, especially when handling multiple reference datasets and identifying novel cell types, remains a significant challenge. To address these issues, we introduce Single Cell annotation based on Distance metric learning and Optimal Transport (scDOT), an innovative cell-type annotation method adept at integrating multiple reference datasets and uncovering previously unseen cell types. scDOT introduces two key innovations. First, by incorporating distance metric learning and optimal transport, it presents a novel optimization framework. This framework effectively learns the predictive power of each reference dataset for new query data and simultaneously establishes a probabilistic mapping between cells in the query data and reference-defined cell types. Secondly, scDOT develops an interpretable scoring system based on the acquired probabilistic mapping, enabling the precise identification of previously unseen cell types within the data. To rigorously assess scDOT’s capabilities, we systematically evaluate its performance using two diverse collections of benchmark datasets encompassing various tissues, sequencing technologies and diverse cell types. Our experimental results consistently affirm the superior performance of scDOT in cell-type annotation and the identification of previously unseen cell types. These advancements provide researchers with a potent tool for precise cell-type annotation, ultimately enriching our understanding of complex biological tissues.
Keywords: cell-type annotation, novel cell-type identification, optimal transport, distance metric learning, single-cell RNA sequencing
INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) technologies enable the measurement of gene expression profiles in individual cells, significantly advancing our comprehension of complex tissues and organisms by unveiling the cellular composition within these structures. One pivotal phase in the analysis of scRNA-seq data is cell-type annotation. It is foundational for unraveling the functional attributes of the tissue and a prerequisite for various downstream analyses, including trajectory analysis and the investigation of cell–cell interactions [1–3].
Cell-type annotation methods generally fall into two categories: unsupervised clustering-based and supervised classification-based approaches. Unsupervised clustering methods initially apply clustering algorithms to group cells based on their gene expression profiles. Subsequently, they manually assign cell types to these clusters using a set of carefully curated marker genes [4–8]. However, accurately grouping cells into different clusters remains a persistent challenge. Moreover, marker gene selection relies on researchers’ prior knowledge, making it susceptible to biases and errors. This process is often time-consuming and lacks reproducibility across different experiments and research groups [9–11]. In contrast, supervised classification-based methods establish the underlying relationships between gene expression measures of cells and their respective cell types by harnessing well-annotated reference data. These methods efficiently transfer label information from the reference data to the query data, effectively bypassing the need for manual marker gene selection. The supervised strategy for cell-type annotation can be broadly categorized into two main groups. The first category relies on statistical metrics to quantify the similarity between cells, including scmap [12], Seurat [13], scMRMA [14], ScType [15], scAnnotate [16] and others. These methods utilize statistical measures to assess the likeness of cells, facilitating the assignment of cell types based on predefined metrics. The second category adopts deep learning methodologies for cell-type annotation, encompassing ItClust [17], scANVI [18], mtANN [19], scDeepSort [20], CellBlast [21], scBERT [22], et al. These approaches leverage advanced deep learning techniques to discern intricate patterns and relationships within single-cell RNA-Seq data, offering a more nuanced and intricate approach to cell-type identification. These supervised approaches often outperform unsupervised clustering-based techniques, giving rise to the development of various reference-based supervised learning methods in recent years [23–27].
Despite the advantages of these supervised methods, two critical issues remain inadequately resolved. First, with the exponential growth of publicly available sequencing data, numerous well-annotated datasets become accessible for a given tissue. These datasets often provide similar gene expression profiles and label information as the query dataset. However, they may vary in noise, encompass different cell types, and exhibit technological and biological differences compared to the query dataset. For previous methods that rely on a single reference dataset, such as ItClust and TACCO [28], selecting the optimal reference from this vast pool without prior guidance is a challenge. Even for methods that can use multiple reference datasets, effectively integrating the complementary information offered by various references is still uncharted territory. Some approaches, such as Seurat, exist for merging multiple reference datasets into a single reference. However, these methods can be sensitive to batch effects, making it difficult to pre-emptively choose an appropriate batch correction method [29, 30]. Thus, there is a pressing need for new methods that can harness the collective strength of diverse reference datasets and avoid the manual selection of an optimal reference.
Secondly, due to variations in the design of experiments, sample acquisition methods and sequencing technologies, query data may contain unseen cell types that are absent in the reference dataset. For example, when annotating a query dataset from a cancer sample using references collected from normal samples, these unseen cell types may appear. Overlooking these cell types might lead to reduced annotation accuracy and result in missing out on new biological insights. To ensure credible cell-type annotation, it is crucial to recognize these elusive cell types and categorize them as ‘unseen’. Current methods attempt to identify these cell types by evaluating the similarities between query data and reference data [12, 13, 31]. However, these methods often conduct individual cell-to-cell comparisons, which can make them sensitive to potential interferences from other cell types within the reference data. This interference is particularly noticeable when dealing with closely related cell types or those originating from the same lineage. In such cases, the methods may not effectively distinguish between these similar cell types, potentially leading to inaccurate annotations.
To overcome these challenges, we introduce a novel approach for scRNA-seq data, Single Cell annotation based on Distance metric learning and Optimal Transport (scDOT). scDOT leverages optimal transport to construct a probabilistic mapping linking unannotated cells in the query dataset to cell types defined in the reference data. Diverging from traditional optimal transport methods that generate a single cost matrix between query cells and reference class means, scDOT adopts a distance metric learning approach. It calculates multiple cost matrices between query cells and reference class means obtained from various references and amalgamates them using weighted averages to create the final cost matrix for optimal transport. scDOT simultaneously learns the weights assigned to each reference and the probabilistic mapping. This enables us to perform cell-type annotation and identify unseen cell types based on the learned probabilistic mapping. It also allows us to analyze the predictive power of different references based on the learned weights. The efficacy of scDOT is rigorously validated through benchmarking on two distinct collections of benchmark datasets that span various tissues, sequencing technologies, and encompass diverse cell types. Across 19 benchmark tests, scDOT consistently demonstrates its superior performance in cell-type annotation and the identification of previously undiscovered cell types. Additionally, we further evaluated the adaptability of our approach by annotating cancer samples using normal references and categorizing patients’ tumor cells. These advances provide researchers with a robust tool for precise data annotation.
METHODS
Overview of scDOT
scDOT is designed for annotating newly sequenced (query) data with the assistance of multiple well-annotated scRNA-seq datasets (Figure 1). For query cells belonging to cell types present in the references, scDOT assigns them corresponding cell types, whereas query cells not associated with reference cell types are annotated as ‘unseen’. scDOT harnesses an optimal transport model to learn a probabilistic map linking unannotated query cells with cell types in the reference data, facilitating query cell annotation and unseen cell-type identification.
Figure 1.
Schematic representation of the scDOT workflow. (A) scDOT takes multiple well-annotated single-cell sequencing datasets and a query dataset as input. (B) It computes cost matrices between the query cells and cell types defined in various reference datasets. These cost matrices are then imputed based on reference dataset similarity and combined into a final cost matrix using weighted averaging for optimal transport. (C) scDOT introduces an optimization model that concurrently determines the weights assigned to each reference and the probabilistic mapping. The optimal probabilistic mapping is obtained by iteratively updating the mapping matrix and reference weights. (D) A novel scoring approach is employed to identify unseen cell types within the query data, while the seen cells are annotated with cell types exhibiting the highest mapping probabilities.
In contrast to traditional optimal transport, which calculates a cost matrix prior to probabilistic map learning, scDOT borrows from the concept of distance metric learning to simultaneously derive the cost matrix and probabilistic mapping between query cells and reference cell types. To eliminate the need for manual selection of an optimal reference from multiple datasets and to integrate the complementary information provided by various references, scDOT initially computes multiple cost matrices between query cells and reference cell types from different references. These matrices are combined using weighted averages to produce the final cost matrix for optimal transport. Subsequently, a mathematical model is established to concurrently determine the weights assigned to each reference and the probabilistic mapping. Once the probabilistic mapping matrix is obtained, we introduce a scoring approach to identify unseen cell types. For query cells identified as belonging to unseen cell types, scDOT annotates them as ‘unseen’, while the remaining cells are annotated with cell types exhibiting the highest mapping probabilities.
Mathematically, scDOT operates on a collection of annotated reference datasets, represented as
, alongside a query dataset denoted as
(Figure 1A). Within this framework,
signifies the gene expression matrix of the
th reference dataset, where rows correspond to cells and columns denote genes. Cell-type information is stored in
. Similarly,
represents the gene expression matrix of the query dataset which is a
matrix with
and
denoting the number of cells and genes of the query dataset separately. Let
denotes the set of cell types observed in
and
denotes all cell types present in all reference datasets. Thus, the number of cell types observed in all reference datasets can be denoted as
, and the
-th cell type can be denoted as
. The method’s workflow commences by calculating a series of cost matrices
, where
is a series of
matrices that capture the relationship between the query cells and the cell types defined in different reference datasets (Figure 1B). These matrices are subsequently combined to form the final cost matrix, computed as
. Here,
signifies a series of weights assigned to reference datasets. Subsequently, through a coordinated learning process, the method determines the weights,
, and the probabilistic mapping matrix, referred to as
which is a
matrix indicating the probability that query cells belong to a certain cell type (Figure 1C). Cell-type annotation and the identification of previously unseen cell types are accomplished based on the information derived from
(Figure 1D).
The following sections begin with a brief review of unbalanced optimal transport (UOT), a fundamental concept used in our method. Following that, we explain the process of computing cost matrices in the context of different references and their combination into a final cost matrix using a weighted averaging approach for optimal transport. We then present the scDOT methodology based on distance metric learning and optimal transport, elaborating on how it simultaneously learns the weights assigned to different references and the probabilistic mapping matrix. Finally, we introduce the optimization algorithm and the approach to identify unseen cell types and annotate cell types.
Unbalanced optimal transport
In the scDOT methodology, the core of our approach lies in the UOT model [32]. UOT aims to establish a probabilistic mapping matrix, denoted as
, which effectively links unannotated query cells with cell types from well-annotated reference datasets. The primary objective is to minimize the following mathematical expression:
![]() |
(1) |
In the first term,
is defined as
. This term signifies the transportation cost, measuring the computational effort needed to map query cells to reference cell types. It is integral to efficiently leveraging the information contained in the distance matrix,
. The second term introduces entropy regularization, which adds smoothness and strict convexity to the optimization problem. The third and fourth terms involve Kullback–Leibler (KL) divergence penalties. These penalties are fundamental elements in the UOT framework. They play a crucial role in enabling the calculation of approximate mappings between distributions that do not possess the same quantity of mass, represented by
and
. The UOT model distinguishes itself by its ability to relax the constraints of traditional optimal transport. This flexibility is achieved by introducing a relaxation mechanism within the optimization objective. Consequently, the model excels at learning precise mappings, even when the marginal distributions are not precisely defined.
Calculating cost matrix
To calculate the cost matrix for cell-type annotation in scDOT, we begin by computing cosine distances to establish relationships between cells in the query data and cell types in the reference data.
We initiate the process by determining the cosine distance from the processed query data to the means of cell types computed in the reference data. These means represent the average expression profiles of cells belonging to specific cell types and can be calculated as follows:
![]() |
(2) |
where
represents the average expression matrix of the cell type of the
th reference dataset with rows corresponding to genes and columns denoting cell types, and
is a column vector denoting the expression profile of
-th cell type.
denotes a row vector representing the expression profile of the
th cell in
th reference. If a reference
does not include a specific cell-type
, we set
as a zero vector. To address this, we will impute the corresponding element in the computed cost matrix by borrowing information from other references. Next, we calculate the distance matrix as follows:
![]() |
(3) |
where
is the 2-norm of a vector, and
denotes the cosine distance from cell
in query data to the mean expression of cell-type
in
th reference data.
Given the potential variance in cell types across different reference datasets, some elements in
cannot be computed, especially when reference
does not encompass a specific cell type
. To maximize the information retained from the reference data, we implement an imputation strategy based on correlations between the reference datasets.
Initially, we calculate the Maximum Mean Discrepancy (MMD) [33] based on the gene expression profiles of cells that are common between pairs of reference datasets. The MMD calculation is as follows:
![]() |
(4) |
Here
represents a feature mapping function that maps the data to a reproducing kernel Hilbert space (RKHS) [33], and
represents that the norm distance is calculated in RKHS, and
and
represent the number of cells shared between the two reference datasets. To transform the MMD into a measure of similarity, we introduce the following equation:
![]() |
(5) |
This equation quantifies the correlation between the
th reference data and the
th reference data. This correlation is then used to impute the distance matrix
with the following equation:
![]() |
(6) |
This process yields the
th imputed distance matrix. Depending on the presence of cell types in the considered reference
, we define the final distance matrix as follows:
![]() |
(7) |
In this context, for cell types not found in the considered reference
, we impute the distance values by borrowing strength from other similar references. For simplicity, we still denote the imputed distance matrix as
.
In scenarios where multiple reference datasets are utilized, we employ a weighted averaging strategy to combine these distance matrices into a final distance matrix, denoted as
:
![]() |
(8) |
Here,
represents the weights assigned to each of the
references that will be automatically learned by our scDOT. This approach allows us to consider variations in the predictive capabilities of individual references with respect to the query data, resulting in a comprehensive and integrated distance matrix.
Formulation of scDOT
Unlike the traditional UOT model provided in Equation (1), which computes a fixed cost matrix before learning the probabilistic mapping, scDOT introduces a novel approach inspired by distance metric learning. In this approach, scDOT simultaneously determines the weights needed to compute the final cost matrix and the probabilistic mapping between query cells and reference cell types. To achieve this, we replace the fixed cost matrix in Equation (1) with a weighted combination of cost matrices from multiple references, as defined in Equation (8). This modification leads to the following optimization model:
![]() |
(9) |
In this model, the first term represents the transportation cost. The significant departure from the traditional UOT model is that the cost matrix is no longer fixed; instead, it becomes a weighted combination of multiple cost matrices from different references. These weights need to be learned. The second to fourth terms in this model align with those of the traditional OT model. The fifth term introduces negative entropic regularization for the weights, aimed at preventing the final result from overfitting to a specific result obtained from a single reference dataset.
The margins, denoted as
and
, respectively represent sums over cells and cell types. In alignment with the approach described in [28], we set the marginal distribution over cell types,
, equal to the total number of counts per cell
, where
. The marginal distribution over cells,
, is an outcome of the annotation process and is no known in advance. When no prior information is available, we default to setting
as the fraction of cells belonging to the
th cell type in the reference datasets.
,
and
are tuning parameters that regulate the relative significance of various terms within the optimization model. By default, in our implementation, we set
and
. We will elaborate on the method for determining
at the end of the next section.
Minimizing (9) with respect to
enables the learning of weights assigned to each reference dataset. Larger
values indicate that the
th reference dataset contributes more to the final cost matrix and may have higher predictive power concerning the query data. By minimizing (9) with respect to
, we can learn the mapping matrix
used for unseen cell-type identification and cell-type annotation.
Optimization algorithm
Solving the optimization problem outlined in Equation (9) requires estimating two sets of model parameters:
and
. To tackle this challenge, we develop a coordinate descent algorithm. This algorithm iteratively updates one parameter while keeping the other constant.
We initiate the optimization process by estimating
while maintaining
at fixed values. This stage involves solving the UOT problem (1), with the cost matrix defined in Equation (8). To accomplish this, we utilize the POT python library [34].
With
fixed, we proceed to compute
. The optimization problem outlined in Equation (9) is reformulated to derive the following expression:
![]() |
(10) |
Utilizing the method of Lagrange multipliers, we can derive a closed-form solution for each
:
![]() |
(11) |
We iterate through the coordinate descent algorithm until the relative change of the objective function in Equation (9) is less than
.
In our implementation, we initialize
for the first update, meaning each reference dataset begins with equal weights. After obtaining the initial estimate of
, we empirically set the tuning parameter
to the median of the dot product of a single cost matrix and the mapping matrix:
![]() |
(12) |
Unseen cell-type identification and cell-type annotation
Once we have the mapping matrix
at our disposal, we can embark on the process of identifying query cells that belong to unseen cell types. Our first step is to perform row normalization on the estimated
, resulting in a normalized mapping matrix denoted as
. In this context,
can be understood as the probability that query cell
belongs to cell type
. Essentially, cells that belong to predefined cell types in the reference data will have a high probability of being assigned to those cell types, whereas cells not fitting within these predefined types will exhibit similar, low probabilities across all possibilities.
To identify unseen cell types, we introduce a scoring mechanism based on
:
![]() |
(13) |
Here,
is the
-th row vector of matrix
which represents the probability vector that query cell
is predicted to be all cell types in
, and
represents the maximum probability associated with the transport of the
th cell. It serves as a quantifiable measure of how closely the
-th cell aligns with a specific cell type. A higher
implies a stronger likelihood that the cell belongs to a particular known cell type. In contrast, a lower
suggests reduced confidence in matching the cell to any known type and raises the possibility of it belonging to an unseen cell type. In our experiments, we employ a threshold of 0.9 on this score to identify unseen cell types.
For query cells identified as belonging to unseen cell types, we annotate them as ‘unseen’. For the remaining cells, we proceed with annotation based on the cell types with the highest mapping probability:
![]() |
(14) |
Data acquisition
We use two collections of publicly available scRNA-seq datasets (Supplementary Tables S1–S2) varying from tissues [peripheral blood mononuclear cells (PBMC) and Pancreas], cell populations and sequencing technologies to benchmark scDOT and other methods. The PBMC collection, including 12 datasets curated from Mereu et al., are sequenced by 12 different sequencing technologies [35]. The Pancreas collection, including four datasets curated from Baron et al. [36], Muraro et al. [37], Segerstolpe et al. [38] and Xin et al. [39], are sequenced by four different technologies. All data in pancreas collection can be downloaded from http://husch.comp-genomics.org [40].
RESULT
Comparison of methods and evaluation metrics
We conducted a comprehensive comparative analysis of scDOT with six cutting-edge cell-type annotation methods, namely scmap-clust [12], scmap-cell [12], Seurat [13], ItClust [17], scANVI [18] and TACCO [28]. These methods represent a diverse array of approaches within the field. For instance, scmap-clust, scmap-cell and Seurat are correlation-based techniques, initially seeking neighboring cells in the reference datasets for the query cells and subsequently annotating the query cells based on these identified neighbors. In contrast, ItClust and scANVI are deep learning-based methods. Similar to our scDOT, TACCO leverages the optimal transport framework for cell-type annotation (for details, please refer to the ‘Comparison methods’ section in the Supplementary materials).
To evaluate the performance of cell-type annotation, we utilized the accuracy metric. For assessing the effectiveness in identifying unseen cell types, we used ROC (Receiver Operating Characteristic) curves and computed the AUC scores (Area Under the ROC curve) along with F1 scores. For more detailed information, please refer to the ‘Performance assessment’ section in the Supplementary materials.
Benchmark of cell-type annotation
To compare the performance of scDOT and other methods in annotating new query data using multiple references, we selected two data collections from different tissues with distinct sequencing technologies: the PBMC data collection and the pancreas data collection. For each data collection, we performed a set of experiments where each dataset alternated as the query data while the others served as reference data. This approach yielded 12 sets of experiments using the PBMC data collection and 4 sets of experiments using the pancreas data collection. Our evaluation was based on the annotation accuracy of the entire query dataset.
Table 1 presents the annotation accuracy of each method when the 12 datasets in the PBMC data collection were used as query data. It is notable that in 11 out of the 12 experiments, scDOT achieved the highest accuracy. Even when the C1HT-medium data served as the query data, scDOT ranked second, closely following the leader, Seurat. In the pancreas dataset collection, scDOT also outperformed the comparison methods in three out of the four experiments (see Table 2). Specifically, scDOT showed superior performance compared to TACCO, although TACCO also utilized optimal transmission for cell-type annotation. This difference can be attributed to scDOT’s capacity to integrate multiple reference datasets with distinct weights, as opposed to simply concatenating multiple reference datasets like TACCO. Additionally, scDOT possesses the capability to dynamically fine-tune the weight assigned to each reference dataset based on its predictive performance. These results affirm scDOT’s superiority in cell-type annotation.
Table 1.
Annotation accuracy of each method in PBMC dataset collection
| Methods | Chromium | inDrop | C1HT-medium | C1HT-small | CEL-Seq2 | ddSEQ | Drop-Seq | ICELL8 | MARS-Seq | Quartz-Seq2 | mcSCRB-Seq | Smart-Seq2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scmap-clust | 0.82 | 0.63 | 0.56 | 0.39 | 0.53 | 0.25 | 0.48 | 0.14 | 0.14 | 0.73 | 0.17 | 0.63 |
| scmap-cell | 0.77 | 0.73 | 0.47 | 0.41 | 0.53 | 0.69 | 0.72 | 0.41 | 0.43 | 0.80 | 0.45 | 0.66 |
| Seurat | 0.87 | 0.82 | 0.74 | 0.72 | 0.43 | 0.84 | 0.86 | 0.54 | 0.60 | 0.87 | 0.53 | 0.81 |
| scANVI | 0.87 | 0.83 | 0.67 | 0.63 | 0.56 | 0.86 | 0.89 | 0.57 | 0.71 | 0.86 | 0.45 | 0.84 |
| ItClust | 0.77 | 0.75 | 0.45 | 0.39 | 0.71 | 0.70 | 0.74 | 0.33 | 0.46 | 0.86 | 0.40 | 0.57 |
| TACCO | 0.62 | 0.75 | 0.53 | 0.68 | 0.51 | 0.79 | 0.75 | 0.52 | 0.56 | 0.48 | 0.67 | 0.78 |
| scDOT | 0.90 | 0.86 | 0.70 | 0.74 | 0.83 | 0.89 | 0.90 | 0.59 | 0.78 | 0.90 | 0.73 | 0.85 |
Note: In the table, the first column specifies the comparison method, while the first row in other columns denotes the dataset utilized as query data. The values in the table represent the accuracy achieved by each method in the respective experiment. Entries in bold highlight the highest accuracy achieved in each experiment.
Table 2.
Annotation accuracy of each method in Pancreas dataset collection
| Methods | Baron | Muraro | Segerstolpe | Xin |
|---|---|---|---|---|
| scmap-clust | 0.46 | 0.91 | 0.97 | 0.96 |
| scmap-cell | 0.90 | 0.87 | 0.92 | 0.99 |
| Seurat | 0.88 | 0.85 | 0.93 | 0.99 |
| scANVI | 0.47 | 0.92 | 0.94 | 0.98 |
| ItClust | 0.43 | 0.67 | 0.79 | 0.79 |
| TACCO | 0.86 | 0.90 | 0.88 | 0.98 |
| scDOT | 0.96 | 0.94 | 0.98 | 0.89 |
Note: The first column of the table indicates the method. The first row of other columns indicates the data set used as query data. The following is the accuracy of each method in the corresponding experiment. Bold numbers indicate the highest accuracy in the corresponding experiments.
Benchmark of unseen cell-type identification
Unseen cell types, those present in the query data but not in the reference dataset, are a common occurrence due to various experimental factors such as sequencing technologies, tissue sources, and experimental goals. Accurate identification of unseen cell types is a pivotal aspect of cell-type annotation. To assess scDOT’s performance in identifying unseen cell types, we conducted experiments using the pancreas data collection. In this collection, the ‘Xin’ dataset comprises only four cell types (Alpha, Beta, Delta and pp). We alternated between selecting one dataset from ‘Baron’, ‘Muraro’ and ‘Segerstolpe’ as the query data, while the remaining datasets were employed as references. To simulate scenarios with unseen cell types, we retained only the common cell types shared by different reference datasets. Any additional cell types present in ‘Baron’, ‘Muraro’ and ‘Segerstolpe’ were considered as unseen cell types, enabling the evaluation of each method’s performance. We excluded the PBMC collection from this evaluation due to its datasets containing the same cell types, making it infeasible to simulate scenarios where query data might include unseen cell types.
The ROC curves and corresponding AUC values for each method are presented in Figure 2(A). The results highlight a significant advantage of scDOT in identifying unseen cell types within the ‘Baron’ dataset. When it comes to recognizing unseen cell types in the ‘Muraro’ and ‘Segerstolpe’ datasets, both scDOT and Seurat demonstrate high accuracy. In practical applications, it is common to apply a threshold for unseen cell-type identification. To assess performance under these conditions, we compared the F1 scores of scDOT with scmap-cell, scmap-clust and Seurat since these methods offer default thresholds to identify unseen cell types. Figure 2(B) illustrates the F1 scores for each method. The results indicate that when utilizing each method’s default threshold for identifying unseen cell types, scDOT consistently exhibits superior accuracy in all three experiments.
Figure 2.
Performance in unseen cell-type identification. (A) ROC curves and their corresponding AUC scores for various methods. Each panel’s name indicates the dataset employed as the query data. The color of each line and bar corresponds to a specific method. (B) F1 scores for scDOT, scmap-cell, scmap-clust and Seurat. Each bar’s color corresponds to a specific method.
Additionally, we visually illustrated the effectiveness of the metrics proposed by each method in distinguishing between common and unseen cell types in Figure 3. It is evident that scDOT’s metric, which is based on the mapping matrix of optimal transport, offers superior differentiation between shared and unseen cell types in various experiments. In contrast, Seurat’s metric occasionally confuses shared and unseen cell types in the ‘Baron’ dataset. This could be partially attributed to Seurat’s approach of consolidating multiple reference datasets into a single reference dataset, which might diminish the specificity and diversity of the reference data, affecting the identification of unseen cell types. These results highlight how scDOT leverages complementary information from different reference datasets, thereby enhancing the precise identification of unseen cell types across diverse experimental settings.
Figure 3.

Distribution of metric scores defined by different methods to distinguish between common and unseen cell types. The name of each row indicates the dataset used as the query data, and the name of each panel indicates the method. The
-axis represents the density of the score distribution, and the
-axis shows the score. The histogram is color-coded to distinguish between common and unseen cell types, and the position of the red vertical line on the
-axis indicates the threshold value for each method.
DISCUSSION
In this study, we introduced scDOT, a novel cell-type annotation method that harnesses distance metric learning and optimal transport. scDOT brings two key innovations to the forefront: (i) integration of reference datasets with predictive power quantification. Through distance metric learning and optimal transport, scDOT introduces a novel optimization framework. This framework concurrently learns the weights assigned to individual references and the probabilistic mapping matrix between cells in the query data and cell types defined in references. Consequently, scDOT can automatically select the most suitable reference datasets, thereby enhancing annotation accuracy using an adaptable weighting scheme. (ii) Interpretable and effective scores for unseen cell-type identification. To accurately identify previously uncharacterized cell types within query data, scDOT introduces an interpretable scoring approach. These scores effectively assess the alignment of query cells with reference data, facilitating the precise identification of previously unseen cell types within the query data. These innovations position scDOT at the forefront of cell-type annotation methods, offering improved flexibility, accuracy, and interpretability.
We systematically evaluated scDOT’s performance across diverse datasets, encompassing various tissues, cell populations and sequencing technologies through 19 experiments. Our findings strongly support scDOT’s ability to integrate multiple reference datasets for accurate cell-type annotation and precise identification of unseen cell types in query data. Comparing scDOT’s performance with multiple references against a single reference consistently showed the integrated results outperforming individual reference datasets (Supplementary Figure S1). We also observed a direct correlation between scDOT’s assigned weights to reference datasets and their annotation accuracy (Supplementary Figure S2), highlighting scDOT’s effectiveness in quantifying predictive power and enhancing cell-type annotation through multiple references (please refer to the ‘Effectiveness of integrating multiple references’ section in the Supplementary materials). Despite integrating multiple reference datasets, scDOT’s memory usage is still comparable, and it uses less runtime (Supplementary Figure S3).
In practical applications, annotating clinical or experimental samples often necessitates using a reference atlas constructed from healthy or wildtype samples. In such scenarios, the influence of disease states on various cell types poses a challenge, complicating the annotation of query cells. To showcase scDOT’s potential in addressing the impact of such confounding variables, we conducted experiments using datasets from normal samples [41] to annotate query data from breast cancer samples [42] (refer to the ‘Annotating cancer samples using normal reference’ section in the Supplementary Materials). The experimental results indicate that our scDOT accurately annotates the query datasets, identifies unseen cell types unique to the query data and notably outperforms the comparison methods (Supplementary Figure S4).
Furthermore, the heterogeneity and dynamic plasticity observed in cancer cells within a tumor significantly influence outcomes and drug responses. Accurate classification of these cells facilitates patient stratification and the prediction of suitable pharmacological treatments. Hence, we conducted experiments with a breast cancer single-cell atlas [43] and single-cell data from five TNBC patients [44] to evaluate scDOT’s ability to classify cancer cells following the experiments by Gambardella et al. [43] (refer to the ‘Automatic classification of patients’ tumor cells’ section in the Supplementary Materials). The results demonstrate that our scDOT effectively classifies cancer cells into their respective cell lines and further categorizes them into cancer subtypes (Supplementary Figures S5–S6). This outcome underscores scDOT’s promising potential in the realm of precision medicine.
In scRNA-seq experiments, doublets or multiplets occur when two or more cells share the same identifier during library construction, leading them to be incorrectly interpreted as a single cell. As a result, doublets introduce expression profiles that do not authentically represent individual cells and may deviate from genuine cell types. Depending on the cellular consistency within these doublets, scDOT might misclassify them as unseen cell types. For instance, if a doublet contains two cells of different types present in the reference data, but their combined expression deviates from known types, scDOT may classify it as unseen. This could introduce bias and mislead the annotation process. As scDOT is not designed to identify doublets in scRNA-seq data, overlooking their removal before employing scDOT for cell-type annotation could lead to potential over-annotation. To address this, various methods have been proposed to detect doublets [45, 46]. We highly recommend applying these methods to eliminate doublets from the query data before utilizing scDOT for annotation purposes. Looking forward, we aim to integrate precise and robust doublet detection directly into scDOT. This integration will streamline and automate the processes of doublet detection, cell-type annotation and identification of unseen cell types. By doing so, we anticipate simplifying analysis workflows and reducing potential biases in scRNA-seq analysis.
In Equation (9), we assigned weights to individual references to account for their varying predictive capabilities concerning the query data. We proposed optimization models to automatically learn these weights in a data-driven manner, demonstrating that the learned weights correlate positively with the references’ predictive power. While our approach assigns weights at the reference level, there is merit in exploring more granular weight assignments, like per cell type per reference weighting. For instance, in scenarios with the presence of rare cell types, our method might erroneously classify query cells of rare types into more prevalent cell types. To mitigate this, assigning distinct weights to different cell types could reduce the distance between query cells and rare cell types, enhancing the correct annotation of rare cell types. Moreover, considering confounding variables like the disease state mentioned earlier might alter the distances between query cells and cell types defined in references differently. In such cases, adjusting weights per cell type could rectify the distance matrices influenced by these variables. In this study, we did not adopt these more complex methods due to the potential introduction of numerous parameters, which could lead to overfitting without an effective regularization term for these weights. Moving forward, we aim to investigate effective regularization strategies to enable per cell type per reference weight assignments, thereby augmenting the potency of our scDOT.
There are a couple of limitations in our work that warrant further consideration. First, it is essential to address the issue of inconsistent terminology for cell types across various reference datasets. Currently, a standardized system for defining, naming, and organizing cell types is lacking, leading to variations in cell-type nomenclature across different datasets [47]. These differences can result in inconsistencies in the original labels when integrating reference datasets. Our present approach mitigates this concern by primarily selecting single-cell data from the same database or laboratory. However, our future plans involve the implementation of a common organizational framework, such as cell ontology [48–50], which can automatically identify and label cells within the same category across reference datasets.
Another limitation relates to the handling of unseen cell types in query data. Real-world scenarios often encompass a diverse array of cell types that may not be represented in reference datasets. These unrepresented cells can vary in number and category, presenting challenges. At present, we uniformly label these cells as ‘unseen’. Future efforts will focus on developing specialized methods for clustering these ‘unseen’ cells and conducting marker gene comparisons. This will provide a more in-depth characterization of these previously unidentified cell types, significantly enhancing the comprehensiveness and depth of biological insights derived from scDOT analyses [44].
Key Points
A novel cell-type annotation method, scDOT, is proposed to address the challenge of accurately annotating cell types in scRNA-seq data. scDOT is specifically engineered to integrate multiple reference datasets and facilitate the identification of previously unseen cell types.
scDOT innovatively combines distance metric learning and optimal transport to create a framework that quantifies the predictive power of reference datasets and enables precise identification of previously undiscovered cell types and cell-type annotation.
Rigorous evaluation of scDOT showcases its superior performance in cell-type annotation, particularly in identifying unseen cell types, providing a potent tool for researchers to enhance their understanding of complex biological tissues.
Supplementary Material
Author Biographies
Yi-Xuan Xiong is a PhD student at School of Mathematics and Statistics, Central China Normal University, Wuhan, China. Her research interests include bioinformatics and machine learning.
Xiao-Fei Zhang is a professor at School of Mathematics and Statistics, Central China Normal University, Wuhan, China. His current research interests include data mining, machine learning and bioinformatics.
Contributor Information
Yi-Xuan Xiong, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China; Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan 430079, China.
Xiao-Fei Zhang, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China; Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan 430079, China.
FUNDING
This work was supported by the National Natural Science Foundation of China (12271198 and 11871026).
References
- 1. Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods to detect differentially expressed genes between single-cell populations. Brief Bioinform 2017;18(5):735–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat Biotechnol 2019;37(5):547–54. [DOI] [PubMed] [Google Scholar]
- 3. Liu Z, Sun D, Wang C. Evaluation of cell-cell interaction methods by integrating single-cell RNA sequencing data with spatial information. Genome Biol 2022;23(1):1–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Kiselev VY, Kirschner K, Schaub MT, et al.. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017;14(5):483–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wang B, Zhu J, Pierson E, et al.. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 2017;14(4):414–6. [DOI] [PubMed] [Google Scholar]
- 6. Brbić M, Zitnik M, Wang S, et al.. Mars: discovering novel cell types across heterogeneous single-cell experiments. Nat Methods 2020;17(12):1200–6. [DOI] [PubMed] [Google Scholar]
- 7. Shao X, Liao J, Xiaoyan L, et al.. scCATCH: automatic annotation on cell types of clusters from single-cell RNA sequencing data. Iscience 2020;23(3):100882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Liang Z, Li M, Ruiqing Zheng Y, et al.. SSRE: cell type detection based on sparse subspace representation and similarity enhancement. Genom Proteom Bioinform 2021;19(2):282–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Miao Z, Moreno P, Huang N, et al.. Putative cell type discovery from single-cell gene expression data. Nat Methods 2020;17(6):621–8. [DOI] [PubMed] [Google Scholar]
- 10. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 2019;20(5):273–82. [DOI] [PubMed] [Google Scholar]
- 11. Ren P, Shi X, Zhiguang Y, et al.. Single-cell assignment using multiple-adversarial domain adaptation network with large-scale references. Cell Rep Methods 2023;3(9):100577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 2018;15(5):359–62. [DOI] [PubMed] [Google Scholar]
- 13. Stuart T, Butler A, Hoffman P, et al.. Comprehensive integration of single-cell data. Cell 2019;177(7):1888–1902.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li J, Sheng Q, Shyr Y, Liu Q. scMRMA: single cell multiresolution marker-based annotation. Nucleic Acids Res 2022;50(2):e7–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022;13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Ji X, Tsao D, Bai K, et al.. scAnnotate: an automated cell-type annotation tool for single-cell RNA-sequencing data. Bioinform Adv 2023;3(1):vbad030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Jian H, Li X, Gang H, et al.. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2020;2(10):607–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Chenling X, Lopez R, Mehlman E, et al.. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol 2021;17(1):e9620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Xiong Y-X, Wang M-G, Chen L, Zhang X-F. Cell-type annotation with accurate unseen cell-type identification using multiple references. PLoS Comput Biol 2023;19(6):e1011261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Shao X, Yang H, Zhuang X, et al.. scDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Res 2021;49(21):e122–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Cao Z-J, Wei L, Shen L, et al.. Searching large-scale scRNA-seq databases via unbiased cell embedding with cell blast. Nat Commun 2020;11(1):3458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Yang F, Wang W, Wang F, et al.. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022;4(10):852–66. [Google Scholar]
- 23. Ma F, Pellegrini M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics 2020;36(2):533–8. [DOI] [PubMed] [Google Scholar]
- 24. Domínguez Conde C, Xu C, Jarvis LB, et al.. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 2022;376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Franchini M, Pellecchia S, Viscido G, Gambardella G. Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data. NAR Genomics Bioinf 2023;5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Chen J, Hao X, Tao W, et al.. Transformer for one stop interpretable cell type annotation. Nat Commun 2023;14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Liu H, Li H, Sharma A, et al.. scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets. Brief Bioinform 2023;24(3):bbad179. [DOI] [PubMed] [Google Scholar]
- 28. Mages S, Moriel N, Avraham-Davidi I, et al.. TACCO unifies annotation transfer and decomposition of cell identities for single-cell and spatial omics. Nat Biotechnol 2023;1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Liu Y, Yan H, Shen L-C, Dong-Jun Y. Learning cell annotation under multiple reference datasets by multisource domain adaptation. J Chem Inf Model 2022;63(1):397–405. [DOI] [PubMed] [Google Scholar]
- 30. Yuan M, Chen L, Deng M. scMRA: a robust deep learning method to annotate scRNA-seq data with multiple reference datasets. Bioinformatics 2022;38(3):738–45. [DOI] [PubMed] [Google Scholar]
- 31. Jing X, Zhang A, Liu F, et al.. CIForm as a transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Brief Bioinform 2023;bbad195. [DOI] [PubMed] [Google Scholar]
- 32. Chizat L, Peyré G, Schmitzer B, Vialard F-X. Scaling algorithms for unbalanced optimal transport problems. Math Comput 2018;87(314):2563–609. [Google Scholar]
- 33. Smola A, Gretton A, Song L, Schölkopf B. A Hilbert space embedding for distributions. In: International Conference on Algorithmic Learning Theory, pp. 13–31. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. [Google Scholar]
- 34. Flamary R, Courty N, Gramfort A, et al.. POT: Python Optimal Transport. J Mach Learn Res 2021;22(78):1–8. [Google Scholar]
- 35. Mereu E, Lafzi A, Moutinho C, et al.. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat Biotechnol 2020;38(6):747–55. [DOI] [PubMed] [Google Scholar]
- 36. Baron M, Veres A, Wolock SL, et al.. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst 2016;3(4):346–360.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Muraro MJ, Dharmadhikari G, Grün D, et al.. A single-cell transcriptome atlas of the human pancreas. Cell Syst 2016;3(4):385–394.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Segerstolpe Å, Palasantza A, Eliasson P, et al.. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab 2016;24(4):593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Xin Y, Kim J, Okamoto H, et al.. Rna sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab 2016;24(4):608–15. [DOI] [PubMed] [Google Scholar]
- 40. Shi X, Zhiguang Y, Ren P, et al.. Husch: an integrated single-cell transcriptome atlas for human tissue gene expression visualization and analyses. Nucleic Acids Res 2023;51(D1):D1029–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Bhat-Nakshatri P, Gao H, Sheng L, et al.. A single-cell atlas of the healthy breast tissues reveals clinically relevant clusters of breast epithelial cells. Cell Rep Med 2021;2(3):100219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Wu SZ, Al-Eryani G, Roden DL, et al.. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 2021;53(9):1334–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Gambardella G, Viscido G, Tumaini B, et al.. A single-cell analysis of breast cancer cell lines to study tumour heterogeneity and drug response. Nat Commun 2022;13(1):1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Gao R, Bai S, Henderson YC, et al.. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat Biotechnol 2021;39(5):599–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Dahlin JS, Hamey FK, Pijuan-Sala B, et al.. A single-cell hematopoietic landscape resolves 8 lineage trajectories and defects in kit mutant mice. Blood 2018;131(21):e1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Xi NM, Li JJ. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst 2021;12(2):176–194.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Domcke S, Shendure J. A reference cell tree will serve science better than a reference cell atlas. Cell 2023;186(6):1103–14. [DOI] [PubMed] [Google Scholar]
- 48. Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol 2005;6(2):R21–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Diehl AD, Meehan TF, Bradford YM, et al.. The cell ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics 7(1–10):2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Wang S, Pisco AO, McGeever A, et al.. Leveraging the cell ontology to classify unseen cell types. Nat Commun 2021;12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
















