Skip to main content
Communications Biology logoLink to Communications Biology
. 2025 Sep 25;8:1360. doi: 10.1038/s42003-025-08745-x

Evaluation of cell type annotation reliability using a large language model-based identifier

Wenjin Ye 1,2,#, Yuanchen Ma 1,2,#, Junkai Xiang 3, Hongjie Liang 2, Jintian Luo 4, Yuantao Li 2,5, Tao Wang 2,5, Qiuling Xiang 2,6, Wu Song 1,, Weiqiang Li 1,2,5,, Weijun Huang 1,2,5,
PMCID: PMC12462508  PMID: 40998963

Abstract

Ensuring accurate cell type annotation in single-cell RNA sequencing data is a significant challenge, as both expert and automated methods can be biased or constrained by their training data, leading to errors and time-consuming revisions. To address this, we developed LICT (Large Language Model-based Identifier for Cell Types), a tool that leverages multi-model integration and a “talk-to-machine” approach. Validated across diverse datasets, LICT consistently aligns with expert annotations. With its objective framework for assessing annotation reliability, LICT can interpret cases where a single cell population exhibits multifaceted traits, allowing researchers to focus on the underlying biological insights. Comparisons with existing tools highlight LICT’s superiority in efficiency, consistency, accuracy, and reliability, establishing it as a powerful tool for single-cell RNA sequencing analysis. Furthermore, its independence from reference data emphasizes LICT’s generalizability, enhancing reproducibility and ensuring more reliable results in cellular research.

Subject terms: Data mining, Software


The authors developed LICT, an LLM-based tool for cell type annotation combining multi-model integration and a “talk-to-machine” approach. LICT delivers interpretable, reliable results and outperforms existing tools in accuracy, efficiency, and consistency.

Introduction

Cell type annotation is crucial for understanding cellular composition and function in single-cell RNA sequencing (scRNA-seq) data, making it an indispensable step in data exploration1. Traditionally, this annotation process has been performed either manually or with automated tools. Manual annotation benefits from expert knowledge but is inherently subjective and highly dependent on the annotator’s experience. On the other hand, automated tools provide greater objectivity but often depend on reference datasets, which can limit their accuracy and generalizability (Supplementary Fig. 1)24. As a result, ensuring the reliability of cell type annotation remains a persistent challenge in cellular functional research, potentially leading to downstream errors in analyses and experiments and consuming time on subsequent corrections.

Recent advancements in artificial intelligence (AI) have opened new possibilities for cell type annotation. One promising development is GPTCelltype, a tool leveraging the large language model (LLM) ChatGPT, which demonstrates that LLMs can autonomously perform cell type annotations without requiring extensive domain expertise or reference datasets1. We believe their most significant contribution lies in providing a practical solution to the critical challenge of objectively assessing the annotation reliability in the interpretation of scRNA-seq data. However, since LLMs are not specifically designed for cell type annotation and are trained on diverse data sources, only a few are well-suited for this task. Even among these, no single model can accurately annotate all cell types. Moreover, the standardized data format encoded within LLMs limits their ability to adapt to the dynamic and complex nature of biological data, where different pieces of evidence may converge on the same conclusion or diverge5. Thus, new strategies are needed to enhance LLMs’ adaptability and improve their performance in cell type annotation, potentially through advanced self-learning mechanisms and ongoing model updates6.

To address the limitations of LLM-based cell type annotation, we systematically evaluated existing models to identify those most suitable for this task. Building on this analysis, we developed three complementary strategies to enhance annotation performance and improve result interpretability. First, a multi-model integration strategy leverages the complementary strengths of multiple LLMs to reduce uncertainty and increase annotation reliability. Second, the “talk-to-machine” strategy iteratively enriches model input with contextual information, mitigating ambiguous or biased outputs. Third, an objective credibility evaluation strategy assesses annotation reliability based on marker gene expression within the input dataset, enabling reference-free, unbiased validation. On this foundation, we developed LICT (LLM-based Identifier for Cell Types), a software package that integrates the most effective LLMs through these three strategies. We further benchmarked LICT against existing supervised machine learning–based annotation tools to evaluate its performance and assess the generalizability of our approach.

Results

Identification of top-performing LLMs for cell type annotation

To identify the most effective LLMs for cell type annotation, we initially evaluated 77 publicly available models7 using a benchmark scRNA-seq dataset of peripheral blood mononuclear cells (PBMCs and GSE164378)8. PBMCs were chosen due to their widespread use in evaluating automated annotation tools9. Standardized prompts incorporating the top ten marker genes for each cell subset were used to elicit annotations, following the benchmarking methodology proposed by Wenpin Hou et al., which assesses agreement between manual and automated annotations1. Based on accessibility and annotation accuracy, we selected five top-performing LLMs for further analysis: GPT-410, LLaMA-311, Claude 312, Gemini13, and the Chinese language model ERNIE 4.014 (Table 1 and Supplementary Data 1).

Table 1.

The 5 LLMs most closely aligned with expert annotations

Model Company Website Number of cell types Response number Free API Match Mismatch
Claude 3 opus Anthropic claude.ai 31 31 No Yes 26 5
Llama 3 70B Meta llama.meta.com 31 31 No No 25 6
ERNIE-4.0 Baidu qianfan.cloud.baidu.com 31 31 No Yes 25 6
GPT4 OpenAI openai.com 31 31 No Yes 24 7
Gemini 1.5 pro Google deepmind.google/technologies/gemini/pro 31 31 Yes Yes 24 7

Performance of LLMs diminishes when annotating less heterogeneity datasets

To comprehensively evaluate the annotation capabilities of the five selected LLMs, we validated their performance across four scRNA-seq datasets representing diverse biological contexts: normal physiology (PBMCs8), developmental stages (human embryos15), disease states (gastric cancer16), and low-heterogeneity cellular environments (stromal cells in mouse organs17). Cell type annotation and benchmarking followed the same standardized methodology described above, ensuring consistency across all analyses.

The results showed that all selected LLMs excelled in annotating highly heterogeneous cell subpopulations, such as those in PBMCs and gastric cancer samples, with Claude 3 demonstrating the highest overall performance (Supplementary Fig. 2a, b). However, significant discrepancies emerged when annotating less heterogeneous subpopulations, such as those in human embryos and stromal cells, compared to manual annotations. Among the top-performing models, Gemini 1.5 Pro achieved 39.4% consistency with manual annotations for embryo data, while Claude 3 reached 33.3% consistency for fibroblast data (Supplementary Fig. 2c, d). The annotation consistent average score was shown in Supplementary Fig. 2e. These findings highlight the importance of integrating multiple LLMs to achieve more comprehensive and reliable cell annotations18,19.

Strategy I: multi-model integration strategy

To enhance LLM performance—particularly for low-heterogeneity datasets—we developed a multi-model integration strategy. Instead of conventional approaches like majority voting or relying on a single top-performing model, we select the best-performing results from five LLMs, effectively leveraging their complementary strengths to improve annotation accuracy and consistency across diverse cell types (Fig. 1a). The same four datasets and benchmarking methodology described in the preceding section were used to ensure methodological consistency and comprehensive evaluation.

Fig. 1. The multi-model integration strategy improves the annotation performance of LLM.

Fig. 1

a Workflow for the multi-model integration strategy. The process consists of three main steps: (1) LLM consultation—five distinct LLMs are prompted with the top ten marker genes for each cell cluster to generate candidate cell type annotations. (2) Benchmarking—the LLM-generated annotations are compared against manual annotations. Agreement is visualized by color: dark green indicates a full match, light green a partial match, and white a mismatch. (3) Final output—the annotation with the highest agreement across the five LLMs is selected as the final output. be Bar plot illustrating the performance of ChatGPT-4 and the integrated set of five LLMs across PBMC, gastric cancer, embryo, and fibroblast datasets. The x-axis shows the match percentage between LLM and expert annotations, while the y-axis, from top to bottom, represents the performance of ChatGPT-4 and the integrated performance of five LLMs. Bar colors represent match categories: full match (dark green), partial match (light green), and mismatch (gray). f Heatmap summarizing match rates for ChatGPT-4 and the integrated set of five LLMs. The x-axis represents the performance of ChatGPT-4 and the integrated performance of five LLMs, while the y-axis corresponds to four independent datasets. Colors range from red (high match rate score) to light red (low match rate score), providing a visual representation of model performance across datasets.

This strategy significantly reduced the mismatch rate in highly heterogeneous datasets—from 21.5% to 9.7% for PBMCs and from 11.1% to 8.3% for gastric cancer data—compared to GPTCelltype (Fig. 1b, c). For low-heterogeneity datasets, the improvement was even more pronounced, with match rates (including both fully and partially match rates) increased to 48.5% for embryo and 43.8% for fibroblast data (Fig. 1d, e). Despite these gains, discrepancies remain, with over 50% of annotations for low-heterogeneity cells still not matching manual results. The annotation consistent average score was shown in Fig. 1f. This suggests that richer informational contexts in high-heterogeneity data may contribute to more robust model training and annotation accuracy20, while also highlighting ongoing opportunities for refining the annotation process.

Strategy II: “talk-to-machine” strategy

To address the limitations of LLM performance in annotating low-heterogeneity cell types, we implemented a “talk-to-machine” strategy to enhance annotation precision. This human-computer interaction process involves the following steps (Fig. 2a):

  1. Marker gene retrieval: the LLM is queried to provide a list of representative marker genes for each predicted cell type based on the initial annotations.

  2. Expression pattern evaluation: the expression of these marker genes is assessed within the corresponding clusters in the input dataset.

  3. Validation: an annotation is considered valid if more than four marker genes are expressed in at least 80% of cells within the cluster. Otherwise, it is classified as a validation failure.

  4. Iterative feedback: for failed validations, a structured feedback prompt is generated containing (i) their expression validation results, and (ii) additional differentially expressed genes (DEGs) from the dataset. This prompt is used to re-query the LLM, prompting it to revise or confirm its previous annotation.

Fig. 2. Integrating both the multi-model integration and ‘talk-to-machine’ strategies further enhances the annotation performance of LLM.

Fig. 2

a Workflow for the “talk-to-machine” strategy. This four-step process refines annotations from Strategy I that did not achieve a full match with manual annotations. (1) Marker gene retrieval—the LLM is prompted to provide a list of canonical marker genes corresponding to its own annotation. (2) Expression pattern evaluation—these proposed markers are cross-validated against the original scRNA-seq data to assess their expression within the relevant cell cluster. (3) Validation—an annotation is considered reliable if at least four of its proposed markers are confirmed as expressed; otherwise, it is flagged as unreliable. (4) Iterative feedback—unreliable annotations are returned to the LLM along with validation feedback, prompting it to generate an alternative annotation. be Bar plot illustrating the performance of ChatGPT-4, the multi-model integration strategy alone, and the integration of both strategies across PBMC, gastric cancer, embryo, and fibroblast datasets. The x-axis shows the match percentage between LLM and expert annotations, while the y-axis, from top to bottom, represents the performance of ChatGPT-4, the multi-model integration strategy alone, and the integration of both strategies. Bar colors represent match categories: full match (dark green), partial match (light green), and mismatch (gray). f Heatmap summarizing match rates for ChatGPT-4, the multi-model integration strategy alone, and the integration of both strategies. The x-axis represents the performance of ChatGPT-4, the multi-model integration strategy alone, and the integration of both strategies, while the y-axis corresponds to four independent datasets. Colors range from red (high match rate score) to light red (low match rate score), providing a visual representation of model performance across datasets.

Under this optimization strategy, the alignment between our annotation results and manual annotations has significantly improved. In highly heterogeneous cell datasets, the rate of full match reached 34.4% for PBMC and 69.4% for gastric cancer, with mismatch reduced to 7.5% and 2.8%, respectively (Fig. 2b, c). Similarly, in low-heterogeneity cell datasets, full match rate improved by 16-fold for embryo data compared to simply using GPT-4, reaching 48.5%, while the full match rate for fibroblast data remained 43.8%, with mismatch decreasing to 42.4% and remaining 56.2% (Fig. 2d, e). The annotation consistent average score was shown in Fig. 2f. These results demonstrate that our interactive LLM strategy successfully enhances annotation accuracy for both high- and low-heterogeneity datasets2123. However, effort is still needed since there remains over 50% inconsistency of annotations in low-heterogeneity data. How should the observed discrepancies be interpreted? Do these results reflect the superiority of expert judgment, highlight methodological limitations in our approach, or stem from confounding factors—such as inherent constraints in the input data quality?

Strategy III: objective credibility evaluation

Discrepancies between LLM-generated and manual annotations do not necessarily indicate reduced reliability of LLM-based methods. Manual annotations often exhibit inter-rater variability and systematic biases, particularly in datasets with ambiguous cell clusters2426. This highlights the need for an objective framework to distinguish discrepancies caused by annotation methodology from those due to intrinsic limitations in the dataset itself. To address this, we implemented an objective credibility evaluation strategy to assess annotation credibility through the following steps (Fig. 3a):

  1. Marker gene retrieval: for each predicted cell type, the LLM is queried to generate representative marker genes based on the initial annotation.

  2. Expression pattern evaluation: the expression of these marker genes is analyzed within the corresponding cell clusters in the input dataset.

  3. Credibility assessment: an annotation is deemed reliable if more than four marker genes are expressed in at least 80% of cells within the cluster; otherwise, it is classified as unreliable.

Fig. 3. Reliability of LLM and expert annotations within mismatch results.

Fig. 3

a Workflow for objective credibility evaluation. This strategy comprises three main steps: (1) Marker gene retrieval—for each predicted cell type, the LLM is prompted to generate representative marker genes based on its initial annotation. (2) Expression pattern evaluation—the expression of these marker genes is assessed within the corresponding cell cluster in the input dataset. (3) Credibility assessment—an annotation is considered reliable if more than four marker genes are expressed in at least 80% of cells within the cluster; otherwise, it is classified as unreliable. be The top bar plot illustrates the performance of integrating both strategies across PBMC, gastric cancer, embryo, and fibroblast datasets. The bottom bar plot shows the reliability distribution of LLM and expert annotations. Bar colors denote match categories: full match (dark green), partial match (light green), and mismatch (gray). Reliability categories are indicated as follows: both reliable (dark blue), only LLM reliable (light blue), only expert reliable (dark yellow), and both unreliable (light yellow).

Credibility assessment results showed comparable annotation reliability between LLM-generated and manual annotations in the gastric cancer dataset (Fig. 3c). In contrast, for the PBMC and low-heterogeneity datasets, LLM-generated annotations outperformed manual ones (Fig. 3b, d, e). Specifically, in the embryo dataset, 50% of mismatched LLM-generated annotations were deemed credible, compared to only 21.3% for expert annotations. For the stromal cell dataset, 29.6% of LLM-generated annotations were considered credible, whereas none of the manual annotations met the credibility threshold (Fig. 3d, e). The results demonstrate that, based on our input scRNA-seq datasets, Strategy III provides greater confidence in identifying reliably annotated cell types for downstream analysis. These findings also highlight the limitations of relying solely on expert judgment for cell type annotation.

Moreover, our objective evaluation identified cases where LLM and manual annotations differed but were both classified as reliable, accounting for 14.3% of the embryo dataset. These results highlight how LLMs can capture valid yet distinct perspectives on cell identity, offering complementary insights into complex biological systems.

Superior performance of LICT

Experimental results from the three strategies demonstrate that: (i) the cell type annotation performance of public LLMs can be substantially enhanced through Strategies I and II, and (ii) reliance on expert annotations can be markedly reduced via Strategy III. By integrating these strategies, we developed LICT (LLM-based identifier for cell types), a public LLM-powered annotation tool. LICT is characterized by its user-friendly design, enabling straightforward installation and seamless compatibility with Seurat, a widely used R-based single-cell analysis package. The LICT analysis pipeline comprises the following steps (Fig. 4):

  1. Preprocessing of the input data using Seurat;

  2. Initial annotation using LICT;

  3. Assessment of annotation reliability;

  4. Output of annotations deemed reliable;

  5. For cell subpopulations with low reliability, execution of a second-round LICT iterative annotation;

  6. Reassessment of reliability for the re-annotated results;

  7. Final output of annotation results, accompanied by detailed reliability information.

Fig. 4. Workflow of LICT.

Fig. 4

(1) Preprocessing of input data: the workflow begins with standard preprocessing of the input data using the Seurat package with default settings. (2) First-round initial annotation: LICT then performs a first-pass annotation by prompting a panel of five distinct LLMs with the top 10 marker genes for each cluster. (3) Assessment of annotation reliability: all initial annotations are evaluated for their reliability using Strategy III, an objective credibility evaluation framework. (4) Output of reliable annotations: annotations that meet the reliability threshold are outputted directly. (5) Second-round iterative annotation: for cell annotations with low reliability, LICT initiates a second round of iterative annotation by providing the respective LLMs with feedback, including expression validation results and additional DEGs from the dataset. (6) Reassessment of reliability: the revised annotations from the second round undergo another reliability assessment. (7) Final output: the final output comprises all reliable annotation results, accompanied by detailed reliability information.

We conducted a comprehensive benchmarking of LICT against conventional supervised machine learning–based annotation tools, evaluating five key performance dimensions: Usability (intuitive interface and minimal user training), Computational efficiency (processing speed and resource usage), Annotation consistency (stable performance across varying data quality levels), Biological accuracy (agreement with established ground truth), and Interpretability (reliability evaluation).

Most existing annotation tools, such as CellBlast27, require the pre-installation of multiple software packages with strict version dependencies. Others, like scBERT28, are limited by rigid input data formats and high technical barriers, often necessitating advanced programming proficiency. These constraints hinder accessibility for users without computational expertise (Table S1)2741. Unlike many existing tools, LICT operates without requiring additional external reference datasets, predefined marker gene sets, or complex programming skills, offering a streamlined and accessible user experience. Results can be quickly generated using fixed commands detailed in a clear user manual.

Through a user-friendly screening process, 12 out of 15 annotation tools were selected for benchmarking using the PBMC dataset, a widely adopted standard for evaluating automated annotation tools9. First, LICT demonstrated the shortest average runtime for cell type annotation, highlighting its superior efficiency (Fig. 5a). Second, LICT’s independence from reference data ensures high stability; annotations generated at two different time points achieved a Cohen’s Kappa of 0.642, indicating substantial agreement. In contrast, most reference-dependent tools—except scClassify42—showed low consistency across datasets (Fig. 5b). Third, LICT exhibited high annotation accuracy, achieving significantly better alignment with manual annotations and yielding the highest average match score among all tools (Fig. 5c). Finally, in terms of reliability, LICT outperformed all other methods when evaluated under the same objective assessment framework (Fig. 5d).

Fig. 5. Superior Performance of LICT.

Fig. 5

a Average runtime of 13 annotation tools based on PBMC and Embryo datasets. The x-axis represents different tools, while the y-axis indicates the average runtime. b Consistency evaluation of 13 annotation tools. LICT’s annotation results are shown for two different time points, while the other 12 tools’ results are based on two separate reference datasets. The x-axis represents different tools, and the y-axis represents the consistency score. c Accuracy evaluation of 13 annotation tools. The x-axis represents query datasets (PBMC and embryo) and the reference datasets. The y-axis shows the match rate average score, LICT’s annotation results were shown with a dashed line. d Reliability of 13 annotation tools. The color bar above indicates the query datasets and their corresponding references. The y-axis represents the 13 tools, and the x-axis shows the percentage of reliable results. The color inside each bar represents Reliable result: blue, Unreliable result: yellow, and no result: gray. e UMAP plot showing detailed multiple annotation results of LICT within the Embryo dataset(E-MTAB-9388). f Pie chart illustrating the proportion of multiple annotation results of LICT within the Embryo dataset.

Collectively, these results highlight LICT’s advantages in efficiency, consistency, accuracy, and interpretability, establishing it as a robust and accessible solution for scRNA-seq data analysis.

Interestingly, when annotating embryonic data, we found that some cell populations exhibit characteristics of multiple cell types. Only LICT can uniquely interpret these complexities and help uncover the nuanced properties of these cells (Fig. 5e). Due to the lack of high-quality reference data, many tools struggle to accurately annotate embryonic datasets. In contrast, LICT successfully annotated subpopulations, with proportions of 72.7% (Fig. 5f), highlighting its ability to operate independently of reference data. This independence underscores LICT’s superior generalizability, making it a more versatile and robust tool for a broader range of applications.

Generalizability of our optimization strategy

To assess the generalizability of our annotation strategy, we trained LLMs that are capable of effective cell annotation using this approach. The results demonstrated that a significant majority (25 out of 27) of the LLMs showed improved reliability, with improvements ranging from 3.2% to 41.9% (Fig. 6).

Fig. 6. Generalizability of optimization strategy.

Fig. 6

The performance of the optimization strategy on 27 LLMs. The x-axis represents the 27 models, while the y-axis indicates the number of reliable results. Blue points represent results before optimization, and red points represent results after optimization.

Additionally, we applied it to two freely available LLMs, LLaMA-3 and Gemini, using the same optimization techniques as those applied to LICT. This approach significantly enhanced annotation reliability, with consistency with manual annotations improving by 5.5% (gastric cancer) to 15.2% (human embryo, Supplementary Fig. 3a–d). While the performance of these two models individually did not match the results obtained with the five-model integration, they still outperformed single models like GPT-4, demonstrating the effectiveness of our strategy. This highlights LICT’s flexibility in supporting subset-based LLM execution and underscores the importance of both the number and quality of LLMs in achieving reliable cell type annotations.

Discussion

In summary, we developed LICT (LLM-based identifier for cell types), a novel tool that integrates multi-modal information, employs a “talk-to-machine” interface, and provides an objective framework for evaluating annotation reliability. Comparative analyses with existing supervised machine learning-based annotation tools demonstrate that LICT outperforms in usability, computational efficiency, annotation consistency, biological accuracy, and interpretability, establishing it as a robust resource for scRNA-seq analysis. Notably, LICT also enables the interpretation of cell populations exhibiting features of multiple cell types, thereby facilitating deeper biological insights into cellular heterogeneity.

While LLMs can be useful for cell annotation, our evaluation found that only a few closely matched expert annotations. Even these high-performing models did not consistently deliver reliable results across all cell types4245, as each LLM excels in different areas46. For example, Claude 3 performed well on PBMC and gastric cancer datasets but was less effective on human embryo data. Gemini and the Chinese LLM ERNIE 4.0 were most effective for low-heterogeneity datasets. These variations in performance likely arise from differences in data sources, the amount of training data used, and the rigid input formats employed during LLM training, which can distort or lose information in complex, highly variable biological data. Our tests also showed that LLMs are less consistent with low-heterogeneity data, possibly due to information degradation caused by rigid input formats. Based on these findings, relying on a single LLM for cell annotation is unlikely to be optimal.

Instead, we propose a multi-model integration strategy, selecting the five LLMs that currently perform best in cell annotation. Instead of conventional approaches such as majority voting or selecting a single top-performing model, our method harnesses the complementary strengths of multiple LLMs to improve annotation accuracy and consistency across diverse cell subsets. This approach capitalizes on the strengths of each model and has proven effective, increasing the match rate between our annotations and expert annotations by at least 2.8% for gastric cancer data and up to 20.9% for human embryo data. However, it is crucial to recognize that current LLMs are not specifically designed for cell type annotation. To fully harness the benefits of a model integration strategy, the development of specialized models tailored for cell annotation, such as scFoundation30 is essential. Unfortunately, the availability of such models remains limited. Furthermore, cell annotation models encounter challenges similar to those faced by LLMs, including biased data sources3. Consequently, relying solely on a model integration strategy does not adequately address the challenges of cell annotation.

To overcome these limitations, we developed a “talk-to-machine” strategy that equips LLMs with self-correction capabilities, reducing ambiguous or biased annotations. By providing LLMs with additional context and information, this approach enables LLMs to assess the credibility of annotation results and make necessary revisions to those deemed unreliable, effectively minimizing discrepancies between LLM-generated and expert annotations. For example, when this strategy is combined with multi-model integration, the fully match rate between annotations and expert results increases by at least 20.9% for fibroblast data and up to 45.5% for human embryo data compared to using GPT-4 alone. Interestingly, we found that current LLMs achieve optimal results when this strategy is applied only once; repeated applications do not significantly enhance the credibility of the annotation outcomes. This diminishing effectiveness with repeated inquiries is a phenomenon commonly observed in other LLM applications and may be related to the inherent limitations of the LLMs architecture and their interaction styles47.

It is important to emphasize that, in presenting the results of Strategies I and II, we used concordance with expert annotations as the benchmark for assessing annotation accuracy. This approach follows the evaluation framework proposed by Wenpin Hou et al. 9 and was applied solely for benchmarking purposes—to demonstrate that the core strategies of LICT substantially outperform standalone LLM-based annotation methods (e.g., ChatGPT alone). As mentioned in the Introduction, while manual annotation benefits from domain expertise, it remains inherently subjective and heavily dependent on the annotator’s experience. To address this limitation, it is essential to establish an objective framework for evaluating the reliability of annotation outcomes. By building upon the “talk-to-machine” strategy, LICT effectively fulfills this need and enables more reproducible and interpretable single-cell annotation.

The full workflow of our credibility evaluation for annotation outcomes comprises three key components: (i) marker gene information is automatically generated by LLMs, eliminating the need for manual curation or subjective bias; (ii) all evaluations are systematically validated against the input scRNA-seq data to ensure that conclusions are grounded in empirical evidence; and (iii) credibility is assessed using consistent, quantitative criteria. Collectively, these steps establish a fully objective and reproducible framework for evaluating annotation reliability, aligned with principles of evidence-based decision-making48. By leveraging the data itself to characterize cell identities, this approach avoids the pitfalls of subjective interpretation49 and ensures that LLM outputs are rational, transparent, and verifiable.

Our analysis further demonstrated that discrepancies between LLM-generated and expert annotations do not necessarily indicate lower reliability of the LLM results. In several datasets—such as PBMC, human embryo, and fibroblasts—the LLM-generated annotations that differed from manual labels were, in fact, more reliable based on our objective evaluation framework. This outcome is reasonable, given that expert annotation, while valuable, is inherently prone to subjective bias. Moreover, our credibility assessment identified cases in which both the LLM and expert annotations differed yet were independently deemed reliable. Such occurrences likely reflect the multifaceted nature of cell identity or variations in annotation criteria. Our method does not diminish the value of expert annotations; instead, by providing an objective basis for evaluating these discrepancies, our method allows researchers to move beyond debates about annotation accuracy and focus instead on the underlying biological insights.

Unexpectedly, LICT outperforms existing supervised machine learning-based annotation tools across several critical dimensions, including usability, computational efficiency, annotation consistency, biological accuracy, and interpretability. In practice, however, the utility of many existing tools is constrained by their reliance on user-provided reference datasets for training48. This dependency introduces significant variability, as the quality of annotation is directly tied to the completeness and accuracy of the reference data. Moreover, it increases the complexity of use and may compromise the reproducibility and trustworthiness of the results. In contrast, LICT eliminates the need for additional external reference data, delivering accurate, detailed, and consistent annotations independent of user expertise. This reference-free approach enhances generalizability and positions LICT as a more robust and adaptable tool for a wide range of scRNA-seq applications.

Overall, our optimization strategies for LLMs have proven effective in developing LICT, a high-quality cell type annotation tool with three key advantages: a reference-free methodology, evidence-based decision-making, and expertise-independent consistency. Importantly, LICT also demonstrates that high-quality, accessible bioinformatics solutions can be achieved using publicly available LLMs—provided that appropriate optimization strategies are employed.

Online methods

Dataset collection

Four datasets used in this study: human peripheral blood mononuclear cells (PBMCs, GSE164378)8, gastric tumor samples (GSE206785)16, human embryo data (http://www.human-gastrula.net)15, and fibroblasts from various organs (https://fibroXplorer.com.)17. We systematically assessed the performance of LICT’s three core strategies across these four distinct datasets to ensure both methodological rigor and generalizability.

The standalone use of PBMC datasets is exclusively for (i) initial screening of LLMs for annotation suitability and (ii) comparative benchmarking against existing tools.

Dataset processing

For each dataset, we performed data processing steps—including quality control, dimensionality reduction, and clustering—following the protocols outlined in the respective original articles.

DEG calculations were conducted for each cell cluster using the FindAllMarkers function in the Seurat package (version 4.3.0). We selected genes with a log2 fold change greater than 0.5 and an expression percentage above 25% for further analysis.

These genes were ranked in descending order based on their log2 fold changes. In cases where multiple genes had the same average log2 fold change, they were further ranked by their p-values.

Screening of LLMs for annotation suitability

To identify top-performing LLMs for cell type annotation, we evaluated 77 publicly available models (Supplementary Data 1) using a benchmark PBMC scRNA-seq dataset (GSE164378)8. Standardized prompts incorporating the top 10 marker genes per cell subset were used to elicit annotations, following the benchmarking approach of Wenpin Hou et al., which quantifies agreement between manual and automated annotations. Each annotation—manual or LLM-derived—was mapped to a unique Cell Ontology (CL) term and, when relevant, to a broad cell type category. Annotation accuracy was scored using the following system:

Full match (score = 1): exact match at the cell type or CL term level.

Partial match (score = 0.5): automated annotation matched or was a subclass of the manual label.

Mismatch (score = 0): no concordance in cell type, CL term, or category.

Model and API selection

Five LLMs were selected for cell type annotation based on their accessibility and efficiency: GPT-4 Turbo Preview (June 2024 version), Gemini 1.5 Pro (June 2024 version), Claude 3 Opus (June 2024 version), Llama 3 70B (June 2024 version), and ERNIE-4.0-8 K (June 2024 version). We chose these models due to their advanced natural language processing capabilities and their robust APIs, which facilitate seamless integration with the dataset annotation tasks.

Implementation of a multi-model integration strategy

The multi-model integration strategy consists of three steps: Consulting LLMs, Benchmarking, and Generating Final Outputs.

  • (i)

    Consulting LLMs:

DEG information calculated using the Seurat package was converted into standardized prompts. These prompts were then submitted to five separate LLMs via independent functions designed to query each model:

  1. For general cell type annotation

    “Identify cell types of ‘species tissuename’ using the following markers. Identify one cell type for each row. Just reply with the cell type; there is no need to reply to the reasoning section or explanation section. \n ‘GeneList’ \n Reply in the following format: \n 1: xx \n 2: xx \n N: xx \n

    N is the line number, xx is a phrase including only cell types, such as pluripotent stem cells and smooth muscle cells.”

  2. For fibroblast data annotation

“You are a cell classification expert who uses the following markers to identify which tissue the ‘species celltype’ originate from using the following markers. Identify one tissue for each of the following rows of marker genes, providing only one tissue. Just reply to the tissue, no need to reply to the reasoning section or explanation section: \n ‘GeneList’ \n Reply in the following format:\n1: xx\n2: xx\nN: xx\nN is the line number, xx is a phrase that only includes tissue, such as Bone and Lung. Do not add additional text!”

In this prompt, “(\n)” represents a newline character. The placeholder’s ‘species’, ‘celltype’, and ‘tissuename’ are replaced with actual species (e.g., human, mouse), cell type(e.g., fibroblast), and tissue names (e.g., brain). ‘GeneList’ contains the differential genes, with genes for each cell population separated by commas, and different populations separated by newline characters.

  • (ii)

    Benchmarking:

    As described in the Dataset Collection section, we systematically evaluated the performance of LICT’s three core strategies across four distinct datasets to ensure both methodological rigor and generalizability. Benchmarking followed the same evaluation framework used in the Screening of LLMs for Annotation Suitability section, maintaining consistency across analyses.

  • (iii)

    Generating final outputs:

For each cell subpopulation, the highest-scoring annotation among the five LLM outputs was selected as the final annotation. These selections were then aggregated to generate the final annotation result for the entire dataset.

Implementation of the “talk-to-machine” strategy

For cell subpopulations annotated using the multi-model integration strategy that received a “partial match” or “mismatch” score, we applied the “talk-to-machine” strategy to optimize the annotations. This strategy comprises four steps: Marker Gene Retrieval, Expression Pattern Evaluation, Validation, and Iterative Feedback.

  • (i)

    Marker gene retrieval:

GPT-4 is queried to generate representative marker genes for each initially predicted cell type:

  1. For general data validation

    “Provide key marker genes for the following ‘species tissuename’, with 15 key marker genes per cell type. Provide only the abbreviated gene names of key marker genes, full names are not required: \n ‘CellType’ \n”

  2. For fibroblast data validation

“Provide key marker genes for the following ‘species celltype’, which are from different tissues, with 15 key marker genes per tissue ‘species celltype’. Provide only the abbreviated gene names of key marker genes, full names are not required: \n ‘Tissues’ \n”

Here, ‘CellType’ refers to the annotated cell type results from the LLMs, and ‘Tissues’ refers to the annotated tissue results from the LLMs.

  • (ii)

    Expression pattern evaluation:

    The expression of these marker genes is assessed within the corresponding clusters in the input dataset.

  • (iii)

    Validation:

    An annotation is deemed valid if at least four marker genes are expressed in ≥80% of the cells in a given cluster; otherwise, it is classified as a failed validation.

  • (iv)

    Iterative feedback:

For failed validations, a structured prompt is generated that includes (1) the validation results and (2) 10 additional DEGs (ranked 11th–20th by log2 fold change). This prompt is used to re-query the LLM, prompting it to revise or confirm its previous annotation:

‘Positive GeneList is expressed in the %d row\n Negative GeneList is expressed in the %d row \nBased on the additional information above, modify my previous response and list all cell types, including those that have not been modified. Reply in the following format: \n1: xx \n2: xx \nN: xx \nN is the line number, xx is the cell type.’

The optimized annotations were also benchmarked using the same four datasets and methodology described earlier to ensure consistency and to evaluate improvements in annotation accuracy.

Implementation of objective credibility evaluation

Building on the “talk-to-machine” strategy, we developed a credibility evaluation framework to interpret discrepancies between LLM-generated and manual annotations. This framework comprises three steps: marker gene retrieval, expression pattern evaluation, and credibility assessment.

  • (i)

    Marker gene retrieval:

    The LLM is queried to generate representative marker genes for each predicted cell type based on its annotation.

  • (ii)

    Expression pattern evaluation:

    The expression of these marker genes is assessed within the corresponding cell clusters in the input dataset.

  • (iii)

    Credibility assessment:

An annotation is considered reliable if at least four marker genes are expressed in ≥80% of the cells in the cluster; otherwise, it is deemed unreliable.

LICT pipeline

By integrating the above strategies, we developed LICT (LLM-based Identifier for Cell Types), a publicly available LLM-powered annotation tool. LICT is designed for ease of use, offering simple installation and seamless compatibility with Seurat. The LICT analysis pipeline consists of the following steps:

  • (i)

    Preprocessing with Seurat:

    Data are preprocessed using Seurat’s default tutorial parameters up to the FindAllMarkers() step. During quality control, it is critical to ensure that data quality meets expectations and that stringent QC measures are applied.

  • (ii)

    First-round annotation:

    Marker gene results from Seurat are input into LICT. The tool automatically formats prompts (as described in the multi-model integration strategy) to query five LLMs and retrieves their responses without manual input.

  • (iii)

    Credibility assessment:

    LICT applies an objective evaluation framework to assess the reliability of each annotation generated by the five LLMs and identifies the most credible result.

  • (iv)

    Output of reliable annotations:

    Credible annotation results are directly returned to the user.

  • (v)

    Second-round annotation for low-confidence results:

    For cell subpopulations with low reliability scores, LICT performs a second-round annotation using the “talk-to-machine” strategy. Prompts are generated (as described in the fourth step of that strategy) and sent to the LLMs for refined responses.

  • (vi)

    Reassessment of re-annotated results:

    The updated annotations are re-evaluated using the same credibility framework to determine their reliability.

  • (vii)

    Final output:

LICT outputs the complete annotation results from all LLMs, along with their corresponding credibility assessments.

Comprehensive benchmarking of LICT against automated tools

  • (i)

    Supervised machine learning–based annotation tools

    SingleR34 (version 1.4.1), scmap-cell33 (version 1.22.3), CellAssign36 (version 0.99.2), CHETAH39 (version 1.16.0), scClassify42 (version 1.5.1), sciBET41 (version 1.0), CellTypist43 (version 1.6.3), TOSICA44 (version 1.0.0), scPred37 (version 1.9.2), singleCellNet38 (version 0.1.0), CellBlast40 (version 0.5.1), and Garnett35 (version 0.2.20) were used to perform cell type annotations with default settings. It is worth noting that CellAssign was implemented as a class from the scvi-tools submodule50. For cell type annotation using CellAssign, marker genes were input based on the top 3 marker genes identified from the reference dataset using the FindAllMarkers function in the Seurat package.

  • (ii)

    Datasets

    For PBMC annotation, two reference datasets from Gene Expression Omnibus (GEO) under accession number: GSE13435551 and GSE10701152 were used. For the annotation of the Embryo dataset, the reference dataset number was GSE15732953.

  • (iii)

    Runtime calculation

    The average runtime for each annotation tool was calculated from loading the reference dataset to completing the annotation across three independent experiments: annotating PBMC data with two different reference datasets and annotating the Embryo dataset with one reference dataset. The average runtime was computed as the mean of these three experiments. For LICT, since no reference dataset was needed, the runtime was the mean time cost from loading the query dataset to completing the annotation.

  • (iv)

    Consistency calculation

    Cohen’s Kappa score was used for consistency calculation: for supervised automated annotation tools, consistency was evaluated by comparing the match score for PBMC data using two different reference datasets, following the same match evaluation method as described earlier. As LICT was reference-free, the consistency evaluation was applied to measure results gathered from two different time points (June 2024 and December 2024).

  • (v)

    Reliability calculation

GPT-4 Turbo Preview was used to evaluate the reliability of each automated annotation tool. For each cell type annotation, GPT-4 Turbo Preview provides signature genes that are commonly believed to be predominantly expressed for that cell type. The current query cell cluster is then checked to see if these genes are expressed in over 80% of the cells within the cluster. If enough (default: 4) marker genes meet the threshold, the cell type annotation is considered reliable; otherwise, it is deemed unreliable.

Statistics and reproducibility

All computational experiments and benchmarking analyses were conducted in triplicate to ensure robustness and reproducibility. Each replicate was defined as a complete, independent execution of an annotation tool’s analysis pipeline on a given dataset. Reported quantitative results—including performance scores, runtimes, and reliability percentages—represent the mean values across the three replicates.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information (637.9KB, pdf)
42003_2025_8745_MOESM2_ESM.docx (14.3KB, docx)

Description of Additional Supplementary Files

Supplementary Data 1 (15.5KB, xlsx)
Reporting Summary (1.2MB, pdf)

Acknowledgements

This work was supported by grants from the National Key Research and Development Program of China (2023YFC2506100, 2021YFA1100600, and 2022YFA1104100); the National Natural Science Foundation of China (82270566, 82471462, and 32130046); the Science and Technology Program of Guangzhou (202206060003); and the Pioneering Talents Project of Guangzhou Development Zone (2017-L163).

Author contributions

W.Y., Y.M., and W.H. conceptualized and designed the project. W.L., W.S., and W.H. supervised the research. J.X. performed data collection. W.Y., J.X., H.L., Y.L., and J.L. performed experiments and data analysis. W.Y., J.X., and H.L. designed the LICT packages. Y.M. and W.H. drafted the manuscript. Q.X., W.L., W.S., Y.M., T.W., and W.H. revised the manuscript. All authors read and approved the final manuscript.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Chien-Yu Chen and Laura Rodríguez Pérez. A peer review file is available.

Data availability

All datasets analyzed during the current study are publicly available. The four primary datasets used to assess the performance of LICT’s core strategies are: human peripheral blood mononuclear cells (PBMCs), available from the GEO under accession number GSE164378; gastric tumor samples, available from GEO under accession number GSE206785; human embryo data, accessible at http://www.human-gastrula.net; and a cross-organ fibroblast dataset, accessible at https://fibroXplorer.com. For the comparative benchmarking against supervised annotation tools, the following publicly available reference datasets were used. For PBMC and embryo data annotation, the GEO accession numbers were GSE134355, GSE107011, and GSE157329, respectively. The source data generated during this study, including the data underlying the figures and the analysis outputs, have been deposited in the Figshare repository and are publicly available at 10.6084/m9.figshare.2981685254.

Code availability

The LICT package (v0.1.0) is provided as an open-source software package with a detailed user manual available in the GitHub repository at https://github.com/Glowworm-cell/LICT55. All codes to reproduce the presented analyses are publicly available in the GitHub repository at https://github.com/Glowworm-cell/LICT_paper56.

Competing interests

All authors declare they have no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Wenjin Ye, Yuanchen Ma.

These authors jointly supervised this work: Wu Song, Weiqiang Li, Weijun Huang.

Contributor Information

Wu Song, Email: songwu@mail.sysu.edu.cn.

Weiqiang Li, Email: liweiq6@mail.sysu.edu.cn.

Weijun Huang, Email: hweijun@mail.sysu.edu.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-025-08745-x.

References

  • 1.Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods21, 1462–1465 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Clarke, Z. A. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat. Protoc.16, 2749–2764 (2021). [DOI] [PubMed] [Google Scholar]
  • 3.Pasquini, G., Rojo Arias, J. E., Schäfer, P. & Busskamp, V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct. Biotechnol. J.19, 961–969 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol.20, 1–19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ge, Y. et al. Openagi: When LLM meets domain experts. Advances in Neural Information Processing Systems, Vol. 36 (ACM Digital Library, 2024).
  • 6.Ferdinan, T., Kocoń, J. & Kazienko, P. Into the unknown: self-learning large language models. In Proceedings of the 2024 IEEE International Conference on Data Mining Workshops (ICDMW) 423–432. (United Arab Emirates: IEEE, Abu Dhabi, 2024). 10.1109/ICDMW65004.2024.00060.
  • 7.Minaee, S. et al. Large language models: a survey. Preprint at arXivhttps://arxiv.org/abs/2402.06196 (2024).
  • 8.Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell184, 3573–3587.e3529 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol.20, 194 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Achiam, J. et al. Gpt-4 technical report. Preprint at arXivhttps://arxiv.org/abs/2303.08774 (2023).
  • 11.Dubey, A. et al. The llama 3 herd of models. Preprint at arXivhttps://arxiv.org/abs/2407.21783 (2024).
  • 12.Kevian, D. et al. Capabilities of large language models in control engineering: a benchmark study on GPT-4, Claude 3 Oopus, and Gemini 1.0 Ultra. Preprint at arXivhttps://arxiv.org/abs/2404.03647 (2024).
  • 13.Reid, M. et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context.Preprint at arXivhttps://arxiv.org/abs/2403.05530 (2024).
  • 14.Wang, Y., Sun, Y., Ma, Z., Gao, L. & Xu, Y. An ERNIE-based joint model for Chinese named entity recognition. Appl. Sci.10, 5711 (2020). [Google Scholar]
  • 15.Tyser, R. C. V. et al. Single-cell transcriptomic characterization of a gastrulating human embryo. Nature600, 285–289 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kang, B. et al. Parallel single-cell and bulk transcriptome analyses reveal key features of the gastric tumor microenvironment. Genome Biol.23, 265 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Buechler, M. B. et al. Cross-tissue organization of the fibroblast lineage. Nature593, 575–579 (2021). [DOI] [PubMed] [Google Scholar]
  • 18.Wan, F. et al. Knowledge Fusion of Large Language Models. In International Conference on Learning Representations. (2024).
  • 19.Mavromatis, C., Karypis, P. & Karypis, G. Pack of LLMs: model fusion at test-time via perplexity optimization. Preprint at arXivhttps://arxiv.org/abs/2404.11531 (2024).
  • 20.Li, B., Jiang, Y., Gadepally, V. & Tiwari, D. LLM Inference Serving: Survey of Recent Advances and Opportunities. Preprint at arXivhttps://arxiv.org/abs/2407.12391 (2024).
  • 21.Aponte, R. et al. A framework for fine-tuning LLMs using heterogeneous feedback. Preprint at arXiv preprint https://arxiv.org/abs/2408.02861 (2024).
  • 22.Bill, D. & Eriksson, T. Fine-tuning a LLM using Reinforcement Learning from Human Feedback for a Therapy Chatbot Application. Bachelor’s thesis. KTH Royal Institute of Technology. Available from: https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-331920 (2023).
  • 23.Lee, H. et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. In Proceedings of the 41st International Conference on Machine Learning 26874–26901 (PMLR, 2024). 10.48550/arXiv.2309.00267.
  • 24.Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep.9, 5233 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research7, 1141 (2018). [DOI] [PMC free article] [PubMed]
  • 26.Freytag, S., Tian, L., Lönnstedt, I., Ng, M. & Bahlo, M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Research7, 1141 (2018). [DOI] [PMC free article] [PubMed]
  • 27.Cao, Z.-J., Wei, L., Lu, S., Yang, D.-C. & Gao, G. Searching large-scale scRNA-seq databases via unbiased cell embedding with Cell BLAST. Nat. Commun.11, 3458 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell.4, 852–866 (2022). [Google Scholar]
  • 29.Shao, X. et al. scDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Res.49, e122 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods21, 1–11 (2024). [DOI] [PubMed]
  • 31.Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods15, 359–362 (2018). [DOI] [PubMed] [Google Scholar]
  • 32.Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol.20, 163–172 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods16, 983–986 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods16, 1007–1015 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol.20, 264 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst.9, 207–213.e202 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.de Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res.47, e95 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li, C. et al. SciBet as a portable and fast single cell type identifier. Nat. Commun.11, 1818 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lin, Y. et al. scClassify: sample size estimation and multiscale classification of cells using single and multiple reference. Mol. Syst. Biol.16, e9389 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science376, eabl5197 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chen, J. et al. Transformer for one stop interpretable cell type annotation. Nat. Commun.14, 223 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wei, J. et al. Measuring and Reducing LLM Hallucination without Gold-Standard Answers. Preprint at arXiv https://arxiv.org/abs/2402.10412 (2024).
  • 43.Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations (2023).
  • 44.Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv.55, 1–38 (2023). [Google Scholar]
  • 45.Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919 (2020).
  • 46.Ye, R. et al. X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs. Preprint at arXiv 10.48550/arXiv.2505.16997 (2025).
  • 47.Yadkori, Y. A., Kuzborskij, I., György, A. & Szepesvári, C. To believe or not to believe your LLM. Preprint at arXivhttps://arxiv.org/abs/2406.02543 (2024).
  • 48.Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet.24, 550–572 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Chen, G. H., Chen, S., Liu, Z., Jiang, F. & Wang, B. Humans or LLMs as the judge? A study on judgement bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 8301–8327 (2024).
  • 50.Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol.40, 163–166 (2022). [DOI] [PubMed] [Google Scholar]
  • 51.Han, X. et al. Construction of a human cell landscape at single-cell level. Nature581, 303–309 (2020). [DOI] [PubMed] [Google Scholar]
  • 52.Xu, W. et al. Mapping of γ/δ T cells reveals Vδ2+ T cells resistance to senescence. EBioMedicine39, 44–58 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Xu, Y. et al. A single-cell transcriptome atlas profiles early organogenesis in human embryos. Nat. Cell Biol.25, 604–615 (2023). [DOI] [PubMed] [Google Scholar]
  • 54.Ye, Ma. et al. Evaluation of cell type annotation reliability using a large language model-based identifier. Figshare 10.6084/m9.figshare.29816852 (2025).
  • 55.Glowworm-cell. Glowworm-cell/LICT_paper: code (code). Zenodo 10.5281/zenodo.16762016 (2025).
  • 56.Glowworm-cell. Glowworm-cell/LICT: LICTv0.1.0 (LICTv0.1.0). Zenodo 10.5281/zenodo.16761626 (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (637.9KB, pdf)
42003_2025_8745_MOESM2_ESM.docx (14.3KB, docx)

Description of Additional Supplementary Files

Supplementary Data 1 (15.5KB, xlsx)
Reporting Summary (1.2MB, pdf)

Data Availability Statement

All datasets analyzed during the current study are publicly available. The four primary datasets used to assess the performance of LICT’s core strategies are: human peripheral blood mononuclear cells (PBMCs), available from the GEO under accession number GSE164378; gastric tumor samples, available from GEO under accession number GSE206785; human embryo data, accessible at http://www.human-gastrula.net; and a cross-organ fibroblast dataset, accessible at https://fibroXplorer.com. For the comparative benchmarking against supervised annotation tools, the following publicly available reference datasets were used. For PBMC and embryo data annotation, the GEO accession numbers were GSE134355, GSE107011, and GSE157329, respectively. The source data generated during this study, including the data underlying the figures and the analysis outputs, have been deposited in the Figshare repository and are publicly available at 10.6084/m9.figshare.2981685254.

The LICT package (v0.1.0) is provided as an open-source software package with a detailed user manual available in the GitHub repository at https://github.com/Glowworm-cell/LICT55. All codes to reproduce the presented analyses are publicly available in the GitHub repository at https://github.com/Glowworm-cell/LICT_paper56.


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES