Skip to main content
Journal of Translational Medicine logoLink to Journal of Translational Medicine
. 2025 Oct 27;23:1176. doi: 10.1186/s12967-025-07091-0

Transformative advances in single-cell omics: a comprehensive review of foundation models, multimodal integration and computational ecosystems

Taylor Yiu 1,2,#, Bin Chen 1,#, Haoyu Wang 1,#, Genyi Feng 3, Qiangqiang Fu 3, Huijing Hu 4,
PMCID: PMC12560279  PMID: 41146276

Abstract

Recent advances in single-cell multi-omics technologies have revolutionized cellular analysis, enabling comprehensive exploration of cellular heterogeneity, developmental trajectories, and disease mechanisms at unprecedented resolution. Foundation models, originally developed for natural language processing, are now driving transformative approaches to high-dimensional, multimodal single-cell data analysis. Frameworks such as scGPT and scPlantFormer excel in cross-species cell annotation, in silico perturbation modeling, and gene regulatory network inference. Multimodal integration approaches, including pathology-aligned embeddings and tensor-based fusion, harmonize transcriptomic, epigenomic, proteomic, and spatial imaging data to delineate multilayered regulatory networks across biological scales. Federated computational platforms facilitate decentralized data analysis and standardized, reproducible workflows, fostering global collaboration. Challenges persist, including technical variability across platforms, limited model interpretability, and gaps in translating computational insights into clinical applications. Overcoming these hurdles demands standardized benchmarking, multimodal knowledge graphs, and collaborative frameworks that integrate artificial intelligence with human expertise. This review synthesizes recent technological advancements and proposes actionable strategies to bridge single-cell multi-omics innovations with mechanistic biology and precision medicine.

Keywords: Single-cell omics, Foundation models, Multimodal integration, Computational ecosystems, Cell type annotation, Perturbation modeling, Data harmonization

Introduction

The advent of single-cell omics technologies has revolutionized the ability to investigate biological systems at the cellular level, offering unprecedented insights into cellular heterogeneity, developmental pathways, and disease mechanisms. Technologies such as single-cell RNA sequencing (scRNA-seq), spatial transcriptomics, and epigenomic profiling produce vast datasets that capture molecular states across millions of individual cells. While these advances have driven significant breakthroughs in precision medicine and systems biology, they have also exposed critical limitations in computational methodologies. Traditional analytical pipelines, typically designed for low-dimensional or single-modality data, are ill-equipped to handle the complexity of modern single-cell datasets, which are characterized by high dimensionality, technical noise, and multimodal data. This mismatch has led to the development of foundation models–large, pretrained neural networks–that serve as transformative tools for decoding cellular complexity. To improve readability for a broad audience, we define key acronyms at first use: single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq), and spatial transcriptomics (ST). Throughout, we use “single-cell foundation models (scFMs)” to refer to large, pretrained models that support zero/few-shot transfer across tasks and modalities. We also use “batch effect” to mean technical variation across protocols, instruments, or centers that is not of biological interest.

Foundation models, originally developed in natural language processing, are now transforming single-cell omics by learning universal representations from large and diverse datasets. Models such as scGPT [1], pretrained on over 33 million cells, demonstrate exceptional cross-task generalization capabilities, enabling zero-shot cell type annotation and perturbation response prediction. Unlike traditional single-task models, these architectures utilize self-supervised pretraining objectives–including masked gene modeling, contrastive learning, and multimodal alignment–allowing them to capture hierarchical biological patterns. For example, scPlantFormer [2] integrates phylogenetic constraints into its attention mechanism, achieving 92% cross-species annotation accuracy in plant systems, while Nicheformer [3] employs graph transformers to model spatial cellular niches across 53 million spatially resolved cells. These advancements signify not incremental improvements, but rather a paradigm shift toward scalable, generalizable frameworks capable of unifying diverse biological contexts [4, 5].

The integration of multimodal data has become a cornerstone of next-generation single-cell analysis, fueled by the convergence of transcriptomic, epigenomic, proteomic, and imaging modalities. Notable breakthroughs, such as PathOmCLIP [6], which aligns histology images with spatial transcriptomics via contrastive learning, and GIST [7, 8], which combines histology with multi-omic profiles for 3D tissue modeling, demonstrate the power of cross-modal alignment. However, technical challenges persist in harmonizing heterogeneous data types–from sparse scATAC-seq matrices to high-resolution microscopy images–while preserving biological relevance. Innovations such as StabMap’s [9, 10] mosaic integration for non-overlapping features and TMO-Net’s [11] pan-cancer multi-omic pretraining represent progress toward robust multimodal frameworks. Here, “mosaic integration” simply means aligning datasets that do not measure the same features (e.g., different gene panels) by leveraging shared cell neighborhoods or robust cross-modal anchors rather than strict feature overlaps. These approaches not only enhance data completeness but also facilitate the discovery of context-specific regulatory networks, such as chromatin accessibility patterns that govern lineage commitment in hematopoiesis.

The development of computational ecosystems has become equally critical to sustaining progress in single-cell omics. Platforms such as BioLLM [12] provide universal interfaces for benchmarking more than 15 foundation models, while DISCO [13] and CZ CELLxGENE Discover [14] aggregate over 100 million cells for federated analysis. Open-source architectures like scGNN+ [15] leverage large language models (LLMs) to automate code optimization, thus democratizing access for non-computational researchers. Despite these advancements, ecosystem fragmentation remains a significant challenge: inconsistent evaluation metrics, unreproducible pretraining protocols, and limited model interoperability hinder cross-study comparisons. Initiatives such as the Human Cell Atlas [16] illustrate the potential of global collaboration, yet sustainable infrastructure for model sharing and version control–similar to Hugging Face in natural language processing–is urgently required.

This review synthesizes three interconnected revolutions in single-cell omics: (1) architectural innovations in foundation models, (2) strategies for multimodal integration, and (3) the development of computational ecosystems. A critical evaluation of 86 seminal studies is presented, highlighting methodological innovations, from transformer-based attention mechanisms to spatially aware graph neural networks, and biological applications across oncology, developmental biology, and immunology. The analysis uncovers emerging best practices, such as hybrid pretraining with biological prior knowledge [17], and identifies persistent challenges, including batch effect propagation in transfer learning [18]. By mapping the current landscape and proposing standardized evaluation frameworks, this work aims to accelerate the translation of computational advances into mechanistic insights and clinical applications, ultimately bridging the gap between cellular omics and actionable biological understanding.

Search methodology

A comprehensive literature search was systematically executed across authoritative scientific repositories, including Scopus, IEEEXplore, arXiv, and bioRxiv/medRxiv preprint servers. The search strategy employed the Boolean combination ("single-cell omics" OR "single cell omics") AND ("deep learning" OR "deep neural network") AND ("foundation model" OR "foundation models") to identify relevant publications. The investigation scope was confined to English-language articles published between January 2022 and December 2024, encompassing both peer-reviewed publications and preprints that delineate the development and application of foundation models in single-cell omics analysis. The inclusion of preprint repositories was strategically considered due to the rapid evolution of foundation models in computational biology, ensuring the capture of cutting-edge methodological advances prior to formal peer review, while maintaining scientific rigor through careful evaluation of technical validity and reproducibility. A clear majority release code and trained weights, full pretraining corpora and standardized inference scripts are less frequently available.

The initial search yielded 692 potentially relevant articles, which were subsequently evaluated following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. After removing duplicates and preliminary screening based on titles and abstracts, 157 articles were selected for full-text review. Following detailed assessment against the inclusion criteria, which mandated original research contributions integrating single-cell omics data with deep learning foundation models, 141 articles were ultimately incorporated into this analysis. The screening process was independently executed by two reviewers to ensure systematic and unbiased selection, with any discrepancies resolved through collaborative discussion to establish consensus, the process is shown in Fig. 1.

Fig. 1.

Fig. 1

PRISMA flow diagram of study selection process. Schematic representation of the systematic literature search and study selection workflow. The diagram tracks the progression from initial database identification through screening to final inclusion, detailing the number of studies at each stage and reasons for exclusion. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Core advances in foundation models for single-cell omics

Recent breakthroughs in foundation models for single-cell omics have revolutionized the analysis of complex biological data. These advances are driven by significant innovations in model architectures, multimodal integration, and computational ecosystems. The developments are organized into three interconnected domains: architectural innovations and pretraining strategies, multimodal and spatial integration, and interpretability-driven optimization with ecosystem development. Details of recent works are illustrated in Tables 1, 2 and Figs. 2, 3, 4.

Table 1.

Snapshot of representative tools with key performance dimensions (where reported)

Tool Category Typical strengths Notes (metrics as reported)
scGPT [1] Foundation model Zero-shot annotation; perturbation Large-scale pretraining; heterogeneous tasks
Nicheformer [3] Spatial transformer Niche context; spatial integration Massive spatial corpora
PathOmCLIP [6] Cross-modal alignment Histology–gene mapping Requires paired datasets
StabMap [9] Mosaic integration Non-overlap feature alignment Robust under feature mismatch
sysVI [23] Batch integration Biology preservation cVAE-based; batch-aware
EpiAgent [20] Epigenomic FM cCRE reconstruction ATAC-centric zero-shot

Table 2.

Summary of recent advances in foundation models for single-cell omics

No. Author et al. Year Key features
1 Cui et al. [1] 2024 Introduces scGPT, a generative pretrained transformer foundation model for single-cell multi-omics analysis. Trained on 33 M+ cells, it demonstrates superior performance in cell type annotation, multi-omic integration, and gene network inference
2 Cui [24] 2024 Presents a PhD thesis framework for single-cell omics foundation models, emphasizing pretraining and transfer learning. Highlights applications in multi-batch integration, perturbation prediction, and spatial transcriptomics
3 Ma et al. [31] 2024 Discusses the potential and challenges of foundation models in single-cell omics, emphasizing progress, limitations, and best practices for downstream tasks
4 Wagle et al. [41] 2024 Addresses interpretability challenges in deep learning for single-cell omics, emphasizing the need for transparent models to identify molecular regulators
5 Zhang et al. [2] 2024 Develops scPlantFormer, a lightweight foundation model for plant single-cell omics, pretrained on 1 M Arabidopsis thaliana cells. Excels in cross-species data integration and cell-type annotation
6 Schaar et al. [3] 2024 Proposes Nicheformer, a transformer-based model for single-cell and spatial omics, trained on 57 M dissociated and 53 M spatially resolved cells. Enables spatial context prediction and integration
7 Lee et al. [6] 2024 Introduces PathOmCLIP, a contrastive learning model connecting tumor histology with spatial gene expression. Validated across five tumor types, it enhances gene expression prediction from histology images
8 Qiu et al. [12] 2024 Presents BioLLM, a standardized framework for integrating and benchmarking single-cell foundation models. Provides a universal interface for streamlined access and evaluation of scFMs, highlighting scGPT’s robust performance
9 Palayew et al. [42] 2024 Introduces scMPT, a model combining scGPT with large language models (LLMs) to leverage textual biological insights. Demonstrates consistent performance improvements over standalone models
10 Ge et al. [7] 2024 Introduces GIST, a deep learning framework integrating histology and transcriptomics for spatial cellular profiling. Validated on human cancer datasets, it improves spatial domain identification and gene expression analysis
11 Wu et al. [43] 2024 Proposes an AI-driven multi-scale modeling framework for predicting causal genotype-environment-phenotype relationships. Integrates multi-omics data across biological levels and species to identify novel biomarkers and therapies
12 Chen et al. [20] 2024 Introduces EpiAgent, a foundation model for single-cell epigenomic data, pretrained on 5 M cells. Excels in feature extraction, cell annotation, and perturbation response prediction, enabling zero-shot capabilities
13 Maleki et al. [18] 2024 Demonstrates efficient fine-tuning of single-cell foundation models for zero-shot molecular perturbation prediction. Introduces a drug-conditional adapter, achieving state-of-the-art results in generalization tasks
14 Wu et al. [25] 2024 Introduces CellPatch, a lightweight foundation model for single-cell transcriptomics using heuristic patching. Achieves state-of-the-art performance with ultra-low computational costs, enhancing downstream task efficiency
15 Wu et al. [44] 2025 Proposes an AI-driven framework for multi-omics integration to predict genotype-environment-phenotype relationships. Highlights challenges in causal inference and generalization, with applications in biomarker and drug discovery
16 Li et al. [26] 2024 Introduces scMonica, a LSTM-transformer hybrid model for single-cell mosaic omics integration. Enhances clustering and analysis of heterogeneous datasets, with applications in developmental biology and oncology
17 Tang et al. [34] 2024 Presents Vec3D, an explainable multimodal model for exploring structured molecular landscapes in single-cell multi-omics data. Demonstrates applications in cell state prediction and molecular dynamics during lung development
18 Chen et al. [45] 2024 Discusses foundation models in bioinformatics, focusing on transformer-based architectures for sequence and non-sequence data. Highlights challenges and future directions for bioinformatics-tailored models
19 Wen [46] 2024 Proposes CellPLM, a transformer-based foundation model for single-cell multi-omics and spatial omics. Demonstrates the potential of foundation models for knowledge transfer and multimodal feature integration
20 Rood et al. [16] 2024 Explores the Human Cell Atlas as a foundation model, integrating molecular and spatial profiling data. Highlights applications in cell census, 3D mapping, and genotype-phenotype relationships
21 Ho et al. [21] 2024 Introduces AIDO.Cell, a dense transformer model for single-cell transcriptomics. Pretrained on 50 M cells, it achieves state-of-the-art results in zero-shot clustering and perturbation modeling
22 Nam et al. [47] 2024 Reviews AI-driven multimodal omics integration for precision medicine. Discusses methodologies, challenges, and the role of large language models in advancing biomedical research
23 Polychronidou et al. [48] 2023 Discusses the future of single-cell biology, highlighting challenges in understanding cell heterogeneity, cell-cell interactions, and the integration of multi-modal data. Emphasizes the role of AI and deep learning in advancing the field
24 Heryanto et al. [49] 2024 Proposes scVQC, a supervised representation learning method for tissue-specific cell type annotation. Outperforms unsupervised foundation models like scBERT and scGPT in cell-type annotation tasks
25 Hsieh et al. [50] 2024 Introduces scEMB, a transformer-based model for learning gene context representations from large-scale single-cell transcriptomics. Excels in batch integration, clustering, and in silico perturbation analysis
26 Wong et al. [51] 2025 Demonstrates that simple baseline methods outperform state-of-the-art deep learning models in predicting genetic perturbation responses. Highlights the utility of foundation models for fine-tuning and generalization
27 Wang et al. [11] 2024 Develops TMO-Net, a pre-trained multi-omics model for oncology. Integrates pan-cancer datasets for cross-omics learning and incomplete data inference, enhancing downstream tasks like tumor classification and biomarker discovery
28 Wen et al. [22] 2023 Introduces CellPLM, a pre-trained model for single-cell data that treats cells as tokens and tissues as sentences. Outperforms existing models in downstream tasks with 100x faster inference speed
29 Zhao et al. [17] 2024 Proposes LangCell, a language-cell pre-training framework for understanding cell identity. Integrates transcriptomics with natural language to enable zero-shot and few-shot cell identity understanding tasks
30 Wang et al. [52] 2024 Introduces CellMemory, a bottlenecked transformer for interpreting out-of-distribution cells. Excels in spatial transcriptomics analysis and disease characterization, outperforming pre-trained large language models
31 Li et al. [29] 2024 Conducts a systematic comparison of single-cell perturbation prediction models. Highlights the strengths and limitations of foundation models in capturing heterogeneous cellular responses to perturbations
32 Hingerl et al. [53] 2024 Introduces scooby, a model for predicting single-cell RNA-seq and ATAC-seq profiles from DNA sequence. Demonstrates applications in gene regulation analysis and variant effect prediction at single-cell resolution
33 Schäfer et al. [54] 2024 Discusses the integration of single-cell multi-omics data with prior biological knowledge to characterize the immune system. Highlights applications in gene regulation and cell-cell interactions
34 Krishnan et al. [38] 2024 Introduces Proximogram, a graph-based framework for integrating single-cell omics and spatial data. Demonstrates improved classification of pancreatic diseases using deep learning models
35 Jiang et al. [15] 2024 Presents scGNN+, a platform integrating ChatGPT for code optimization and tutorial generation in single-cell analysis. Enhances reproducibility and accessibility for non-programmers
36 Fang et al. [55] 2024 Introduces ChatCell, a framework for single-cell analysis using natural language. Leverages vocabulary adaptation and sequence generation to enable intuitive exploration of single-cell data
37 Ahlmann-Eltze et al. [56] 2024 Benchmarks deep learning models against simple linear methods for predicting gene perturbation effects. Finds that linear models often outperform or match deep learning approaches
38 Zhao et al. [27] 2024 Introduces SC-MAMBA2, a state-space model for ultra-long single-cell transcriptome modeling. Pretrained on 57 M cells, it achieves state-of-the-art performance in downstream tasks
39 Waqas [57] 2024 Proposes a graph-based deep learning framework for multimodal cancer analysis. Highlights applications in single-cell data integration and robust network modeling
40 Li et al. [13] 2025 Introduces DISCO, a platform for rediscovering publicly available single-cell data. Provides tools for data retrieval, integration, and analysis, hosting over 100 M cells
41 Manchel et al. [40] 2024 Discusses the use of single-cell multi-omics data for systems pathophysiological modeling. Highlights opportunities for building multi-organ models and patient-specific simulations
42 Xu et al. [58] 2024 Applies deep learning to single-cell RNA-seq analysis of double-negative T cells. Identifies novel markers and subgroups, validated by flow cytometry
43 CZI Cell Science Program et al. [14] 2025 Introduces CZ CELLxGENE Discover, a platform for exploring and analyzing aggregated single-cell data. Hosts over 93 M cells and provides tools for cross-corpus analysis
44 Liu et al. [59] 2025 Develops CSI-GEP, a GPU-based unsupervised learning approach for recovering gene expression programs in single-cell RNA-seq data. Outperforms state-of-the-art methods in scalability and consistency
45 Schuster et al. [60] 2024 Introduces multiDGD, a deep generative model for multi-omics data. Demonstrates applications in data integration, reconstruction, and gene-regulatory network inference
46 Zhai et al. [61] 2024 Proposes scBOL, a universal framework for cell type identification in single-cell and spatial transcriptomics data. Addresses challenges in handling novel cell types and spatial organization
47 Ren et al. [62] 2024 Introduces twGCN, a Wasserstein graph convolutional network with attention for imbalanced scRNA-seq data. Demonstrates superior performance in discovering single-cell patterns
48 Kobayashi-Kirschvink et al. [35] 2024 Introduces Raman2RNA, a method for predicting single-cell RNA expression profiles using Raman microscopy. Demonstrates applications in live-cell analysis and gene expression prediction
49 Hai et al. [63] 2024 Introduces SCBC, a supervised single-cell classification method for ATAC-seq data. Uses batch correction to improve data consistency and classification accuracy
50 Wu et al. [64] 2025 Proposes small-sample learning for human health risk assessment using AI, exposome data, and systems biology. Highlights the need for efficient methods to handle limited sample sizes
51 Baek et al. [28] 2024 Introduces CRADLE-VAE, a counterfactual reasoning-based model for single-cell gene perturbation modeling. Improves treatment effect estimation and generative quality
52 Ghazanfar et al. [9] 2024 Introduces StabMap, a mosaic data integration technique for single-cell omics. Exploits non-overlapping features to stabilize data integration and improve downstream analysis
53 Xia et al. [33] 2024 Introduces DECIPHER, a model for generating high-fidelity disentangled cellular embeddings from large-scale spatial omics data. Demonstrates superior performance in downstream tasks
54 Niu et al. [65] 2025 Introduces scPairing, a variational autoencoder for single-cell multiomics data integration and generation. Enables the creation of novel multiomics datasets from unimodal data
55 Yin et al. [66] 2024 Introduces Scope+, an open-source architecture for single-cell RNA-seq atlases. Enables efficient access and meta-analysis of large-scale single-cell datasets
56 Palma et al. [36] 2024 Introduces CFGen, a flow-based generative model for multi-modal single-cell counts. Improves data augmentation and rare cell type classification
57 Gong et al. [67] 2024 Introduces xTrimoGene, a scalable transformer for single-cell RNA-seq data. Achieves state-of-the-art performance in downstream tasks with reduced computational costs
58 Piran et al. [68] 2024 Introduces biolord, a deep generative model for disentangling single-cell multi-omics data. Enables the generation of experimentally inaccessible samples and improves perturbation predictions
59 Hrovatin et al. [23] 2024 Proposes sysVI, a cVAE-based model for integrating single-cell RNA-seq datasets with substantial batch effects. Improves biological preservation and batch correction
60 Gao et al. [69] 2024 Introduces UniCoord, a joint-VAE model for building a universal coordinate system for single-cell atlases. Enables interpretable data representation and generation
61 Cao [70] 2023 Presents deep representation learning frameworks for single-cell sequencing data analysis. Highlights applications in confounding-free representation learning and multi-modal data integration
62 Schaefer et al. [71] 2024 Introduces CellWhisperer, a multimodal model for interactive single-cell RNA-seq data exploration using natural language. Enables chat-based interrogation of transcriptome data
63 Bi et al. [72] 2024 Surveys the application of large language models (LLMs) in biomedicine. Highlights their potential for analyzing textual data, biological sequences, and brain signals
64 Singhal et al. [37] 2024 Introduces BANKSY, an algorithm for unifying cell typing and tissue domain segmentation in spatial omics data. Demonstrates scalability and improved performance on diverse datasets

Fig. 2.

Fig. 2

Workflow for Single-Cell Omics Analysis: Antibody Staining, Data Augmentation, and Model Fine-Tuning. a Experimental Setup for Infinity Antibody Staining and Data Acquisition: Single-cell samples are stained with a “Backbone” antibody panel (15 markers, including CD45, CD3, CD19, and CD14) and unique “Infinity” antibodies (300 markers per well). Flow cytometry is used to generate a data matrix containing both Backbone and Infinity markers. Missing data are predicted using nonlinear regression and augmented to enhance dataset completeness [19]. b Computational Workflow for Data Embedding and Model Fine-Tuning: The augmented data matrix is input into a pre-trained masked-attention transformer model (SCGPT) with separate embedding layers for Backbone and Infinity markers. Fine-tuning incorporates task-specific supervision and multi-modal biological data (DNA, RNA, protein). The model performs downstream tasks such as gene expression prediction, perturbation modeling, and cross-species adaptation, leveraging multi-head attention mechanisms to integrate complex biological features for advanced analysis [1]. The schematic were adapted from [19] and [1] with permission

Fig. 3.

Fig. 3

Overview of Methodological Workflows for Multi-Omics and Spatial Transcriptomics Analysis. a Nicheformer Model for Gene Expression Integration: The Nicheformer model processes tokenized gene expression data and assay-specific markers using transformer embeddings, producing unified outputs for gene ranking and modality integration. This enables accurate predictions for gene regulatory networks (GRN) and drug response analysis [3]. b LocalCLiP for Spatial Transcriptomics: LocalCLiP utilizes a local transformer model to integrate spatial transcriptomics data, using KNN for image patch analysis and gene expression prediction, providing insights into tissue-specific molecular patterns [6]. c BioTask Executor for Task-Specific Analysis: The BioTask Executor handles various biological tasks, from zero-shot learning to GRN inference and drug response prediction, by preprocessing data, initializing pretrained models (e.g., SCGPT, Geneformer), and fine-tuning them for task-specific applications [12]. d Human-8CATAC-CorpuS for Multi-Tissue Analysis: The Human-8CATAC-CorpuS dataset, with 5 million cells from 31 tissues, is used to train models for gene expression prediction and cCRE signal reconstruction, enabling comprehensive analysis of tissue-specific regulatory elements [20]. The schematics were adapted from [3, 6, 12] and [20]

Fig. 4.

Fig. 4

Model Architecture and Self-Supervised Pretraining for Single-Cell Transcriptomics. a Transcriptome-Scale Cell Foundation Model: A pretrained model capturing upregulated and repressed gene expression across 50 million human cells, using a 20,000-gene context length and hierarchical downsampling. The transformer-based architecture supports transfer learning for perturbation modeling, employing masked language modeling to predict high read-depth expression values and representing the full human transcriptome [21]. b Self-Supervised Pretraining: The self-supervised pretraining process using scRNA-seq and spatial transcriptomics data, with a masked gene expression embedding approach. Latent gene expressions are estimated through a Gaussian mixture model prior, enabling robust feature reconstruction and learning unmeasured or masked gene data [22]. c scRNA-seq Data and Metadata Integration: Overview of integrating single-cell RNA-seq data with metadata (cell type, stage, disease) for contextualized gene expression analysis. Rank value encoding normalizes data for accurate disease and cell-type-specific modeling [17]. The schematics were adapted from [21, 22] and [17]

Cross-cutting trends and research gaps

Trends: (i) Transformers and SSMs dominate large-scale pretraining; (ii) cross-modal alignment and spatial niche modeling grow fastest; (iii) lightweight adapters and parameter-efficient finetuning gain traction for clinical scenarios. Gaps: (i) understudied modalities (spatial proteomics, metabolomics) and time-resolved data; (ii) systematic, prospective validations beyond benchmarks; (iii) sustainable model/version registries with transparent data-provenance; (iv) community-agreed, biologically faithful metrics with uncertainty reporting.

Architectural innovations and pretraining strategies

Transformer-based models have become the dominant architecture in the field, with scGPT [1] setting a new benchmark by pretraining on a massive dataset of 33 million cells for multi-omic tasks. This milestone not only demonstrated the power of transformers in handling complex omics data but also laid the foundation for transfer learning frameworks [24] that extend the model’s applicability across diverse biological contexts. The transfer learning approach improves the model’s ability to adapt to new datasets with minimal retraining, enhancing its robustness and versatility in single-cell analysis.

In contrast, lightweight models such as scPlantFormer [2] and CellPatch [25] offer computational efficiency by reducing model complexity while maintaining competitive performance. ScPlantFormer, specifically designed for plant single-cell omics, addresses the unique challenges posed by plant biology. Meanwhile, CellPatch introduces patch-based learning techniques that enable efficient processing of single-cell images, reducing computational costs by up to 80%, making these models particularly useful for large-scale analyses.

Hybrid architectures are also gaining traction. For example, scMonica’s fusion of LSTM and transformer models [26] combines the strengths of both architectures, enabling the model to capture temporal dynamics and sequential patterns in biological data. Similarly, LangCell’s cross-modal integration of language processing and transcriptomics [17] bridges the gap between computational biology and natural language processing, providing new insights into gene expression through a language-like modeling approach.

Efforts to scale these models have reached unprecedented levels, exemplified by Nicheformer’s training on 110 million cells [3], which has set new records for the size of datasets that can be processed. This breakthrough allows for robust zero-shot capabilities, enabling the model to generalize effectively to novel biological contexts without the need for extensive retraining. Similarly, SC-MAMBA2’s state-space modeling approach [27] offers a powerful framework for capturing the complexity of high-dimensional biological data.

To address specific biological challenges, specialized pretraining frameworks have emerged. EpiAgent [20] focuses on epigenomics, enabling the capture of regulatory mechanisms at the epigenetic level. CRADLE-VAE [28] is tailored to perturbation modeling, learning how cellular states respond to experimental perturbations. Moreover, cross-species adaptation frameworks [2] are increasingly important for transferring insights from model organisms to humans, ensuring the broad applicability of these models across species.

Systematic comparisons [29] of these models have underscored their strengths and weaknesses, particularly in how they capture cellular heterogeneity, further driving the development of more accurate and generalizable models. Insights from methodological surveys [30, 31] and theoretical foundations [24, 32] continue to inform the design of more effective and interpretable models.

Multimodal and spatial integration

Recent advancements in spatial context modeling are exemplified by Nicheformer’s integration of 53 million spatially resolved cells [3], providing an unprecedented ability to model cellular architecture and interactions within tissues. The inclusion of spatial context is critical for understanding how cellular functions and interactions are organized in complex biological systems. DECIPHER’s disentangled embeddings [33] further enhance spatial analysis by enabling a clearer identification of distinct cellular structures and interactions within tissue samples.

PathOmCLIP [6] represents a significant innovation in multimodal integration, aligning histological features with transcriptomic data across multiple tumor types. This alignment provides a deeper understanding of the molecular basis of cancer by linking tissue morphology with gene expression, offering a powerful tool for cancer diagnosis and therapeutic targeting.

Multimodal frameworks are expanding rapidly. TMO-Net [11] facilitates pan-cancer integration by combining multi-omic data across diverse cancer types, creating a unified molecular profile for improved diagnostics and therapeutic decision-making. Vec3D [34] offers an explainable framework for visualizing molecular landscapes in three dimensions, providing clearer insights into the complex spatial arrangements of cells. GIST [7] further enhances this by enabling high-resolution 3D profiling of cellular structures, facilitating a deeper understanding of cellular organization in health and disease.

Comparative evidence across heterogeneous datasets indicates that no single integration strategy dominates: contrastive alignment excels when paired data exist (e.g., histology–transcriptome with stable co-registration), tensor/low-rank fusion benefits partially matched modalities with moderate sample sizes, whereas graph-based alignment or optimal-transport schemes are more robust under severe feature mismatch or sparse modalities (e.g., scATAC-seq). Reported failure modes include negative transfer under severe domain shift, inflated performance on in vitro cell lines with shared backgrounds, and loss of rare-cell structure when optimizing purely for batch mixing. We therefore recommend mixed metrics (Section “Community Benchmarks...”) and stratified evaluations by tissue, platform, and rarity.

Cross-modal generation techniques are also advancing, with models such as Raman2RNA [35] enabling the prediction of gene expression profiles from microscopy images. This ability to derive molecular insights from non-transcriptomic data opens new avenues for integrating diverse biological modalities. Similarly, CFGen’s multi-modal count synthesis [36] allows for the synthesis of gene expression data from various modalities, enhancing the robustness of single-cell analyses.

Spatial analysis tools such as BANKSY’s unified cell typing [37] and Proximogram’s disease classification [38] are transforming how we identify cell types and classify diseases based on spatially resolved data. These tools integrate cellular identity with spatial information to improve diagnostic accuracy and facilitate a deeper understanding of tissue architecture and disease progression.

To address integration challenges, StabMap’s mosaic techniques [9] and sysVI’s batch correction methods [23] provide solutions to harmonize data from different experimental conditions, ensuring that multi-omic and spatial data are consistently represented across studies.

The growing importance of clinical translation is evident in roadmaps for applying these advances in clinical settings [39] and the development of multi-organ modeling frameworks [40], which aim to extend these technologies to organismal-level analyses and improve clinical diagnostics and treatment strategies.

Beyond listing models, we systematically compare representative scFMs on scalability, cross-task generalization, bias sources, and openness. Table 3 contrasts scGPT, Nicheformer, scPlantFormer, and EpiAgent, highlighting training scale, supported modalities, zero-shot capabilities, cross-species transfer, and whether pretraining code/data are openly released. While some papers report task-specific metrics (e.g., scPlantFormer’s 92% cross-species annotation accuracy in plants), heterogeneous datasets and objectives caution against declaring a single “best” model. We therefore emphasize task- and data-matched evaluation with uncertainty reporting.

Table 3.

Representative scFMs compared on scale, modalities, generalization, and openness

Model Pretraining scale Modalities Notable capabilities Bias risks (reported) Openness
scGPT [1] 33 M+ cells RNA (+multi-omics via adapters) Zero/few-shot annotation; perturbation modeling Batch propagation in transfer; dataset imbalance Code+weights public; pretraining corpus partially aggregated
Nicheformer [3] 57 M disassoc + 53 M spatial RNA + spatial Spatial niche reasoning; cross-modality alignment Tissue/histology imbalance; spatial sampling bias Code public; large-scale corpora curated
scPlantFormer [2] 1 M cells (plants) RNA (plants) Cross-species annotation (reported 92%) Species shift; gene orthology mapping Code public; plant corpora subsets
EpiAgent[20]  5 M cells ATAC/epigenome cCRE reconstruction; zero-shot epigenomic transfer Peak calling/platform bias Code public; weights available

Translational impact and clinical readiness

Single-cell multimodal AI is beginning to yield translational insights, yet most reports remain preclinical or retrospective. We summarize representative case studies in Table 4, covering biomarker nomination from spatial niches, integrative histology–omics risk stratification, and perturbation-informed target prioritization. Notably, only a subset proceeds to prospective wet-lab or clinical validation, reflecting costs, data-access constraints, and regulatory hurdles.

Table 4.

Selected translational case vignettes linking multimodal single-cell AI to biomarkers/targets and validation status

Use case AI/methodological core Validation and notes
Histology–transcriptome risk stratification Contrastive alignment (e.g., PathOmCLIP) with spatial priors Retrospective validation across tumor types; prospective trials needed; potential domain shift across scanners
Spatial niche biomarkers Spatial transformers/graph models (e.g., Nicheformer) to map microenvironmental cues Orthogonal IHC/ISH validation possible; generalization to inflamed/treated tissues pending
Perturbation-informed targets Foundation models fine-tuned for perturbation prediction; counterfactual VAEs Benchmarks promising; requires CRISPR/Perturb-seq validation and toxicity screens
Noncoding variant prioritization scFM-informed regulatory priors integrated with WGS (e.g., CWAS-Plus pipeline) Supports cell-type–specific hypotheses; wet-lab enhancer perturbation remains a bottleneck

Regulatory pathways for AI-enabled single-cell diagnostics/prognostics will likely follow FDA GMLP and IMDRF SaMD guidance, requiring clear intended use, locked models or controlled updates, uncertainty quantification, bias assessment across subpopulations, and real-world monitoring. We outline an actionable path: (i) prospective pre-registration of analysis plans; (ii) multi-center external validation with standardized SOPs; (iii) integration with EHR/pathology workflows; (iv) triangulation with perturbation assays (CRISPR/Perturb-seq) to close the mechanistic loop.

Importantly, the interpretation of challenging noncoding WGS variants can benefit from cell-type–specific priors learned by scFMs. Recent methods such as CWAS-Plus (PMID: 38966948) integrate single-cell functional maps with WGS to pinpoint regulatory variants; scFMs can enrich this pipeline by providing transferable, cell-state–aware embeddings and enhancer–gene linkage priors to prioritize variants in relevant cellular contexts.

Interpretability, optimization, and ecosystem development

The push for greater model transparency has led to the development of frameworks for molecular regulator identification [41], Visible Neural Networks [73], and biolord’s disentangled representations [68], which aim to make model predictions more interpretable and biologically meaningful. These advances are crucial for building trust in AI-driven analyses and for bridging the gap between computational predictions and experimental biology. For interpretability, disentangled generative models (e.g., biolord) can align latent factors with known pathways, enabling hypothesis-driven perturbations; in immune profiling, factors capturing cytotoxic vs. exhaustion programs can be validated via marker panels and functional assays. Static embeddings, however, struggle with dynamics (activation, differentiation, tumor evolution). Hybrid approaches that couple scFMs with neural ODEs/controlled differential equations and RNA-velocity priors may better capture time-varying processes, offering causal hypotheses testable by longitudinal sampling. Table 5 compares major computational ecosystems on interoperability, scalability, accessibility, and sustainability to guide platform selection and long-term maintenance.

Table 5.

Computational ecosystems compared on interoperability, scalability, accessibility, and sustainability

Platform Interoperability Scalability Accessibility Sustainability/governance
CZ CELLxGENE Discover [14] AnnData/h5ad; APIs Large atlas hosting; cloud-native Web UI + programmatic Philanthropic + open-source; strong community
DISCO [13] Cross-study integration tools 100 M+ cells; batch-aware Portal + SDK Consortium-led; data-curation focus
BioLLM [12] Unified interface for scFMs Depends on backend models Python SDK + recipes Research-led; benchmarking emphasis
scGNN+[15] Graph ML + LLM aids GPU-accelerated graphs Tutorials + auto code Academic open-source; education-friendly
Scope+[66] Atlas architecture Distributed access Meta-analysis tooling Open-source; atlas sustainability

Enhancing computational efficiency is a key area of focus, with innovations such as drug-conditional adapters [18] and heuristic patching [25] optimizing model performance while minimizing resource requirements. These methods make large-scale single-cell analyses more feasible, enabling faster and more efficient processing of complex datasets. Furthermore, GPU-accelerated techniques [59] are significantly reducing the time required for model training, thereby accelerating scientific discovery.

The development of integrated computational ecosystems is exemplified by platforms like ChatCell [55] and CellWhisperer [71], which enable natural language interaction with single-cell data, making these powerful tools more accessible to biologists. Additionally, code optimization platforms such as scGNN+ [15] are streamlining the development and deployment of graph-based models for single-cell data analysis.

Data discovery platforms like DISCO [13] and CZ CELLxGENE [14] are enhancing the accessibility of single-cell datasets, facilitating the integration of data across studies and enabling new insights. Standardization efforts, such as BioLLM’s universal interface [12], are establishing common frameworks for sharing and utilizing single-cell data, promoting interoperability across platforms.

Domain-specific applications are flourishing, with significant progress in areas such as liver disease [74], brain metastasis [75], and plant biology [76], supported by foundational resources like the Human Cell Atlas [16, 77]. These applications demonstrate the growing potential of single-cell technologies in both human and non-human biological research [78, 79].

As the field advances, challenges remain in developing unified evaluation metrics [31], ensuring biologically meaningful interpretability [71], and establishing sustainable model-sharing infrastructure [22]. Balancing innovative model architectures with biological validation is critical for ensuring that these computational tools can translate into actionable mechanistic insights and clinical applications. The integration of foundation models with emerging single-cell technologies promises to unlock new dimensions of cellular complexity, deepening our understanding of developmental, physiological, and pathological processes.

Challenges

The application of machine learning (ML) to single-cell omics has ushered in a new era of discovery, offering unprecedented potential to decode the complexity of cellular systems. However, the realization of this transformative promise remains mired by a range of persistent challenges that hinder the broader integration of these techniques into clinical and biomedical research. These challenges, spanning data quality, model generalization, interpretability, and integration of multimodal datasets, must be systematically addressed to ensure that the full power of ML and advanced computational ecosystems is harnessed effectively.

Data quality and biological relevance

A critical challenge in applying ML to single-cell omics is the variability and noise inherent in the data. Despite substantial advancements in sequencing technologies and data acquisition methods, the lack of standardization across experimental protocols remains a significant obstacle. Variability in sample preparation, platform-specific biases, and the absence of universal data preprocessing workflows contribute to inconsistent data quality. In precision medicine and clinical applications, where minute variations in cellular states can determine therapeutic outcomes, these inconsistencies are particularly problematic. In disease studies, such as cancer immunology, where subtle differences in immune cell populations can profoundly impact treatment decisions, unreliable data introduce significant risks of misinterpretation, potentially leading to misguided clinical strategies. The reliance on ML models to correct or “learn through” this noise further amplifies the risks, as these algorithms can inadvertently emphasize spurious correlations, compromising the biological fidelity of the resulting models.

Additionally, the biological complexity of human samples exacerbates these challenges. Inter-individual variability, genetic diversity, and disease heterogeneity introduce layers of complexity that are often poorly represented in the typically small, homogeneous datasets on which many models are trained. ML models trained on such data may perform well in controlled environments but often fail to generalize across the broader, more diverse clinical population. This limitation underscores the need for large-scale, well-annotated, and representative datasets that accurately capture the biological diversity seen in real-world clinical settings.

Overfitting and lack of generalizability

Overfitting remains a pervasive issue in the application of ML to single-cell omics, particularly within clinical research. The challenge of working with small, highly specific datasets, a common limitation in medical studies due to recruitment difficulties or ethical concerns, heightens the risk of models memorizing noise rather than learning robust, generalizable patterns. This risk is particularly pronounced when models are trained on patient cohorts that fail to represent the genetic, demographic, and environmental diversity of broader populations. In cancer research, for example, models trained on immune response data from a single cohort may identify predictive biomarkers that fail to replicate in independent patient groups, leading to erroneous conclusions about their clinical utility. The resulting biomarkers often lack the predictive power needed for personalized medicine, limiting their potential to inform individualized therapeutic strategies.

Furthermore, many current ML approaches treat biological systems as static, with little regard for their dynamic nature. Diseases such as cancer or autoimmune disorders, which involve rapid and unpredictable cellular changes, challenge the validity of models trained on snapshot datasets. This static modeling approach does not account for the temporal fluctuations in cellular states, impeding the ability of ML models to accurately predict disease progression or therapeutic responses. More sophisticated models that incorporate temporal dynamics and account for cellular plasticity will be necessary for future breakthroughs in personalized medicine.

Interpretability crisis in ML models

A central limitation in the application of ML to single-cell data is the opacity of many deep learning models. In basic medical research, where the goal is often to elucidate the mechanisms underlying disease or cellular behavior, the interpretability of ML models is crucial. For example, in autoimmune diseases, where small changes in immune cell populations can drive pathology, understanding how a model arrives at its conclusions is vital for hypothesis generation and further experimentation. However, deep learning models, especially those used in flow cytometry analysis, often operate as “black boxes,” offering little transparency into the reasoning behind their predictions. This lack of interpretability undermines the scientific value of these models, as researchers and clinicians cannot discern which biological features are driving the model’s outcomes. In clinical contexts, such as predicting patient responses to immunotherapies, the inability to explain a model’s decision-making process becomes a safety concern. For ML to be safely integrated into clinical decision support systems, models must be interpretable, with clear explanations of how features contribute to predictions, ensuring that healthcare professionals can trust the underlying biological rationale.

Overpromising and under-delivering in ML applications

Despite the transformative potential of ML, the field often faces inflated expectations, which can hinder progress by obscuring the limitations of current techniques. The widespread excitement surrounding unsupervised clustering methods such as t-SNE and UMAP, which are frequently employed to explore cellular heterogeneity, is a prime example of overpromising. These methods, while useful, are far from foolproof; they often fail to preserve meaningful biological relationships in high-dimensional data, leading to misleading or biologically irrelevant clusters. As a result, researchers may pursue false leads, wasting time and resources on findings that do not hold up under closer biological scrutiny. Moreover, the growing reliance on ML algorithms to replace traditional expertise in experimental design and data interpretation risks oversimplifying the complex nature of single-cell analysis. Flow cytometry and omics data interpretation require deep biological knowledge and nuanced understanding, and no algorithm can replace the critical role played by skilled researchers in ensuring experimental rigor and biological relevance.

Integration of multimodal data for disease mechanisms

A transformative opportunity for ML in single-cell omics lies in the integration of multimodal datasets, such as combining flow cytometry data with genomics, transcriptomics, and proteomics. This integration, essential for understanding complex disease mechanisms, remains in its infancy, particularly when it comes to harmonizing data from different platforms. The integration of high-dimensional flow cytometry with other omics data, such as gene expression or epigenetic profiles, is fraught with challenges. These datasets, each generated with varying levels of noise and on different scales, require sophisticated harmonization techniques to enable meaningful ML analysis. Without proper integration, critical biological insights are often lost. In addition, incorporating patient-specific factors such as genetic mutations and environmental exposures into integrated models remains a significant hurdle, particularly when dealing with the heterogeneity seen in diseases like cancer or autoimmune disorders. As such, meaningful multimodal integration will be essential to advancing personalized medicine but requires substantial progress in computational tools, data standardization, and model robustness.

Lack of standardization and benchmarking

A lack of standardized protocols for data analysis and benchmarking of ML models in single-cell omics represents a significant barrier to progress. The absence of universally accepted evaluation frameworks leads to fragmented research, with different groups claiming superior performance for their models without a clear method for comparison. In clinical research, where reproducibility and consistency are paramount, this lack of standardization undermines the credibility of ML approaches. Failure modes of batch correction under transfer include: (i) negative transfer when source-target biology diverges; (ii) overmixing that erodes rare-cell structure; (iii) adversarial leakage where batch cues persist in higher-order interactions. Solutions include adversarial debiasing with biology-preserving constraints, invariant risk minimization or meta-learning to select stable features, and adapter-based finetuning that preserves source representations while minimizing batch-sensitive drift. We also recommend stratified reporting by tissue/platform and rarity-aware metrics (see Table 6). Models that are touted as state-of-the-art often fail to perform under real-world conditions, especially when applied to larger, more diverse clinical cohorts. To move forward, the field requires a concerted effort to establish standardized protocols for model training, evaluation, and validation, ensuring that the claims of model efficacy are robust, reproducible, and clinically relevant.

Table 6.

Common metrics: purpose, typical use, known biases, and scalability notes

Metric Purpose/use Known biases/limitations Scalability notes
ARI/NMI Clustering vs. ground truth Sensitive to class imbalance; favors many small clusters Contingency-table based; scales well
Silhouette/ASW (cell/batch) Separation/cohesion; batch vs biology ASW-batch may reward overmixing; silhouette poor in non-convex manifolds Needs distance graph; scalable with ANN
kBET/iLISI Batch mixing kBET sampling unstable in tiny clusters; iLISI can mask biology if overmixed Subsampling advisable at atlas scale
Graph connectivity Preservation of cell-type neighborhoods Can favor overly connected graphs; depends on kNN construction kNN graph scales with ANN/IVF-PQ
Label transfer accuracy Biological conservation across datasets Depends on reference bias; inflated on in vitro lines Fast with nearest-neighbor inference
Trajectory correlation (Kendall’s Inline graphic) Pseudotime/lineage preservation Requires reliable trajectory; sensitive to branching Uses reduced graphs; scalable
Perturbation Inline graphic/AUROC Counterfactual prediction Inflated if splits leak donors/batches Requires careful split; scalable minibatches
Pearson’s delta Batch removal Can be inflated on homogeneous cell lines Cheap; interpret with caution
Cross-view retrieval (Recall@k) Multimodal alignment fidelity Sensitive to class prevalence and pairing noise ANN-based retrieval scales

Community benchmarks and reporting standards for single-cell AI

A challenge-style benchmark suite

We propose a community “SCFM-DREAM” suite with task-specific tracks: (i) batch integration (hold-out labs); (ii) cross-modal alignment (paired/mosaic); (iii) zero-shot cell annotation (novel tissues/species); (iv) perturbation prediction (unseen compounds/genes); (v) spatial domain segmentation and niche reasoning; and (vi) cross-species transfer. Each track includes standardized data partitions, blinded test servers, and biological sanity checks.

MINASCO: a minimal reporting standard

We suggest MINASCO (Minimum Information for AI in Single-Cell Omics) covering: data accession and preprocessing; model version/parameters/seeds; compute budget; training/evaluation splits; metrics with confidence intervals and ablations; bias and subgroup analyses; code/weights and inference scripts; and long-term model cards with provenance. Consortia such as the Human Cell Atlas can steward data/metadata schemas and model registries.

Metrics: purposes, biases, and scalability

Mixed metrics are essential: batch mixing (iLISI/kBET/graph connectivity), biological conservation (NMI/ARI/ASW-cell), structure preservation (KNN-purity, trajectory correlations), cross-modal fidelity (cross-view retrieval), and task-specific scores (label transfer accuracy, perturbation Inline graphic/AUROC). Importantly, certain metrics can be biased toward specific data types (e.g., Pearson’s delta inflated on homogeneous cell-line panels) or become costly at atlas scale; therefore, we recommend scalable approximations and uncertainty reporting with stratification (by tissue/platform/rarity).

Future directions

Next-generation deep learning architectures

The rapid evolution of computational frameworks for high-dimensional single-cell cytometry necessitates the development of architectures that transcend conventional machine learning paradigms. Emerging hybrid models, which integrate hierarchical attention mechanisms, geometric deep learning, and transformer-based representations, show considerable promise in resolving ambiguities within overlapping cell populations and detecting ultra-rare phenotypes with unprecedented single-cell precision. Notably, the integration of physics-informed neural networks presents an exciting opportunity to address enduring challenges, such as batch effect correction and technical variability, by embedding domain-specific constraints derived from cytometer physics and antibody-antigen kinetics.

Self-supervised contrastive learning frameworks, particularly those leveraging multi-instance learning strategies, are poised to revolutionize the analysis of unannotated datasets by disentangling biological signals from experimental artifacts, free from manual gating biases. These advanced architectures, in combination with innovative regularization techniques and adaptive optimization strategies, offer remarkable potential for capturing complex cellular hierarchies and developmental trajectories. Furthermore, the integration of curriculum learning approaches stands to enhance model robustness, providing the capacity for generalization across diverse experimental conditions and data sources.

Virtual cell and multi-scale biology simulation

A pivotal frontier in the future of single-cell analysis lies in the development of virtual cell systems coupled with multi-scale biological simulations. The concept of the virtual cell, whereby each individual cell is modeled as a dynamic, self-contained computational entity, enables the simulation of cellular behaviors and interactions at an unprecedented level of granularity. By integrating molecular, cellular, and tissue-level models, this approach can facilitate the exploration of complex biological phenomena, such as cellular differentiation, immune response, and tumor progression, in a highly controlled computational environment.

Multi-scale biology simulation, incorporating cellular dynamics, tissue architecture, and organ systems, offers the potential to predict how cellular processes at the single-cell level influence overall organismal function. The integration of high-throughput single-cell omics data with advanced computational models will allow for the simulation of cellular ecosystems that mirror the complexity of real biological systems. These simulations, powered by exascale computing platforms and cutting-edge machine learning techniques, could provide invaluable insights into disease mechanisms, therapeutic interventions, and personalized medicine strategies. This area is poised to become a transformative force, facilitating the development of predictive models that integrate molecular dynamics, gene regulation, and spatial-temporal organization within living systems.

Exascale computing and latency-optimized workflows

The exponential growth of cytometry datasets necessitates a paradigm shift toward exascale-compatible computational strategies. Future platforms must incorporate sparsity-aware algorithms and quantized neural networks to efficiently process billion-cell datasets with sublinear memory scaling. Heterogeneous computing architectures, which combine GPU-accelerated tensor decomposition with FPGA-based streaming analytics, will enable real-time classification at cytometer acquisition speeds (<1 ms/cell), essential for closed-loop adaptive sampling in rare event detection.

The implementation of distributed computing frameworks and optimization of computational resources will be crucial to managing the growing complexity of cytometry data. Federated learning approaches with differential privacy guarantees could facilitate collaborative analysis of distributed clinical datasets, all while ensuring patient confidentiality. The development of efficient load balancing strategies and fault-tolerant processing pipelines will ensure consistent performance in high-throughput environments.

Cross-modal data fusion and causal representation learning

A transformative opportunity exists in the development of unified embedding spaces that reconcile flow cytometry with CyTOF, scRNA-seq, and spatial proteomics using multimodal foundation models. Advanced manifold alignment techniques, such as optimal transport-based integration and contrastive domain adaptation, could overcome platform-specific signal nonlinearities and detector saturation effects. The incorporation of causal discovery algorithms offers the potential to identify invariant cellular signatures across diverse experimental conditions, distinguishing technical confounders from biologically meaningful variations.

The integration of multi-view learning strategies and transfer learning mechanisms will further enhance comprehensive cellular phenotyping across complementary single-cell technologies. The development of robust batch correction methods and harmonization protocols will be critical for establishing reliable cross-platform comparisons. These advancements will facilitate the construction of comprehensive cellular atlases that leverage the strengths of each modality while compensating for their respective limitations.

Clinical-grade standardization and regulatory-compliant AI

Translational applications of single-cell technologies will require the establishment of rigorous validation frameworks that align with IVDR/FDA-AI guidelines, integrating uncertainty-aware classification and explainability-by-design principles. The development of certified reference materials and synthetic digital twins of cytometry data could standardize benchmarking across institutions, ensuring consistency and reproducibility. Automated quality metrics, including those quantifying signal-to-noise ratios, spillover propagation errors, and classifier calibration drift, will be essential for clinical deployment.

The establishment of comprehensive validation protocols and performance standards will pave the way for regulatory approval and clinical adoption. Continuous monitoring systems and automated quality control pipelines will be critical to maintaining consistent performance in clinical environments. Standardized reporting frameworks and documentation procedures will support transparency and reproducibility in clinical applications while ensuring compliance with evolving regulatory requirements.

Bayesian uncertainty quantification and error-aware pipelines

Next-generation analytical pipelines must integrate probabilistic reasoning at multiple hierarchical levels, ranging from photodetector noise modeling to population-level confidence estimation. Bayesian neural networks, particularly those utilizing hierarchical priors, could propagate measurement uncertainties through feature extraction and clustering stages. At the same time, conformal prediction frameworks may provide statistically rigorous confidence intervals for diagnostic predictions. Incorporating adversarial robustness testing will be crucial to ensuring stability against instrument drift and reagent lot variability.

Advanced statistical frameworks, including ensemble methods and probabilistic calibration techniques, will enhance the reliability of automated analysis systems. The development of comprehensive error propagation models and uncertainty visualization tools will facilitate informed decision-making in clinical settings, providing meaningful confidence metrics for clinical interpretation.

Decentralized data ecosystems and metadata-rich repositories

The establishment of FAIR-compliant knowledge graphs, which integrate raw FCS files, experimental metadata, and clinical outcomes, holds the potential to catalyze large-scale scientific discovery. Blockchain-based provenance tracking and distributed storage solutions could address concerns related to data sovereignty, particularly in multinational research consortia. Lightweight containerized analysis modules could democratize access to advanced computational methods, bridging the gap between high-performance computing resources and resource-limited clinical settings.

The implementation of standardized data formats and interoperable software interfaces will streamline data exchange and facilitate seamless integration of analysis workflows. Development of ontology-driven annotation tools, supported by language model-assisted metadata curation, will ensure consistency across evolving technological platforms. These advancements will support collaborative research efforts while ensuring data security, privacy, and compliance with regulatory frameworks.

Conclusion

The rapid evolution of single-cell omics, fueled by foundational models, multimodal integration, and the development of computational ecosystems, has revolutionized the study of cellular complexity. The introduction of large-scale, pretrained neural networks has transformed single-cell analysis, enabling unprecedented precision in tasks such as cell type annotation and perturbation response prediction. These advances, along with innovative multimodal integration strategies that combine transcriptomics, epigenomics, proteomics, and imaging data, have allowed for the creation of comprehensive cellular atlases, providing a more nuanced understanding of cellular heterogeneity and function. Despite these strides, challenges related to model interoperability, standardized evaluation metrics, and data harmonization remain, emphasizing the need for continued innovation in computational methodologies and data standards.

Looking forward, the integration of causal inference and advanced statistical techniques will further enhance the robustness and clinical applicability of single-cell analyses. As computational power continues to grow, the development of exascale computing platforms and real-time workflows will be essential for managing the increasing scale and complexity of single-cell datasets. These transformative advances in single-cell omics are poised to reshape our understanding of cellular biology and drive innovations in precision medicine, enabling more personalized, accurate treatment strategies that will ultimately improve patient outcomes across diverse diseases. Finally, clinically credible deployment will hinge on explainability-by-design and calibrated uncertainty, alongside community benchmarks and reporting standards that privilege biological relevance over purely statistical performance.

Author contributions

Conceptualization: T.Y., B.C., H.H. Methodology: T.Y., B.C., H.W. Software: T.Y., H.W. Validation: G.F., Q.F. Formal Analysis: T.Y., B.C. Investigation: All authors contributed to the investigation and literature search. Resources: H.H. Data Curation: T.Y., B.C., H.W. Writing – Original Draft Preparation: T.Y., B.C., H.W. Writing – Review & Editing: All authors contributed to the review and editing of the manuscript. Visualization: T.Y., H.W. Supervision: H.H. Project Administration: H.H. Funding Acquisition: H.H. Equal Contribution: T.Y., B.C., and H.W. contributed equally to this work.

Funding

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 62403319 and Grant No. 62473116.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.

Conflict of interest

The authors declare no Conflict of interest or other affiliations that could be perceived to influence the interpretation of the findings presented in this article.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Taylor Yiu, Bin Chen and Haoyu Wang have contributed equally to this work.

References

  • 1.Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B. scgpt: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods 2024;1–11. [DOI] [PubMed]
  • 2.Zhang X, Xu J, Chen D, Chen L-N. scplantformer: a lightweight foundation model for plant single-cell omics analysis. 2024.
  • 3.Schaar AC, Tejada-Lapuerta A, Palla G, Gutgesell R, Halle L, Minaeva M, Vornholz L, Dony L, Drummer F, Bahrami M, et al. Nicheformer: a foundation model for single-cell and spatial omics. 2024. bioRxiv, 2024–04.
  • 4.Zhao H, Liu T, Li K, Wang Y, Li H. Evaluating the utilities of large language models in single-cell data analysis. 2023.
  • 5.Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large language models in bioinformatics: applications and perspectives. 2024. arXiv preprint arXiv:2401.04155.
  • 6.Lee Y, Liu X, Hao M, Liu T, Regev A. Pathomclip: connecting tumor histology with spatial gene expression via locally enhanced contrastive learning of pathology and single-cell foundation model. 2024. bioRxiv, 2024–12.
  • 7.Ge Y, Leng J, Tang Z, Wang K, U K, Zhang S.M, Han S, Zhang Y, Xiang J, Yang S., et al. Deep learning-enabled integration of histology and transcriptomics for analyzing single-cell spatial profiles. Research. [DOI] [PMC free article] [PubMed]
  • 8.Liu J, Cen X, Yi C, Wang F-A, Ding J, Cheng J, et al. Challenges in AI-driven biomedical multimodal data fusion and analysis. Genomics Prot Bioinform. 2025;23(1):011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ghazanfar S, Guibentif C, Marioni JC. Stabilized mosaic single-cell data integration using unshared features. Nat Biotechnol. 2024;42(2):284–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun. 2021;12(1):1337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang F-A, Zhuang Z, Gao F, He R, Zhang S, Wang L, et al. Tmo-net: an explainable pretrained multi-omics model for multi-task learning in oncology. Genome Biol. 2024;25(1):1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wang F-A, Zhuang Z, Gao F, He R, Zhang S, Wang L, et al. Tmo-net: an explainable pretrained multi-omics model for multi-task learning in oncology. Genome Biol. 2024;25(1):1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li M, Ang KS, Teo B, Rom U, Nguyen MN, Maurer-Stroh S, et al. Rediscovering publicly available single-cell data with the disco platform. Nucleic Acids Res. 2025;53(D1):932–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Program CCS, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, et al. Cz cellxgene discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res. 2025;53(D1):886–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jiang Y, Wang S, Feng S, Wang C, Wu W, Huang X, Ma Q, Wang J, Ma A. scgnn+: adapting chatgpt for seamless tutorial and code optimization. 2024. bioRxiv, 2024–09.
  • 16.Rood J.E, Wynne S, Robson L, Hupalowska A, Randell J, Teichmann SA, Regev A. The human cell atlas from a cell census to a unified foundation model. Nature 2024;1–2. [DOI] [PubMed]
  • 17.Zhao S, Zhang J, Wu Y, Luo Y, Nie Z. Langcell: language-cell pre-training for cell identity understanding. 2024. arXiv preprint arXiv:2405.06708.
  • 18.Maleki S, Huetter J-C, Chuang KV, Scalia G, Biancalani T. Efficient fine-tuning of single-cell foundation models enables zero-shot molecular perturbation prediction. 2024. arXiv preprint arXiv:2412.13478.
  • 19.Becht E, Tolstrup D, Dutertre C-A, Morawski PA, Campbell DJ, Ginhoux F, et al. High-throughput single-cell quantification of hundreds of proteins using conventional flow cytometry and machine learning. Sci Adv. 2021;7(39):0505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen X, Li K, Cui X, Wang Z, Jiang Q, Lin J, Li Z, Gao Z, Jiang R. Epiagent: foundation model for single-cell epigenomic data. 2024. bioRxiv, 2024–12. [DOI] [PubMed]
  • 21.Ho N, Ellington CN, Hou J, Addagudi S, Mo S, Tao T, Li D, Zhuang Y, Wang H, Cheng X, et al. Scaling dense representations for single cell with transcriptome-scale context. 2024. bioRxiv, 2024–11.
  • 22.Wen H, Tang W, Dai X, Ding J, Jin W, Xie Y, Tang J. Cellplm: pre-training of cell language model beyond single cells. 2023. bioRxiv, 2023–10.
  • 23.Hrovatin K, Moinfar AA, Zappia L, Lapuerta AT, Lengerich B, Kellis M, Theis FJ. Integrating single-cell rna-seq datasets with substantial batch effects. 2024. bioRxiv, 2023–11.
  • 24.Cui H. Foundation model for single-cell omics. PhD thesis, University of Toronto (Canada). 2024.
  • 25.Wu H.-J, Zheng X, Ma Z, Zhu H, Yuan Y, Yang J, Cai K, Wei N, Zhang S, Wang L, et al. Cellpatch: a highly efficient foundation model for single-cell transcriptomics with heuristic patching. 2024. bioRxiv, 2024–11.
  • 26.Li X, Zhang R, Aslam S, Li H, Chen Y, Zhang Z, Huang R-S, Wu H. scmonica: single-cell mosaic omics nonlinear integration and clustering analysis. In: 2024 IEEE International conference on bioinformatics and biomedicine (BIBM). IEEE; 2024. pp. 1579–83.
  • 27.Zhao Y, Zhao B, Zhang F, He C, Wu W, Lai L. Sc-mamba2: leveraging state-space models for efficient single-cell ultra-long transcriptome modeling. 2024. bioRxiv, 2024–09.
  • 28.Baek S, Park S, Chok YT, Lee J, Park J, Gim M, Kang J. Cradle-vae: enhancing single-cell gene perturbation modeling with counterfactual reasoning-based artifact disentanglement. 2024. arXiv preprint. arXiv:2409.05484.
  • 29.Li L, You Y, Liao W, Fan X, Lu S, Cao Y, Li B, Ren W, Fu Y, Kong J, et al. A systematic comparison of single-cell perturbation response prediction models. 2024. bioRxiv. 2024–12.
  • 30.Szałata A, Hrovatin K, Becker S, Tejada-Lapuerta A, Cui H, Wang B, et al. Transformers in single-cell omics: a review and new perspectives. Nat Methods. 2024;21(8):1430–43. [DOI] [PubMed] [Google Scholar]
  • 31.Ma Q, Jiang Y, Cheng H, Xu D. Harnessing the deep learning power of foundation models in single-cell omics. Nat Rev Mol Cell Biol. 2024;25(8):593–4. [DOI] [PubMed] [Google Scholar]
  • 32.He Y, Huang F, Jiang X, Nie Y, Wang M, Wang J, Chen H. Foundation model for advancing healthcare: challenges, opportunities, and future directions. 2024. arXiv preprint arXiv:2404.03264. [DOI] [PubMed]
  • 33.Xia C-R, Cao Z-J, Gao G. High-fidelity disentangled cellular embeddings for large-scale heterogeneous spatial omics via decipher. 2024. bioRxiv, 2024–11. [DOI] [PMC free article] [PubMed]
  • 34.Tang H, Zhong J-Y, Yu X-T, Chai H, Liu R, Zeng T. Exploring structured molecular landscape from single-cell multi-omics data by an explainable multimodal model. iScience 2024;27(12). [DOI] [PMC free article] [PubMed]
  • 35.Kobayashi-Kirschvink KJ, Comiter CS, Gaddam S, Joren T, Grody EI, Ounadjela JR, Zhang K, Ge B, Kang JW, Xavier RJ, et al. Prediction of single-cell rna expression profiles in live cells by raman microscopy with raman2rna. Nat Biotechnol 2024;1–9. [DOI] [PMC free article] [PubMed]
  • 36.Palma A, Richter T, Zhang H, Lubetzki M, Tong A, Dittadi A, Theis F. Generating multi-modal and multi-attribute single-cell counts with cfgen. 2024. arXiv preprint arXiv:2407.11734.
  • 37.Singhal V, Chou N, Lee J, Yue Y, Liu J, Chock WK, et al. Banksy unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat Genet. 2024;56(3):431–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Krishnan SN, Ji S, Elhossiny AM, Rao A, Frankel TL, Rao A. Proximogram-a multi-omics network-based framework to capture tissue heterogeneity integrating single-cell omics and spatial profiling. Comput Biol Med. 2024;182:109082. [DOI] [PubMed] [Google Scholar]
  • 39.Pentimalli TM, Karaiskos N, Rajewsky N. Challenges and opportunities in the clinical translation of high-resolution spatial transcriptomics. Ann Rev Pathol Mech Disease 202420. [DOI] [PubMed]
  • 40.Manchel A, Gee M, Vadigepalli R. From sampling to simulating: single-cell multiomics in systems pathophysiological modeling. iScience. 2024. [DOI] [PMC free article] [PubMed]
  • 41.Wagle MM, Long S, Chen C, Liu C, Yang P. Interpretable deep learning in single-cell omics. Bioinformatics 2024;374. [DOI] [PMC free article] [PubMed]
  • 42.Palayew S, WANG B, Bader GD. scmpt: towards applying large language models to complement single-cell foundation models.
  • 43.Wu Y, Xie L. Ai-driven multi-omics integration for multi-scale predictive modeling of causal genotype-environment-phenotype relationships. 2024. arXiv preprint arXiv:2407.06405. [DOI] [PMC free article] [PubMed]
  • 44.Wu Y, Xie L. Ai-driven multi-omics integration for multi-scale predictive modeling of genotype-environment-phenotype relationships. Comput Struct Biotechnol J. 2025. [DOI] [PMC free article] [PubMed]
  • 45.Chen Z, Wei L, Gao G. Foundation models for bioinformatics. Quant Biol. 2024;12(4):339–44. [Google Scholar]
  • 46.Wen H. Single cells are biological tokens: towards cell language models. PhD thesis, Michigan State University. 2024.
  • 47.Nam Y, Kim J, Jung S-H, Woerner J, Suh EH, Lee D-G, Shivakumar M, Lee ME, Kim D. Harnessing artificial intelligence in multimodal omics data integration: paving the path for the next frontier in precision medicine. Ann Rev Biomed Data Sci 2024;7. [DOI] [PMC free article] [PubMed]
  • 48.Polychronidou M, Hou J, Babu MM, Liberali P, Amit I, Deplancke B, Lahav G, Itzkovitz S, Mann M, Saez-Rodriguez J, et al. Single-cell biology: what does the future hold? 2023. [DOI] [PMC free article] [PubMed]
  • 49.Heryanto YD, Zhang Y-Z, Imoto S. Tissue-specific cell type annotation with supervised representation learning using split vector quantization and its comparisons with single-cell foundation models. 2024. bioRxiv, 2024–12.
  • 50.Hsieh K-L, Chu Y, Li X, Pilié PG, Dai Y. scemb: learning context representation of genes based on large-scale single-cell transcriptomics. 2024. bioRxiv.
  • 51.Wong DR, Hill A, Moccia R. Simple controls exceed best deep learning algorithms and reveal foundation model effectiveness for predicting genetic perturbations. 2025. bioRxiv, 2025–01. [DOI] [PMC free article] [PubMed]
  • 52.Wang Q, Zhu H, Hu Y, Chen Y, Wang Y, Zhang X, Zou J, Kellis M, Li Y, Liu D, et al. Cellmemory: hierarchical interpretation of out-of-distribution cells using bottlenecked transformer. 2024. bioRxiv, 2024–12. [DOI] [PMC free article] [PubMed]
  • 53.Hingerl JC, Martens LD, Karollus A, Manz T, Buenrostro JD, Theis FJ, Gagneur J. scooby: modeling multi-modal genomic profiles from dna sequence at single-cell resolution. 2024. bioRxiv. [DOI] [PMC free article] [PubMed]
  • 54.Schafer PSL, Dimitrov D, Villablanca EJ, Saez-Rodriguez J. Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system. Nat Immunol. 2024;25(3):405–17. [DOI] [PubMed] [Google Scholar]
  • 55.Fang Y, Liu K, Zhang N, Deng X, Yang P, Chen Z, Tang X, Gerstein M, Fan X, Chen H. Chatcell: facilitating single-cell analysis with natural language. 2024. arXiv preprint arXiv:2402.08303.
  • 56.Ahlmann-Eltze C, Huber W, Anders S. Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods. 2024. BioRxiv, 2024–09. [DOI] [PMC free article] [PubMed]
  • 57.Waqas A. From graph theory for robust deep networks to graph learning for multimodal cancer analysis. PhD thesis, University of South Florida. 2024.
  • 58.Xu T, Xu Q, Lu R, Oakland DN, Li S, Li L, et al. Application of deep learning models on single-cell rna sequencing analysis uncovers novel markers of double negative t cells. Sci Rep. 2024;14(1):31158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Liu X, Chapple R.H, Bennett D, Wright W.C, Sanjali A, Culp E, Zhang Y, Pan M, Geeleher P. Csi-gep: a gpu-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell rna-seq data. Cell Genomics. 2025;5(1). [DOI] [PMC free article] [PubMed]
  • 60.Schuster V, Dann E, Krogh A, Teichmann SA. multiDGD: a versatile deep generative model for multi-omics data. Nat Commun. 2024;15(1):10031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zhai Y, Chen L, Deng M. scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data. Brief Bioinform. 2024;25(3):188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Ren J, Han H. Wasserstein graph convolutional network with attention for imbalanced scrna-seq data knowledge discovery. In: Southwest data science conference. Springer; 2024. pp. 1–16.
  • 63.Hai J, Xie Z, Liu N, Yuan Y. Scbc: a supervised single-cell classification method based on batch correction for atac-seq data. In: Pacific rim international conference on artificial intelligence. Springer; 2024. pp. 61–72.
  • 64.Wu T, Zhao L, Ren M, He S, Zhang L, Fang M, Wang B. Small-sample learning for next-generation human health risk assessment: harnessing AI, exposome data, and systems biology. Environ Sci Technol. 2025. [DOI] [PubMed]
  • 65.Niu J, Ding J. Single-cell multiomics data integration and generation with scpairing. 2025. bioRxiv, 2025–01.
  • 66.Yin D, Cao Y, Chen J, Mak CL, Yu KH, Zhang J, Li J, Lin Y, Ho JW, Yang JY. Scope+: an open source generalizable architecture for single-cell rna-seq atlases at sample and cell levels. Bioinformatics 2024;727. [DOI] [PMC free article] [PubMed]
  • 67.Gong J, Hao M, Cheng X, Zeng X, Liu C, Ma J, Zhang X, Wang T, Song L. xtrimogene: an efficient and scalable representation learner for single-cell rna-seq data. Adv Neural Inf Process Syst. 2024;36.
  • 68.Piran Z, Cohen N, Hoshen Y, Nitzan M. Disentanglement of single-cell data with biolord. Nat Biotechnol. 2024;1–6. [DOI] [PMC free article] [PubMed]
  • 69.Gao H, Hua K, Wu X, Wei L, Chen S, Yin Q, et al. Building a learnable universal coordinate system for single-cell atlas with a joint-vae model. Commun Biol. 2024;7(1):977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Cao Y. Deep representation learning for single-cell sequencing data analysis. Irvine: University of California; 2023. [Google Scholar]
  • 71.Schaefer M, Peneder P, Malzl D, Peycheva M, Burton J, Hakobyan A, Sharma V, Krausgruber T, Menche J, Tomazou EM., et al. Multimodal learning of transcriptomes and text enables interactive single-cell rna-seq data exploration with natural-language chats. 2024. bioRxiv, 2024–10.
  • 72.Bi Z, Dip SA, Hajialigol D, Kommu S, Liu H, Lu M, Wang X. Ai for biomedicine in the era of large language models. 2024. arXiv preprint arXiv:2403.15673.
  • 73.Selby D.A, Jakhmola R, Sprang M, Grossmann G, Raki H, Maani N, Pavliuk D, Ewald J, Vollmer SJ. Visible neural networks for multi-omics integration: a critical review. 2024. bioRxiv, 2024–12. [DOI] [PMC free article] [PubMed]
  • 74.Ghosh S, Zhao X, Alim M, Brudno M, Bhat M. Artificial intelligence applied to ‘omics data in liver disease: towards a personalised approach for diagnosis, prognosis and treatment. Gut. 2025;74(2):295–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Dong W, Sheng J, Cui JZ, Zhao H, Wong ST. Systems immunology insights into brain metastasis. Trends Immunol. 2024;45(11):903–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. Trends Plant Sci. 2024. [DOI] [PubMed]
  • 77.Parums DV. The human cell atlas. what is it and where could it take us? Med Sci Monit Int Med J Exp Clin Res. 2025;31:947707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Xu K, Ding Y. Zeromics: toward general models for single-cell analysis with instruction tuning. 2025.
  • 79.Ali S, Qadri YA, Ahmad K, Lin Z, Leung M-F, Kim SW, et al. Large language models in genomics-a perspective on personalized medicine. Bioengineering. 2025;12(5):440. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Journal of Translational Medicine are provided here courtesy of BMC

RESOURCES