Abstract
Large‐scale single‐cell RNA sequencing (scRNA‐seq) and spatial transcriptomics (ST) have transformed biomedical research into a data‐driven field, enabling the creation of comprehensive atlases. These methodologies facilitate detailed understanding of biology and pathophysiology; however, the complexity and sheer volume of data present analytical challenges, particularly in robust cell typing, integration, and understanding complex spatial relationships of cells. To address these challenges, CELLama (Cell Embedding Leverage Language Model Abilities) develops a framework that leverage language model to transform cell data into “sentences” that encapsulate gene expressions and metadata, enabling universal cell embedding. CELLama, serving as a foundation model, supports flexible applications ranging from cell typing to analysis of spatial contexts, independent of complex dataset‐specific analysis workflows by using a large cell atlas. The results demonstrate that CELLama has significant potential to transform cellular analysis in various contexts, from determining cell types using multi‐tissue atlases and their interactions to unraveling intricate tissue dynamics.
Keywords: artificial intelligence, natural language model, single cell RNA‐sequencing, spatial transcriptomics, transformer
CELLama is created, a framework that harnesses language models to convert cellular data into “sentences” that represent gene expression and metadata, enabling a universal embedding of cells. Unlike most single‐cell foundation models, CELLama supports scalable analysis and offers flexible applications including spatial transcriptomics. Its capacity to automate cell typing with atlas‐scale datasets underscores its practical value in simplifying complex workflows.

1. Introduction
Recent advancements in single‐cell RNA‐seq (scRNA‐seq) and spatial transcriptomics (ST) have facilitated the creation of extensive data atlases under normal and disease conditions.[ 1 , 2 , 3 , 4 ] Understanding these large datasets through an integrative approach, coupled with detailed molecular expression data at the cellular level, paves the way for a deeper understanding of diseases and transforms biomedical research into a data‐driven field. These cellular atlases and recently developed high resolution ST‐based data are invaluable resources for dissecting pathophysiological processes and discovering new therapeutic targets that influence drug development and enhance our understanding of disease mechanisms and cellular functions.[ 5 , 6 ] Yet, the wealth of data generated by such studies presents significant challenges, especially when it comes to the robust determination of cell types, the analysis of their functions and perturbations, and the interpretation of their spatial arrangements within tissues.[ 7 , 8 , 9 ] These challenges stem from the complexity and volume of atlas‐level datasets that are difficult to handle effectively with traditional analytics tools. More specifically, scRNA‐seq and ST analysis for pathophysiology and comprehensive understanding of biological phenomena require the adoption of analysis methods that leverage large, atlas‐scale data.[ 10 , 11 , 12 ] These methods should integrate seamlessly with existing atlas‐level knowledge and flexibly accommodate in specific tasks including robust cell typing, spatial context analyses and multi‐batch, platform integration.
Given the challenges, there has been significant interest in developing foundation models for scRNA‐seq and ST data. In this regard, models based on transformer architecture, widely utilized in natural language processing (NLP) and vision processing in recent AI developments,[ 13 ] have been proposed, including scGPT[ 14 ] and Geneformer.[ 15 ] These models provide robust frameworks for integrating and interpreting large‐scale molecular expression data at the cellular level. While these models have pioneered the use of foundation models in cellular analysis by embedding data for various applications, they require extensive training and relatively lack flexibility in terms of usage and adaptation, as they primarily focus on embedding cells based on their gene expression data. For instance, when attempting to integrate these data with other variables like disease conditions or spatial context, which serve as metadata for cellular data, these models have limitations in directly applying to such “novel” or “unseen” information. An alternative approach, which leverages the intrinsic properties of cellular data, involves generating “sentences” from cells based on their gene expression ranks and associated metadata (e.g., tissue origin, sample condition, data platform, and spatial context). This approach hypothesizes that transforming cellular information into a language‐like format could harness the power of language models to create a universal embedding space for cells.
Building on this concept, we develop a framework CELLama, Cell Embedding Leverage Language Model Abilities, designed to adapt flexibly to various cellular datasets for general‐purpose applications like cell typing at atlas and multi‐tissue levels without manual intervention such as searching appropriate reference cell data. CELLama transforms scRNA‐seq data into natural language sentences, capturing the unique transcriptomic signature of each cell. In addition, CELLama can utilize pretrained models that cover general NLP processes for embedding, and it can also be fine‐tuned using large‐scale cellular data by generating sentences and their similarity metrics. This approach can be flexibly applied to various tasks, including integrating niche cell information to characterize the spatial context of cells. By generating sentences from spatial contexts, it pushes the boundaries of how we interpret complex biological data. CELLama showcases remarkable flexibility in applications ranging from cell embedding and cell typing to analyzing spatial contexts. Its ability to automate cell typing using large‐scale, atlas‐level data highlights its practical utility in streamlining complex analyses.
2. Results
2.1. Sentence Generation for Each Cell Data and Embedding using Sentence Transformer
CELLama aims to utilize sentence transformers by generating sentences from cell data, incorporating metadata as well as gene expression. To transform the gene expression values of individual cells into a format suitable for language‐based embedding models, the scRNA‐seq data were first converted into a ranked list of genes based on their expression levels. For each cell, the top expressed genes (top‐k is a parameter for number of top genes) are ordered sequentially to form a comprehensive sentence. In other words, the derived sentence would list these genes in descending order of expression. This conversion process is depicted in Figure 1a, illustrating how gene expression values are translated into textual data. In addition to gene expression data, cell‐specific metadata can be integrated into the sentence generation process. Information such as tissue origin, disease state, or experimental condition can be added to the gene expression sentences. For example, if we aim to embed cells with information of metadata, key descriptors like tissue type or disease status can be included in the sentence. This process involved adding phrases such as “[tissue type] of this cell is [value]” or “[disease status] of this cell is [value]” after the gene rank‐based description. In practice, “[tissue type]” can be specified using the corresponding metadata column (e.g., in a Scanpy/AnnData object, adata.obs.columns), while “[value]” represents the specific metadata entry as a string. The generated sentences for cells effectively encapsulate the expression profile of each cell as a simple, coherent narrative, which is then fed into a sentence transformer model.[ 16 ] This model, designed for NLP, embeds each cell data into a high‐dimensional space in a flexible manner. Additionally, the training, fine‐tuning, and inference processes of NLP‐based packages can be effectively leveraged.
Figure 1.

CELLama sentence transformation and embedding process. a) Schematic illustration of the conversion of gene expression values into sentences for embedding. First, single‐cell RNA‐seq (scRNA‐seq) data were transformed into a ranked list of gene expressions for each cell. The top expressed genes were sequentially ordered to form a sentence, which lists these genes in descending order of their expression levels. Additional metadata could also be integrated, enriching the context of each generated sentence. These sentences included the unique expression profile of each cell and were prepared for input into a sentence transformer model. The widely used sentence transformer model for natural language embedding was utilized without additional training. b) UMAP visualization showed a baseline analysis of scRNA‐seq data from multiple tissues (subset of Tabula Sapiens data) using traditional preprocessing methods. This provided a baseline reference for comparing with CELLama embedding capabilities. c) UMAP visualization showed cell data embedded using CELLama with the sentence transformer model. Notably, the visualization highlighted how the sentence generation, which includes specific metadata such as the tissue of origin, impacted the clustering patterns of cells, demonstrating the adaptability of CELLama to utilize both gene expression and contextual metadata for effective embedding in a high‐dimensional space.
2.2. Embedding and Zero‐Shot Cell Typing from Reference scRNA‐seq Data of Multiple Tissues and Samples
To evaluate CELLama using a general‐purpose sentence transformer model, we utilized scRNA‐seq data from multiple tissues and samples as a reference. The goal of this task of CELLama was to map test scRNA‐seq data to large‐scale, multi‐tissue datasets without the need for manual reference selection of similar scRNA‐seq data. For this purpose, the Tabula Sapiens dataset was employed.[ 10 ] We used a subsample comprising 1/10 of the Tabula Sapiens data. The test set was defined based on the origin of the sample, with cells from one subject in the Tabula Sapiens dataset designated as the test data, while scRNA‐seq data from other subjects were used as the reference set. This test data included cells from 5 tissues including the bladder, blood, lung, muscle, and pancreas, encompassing 57 distinct cell types from a different subject than those in the reference set. To assess the multi‐tissue scRNA‐seq data, as a baseline study, we conducted a standard analytical pipeline that included preprocessing to identify highly variable genes, followed by PCA calculation, and subsequently visualizing the data using UMAP as depicted in Figure 1b.[ 17 ]
Both the reference and test data were embedded into the same space using CELLama, which leverages a general‐purpose, pretrained sentence transformer model—specifically, the “all‐MiniLM‐L12‐v2” model, producing 384‐D embedding vectors. These vectors were then projected onto a UMAP and visualized in Figure 1c. Notably, the way of sentence generation can influence the embedding results. For instance, if we aim at embedding incorporated with specific metadata such as the “tissue organ” origin of each cell, this information can be included in the sentence generation process. As illustrated in Figure 1c, the embedded vectors and their clustering patterns varied according to the metadata information, demonstrating the capacity of CELLama to adapt embeddings based on the critical metadata provided. Furthermore, within the same tissue organ, cell types were distinctly separated on the UMAP (Figure 1c), suggesting that after metadata incorporation, CELLama's embedding effectively differentiated cell types based on top‐k gene sentences.
After embedding, we assessed the cell type classification performance using the embedded vectors. We assigned cell types to the test set data based on the nearest neighborhood to the reference set. The accuracy of cell type assignments made by CELLama was evaluated against the ground truth cell labels, as shown in Figure 2a. Overall accuracy, precision, and recall metrics were calculated to gauge the performance of cell typing. The performance of zero‐shot CELLama in cell typing varied depending on the top‐k parameters and whether metadata was included. These results were benchmarked against scGPT,[ 14 ] a foundation model trained on 33 million cells and specialized for scRNA‐seq analysis. When tissue origin metadata was incorporated into the sentence generation, CELLama's performance exceeded that of scGPT across all three metrics (Figure 2b). We also compared our CELLama's performance with Geneformer,[ 15 ] C2S,[ 18 ] SingleR,[ 19 ] AIDO.Cell,[ 20 ] UCE,[ 21 ] Transcriptformer,[ 22 ] and scVI with logistic regression.[ 23 ] Geneformer failed in the zero‐shot approach, unable to predict cell types without further training. However, after training with the reference scRNA‐seq data from Tabula Sapiens, Geneformer achieved a similar accuracy to CELLama (77.7% vs 78.7% for Geneformer and CELLama, respectively). Despite this, Geneformer demonstrated much lower precision and recall compared to CELLama and scGPT, even though both of these methods used only the zero‐shot approach (Figure 2b). Furthermore, we explored the impact of excluding tissue type from the sentence generation. In this situation, CELLama's accuracy in cell typing was comparable to that of scGPT, even though both precision and recall were slightly lower than those observed with scGPT when metadata was omitted (Figure 2c). We also evaluated SingleR,[ 19 ] a conventional unbiased reference‐based cell typing method, which demonstrated lower accuracy (54.2%) compared to other zero‐shot transformer‐based models, including CELLama (Figure 2b,c). When compared with other recent single‐cell foundation models and the conventional method scVI, these approaches, unlike CELLama, either showed clear disadvantages in computing time that raised scalability issues (i.e., SingleR, C2S, AIDO.Cell, UCE, TranscriptFormer, scVI), or failed to incorporate metadata during embedding, which limited their applicability to flexible settings such as ST (i.e., scGPT, Geneformer, SingleR, AIDO.Cell, UCE, scVI) (Figure 2b,c). The performance comparison results are presented in Table S1 (Supporting Information).
Figure 2.

Performance evaluation of CELLama in cell typing based on atlas‐level multi‐tissue dataset. a) Confusion matrix of cell type classification accuracy using CELLama and the true label for cell types. b) Comparison of CELLama's cell typing performance with the inclusion of tissue origin metadata, benchmarked against scGPT. The graph shows the accuracy, precision, recall, F1‐score (macro), and total time (s) of CELLama‐based cell typing according to the top‐k variable, the number of genes for sentence generation. The results were compared with those metrics of scGPT, Geneformer, SingleR, C2S (160m), AIDO.Cell (3m), UCE (4‐layer), TranscriptFormer (TF‐Sapiens), and scVI. Notably, Geneformer failed in zero‐shot prediction and required training with reference data before being applied to cell type prediction. Also, CELLama operates on a time scale comparable to scGPT and Geneformer, whereas the other methods exhibit significantly higher time complexity, limiting their scalable analysis. The raw benchmarking data can be found in Table S1 (Supporting Information). c) Impact of excluding tissue type metadata on the performance of CELLama. When tissue origin metadata was excluded during the sentence generation, the performance metrics including the accuracy, precision, recall, F1‐score (macro), and total time were altered. The results for C2S and TranscriptFormer are shown with metadata involved in embedding. The raw benchmarking data can be found in Table S1 (Supporting Information).
Zero‐shot cell embedding and typing were also evaluated using a COVID‐19 dataset from multiple samples (sourced from the study by Lotfollahi et al.).[ 24 ] This data includes 18 distinct batches from lung tissues. It was divided into two sets, reference and test sets (15,997 cells for reference and 4,003 cells for test). The embedding results varied depending on whether metadata was included in the sentence generation (Figure S1, Supporting Information). Notably, while metadata inclusion led to tissue‐specific separation in the embedding space, individual cell types remained distinctly clustered within their respective metadata‐defined groups, demonstrating that CELLama effectively preserves fine‐grained cell‐type resolution even when broader metadata categories influence the overall structure of the embedding space (Figure S1, Supporting Information). For the test dataset, cell type mapping using CELLama was compared against the ground truth (Figure S2a, Supporting Information). Performance was further benchmarked against a general‐purpose sentence transformer‐based embedding, which demonstrated slightly better precision and recall than scGPT, though the accuracy was marginally lower than that of scGPT. Without the inclusion of metadata, the performance of CELLama‐based cell typing slightly decreased (Figure S2b, Supporting Information). In this COVID‐19 dataset as well, CELLama demonstrated both its efficiency in computing time and its capability to incorporate metadata (Figure S2, Supporting Information), with the results presented in Table S3 (Supporting Information).
Note that CELLama utilizes reference‐based similarity comparisons for cell embedding and annotation and therefore does not qualify as a pure zero‐shot model, in which no reference dataset is needed. While this design sacrifices autonomy from reference data, it offers stronger alignment to known biological structures and ensures higher embedding fidelity in downstream analyses. In addition, the reported performance metrics do not represent the accuracy of an additional trained classifier; rather, they directly evaluate how well the embedding features are preserved and subsequently used for cell labeling. Specifically, these metrics assess the quality of feature extraction by evaluating how well the embeddings represent cell type–specific characteristics across datasets, without requiring additional classifier training. To concretely demonstrate this, we evaluated transfer performance by integrating two distinct datasets (Tabula Sapiens and COVID‐19 data), thereby highlighting the capacity of the embedding space to align with biologically meaningful cell type structures.
2.3. Human Pancreas Data from Multiple Platforms Mapped onto Atlas Cell Data
For simulating generalized cell typing leveraging a single large atlas as a reference, human pancreas data from multiple platforms were used as a practical example of mapping new single‐cell data cell types using a general‐purpose, multi‐tissue scRNA‐seq dataset as a reference.[ 25 ] We utilized human pancreas scRNA‐seq data from nine different batches, generated on various platforms, as the query dataset. Cell types were mapped using a subsampled Tabula Sapiens dataset as a “general purpose, multi‐tissue reference.” When we applied a conventional analytical workflow based on PCA, the human pancreas scRNA‐seq data from various batches appeared separated according to the batches (Figure 3a). However, when embedded by CELLama, the data clustered primarily according to their cell types (Figure 3b). The parameter top‐k was selected as 20 for this experiment. To map this data onto the reference Tabula Sapiens data, sentences were generated incorporating the organ types from the data. As a result, the query human pancreas data were effectively aligned with the pancreas tissue cell data of Tabula Sapiens (Figure 3c; Figure S3, Supporting Information). The confusion matrix demonstrated a strong match with the original labels of the human pancreas scRNA‐seq data (Figure 3d). Notably, although the cell proportions of these two datasets varied widely (Table S2, Supporting Information), the CELLama embedding showed accurate mapping results, achieving 98% and 88% recall rates for alpha and beta cells, respectively. An additional application of CELLama was to measure the embedded distance from query cell data to the nearest cell in the reference dataset, revealing metrics for uncertainty or out‐of‐distribution for query cell data (Figure 3e).
Figure 3.

Zero‐shot mapping and cell typing of human pancreas scRNA‐seq data with multiple platforms using CELLama. a) Conventional analytic flows of scRNA‐seq data from multiple platforms for human pancreas scRNA‐seq data from nine different batches demonstrated batch‐specific clustering. b) UMAP projection showing the CELLama‐embedded data, where cells predominantly cluster according to cell type rather than batch, illustrating effective cell‐type‐based grouping. c) Alignment of query pancreas data with the multi‐tissue reference from Tabula Sapiens, highlighting the effective mapping of cell types across different datasets. d) A confusion matrix showed the robustness of CELLama‐based cell types mapping using atlas‐level reference dataset without manual selection of reference scRNA‐seq data. e) Analysis of embedded distances from query cell data to the nearest cell in the reference dataset, utilized to identify out‐of‐distribution cells and assess uncertainty in cell type prediction. Here, the embedded distance refers to the distance to the nearest reference within the embedding space. This metric demonstrates the capability to provide insights into the reliability of the embedding process, especially for cells with low representation or outlier characteristics. We fit a k‐NN index using scikit‐learn's NearestNeighbors (metric='cosine') on the embedding vectors and queried n_neighbors=1. Reported distances correspond to the nearest‐neighbor cosine distance for each query point.
Notably, CELLama metadata‐leveraging cell embedding distinguishes it from other foundation embedding methods. For example, scGPT‐embedded results from human pancreas data across multiple platforms were generally clustered by cell type, except for the “SMARTER” batch, which formed a distinct cluster (Figure S4a, Supporting Information). When referencing the general whole‐cell data from Tabula Sapiens, the embedding of human pancreas data demonstrated a generalized cell typing approach, utilizing a single large atlas as a reference. To illustrate this, human pancreas data from multiple platforms were mapped using a general‐purpose, multi‐tissue scRNA‐seq dataset. However, pancreas cells from Tabula Sapiens (blue spots, Figure S4b, Supporting Information) were distinctly separated from human pancreas data when the data were embedded by scGPT, highlighting challenges in cross‐dataset integration. As a result, when performing zero‐shot cell typing using the general reference dataset, the model failed to accurately identify minor cell types. For instance, beta cell annotation achieved only 28% accuracy, while delta cells were identified with less than 1% accuracy, indicating limitations in mapping rare cell populations without dataset‐specific adaptations (Figure S4c, Supporting Information).
2.4. Finetuning of CELLama Results for Better Performance
Though previous results utilized a general‐purpose sentence transformer directly after generating sentences from cell data, sentences can also be generated specifically from single‐cell data for model fine‐tuning. This two‐step process involves generating sentence pairs from single‐cell or spatial data, followed by training the transformer with these pairs. We tested this finetuning process with the COVID‐19 dataset (Figure 4a). First, sentences were constructed to represent each cell by ranking gene expressions and formatting them into descriptive sentences. These sentences were then paired, with pairs assigned a label based on their cosine similarity from gene expression data. This similarity is calculated using PCA components derived from high‐variable gene expressions to ascertain the cosine distances between cells. We aimed for a balanced dataset by selecting pairs that are either closely related (cells with top cosine similarity for a given cell) or randomly assorted. Additionally, if the features of metadata differ (e.g., cells from different conditions or tissues), the cosine similarity was adjusted to zero to facilitate clustering of similar samples and to ensure separation based on distinct metadata (Figure 4b). The transformer was then fine‐tuned using these generated pairs. The reference COVID‐19 dataset was free of predefined labels, allowing for the generation of ≈160,000 sentence pairs, which were then used for fine‐tuning the model. After applying finetuning to the sentence transformer‐based embedding model, both reference and test data were embedded (Figure 4c). Cell types for the test set were then annotated using the nearest neighbor approach, and accuracy was subsequently measured. As a result, compared to Figure S2 (Supporting Information), the fine‐tuned model demonstrated enhanced cell typing performance, significantly surpassing both the original CELLama and scGPT. The results showed accuracies of 86.7%, 85.9%, and 86.7% for finetuned CELLama, original CELLama, and scGPT respectively. Precision scores were 61.0%, 54.3%, and 54.5%, while recall scores were 59.5%, 52.4%, and 49.7%, respectively (Figure 4d).
Figure 4.

Fine‐tuning CELLama with sentence pairs derived from cell data and their similarity. a) Illustration of the COVID‐19 dataset for finetuning. UMAP showed embedded results based on principal components after preprocessing of scRNA‐seq data. b) Visualization of sentence pairing and cosine similarity calculations used for training the model. Cosine similarities were calculated based on PCA components of high‐variable gene expressions, with pairs assigned labels that reflect their similarity. Similarity labels are adjusted to zero if metadata features differ, promoting accurate and relevant cellular representation in embeddings. c) Embedding of both reference and test datasets using fine‐tuned CELLama model. With the metadata information, cells were separated according to the “condition” of data and reference/query datasets were integrated. d) Performance metrics of cell type annotation post‐fine‐tuning. The plot compares the accuracy, precision, and recall of CELLama before and after fine‐tuning as well as scGPT. The metrics showed the effectiveness of fine‐tuning in enhancing the performance of CELLama embeddings.
2.5. Analyzing Cells with Spatial Context by Niche Cell Information on ST with CELLama
We further explored the capability of CELLama to perform spatial mapping of cell types on ST data. Cell type mapping for image‐based ST is challenging due to limited gene panels, which restrict the annotation of cell types. Additionally, we investigated whether CELLama could analyze spatial contexts of cells considering their niche. The parameter top‐k was selected as 20 for this experiment. We applied scRNA‐seq data from lung cancer[ 26 ] to image‐based ST data of lung cancer (Xenium data).[ 27 ] We used genes common to both the scRNA‐seq and ST panel to generate sentences, which were then embedded (Figure 5a). To benchmark against other cell type mapping tools specialized for ST, we compared the results of CELLama with that of TACCO.[ 28 ] The spatial maps of cell type annotations (Figure 5b) displayed similar results between two annotation methods. The confusion matrix also showed similar results, while it indicated a higher presence of rare cell types like NK cells in the TACCO‐based labels (Figure 5c). We also evaluated CELLama‐based spatial cell type mapping by comparing it with Tangram[ 29 ] and manual cell typing following clustering with marker genes. The results demonstrated that CELLama‐based annotations were highly consistent with manual cell typing, whereas Tangram produced differing results (Figure S5, Supporting Information).
Figure 5.

Spatial mapping and cell type annotation using CELLama on ST data. a) Process of generating sentences using common genes between scRNA‐seq and ST panels from lung cancer data. CELLama co‐embedded scRNA‐seq and ST data with limited number of gene panel. b) Comparison of spatial cell type maps generated by CELLama and TACCO, a specialized cell type mapping tool for image‐based ST. c) A confusion matrix comparing the cell type annotations between CELLama and TACCO, highlighting the overall consistency.
As an additional application of CELLama to ST data, we tested the ability to capture spatial context by integrating sentences with additional metadata. For this, niche cell type information, determined by the top nearest cell types for a given cell, was incorporated into the sentence generation process (Figure 6a). These sentences were then used to embed ST cell data, resulting in distinct subtyping on UMAP even within the same cell types (Figure 6b). Specifically, we selected a subset of cells (in this case, fibroblasts) and analyzed their clustering based on the embedded results from CELLama, which included niche cell information (Figure 6c). The spatial mapping of subtypes of fibroblasts showed patterns of fibroblasts according to the cell locations, for instance, fibroblasts near epithelial cells and fibroblasts clustering with similar cells nearby (Figure 6d). These niche information‐enhanced fibroblast subclusters were characterized by the average abundance of niche cells, revealing that “cluster 0” was associated with high concentrations of T‐ and B‐lymphocytes, “cluster 1” with more epithelial cells, “cluster 2” with myeloid cells, “cluster 4” with MAST cells, and “cluster 5” primarily comprised of fibroblasts (Figure 6e). By identifying markers for these fibroblast subtypes, we could define “spatial pattern”‐based fibroblast categorizations using CELLama embedding, offering unique insights for analyses that consider spatial patterns (Figure 6f). Although the result was solely from a single sample, for instance, SFRP4+ fibroblasts, prominently observed in ‘cluster 1’, have been well‐documented in cancer‐associated EMT progression.[ 30 , 31 ] These findings highlight the capability in niche analysis in ST data, expanding beyond traditional cell type annotation methods. Additional analysis of a different cell type showed subclusters of epithelial cells according to their niche information (Figure S6, Supporting Information).
Figure 6.

Advanced spatial context analysis and niche‐based subtyping using CELLama. a) Illustration of the process of integrating additional metadata about the top nearest cell types into sentence generation for ST data. As CELLama embedding could flexibly use various information for sentence generation, spatial context in ST data could be leveraged for niche analysis in the microenvironment. b) UMAP visualization showing distinct subtyping even in the same cell types based on the CELLama embeddings with spatial context. This embedding results demonstrated the ability of CELLama to discriminate subpopulations with niche information. c) Focused analysis on a subset of cells, specifically fibroblasts in the ST data of lung cancer, depicting their clustering based on niche‐informed CELLama embeddings. d) Spatial distribution map of fibroblast subtypes, highlighting the spatial relationships and clustering patterns in relation to other cell types of fibroblasts within the tumor microenvironment. e) Characterization of fibroblast subclusters by the average abundance of nearby niche cells, providing insights into the cellular microenvironment of fibroblasts in different subclusters. f) Marker genes of fibroblast subclusters using niche information from CELLama embeddings. These differentially expressed genes according to the spatial context could provide a novel approach to understanding spatial patterns within disease tissue.
3. Discussion
Large‐scale scRNA‐seq and ST have shifted biomedical research paradigms toward data‐driven approaches.[ 4 ] Experiments on diseased tissues or experimental samples now routinely generate transcriptomic data from hundreds of thousands of cells, necessitating a universal analytic workflow that leverages the power of recently developed AI algorithms.[ 32 , 33 ] The development and application of CELLama have demonstrated significant strides in addressing the complexities of interpreting large‐scale scRNA‐seq and ST data. By harnessing a language model framework, CELLama effectively embedded cellular data into a common data space, which not only streamlines the cell typing process across various tissue samples without reliance on manual reference data but also introduces a novel approach for integrating cellular metadata including spatial context. This integration allows for more nuanced analysis and understanding of cellular behaviors and interactions within their respective microenvironments. The flexibility of cell embedding, leveraging NLP techniques, highlights the advantages and scalability of recent transformer‐based models in biomedical fields.[ 34 , 35 , 36 , 37 ] These models benefit from long‐term memory capabilities inherent in neural networks and can handle variable token lengths.[ 38 ] Previous studies, including C2S and its scaled variant C2S‐Scale[ 18 , 39 ]—which both rely on top‐expressing gene lists—as well as scMulan, scGPT, and Geneformer,[ 14 , 15 , 40 ] assumed that general‐purpose language models would treat gene symbols as “out‐of‐vocabulary words” and thus fail to capture biological patterns. To address this, they introduced additional fine‐tuning steps or even trained models from scratch using cell‐based data. In contrast, CELLama demonstrates that massively pretrained sentence transformers already possess sufficient biological intuition to perform cell typing and a variety of downstream bioinformatics tasks—provided that the gene input size is appropriately constrained. This feature allows for the creation of flexible sentences from single‐cell and spatial data, facilitating their embedding into a common space. Such capabilities are particularly advantageous in the development of CELLama, enabling adaptable cell embedding that considers both gene expression data and various metadata.
A key distinction between CELLama and existing foundation model approaches such as scGPT lies in the explicit transformation of cellular data into a structured language format prior to embedding.[ 14 , 15 ] While other methods such as scGPT and Geneformer operates by directly encoding gene expression values into an embedding space, CELLama constructs natural language‐based representations of cells, incorporating both gene expression ranks and metadata descriptors. This approach enables several advantages: 1) CELLama facilitates niche analysis, allowing spatial and contextual information to be seamlessly integrated into the embedding process, 2) it enables embeddings considering multiple organs or conditions, ensuring diverse biological contexts during embedding process, and (3) its zero‐shot capabilities allow for flexible reference mapping without dataset‐specific retraining. These features make CELLama particularly well‐suited for ST, cross‐dataset integration, and general‐purpose single‐cell analysis, distinguishing it from previous transformer‐based approaches. In addition, CELLama embedding‐based approach differs fundamentally from direct usage of large language models, such as ChatGPT, for annotation methods, which infer cell types from a predefined list of marker genes.[ 41 ] While LLM can classify cells through text‐based queries, it lacks the ability to embed cells into a structured latent space, making it less suited for large‐scale dataset integration, zero‐shot annotation, and ST applications. In contrast, CELLama provides an adaptable embedding framework, allowing seamless generalization across diverse single‐cell datasets without requiring manual marker selection, reinforcing its role as a foundation model for single‐cell and ST.
The findings from this study highlight the potential to analyze large‐scale cell data using NLP methodologies, thereby enhancing the precision of cell type identification and enabling the integration of single‐cell and ST with atlas‐level data for more flexible applications. The employment of sentence transformers to embed cell data, incorporating both gene expression and metadata, deepens our understanding of cell subtypes in relation to their biological and environmental contexts without the need for tailored analytic workflows. Additionally, the use of NLP frameworks based on the transformer architecture offers the opportunity to fine‐tune the model, thereby enhancing performance by optimizing cell data embedding relative to a specific reference dataset. For example, our results demonstrate that CELLama, when fine‐tuned with generated sentence pairs from COVID‐19 cellular data as a reference, surpasses other foundational models like scGPT in terms of accuracy, precision, and recall, especially when metadata is included. This indicates that the design of CELLama is particularly effective at adapting to the complex nuances and variations present within biological datasets.
Our investigation into the spatial mapping capabilities of CELLama underscores its potential in ST, a field that often struggles with limited gene panels and challenges in accurately mapping cell types to their spatial coordinates.[ 42 , 43 ] In addition, the application of CELLama to incorporate niche cell type information and generate context‐aware embeddings presents a promising path for detailed spatial analysis.[ 44 , 45 , 46 ] The relevance of spatial location is crucial for understanding the microenvironment of diseased tissues, and insights into cellular spatial interactions can add a new dimension to the pathophysiology of various diseases.[ 32 ] Leveraging text data derived from cellular spatial features, CELLama has identified niche cell type‐related patterns in ST data, providing unique insights into the tissue microenvironment. Such analyses can reveal the spatial organization of cell types within tissues, potentially offering deeper insights into tissue architecture and the cellular interactions that govern biological functions and disease progression.
Another significant advantage of the cell embedding by common embedding based on CELLama is its ability to provide distances from the cellular data, which can indicate uncertainty or out‐of‐distribution measures from the reference cell data,[ 47 ] as introduced in Figure 3e. This capability can be applied to various purposes, such as creating ‘spatio‐temporal trajectories’ if it is hypothesized that the distance increases with advancing pathophysiological processes. For example, as shown in Figure S5 (Supporting Information), lung cancer ST data were mapped against a lung atlas excluding tumor tissue cells, using normal lung tissue cells. The CELLama embedding identified the nearest cell types, revealing that most cancerous epithelial cells were mapped to alveolar epithelium from the normal lung atlas data, although they were distinguished in the UMAP (Figure S7a,b, Supporting Information). The distance maps from the normal lung cell atlas, measured by the nearest cells, illustrate how the spatial data of lung tumor tissues diverge from the normal lung cell atlas[ 48 ] (Figure S7c,d, Supporting Information). While this type of analysis requires further exploration to comprehend its significance and potential clinical applications, it represents a viable method for understanding how disease tissues and their cellular and microenvironmental contexts evolve from normal cell atlas through the common embedding methods offered by CELLama.
Despite the various applications of CELLama utilizing NLP processes for universal cell data embedding, several challenges persist. While CELLama demonstrates considerable potential in managing diverse and complex datasets, keeping quantitative gene expression may capture more diverse features for small‐sample embeddings within the same batch as there is a loss of information by using only top‐ranked genes. Its performance and accuracy depend on key parameters such as top‐k, the number of genes used to generate sentences, and the selection of metadata for sentence generation. However, this is the only parameter that can be modified when we directly used publicly available sentence transformer models. Based on the benchmarking results from Figure 2, we adopted top‐k = 20 for human pancreas and ST datasets, as it provided an optimal balance between embedding resolution and performance. While dataset‐specific variations exist, our findings suggest that a range of 16‐24 genes is generally suitable for maintaining annotation accuracy across diverse single‐cell datasets. More specifically, the selection of genes and top‐k values can be particularly sensitive in ST that use selected gene panels, which may not adequately represent the complexity needed to effectively detect rare or detailed cell subtypes. Additionally, the interpretation of embeddings and the translation of these findings into biologically meaningful insights necessitate careful consideration and rigorous validation. Importantly, CELLama enables zero‐shot cell type prediction by embedding cells into a shared latent space without requiring task‐specific fine‐tuning, leveraging reference data only for post‐hoc similarity‐based mapping rather than supervised training. Future efforts will concentrate on refining CELLama to improve its sensitivity to subtle variations in gene expression and to broaden its applicability across a diverse range of conditions and tissue types. Moreover, extensive re‐training with large‐scale data, as exemplified by the fine‐tuning methods described in Figure 4, will further enhance overall performance. Of note, while CELLama serves as a foundation model for single‐cell and ST by providing a universal embedding framework, it differs from generative models in its architectural design. Specifically, CELLama does not include a decoding or generative layer, making it less applicable to tasks such as perturbation prediction or gene regulatory network inference. Instead, its strength lies in metadata‐driven flexible cell embeddings, enabling applications such as niche analysis in ST and cell typing with the atlas‐level single reference. This distinction highlights the diversity of transformer‐based approaches in single‐cell analysis, with different models tailored for specific analytical objectives. Finally, as shown in Figure 1c, cells tend to cluster according to metadata, and incorrectly assigned metadata can therefore lead to misleading groupings as a “stress test” (Figure S8a, Supporting Information). Nevertheless, this effect is not sufficiently strong to obscure underlying gene expression variation, and it can be mitigated by excluding outliers identified on the basis of expression profiles. In addition, when metadata fields contain missing values, the embedding may produce an artificial cluster corresponding to the “missing” category (Figure S8b, Supporting Information). This highlights the importance of applying stringent quality‐control procedures to ensure metadata integrity when using CELLama with metadata.
By leveraging the abilities of language models, CELLama has demonstrated its flexibility and scalability in embedding cell data, even within large‐scale, atlas‐level datasets integrated with specific scRNA‐seq or ST data. As we continue to explore and expand the capabilities of CELLama, the integration of computational biology with machine learning is poised to radically transform our understanding of the cellular landscape, ultimately facilitating the development of novel therapeutic targets.
4. Experimental Section
Collection of scRNA‐seq and ST Datasets
To support the development and validation of CELLama, we meticulously collected and curated several datasets across diverse platforms and biological conditions. These datasets encompass a broad range of cell types and conditions, providing a rich resource for testing and refining our model.
The single‐cell RNA sequencing datasets were sourced from public repositories. First, Tabula Sapiens dataset was utilized for a normal, multi‐tissue reference atlas for CELLama.[ 10 ] A subsampled version of the Tabula Sapiens dataset was made, accessed from Tabula Sapiens Portal, reducing it to 10% of its original size for experiment. In addition, donor 1 (TSP1) data was isolated as a test set to evaluate the performance of CELLama in cell typing as well as integration.
As another dataset, a COVID‐19 dataset was employed to demonstrate the capability in zero‐shot reference mapping scenarios as well as finetuing for specific purpose and specific organ data.[ 24 ] Derived from the previous study, it includes data from 18 distinct batches, reflecting the heterogeneous nature of lung tissue responses to COVID‐19. The reference set consists of 15,997 cells, and the query set includes 4,003 cells.
Human pancreas data across multiple platforms, detailed in the study by Luecken et al., include six different technologies (CEL‐seq, CEL‐seq2, Smart‐seq2, inDrop, Fluidigm C1, and SMARTER‐seq), providing a robust framework for data integration and cell type mapping under various platforms.[ 25 ] The data consists of 16,382 cells and 18,771 genes and was structured as an AnnData object by previous research, single‐cell integration benchmarking.
As another detailed atlas for human lung data, human lung cell atlas dataset was also used.[ 48 ] This comprehensive dataset of lung cells obtained from Human Lung Cell Atlas Portal, providing extensive coverage of cell types within the human lung and provided various level cell types.
ST datasets were specifically chosen to test the CELLama to integrate spatial metadata and gene expression data, particularly in cellular resolution image‐based ST data. The dataset was downloaded from 10x Xenium example dataset. This dataset was obtained by multi‐tissue panel of Xenium and provided tissue‐specific expression patterns in the context of lung adenocarcinoma. This dataset contains 162,254 cells, offering a detailed view of cellular heterogeneity and spatial distribution within the tissue. For cell type mapping, we utilized a scRNA‐seq dataset from primary lung tumors (GSE131907),[ 26 ] which comprises 45,149 cells. This data was integrated with Xenium lung cancer data for comprehensive analysis.
Data Processing and Annotation
All datasets were processed to Hierarchical Data Format (HDF5) and read by Scanpy of python.[ 17 ] The primary data format used was through the AnnData object structure, which is suited for handling large‐scale single‐cell data efficiently. Metadata such as cell type, tissue origin, and experimental conditions were stored in “AnnData.obs” to flexibly read each dataset. As a gold standard for cell type annotation, we used the cell types labeled by the original authors.
Data Preprocessing and Sentence Generation
The initial phase of CELLama involved transforming gene expression data from individual cells into a structured textual format, leveraging their gene expression profiles. For each cell, gene expression data was extracted, and the genes were ranked based on their expression levels. The top‐k genes were determined by each cell. This subset of genes forms the basis of the sentence construction, mimicking natural language patterns by listing these genes in a descending order of expression. For example, “Top genes are (gene list)”.
Additionally, metadata associated with each cell, such as tissue origin, disease state, or experimental conditions, were integrated into these sentences. This metadata enrichment allowed for a more detailed representation of each cell, capturing not only its molecular profile but also its context. For instance, if tissue type or disease status is available, the sentence might be extended to include these details, formatted as “Top genes are (gene list). (feature) of this cell is (value).” These feature values were incorporated by adding structured phrases to the generated sentences, such as “[tissue type] of this cell is [value]” or “[disease status] of this cell is [value]”, following the ranked gene expression description. This allowed for context‐aware embeddings by integrating biological metadata directly into the textual representation of each cell. In practice, “[tissue type]” corresponds to a metadata category, which can be extracted from the relevant metadata column in a single‐cell dataset. For instance, in a Scanpy/AnnData object (adata), the appropriate metadata can be accessed via adata.obs.columns, where each column represents an annotation variable (e.g., tissue type, disease condition, or sample origin). The corresponding “[value]” is then dynamically assigned based on the specific metadata entry for each cell, ensuring that embeddings incorporate biologically meaningful contextual information.
Embedding Generation
Using the generated sentences, each cell was embedded into a high‐dimensional space using a pretrained sentence transformer model. This model was selected from publicly available, general‐purpose NLP transformers, with the option to be fine‐tuned using sentences generated from cell data. The embedding process converts natural language representations of cells into numerical vectors, capturing semantic relationships based on gene expression and contextual metadata. For this study, we primarily used “all‐MiniLM‐L12‐v2”, a compact and efficient sentence transformer model optimized for short text inputs (available at Huggingface). After obtaining the embedding vectors, dimensionality reduction was applied using UMAP to facilitate visualization and further downstream analyses. CELLama does not utilize a fully connected layer for direct classification. As a downstream analysis, cell type annotations were determined by performing a nearest‐neighbor search in the embedding space, described in the next section. This approach allows CELLama to function in a zero‐shot manner, generalizing across diverse datasets without explicit classification training.
Integration and Nearest Neighbor Search and Cell Typing
For experiments involving multiple datasets, an integration step was simply performed by concatenating embedded data after CELLama‐embedding. Because sentences were generated using the same gene sets across datasets and only rank data were utilized to describe cells, CELLama inherently co‐embedded the data, thereby minimizing batch effects that typically arise from variations in transcript count scales. This integrated embedding allowed for the direct comparison and cell type maps based on the reference single cell atlas. With the embedded data, a nearest neighbor search was used to transfer cell types by comparing embedded vectors of test cells against a reference dataset. This approach used a K‐nearest neighbor algorithm with cosine similarity to identify the most similar cells within the reference set. The cell types from the nearest neighbors were then assigned to the new cells.
Accuracy Measurement for Cell typing and Comparison with Other Methods
To assess the performance of CELLama in mapping cell types, we utilized standard evaluation metrics such as accuracy, precision, and recall, calculated through the scikit‐learn module. These performance metrics were directly compared to those of scGPT and Geneformer as a different foundation model specialized for scRNA‐seq data. For the comparison, CELLama, scGPT, and Geneformer were applied to the same datasets, Tabula Sapiens and COVID‐19 described above, where embeddings were generated for each cell within the reference and query datasets. Annotations were then transferred from reference to query, enabling a comprehensive evaluation of transferring cell types under the same experimental conditions. Additionally, SingleR, a conventional automated reference‐based cell typing method, was included in the comparison.[ 19 ] The same datasets were used in SingleR to map cell types in the Tabula Sapiens dataset, allowing for a direct comparison of whether foundation models outperformed this conventional unbiased automated cell typing approach.
Finetuning of CELLama by Generated Sentences
The finetuning process based on a specific cell dataset could be performed to better align with the specific linguistic patterns derived from cell data. Finetuning of CELLama leveraged the inherent capability of CELLama to generate descriptive sentences that capture the gene expression profiles and metadata characteristics of single cells, forming a natural language depiction of cellular states. First, a given single cell dataset was used to generate sentences and compute their similarities for finetuning the sentence transformer.[ 16 ] More specifically, the high‐dimensional transcriptomic data were processed by applying normalization and logarithmic transformation to manage the scale of expression counts, followed by the selection of highly variable genes as conventional processing approaches.[ 17 ] These genes were then used to generate sentences that succinctly described each cell based on their top expressed genes and any additional observational features, such as tissue type or experimental condition, when available.
For the finetuning process, sentences were paired based on their cosine similarity, which quantified the similarity between cell profiles and served as the training label. To provide robust learning based on similar and non‐similar sentences, random sentence pairs were generated, enhancing the ability to differentiate and accurately embed varying degrees of cellular similarity. Given the limited number of “similar” cells in extensive single‐cell datasets, the training set included sentence pairs with high similarity with specific proportions. These pairs were selected from sentences that exhibited closely matched gene expression profiles, as determined through PCA analysis, while also aligning on specified metadata features to ensure biological relevance. If the metadata for cell pairs differed, the similarity label was set to zero in the training sets to distinguish between distinct biological conditions.
| (1) |
| (2) |
Once the training data consisting of sentence pairs and their corresponding similarity scores is compiled, the CELLama model, based on the sentence transformer, was fine‐tuned using this dataset. The training process involved adjusting the embeddings to minimize the cosine distance for similar pairs while maximizing it for dissimilar ones, guided by a loss function specifically designed for this purpose. Upon completion of the training phase, the fine‐tuned model was applied as an embedder for cell data with gene expression profiles.
CELLama Analysis with Spatial Context
The CELLama framework was extended to analyze spatial context. Image‐based ST data such as Xenium was used to test whether CELLama could extract spatial context of cells. Following data preparation as aforementioned, Xenium data from lung adenocarcinoma was prepared, and CELLama was applied to annotate cell types in the ST data by referencing the annotated lung cancer tumor scRNA‐seq data (GSE131907). Cell type annotation for Xenium dataset was based on CELLama using the sentence transformer with “all‐MiniLM‐L12‐v2” model. The results were visualized using spatial plots, highlighting the distribution and localization of annotated cell types across the tissue sample. In addition, for the comparison with other cell type mapping specialized for image‐based ST data, TACCO algorithm with default hyperparameters was used for cell type transfer from the reference scRNA‐seq data. We also compared the cell type mapping methods with Tangram[ 29 ] and manual labeling. For further details, Tangram was applied following the publicly available implementation, using the same reference scRNA‐seq data as TACCO.
For manual cell labeling, we first performed clustering using Scanpy with Leiden clustering. A resolution of 4.0 was applied to ensure detailed cell‐type resolution, and cluster annotations were assigned based on canonical marker gene expression. Each cluster was manually labeled as one of the following cell types: B lymphocytes, Endothelial cells, Epithelial cells, Fibroblasts, Mast cells, Myeloid cells, NK cells, and T lymphocytes. The following marker genes were used for cell type identification:
B lymphocytes: MS4A1, CD79A, CD79B; Endothelial cells: PECAM1, CDH5, VWF; Epithelial cells: EPCAM, KRT19, KRT8; Fibroblasts: COL1A1, COL3A1, PDGFRA; Mast cells: TPSAB1, KIT; Myeloid cells: LYZ, CD14, ITGAM; NK cells: NKG7, GNLY, KLRD1; T lymphocytes: CD3E, CD2, CD3G
Clusters were assigned based on the highest expression levels of these markers, ensuring robust and biologically meaningful classification of cell types.
To include the spatial context in CELLama framework, we conducted niche analysis, which focuses on the cellular microenvironment. This involved identifying and ordering neighboring cells for a given cell based on their proximity. For this purpose, the nearest neighbors were identified for each cell, and this information was incorporated into CELLama embedding process, enriching the data with spatial relational context. More specifically, during sentence generation, neighborhood cells for a given cell were added to the sentence. The enriched embeddings allowed for a detailed sub‐cluster analysis within a specific cell type population, for instance, fibroblast in lung cancer. This analysis identified niche‐specific subtypes of fibroblasts based on their spatial relations and proximity to other cell types. This analysis was complemented by clustering algorithms and UMAP visualization to discern distinct spatial patterns and interactions among fibroblasts. The clustering results were further analyzed by calculating differential expression genes among these spatial context‐based subclusters, providing insights into the molecular basis of the observed spatial patterns.
Statistics
CELLama generates cell‐wise embeddings by feeding both the top‐k expressing genes and associated metadata (if any) into a sentence‐embedder. Gene expression values were normalized using `sc.pp.normalize_total` followed by `sc.pp.log1p` in Python Scanpy (with default parameters). For benchmarking CELLama performance, we used scGPT (whole‐human), Geneformer (Geneformer‐V2‐104 M), SingleR (2.10.0), C2S (160m), AIDO.Cell (3m), UCE (4‐layer), TranscriptFormer (TF‐Sapiens), and scVI, all built under a GPU environment with PyTorch 2.3.1 + CUDA 12.1. SingleR did not have a separate embedding step, while for the other methods we reported using the post‐embedding label prediction strategy that yielded higher performance: embedding + KNN (i.e., CELLama, C2S), embedding + logistic regression (i.e., scGPT, Geneformer, AIDO.Cell, UCE, TranscriptFormer), and scVI pretraining → scANVI transfer (i.e., scVI). The raw benchmarking data shown in Figure 2 and Figure S2 (Supporting Information) can be found in Tables S1 and S3 (Supporting Information), respectively.
Conflict of Interest
H.C. and D.S.L. are co‐founders of Portrai, Inc. Other authors declare no conflict of interest.
Author Contributions
H.C. designed the study and primarily developed algorithm. J.P. and D.L. contributed to concept of the algorithm. J.P., S.K., D.L., J.K., S.B., and H.S. contributed to application and the analysis of computational experiments. J.P. and H.C. drafted the manuscript and all authors contributed to revision of the work. All authors contributed to the interpretation of the data and wrote the paper.
Supporting information
Supporting Information
Supplemental Tables 1‐3
Acknowledgements
This research was supported by the National Research Foundation of Korea (2023R1A2C2006636 and 2022M3A9D3016848), the NAVER Digital Bio Innovation Research Fund, funded by the NAVER Corporation (Grant No. 3720230020), a grant from the Korea Technology and Information Promotion Agency (RS‐2023‐00304101), and Korea Health Industry Development Institute (KHIDI) funded by the Ministry of Health & Welfare, Republic of Korea (RS‐2025‐02223280).
Park J., Kim S., Kim J., et al. “CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities.” Adv. Sci. 13, no. 7 (2026): e13210. 10.1002/advs.202513210
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
References
- 1. Baysoy A., Bai Z., Satija R., Fan R., Nat. Rev. Mol. Cell Biol. 2023, 24, 695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Rood J. E., Maartens A., Hupalowska A., Teichmann S. A., Regev A., Nat. Med. 2022, 28, 2486. [DOI] [PubMed] [Google Scholar]
- 3. Quake S. R., Trends Genet. 2022, 38, 805. [DOI] [PubMed] [Google Scholar]
- 4. Rao A., Barkley D., França G. S., Yanai I., Nature 2021, 596, 211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Zhang L., Chen D., Song D., Liu X., Zhang Y., Xu X., Wang X., Signal Transduction Targeted Ther. 2022, 7, 111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Heath J. R., Ribas A., Mischel P. S., Nat. Rev. Drug Discovery 2016, 15, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Lähnemann D., Köster J., Szczurek E., McCarthy D. J., Hicks S. C., Robinson M. D., Vallejos C. A., Campbell K. R., Beerenwinkel N., Mahfouz A., Genome Biol. 2020, 21, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Argelaguet R., Cuomo A. S., Stegle O., Marioni J. C., Nat. Biotechnol. 2021, 39, 1202. [DOI] [PubMed] [Google Scholar]
- 9. Fang S., Chen B., Zhang Y., Sun H., Liu L., Liu S., Li Y., Xu X., Genomics, Proteomics Bioinformatics 2023, 21, 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Consortium T. T. S., Jones R. C., Karkanias J., Krasnow M. A., Pisco A. O., Quake S. R., Salzman J., Yosef N., Bulthaup B., Brown P., Science 2022, 376, abl4896. [Google Scholar]
- 11. Osumi‐Sutherland D., Xu C., Keays M., Levine A. P., Kharchenko P. V., Regev A., Lein E., Teichmann S. A., Nat. Cell Biol. 2021, 23, 1129. [DOI] [PubMed] [Google Scholar]
- 12. Biology C. S.‐C., Abdulla S., Aevermann B., Assis P., Badajoz S., Bell S. M., Bezzi E., Cakir B., Chaffer J., Chambers S., bioRxiv 2023, 10.1101/2023.10.30.563174. [DOI] [Google Scholar]
- 13. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I., Adv. Neural Inform. Processing Syst. 2017, 30. [Google Scholar]
- 14. Cui H., Wang C., Maan H., Pang K., Luo F., Duan N., Wang B., Nat. Methods 2024, 21, 1470. [DOI] [PubMed] [Google Scholar]
- 15. Theodoris C. V., Xiao L., Chopra A., Chaffin M. D., Al Sayed Z. R., Hill M. C., Mantineo H., Brydon E. M., Zeng Z., Liu X. S., Nature 2023, 618, 616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Reimers N., Gurevych I., arXiv 2019, 10.48550/arXiv.1908.10084. [DOI] [Google Scholar]
- 17. Wolf F. A., Angerer P., Theis F. J., Genome Biol. 2018, 19, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Levine D., Rizvi S. A., Lévy S., Pallikkavaliyaveetil N., Zhang D., Chen X., Ghadermarzi S., Wu R., Zheng Z., Vrkic I., BioRxiv 2024, 557287. [Google Scholar]
- 19. Aran D., Looney A. P., Liu L., Wu E., Fong V., Hsu A., Chak S., Naikawadi R. P., Wolters P. J., Abate A. R., Nat. Immunol. 2019, 20, 163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Ho N., Ellington C. N., Hou J., Addagudi S., Mo S., Tao T., Li D., Zhuang Y., Wang H., Cheng X., bioRxiv 2024, 625303. [Google Scholar]
- 21. Rosen Y., Roohani Y., Agarwal A., Samotorčan L., Consortium T. S., Quake S. R., Leskovec J., bioRxiv 2023, 10.1101/2023.11.28.568918. [DOI] [Google Scholar]
- 22. Pearce J. D., Simmonds S. E., Mahmoudabadi G., Krishnan L., Palla G., Istrate A.‐M., Tarashansky A., Nelson B., Valenzuela O., Li D., bioRxiv 2025, 650731. [Google Scholar]
- 23. Lopez R., Regier J., Cole M. B., Jordan M. I., Yosef N., Nat. Methods 2018, 15, 1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Lotfollahi M., Naghipourfar M., Luecken M. D., Khajavi M., Büttner M., Wagenstetter M., Avsec Ž., Gayoso A., Yosef N., Interlandi M., Nat. Biotechnol. 2022, 40, 121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Luecken M. D., Büttner M., Chaichoompu K., Danese A., Interlandi M., Müller M. F., Strobl D. C., Zappia L., Dugas M., Colomé‐Tatché M., Nat. Methods 2022, 19, 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kim N., Kim H. K., Lee K., Hong Y., Cho J. H., Choi J. W., Lee J.‐I., Suh Y.‐L., Ku B. M., Eum H. H., Nat. Commun. 2020, 11, 2285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Janesick A., Shelansky R., Gottscho A. D., Wagner F., Williams S. R., Rouault M., Beliakoff G., Morrison C. A., Oliveira M. F., Sicherman J. T., Nat. Commun. 2023, 14, 8353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Mages S., Moriel N., Avraham‐Davidi I., Murray E., Watter J., Chen F., Rozenblatt‐Rosen O., Klughammer J., Regev A., Nitzan M., Nat. Biotechnol. 2023, 41, 1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Biancalani T., Scalia G., Buffoni L., Avasthi R., Lu Z., Sanger A., Tokcan N., Vanderburg C. R., Segerstolpe Å., Zhang M., Nat. Methods 2021, 18, 1352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Pawar N. M., Rao P., Cellular Signal. 2018, 45, 63. [DOI] [PubMed] [Google Scholar]
- 31. Andersen M. K., Krossa S., Midtbust E., Pedersen C. A., Wess M., Høiem T. S., Viset T., Størkersen Ø., Nervik I., Sandsmark E., Commun. Biol. 2024, 7, 1462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Longo S. K., Guo M. G., Ji A. L., Khavari P. A., Nat. Rev. Genet. 2021, 22, 627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Heumos L., Schaar A. C., Lance C., Litinetskaya A., Drost F., Zappia L., Lücken M. D., Strobl D. C., Henao J., Curion F., Nat. Rev. Genet. 2023, 24, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Zhang S., Fan R., Liu Y., Chen S., Liu Q., Zeng W., Bioinformat. Adv. 2023, 3, vbad001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Luo R., Sun L., Xia Y., Qin T., Zhang S., Poon H., Liu T.‐Y., Briefings Bioinforma. 2022, 23, bbac409. [DOI] [PubMed] [Google Scholar]
- 36. Jin Q., Kim W., Chen Q., Comeau D. C., Yeganova L., Wilbur W. J., Lu Z., Bioinformatics 2023, 39, btad651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., Kang J., Bioinformatics 2020, 36, 1234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Vig J., Belinkov Y., arXiv 2019, 10.48550/arXiv.1906.04284. [DOI] [Google Scholar]
- 39. Rizvi S. A., Levine D., Patel A., Zhang S., Wang E., He S., Zhang D., Tang C., Lyu Z., Darji R., bioRxiv 2025, 648850. [Google Scholar]
- 40. Bian H., Chen Y., Dong X., Li C., Hao M., Chen S., Hu J., Sun M., Wei L., Zhang X., in Int. Conf. on Research in Computational Molecular Biology , Springer, Cham: 2024, 479–482. [Google Scholar]
- 41. Hou W., Ji Z., Nat. Methods 2024, 21, 1462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Smith K. D., Prince D. K., MacDonald J. W., Bammler T. K., Akilesh S., Glomerular Diseases 2024, 4, 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Zhang Y., Petukhov V., Biederstedt E., Que R., Zhang K., Kharchenko P. V., Genome Biol. 2024, 25, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Baccin C., Al‐Sabah J., Velten L., Helbling P. M., Grünschläger F., Hernández‐Malmierca P., Nombela‐Arrieta C., Steinmetz L. M., Trumpp A., Haas S., Nat. Cell Biol. 2020, 22, 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Ren Y., Huang Z., Zhou L., Xiao P., Song J., He P., Xie C., Zhou R., Li M., Dong X., Nat. Commun. 2023, 14, 1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Mason K., Sathe A., Hess P. R., Rong J., Wu C.‐Y., Furth E., Susztak K., Levinsohn J., Ji H. P., Zhang N., Genome Biol. 2024, 25, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Cao Z.‐J., Wei L., Lu S., Yang D.‐C., Gao G., Nat. Commun. 2020, 11, 3458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Travaglini K. J., Nabhan A. N., Penland L., Sinha R., Gillich A., Sit R. V., Chang S., Conley S. D., Mori Y., Seita J., Nature 2020, 587, 619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information
Supplemental Tables 1‐3
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
