Abstract
Single-cell sequencing is frequently affected by “omission” due to limitations in sequencing throughput, yet bulk RNA-seq may contain these ostensibly “omitted” cells. Here, we introduce the single cell trajectory blending from Bulk RNA-seq (BulkTrajBlend) algorithm, a component of the OmicVerse suite that leverages a Beta-Variational AutoEncoder for data deconvolution and graph neural networks for the discovery of overlapping communities. This approach effectively interpolates and restores the continuity of “omitted” cells within single-cell RNA sequencing datasets. Furthermore, OmicVerse provides an extensive toolkit for both bulk and single cell RNA-seq analysis, offering seamless access to diverse methodologies, streamlining computational processes, fostering exquisite data visualization, and facilitating the extraction of significant biological insights to advance scientific research.
Subject terms: Computational models, Computational platforms and environments, Bioinformatics
Single cell sequencing often omits certain cell types due to technological constrains. Here the authors report an algorithm called BulkTrajBlend, part of OmicVerse, which fills gaps in single-cell RNA-seq by using Bulk RNAseq data to restore missing cells and improve analysis.
Introduction
Single-cell RNA sequencing (scRNA-seq) and bulk RNA sequencing (RNA-seq) have emerged as essential techniques for exploring cellular heterogeneity, differentiation, and disease mechanisms1–6. These technologies facilitate numerous applications, including converting bulk-seq data into single-seq analyses7, performing differential expression analysis8, pathway enrichment9, gene co-expression network analysis in bulk RNA-seq10, cell annotation11, cell interaction analysis12, cell-trajectory inference13, evaluating cell-state in gene sets, and predicting drug response in scRNA-seq14. Many of these approaches rely on open-source algorithms contributed by the research community15,16.
Nevertheless, the growing diversity and abundance of omics algorithms pose challenges in selecting tools that are accurate, user-friendly, and appropriate for specific analyses. Learning to use various algorithms often leads to computational inefficiencies, as users are required to adapt to various systems. Moreover, for analyses involving low data quantities, researchers commonly employ web servers and the R language17, whereas Python is preferred for processing large-scale datasets18.
Integrating single-cell and bulk sequencing results can be intricate, producing complex, multi-layered data sets that challenge the extraction of meaningful biological insights. A recognized impediment in single-cell sequencing is the “omission”—the omission of certain cell types due to technological constraints on the sequencing platform and interruption of the trajectory of cell differentiation, such as the enzymatic lysis-related loss of podocytes and intercalated cells19. For example, the differentiation from hematopoietic cells (HPC) to podocytes was interrupted, and the filtering-induced absence of neutrophils, cardiomyocytes, neuronal cells, and megakaryocytes and the differentiation from neural intermediate progenitor cells (nIPC) to neurons was interrupted20,21. The BD Rhapsody™ single-cell platform overcomes granulocyte loss by accommodating their natural sedimentation22. Conversely, bulk RNA-seq of whole tissues intrinsically includes these “omitted” cells. It should be acknowledged that there is no existing algorithm that can directly solve the “omitted” cell problem. However, similar to this problem, there are some deconvolution algorithms, such as TAPE23, CIBERSORT (CS)24, MuSiC25, CIBERSORTx (CSx)26, and Bisque27, which are not really effective in solving the “omitted” cell problem because they lack a generative capability. This suggests that Generative Adversarial Networks (GANs) may be the best solution to the “omitted” cell problem.
To address these challenges, we have developed OmicVerse (https://omicverse.readthedocs.io/), a comprehensive Python library designed for transcriptomic research. OmicVerse streamlines access to a spectrum of models and algorithms for bulk-seq and scRNA-seq analyses, improving computational efficiency and visual engagement. Rewritten models and algorithms and integrated different pre-processing options stem from benchmark testing28 (Supplementary Note 1). Moreover, OmicVerse features single cell trajectory blending from Bulk RNA-seq (BulkTrajBlend), a specialized algorithm for addressing “omission” in single-cell data. BulkTrajBlend employs a beta-variational autoencoder and graph neural network-based algorithm to deconvolve single-cell data from bulk RNA-seq, facilitating the identification of “omitted” cells within the reconstructed single-cell landscape.
Results
Design concept of BulkTrajBlend and Benchmarking
The conceptualization of BulkTrajBlend draws upon prior research, proposing that Bulk RNA-seq data is a composite of scRNA-seq data through a nonlinear superposition mechanism29,30. Central to this notion is the implementation of the beta-variational autoencoder (β-VAE), a potent tool for approximating Bulk RNA-seq data to scRNA-seq representation31,32. Integrating the β-VAE enables the construction of an encoder and decoder from single-cell data, traditionally characterized by unconstrained attributes.
BulkTrajBlend advances the foundational structure of autoencoders (AE) and β-VAE. These enhancements involve (1) employing an AE to construct a Bulk RNA-seq generator analogous to real Bulk RNA-seq inspired by TAPE23. We modeled the cellular proportion space of Bulk RNA-seq on the output of the Encoder, the input of the Decoder. Subsequently utilizing ground truth bulk RNA-seq generated from single cell RNA-seq as input of Encoder for calculating the true cellular fractions. (2) When we trained β-VAE using real single cell RNA-seq, the Encoder outputs were V (cell type fraction) and W (cell type correlated generative factor). We added a loss function to minimize the relationship between V and the real cell type fraction. We obtained W for each cell at the end of model training and averaged W for each cell type to represent that cell type. (3) We used the true cell type fraction V calculated by AE with the cell type-associated generating factor W obtained by β-VAE as input to β-VAE for generating single-cell data, and deploying unsupervised clustering to denoise and refine the outcomes of the β-VAE. (4) We employed a graph neural network (GNN) to sample the generated single-cell data, thereby identifying overlapping cell communities. Sampling the overlapping communities of cells helps us to insert “omitted” cells without losing cell continuity.
The methodology based on β-VAE approximates the joint distribution of data and latent generating factors by estimating the probability distribution relative to the true posterior Here, denotes gene expression data, and characterizes the normally distributed parameters of post-sampling. It is noteworthy that this approximation introduces a level of noise and bias into the generated data (Supplementary Fig. 1e). Consequently, unsupervised clustering is employed as a data refinement strategy to mitigate the impact of noise and enhance data robustness. We use unsupervised clustering to filter out “noisy” cells, which are identified using community size.
Another notable limitation of β-VAE lies in the unconstrained nature of the decoder’s output. This contrasts with the real Bulk environment, where the cellular ratios are not strictly fixed. To address this discrepancy, a simulated Bulk environment is constructed through the sampling of single-cell data, with the procedural details outlined in the “Methods” section. This process is facilitated by a deep neural network (DNN)-based autoencoder model, where the simulated Bulk serves as input, the encoder’s output reflects the proportions of actual cells, and the simulated Bulk constitutes the decoder’s output. Mean absolute error (MAE) is used as the evaluation metric for both the encoder and decoder. Subsequent to model convergence, the real Bulk data is utilized as input for the AE model, with the critical requirement being the alignment of the generation, based on the best-pretrained decoder, with the real Bulk data. At this point, the cell proportions output by the encoder accurately reflect the cell proportions of the actual Bulk (Fig. 1a).
Given that BulkTrajBlend’s primary objective is to interpolate data from original scRNA-seq data, the focus shifts to the targeted extraction of cells from the generated single-cell data. Considering the inherent challenges associated with cell annotation, the input single-cell data containing diverse cell types is expected to exhibit overlaps in real-world scenarios. The “omitted” cells we need to recover should maintain the continuous state of the cells, the traditional community discovery algorithms cannot identify the overlapping cell communities. Cells generated by β-VAE are directly restored to the original single-cell data, which will lose the continuous state of the cell. To solve this problem, we introduce NOCD, a GNN-based algorithm for identifying overlapping communities that achieves the best performance among existing baselines33. Utilizing NOCD enables the identification of overlapping cell communities. We also use the “omitted” cell in the overlapping community state as the target cells for recovery. This insight is crucial for the subsequent task of recovering and reconstructing cell differentiation trajectories within the single-cell sequencing data (Fig. 1b).
To assess the efficacy and accuracy of BulkTrajBlend in the context of cell differentiation trajectory recovery, a rigorous benchmarking exercise is undertaken. The VAE module within BulkTrajBlend is systematically compared against alternative generative models, including conditional generative adversarial networks (CGAN)34 and auxiliary conditional GANs (ACGAN)35. This benchmarking exercise involves assessing a range of performance metrics for generating scRNA-seq features and trajectory inference features. These metrics include the correlation of cell-type marker gene expression, marker gene similarity (quantified via cosine similarity), probability of trajectory conversion post-interpolation, and the degree of data variability following interpolation. Notably, the findings consistently demonstrate BulkTrajBlend’s superior performance, characterized by heightened correlations in marker gene expression, marker gene similarity, trajectory conversion probabilities, and minimal post-interpolation data variability in the generated single-cell data (Fig. 1c–f, Supplementary Note 2, Supplementary Figs. 2–4).
Impact of varied hyperparameters on interpolation performance in BulkTrajBlend
This study explores the effect of varying hyperparameter settings on the performance of BulkTrajBlend, a tool reconstruction OPC trajectories in the Dentate gyrus dataset and interpolating Basophil within the HPC dataset. We analyzed the impact of hyperparameter variations by examining five key factors: (1) the number of interpolated cells, (2) the correlation of marker gene expression between interpolated and actual cells, (3) marker gene similarity, (4) transition probabilities following interpolation, and (5) the prevalence of noise clusters.
Initially, the effect of changing the size of the input single-cell data, ranging from 1000 to 20,000 cells, was investigated. An increase in data size resulted in higher correlations of marker gene expression and improved single-cell similarity as performed by BulkTrajBlend (Fig. 2a, b). The transition probabilities, however, were only slightly better (Fig. 2c). Notably, an inverse relationship was found between the saturation of cell numbers and the frequency of noise clusters (Fig. 2d).
Next, the effect of interpolation size was examined, with sizes ranging from 1 to 10 times the original number of target “omitted” cells. Marker gene correlation and single-cell similarity improved significantly within the 1-4x interpolation range, outperforming the 6-10x range. Conversely, larger interpolation sizes were correlated with a notable increase in noise clusters (Fig. 2e–h).
Contrary to expectations, a detailed analysis of the number of neurons in BulkTrajBlend’s hidden layer, with a range from 64 to 1024, revealed that a hidden layer with only 64 neurons exhibited the highest marker gene correlation, similarity, and transition probability for interpolated single cells, while also reducing noise cluster occurrences (Fig. 2i–l).
In conclusion, the ideal hyperparameter setting involves using the entire single-cell dataset, interpolating at a scale of 2x or 4x, and configuring a hidden layer with 64 neurons. Under these optimal hyperparameters, BulkTrajBlend effectively reconstructs the nIPC-OPC developmental flow pattern in dentate gyrus datasets and the HSC-Basophil flow pattern in hematopoietic system development datasets (Fig. 2m,n). It is important to note that using the full single-cell dataset improves accuracy, which also significantly increases computational demands (Fig. 2o).
Proficient reconstruction of cell developmental trajectories in simulated “omission” single-cell profiles
Our study extended beyond evaluating BulkTrajBlend’s ability to reconstruct developmental trajectories in real datasets, by also examining its performance within simulated datasets. We crafted three simulated datasets with specific “omission”: the first omitted a subset of Ngn3High endocrine progenitor-precursor (Ngn3High EP) cells in mouse pancreas development, the second removed immature granules from mouse dentate gyrus neurons development, and the third excluded hematopoietic stem cells (HSC) mesomorphic cells from human bone marrow development. These cells were successfully recognized in the reconstructed developmental trajectories within these simulated “omission” datasets (Fig. 3a–c, Supplementary Fig. 5a–c, Supplementary Fig. 5g–i).
In the mouse pancreatic development dataset, PAGA plots illustrated a baseline probability of 0.04 for Ngn3High EP cells differentiating into Pre-endocrine cells. In the corresponding “omission” dataset, this probability was 0. BulkTrajBlend interpolation increased the probability to 0.035 (Fig. 3d–g, Supplementary Fig. 6a–c). In the mouse dentate gyrus neurons development, Granule Immature cells had baseline differentiation probability to Granule Mature cells of 0.018, while no probability was observed in simulated “omission” dataset. BulkTrajBlend’s interpolation resulted in a probability increase to 0.019 (Fig. 3g, Supplementary Fig. 5d–f, Supplementary Fig. 6d–f). In human bone marrow development, hematopoietic stem cells stage 2 (HSC 2) cells showed a differentiation probability into monocytes of 0.082, compared to 0 in the simulated “omission” dataset. Following BulkTrajBlend interpolation, the probability increase to 0.079 (Fig. 3g, Supplementary Fig. 5j–l, Supplementary Fig. 6g–i). Notably, the original pseudotime variability in the three datasets was preserved after interpolation (Supplement Note 3). These analyses collectively highlight BulkTrajBlend’s effectiveness in accurately reconstructing authentic developmental trajectories.
OmicVerse provides a comprehensive analysis platform for Bulk RNA-seq data
Bulk RNA-seq is an established method for investigating the transcriptome of combined cellular samples, tissue or biopsies6. It probes gene expression, isoform variations, alternative splicing, and single-nucleotide polymorphisms, revealing critical biological information such as copy number variations, microbial contamination, transposable elements, cell-types deconvolution, and neoantigens. Advances in bioinformatics have enhanced the ability to reveal these hidden dimensions in Bulk RNA-seq data, expanding its analytical applications.
OmicVerse integrates an extensive collection of Bulk RNA-seq analysis algorithms, previously developed mostly in R but now increasingly in Python, to promote their utilization and interconnectivity36. Our integration enhances the existing repertoire of analysis algorithms catering to single-cell, spatial transcriptomics, as well as machine learning and deep learning models37.
The platform hosts a comprehensive assortment of Bulk RNA-seq algorithms, including pyComBat38 for batch correction, pyDEG for differential expression analysis using Deseq239, t-test, and Wilcoxon tests, pyPPI for protein-protein interaction network using STRING web API40, pyWGCNA for gene co-expression network41, pyGSEA for gene set enrichment analysis42, and pyTCGA for The Cancer Genome Atlas (TCGA) data analysis, complete with survival analysis (Fig. 4a).
To evaluate the OmicVerse’s analytical pipeline, we analyzed Alzheimer’s disease (AD) data, beginning with pyDEG to identify differential expressed genes between AD patients and controls, highlighting the top 10 foldchange genes. Then, we conducted Gene Set Enrichment Analysis at the gene level using pyGSEA, ordering genes according to p-values derived from pyDEG’s differential expression analysis. We further built a co-expression network from the top 5000 genes exhibiting the highest absolute median difference (MAE), selecting the most differential expression module for visualization (See Supplementary Note 4 for Methods).
OmicVerse’s workflow simplifies Bulk RNA-seq analyses with minimal coding required (Fig. 4b). Parameter adjustments may enhance visual outputs. Our analysis revealed 56 genes differentially expressed in AD: 48 upregulated and 8 downregulated. Box plots showcased the most altered genes (Fig.4c–e). Gene Set Enrichment Analysis exposed over-represented pathways relevant to Alzheimer’s, consistent with established literature (Fig. 4f,g). Moreover, we focused on the most variable genes from the top 5000, discerning 12 modules through pyWGCNA at 5 soft threshold. Notably, modules 4 and 5 showed the highest rates of differential gene expression, with module 5 containing APP proteins. Further probing of these modules provides insight into their network connectivity (Fig. 4h–j).
OmicVerse provides a versatile multifaceted framework for Single-Cell RNA-Seq Analysis
Single-cell RNA-seq is a powerful high-throughput technique that enables the measurement of gene expression patterns and cell types at the single-cell level. It has become a crucial technique for delineating cellular heterogeneity, differentiation, and disease mechanisms, particularly in cancer research. scRNA-seq unravels tumor cell diversity and tracks tumor progression to anticipate cellular deterioration43. The breadth of scRNA-seq data analysis facilitated by OmicVerse includes cell annotation, examination of cell interactions, trajectory inference, states evaluation within gene sets, and drug responses prediction44. The framework supports Anndata-standardized data processing for integrated downstream analysis and benefits from benchmarked data transformations28. Preprocessing methods in OmicVerse feature optimal logarithmic transformation with pseudo-count addition, principal-component analysis (PCA), and Pearson residual normalization. For visualizing reduced dimensions, it employs GPU-accelerated Uniform Manifold Approximation and Projection (UMAP) through pymde45.
Incorporating a suite of state-of-the-art scRNA-seq algorithms, OmicVerse’s integrated toolset includes pyHarmony46, pyCombat38, scanorama47 for batch correction, pySCSA48, updated with CellMarker49 2.0 and CancerSEA50 for enhanced cell-type annotation, CellPhoneDB12 for cell-cell interactions analysis pyVIA13 for trajectory inference, AUCell for geneset score evaluates based on Area Under the Curve51, and scDrug for drug prediction14 (Fig. 5a). The OmicVerse framework also introduces SEACells for metacell analysis, effectively minimizing single cell profile noise52. Importantly, the data format input for all the aforementioned methods is consistent, enabling users to conduct analyses using Anndata format, with significantly improved visualization for more elegant results. OmicVerse’s user-friendly nature and straightforward application are exemplified in Fig. 5b.
Illustrating Omicverse’s practical application in scRNA-seq, we analyzed a colorectal cancer (CRC) dataset, emphasizing the tumor microenvironment (TME) cell atlas integration53,54. Beginning with automatic cell annotation via pySCSA, the results showed high concordance with manual annotations (Fig. 5c), with an f1_score of 0.856, highlighting OmicVerse’s annotation accuracy (Fig. 5d). Using AUCell, we confirmed the expected signaling pathway enrichment in cell-specific receptor pathways: the B-cell receptor signaling pathway was prominence in B cells, while the T-cell receptor signaling pathway was most pronounced in T cells and NK cells (Fig. 5e). In addressing the sparsity inherent in previous CRC single-cell data analysis and to enhance resolution and depth, we utilized SEACells to extract metacells from the scRNA-seq data. After 39 epochs, the metacell aggregation iteration converged, attaining a high cell purity of 0.98, with compactness and separation values closely approximating 0 (Fig. 5f, Supplementary Fig. 7a–c). The SEACells algorithm enhanced cell type differentiation, with the signal intensity for receptor pathways being significantly accentuated (Supplementary Fig. 7d).
Furthermore, we traced epithelial-to-cancer cell differentiation trajectories using pyVIA and annotated cancer cell types within the epithelial population with pySCSA, identifying distinct pathways including Epithelial-to-Mesenchymal Transition (EMT) and Metastasis. This analysis provided deep insights into cancer progression (Fig. 5g). By commencing the trajectory with stemness as the starting point, we delineated the pseudotime trajectory of cancer cell differentiation, revealing three distinctive directions: EMT-Differentiation and Metastasis, representing two stages in the transition from epithelial cells to cancer cells. This analysis provided deep insights into the dynamics of cancer evolution. In a parallel approach, metacells within the epithelial cell subpopulation were subjected to further aggregative analysis. Due to the inherent similarities among epithelial cells, the average cell purity of the metacells obtained was reduced to 0.9, while compactness and separation values remained in close proximity to 0 (Supplementary Fig. 7e, f). Consequently, we extrapolated the metacells of epithelial cells into trajectories, revealing that EMT-differentiation and Metastasis served as the two primary differentiation pathways, aligning with the analysis conducted on all cells (Supplementary Fig. 7g–i).
Finally, to investigate the interaction network between epithelial cells and other TME cells, we established a CRC cell communication network using CellPhoneDB (Fig. 5h). The analysis included immune cells, including B-cells, T-cells, NK-cells, and plasma cells, exploring their interactions with eight subtypes of epithelial cells. The analysis revealed that PPIA-BSG and LTB-LTBR were recurrent ligand-receptor pairs mediating the recognition of cancer epithelial cells by immune cells (Fig. 5i). Notably, PPIA-BSG and LTB-LTBR have been linked to a positive correlation in various cancers and are associated with poor prognosis55,56. OmicVerse’s data harmonization significantly streamlines this comprehensive analysis, enabling researchers to delve into personalized explorations as outlined in our detailed tutorial (Refer to Supplementary Note 5 for the Methods).
OmicVerse performed multi-omics analysis with MOFA and GLUE
Single-cell sequencing advancements enable the investigation of biological systems across different tissue levels. A key element in scRNA-seq is understanding the impact of chromatin accessibility variation, which is quantified by Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq). The conjoined analysis of scATAC-seq and scRNA-seq data is critical for unraveling transcriptional regulatory complexities. While scNMT-seq can capture both modalities simulaneously, obtaining unpaired data from identical tissues is more common57. Addressing this disparity, Graphical Linkage Unified Embedding (GLUE) offers a Graphical Linkage Unified Embedding solution for integrating unpaired data58, and Multi-Omics Factor Analysis (MOFA) elucidates the variations within omics data59. OmicVerse utilizes both GLUE and MOFA to reveal transcriptional regulatory dynamics.
Within OmicVerse, the GLUE_pair algorithm leverages the Pearson correlation coefficient to compute cell similarity between scRNA-seq and scATAC-seq base on embedding from GLUE (Fig. 6a). The accuracy of GLUE_pair is verified using the Adjusted Rand Score (ARI) to confirm cell type congruence post-normalization. For the analysis of paired cell modalities, OmicVerse applies MOFA’s core algorithm, simplifying ensuing data analysis and visualization tasks (Fig. 6a), all achievable with minimal coding (Fig. 6b).
Demonstrating the integration of GLUE and MOFA, we analyzed simultaneous single-nucleus RNA-seq (snRNA-seq) and single-nucleus ATAC-seq (snATAC-seq) data from cortical regions of Alzheimer’s disease patients60. Our analysis of aligned cell types uncovered consistent patterns indicative of common cellular states (Fig. 6c,d). From a random subset of 5000 paired cells, MOFA unveiled 13 factors (Fig. 6e,f). The factors 1-6 accounted for RNA-related variance, while the second for ATAC-related variance. The interaction among these factors and cell types revealed significant associations: EX-signature with Factor 1, PER.END-signature with Factor 5, ASC-signature with Factor 2, MG-signature with Factor 3, and INH-signature jointly detailed by Factors 6 and 4. Additionally, gene weights for each factor uncovered genes with the most considerable influence on their respective signatures (Refer to Supplementary Note 6 for the Methods, Supplementary Fig. 8a–c).
Discussion
The innovative fusion of the variational autoencoder and graph neural networks combined in the creation of the BulkTrajBlend framework. This framework aims to deconvolve scRNA-seq data within Bulk RNA-seq and elucidate precise cell-specific developmental trajectories in scRNA-seq. It demonstrates significant accuracy and robustness, due in large part to the unique integration of the topological overlap community in graph neural networks, which skillfully addresses the potential bias introduced by unsupervised clustering in the single-cell data outcomes.
A conceptual parallel exists between back-calculating cell proportions in Bulk RNA-seq from scRNA-seq and using Bulk RNA-seq as a scaffold for interpolating scRNA-seq. However, the latter is inherently more challenging due to the need to accurately interpolat the inadequate target cell type. While numerous single-cell generators perform well in generating scRNA-seq data, the incorporation of unknown information remains an intrinsic challenge. For example, scDesign3 is a proficient statistical simulator that creates realistic single-cell data by learning interpretable parameters from actual scRNA-seq data. Nevertheless, reconstructing cell developmental trajectories often requires elusive parameters, which necessarily leverages known data from Bulk RNA-seq61. Hence, BulkTrajBlend is meticulously crafted based on the principles of scDesign361 and scGen32, with the state space and parameters being informed by Bulk RNA-seq. Notably, cell categorization in the resulting single-cell data often relies on unsupervised annotation. By introducing GNN, BulkTrajBlend effectively reduces resolution-dependent issues associated with unsupervised clustering.
While BulkTrajBlend can efficiently extract the state space of cells from Bulk RNA-seq and interpolate the original scRNA-seq data, this interpolation relies on the selection of the reference scRNA-seq versus the reference Bulk RNA-seq data. We suggest that users can adopt an additional comprehensive single-cell profile to train BulkTrajBlend and then perform interpolation of their data, thereby avoiding generating BulkTrajBlend without information about the target cells.
Upon devising the interpolation algorithm for Bulk RNA-seq in scRNA-seq, it became apparent that a unified Python-based framework for comprehensive dual analysis of these platforms was missing. To fill this void, we developed OmicVerse, seamlessly integrating single-seq and bulk-seq. OmicVerse introduces a specialized analysis object for each omics layer, facilitating streamlined analysis and ensuring an intuitive user experience. OmicVerse not only has a well-established scRNA-seq ecosystem like Seurat, which complements Scanpy, but also features a unique Bulk RNA-seq ecosystem, thus offering a consistent and user-friendly interface (Supplement Note 7).
As an integrated framework for both Bulk and single-cell RNA-seq analysis, OmicVerse offers a suite of analytical tools that include, but are not limited to:
(1) Bulk RNA-seq: OmicVerse provides comprehensive functionalities, including multi-sample integration, batch effect correction, differential gene expression analysis, gene set enrichment analysis, protein interaction network construction, the identification of gene co-expression modules, and TCGA database preprocessing.
(2) Single-cell RNA-seq: OmicVerse offers robust features, including multi-sample quality control, batch effect removal and integration, automated cell type annotation (with multiple databases support) and migration annotation, cell type and gene set enrichment analysis, developmental trajectory reconstruction, metacell identification, cellular interaction network analysis, and drug response prediction. It also covers scATAC-seq integration and multi-omics analysis, inherently linked to RNA-seq.
(3) Bulk RNA-seq to scRNA-seq: OmicVerse enhances the deconvolution of Bulk RNA-seq, cell proportions estimation, interpolation the scRNA-seq data and the recovery of developmental trajectories within scRNA-seq. Acting as a critical bridge in the transition from Bulk to single-cell RNA-seq.
The OmicVerse documentation provides a detailed Application Programming Interface (API) reference for each algorithm, coupled with tutorials that clarify their functions, limitations, and synergies with other bulk and single-seq analysis tools. These resources are accessible via Google Colab, offering a free computational workspace for pipeline examinations. OmicVerse also has comprehensive developer documentation that makes it easy for users to add tools to the ecosystem following a consistent development logic.
Our primary goal was to foster an ecosystem replete with visually engaging and insightful visualizations, fully integrated within the Python programming environment. OmicVerse allows users to perform extensive transcriptome analysis using a single programming language, tapping into the collective machine-learning knowledge and models available within the Python community. We anticipate that OmicVerse will continue to grow, with updates introducing additional algorithms, features, and models. Ultimately, OmicVerse aims to act as a driving force for the bulk and single-seq community, encouraging the prototyping of various models, establishing standards for RNA-omics analysis, and expanding the potential for scientific exploration.
Methods
Methods for BulkTrajBlend
BulkTrajBlend is primarily designed to address the issue of “omitted” cells in single-cell data, making the inference of developmental or differentiation trajectories continuous. To achieve this goal, we designed BulkTrajBlend to generate potential “missing” cells from bulk RNA-seq data for inferring pseudo-time cell trajectories. This process consists of the following four steps (where communities represent cell types):
Cell proportion calculation
To estimate the proportion of cells in Bulk RNA-seq, we first annotated the single-cell data with respective cell types and aggregate the gene counts of single cells by cell type, resulting in an matrix, where represents the number of cell types and represents the number of genes. We define this matrix as the simulated Bulk RNA-seq cell type matrix, and then we sum columns of each row to get the simulated Bulk RNA-seq , and we input the simulated Bulk RNA-seq into the self-encoder of AE. In the self-encoder, we define the output of the encoder as , and we make close to , i.e., Cell Proportion, by training AE. We then define the output of the generator as and we make and close to each other by MAE as an evaluation. After training the optimal AE, we change the input to real Bulk RNA-seq , at which time the output of the encoder, , is the Cell Proportion corresponding to real Bulk, which we use as the range of the generator space for the subsequent β-VAE.
Generation of single-cell data
Given a dataset , where the vector in the gene expression matrix represents gene expression vector of a cell, the vector in the matrix represents cell type proportion, satisfying , where is restricted by a loss function:
1 |
Here is the predicted proportions of certain cell type.
The vector in the matrix represents conditionally correlated generative factor. The factor is obtained from the same class of cells through the β-VAE Encoder. For each class of cells, the average value after model training represents a class of cell-specific , and it is not restriected by adding a loss function. According to Higgins et al.26, we hypothesize that gene expression vectors are generated by a probability model , where represents the generative model parameters. The model learns the joint distribution of the data and a set of latent variables (, where ) for generating observed data , i.e., , and approximates the true posterior distribution with an approximate posterior distribution that is easier to compute. Our goal is to ensure that the inferred latent variables capture the generative factors in a disentangled manner. A disentangled representation implies that individual latent unit is sensitive to variations in a single generative factor while being relatively invariant to variations in other factors. In a disentangled representation, knowledge of one factor can be generalized to new configurations of other factors. The conditionally correlated generative factors can remain entangled in a separate subset of and are not used to represent .
To achieve this, we minimize the KL divergence between the approximate posterior and the true posterior:
2 |
Here, is the variational lower bound and can be written as:
3 |
We introduce a constraint to shape the inferred posterior and match it with a prior that controls the capacity of the latent information bottleneck. We set the prior as an isotropic unit Gaussian, . The constrained optimization problem can be written as:
4 |
Here, is the strength of the applied constraint. With this optimization based on MLE, the latent variable can reflect the character of the ground truth data with lower error. According to β-VAE model31, we can rewrite the problem in Lagrangian form:
5 |
where is the regularization coefficient of the constraint, which limits the capacity of and imposes an implicit pressure for independence in learning the posterior distribution due to the isotropic nature of the Gaussian prior . In this model, different values of can alter the level of learning pressure imposed during training, encouraging the learning of different representations. We assume a disentangled representation of the conditional independent data generative factors and therefore set to apply a stronger constraint on the latent variable information bottleneck, exceeding the constraint of the original VAE. These constraints restrict the capacity of and, combined with the pressure to maximize the log-likelihood of the training data , encourage the model to learn the most efficient representation of the data.
Computation of single-cell neighborhood graph
Here, we used the scanpy.pp.neighbors function from Scanpy to compute the cell neighborhood graph. For detailed mathematical description, please refer to the relevant papers and documentation of nearest neighbor descent in Scanpy and PyNNDescent62.
Community detection and generation of overlapping cell communities
We performed community detection on the cell neighborhood graph using a Graph Neural Network (GNN) model to find overlapping cell communities33. GNN can learn relationships between nodes and divide them into different communities based on their similarities. Specifically, we used GCN, which is one of the basic models in GNN, to generate an affinity matrix , which represents the degree of association between cells. The computation is as follows:
6 |
Here, is the adjacency matrix of the cell neighborhood graph, and represents cell type as the node feature. To ensure non-negativity of , we applied element-wise ReLU non-linear activation function to the output layer. For detailed information about the GNN architecture,
7 |
Here, is the normalized adjacency matrix, is the adjacency matrix with self-loops, and is the diagonal degree matrix of the adjacency matrix with self-loops. We considered other GNN architectures and deeper models but did not observe significant improvements. Two main differences between our model and the standard GCN are: (1) batch normalization applied after the first graph convolutional layer, and (2) L2 regularization applied to all weight matrices. We found that both modifications significantly improved the performance.
We measured the fit between the generated affinity matrix and the neighborhood graph using the negative log-likelihood function of the Bernoulli-Poisson model:
8 |
Here, represents the set of edges in the graph. Since neighborhood graphs of single-cell data are typically sparse, the second term in the third sum contributes more to the loss. To balance these two terms, we adopted a standard technique known as balanced classification18, and defined the loss function as follows:
9 |
Here, and represent uniform distributions over edges and non-edges, respectively.
Instead of directly optimizing the affinity matrix as in traditional methods, we search for the optimal neural network parameters to minimize the (balanced) negative log-likelihood function:
10 |
Through these steps, the BulkTrajBlend model computes overlapping communities in single-cell data, which can be used to infer “omission” cells in the original single-cell data. It can help reveal cell type transitions and dynamics, and model and analyze cell developmental trajectories.
Community trajectory inference
Here, we inserted the overlapping communities of target cells into the original single-cell data and used PyVIA to infer pseudo-temporal trajectories of cell differentiation. For detailed inference methods, please refer to the mathematical description of PyVIA. Additionally, researchers can also use CellRank for community trajectory re-inference.
CGAN and ACGAN model description
CGAN (Conditional Generative Adversarial Nets) is a GAN (Generative Adversarial Nets) based model that generates data by training the generator and discriminator with the data and corresponding labels. The training process can be split into 2 parts. In the first part, latent variables are generated by standardized normal distribution and its generated class labels are input into the generator to get the generated data. Here the generator can be summarized as a function , where are the parameters of the MLP and there are 6 layers in that each layer is normalized. The the hidden dimensions are 128*256*512*1024 and the activation function is LeakyRelu. After getting the generated data , there will be a discriminator , where are the parameters of the MLP and there are 4 layers in each layer the hidden dimension is 512, dropout rate is 0.4 and the activation function is LeakyRelu, judging whether accords with its label . Therefore, in the second part, will be trained by the real data and its label with Adam optimizer to improve the judgement level of. Then the loss of judged by will be employed to enhance the generation ability of with the same optimizer. The loss functions for and are both MSEloss and the weights of the loss of the generative data and the real data are both 0.5.
In addition, ACGAN (Auxiliary Classifier GAN), which makes the generative data more authentic, keeps the same structure of the generator as the one in the CGAN, but it adds the classifier that offers the label of the input data on the output of the discriminator. In the training process, the loss function for the added classifier is CrossEntropy.
Data pre-processing
All single-cell data used for BulkTrajBlend training underwent the same quality control steps: Cells with low sequencing counts (<1000) and a high mitochondrial fraction (>0.2) were excluded in further analysis. The filtered count matrix was normalized by dividing the counts of each cell by total molecule counts detected in that particular cell and logarithmised with Python library scanpy63. All Bulk RNA-seq were normalized using DEseq2 and “numpy.log1p” logarithmised using Python’s Numpy64 package. It is worth noting that both Bulk and single-cell data use raw counts during AE estimation of the cell fraction state space, whereas both Bulk and single-cell data use normalized and logarithmised data during training of β-VAE.
Performance evaluation
To evaluated the generated and interpolation performance of our model, a comprehensive analysis was conducted, encompassing the examination of five critical dimensions:
(1) The count of interpolated cells, we counted the number of cells that were eventually used to interpolate into the raw single-cell profile.
(2) The correlation in marker gene expression between interpolated and authentic cells, we first use scanpy’s “scanpy.tl.rank_genes_groups” function to calculate the marker genes for each type of cell subpopulation in the raw single-cell profile (taking the top 200 marker genes). Then, we use the Pearson coefficient to calculate the percentage of these 200 marker genes in the expression correlation between the generated single-cell profile and the raw single-cell profile.
(3) Marker gene similarity, we first used scanpy’s “scanpy.tl.rank_genes_groups” function to calculate the marker genes for each type of cell subpopulation (taking the first 200 marker genes) in the raw single-cell profile versus the generated single-cell profile, respectively. Then, we treated marker genes as words and all the marker genes of each cell class as sentences, and used cosine similarity to calculate the similarity of marker genes of each cell subpopulation.
(4) Transition probabilities post-interpolation We first wrapped “omicverse.pp.scale” and “omicverse.pp.pca” in omicverse, “omicverse.utils.cal_paga”, and computed the principal component PCA of the single-cell profile. We took the first 50 principal components and used the scanpy’s “scanpy.pp.neighbour” to compute the neighborhood map of the single-cell profile. Immediately after that, we calculated the developmental trajectory of single-cell profile with pseudotime using pyVIA, and we calculated the state transfer confidence for each type of cell subpopulation by taking pseudotime as the priority time with the neighborhood graph as the input of “omicverse.utils.cal_paga”.
(5) The number of noise clusters, we used “scanpy.tl.leiden” in scanpy to perform unsupervised clustering on the generated single-cell profiles, with the resolution set to 1.0, and we identified the categories with less than 25 cells after clustering as noisy clusters and counted the number of noisy clusters as an assessment of the generation quality.
(6) Density assessment of pseudotime, after we obtained the pseudotime of single-cell profiles using pyVIA as the default parameters, specifically setting K to 15 in the neighborhood graph of the KNN and configuring use_rep to X_pca.We assessed the variance of the pseudotime of target interpolated cells as one of the metrics for the assessment of developmental trajectory reconstruction.
Datasets
Dentate Gyrus
Single-cell RNA-seq: Data from Hochgerner et al.65. Dentate gyrus (DG) is part of the hippocampus involved in learning, episodic memory formation and spatial coding. The experiment from the developing DG comprises two time points (P12 and P35) measured using droplet-based scRNA-seq (10x Genomics Chromium). The dominating structure is the granule cell lineage, in which neuroblasts develop into granule cells. Simultaneously, the remaining population forms distinct cell types that are fully differentiated (e.g., Cajal-Retzius cells) or cell types that form a sub-lineage (e.g., GABA cells) (Accession ID GSE95753).
Bulk RNA-seq: Data from Cembrowski et al.66. Dentate gyrus (DG) is measured by RNA sequencing (RNA-seq) to produce a quantitative, whole genome atlas of gene expression for every excitatory neuronal class in the hippocampus; namely, granule cells and mossy cells of the dentate gyrus, and pyramidal cells of areas CA3, CA2, and CA1 (Accession ID GSE74985).
Pancreatic endocrinogenesis
Single-cell RNA-seq: Data from Bastidas-Ponce et al.67. Pancreatic epithelial and Ngn3-Venus fusion (NVF) cells during secondary transition with transcriptome profiles sampled from embryonic day 15.5. Endocrine cells are derived from endocrine progenitors located in the pancreatic epithelium. Endocrine commitment terminates in four major fates: glucagon- producing α-cells, insulin-producing β-cells, somatostatin-producing δ-cells and ghrelin-producing ε-cells (Accession ID GSE132188).
Bulk RNA-seq: Data from Bosch et al.68. RNA-sequencing was performed of pancreatic islets (islets of Langerhans) from mice on PLX5622 or control diet for 5.5 or 8.5 months (Accession ID GSE189434).
Human bone marrow
Single-cell RNA-seq: Data from Setty et al.69. The bone marrow is the primary site of new blood cell production or haematopoiesis. It is composed of hematopoietic cells, marrow adipose tissue, and supportive stromal cells. This dataset served to detect important landmarks of hematopoietic differentiation, to identify key transcription factors that drive lineage fate choice and to closely track when cells lose plasticity (https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45).
Bulk RNA-seq: Data from Myers et al (2018). RNA-Seq of CD34+ Bone Marrow Progenitors from Healthy Donors (Accession ID GSE118944).
Maturation of murine liver
Single-cell RNA-seq: Data from Liang et al.70. A total of 52,834 single cell transcriptomes, collected from the newborn to adult livers, were analyzed. We observed dramatic changes in cellular compositions during liver postnatal development. We characterized the process of hepatocytes and sinusoidal endothelial cell zonation establishment at single cell resolution. We selected Pro-B, Large Pre-B, SmallPre-B, B, HPC, GMP, iNP, imNP, mNP, Basophil, Monocyte, cDC1, cDC2, pDC, aDC, Kupffer, Proerythroblast, Erythroblast, erythrocyte (Annotation could be found in metadata of Data from Liang et al.) to performed HPC differentiation analysis (Accession ID GSE171993).
Bulk RNA-seq: Data from Renaud et al.71. We analyze gene expression patterns in the developing mouse liver over 12 distinct time points from late embryonic stage (2 days before birth) to maturity (60 days after birth). Three replicates per time point (Accession ID GSE58827).
Construction of Simulated “omission” single-cell profile
To simulate the cell “omission” in single-cell sequencing, we conducted cell dropout experiments across diverse datasets. In the Pancreas dataset, we employed Leiden clustering and manually excluded specific clusters of Ngn3 high EP, resulting in a reduction of confidence in the transition from Ngn3 high EP to Pre-endocrine to 0. In the Dentategyrus dataset, we applied Leiden clustering and manually removed specific clusters of Granule Immature, leading to a confidence reduction in the transition from Granule Immature to Granule Mature to 0. Furthermore, in the BoneMarrow dataset, we randomly eliminated 80% of the cells from HSC-2, causing a confidence drop in the transition from HSC-2 to Monocyte-2 to 0.
To employ BulkTrajBlend for generating “omission” cells across various datasets, we generated single-cell data from the bulk RNA-seq data using BulkTrajBlend and filtered out noisy cells using the size of the Leiden as a constraint. In configuring the model for different datasets, we set the hyperparameter “cell_target_num” to be 1.5 times, 1 time, and 6 times the number of dropped-out cell types, aligning with Pancreas, Dentategyrus, and BoneMarrow, respectively. Subsequently, BulkTrajBlend calculated the overlapping cell types in the generated single-cell data, and we annotated the overlapping cell communities. Specifically, we selected the single-cell data in which dropped-out cell types were associated with adjacent cell types.
Methods of OmicVerse integration
We unified the downstream analyses of Bulk RNA-seq, single cell RNA-seq in OmicVerse. Since the downstream analyses are independent of the parameter evaluation of BulkTrajBlend and the analysis modules of each part are independent of each other, we have placed the datasets and methods used in each part in Supplementary, an index of which is provided here.
Bulk RNA-seq: All datasets selected, parameter setting, and methods could be found in Supplementary Note 4.
scRNA-seq: All datasets selected, parameter setting, and methods could be found in Supplementary Note 5.
Multi-omics: All datasets selected, parameter setting, and methods could be found in Supplementary Note 6.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
This work was supported by the grants from the National Natural Science Foundation of China (32300682 to C.X.), the National Key Research & Developmental Program of China (92249303 to Y.X.), the Fundamental Research Funds for the Central Universities (FRF-TP-22-007A1 to C.X.), the Student Research Training Program (SRTP) of University of Science and Technology Beijing (202010008107 to Z.Z.). We thank Professor Ge Gao of Peking University for his guidance on the OmicVerse open-source copyright in the summer of 2021. We are grateful for the experience of studying epigenomics in the Xie Lab at Tsinghua University, and for Xiaotong Wu’s guidance in enabling OmicVerse's multi-omics analyses to be successfully designed. We thank all the Github users who contributed code and issue to OmicVerse over the years. We would like to thank the following WeChat Official Accounts for promoting OmicVerse: pythonic biologists, biotrainee. Pythonic biologist and Biotrainee’s article inspired some of the charting in the OmicVerse.
Author contributions
Z.Z., Y.M. and L.H. contributed equally to this work. Z.Z. was responsible for designing the OmicVerse application programming interface and designing the whole BulkTrajBlend framework. Y.M. played a key role in designing and implementing the overlap cell community of BulkTrajBlend, while Z.Z. was responsible for implementing and conducting testing of the single-cell RNA-seq pipeline. L.H. conducted simulated single-cell profile tests for BulkTrajBlend. The multi-omics module of pyMOFA was implemented and tested by L.H., Y.W. and Z.Z. P.L. handled the implementation and testing of the meta cells analysis in SEACells. B.T. was responsible for writing the methods for CGAN and ACGAN, as well as reviewing the methods of BulkTrajBlend. H.D., Y.X. and Z.Z. jointly conceived, implemented, and tested the bulk RNA-seq pipeline. C.X. and Y.X. provided the conceptualization of false overlap rate for evaluation of BulkTrajBlend. Y.X., H.D., C.X. and Z.Z. provided supervision and contributed to the conceptualization of the OmicVerse platform. The manuscript was collaboratively written by Z.Z., Y.M., L.H., H.D. and Y.X.
Peer review
Peer review information
Nature Communications thanks Runmin Wei, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The Dentate Gyrus data used in this study have been deposited in the Gene Expression Omnibus (GEO) database under accession code GSE95753 and GSE74985, Data related to pancreatic endocrinogenesis are accessible via accession codes GSE132188 and GSE189434, the maturation of murine liver data can be found under accession code GSE171993 and GSE58827, Human bone marrow data are available in the Human Cell Atlas (HCA) database at https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 and in the GEO database under accession code GSE118944. The Alzheimer’s Disease snRNA-seq and snATAC-seq used in this study are available from GSE174367. The colorectal cancer scRNA-seq data is available from GSE178318. All processed data in this manuscript are available at https://github.com/Starlitnightly/omicverse-reproducibility.
Code availability
The code to reproduce the experiments of this manuscript is available at https://github.com/Starlitnightly/omicverse-reproducibility. The OmicVerse package can be found on GitHub at https://github.com/Starlitnightly/omicverse Documentation and tutorials can be found at https://omicverse.readthedocs.io.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Zehua Zeng, Yuqing Ma, Lei Hu.
Contributor Information
Zehua Zeng, Email: starlitnightly@gmail.com.
Cencan Xing, Email: cencanxing@ustb.edu.cn.
Yuanyan Xiong, Email: xyyan@mail.sysu.edu.cn.
Hongwu Du, Email: hongwudu@ustb.edu.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-024-50194-3.
References
- 1.Kharchenko PV. The triumphs and limitations of computational methods for scRNA-seq. Nat. Methods. 2021;18:723–732. doi: 10.1038/s41592-021-01171-x. [DOI] [PubMed] [Google Scholar]
- 2.Peng L, et al. Single-cell RNA-seq clustering: datasets, models, and algorithms. RNA Biol. 2020;17:765–783. doi: 10.1080/15476286.2020.1728961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Xu, X., Hua, X., Mo, H., Hu, S. & Song, J. Single-cell RNA sequencing to identify cellular heterogeneity and targets in cardiovascular diseases: from bench to bedside. Basic Res. Cardiol.118, 7 (2023). [DOI] [PubMed]
- 4.Derakhshan, T., Boyce, J. A. & Dwyer, D. F. Defining mast cell differentiation and heterogeneity through single-ce ll transcriptomics analysis. J. Allergy Clin. Immunol.150, 739–747 10.1016/j.jaci.2022.08.011 (2022). [DOI] [PMC free article] [PubMed]
- 5.Zeng, L. et al. Research progress of single-cell transcriptome sequencing in autoimmune diseases and autoinflammatory disease: a review. J. Autoimmun133, 102919 10.1016/j.jaut.2022.102919 (2022). [DOI] [PubMed]
- 6.Thind AS, et al. Demystifying emerging bulk RNA-Seq applications: the application and utility of bioinformatic methodology. Brief. Bioinform. 2021;22:bbab259. doi: 10.1093/bib/bbab259. [DOI] [PubMed] [Google Scholar]
- 7.Liao J, et al. De novo analysis of bulk RNA-seq data at spatially resolved single-cell resolution. Nat. Commun. 2022;13:6498. doi: 10.1038/s41467-022-34271-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpret ing genome-wide expression profiles. Proc Natl. Acad. Sci. USA102, 15545–15550 10.1073/pnas.0506580102 (2005). [DOI] [PMC free article] [PubMed]
- 10.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hu C, et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 2023;51:D870–D876. doi: 10.1093/nar/gkac947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Efremova M, Vento-Tormo M, Teichmann SA, Vento-Tormo R. CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes. Nat. Protoc. 2020;15:1484–1506. doi: 10.1038/s41596-020-0292-x. [DOI] [PubMed] [Google Scholar]
- 13.Stassen SV, Yip GGK, Wong KKY, Ho JWK, Tsia KK. Generalized and scalable trajectory inference in single-cell omics data with VIA. Nat. Commun. 2021;12:5528. doi: 10.1038/s41467-021-25773-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hsieh, C.-Y. et al. scDrug: from single-cell RNA-seq to drug response prediction. Comput. Struct. Biotechnol. J.21, 150–157 10.1016/j.csbj.2022.11.055 (2022). [DOI] [PMC free article] [PubMed]
- 15.Amezquita RA, et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods. 2020;17:137–145. doi: 10.1038/s41592-019-0654-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Virshup I, et al. The Scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 2023;41:604–606. doi: 10.1038/s41587-023-01733-8. [DOI] [PubMed] [Google Scholar]
- 17.Giorgi, F. M., Ceraolo, C. & Mercatelli, D. The R Language: an engine for bioinformatics and data science. Life (Basel)12, 648 10.3390/life12050648 (2022). [DOI] [PMC free article] [PubMed]
- 18.Brittain J, Cendon M, Nizzi J, Pleis J. Data scientist’s analysis toolbox: comparison of Python, R, and SAS Performance. SMU Data Sci. Rev. 2018;1:7. [Google Scholar]
- 19.Wu H, Kirita Y, Donnelly EL, Humphreys BD. Advantages of single-nucleus over single-cell RNA sequencing of adult kidney: rare cell types and novel cell states revealed in fibrosis. J. Am. Soc. Nephrol. 2019;30:23. doi: 10.1681/ASN.2018090912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mereu E, et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020;38:747–755. doi: 10.1038/s41587-020-0469-4. [DOI] [PubMed] [Google Scholar]
- 21.Denyer T, Timmermans MCP. Crafting a blueprint for single-cell RNA sequencing. Trends Plant Sci. 2022;27:92–103. doi: 10.1016/j.tplants.2021.08.016. [DOI] [PubMed] [Google Scholar]
- 22.Gao C, Zhang M, Chen L. The comparison of two single-cell sequencing platforms: BD rhapsody and 10x genomics chromium. Curr. Genomics. 2020;21:602–609. doi: 10.2174/1389202921999200625220812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chen Y, et al. Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis. Nat. Commun. 2022;13:6735. doi: 10.1038/s41467-022-34550-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen, B., Khodadoust, M. S., Liu, C. L., Newman, A. M. & Alizadeh, A. A. Profiling tumor infiltrating immune cells with CIBERSORT. Cancer Syst. Biol. Methods Protocols, 1711, 243–259 (2018). [DOI] [PMC free article] [PubMed]
- 25.Fan J, et al. MuSiC2: cell-type deconvolution for multi-condition bulk RNA-seq data. Brief. Bioinforma. 2022;23:bbac430. doi: 10.1093/bib/bbac430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Steen, C. B., Liu, C. L., Alizadeh, A. A. & Newman, A. M. Profiling cell type abundance and expression in bulk tissues with CIBERSORTx. Stem Cell Transcr. Netw. Methods Protoc.2117, 135–157 (2020). [DOI] [PMC free article] [PubMed]
- 27.Jew B, et al. Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nat. Commun. 2020;11:1971. doi: 10.1038/s41467-020-15816-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ahlmann-Eltze C, Huber W. Comparison of transformations for single-cell RNA-seq data. Nat. Methods. 2023;20:665–672. doi: 10.1038/s41592-023-01814-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Frishberg, A. et al. Cell composition analysis of bulk genomics using single-cell data. Nat. Methods16, 327–332, 10.1038/s41592-019-0355-5 (2019). [DOI] [PMC free article] [PubMed]
- 30.Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 2019;10:380. doi: 10.1038/s41467-018-08023-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Higgins, I. et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR (Poster), 3. (2017).
- 32.Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat. Methods. 2019;16:715–721. doi: 10.1038/s41592-019-0494-8. [DOI] [PubMed] [Google Scholar]
- 33.Shchur, O. & Günnemann, S. Overlapping community detection with graph neural networks. Deep Learning on Graphs, KDD.10.48550/arXiv.1909.12201 (2019).
- 34.Mirza, M. & Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- 35.Odena, A., Olah, C. & Shlens, J. in International conference on machine learning. 2642-2651 (PMLR).
- 36.Dimitrov, D. & Gu, Q. BingleSeq: a user-friendly R package for bulk and single-cell RNA-Seq data analysis. PeerJ8, e10469, 10.7717/peerj.10469 (2020). [DOI] [PMC free article] [PubMed]
- 37.Flores, M. et al. Deep learning tackles single-cell analysis—a survey of deep learning for scRNA-seq analysis. Brief. Bioinform.23, bbab531 (2022). [DOI] [PMC free article] [PubMed]
- 38.Behdenna, A. et al. pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. bioRxiv, 2020.2003.2017.995431, 10.1101/2020.03.17.995431 (2023). [DOI] [PMC free article] [PubMed]
- 39.Muzellec, B., Telenczuk, M., Cabeli, V. & Andreux, M. PyDESeq2: a python package for bulk RNA-seq differential expression analysis. bioRxiv, 2022–2012. [DOI] [PMC free article] [PubMed]
- 40.Szklarczyk D, et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49:D605–D612. doi: 10.1093/nar/gkaa1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinforma. 2008;9:1–13. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics39, btac757 (2023). [DOI] [PMC free article] [PubMed]
- 43.Zhang, Y. et al. Single-cell RNA sequencing in cancer research. J. Exp. Clin. Cancer Res.40, 81 (2021). [DOI] [PMC free article] [PubMed]
- 44.Mo, Z. et al. Single-cell transcriptomics reveals the role of Macrophage-Naïve CD4+ T cell interaction in the immunosuppressive microenvironment of primary liver carcinoma. J. Transl. Med.20, 466 (2022). [DOI] [PMC free article] [PubMed]
- 45.Agrawal, A., Ali, A., Boyd, S. & others. Minimum-distortion embedding. Foundations and Trends® in Machine Learning14, 211–378.
- 46.Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods. 2019;16:1289. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 2019;37:685–691. doi: 10.1038/s41587-019-0113-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Cao Y, Wang X, Peng G. SCSA: a cell type annotation tool for single-cell RNA-seq data. Front. Genet. 2020;11:490. doi: 10.3389/fgene.2020.00490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res.47, D721–D728 (2019). [DOI] [PMC free article] [PubMed]
- 50.Yuan, H. et al. CancerSEA: a cancer single-cell state atlas. Nucleic Acids Res.47, D900–D908 (2019). [DOI] [PMC free article] [PubMed]
- 51.Van de Sande B, et al. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nat. Protoc. 2020;15:2247–2276. doi: 10.1038/s41596-020-0336-2. [DOI] [PubMed] [Google Scholar]
- 52.Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat Biotechnol41, 1746–1757 (2023). [DOI] [PMC free article] [PubMed]
- 53.Che L-H, et al. A single-cell atlas of liver metastases of colorectal cancer reveals reprogramming of the tumor microenvironment in response to preoperative chemotherapy. Cell Discov. 2021;7:80. doi: 10.1038/s41421-021-00312-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.AlMusawi, S., Ahmed, M. & Nateri, A. S. Understanding cell-cell communication and signaling in the colorectal cancer microenvironment. Clin. Transl. Med.11, e308 (2021). [DOI] [PMC free article] [PubMed]
- 55.Han, J. M. & Jung, H. J. Cyclophilin A/CD147 interaction: a promising target for anticancer therapy. Int. J. Mol. Sci.23, 9341 10.3390/ijms23169341. [DOI] [PMC free article] [PubMed]
- 56.Scarzello, A. J. et al. LTβR signalling preferentially accelerates oncogenic AKT-initiated liver tumours. Gut65, 1765–1775, 10.1136/gutjnl-2014-308810. [DOI] [PMC free article] [PubMed]
- 57.Clark, S. J. et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun.9, 781 (2018). [DOI] [PMC free article] [PubMed]
- 58.Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology40, 1458–1466 (2022). [DOI] [PMC free article] [PubMed]
- 59.Argelaguet R, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21:1–17. doi: 10.1186/s13059-020-02015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Morabito S, et al. Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s disease. Nat. Genet. 2021;53:1143. doi: 10.1038/s41588-021-00894-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Song, D. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat. Biotechnol.10.1038/s41587-023-01772-1 (2023). [DOI] [PMC free article] [PubMed]
- 62.Dong, W., Moses, C. & Li, K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web. 577–586 (2011)
- 63.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Harris, et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Hochgerner H, Zeisel A, Lönnerberg P, Linnarsson S. Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nat. Neurosci. 2018;21:290–299. doi: 10.1038/s41593-017-0056-2. [DOI] [PubMed] [Google Scholar]
- 66.Cembrowski MS, Wang L, Sugino K, Shields BC, Spruston N. Hipposeq: a comprehensive RNA-seq database of gene expression in hippocampal principal neurons. eLife. 2016;5:e14997. doi: 10.7554/eLife.14997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Bastidas-Ponce A, et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development. 2019;146:dev173849. doi: 10.1242/dev.173849. [DOI] [PubMed] [Google Scholar]
- 68.Bosch AJT, et al. CSF1R inhibition with PLX5622 affects multiple immune cell compartments and induces tissue-specific metabolic effects in lean mice. Diabetologia. 2023;66:2292–2306. doi: 10.1007/s00125-023-06007-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Setty M, et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 2019;37:451–460. doi: 10.1038/s41587-019-0068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Liang Y, et al. Temporal analyses of postnatal liver development and maturation by sin gle-cell transcriptomics. Dev. Cell. 2022;57:398–414.e395. doi: 10.1016/j.devcel.2022.01.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Renaud HJ, et al. Ontogeny of hepatic energy metabolism genes in mice as revealed by RNA -sequencing. PloS One. 2014;9:e104560. doi: 10.1371/journal.pone.0104560. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Dentate Gyrus data used in this study have been deposited in the Gene Expression Omnibus (GEO) database under accession code GSE95753 and GSE74985, Data related to pancreatic endocrinogenesis are accessible via accession codes GSE132188 and GSE189434, the maturation of murine liver data can be found under accession code GSE171993 and GSE58827, Human bone marrow data are available in the Human Cell Atlas (HCA) database at https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 and in the GEO database under accession code GSE118944. The Alzheimer’s Disease snRNA-seq and snATAC-seq used in this study are available from GSE174367. The colorectal cancer scRNA-seq data is available from GSE178318. All processed data in this manuscript are available at https://github.com/Starlitnightly/omicverse-reproducibility.
The code to reproduce the experiments of this manuscript is available at https://github.com/Starlitnightly/omicverse-reproducibility. The OmicVerse package can be found on GitHub at https://github.com/Starlitnightly/omicverse Documentation and tutorials can be found at https://omicverse.readthedocs.io.