Summary
We present hypergraph variational autoencoder (HyperG-VAE), a Bayesian deep generative model that leverages hypergraph representation to model single-cell RNA sequencing (scRNA-seq) data. The model features a cell encoder with a structural equation model to account for cellular heterogeneity and construct gene regulatory networks (GRNs) alongside a gene encoder using hypergraph self-attention to identify gene modules. The synergistic optimization of encoders via a decoder improves GRN inference, single-cell clustering, and data visualization, as validated by benchmarks. HyperG-VAE effectively uncovers gene regulation patterns and demonstrates robustness in downstream analyses, as shown in B cell development data from bone marrow. Gene set enrichment analysis of overlapping genes in predicted GRNs confirms the gene encoder’s role in refining GRN inference. Offering an efficient solution for scRNA-seq analysis and GRN construction, HyperG-VAE also holds the potential for extending GRN modeling to temporal and multimodal single-cell omics.
Keywords: gene regulatory networks, scRNA-seq, hypergraph representation learning
Graphical abstract

Highlights
-
•
Hypergraph modeling enhances scRNA-seq representation by reducing sparsity
-
•
Capturing cellular heterogeneity and gene modules uncovers complex GRN relationships
-
•
HyperG-VAE surpasses benchmarks in predicting GRNs and identifying key regulators
-
•
Tested on B cells, HyperG-VAE excels in gene regulation, clustering, and lineage tracing
Motivation
Gene regulatory networks (GRNs) derived from single-cell RNA sequencing (scRNA-seq) data provide insights into the complex interactions between transcription factors (TFs) and target genes. This capability enables a detailed understanding of gene expression regulation and cellular function across diverse populations. However, addressing both cellular heterogeneity and gene modules remains a significant challenge, as current GRN inference methods struggle to bridge this gap. To address this limitation, we introduce hypergraph variational autoencoder (HyperG-VAE), an algorithm based on hypergraph representation learning that captures latent correlations among genes and cells, enhancing the imputation of contact maps.
Su et al. present HyperG-VAE, a hypergraph-based model integrating cellular heterogeneity and gene modules for robust GRN inference from scRNA-seq data. By capturing latent gene-cell correlations, it overcomes sparsity, outperforms existing methods, and reveals key gene regulation patterns with robust downstream analysis.
Introduction
Gene regulatory networks (GRNs) within single-cell RNA sequencing (scRNA-seq) datasets present a sophisticated interplay of transcription factors (TFs) and target genes, uniquely capturing the modulation of gene expression and thereby delineating the intricate cellular functions and responses within diverse cell populations.1 GRNs illuminate core biological processes and underpin applications from disease modeling to therapeutic design,2,3,4 empowering researchers to interpret the mechanisms of gene interactions within cells and leverage this understanding for medical and biotechnological innovations.5,6
Numerous methodologies have emerged for inferring GRNs from single-cell transcriptomic data. The algorithms emphasize co-expression networks in a statistical way (e.g., PPCOR7 and LocaTE8) or aim to decipher causal relationships between TFs and their target genes based on the analysis of the gene interactions among cells (e.g., DeepSEM9 and PIDC10). Despite their achievements, these algorithms still have inherent limitations. Specifically, these approaches mainly focus on cellular heterogeneity and overlook the critical importance of simultaneously considering cellular heterogeneity and gene module information in the model design. Generally, from the view of underlying principles, we can divide the methodologies into deep learning methods and traditional statistical algorithms. Many deep learning-based (e.g., DeepTFni11 and DeepSEM9) methodologies primarily build upon foundational models.12,13 The frequent oversight in these models is the inherent relationships between cells and genes, as informed by domain expertise. This often leads to models that compromise on explainability and narrow their application scope. For the traditional statistical algorithms, such as Bayesian networks14,15 and ensemble methods,16,17,18 it can be computationally expensive, and it remains a challenge to extend these methodologies to encompass broader nonlinear paradigms.
Additionally, the scRNA-seq data are frequently marred by noise and incompleteness, attributable to phenomena such as amplification biases inherent to reverse transcription and PCR amplification processes,19,20 as well as the issue of low quantities of nucleic acids in single cells. To get a more robust GRN, several methodologies21,22 leverage multi-omic datasets, capturing different kinds of cellular information to enrich the model’s comprehensiveness. However, integrating multi-omic datasets presents substantial challenges, particularly regarding harmonizing data from disparate sources and platforms, and could also introduce additional noise.23
To address the problems and construct a reliable GRN, we model scRNA-seq data as a hypergraph and present hypergraph variational autoencoder (HyperG-VAE), a Bayesian deep generative model to process the hypergraph data. Distinct from current approaches, HyperG-VAE simultaneously captures cellular heterogeneity and gene modules (in GRN analysis, gene modules refer to clusters of genes that are regulated together by the same set of TFs) through its cell and gene encoders individually during the GRN construction. Two encoders employ variational inference to learn stochastic representations of genes and cells, offering a more flexible and robust approach to managing real-world data complexities. This could be particularly effective in handling noise in scRNA-seq datasets, a capability that has been demonstrated in previous studies.24,25,26 Within a shared embedding space, the dual encoders of our model interact, boosting its cohesiveness. The joint optimization manner elucidates gene regulatory mechanisms within gene modules across various cell clusters, thereby augmenting the model’s ability to delineate complex gene regulatory interactions and significantly improving its explainability.
Our study evaluates the performance of HyperG-VAE in various scRNA-seq applications. These include (1) GRN inference, (2) cell embedding, (3) gene embedding, and (4) gene regulation hypergraph construction. Through benchmark comparisons, encompassing tasks like GRN inference, data visualization, and single-cell clustering, we establish that HyperG-VAE outperforms existing state-of-the-art methods. Additionally, HyperG-VAE demonstrates its utility in elucidating the regulatory patterns governing B cell development in bone marrow. Our model also excels in learning gene expression modules and cell clusters, which connect the gene encoder and cell encoder individually to boost gene regulatory hypergraph prediction. This integrated functionality of HyperG-VAE improves our comprehension of single-cell transcriptomic data, ultimately providing better insights into the realm of GRN inference.
Results
Framework overview
We introduce HyperG-VAE, a Bayesian deep generative model specifically designed to address the complex challenge of gene regulation network inference using scRNA-seq data, which are represented as a hypergraph (Figure 1; STAR Methods). Our HyperG-VAE takes into account the interplay between gene modules and cellular heterogeneity, allowing for a more accurate representation of cell-specific regulatory mechanisms. This interplay could be incorporated into a hypergraph to capture the nuanced interactions of genes across diverse cellular states.
Figure 1.
Overview of HyperG-VAE
(A) HyperG-VAE, which takes the expression value matrix derived from scRNA-seq data as input. In the provided table, four cells exhibit expression across fifteen genes, with color gradients indicating varying gene expression levels (white circles mean no expression).
(B) The colored circles with serial numbers denote distinct genes, expressed within specific cells, functioning as interlinked nodes. These nodes are interconnected by a singular hyperedge (small dashed ellipses) symbolizing the cell (triangle). Together, these nodes and hyperedges form a hypergraph structure. Node coloration reflects a composite of gene expression levels of the given gene across cells; for instance, gene 3 manifests a blend of green and blue hues. The largest dashed ellipse is the genome shared by all cells.
(C) The neural network architecture of HyperG-VAE, where two encoders are designed to process the provided input matrix. The cell encoder uses the structural equation model (SEM) to discern cellular heterogeneity and form the GRN, while the gene encoder, employing a hypergraph self-attention mechanism, focuses on gene module analysis. The decoder subsequently reconstructs the input matrix, leveraging the shared latent space of both gene and cell embeddings. The inferred gene regulation hypergraph integrates cellular and gene representations, drawing on relationships derived from the learned GRN.
(D–G) Downstream tasks that can be pursued by HyperG-VAE include GRN construction, clustering both cells and genes, and modeling the interplay between gene modules and cellular heterogeneity. Further details are provided in the legend, located in the top right corner.
In the context of hypergraphs, we construct the hypergraph by representing cells as individual hyperedges, with the genes expressed in each cell serving as the nodes included within those hyperedges (Figures 1A and 1B). Specifically, let denote the scRNA-seq expression matrix, where m is the number of cells and n is the number of genes. The incidence matrix encodes the hypergraph structure: if a gene (node) i is expressed in cell (hyperedge) j (), then . This construction captures the relationship between cells and their expressed genes, enabling the sparse scRNA-seq data to be effectively represented as a hypergraph.
HyperG-VAE incorporates two encoders, a cell encoder and a gene encoder, enabling it to learn the hypergraph representation (Figure 1C) with structure . The cell encoder generates cell representations in the form of hypergraph duality, facilitating the embedding of high-order relations via structural equation layers. GRN construction (Figure 1D) is realized in this structural equation layer through a learnable causal interaction matrix. In addition, the cell encoder can adeptly capture the gene regulation process in a cell-specific manner, elucidating a clearer landscape of cellular heterogeneity (Figure 1E). The gene encoder is specifically designed to process observed gene representations, denoted as . Given that genes within a module generally manifest consistent expression profiles across cells, we employ a multi-head self-attention mechanism that is specifically designed for the hypergraph in this work. This not only discerns varying gene expression levels but also assigns appropriate weights to the genes expressed in the same cell during the message-passing phase. Thus, the gene encoder enhances the model’s ability to understand and integrate the intricate interdependencies among genes, thereby aiding in the effective embedding of gene clusters (Figure 1F). Finally, a hypergraph decoder is utilized to reconstruct the original topology of the hypergraph (Figure 1G) using the learned latent embedding of genes and cells. Utilizing the reconstructed hypergraph and the learned inter-gene relationships, we can also infer a gene regulatory hypergraph (Figure 1G). This hypergraph encompasses gene regulatory modules that span across various cell stages.
HyperG-VAE enhances GRN inference by incorporating the above two encoders to mutually augment each other’s embedding quality (Figure 1C) while preserving the high-order gene relations among cells, constrained by hypergraph variational evidence lower bound (STAR Methods). Specifically, the cell encoder incorporates a structure equation model (SEM) on gene co-expression space to infer the GRNs; the learning of gene modules by the gene encoder aids in the inference of GRNs since the gene module conceivably incorporates TF-target regulation patterns. By integrating the embedding of genes and cells through joint learning, we observe the substantial performance of downstream tasks (Figures 1D–1G), including the inference of GRNs, cell clustering, gene clustering, and interplay characterization between gene modules and cellular heterogeneity, among others.
HyperG-VAE achieves accurate prediction of GRNs
We evaluate the performance on GRN inference of HyperG-VAE based on the setting of the BEELINE framework.27 Our evaluation encompassed seven scRNA-seq datasets. This includes two cell lines from humans and five mouse cell lines (more details can be found in the supplemental information). We evaluate GRN performance using two metrics: EPR, which assesses the enrichment of true positives among the top K predicted edges relative to random predictions, and AUPRC, which measures the area under the precision-recall curve to account for class imbalance. These metrics are applied across four types of ground-truth datasets: STRING,28 non-specific chromatin immunoprecipitation (ChIP)-seq,29,30,31 cell-type-specific ChIP-seq,32,33,34 and loss-/gain-of-function (LOF/GOF) networks.34 As recommended by Pratapa et al.,27 our analysis for each dataset prioritized the most variable TFs and the top N most-varying genes, where N is set to 500 and . We selected seven state-of-the-art baseline algorithms based on the evaluation of BEELINE to compare with HyperG-VAE: DeepSEM,9 GENIE3,17 PIDC,10 GRNBoost2,18 SCODE,35 ppcor,7 and SINCERITIES.36
Overall, HyperG-VAE demonstrates a discernible enhancement in performance when compared with other baseline methods in terms of both AUPRC and EPR metrics (Figures 2 and S2). For scaled results of datasets composed of all significantly varying TFs and the 500 most-varying genes (as shown in Figure 2), HyperG-VAE surpasses the seven other benchmarked methods in 40 of the 44 () evaluated conditions. Compared with the second-best method (DeepSEM), HyperG-VAE enhances the results by at least in 19 out of the 44 benchmarks. Furthermore, in comparison to other commendable approaches, such as PIDC and GENIE3, our approach registered significant enhancements. For PIDC, 38 out of 44 instances showed improvements of over , with 27 surpassing and 22 going beyond . Similarly, with GENIE3, 33 out of 44 instances marked at least a enhancement, 26 surpassed , and an impressive 20 recorded at least a increase. For results of datasets composed of all significantly varying TFs and the 1000 most-varying genes (Figure S2), HyperG-VAE achieves the best prediction performance on () of the benchmarks. In comparison to the runner-up method, DeepSEM, HyperG-VAE outperforms by a margin of at least in 17 of the 44 evaluated benchmarks. Notably, the average enhancement in EPR stands at , while that in AUPRC is .
Figure 2.
Benchmarks of different GRN inference methods on experimental scRNA-seq datasets by EPR and AUPRC scores
The performance of HyperG-VAE is contrasted against seven alternative algorithms across seven datasets. Each dataset comprises all significantly varying transcription factors (TFs) and the 500 most-varying genes. These evaluations are based on four distinct ground-truth benchmarks: non-specific ChIP-seq, STRING, cell-type-specific ChIP-seq, and LOF/GOF. For each figure pair, the left image depicts the median EPR results, while the right image shows the median AUPRC outcomes. Results inferior to random predictions are excluded from the visualizations for clarity. The color scale in each dataset is normalized between 0 and 1 using a min-max scaling approach. EPR is defined as the odds ratio of true positives among the top K predicted edges, where K represents the number of edges in the ground-truth GRN, compared to random predictions. Similarly, the AUPRC ratio reflects the odds ratio of the AUPRC value between the model and random predictions.
With single-cell sequencing data, robustly inferring GRNs from limited cells is pivotal, especially for capturing rare cellular phenotypes and transient states.9,11 Here, we explore the fluctuations in EPR performance and the robustness of HyperG-VAE when confronted with limited training data (Figure S3A). We constructed mouse embryonic stem cell (mESC) datasets37 composed of all significantly varying TFs and the 500 and 1,000 most-varying genes, respectively, and evaluated the accuracy based on four unique ground-truth benchmarks by randomly subsampling single cells following the BEELINE benchmark.27 Upon adjusting the number of subsampled single cells to 400, 300, 200, 100, and 50, we registered average performance retentions of , , , , and , respectively. Remarkably, when training with cell counts exceeding 100, a robust () retained more than of their performance, and for counts greater than 50, a compelling () maintained above efficacy. When utilizing cell-type-specific ChIP-seq as the benchmark, the performance remains notably stable, with an average performance retention of 93%. Furthermore, when assessed against the other three ground truths and the training cell count exceeds 50, there is only a modest decline in efficacy, averaging performance retention in comparison to the median value derived from all cells. Beyond performance evaluation, we also examined HyperG-VAE’s scalability with expansive datasets (Figure S3B).
HyperG-VAE reveals the gene regulation patterns of B cell development in bone marrow
To evaluate HyperG-VAE’s proficiency in elucidating GRNs and to assess the effectiveness of both cell clustering embedding and gene module embedding components within HyperG-VAE, we deployed HyperG-VAE on scRNA-seq data of B cell development in bone marrow38 (more details of the data can be found in the STAR Methods and Table S3), as illustrated in Figure 3. The progression of B cell development from hematopoietic stem cells follows a sequential yet adaptable developmental pathway governed by interactions among environmental stimuli, signaling cascades, and transcriptional networks.39 Throughout this developmental trajectory, TFs play a pivotal role in regulating the cell cycle, differentiation, and advancement to subsequent developmental stages. These critical checkpoints encompass the initial commitment to lymphocytic progenitors, the specification of pre-B cells, progression through immature stages, entry into the peripheral B cell pool, B cell maturation, and subsequent differentiation into plasma cells.40 Each of these regulatory nodes is controlled by complex transcriptional networks, which, along with sensing and signaling systems, determine the final outcomes.
Figure 3.
GRN prediction by HyperG-VAE across developmental B cell states in bone marrow
(A) t-distributed stochastic neighbor embedding (t-SNE) visualization of cell embedding on the bone marrow B cell dataset; the embedding is learned by the cell encoder of HyperG-VAE. Black lines depict the trajectory from pre-pro-B cells to mature B cells.
(B) Heatmap/dot plot showing TF expression of the regulon on a color scale and cell-type specificity (RSS [regulon specificity score]) of the regulon on a size scale.
(C) The accuracy of GRN prediction by cross-validation with publicly available ChIP-seq datasets. The overlap coefficient quantifies the concordance between sets of target genes for each TF, as derived from GRN prediction and ChIP-seq database, respectively. The x axis represents the difference value of overlap coefficients between HyperG-VAE and SCENIC (default). Pink lines indicate superior performance by HyperG-VAE, while blue lines favor the default SCENIC. The dot plot illustrates the overlap coefficient of the more effective approach for each regulon, depicted on a color gradient. Cell states are arranged in a sequence that reflects the progression of bone marrow B cell development stages.
(D) The GRN visualization for the bone marrow B cell dataset with ten states from pre-pro-B state to plasma state, as delineated by HyperG-VAE; the inner circle shows the co-binding of shared target genes, while the outer circle presents TF-focused target genes.
HyperG-VAE uncovers the cell embedding by dimensionality reduction and distinctly segregates the primary cell types across various stages of bone marrow B cell development (Figure 3A). Significantly, HyperG-VAE also effectively captures the linear progression of B cell development, spanning from early pro-B, late pro-B, large pre-B, small pre-B, and immature B to mature B cells. In our pursuit to unveil the gene regulation patterns in developmental B cells, our HyperG-VAE, in conjunction with SCENIC,41 successfully identifies established master regulators associated with different developmental stages (Figures 3B and S4), including pre-pro-B (Runx2), pro-B (Ebf1 and Lef1), large pre-B (Myc and Hmgb2), small pre-B (Tcf3 and Sox4), immature B (Relb and Egr1), mature B (Nfkb2), and plasma (Cebpb and Prdm1) cells.
Furthermore, we conducted a benchmark assessment to compare the performance of HyperG-VAE against SCENIC using its default settings. Using the ChIP-seq database,33 the accuracy was evaluated based on the degree of overlap coefficient between the ChIP-seq coverage and the predicted target genes from both methods. Our HyperG-VAE, when combined with SCENIC, demonstrates superior performance compared to the standard SCENIC approach, exhibiting higher accuracy in detecting TF-target patterns for the key TFs (as illustrated in Figure 3C). The comprehensive gene regulation network spanning the developmental B cells in the bone marrow is depicted in Figure 3D. We find that the GRNs show TF-target regulation patterns in two ways: TFs co-binding to shared predicted enhancers (the inner circle in Figure 3D) and TF-specific target genes (the outer circle in Figure 3D). We also observe that the cooperativity between TFs is stronger within cell types along the development path, indicating that some TFs are involved in multiple stages of B cell development.
Gene expression module learning enhances HyperG-VAE in GRN inference
Our HyperG-VAE model augments the GRN prediction by integrating gene space learning, as depicted in Figure 4C. HyperG-VAE uncovers the gene expression modules visualized by uniform manifold approximation and projection (UMAP)42 in Figure 4A. By associating these gene modules with the key TFs and corresponding target genes of pathways along B cell development, we annotate the gene modules with specific cell types, indicating that these gene clusters are activated in different stages of developmental B cells (Figures 4A and 4B).
Figure 4.
The interplay between gene embedding and cell clusters
(A) Gene embedding by the gene encoder of HyperG-VAE on developmental B cell data. Gene clusters encoded by numbers are associated with different cell types by colors.
(B) The heatmap illustrates normalized overlap values between gene clusters and TF regulons from different B cell states. Here, genes serve as a bridge to compute the overlap, with lighter colors representing larger overlap scores.
(C) t-SNE visualization of cellular embeddings with highlighted pre-BCRi B cell state, together with the associated regulon, Phf8_(+), and related target genes.
(D) Pathway enrichment analysis on gene cluster 5 with associated molecular complex detection (MCODE) network components.
(E) Pathway enrichment analysis of different gene clusters. The shaded pathways show the dominant gene programs for each gene cluster.
We further apply gene set enrichment analysis (GSEA)43 (STAR Methods) to investigate the gene clusters (Figures 4E and S1; Tables S5 and S6). The pathways identified through GSEA validate the accuracy of our gene cluster annotations. For example, large pre-B cells (cluster pre-BCRi [B cell receptor independent] B) are associated with signals initiating diverse processes, which include proliferation and recombination of the light-chain gene44; the GSEA results show the related pathways: lymphocyte proliferation, cell activation, and B cell receptor signaling pathway. Immature B cells exhibit B cell central tolerance, which is governed by mechanisms such as receptor editing and apoptosis.45 The pathways identified in the corresponding gene clusters include antigen processing and presentation of exogenous peptide antigen, DNA damage response, regulation of cell killing, and apoptotic signaling pathways. Plasma cells are terminally differentiated B lymphocytes that secrete immunoglobulins, also known as antibodies.46 Considering the substantial demands placed on these cells for secretory biological processes, the pathways associated with the relevant gene cluster shed light on the cellular response to endoplasmic reticulum stress.
We show that the gene modules are associated with different biological pathways during B cell development in the bone marrow. These gene modules implicitly incorporate the gene regulation patterns, leading to different cell types. On the other hand, distinct cell types of B cell clustering are engaged in various immunological environments,39,47 resulting in different signaling pathways for B cell activation and fate decisions. We exemplify this joint relationship with an example involving B cells at the large pre-B stage, as shown in Figures 4C and 4D. This specific cell state (Figure 4C) is characterized by gene regulation patterns associated with cell proliferation, reflected by the regulon Phf8(+).48 The corresponding gene cluster (Figure 4D) is linked to a molecular complex detection (MCODE) network, which belongs to the lymphocyte proliferation pathway (Figure 4E) and shares target genes with the Phf8(+) regulon.
Therefore, our HyperG-VAE reciprocally integrates these two concepts, cell clustering and gene module detection, with the aim of revealing GRNs. Concretely, the cell embedding process groups together similar cells that share common pathways, while the gene modules aggregate genes exhibiting similar regulation patterns, thereby enhancing the accuracy of GRN computations.
HyperG-VAE constructs the cell-type-specific GRN on B cell development in bone marrow
We have demonstrated that gene modules associated with various biological pathways correspond to distinct cell types within bone marrow development in B cells. Essentially, these distinct gene regulation patterns influence cell fate commitment, leading to the development of diverse cell types with varying gene regulation profiles. Thus, we employ HyperG-VAE to investigate each individual state of developmental B cells and construct a more accurate GRN for B cells at specific developmental stages, as illustrated in Figure 5. B cell development in the bone marrow can be broadly categorized into four states: pro-B, large pre-B, small pre-B, and mature B.40 These four stages are visualized using UMAP, as depicted in Figure 5B. For each of these states, we employed HyperG-VAE to compute GRNs and uncover the predominant regulatory patterns, as illustrated in Figures 5A–5C. HyperG-VAE effectively reveals the key TFs and their associated target genes within each cell state. For example, in the pro-B state, Ebf149 and Pax550 play significant roles, while Myc38 stands out in the large pre-B state; Bach238,51 is crucial in the small pre-B state, and Klf252 and Ctcf53 are notable in the mature state.
Figure 5.
Cell-type-specific GRN analysis of developmental B cells in bone marrow
(A) The Sankey diagram shows significant regulons and corresponding target genes of different states along B cell development in bone barrow, with the normalized enrichment score (NES) encoded by color shade and the area under the curve (AUC) score by dot size.
(B) Gene regulatory hypergraph at the cell clustering level, illustrating the four principal B cell states as four hyperedges. Conserved TFs are highlighted with red dots, and target genes are depicted as diamonds, where size reflects the log fold change (logFC) in gene expression of a given state compared to others. Highly expressed genes are labeled in the figure.
(C) The motif of significant TFs along the principal stages. Additional motif details can be found in Figure S5.
(D) Heatmap displays the expression of the top genes, ranked by logFC, across cells classified into four distinct cell states. The genes are selected by the overlap of top logFC genes and predicted target genes. The genes’ color corresponds to the cell states in which the regulation pattern is predicted.
(E) Volcano plot of differentially expressed genes of different states. The blue inverted triangles denote downregulated genes, and the red triangles denote upregulated genes.
The aforementioned TFs, along with their respective target genes, collectively constitute the regulons that characterize the four major cell states, allowing for the construction of a gene regulatory hypergraph at the cell clustering level (Figures 5A and 5B). For each major state, we overlap the top-predicted target genes by HyperG-VAE (Figure 5B) with the differentially expressed genes (DEGs; Figure 5E) and identify the principal marker genes (Figure 5D). Specifically, Ebf1 and Pax5 are essential in the pro-B state of bone marrow to maintain an early B cell phenotype characterized by the expression of B cell-specific genes such as Vpreb and Igll1 for surrogate light-chain production.49,50 In the large pre-B stage, the enriched regulons encompass the TF Myc38 and other genes related to the cell cycle, such as Mki67, Cenpf, Cenpa, and Hmgb2. Additionally, nucleosome-related genes, such as Hist1h2ae and Hist1h3c, are also enriched in this state due to the high rate of cell proliferation. In the small pre-B stage, both Bach2 and Btg1 restrain cell proliferation.54,55 It is noteworthy that the mature state markers H2-Ab1, H2-Eb1, H2-Aa, and Cd74 are assigned as target genes in the pro-B stage, suggesting that these genes may be actively repressed in the early B cell development stage.
HyperG-VAE addresses cellular heterogeneity and learns the cell representation
Cellular heterogeneity is a hallmark of complex biological systems, manifesting as diverse cell types and states within scRNA-seq datasets.56 We hypothesize that the latent space inferred by the cell encoder of HyperG-VAE captures this biological variability among cells. Leveraging domain expertise, we can map these clusters to known cell types or states, ensuring that the computational predictions align with manual inspection and annotation. To evaluate the performance, we applied HyperG-VAE to three biologically relevant scRNA-seq datasets, including an Alzheimer’s disease (AD) dataset,57 a colorectal cancer dataset,58 and the widely used mouse brain dataset, known as the Zeisel dataset59 (more dataset details can be found in Table S4). To benchmark HyperG-VAE, we also compared its low-dimensional embeddings with those of six other algorithms: autoCell,60 DCA,61 scVI,62 DESC,63 SAUCIE,64 and scVAE.65 We followed the Louvain algorithm66 to cluster all the single cells into an identical number of clusters for each method (STAR Methods). To assess the precision of clustering against established reference labels, we employed four metrics: the adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity (HOM), and completeness (COM). These metrics span a scale from 0, indicating random clustering, to 1, signifying perfect alignment with reference clusters, with superior values indicating enhanced accuracy.
Overall, the performance of HyperG-VAE surpasses that of its counterparts, as evidenced in Figure 6A. Specifically, for the Zeisel dataset, the clusters generated using HyperG-VAE align more closely with the existing cell type annotations, registering an NMI of and an ARI of . In comparison, the next best-performing algorithm, autoCell, recorded an NMI of and an ARI of . Furthermore, we evaluated HyperG-VAE’s latent space to determine its ability to capture the biological diversity among individual cells in the Zeisel dataset, as illustrated in Figure 6B. We visualized the data embedding by UMAP. In previous sections, we showed that HyperG-VAE effectively identifies gene expression differences within cells of the same type. The compact UMAP highlights distinct clusters while preserving intra-cluster heterogeneity, demonstrating its ability to capture both inter- and intra-cluster variability. Compared to other algorithms, the distinct separation observed with HyperG-VAE across most clusters indicates effective clustering, suggesting that HyperG-VAE’s cell encoder adeptly distinguishes between various cell states or types. While algorithms such as autoCell, scVI, and scVAE have achieved results that are comparable, the differentiation between their clusters is not as pronounced as with HyperG-VAE. For the remaining algorithms, the substantial overlap among clusters hinders the classifier from producing optimal results. Specifically, compared to other methodologies founded on conventional single-layer VAEs, the enhanced visualization capabilities of HyperG-VAE underscore the potential benefits of incorporating gene modules in cell embedding processes.
Figure 6.
Benchmarks of single-cell clustering and embedding
(A) The cell clustering performance of HyperG-VAE on the single-cell datasets compared with six baseline methods on four key metrics: NMI, ARI, COM, and HOM. NMI, normalized mutual information (the higher the value, the better); ARI, adjusted rand index (the higher the value, the better); COM, completeness (the higher the value, the better); HOM, homogeneity (the higher the value, the better).
(B) UMAP visualization of latent representations on the Zeisel dataset for different methods. The black circles highlight areas with ambiguous classification boundaries.
Discussion
In this work, we introduce HyperG-VAE, a sophisticated model designed for the construction of GRNs. Uniquely, HyperG-VAE leverages a hypergraph framework, wherein genes expressed within individual cells are represented as nodes connected by distinct hyperedges, capturing the latent gene correlations among single cells. As a key algorithmic innovation of HyperG-VAE, the transformation of scRNA-seq data into a hypergraph offers unique advantages compared to existing GRN inference methods. These advantages include improved modeling of cellular heterogeneity, enhanced analysis of gene modules, increased sensitivity to gene correlations among cells, and improved visualization and interpretation of GRNs. This direct use of a hypergraph, as opposed to traditional pairwise methods like star expansion (SE) and clique expansion (CE),67 captures complex multi-dimensional relationships more effectively, avoiding the increased complexity and information loss associated with SE and CE. By maintaining the hypergraph’s original form, HyperG-VAE preserves the data’s full complexity and integrity, enhancing analytical depth and reducing computational demands.
In addition to modeling scRNA-seq data in a hypergraph, HyperG-VAE effectively integrates gene modules and cellular heterogeneity, demonstrating superior performance compared to existing methods. On the one hand, our study reveals that HyperG-VAE outperforms related existing state-of-the-art algorithms in GRN inference, cell type classification, and visualization tasks, respectively, as evidenced by its enhanced performance across several widely recognized benchmark datasets. On the other hand, we also utilize HyperG-VAE on scRNA-seq data of B cell development in bone marrow38 to evaluate its performance in a biologically relevant context. Firstly, HyperG-VAE achieves accurate prediction of GRNs and successfully identifies key master regulators and target genes across different developmental stages. Meanwhile, we cross-validated our results with publicly available ChIP-seq datasets,33 further demonstrating HyperG-VAE’s robust performance in predicting regulons based on GRN inference. Secondly, subsequent evaluations across various tasks further highlighted the effectiveness of HyperG-VAE’s carefully designed encoder components, with their synergistic interaction significantly bolstering the model’s overall performance. Specifically, the cell encoder within HyperG-VAE predicts the GRNs through a structural equation model while also pinpointing unique cell clusters and tracing the developmental lineage of B cells; the gene encoder uncovers gene modules that implicitly encapsulate patterns of gene regulation, thereby enhancing the accuracy of GRN predictions. To demonstrate this interaction, we highlight the shared genes between gene clusters and the predicted target genes within cell clusters. These shared genes are notably present in pathways identified by GSEA, signifying the connections between gene modules identified by gene encoders and cell clusters delineated by cell encoders.
Our proposed model, HyperG-VAE, holds promise as a foundational framework, adaptable to a multitude of biological contexts in future research endeavors. A promising future direction is extending HyperG-VAE into a heterogeneous hypergraph VAE by incorporating additional omics data, such as single-cell ChIP-seq. Integrating single-cell ChIP-seq data into HyperG-VAE would enhance GRN construction, complementing transcriptomic data and revealing upstream regulatory events. This kind of integration would enable seamless multi-omics data fusion and further advance GRN inference and cellular regulation analysis. Additionally, while the present model does not explicitly use metadata for genes and cells, future enhancements that integrate these metadata into the hypergraph-centric framework could significantly improve the representations of nodes (genes) and hyperedges (cells). The weights assigned to these hyperedges can also be factored into the model’s learning phase, offering a more comprehensive analysis. In the generative phase of HyperG-VAE, gene-cell interactions proceed through a cohesive mechanism, facilitating the development of a robust GRN underscored by the interplay between gene modules and cell clusters. Moreover, advancing to a single-cell-level, fine-grained gene-coexpression hypergraph study could further enhance our understanding of single-cell dataset analysis. Furthermore, subsequent research could explore the dynamic construction of temporal GRNs on chronological single-cell data, drawing upon the foundational principle of simultaneously considering cellular heterogeneity and gene modules, as demonstrated in this work.
Overall, HyperG-VAE provides a competitive solution for GRN construction and related downstream works. By combining cell-specific GRN inference, hypergraph-based gene module identification, and integrated cell-gene latent representations, HyperG-VAE delivers biologically relevant insights that extend beyond traditional GRN methods. It provides researchers with powerful tools to explore cell-specific regulatory mechanisms, identify gene modules, and generate testable predictions, thereby advancing our understanding of complex biological systems.
Limitations of the study
Our HyperG-VAE leveraging the self-attention mechanism has undeniably propelled models to achieve remarkable performance.68,69,70,71 However, despite its prowess, self-attention-based models still have inherent limitations. Specifically, the self-attention’s quadratic complexity concerning sequence length presents challenges. For sequences of length N, it necessitates computations, rendering it computationally demanding and memory inefficient, especially for longer sequences. Future efforts to address this limitation will explore adapting the techniques of attention matrix sparse factorization and positive orthogonal random features, as demonstrated in studies,72,73 to ease computational demands.
Resource availability
Lead contact
Requests for resources of this article and any additional information should be directed to the lead contact, Wenjie Zhang (wenjie.zhang@unsw.edu.au).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
All datasets used in the work have been summarized in the key resources table.
-
•
All original code has been deposited at https://github.com/guangxinsuuu/HyperG-VAE and is publicly available at https://doi.org/10.5281/zenodo.15028720 as of the date of publication.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
This work was supported by the ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (CE230100001), the Australian National Health and Medical Research Council (NHMRC) grant GNT2009554, and Children’s Hospital Foundation philanthropic funding.
Author contributions
G.S., Y.Y., and H.W. designed the experiments. G.S., H.W., W.Z., Y.Y., and P.F.C. performed the experiments and analyzed the data. G.S., Y.Y., Y.Z., D.Y., M.R.W., and W.Z. performed the statistical analysis. D.Y., M.R.W., and W.Z. provided conceptual guidance. W.Z. and Y.Y. supervised the study. G.S., H.W., Y.Y., M.R.W., and W.Z. wrote the manuscript and prepared figures and tables. All authors reviewed and edited the manuscript.
Declaration of interests
The authors declare no competing interests.
STAR★Methods
Key resources table
Method details
Preliminaries
Notation
Given a hypergraph , where denotes the set of nodes, and is the set of hyperedges. Within the hypergraph framework, it is possible for numerous nodes to be interconnected by a solitary hyperedge. Aligning the hypergraph framework with the gene regulation networks (GRNs) paradigm, the expressed genes are mapped as nodes while individual cells stand in as the hyperedges, thus crafting a representation of the cellular architecture as a hypergraph. We aim to approximate the real-world regulatory network by learning a causal interaction matrix through HyperG-VAE. Both and are square matrices, with their elements representing the levels of regulatory interaction between pairs of genes. In the context of hypergraphs, let represent the expression matrix of scRNA-seq dataset, where m represents the number of cells and n indicates the number of genes. and signify the incidence matrix. The matrix is also of size . If node i is linked to hyperedge j (gene i expressed in cell j), then and . In the absence of such a link, both and are set to 0. For the hypergraph , its dual is defined as . Here, and comprises sets where each corresponds to edges in that contain node . As a direct consequence, the feature matrix of the dual, , is the transpose of the feature matrix of .
Structural Equation Model
Within the dual of scRNA-seq expression matrix , we employ the Structural Equation Model (SEM),74 a statistical approach that integrates factor analysis and multiple regression, to model causal relationships among genes and deduce the intricate dynamics present within gene regulatory networks (GRNs), considering both observed and latent gene interactions. Specifically, our approach is rooted in the Linear SEM:
| (Equation 1) |
Here, is the intrinsic noise component following a Gaussian distribution denoted by . The adjacent matrix indicates the conditional dependencies among genes. This characteristic implies a mechanism to derive from the noise matrix , expressed as:
| (Equation 2) |
This expression elucidates the relationship between and while highlighting the underlying network structure of the GRN as captured by the matrix . In scRNA-seq, the data matrix (where rows represent cells and columns represent genes) captures complex gene dependencies. SEM models these dependencies by learning a matrix that represents conditional gene-gene relationships, while accounting for intrinsic noise through a Gaussian distribution. This enables HyperG-VAE to reconstruct GRNs and ensures biologically meaningful representations that reflect both direct and indirect gene interactions.
Hypergraph variational evidence lower bound
The input scRNA-seq expression matrix is often noisy and incomplete due to factors like amplification biases during reverse transcription and PCR amplification,19,20,75 can compromise the efficacy of basic autoencoders. These autoencoders risk overfitting to training data by solely penalizing reconstruction error, which are influenced by suboptimal expression matrices.76 To relief the problem, within HyperG-VAE, the hypergraph’s stochastic distribution is tailored to emphasize the latent spaces of nodes and hyperedges, rather than merely relying on observed inputs. Specifically, the node and hyperedge latent spaces are independently derived using distinct encoders and are subsequently refined according to Equation 3:
| (Equation 3) |
As a crucial loss function of HyperG-VAE, the Evidence Lower Bound (ELBO) is formulated with respect to the observed hypergraph node matrix and the parameters and which need to be estimated. Specifically, the expectation term, , is the likelihood of the model’s reconstruction of the node matrix using the latent representations for nodes and hyperedges . Moreover, the Kullback-Leibler (KL) divergence assesses the deviation of the learned latent distribution, , from a designated prior . The coefficients α and β modulate the magnitude of this regularization.
HyperG-VAE node encoder
For the expression matrix , each row delineates the expression profile of a gene across diverse cells. Concurrently, a particular gene might manifest across numerous cells and associate with other genes via distinct hyperedges .
In the message-passing phase, row weights should account for expression coherence: genes within the same module typically exhibit consistent expression profiles across cells,77,78 warranting higher weights than genes with more variable expressions.
Based on the basic idea of GAT,79 we have devised an attention computation mechanism tailored for hypergraph, which enables (implicitly) specifying different weights to different nodes share a common hyperedge . Multi-head attention was selected for the gene encoder to model the complex and dynamic relationships among genes. This mechanism enables the encoder to learn context-specific expression differences, allowing for the identification of reliable gene modules across diverse cells. By dynamically assigning different weights to gene-gene interactions, multi-head attention captures both global regulatory patterns and local dependencies, enhancing gene feature learning and supporting the construction of robust gene regulatory networks.
A scoring function e: computes a score for two genes share a common hyperedge , which indicates the importance of the expression profiles of two genes and , which belong to the same hyperedge :
| (Equation 4) |
where , are trainable parameters, and denotes vector concatenation. These attention scores are normalized across all hyperedges using softmax, and the attention function is defined as:
| (Equation 5) |
We denote the coefficient matrix, whose entries are , if , and 0 otherwise. Then, GAT computes a weighted average of the transformed features of the neighbor nodes followed by a non-linearity σ as the new representation of , using the normalized attention coefficients:
| (Equation 6) |
In layer , the representation of is denoted by . The hyperedge weight matrix , is set as the identity matrix, due to the lack of prior knowledge regarding cell relationships. In this paper, we refer to (Equation 4), (Equation 5), (Equation 6) as the computation of each layer in an L-layer HyperG-VAE node encoder. We also leveraged the - mechanism, akin to the strategy used in Vaswani et al.68 to stabilize the learning process of self-attention.
Through the message-passing layers, the input node features of could be represented as , two individual fully connected layers are then employed to estimate the means and variances of :
| (Equation 7) |
| (Equation 8) |
where , , d is the dimensionality of the final node embedding , which is sampled by the following process:
| (Equation 9) |
where and is scaled element-wise by . The collective set of parameters, encapsulated within , offers the posterior estimates for .
HyperG-VAE hyperedge encoder
Based on the Equation 1 and nonlinear version of the SEM proposed by,80 the encoder part of the SEM variational autoencoder could be represented as:
| (Equation 10) |
here, the functions and , parameterized for potential non-linear transformations, adeptly act upon and , respectively.
Based on Equation 10, to encode the high-order semantics and complex relations represented in the form of hyperedges, a hyperedge encoder first conducts a non-linear feature transformation from the observed embedding into a common latent space , which is as follows:
| (Equation 11) |
While the gene expression profile is given by , denotes the initial f-dimensional gene features matrix. Due to the absence of this detailed feature information in our dataset, is simplified as an identity matrix, . stands for multilayer neural network, is the learnable weight matrices, and is bias.
Given the fused hyperedge embedding , two individual fully connected layers are then employed to estimate the means and variances of :
| (Equation 12) |
| (Equation 13) |
where , , is the dimensionality of the , which is sampled by the following process:
| (Equation 14) |
where and is scaled element-wise by . The collective set of parameters, encapsulated within , offers the posterior estimates for .
Generative model
In the decoding phase, the hypergraph is reconstructed utilizing the latent space representations, and , acquired from the node and hyperedge encoders, respectively.
To keep the nonlinear SEM of the hyperedge encoder, we first reconstruct the representation of , and we use the corresponding decoder of Equation 10:
| (Equation 15) |
In this work, we can represent the inner content of as:
| (Equation 16) |
where is the learnable weight matrices, and is bias. Correspondingly, we can get the estimated means and variances based on :
| (Equation 17) |
| (Equation 18) |
where , , d is the dimensionality of the final hyperedge representation , which is sampled by the following process:
| (Equation 19) |
where and is scaled element-wise by .
Finally, the estimated hypergraph based on distributions is represented as:
| (Equation 20) |
Hypergraph variational evidence lower bound
In the process of HyperG-VAE, latent node embeddings and high-order relation embeddings are first generated independently from a parameter-free prior distribution, typically a Gaussian. The observed data points are then generated conditionally, based on these latent embeddings, with each data point being conditioned on its corresponding latent node embedding and high-order relation embeddings , parameterized by . The objective of HyperG-VAE is to optimize these parameters to maximize the log likelihood of the observed data. To derive a lower bound for the log likelihood, known as the Evidence Lower Bound (ELBO). HyperG-VAE leverages Jensen’s Inequality as follows:
| (Equation 21) |
where is the variational posterior used to approximate the true posterior , and is the parameter that we need to estimate in the learning phase. The Evidence Lower Bound (ELBO) on the marginal likelihood of , denoted as , is derived by applying the logarithmic product rule to the joint probability distribution, facilitating a tractable lower bound for model optimization:
| (Equation 22) |
In the variational autoencoder framework, specifically within the HyperG-VAE, the Kullback-Leibler (KL) divergence acts as a regularization factor. It aligns the variational distribution with the prior distribution , reinforcing the model’s adherence to initial assumptions. Concurrently, the expected log likelihood of reconstruction, expressed as , dictates the fidelity of data reconstruction from latent embeddings, which are shaped by the learned distribution. The parameter , crucial to this reconstruction, is optimized during the learning phase. This dual mechanism ensures that while the model is incentivized to replicate observed data accurately, it remains regularized by the prior, establishing a balance pivotal to the ELBO’s effectiveness in training variational models like HyperG-VAE.
and are transposed relations. To better tailor the learning process to specific objectives, weighting components within a loss function, as in Beta-VAE,82 offers nuanced control over regularization, fostering more interpretable and generalizable models. And we will get the ELBO used in HyperG-VAE as:
| (Equation 23) |
Model setup
HyperG-VAE was devised to infer gene regulatory networks from scRNA-seq data without relying on cell type annotations. Before feeding into the model, the scRNA-seq expression data underwent log-transformation followed by Z-normalization to ensure optimal data representation. For the initialization of the gene interaction matrix, denoted as , the matrix diagonal was set to zeros, while the other entries followed a Gaussian distribution . Here, m represents the number of genes, and is a small value introduced to prevent entrapment in local optima.
We chose a two-step alternative optimization approach. The RMSprop algorithm83 was selected initially for tuning the weights within the HyperG-VAE layers over specific epochs. Then, in a separate phase, the weight matrix , which plays a critical role in our architecture, was fine-tuned over another set of epochs, employing a differential learning rate strategy. This bifurcated approach not only fortified the model’s robustness but also ensured granular weight updates in both the matrix and the neural layers. We utilized the kaiming_uniform technique84 to initialize MLPs, crucially defining the initial conditions of our model. The gene (node) encoder, taking the constructed hypergraph as input, employs the Xavier uniform initialization85 for optimal training. During training, the model’s objective function was guided by a multi-faceted loss: a reconstruction component to maintain data fidelity, two KL divergences (sourced from both the node encoder and the hyperedge encoder) to ensure latent variable alignment with a priori distributions, and a penalty promoting sparsity in the adjacency matrix. This ensured both accuracy in reconstruction and interpretability in inferred gene interactions.
This holistic framework was crafted in Python and leaned heavily on the computational prowess of the PyTorch framework,86 complemented by scanpy81 for preliminary data handling. Key hyperparameters are selected based on a grid search strategy, more details could be checked in Table S1. More details of the structure introduction of HyperG-VAE can be found in the Supplementary.
GRN inference
Central to our model, HyperG-VAE, the gene regulatory network (GRN) is elucidated through the learned causal interaction matrix . The absolute values within this matrix convey the potential links between genes, reflecting the probability of their interrelations. To enhance the biological interpretability of the predicted GRNs, we incorporate SCENIC,41 a method renowned for its ability to distill biologically meaningful gene regulations. SCENIC’s capability to identify cell-type-specific regulatory interactions complements the precision of HyperG-VAE, providing a deep learning-based approach that bridges the accuracy of causal gene interactions with the biological relevance of transcription factor (TF)-gene relationships.
HyperG-VAE first identifies high-confidence TF-gene pairs by modeling complex gene dependencies and direct regulatory links, precisely capturing causal relationships while accounting for cell heterogeneity and gene-specific regulatory modules. SCENIC is then applied to filter and refine these TF-gene pairs, further enhancing their biological relevance. By using motif enrichment analysis and transcription factor activity scoring, SCENIC identifies the key transcription factors that drive gene expression in specific cell types, ensuring that the resulting GRNs are biologically meaningful and contextually relevant. This cascading combination of HyperG-VAE and SCENIC enables the construction of robust, biologically grounded gene regulatory networks. Together, they offer a comprehensive view of gene regulation, not only uncovering the structure of the GRNs but also providing insights into their functional significance.
Gene set enrichment analysis (GSEA)
For the analysis of gene clusters, we employed the default settings of Metascape.43 Specifically, enrichment analysis for given gene lists encompassed pathway and process assessments using GO Biological Processes, GO Cellular Components, GO Molecular Functions, and DisGeNET ontologies. The entire genome served as the background for enrichment. Terms meeting stringent criteria: p-value 0.01, minimum count of 3, and enrichment factor 1.5 (ratio of observed to expected counts)were selected. Statistical rigor was maintained by employing cumulative hypergeometric distribution for p-value calculation, Banjamini-Hochberg procedure for q-value adjustment, and Kappa scores for hierarchical clustering. Clusters, defined by sub-trees with a similarity exceeding 0.3, were identified based on membership similarities. Each cluster is represented by its most statistically significant term. This comprehensive approach ensures robust and reliable insights into gene function and pathway associations.
Latent representation visualization and clustering
In both HyperG-VAE and the comparative methodologies, if the size of hidden embeddings exceeded 10, we commenced by extracting the foremost 10 principal components (PCs) via principal component analysis. Subsequently, a cell neighborhood graph was computed, setting the “n_neighbors” parameter to 30. Visualization of dataset results was then performed in a two-dimensional space using the default parameters of the UMAP algorithm. For cell clustering, the Louvain algorithm was employed, and the “resolution” parameter was fine-tuned using a binary search to yield a cluster count consistent with cell-type annotations.
Datasets used for GRN evaluation
We evaluate the performance on GRN inference of HyperG-VAE based on the setting of BEELINE framework.27 Our evaluation encompassed seven scRNA-seq datasets. This includes two cell lines from human, human embryonic stem cells (hESC)87 and human mature hepatocytes (hHEP).88 Additionally, five mouse cell lines are studied here: mouse dendritic cells (mDC),89 mouse embryonic stem cells (mESC),37 mouse hematopoietic stem cells with erythroid-lineage (mHSC-E),90 mouse hematopoietic stem cells with granulocyte-monocyte-lineage (mHSC-GM)90 and mouse hematopoietic stem cells with lymphoid-lineage (mHSC-L).90 Furthermore, the EPR and AUPRC the GRN performance based on four kinds of groundstruth: STRING,28 Non-specific ChIP-seq,29,30,31 Cell-type-specific ChIP-seq,32,33,34 and loss-/gain-of-function (LOF/GOF) groundtruth network.34 Following the guidelines outlined by Pratapa et al.,27 our dataset-specific analysis emphasized the most variable transcription factors and considered the top N most-varying genes, with N being 500 and 1,000. We meticulously adhered to the raw data preprocessing steps detailed in their work and, for evaluation, disregarded any edges that did not originate from TFs. More details can be found in Table S2 and S3.
scRNA-seq datasets of bone marrow developmental B cells
We assess the overarching capability of HyperG-VAE in modeling gene regulatory networks pivotal to B cell development and transformation based on previously published bone marrow developmental B cells datasets.38 The raw sequencing data in this study were processed using the CellRanger pipeline (version 3.1.0, 10X Genomics), where the “mkfastq” function demultiplexed three Illumina libraries (mRNA transcript expression (RNA), mouse-specific hashtag oligos (HTO), and cell surface marker levels using antibody-derived tags (ADT)) and “count” aligned reads to the mouse genome (mm10) to generate count tables. Analysis was carried out in R using the Seurat package,91 involving filtering of the RNA dataset to include only GEMs expressing more than 300 genes and excluding those with high mitochondrial RNA levels. Normalization was performed using a centered-log ratio method. Doublets were identified in GEMs using both DoubletFinder and HTODemux methods; however, due to discrepancies in classification and challenges with DoubletFinder in identifying similar doublets, subsequent analyses relied solely on HTODemux classifications. GEMs identified as multiplets or negative were removed, leaving a refined dataset of wildtype (WT) singlets, which expressed a median of 1409 genes with 3548 counts. These WT singlets then underwent a transformation process using Seurat’s “SCTransform” function, factoring in the percentage of mitochondrial expression, to prepare a high-quality, normalized dataset for further study.
Datasets used for cellular heterogeneity study
We assessed the efficacy of HyperG-VAE by applying it to three pertinent scRNA-seq datasets: an Alzheimer’s disease (AD) study,57 a colorectal cancer investigation,58 and the renowned mouse brain dataset, often referred to as the Zeisel dataset.59 HyperG-VAE processes raw scRNA-seq gene expression profiles directly. The initial phase of data preprocessing involves rigorous data filtering and quality control. Considering the significant dropout rates characteristic of scRNA-seq expression datasets, only genes with non-zero expression in over 1% of cells and cells with non-zero expression in more than 1% of genes are retained. Subsequently, genes are ranked based on their standard deviation, and the top 2,000 genes in terms of variance are selected for further analysis. More details can be found in Table S4.
SCENIC and Chip-Atlas setting
In our approach to further filter reliable gene regulatory networks (GRN) from single-cell RNA-sequencing data, we integrated HyperG-VAE with SCENIC, focusing on discerning crucial gene co-expression modules. Specifically, only the top 0.5% of gene pairs predicted by HyperVAE, based on their co-expression significance, are channeled into SCENIC for rigorous regulon analysis. Using the genome reference, our model evaluates regulatory regions defined as 500 bp upstream, 5-kb, and 10-kb centered around each gene’s transcription start site (TSS), collectively referred to as gene-motif rankings. The analysis adopts criteria for GRN derivation of SCENIC: a feature AUC (default: 0.05), gene rank threshold (default: 5,000), and a normalized enrichment score (NES) threshold (default: 3.0).
To validate the predicted regulons, we cross-verified our computational results with publicly available ChIP-seq datasets.33 Following the foundational settings of SCENIC, we specifically tailored the study to the M. musculus (mm9) genome. Furthermore, in our evaluation approach, we incorporated multiple transcription start sites (TSS) ranges, including 1k, 5k, and 10k, to ensure a comprehensive understanding of gene expression.
Quantification and statistical analysis
To evaluate the performance of HyperG-VAE, we employed multiple statistical and quantitative analyses. These methods assess the accuracy, robustness, and reliability of the inferred GRNs, clustering assignments, and motif enrichment results.
GRN inference evaluation metrics
To assess the accuracy of the inferred GRNs, we compared the predicted networks to ground-truth GRNs using the following metrics.
-
(1)
EPR is defined as the odds ratio of the true positives among the top K predicted edges between the model and the random predictions where K denotes the number of edges in ground-truth GRN.
-
(2)
AUPRC ratio is defined as the odds ratio of the area under the precision-recall curve (AUPRC) between the model and the random predictions.
-
(3)
The Overlap coefficient is a similarity measure related to the Jaccard Similarity, but whereas the Jaccard Similarity considers both the intersection and union of two sets, the Overlap Coefficient only considers the intersection relative to the smaller set. It’s used to quantify the overlap between two sets. Given two sets, A and B, the Overlap Coefficient O is defined as:
The value of the Overlap Coefficient lies between 0 and 1: A value of 1 indicates that the sets are identical, and 0 indicates that the sets have no elements in common.
Clustering performance metrics
To evaluate cell clustering quality, we compared the inferred clusters with known cell populations using four clustering similarity metrics.
-
(1)
NMI (Normalized Mutual Information) quantifies the mutual dependence between two clustering assignments, offering a value between 0 (completely independent assignments) and 1 (identical assignments).
-
(2)
ARI (Adjusted Rand Index) is an adjusted variant of the Rand Index that gauges clustering similarity while accounting for random agreement. Its values range from −1 (perfect disagreement) to 1 (perfect agreement), with 0 indicating random agreement.
-
(3)
HOM (Homogeneity) evaluates whether each cluster comprises solely members of a single class. It ranges from 0 (poor homogeneity) to 1 (perfect homogeneity).
-
(4)
COM (Completeness) assesses if all members of a given class are confined to the same cluster, with scores spanning from 0 (low completeness) to 1 (perfect completeness).
Motif enrichment analysis
To validate the biological relevance of the inferred GRNs, we performed motif enrichment analysis using the Normalized Enrichment Score (NES).
-
(1)
The Normalized Enrichment Score (NES) quantifies the enrichment of a given motif at the top of a ranking compared to motifs generated by chance. Mathematically, NES is defined as:
-
(1)
where represents the Area Under the Curve for the top 0.5% of the ranked motifs for the gene of interest, and the mean and standard deviation are calculated across the AUCs of all motifs in the dataset. A higher NES indicates a more significant enrichment of the motif in the given context.
Published: April 11, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2025.101026.
Contributor Information
Yang Yang, Email: yang.yang1@uq.edu.au.
Wenjie Zhang, Email: wenjie.zhang@unsw.edu.au.
Supplemental information
References
- 1.Davidson E.H., Erwin D.H. Gene regulatory networks and the evolution of animal body plans. Science. 2006;311:796–800. doi: 10.1126/science.1113832. [DOI] [PubMed] [Google Scholar]
- 2.Bonneau R., Facciotti M.T., Reiss D.J., Schmid A.K., Pan M., Kaur A., Thorsson V., Shannon P., Johnson M.H., Bare J.C., et al. A predictive model for transcriptional control of physiology in a free living cell. Cell. 2007;131:1354–1365. doi: 10.1016/j.cell.2007.10.053. [DOI] [PubMed] [Google Scholar]
- 3.Marbach D., Costello J.C., Küffner R., Vega N.M., Prill R.J., Camacho D.M., Allison K.R., DREAM5 Consortium. Kellis M., Collins J.J., Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat. Methods. 2012;9:796–804. doi: 10.1038/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Van Dam S., Võsa U., van der Graaf A., Franke L., de Magalhães J.P. Gene co-expression analysis for functional classification and gene–disease predictions. Brief. Bioinform. 2018;19:575–592. doi: 10.1093/bib/bbw139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Smith A.M., Davey K., Tsartsalis S., Khozoie C., Fancy N., Tang S.S., Liaptsi E., Weinert M., McGarry A., Muirhead R.C.J., et al. Diverse human astrocyte and microglial transcriptional responses to alzheimer’s pathology. Acta Neuropathol. 2022;143:75–91. doi: 10.1007/s00401-021-02372-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Otálora-Otálora B.A., López-Kleine L., Rojas A. Lung cancer gene regulatory network of transcription factors related to the hallmarks of cancer. Curr. Issues Mol. Biol. 2023;45:434–464. doi: 10.3390/cimb45010029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kim S. ppcor: an r package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods. 2015;22:665–674. doi: 10.5351/CSAM.2015.22.6.665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhang S.Y., Stumpf M.P. Learning cell-specific networks from dynamical single cell data. bioRxiv. 2023 doi: 10.1101/2023.01.08.523176. Preprint at. [DOI] [Google Scholar]
- 9.Shu H., Zhou J., Lian Q., Li H., Zhao D., Zeng J., Ma J. Modeling gene regulatory networks using neural network architectures. Nat. Comput. Sci. 2021;1:491–501. doi: 10.1038/s43588-021-00099-8. [DOI] [PubMed] [Google Scholar]
- 10.Chan T.E., Stumpf M.P.H., Babtie A.C. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2017;5:251–267.e3. doi: 10.1016/j.cels.2017.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li H., Sun Y., Hong H., Huang X., Tao H., Huang Q., Wang L., Xu K., Gan J., Chen H., Bo X. Inferring transcription factor regulatory networks from single-cell atac-seq data based on graph neural networks. Nat. Mach. Intell. 2022;4:389–400. [Google Scholar]
- 12.Kingma D.P., Welling M. Auto-encoding variational bayes. arXiv. 2013 doi: 10.48550/arXiv.1312.6114. Preprint at. [DOI] [Google Scholar]
- 13.Kipf T.N., Welling M. Variational graph auto-encoders. arXiv. 2016 doi: 10.48550/arXiv.1611.07308. Preprint at. [DOI] [Google Scholar]
- 14.Suter P., Kuipers J., Beerenwinkel N. Discovering gene regulatory networks of multiple phenotypic groups using dynamic bayesian networks. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbac219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Friedman N., Linial M., Nachman I., Pe’er D. In: Proceedings of the fourth annual international conference on Computational molecular biology. Sorin Istrail M.S.W., editor. AAAI Press; 2000. Using bayesian networks to analyze expression data; pp. 127–135. [Google Scholar]
- 16.Pio G., Mignone P., Magazzù G., Zampieri G., Ceci M., Angione C. Integrating genome-scale metabolic modelling and transfer learning for human gene regulatory network reconstruction. Bioinformatics. 2022;38:487–493. doi: 10.1093/bioinformatics/btab647. [DOI] [PubMed] [Google Scholar]
- 17.Huynh-Thu V.A., Irrthum A., Wehenkel L., Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010;5 doi: 10.1371/journal.pone.0012776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Moerman T., Aibar Santos S., Bravo González-Blas C., Simm J., Moreau Y., Aerts J., Aerts S. Grnboost2 and arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019;35:2159–2161. doi: 10.1093/bioinformatics/bty916. [DOI] [PubMed] [Google Scholar]
- 19.Wu Y., Zhang K. Tools for the analysis of high-dimensional single-cell rna sequencing data. Nat. Rev. Nephrol. 2020;16:408–421. doi: 10.1038/s41581-020-0262-0. [DOI] [PubMed] [Google Scholar]
- 20.Jia C., Hu Y., Kelly D., Kim J., Li M., Zhang N.R. Accounting for technical noise in differential expression analysis of single-cell rna sequencing data. Nucleic Acids Res. 2017;45:10978–10988. doi: 10.1093/nar/gkx754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Bravo González-Blas C., De Winter S., Hulselmans G., Hecker N., Matetovici I., Christiaens V., Poovathingal S., Wouters J., Aibar S., Aerts S. SCENIC+:single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods. 2023;20:1355–1367. doi: 10.1038/s41592-023-01938-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Picard M., Scott-Boyer M.-P., Bodein A., Périn O., Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J. 2021;19:3735–3746. doi: 10.1016/j.csbj.2021.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Goh W.W.B., Wang W., Wong L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 2017;35:498–507. doi: 10.1016/j.tibtech.2017.02.012. [DOI] [PubMed] [Google Scholar]
- 24.Fan H., Zhang F., Wei Y., Li Z., Zou C., Gao Y., Dai Q. Heterogeneous hypergraph variational autoencoder for link prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44:4125–4138. doi: 10.1109/TPAMI.2021.3059313. [DOI] [PubMed] [Google Scholar]
- 25.Xu M., Powers A.S., Dror R.O., Ermon S., Leskovec J. In: International Conference on Machine Learning. Krause A., Brunskill E., Cho K., Engelhardt B., Sabato S., Scarlett J., editors. PMLR; 2023. Geometric latent diffusion models for 3d molecule generation; pp. 38592–38610. [Google Scholar]
- 26.Su G., Zhu Y., Zhang W., Wang H., Zhang Y. Bridging large language models and graph structure learning models for robust representation learning. arXiv. 2024 doi: 10.48550/arXiv.2410.12096. Preprint at. [DOI] [Google Scholar]
- 27.Pratapa A., Jalihal A.P., Law J.N., Bharadwaj A., Murali T.M. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat. Methods. 2020;17:147–154. doi: 10.1038/s41592-019-0690-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., Simonovic M., Doncheva N.T., Morris J.H., Bork P., et al. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47:D607–D613. doi: 10.1093/nar/gky1131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Garcia-Alonso L., Holland C.H., Ibrahim M.M., Turei D., Saez-Rodriguez J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res. 2019;29:1363–1375. doi: 10.1101/gr.240663.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu Z.-P., Wu C., Miao H., Wu H. Regnetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015;2015 doi: 10.1093/database/bav095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Han H., Cho J.-W., Lee S., Yun A., Kim H., Bae D., Yang S., Kim C.Y., Lee M., Kim E., et al. Trrust v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46:D380–D386. doi: 10.1093/nar/gkx1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.ENCODE Project Consortium. Moore J.E., Purcaro M.J., Pratt H.E., Epstein C.B., Shoresh N., Adrian J., Kawli T., Davis C.A., Dobin A., et al. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Oki S., Ohta T., Shioi G., Hatanaka H., Ogasawara O., Okuda Y., Kawaji H., Nakaki R., Sese J., Meno C. Ch ip-atlas: a data-mining suite powered by full integration of public ch ip-seq data. EMBO Rep. 2018;19 doi: 10.15252/embr.201846255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Xu H., Baroukh C., Dannenfelser R., Chen E.Y., Tan C.M., Kou Y., Kim Y.E., Lemischka I.R., Ma’ayan A. Escape: database for integrating high-content published data collected from human and mouse embryonic stem cells. Database. 2013;2013 doi: 10.1093/database/bat045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Matsumoto H., Kiryu H., Furusawa C., Ko M.S.H., Ko S.B.H., Gouda N., Hayashi T., Nikaido I. Scode: an efficient regulatory network inference algorithm from single-cell rna-seq during differentiation. Bioinformatics. 2017;33:2314–2321. doi: 10.1093/bioinformatics/btx194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Papili Gao N., Ud-Dean S.M.M., Gandrillon O., Gunawan R. Sincerities: inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles. Bioinformatics. 2018;34:258–266. doi: 10.1093/bioinformatics/btx575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hayashi T., Ozaki H., Sasagawa Y., Umeda M., Danno H., Nikaido I. Single-cell full-length total rna sequencing uncovers dynamics of recursive splicing and enhancer rnas. Nat. Commun. 2018;9:619. doi: 10.1038/s41467-018-02866-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Lee R.D., Munro S.A., Knutson T.P., LaRue R.S., Heltemes-Harris L.M., Farrar M.A. Single-cell analysis identifies dynamic gene expression networks that govern b cell development and transformation. Nat. Commun. 2021;12:6843. doi: 10.1038/s41467-021-27232-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Carsetti R. The development of b cells in the bone marrow is controlled by the balance between cell-autonomous mechanisms and signals from the microenvironment. J. Exp. Med. 2000;191:5–8. doi: 10.1084/jem.191.1.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Morgan D., Tergaonkar V. Unraveling B cell trajectories at single cell resolution. Trends Immunol. 2022;43:210–229. doi: 10.1016/j.it.2022.01.003. [DOI] [PubMed] [Google Scholar]
- 41.Aibar S., González-Blas C.B., Moerman T., Huynh-Thu V.A., Imrichova H., Hulselmans G., Rambow F., Marine J.-C., Geurts P., Aerts J., et al. Scenic: single-cell regulatory network inference and clustering. Nat. Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.McInnes L., Healy J., Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv. 2018 doi: 10.48550/arXiv.1802.03426. Preprint at. [DOI] [Google Scholar]
- 43.Zhou Y., Zhou B., Pache L., Chang M., Khodabakhshi A.H., Tanaseichuk O., Benner C., Chanda S.K. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 2019;10:1523. doi: 10.1038/s41467-019-09234-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Herzog S., Reth M., Jumaa H. Regulation of b-cell proliferation and differentiation by pre-b-cell receptor signalling. Nat. Rev. Immunol. 2009;9:195–205. doi: 10.1038/nri2491. [DOI] [PubMed] [Google Scholar]
- 45.Nemazee D. Mechanisms of central tolerance for b cells. Nat. Rev. Immunol. 2017;17:281–294. doi: 10.1038/nri.2017.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Nutt S.L., Hodgkin P.D., Tarlinton D.M., Corcoran L.M. The generation of antibody-secreting plasma cells. Nat. Rev. Immunol. 2015;15:160–171. doi: 10.1038/nri3795. [DOI] [PubMed] [Google Scholar]
- 47.Kwak K., Akkaya M., Pierce S.K. B cell signaling in context. Nat. Immunol. 2019;20:963–969. doi: 10.1038/s41590-019-0427-9. [DOI] [PubMed] [Google Scholar]
- 48.Lim H.-J., Dimova N.V., Tan M.-K.M., Sigoillot F.D., King R.W., Shi Y. The g2/m regulator histone demethylase phf8 is targeted for degradation by the anaphase-promoting complex containing cdc20. Mol. Cell Biol. 2013;33:4166–4180. doi: 10.1128/MCB.00689-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Burrows N., Bashford-Rogers R.J.M., Bhute V.J., Peñalver A., Ferdinand J.R., Stewart B.J., Smith J.E.G., Deobagkar-Lele M., Giudice G., Connor T.M., et al. Dynamic regulation of hypoxia-inducible factor-1α activity is essential for normal b cell development. Nat. Immunol. 2020;21:1408–1420. doi: 10.1038/s41590-020-0772-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pridans C., Holmes M.L., Polli M., Wettenhall J.M., Dakic A., Corcoran L.M., Smyth G.K., Nutt S.L. Identification of pax5 target genes in early b cell differentiation. J. Immunol. 2008;180:1719–1728. doi: 10.4049/jimmunol.180.3.1719. [DOI] [PubMed] [Google Scholar]
- 51.Swaminathan S., Duy C., Müschen M. Bach2–bcl6 balance regulates selection at the pre-b cell receptor checkpoint. Trends Immunol. 2014;35:131–137. doi: 10.1016/j.it.2013.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Hart G.T., Wang X., Hogquist K.A., Jameson S.C. Krüppel-like factor 2 (KLF2) regulates B-cell reactivity, subset differentiation, and trafficking molecule expression. Proc. Natl. Acad. Sci. USA. 2011;108:716–721. doi: 10.1073/pnas.1013168108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Pérez-García A., Marina-Zárate E., Álvarez-Prado Á.F., Ligos J.M., Galjart N., Ramiro A.R. Ctcf orchestrates the germinal centre transcriptional program and prevents premature plasma cell differentiation. Nat. Commun. 2017;8 doi: 10.1038/ncomms16067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sidwell T., Liao Y., Garnham A.L., Vasanthakumar A., Gloury R., Blume J., Teh P.P., Chisanga D., Thelemann C., de Labastida Rivera F., et al. Attenuation of tcr-induced transcription by bach2 controls regulatory t cell differentiation and homeostasis. Nat. Commun. 2020;11:252. doi: 10.1038/s41467-019-14112-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Kim S.H., Jung I.R., Hwang S.S. Emerging role of antiproliferative protein btg1 and btg2. BMB Rep. 2022;55:380–388. doi: 10.5483/BMBRep.2022.55.8.092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Milich L.M., Choi J.S., Ryan C., Cerqueira S.R., Benavides S., Yahn S.L., Tsoulfas P., Lee J.K. Single-cell analysis of the cellular heterogeneity and interactions in the injured mouse spinal cord. J. Exp. Med. 2021;218 doi: 10.1084/jem.20210040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Grubman A., Chew G., Ouyang J.F., Sun G., Choo X.Y., McLean C., Simmons R.K., Buckberry S., Vargas-Landin D.B., Poppe D., et al. A single-cell atlas of entorhinal cortex from individuals with alzheimer’s disease reveals cell-type-specific gene expression regulation. Nat. Neurosci. 2019;22:2087–2097. doi: 10.1038/s41593-019-0539-4. [DOI] [PubMed] [Google Scholar]
- 58.Li H., Courtois E.T., Sengupta D., Tan Y., Chen K.H., Goh J.J.L., Kong S.L., Chua C., Hon L.K., Tan W.S., et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 2017;49:708–718. doi: 10.1038/ng.3818. [DOI] [PubMed] [Google Scholar]
- 59.Zeisel A., Muñoz-Manchado A.B., Codeluppi S., Lönnerberg P., La Manno G., Juréus A., Marques S., Munguba H., He L., Betsholtz C., et al. Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science. 2015;347:1138–1142. doi: 10.1126/science.aaa1934. [DOI] [PubMed] [Google Scholar]
- 60.Xu J., Xu J., Meng Y., Lu C., Cai L., Zeng X., Nussinov R., Cheng F. Graph embedding and gaussian mixture variational autoencoder network for end-to-end analysis of single-cell rna sequencing data. Cell Rep. Methods. 2023;3 doi: 10.1016/j.crmeth.2022.100382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Eraslan G., Simon L.M., Mircea M., Mueller N.S., Theis F.J. Single-cell rna-seq denoising using a deep count autoencoder. Nat. Commun. 2019;10:390. doi: 10.1038/s41467-018-07931-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lopez R., Regier J., Cole M.B., Jordan M.I., Yosef N. Deep generative modeling for single-cell transcriptomics. Nat. Methods. 2018;15:1053–1058. doi: 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Li X., Wang K., Lyu Y., Pan H., Zhang J., Stambolian D., Susztak K., Reilly M.P., Hu G., Li M. Deep learning enables accurate clustering with batch effect removal in single-cell rna-seq analysis. Nat. Commun. 2020;11:2338. doi: 10.1038/s41467-020-15851-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Amodio M., Van Dijk D., Srinivasan K., Chen W.S., Mohsen H., Moon K.R., Campbell A., Zhao Y., Wang X., Venkataswamy M., et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods. 2019;16:1139–1145. doi: 10.1038/s41592-019-0576-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Grønbech C.H., Vording M.F., Timshel P.N., Sønderby C.K., Pers T.H., Winther O. scvae: variational auto-encoders for single-cell gene expression data. Bioinformatics. 2020;36:4415–4422. doi: 10.1093/bioinformatics/btaa293. [DOI] [PubMed] [Google Scholar]
- 66.Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008 [Google Scholar]
- 67.Gu Y., Yu K., Song Z., Qi J., Wang Z., Yu G., Zhang R. Distributed hypergraph processing using intersection graphs. IEEE Trans. Knowl. Data Eng. 2020;34:3182–3195. [Google Scholar]
- 68.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:5998–6008. [Google Scholar]
- 69.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020;33:1877–1901. [Google Scholar]
- 70.Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018 doi: 10.48550/arXiv.1810.04805. Preprint at. [DOI] [Google Scholar]
- 71.Su G., Wang H., Zhang Y., Zhang W., Lin X. Simple and deep graph attention networks. Knowl. Base Syst. 2024;293 [Google Scholar]
- 72.Child R., Gray S., Radford A., Sutskever I. Generating long sequences with sparse transformers. arXiv. 2019 doi: 10.48550/arXiv.1904.10509. Preprint at. [DOI] [Google Scholar]
- 73.Choromanski K., Likhosherstov V., Dohan D., Song X., Gane A., Sarlos T., Hawkins P., Davis J., Mohiuddin A., Kaiser L., et al. Rethinking attention with performers. arXiv. 2020 doi: 10.48550/arXiv.2009.14794. Preprint at. [DOI] [Google Scholar]
- 74.Bollen K.A. Structural Equations with Latent Variables. Vol. 210. John Wiley & Sons; 1989. [Google Scholar]
- 75.Haque A., Engel J., Teichmann S.A., Lönnberg T. A practical guide to single-cell rna-sequencing for biomedical research and clinical applications. Genome Med. 2017;9:75. doi: 10.1186/s13073-017-0467-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Probst M., Rothlauf F. Harmless overfitting: Using denoising autoencoders in estimation of distribution algorithms. J. Mach. Learn. Res. 2020;21:2992–3022. [Google Scholar]
- 77.Harris B.D., Crow M., Fischer S., Gillis J. Single-cell co-expression analysis reveals that transcriptional modules are shared across cell types in the brain. Cell Syst. 2021;12:748–756.e3. doi: 10.1016/j.cels.2021.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Eisen M.B., Spellman P.T., Brown P.O., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Veličković P., Cucurull G., Casanova A., Romero A., Lio P., Bengio Y. Graph attention networks. arXiv. 2017 doi: 10.48550/arXiv.1710.10903. Preprint at. [DOI] [Google Scholar]
- 80.Yu Y., Chen J., Gao T., Yu M. In: International Conference on Machine Learning. Chaudhuri K., Salakhutdinov R., editors. PMLR; 2019. Dag-gnn: Dag structure learning with graph neural networks; pp. 7154–7163. [Google Scholar]
- 81.Wolf F.A., Angerer P., Theis F.J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Higgins I., Matthey L., Pal A., Burgess C., Glorot X., Botvinick M., Mohamed S., Lerchner A. In: International conference on learning representations. Levine S., Dyer C., Anandkumar A., Lawrence N., Urtasun R., editors. 2016. beta-vae: Learning basic visual concepts with a constrained variational framework. [Google Scholar]
- 83.Tieleman T., Hinton G. Lecture 6.5 - RMSprop: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. 2012;4:26–31. [Google Scholar]
- 84.He K., Zhang X., Ren S., Sun J. Proceedings of the IEEE international conference on computer vision. IEEE; 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification; pp. 1026–1034. [Google Scholar]
- 85.Glorot X., Bengio Y. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Teh Y.W., Titterington M., editors. JMLR Workshop and Conference Proceedings; 2010. Understanding the difficulty of training deep feedforward neural networks; pp. 249–256. [Google Scholar]
- 86.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019;32:8024–8035. [Google Scholar]
- 87.Chu L.-F., Leng N., Zhang J., Hou Z., Mamott D., Vereide D.T., Choi J., Kendziorski C., Stewart R., Thomson J.A. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 2016;17:173. doi: 10.1186/s13059-016-1033-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Camp J.G., Sekine K., Gerber T., Loeffler-Wirth H., Binder H., Gac M., Kanton S., Kageyama J., Damm G., Seehofer D., et al. Multilineage communication regulates human liver bud development from pluripotency. Nature. 2017;546:533–538. doi: 10.1038/nature22796. [DOI] [PubMed] [Google Scholar]
- 89.Shalek A.K., Satija R., Shuga J., Trombetta J.J., Gennert D., Lu D., Chen P., Gertner R.S., Gaublomme J.T., Yosef N., et al. Single-cell rna-seq reveals dynamic paracrine control of cellular variation. Nature. 2014;510:363–369. doi: 10.1038/nature13437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Nestorowa S., Hamey F.K., Pijuan Sala B., Diamanti E., Shepherd M., Laurenti E., Wilson N.K., Kent D.G., Göttgens B. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood. 2016;128:e20–e31. doi: 10.1182/blood-2016-05-716480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Hafemeister C., Satija R. Normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression. Genome Biol. 2019;20:296. doi: 10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
All datasets used in the work have been summarized in the key resources table.
-
•
All original code has been deposited at https://github.com/guangxinsuuu/HyperG-VAE and is publicly available at https://doi.org/10.5281/zenodo.15028720 as of the date of publication.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.






