Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2025 Jan 24;122(4):e2404860121. doi: 10.1073/pnas.2404860121

Diffusive topology preserving manifold distances for single-cell data analysis

Jiangyong Wei a, Bin Zhang a, Qiu Wang a, Tianshou Zhou b, Tianhai Tian c, Luonan Chen a,d,e,1
PMCID: PMC11789025  PMID: 39854240

Significance

Distortions from traditional dimensionality reduction methods obscure relationships in high-dimensional single-cell data, thus impeding biological insights. We introduce DTNE (diffusive topology neighbor embedding), a manifold learning framework that employs personalized diffusion processes to faithfully capture the manifold distances between cells. The resulting diffusion manifold distance matrix enables accurate visualization, trajectory inference, and clustering within a unified framework. Compared to mainstream algorithms on diverse datasets, DTNE consistently achieves superior results. By emphasizing the importance of geodesic preservation, this work demonstrates the potential of diffusion dynamics to extract biologically meaningful patterns that are often obscured in high-dimensional data.

Keywords: manifold distance, data topology, diffusion model, dimension reduction, single-cell analysis

Abstract

Manifold learning techniques have emerged as crucial tools for uncovering latent patterns in high-dimensional single-cell data. However, most existing dimensionality reduction methods primarily rely on 2D visualization, which can distort true data relationships and fail to extract reliable biological information. Here, we present DTNE (diffusive topology neighbor embedding), a dimensionality reduction framework that faithfully approximates manifold distance to enhance cellular relationships and dynamics. DTNE constructs a manifold distance matrix using a modified personalized PageRank algorithm, thereby preserving topological structure while enabling diverse single-cell analyses. This approach facilitates distribution-based cellular relationship analysis, pseudotime inference, and clustering within a unified framework. Extensive benchmarking against mainstream algorithms on diverse datasets demonstrates DTNE’s superior performance in maintaining geodesic distances and revealing significant biological patterns. Our results establish DTNE as a powerful tool for high-dimensional data analysis in uncovering meaningful biological insights.


The advances in single-cell sequencing technologies enable the simultaneous quantification of thousands of gene expression profiles across millions of individual cells. By revealing transcriptional heterogeneity with single-cell resolution, these technologies provide an unprecedented insight into multiscale landscapes of complex cell populations. Integrating measurements from the vast, high-dimensional datasets poses difficulties for discriminating cell types and states (1), reconstructing developmental trajectories (2), deciphering cell communication interactions (3), data integration (4) and other common single-cell analysis workflows (5). Specifically, single-cell data analysis poses a significant challenge due to the high dimensionality of the data, often termed the “curse of dimensionality.” This issue arises from the vast number of genes measured in each cell, which can obscure the underlying relationships between cells and hinder the identification of meaningful biological insights. Fortunately, gene expression levels are often coordinated and constrained due to coregulation, where genes exhibit correlated expression patterns. Consequently, the high-dimensional gene expression data actually lie on a lower-dimensional manifold. To effectively uncover and utilize the underlying structure and extract valuable information, dimensionality reduction techniques serve as indispensable tools for single-cell data analysis.

Accurate analysis of high-dimensional single-cell data relies on precisely capturing the true manifold structures to correctly identify the interrelationships between cells. Just as a two-dimensional world map cannot accurately represent the distance relationships between cities on the Earth’s surface, maintaining true distances and relationships becomes increasingly difficult as the number of dimensions increases. Additionally, different regions within the same high-dimensional dataset may have varying intrinsic dimensions, further complicating the task (6). Traditional dimensionality reduction techniques, such as UMAP (7) and tSNE (8), play a crucial role in visualizing and interpreting single-cell data. However, projecting high-dimensional data into 2D or 3D space inevitably results in information loss and potential distortion of the underlying manifold structure (9, 10). This can lead to misunderstandings of data relationships and patterns, particularly in situations with complex nonlinear interdependencies (11). Moreover, UMAP and tSNE primarily focus on preserving pairwise similarities, neglecting the critical information encoded within intercellular distances (12). Rather than risk either misinterpreting 2D visualizations or overlooking biologically relevant geometric features, we can achieve a more accurate understanding of these datasets by using concepts from high-dimensional data geometry and topology (13). Manifold-based distance metrics offer a powerful way to understand relationships within high-dimensional data, particularly in fields like single-cell analysis. By capturing the “geodesic distances” along the manifold that represents the data, the metrics reflect the true underlying distances between nodes. Unlike traditional Euclidean distances, they account for the intrinsic geometry of the underlying manifold, enabling better extraction of meaningful nonlinear relationships and underlying patterns embedded in high-dimensional biological data.

In this study, we introduce DTNE (diffusive topology neighbor embedding), an algorithm for reconstructing the cell-state manifold from single-cell data. DTNE leverages a modified personalized PageRank to model diffusion dynamics on the cell–cell connectivity graph. Through iterative random walks, it refines proximity scores between cells, capturing their true relationships while smoothing out noise. Unlike standard PageRank, DTNE personalizes the diffusion process for each cell. It intelligently calculates optimal restart probabilities based on intrinsic structure, leading to a more accurate representation of the manifold. The refined random walks produce a diffusion manifold distance matrix, which encodes distances not only through local direct connections but also by considering global diffusion patterns. We establish a framework to leverage the manifold distance matrix for three key tasks: low-dimensional visualization, pseudotime inference, and cluster identification. The results of these tasks, in turn, validate the accuracy of the obtained manifold structure, enhancing confidence in the inferred relationships between cells. Extensive benchmarking on synthetic and real datasets using diverse metrics demonstrates DTNE’s superior performance in uncovering hidden information within single-cell data.

Related Works

Manifold geodesic distances quantify the shortest path length between two points on a curved surface or higher-dimensional manifold. Unlike Euclidean metrics, geodesic distances accurately represent nonlinear data structures, enabling more meaningful similarity assessments in manifold-structured data. Understanding geodesic distances is crucial for accurately analyzing data that lie on or near a manifold, as they provide a more faithful representation of the true relationships between points in non-Euclidean spaces. However, Isomap’s reliance on shortest path distances computed using Dijkstra’s algorithm often leads to an overestimation of the true geodesic distance on the manifold, ultimately resulting in topological instability (14). In contrast, diffusion-based methods leverage random walks on graphs to capture the manifold structure, providing more accurate and meaningful distance measures (15). These approaches begin by creating a weighted adjacency matrix in which each data point is linked to its nearest neighbors with specific weights. Row normalization of this matrix results in a Markov matrix P (the diffusion operator), which is robust in handling sparse or noisy regions of the data (16). The diffusion process of the Markov matrix P acts as a low-pass filter on the manifold, allowing for the extraction of varying levels of geometric information from the data through different powers of P. During this process, data points are characterized as discrete distributions within a common probability space, from which manifold similarity can be derived to calculate manifold distances. The similarity metric quantifies positional relationships on the manifold, with increased distances correlating with decreased similarity.

Several algorithms demonstrate the effectiveness of the diffusion matrix in embedding high-dimensional data into meaningful low-dimensional manifolds, while maintaining reliable distance calculations. For instance, diffusion maps obtain pairwise diffusion distances Dt2(x,y)=PxtPyt2 based on the probability of transitioning between nodes through random walks (17). DPT defines a propagation process using the accumulated transition matrix M=t=1Pt and measures transitions between cells using diffusion pseudotime distance Dt2(x,y)=MxMy2 (18). PHATE defines a t-step potential distance Dt2(x,y)=UxtUyt2 as manifold preservation metric where Ut=logPt (19), and its extended version, Multiscale PHATE, uses diffusion condensation to merge similar cells into clusters (20). HeatGeo preserves the heat-geodesic dissimilarity distance Dt2(x,y)=4tlog(Ht)xyσ4tlog(V)xy with a heat diffusion kernel Ht=k>0tket/k!Pk, where V is a volume regularization term (12). Although these diffusion-based methods provide effective tools for capturing true data relationships and distances, optimizing diffusion distributions and distance metrics remains a significant challenge. More refined and robust dimensionality reduction techniques are crucial for fully unlocking the potential of geodesic preservation and enabling more accurate data analysis.

Results

Overview of DTNE.

As illustrated in Fig. 1, our algorithm DTNE effectively captures the manifold structure of data by using diffusion process simulated with a Markov matrix. DTNE utilizes a modified personalized PageRank algorithm to transform data into similarity-based probability distributions. These distributions are subsequently converted into manifold similarities through kernel methods and further transformed into manifold distances via logarithmic operations.This approach constructs a robust diffusion manifold distance matrix, which facilitates critical analytical tasks by capturing the intrinsic relationships between data points.

Fig. 1.

Fig. 1.

Overview of DTNE. (A) The core idea of the algorithm is to convert the original straight-line distance in Euclidean space into a distribution-based geodesic distance between cells in manifold space. (B) We transform the original input data into a cell–cell connectivity graph using the k-nearest neighbors (kNN) algorithm and then normalize it to get a Markov matrix. After that, we simulate diffusion dynamics on this graph via a personalized PageRank method to derive a diffusion manifold distance matrix. This matrix captures both local and global relationships between cells based on manifold similarities. (C) We leverage the manifold distance matrix to perform key analysis tasks such as dimensionality reduction, pseudotime ordering, and clustering identification to gain insights of the dataset.

In this section, we extensively evaluate our algorithm on both real and synthetic datasets, benchmarking its performance against various popular methods for manifold structure preservation, local and global distance capturing, temporal trajectory construction, and clustering identification. By evaluating through various metrics, researchers can gain confidence in the validity of downstream analyses and interpretations based on diffusive topology.

Local and Global Structure Preservation on Synthetic Dataset.

Effective dimensionality reduction methods convert high-dimensional data into visually interpretable representations, typically in two or three dimensions, while preserving the inherent structure. To evaluate various dimensionality reduction algorithms, we utilize the Swiss roll and artificial tree datasets from the HeatGeo study (12). The Swiss roll dataset comprises data points residing on a two-dimensional manifold smoothly embedded in three-dimensional space, with known geodesic distances. While the artificial tree dataset is uniformly sampled from multiple branches generated by multidimensional Brownian motion, with geodesic distances calculated from the noise-free manifold structure. As illustrated in Fig. 2A, UMAP and t-SNE generate undesirable clusters due to excessive denoising, failing to capture the continuous manifold. Multidimensional scaling (MDS), which aims to preserve Euclidean distances between points, fails to reveal intrinsic data structures. Similarly, diffusion maps struggle to accurately represent the branches in the artificial tree dataset. Diffusion-based dimensionality reduction algorithms, such as PHATE, HeatGeo, and DTNE, typically employ MDS as their default visualization technique. When applied to the Swiss roll dataset, PHATE and HeatGeo tend to introduce more holes in the resulting low-dimensional space compared to DTNE, which reveals the true topological structure of the data more effectively. Fig. 2B examines the impact of DTNE parameters (graph kernel, diffusion steps, and number of nearest neighbors) on performance for the Swiss roll dataset. Compared to the other two parameters, the number of nearest neighbors k has a greater impact on the manifold structure when using Pearson correlation as the evaluation metric. Notably, as the value of k increases, other local metrics, such as the average Jaccard coefficient, may decrease once the value of k surpasses a certain threshold, Thus, a larger value of k does not necessarily improve visualization effects. This phenomenon suggests a trade-off between global and local information retention.

Fig. 2.

Fig. 2.

(A) 2D visualization of the Swiss roll dataset, colored by univariate position, and an artificial tree dataset, colored by branches, using various dimensionality reduction methods: UMAP, tSNE, MDS, Diffusion Maps, PHATE, HeatGeo, and DTNE, each demonstrating different capabilities in preserving the intrinsic structure of the data. (B) Evaluation of parameters impacts on Swiss roll dataset structure preservation. The plot demonstrates how kernel type, diffusion step size l, and number of nearest neighbors k affect global structure maintenance, assessed via Pearson correlation (y-axis). Simultaneously, the impact of k on local neighborhood structure is shown using the average Jaccard coefficient (secondary y-axis).

Preserving both local and global distances is essential for dimensionality reduction techniques to effectively capture the essential structure of the data. To evaluate local and global topological structure preservation in two synthetic datasets, we first compute the distance matrices generated by various algorithms. We then evaluate the Pearson and Spearman correlation coefficients between these distance matrices and the geodesic distances. After that, we compare the trustworthiness metric, which assesses the preservation of local structure. As shown in Table 1, the DTNE algorithm yields significantly better results compared to other algorithms in terms of these metrics.

Table 1.

Comparison of global and local structure preservation metrics for two datasets: the Swiss roll and artificial tree datasets

Swiss roll Artificial tree
Global Local Global Local
Method Pearson Spearman Trustworthiness Pearson Spearman Trustworthiness
Diffusion Map 0.493 0.458 0.947 0.727 0.681 0.990
PHATE 0.428 0.314 0.956 0.615 0.361 0.991
Shortest Path 0.534 0.538 0.950 0.879 0.892 0.991
HeatGeo 0.529 0.548 0.959 0.798 0.884 0.992
DTNE 0.640 0.635 0.964 0.912 0.918 0.993

Pearson and Spearman correlations between algorithm-derived and ground truth distance matrices assess global structure preservation, while the trustworthiness metric evaluates local structure preservation (higher values indicate better performance. Bold indicates best average performance).

Low Dimensional Visualization of Single-Cell Datasets.

Single-cell datasets often capture dynamic cellular developmental processes in which cells evolve through distinct states or lineages. We evaluated different visualization approaches using six real established single-cell RNA-seq datasets [Paul (21), Nestorowa (22), Pancreas (23), Lymphoid (24), Embryoid Body (19), Arabidopsis root atlas (25)] with distinct developmental trajectories. As shown in Fig. 3, the 2D linear projections obtained from PCA contain significant noise. In contrast, tSNE prioritizes local structure while losing connections to more distant neighbors, failing to preserve global relationships. Meanwhile, PHATE has difficulties with datasets containing high levels of noise, which can hinder its ability to distinguish cell types in certain cases. From the rightmost column of Fig. 3, we can see that DTNE captures development trajectories well and clearly identifies different cell states. We selected “MDS” as the default visualization option for DTNE due to its focus on preserving pairwise distances between data points, which aligns with our algorithm’s emphasis on preserving intrinsic manifold distances.

Fig. 3.

Fig. 3.

2D visualizations of various dimensionality reduction methods applied to six single-cell datasets. Each column corresponds to a different visualization method: PCA, UMAP, tSNE, PHATE, and DTNE. The rows represent different datasets, with colors indicating different cell types or time stages.

To quantitatively assess performance, we computed Spearman correlations between shortest path distances in the original high-dimensional space and Euclidean distances in the two-dimensional visualization for these methods (SI Appendix, Fig. S1). These correlations confirmed that MDS visualization of DTNE effectively preserves global structure, including multimodal data [SI Appendix, Fig. S3 (26)]. While UMAP may offer superior visualization due to the current limitations of MDS (27), we have also added “UMAP” as an optional visualization option (SI Appendix, Fig. S2). These visualizations demonstrate DTNE’s exceptional capability to trace cellular trajectories while maintaining clear distinctions between different cell states.

Pseudotime Trajectory Inference on Single-Cell Datasets.

DTNE utilizes the calculated manifold distance metric to infer pseudotime trajectories in single-cell datasets, and we evaluate manifold structure preservation by examining the inferred pseudotime ordering. Although manifold distances may not always perfectly align with biological pseudotime due to nonuniform gene expression changes over time, they provide a reasonable pseudotime approximation to a certain extent and are widely used in the community. We utilize unified visualization spaces (UMAP for most datasets, PHATE for Embryoid Body, this selection prioritizes optimal visualization of distinct manifold structures) to compare developmental trajectories inferred by different algorithms. As shown in Fig. 4A, DTNE’s pseudotime measures exhibit greater uniformity across developmental stages in the 2D space, indicating a smoother and more consistent developmental progression. This is particularly evident in large datasets (Embryoid Body and root atlas), where other algorithms-including DPT (18), Palantir (28), and Monocle3 (29) show clear deviations.

Fig. 4.

Fig. 4.

(A) UMAP visualizations of single-cell datasets, (except for the embryoid body dataset visualized by PHATE), with cells colored by pseudotime inferred using different methods. Methods shown from left to right are DPT, Palantir, Monocle3, and DTNE. (B) Pseudotime evaluation of the embryoid body dataset includes a heatmap and expression changes of marker genes along pseudotime, with correlations using Kendall’s tau and Spearman. (C) Correlation comparisons (Kendall’s tau and Spearman) for different pseudotime methods across various datasets.

In addition to qualitative assessment, we perform a quantitative comparison with Kendall’s tau and Spearman correlations for different trajectory methods. For the Arabidopsis root atlas dataset, we employed consensus time as the reference order, derived by averaging the pseudotime estimations from two complementary methods, scVelo (30) and CytoTRACE (31). In the Embryoid Body dataset, cell developmental stages serve as the reference order. For the Paul, Nestorowa, and Pancreas datasets, which lack time-related labels, we utilized the normalized shortest path (NSP) distances from each cell to the root cell as gold standard labels. The NSP distance, which quantifies the minimal path between two points on the manifold surface, aligns well with the concept of pseudotime, making it an valuable benchmark for evaluating pseudotime methods. The results are presented in Fig. 4 B and C. DTNE achieves the highest correlation scores for all datasets. Further validation came from comparing the NSP distances with the normalized pseudotimes (SI Appendix, Fig. S4). It can be seen that DTNE’s pseudotimes demonstrate strong consistency with the NSP distances along the line y = x. Moreover, evaluation of pseudotime calculation intervals across cell types (SI Appendix, Fig. S5) demonstrated DTNE’s superior alignment with biological processes. These results suggest that DTNE’s manifold distances accurately capture the intrinsic geometry of the datasets.

Cluster Identification of Single-Cell Datasets.

Manifold learning combined with clustering provides an unsupervised evaluation approach for assessing how accurately the manifold represents data structure. Higher clustering performance indicates better manifold learning quality. We evaluate the accuracy of manifold structures through two clustering metrics, ARI (adjusted Rand index) and AMI (adjusted mutual information). Widely used clustering methods include the Louvain (32) or Leiden (33) community detection algorithms and hierarchical clustering techniques. We employ Python library Scanpy (34) for Louvain and Leiden clustering and scikit-learn (35) for hierarchical agglomerative clustering analysis. To make a fair accuracy comparison, we fine-tune the “resolution” parameter in both Leiden and Louvain algorithms to match the number of ground truth clusters. First, synthetic single-cell RNA-seq data with predefined cluster labels are generated using the Splatter package (36), with different noise levels obtained by the parameter de_fac_scale. As shown in Fig. 5A, the clustering results of DTNE are on par with Leiden and Louvain, and better than hierarchical clustering and Multiscale PHATE. We then assess the performance of DTNE on five real single-cell datasets [mouse brain (37), tosches turtle (38), tosches lizard (38), lake2018 (39), Ximerakis (40)]. We utilize datasets with ground truth cell types defined by “cell_ontology_class” to ensure fair performance evaluation. The results in Fig. 5B show that DTNE clustering outperforms other clustering methods on the five datasets. Specifically, hierarchical agglomerative clustering on the DTNE-derived manifold distance matrix yields significantly improved results compared to directly clustering the data with hierarchical agglomerative clustering.

Fig. 5.

Fig. 5.

(A) Clustering performance evaluation of different clustering techniques with varying noise levels on Splatter simulation datasets, controlled by the “de_fac_scale” parameter. Techniques compared include leiden, louvain, hierarchical, MS-PHATE, and DTNE, with ARI and AMI as performance metrics. (B) Clustering performance evaluation of different clustering techniques on five real single-cell datasets. (C) UMAP visualization of “cell ontology classes” and different clustering results from various algorithms (leiden, louvain, hierarchical, MS-PHATE, DTNE) applied to Tosches’ turtle dataset.

We use UMAP to project the tosches turtle dataset onto a 2D space and present the different clustering results in Fig. 5C. The clustering results of Leiden and DTNE are superior to the others, consistent with the results in the first bar chart in Fig. 5B. Additionally, the cells indicated by arrow 1 are assigned to the “glutamatergic neuron” cell type in both cell ontology class and DTNE clusterings. Similarly, the cells pointed to by arrow 2 are assigned to the “ependymoglial cell” group in both Leiden and DTNE clusterings. The UMAP visualization shows a clear distance between the cells highlighted by the arrows and the cell types to which they are assigned. The inconsistency between the UMAP representation and clustering structure suggests that 2D visualizations of high-dimensional data can sometimes create apparent patterns or relationships that do not actually exist in the original data space. Therefore, reduced representation should be verified and explained in combination with the original data and other analysis methods to avoid overinterpretation or erroneous inference.

Discussion

Modern manifold learning techniques have become powerful tools for interpreting single-cell omics data. They enable noise reduction, cell type identification, and dynamic behavior inference, while preserving both local and global relationships. However, popular algorithms like t-SNE and UMAP, which use entropy-based loss functions to optimize low-dimensional representations that preserve high-dimensional local neighborhood structures, often distort the global structure and geometric relationships between cells in 2D projections. This distortion can compromise downstream analyses, potentially obscuring true cell identities or differentiation dynamics. The primary objective of manifold learning is to accurately reconstruct the data’s intrinsic manifold structure while preserving geodesic distances. Maintaining this intrinsic structure is essential for preserving cellular relationship integrity and ensuring reliable interpretation of biological processes. Achieving this goal requires exploring advanced methods from high-dimensional topology and geometry, as well as developing a comprehensive framework to deepen our understanding of how preserving manifold geodesics relates to effective dimensionality reduction.

Diffusion-based methods provide a promising approach for approximating manifold geodesic distances through Markov random walks on topological graphs, even for large datasets. Note that lower powers of the Markov matrix reveal local neighborhood structures, higher powers expose long-range dependencies and global geometric patterns. However, insufficient data smoothing can introduce noise, while excessive focus on distant nodes may overlook valuable local information. To address this challenge, we introduce DTNE, an algorithm that effectively captures both local and global structures through diffusion processes. DTNE employs a modified personalized PageRank algorithm to convert data points into diffusion distributions within a statistical manifold space. Subsequently, we establish manifold similarities to capture the geometric relationships among data points. These similarities are further processed using logarithmic operations to derive a meaningful manifold distance matrix. The diffusion distance matrix provides a more accurate representation of cellular relationships compared to previous methods. The modified personalized PageRank in DTNE has the following innovations: first, we propose using Kullback–Leibler divergence to adaptively search for optimal damping factors across nodes, ensuring effective smooth learning along the manifold structure. Second, we modify the restart matrix to ensure the accuracy of manifold distance calculation. Unlike the standard Markov matrix constructed using Gaussian kernel, we initialize the adjacency matrix by setting the elements representing neighboring cells to 1. This improves computational efficiency while reducing noise perturbations. When dealing with very large datasets, we propose merging similar data points using hierarchical clustering, and ultimately represent each data point as an appropriate probability distribution. We then employ MDS for dimensionality reduction and perform both trajectory inference and clustering analyses based on the manifold distance matrix. By computing a variety of metrics on synthetic and real datasets, our algorithm demonstrates superior performance compared to popular algorithms. This is achieved by precisely measuring the manifold distances, thereby accurately reflecting the true relationships between data points.

Alternatively, UMAP visualizations do not always align with the clusters identified by classical clustering algorithms. Variants like den-SNE and densMAP enhance t-SNE and UMAP by incorporating density information, making them valuable for datasets with clusters of varying sizes or densities (41). However, proximity in the resulting 2D plots may still fail to represent true similarities in the original high-dimensional space, highlighting these methods’ limitations in maintaining accurate distance relationships. To reveal deeper insights into intrinsic data patterns, manifold learning techniques must explore novel frameworks that extract richer information from complex datasets. DTNE addresses these challenges through a diffusion-based approach that effectively balances local and global information preservation. Through the construction of a diffusion manifold distance matrix, DTNE establishes a unified framework for dimensionality reduction, pseudotime trajectory inference, and single-cell data clustering. The use of diffusive distances, rather than Euclidean measures, enables capture of multiscale structures and intrinsic global patterns, resulting in superior manifold distance preservation while offering promising future directions of manifold learning. While DTNE uses kNN for local structure, simplifying all edges to 1 constrains its full potential. Future developments could explore adaptive neighbor selection approaches (42) and focus on enhancing both scalability and visualization capabilities.

Materials and Methods

Consider a high-dimensional dataset XRn×d, where n is the number of data points and d denotes the dimensionality of each data point, which could correspond to the number of filtered genes or dimensions retained by linear dimensionality reduction such as PCA. We assume that the dataset X has already performed preprocessing operations.

Initial Markov Matrix.

Manifold learning techniques typically utilize the manifold’s local Euclidean properties, represented by a kNN graph that describes local relationships. The kNN graph is then transformed into a similarity matrix, usually using kernel functions like the Gaussian kernel. For simplicity and interpretability, we employ the box kernel:

K(xi,xj)=1 ifxjNk(xi),0else, [1]

where Nk(xi) is the k nearest neighborhood set of xi. This box kernel creates a binary adjacency matrix K, where each entry indicates whether two cells are kNN. After that, we normalize the rows of the matrix by

P=D1K [2]

where D is a diagonal matrix with Dii=jKij.

Transformed Markov matrix.

Personalized PageRank (PPR) introduces a “restart mechanism” that prioritizes the query node and its neighbors (43), using a “random walker” to navigate the graph network:

r(t)=cPr(t1)+(1c)s, [3]

where P is the diffusion operator (Eq. 2), s is a query point’s starting influence, and c is a damping factor. The walker can either move to connected nodes with probability c or “restart” back to the query point s with probability 1c.

Studies indicate that the optimal damping factor c may vary across nodes within the graph structure (44). Varying the damping factor balances exploration (wandering neighbors) with exploitation (focusing on the query). Therefore, we define a vector C=[c1,cN] containing damping factors for all nodes and construct a diagonal matrix C=diag(C) using these values. This allows us to rewrite a modified PPR formula in matrix form:

R(t)=CR(t1)P+(IC)Pl, [4]

here Pl is a restart matrix and l is the power of diffusion matrix P. By default, we set l=2. When t, R=(IC)(I+CP+C2P2++CtPt+)Pl. Each row in R represents a query node in the graph. For efficiency, we consider matrix factorization P=ΦΛΨT, where ΨTΦ=I. Then we obtain the matrix power Pt=ΦΛtΨT, with each element as Pij(t)=k=1nλktϕk(i)ψk(j). From this, we can derive:

Rij=k=1n(1ci)λkl1ciλkϕk(i)ψk(j). [5]

Let Σ=[(1ci)λkl1ciλk]i×k, the matrix form of the formula is R=Φ°Σ·ΨT, where the symbol ° is Hadamard product and · is dot product. Differentiating Rij with respect to ci yields

Rijci=k=1nλkl(λk1)(1ciλk)2ϕk(i)ψk(j). [6]

If we define DΣ=[λkl(λk1)(1ciλk)2]i×k, the partial derivative matrix becomes DR=RijciN×N=Φ°DΣ·ΨT.

Inspired by the objective functions of tSNE, we use gradient descent algorithm to find the optimal damping factors for all nodes in the graph,

L(C)=KL(Pl+1R). [7]

Manifold Distance Matrix.

With the transformed Markov matrix R, we define a Bhattacharyya kernel matrix

G=RRT, [8]

whose entries are inner products Gij=ri,rj (45, 46). And for all i,j, a kernel function k satisfies k(ri,rj)=Gij. Utilizing closure properties of kernels, we can create a new kernel k2 (47):

Gij=k(ri,rj)ek2(ri,rj)=eϕ(ri),ϕ(rj)=ezi,zj, [9]

where zi=ϕ(xi) is the resulting mapped vector. This equation also can be written as zi,zj= logGij. Note that the diagonal elements of G are all ones, and zi2=logGii=0. We define a distance metric based on the kernel (48):

D(2)(zi,zj)=zizj2=2zi,zj=2logGij. [10]

graphic file with name pnas.2404860121table01.jpg

The matrix form Dmanifold=2logG is used to approximate geodesic distances between data points on the manifold M. Observe that shorter times capture local neighborhood geometry, while longer times approximate global topological structure. By combining information from both local and global scales, we achieve a smooth and reliable estimate of the true geodesic distances. With viewing diffusive topology as a statistical manifold, we unlock a powerful tool for understanding complex high-dimensional data geometry (The core steps are presented in Algorithm 1, while the detailed derivation is provided in SI Appendix).

Downstream Tasks.

Leveraging the derived manifold distance matrix, we can subsequently perform various analytical tasks, including low-dimensional visualization, pseudotime ordering, and cell clustering.

Diffusive topology preserving reduction.

Be aware that MDS provides a stress-minimizing optimization procedure to embed the distances while preserving the geometric relationships between data points as much as possible (49). We learn the low-dimensional projection ξ:RNR2 by minimizing the following loss function:

Ldistance(yi,,yN)=DembedDmanifoldF2,

where Dembed is the matrix of pairwise Euclidean distances between data points yi=ξ(zi)R2(orR3). This approach embeds manifold distances into 2D (or 3D) coordinates Y for low-dimensional representation.

To reduce the computational burden associated with massive datasets and improve scalability, landmark-based approaches are commonly employed. Unlike PHATE, which employs k-means clustering for landmark selection, we utilize agglomerative clustering to partition the data into M groups where M<N. The data points in the same group will share the same restart probability. First, we define KMN(i,j)=jCmK(i,j) to store the transition weights from the ith point to the mth group in one step, where Cm is the set of points in the mth group. Then we construct two transition matrices:

PMN=KMNiKMN(i,j),PNM=KMNTjKMN(i,j),

to calculate transition probabilities. A new diffusion operator PMM=PMNPNM is constructed to get the transition probabilities from landmark to landmark. After that we perform the diffusion process as before to get global similarity matrix RMM and the landmark embedding YM. The final low-dimensional embedding of the dataset is obtained by Y=PNMYM.

Pseudotime inference.

We construct single-cell developmental trajectories by analyzing cell-to-cell distances along the underlying manifold geometry. While the true manifold is unknown, diffusion patterns in the data help approximate geodesic distances. These approximated distances enable ordering cells into developmental trajectories based on their relative positions in the manifold landscape, effectively converting static snapshot data into a dynamic progression.

Manifold clustering.

Clustering facilitates the identification of biologically meaningful cell types and states, and manifold clustering approaches can uncover subtler relationships between cells that traditional methods may overlook. We apply hierarchical agglomerative clustering to the manifold distance matrix to capture the manifold topology, enabling more biologically relevant groupings that accurately reflect the underlying data structure.

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

This work is supported by Key-Area Research and Development Program of Guangdong Province(2021B0909060002), the National Natural Science Foundation of China (Nos. 12101254, T2341007, 12131020, 31930022, 11931019, 62373384, 12371486, 12426310, and T2350003), National Key Research and Development Program of China (No. 2022YFA1004800), the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB38040400), the Shanghai Science and Technology Committee (No. 23JS1401300), Special Fund for Science and Technology Innovation Strategy of Guangdong Province (No. 2021B0909050004) and JST Moonshot R&D (No. JPMJMS2021).

Author contributions

L.C. designed research; J.W., T.Z., T.T., and L.C. performed research; J.W., B.Z., and Q.W. analyzed data; and J.W., T.Z., T.T., and L.C. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

The Python implementation of our DTNE algorithm and the data processing procedures have been deposited on GitHub (https://github.com/statway/DTNE) (50). Previously published datasets were utilized in this study (19, 2125, 3740), with their respective download links provided in Table 2.

Table 2.

Summary of single-cell datasets used in this study

The table lists dataset names, cell counts, and their repository links. Cell counts represent the total number of cells analyzed after quality control and preprocessing.

Supporting Information

References

  • 1.Kiselev V. Y., Andrews T. S., Hemberg M., Challenges in unsupervised clustering of single-cell rna-seq data. Nat. Rev. Genet. 20, 273–282 (2019). [DOI] [PubMed] [Google Scholar]
  • 2.Ding J., Sharon N., Bar-Joseph Z., Temporal modelling using single-cell transcriptomics. Nat. Rev. Genet. 23, 355–368 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Armingol E., Officer A., Harismendy O., Lewis N. E., Deciphering cell-cell interactions and communication from gene expression. Nat. Rev. Genet. 22, 71–88 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Luecken M. D., et al. , Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lähnemann D., et al. , Eleven grand challenges in single-cell data science. Genome Biol. 21, 1–35 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.B. C. Brown, A. L. Caterini, B. L. Ross, J. C. Cresswell, G. Loaiza-Ganem, “The union of manifolds hypothesis” in NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations (2022).
  • 7.L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv [Preprint] (2018). https://arxiv.org/abs/1802.03426 (Accessed 9 February 2018).
  • 8.v. d. Maaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
  • 9.Chari T., Pachter L., The specious art of single-cell genomics. PLoS Comput. Biol. 19, e1011288 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Xia L., Lee C., Li J. J., Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat. Commun. 15, 1753 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Marx V., Seeing data as t-SNE and UMAP do. Nat. Methods 21, 930–933 (2024). [DOI] [PubMed] [Google Scholar]
  • 12.Huguet G., et al. , A heat diffusion perspective on geodesic preserving dimensionality reduction. Adv. Neural Inf. Process. Syst. 36, 1–31 (2024). [Google Scholar]
  • 13.Wang S., Sontag E. D., Lauffenburger D. A., What cannot be seen correctly in 2D visualizations of single-cell ‘omics data’? Cell Syst. 14, 723–731 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tenenbaum J. B., De Silva V., Langford J. C., A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000). [DOI] [PubMed] [Google Scholar]
  • 15.Cowen L., Ideker T., Raphael B. J., Sharan R., Network propagation: A universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551 (2017). [DOI] [PubMed] [Google Scholar]
  • 16.Coifman R. R., Lafon S., Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006). [Google Scholar]
  • 17.Haghverdi L., Buettner F., Theis F. J., Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31, 2989–2998 (2015). [DOI] [PubMed] [Google Scholar]
  • 18.Haghverdi L., Buettner M., Wolf F. A., Buettner F., Theis F. J., Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016). [DOI] [PubMed] [Google Scholar]
  • 19.Moon K. R., et al. , Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kuchroo M., et al. , Multiscale PHATE identifies multimodal signatures of COVID-19. Nat. Biotechnol. 40, 681–691 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Paul F., et al. , Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015). [DOI] [PubMed] [Google Scholar]
  • 22.Nestorowa S., et al. , A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bastidas-Ponce A., et al. , Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 146, dev173849 (2019). [DOI] [PubMed] [Google Scholar]
  • 24.Satpathy A. T., et al. , Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral t cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Shahan R., et al. , A single-cell Arabidopsis root atlas reveals developmental trajectories in wild-type and cell identity mutants. Dev. Cell 57, 543–560 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Stuart T., et al. , Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.M. W. Trosset, C. E. Priebe, Continuous multidimensional scaling. arXiv [Preprint] (2024). https://arxiv.org/abs/2402.04436 (Accessed 6 February 2024).
  • 28.Setty M., et al. , Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Cao J., et al. , The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bergen V., Lange M., Peidli S., Wolf F. A., Theis F. J., Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020). [DOI] [PubMed] [Google Scholar]
  • 31.Gulati G. S., et al. , Single-cell transcriptional diversity is a hallmark of developmental potential. Science 367, 405–411 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.P. De Meo, E. Ferrara, G. Fiumara, A. Provetti, “Generalized Louvain method for community detection in large networks” in 2011 11th International Conference on Intelligent Systems Design and Applications (IEEE, 2011), pp. 88–93.
  • 33.Traag V. A., Waltman L., Van Eck N. J., From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wolf F. A., Angerer P., Theis F. J., SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pedregosa F., et al. , Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
  • 36.Zappia L., Phipson B., Oshlack A., Splatter: Simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schaum N., et al. , Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature 562, 367 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Tosches M. A., et al. , Evolution of pallium, hippocampus, and cortical cell types revealed by single-cell transcriptomics in reptiles. Science 360, 881–888 (2018). [DOI] [PubMed] [Google Scholar]
  • 39.Lake B. B., et al. , Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ximerakis M., et al. , Single-cell transcriptomic profiling of the aging mouse brain. Nat. Neurosci. 22, 1696–1708 (2019). [DOI] [PubMed] [Google Scholar]
  • 41.Narayan A., Berger B., Cho H., Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat. Biotechnol. 39, 765–774 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Dyballa L., Zucker S. W., IAN: Iterated adaptive neighborhoods for manifold learning and dimensionality estimation. Neural Comput. 35, 453–524 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Park S., Lee W., Choe B., Lee S. G., A survey on personalized PageRank computation algorithms. IEEE Access 7, 163049–163062 (2019). [Google Scholar]
  • 44.Jin W., Jung J., Kang U., Supervised and extended restart in random walks for ranking and link prediction in networks. PLoS ONE 14, e0213857 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kailath T., The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 15, 52–60 (1967). [Google Scholar]
  • 46.T. Jebara, R. Kondor, “Bhattacharyya and expected likelihood kernels” in Learning Theory and Kernel Machines: Proceedings of the 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, August 24–27, 2003 (Springer, 2003), pp. 57–71.
  • 47.Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis (Cambridge University Press, 2004). [Google Scholar]
  • 48.Wei J., Zhou T., Zhang X., Tian T., DTFLOW: Inference and visualization of single-cell pseudotime trajectory using diffusion propagation. Genomics, Proteomics, Bioinf. 19, 306–318 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Borg I., Groenen P. J., Modern Multidimensional Scaling: Theory and Applications (Springer Science & Business Media, 2005). [Google Scholar]
  • 50.J. Wei et al., Diffusive topology neighbor embedding. GitHub. https://github.com/statway/DTNE. Deposited 7 March 2024.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

The Python implementation of our DTNE algorithm and the data processing procedures have been deposited on GitHub (https://github.com/statway/DTNE) (50). Previously published datasets were utilized in this study (19, 2125, 3740), with their respective download links provided in Table 2.

Table 2.

Summary of single-cell datasets used in this study

The table lists dataset names, cell counts, and their repository links. Cell counts represent the total number of cells analyzed after quality control and preprocessing.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES