Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization

Yuta Hozumi; Guo-Wei Wei

doi:10.1016/j.cam.2024.115842

. Author manuscript; available in PMC: 2025 Aug 1.

Published in final edited form as: J Comput Appl Math. 2024 Feb 19;445:115842. doi: 10.1016/j.cam.2024.115842

Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization

Yuta Hozumi ¹, Guo-Wei Wei ^1,^2,^3,^*

PMCID: PMC10919214 NIHMSID: NIHMS1968622 PMID: 38464901

Abstract

Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and $t$ -distributed stochastic neighbor embedding ( $t$ -SNE).

Keywords: Algebraic topology, Persistent Laplacian, scRNA-seq, dimensionality reduction, machine learning

1. Introduction

Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has unveiled the heterogeneity within cell populations, providing valuable insights into complex biological interactions and pathways, such as cell-cell interactions, differential gene expression, signal transduction pathways, and more [1]. Unlike traditional microarray analysis, often referred to as bulk sequencing, scRNA-seq offers the transcriptomic profile of individual cells. With current technology, it’s possible to sequence more than 20,000 genes and 10,000 samples simultaneously. Standard experimental procedures involve cell isolation, RNA extraction, sequencing, library preparation, and data analysis. Over the years, numerous data analysis pipelines have been proposed, typically encompassing data preprocessing, batch correction, normalization, dimensionality reduction, feature selection, cell type identification, and downstream analyses to uncover relevant biological functions and pathways [2-6]. However, scRNA-seq data, in addition to their high dimensionality, are characterized by nonuniform noise, sparsity due to drop-out events and low reading depth, as well as unlabeled data [7]. Consequently, dimensionality reduction and feature selection are essential for successful downstream analysis.

Principal components analysis (PCA), uniform manifold approximation and projection (UMAP), and $t$ -distributed stochastic neighbor embedding ( $t$ -SNE) are among the most commonly used dimensionality reduction tools for scRNA-seq data. PCA is often employed as an initial step in analysis pipelines, such as trajectory analysis and data integration [8-11]. In PCA, the first few components are referred to as the principal components, where the variance of the projected data is maximized. In PCA, each ith component is orthogonal to all the $i - 1$ components, maximizing the residual data projected onto the ith component [12,13]. Numerous successful extensions to the original formulation have been proposed [14-17]. However, due to the orthogonality constraint of PCA, the reduced data may contain negative values, making it challenging to interpret.

UMAP and $t$ -SNE are nonlinear dimensionality reduction methods often used for visualization. UMAP constructs a $k$ -dimensional weighted graph based on $k$ -nearest neighbors and computes the edge-wise cross-entropy between the embedded low-dimensional weighted graph representation, utilizing the fuzzy set cross-entropy loss function [18]. $t$ -SNE computes the pairwise similarity between cells by constructing a conditional probability distribution over pairs of cells. Then, a student $t$ -distribution is used to obtain the probability distribution in the embedded space, and the Kullback-Leibler (KL) divergence between the two probability distributions is minimized to obtain the reduced data [19-22]. However, due to the stochastic nature of these methods and their instability at dimensions greater than 3 [23], they may not be suitable for downstream analysis.

Nonnegative matrix factorization (NMF) is another dimensionality reduction method in which the objective is to decompose the original count matrix into two nonnegative factor matrices [24,25]. The resulting basis matrices are often referred to as meta-genes and represent nonnegative linear combinations of the original genes. Consequently, NMF results are highly interpretable. However, the original formulation employs a least-squares optimization scheme, making the method susceptible to outlier errors [26]. To address this issue, Kong et al. [27] introduced robust NMF (rNMF), or $l_{2, 1}$ -NMF, which utilizes the $l_{2, 1}$ -norm and can better handle outliers while maintaining comparable computational efficiency to standard NMF. Manifold regularization has also been employed to incorporate geometric structures into dimensionality reduction, utilizing a graph Laplacian, leading to Graph Regularized NMF (GNMF) [28]. Semi-supervised methods, such as those incorporating marker genes [29], similarity and dissimilarity constraints [30], have been proposed to enhance NMF’s robustness. Additionally, various other NMF derivatives have been introduced [31-34]. Despite these advancements in NMF, manifold regularization remains an essential component to ensure that the lower-dimensional representation of the data can form meaningful clusters. However, using graph Laplacians can only capture a single scale of the data, specifically the scaling factor in the heat kernel. Therefore, single-scale graph Laplacians lack multiscale information.

Eckmann et al. [35] introduced simplicial complexes to the graph Laplacian defined on point cloud data, leading to the combinatorial Laplacian. This can be viewed as a discrete counterpart of the de Rham-Hodge Laplacian on manifolds. Both the Hodge Laplacian and the combinatorial Laplacian are topological Laplacians that give rise to topological invariants in their kernel space, specifically the harmonic spectra. However, the nonharmonic spectra contain algebraic connectivity that cannot be revealed by the topological invariants [36].

A significant development in topological Laplacians occurred in 2019 with the introduction of persistent topological Laplacians. Specifically, evolutionary de Rham theory was introduced to obtain persistent Hodge Laplacians on manifolds [37]. Meanwhile, persistent combinatorial Laplacian [38], also known as the persistent spectral graph or persistent Laplacian (PL), was introduced for point cloud data. These methods have spurred numerous theoretical developments [39-43] and code construction [44], as well as remarkable applications in various fields, including protein engineering [45], forecasting emerging SARS-CoV-2 variants BA.4/BA.5 [46], and predicting protein-ligand binding affinity [47]. Recently, PL has been shown to improve PCA performance [14].

This growing interest arises from the fact that persistent topological Laplacians represent a new generation of topological data analysis (TDA) methods that address certain limitations of the popular persistent homology [48, 49]. In persistent homology, the goal is to represent data as a topological space, often as simplicial complexes. Then, ideas from algebraic topology, such as connected components, holes, and voids, are used to extract topological invariants during a multiscale filtration. Persistent homology has facilitated topological deep learning (TDL), an emerging field [50]. However, persistent homology is unable to capture the homotopic shape evolution of data. PLs overcome this limitation by tracking changes in non-harmonic spectra, revealing the homotopic shape evolution. Additionally, the persistence of PL’s harmonic spectra recovers all topological invariants from persistent homology.

In this work, we introduce PL regularized NMF, namely the topological NMF (TNMF) and robust topological NMF (rTNMF). Both TNMF and rTNMF can better capture multiscale geometric information than the standard GNMF and rGNMF. To achieve improved performance, PL is constructed by observing cell-cell interactions at multiple scales through filtration, creating a sequence of simplicial complexes. We can then view the spectra at each complex associated with a filtration to capture both topological and geometric information. Additionally, we introduce $k$ -NN based PL to TNMF and rTNMF, referred to as $k$ -TNMF and $k$ -rTNMF, respectively. The $k$ -NN based PL reduces the number of hyperparameters compared to the standard PL algorithm.

The outline of this work is as follows. First, we provide a brief overview of NMF, rNMF, GNMF, and rGNMF. Next, we present a concise theoretical formulation of PL and derive the multiplicative updating scheme for TNMF and rTNMF. Additionally, we introduce an alternative construction of PL, termed $k$ -NN PL. Following that, we present a benchmark using 12 publicly available datasets. We have observed that PL can improve NMF performance by up to 0.16 in ARI, 0.08 in NMI, 0.04 in purity, and 0.1 in accuracy.

2. Methods

In this section, we provide a brief overview of NMF methods, namely NMF, rNMF, GNMF, and rGNMF. We then give persistent Laplacian and its construction. Finally, we formulate various PL regularized NMF methods. Table 1 show the parameters and abbreivations used for the formulation of PL regularized NMF.

Table 1:

Abbreviations and notations used in the methods

Abbreviations and notations	Description
NMF	Nonnegative matrix factorization
rNMF	Robust Nonnegative Matrix Factorization
GNMF	Graph Regularized Nonnegative Matrix Factorization
rGNMF	Robust Graph Regularized Nonnegative Matrix Factorization
TNMF	Topological Nonnegative Matrix Factorization
rTNMF	Robust Topological Nonnegative Matrix Factorization
$k$ -TNMF	$k$ -NN induced Topological Nonnegative Matrix Factorization
$k$ -rTNMF	$k$ -NN induced Robust Topological Nonnegative Matrix Factorization
$X \in R^{M \times N}$	Nonnegative data matrix with $M$ genes and $N$ cells
$W \in R^{M \times p}$	The basis or the meta-genes, and $p$ is the rank
$H \in R^{p \times N}$	Lower dimensional representation of the data, and $p$ is the rank
$L \in R^{N \times N}$	Graph Laplacian $L = D - A$
$A \in R^{N \times N}$	Adjacency matrix
$D \in R^{N \times N}$	Degree matrix
$P L \in R^{N \times N}$	Persistent Laplacian, $P L = P D - P A$
$P A \in R^{N \times N}$	Adjacency matrix associated with PL
$P D \in R^{N \times N}$	Degree matrix associated with PL
$ζ_{t}$	The weight of the subgraph for the $t$ -th filtration
$λ$	Hyperparameter for the regularized NMF

Open in a new tab

2.1. Prior Work

2.1.0.1. NMF

The original formulation of NMF utilizes the Frobenius norm, which assumes that the noise of the data is sampled from a Gaussian distribution. Let $X \in R^{M \times N}$ be a nonnegative data matrix, where $M$ is the number of genes, and $N$ is the number of cells. The goal of NMF is to find the decomposition $X \approx W H$ , where both $W \in R^{M \times p}$ and $H \in R^{p \times N}$ are nonnegative, and $p$ is the nonnegative rank. The objective function of NMF is given as follows

min_{W, H} {‖ X - W H ‖}_{F}^{2}, s.t. W, H \geq 0

(1)

where ${‖ A ‖}_{F}^{2} = \sum_{i, j} a_{i j}^{2}$ . $W$ is the basis, which are often times called the meta-genes in scRNA-seq and $H$ is the reduced features matrix. Let $H = [h_{1}, \dots, h_{N}]$ , where $h_{j}$ is the jth column, or the reduced feature for the jth cell. Lee proposed a multiplicative updating scheme, which preserves the nonnegativity [24]. For the $t + 1 th$ iteration,

w_{i j}^{t + 1} = w_{i j}^{t} \frac{{(X H^{T})}_{i j}}{{(W H H^{T})}_{i j}}

(2)

h_{i j}^{t + 1} = h_{i j}^{t} \frac{{(W^{T} X)}_{i j}}{{(W^{T} W H)}_{i j}}

(3)

Although the updating scheme is simple and effective in many biological data applications, scRNA-seq data is sparse and contains large amount of noise. Therefore, a model that is more robust to noise is necessary for feature selection and dimensionality reduction

2.1.0.2. rNMF

The robust NMF (rNMF) utilizes the $l_{2, 1}$ norm, which assumes that the noise of the data is sampled from a Laplace distribution, which may be more suitable for a count-based data matrix, like scRNA-seq data. The minimization function is given as the following

min_{W, H} {‖ X - W H ‖}_{2, 1}, s.t. W, H \geq 0,

where ${‖ A ‖}_{2, 1} = \sum_{j} {‖ a_{j} ‖}_{2}$ . Because $l_{2, 1}$ -norm utilizes summation over the $l_{2}$ distance between the original cell feature and the reduced feature, the effect of the outlier will not dominate the loss function as much as the Frobenius norm formulation. Likewise with the original NMF updating scheme, rNMF has the following multiplicative updating scheme

w_{i j}^{t + 1} = w_{i j}^{t} \frac{{(X {Q H}^{T})}_{i j}}{{(W H Q H^{T})}_{i j}}

(4)

h_{i j}^{t + 1} = h_{i j}^{t} \frac{{(W^{T} X Q)}_{i j}}{{(W^{T} W H Q)}_{i j}},

(5)

where $Q_{j j} = 1 ∕ {‖ X - {W h}_{j} ‖}_{2}$ is a diagonal matrix.

2.1.0.3. GNMF and rGNMF

Manifold regularization has been widely utilized in scRNA-seq. Let $G (V, E, W)$ be a graph, where $V = {x_{j}}_{j = 1}^{N}$ is the set of vertices, $E = {(x_{i}, x_{j}) ∣ x_{i} \in 𝒩_{k} (x_{j}) \cup x_{j} \in 𝒩_{k} (x_{i})}$ is the set of edges, and $W$ is the weight associated with the edges. Here, $𝒩_{k} (x_{j})$ denotes the $k$ -th nearest neighbors of vertex $j$ . For the weight between vertex $i$ and $j$ , denoted $ω_{i j}$ , we chose a monotonically decaying function with the following two properties

ω_{i j} \to 0 as ‖ x_{i} - x_{j} ‖ \to \infty ω_{i j} \to 1 as ‖ x_{i} - x_{j} ‖ \to 0 .

(6)

A common choice for such function is the radial basis function. For example, one may use the heat kernel

ω_{i j} = exp (- \frac{{‖ x_{i} - x_{j} ‖}^{2}}{σ}),

(7)

where $σ$ is the scale of the kernel. We can then represent the weights as an adjacency matrix $A$

A_{i j} = {\begin{matrix} exp (- \frac{{‖ x_{i} - x_{j} ‖}^{2}}{σ}) & x_{j} \in 𝒩_{k} (x_{i}) \\ 0 & otherwise . \end{matrix}

We can now construct the graph regularization term, $R_{G}$ , by looking at the distance ${‖ h_{i} - h_{j} ‖}^{2}$ and the adjacency matrix.

R_{G} = \frac{1}{2} \sum_{i, j} A_{i j} {‖ h_{i} - h_{j} ‖}^{2} = \sum_{i} D_{i i} h_{i}^{T} h_{i} - \sum_{i j} A_{i j} h_{i}^{T} h_{j} = Tr ({H D H}^{T}) - Tr ({H A H}^{T}) = Tr ({H L H}^{T}) .

Here, $L$ and $D$ are the Laplacian and the degree matrix, given by $L = D - A$ and $D_{i i} = \sum_{j} A_{i j}$ , respectively. $Tr (\cdot)$ denotes the trace of the matrix. Utilizing the regularization parameters $λ \geq 0$ , we get the objective function of GNMF

min_{W, H} {‖ X - W H ‖}_{F}^{2} + λ Tr ({H L H}^{T}), s.t. W, H \geq 0,

(8)

and the objective function for rGNMF

min_{W, H} {‖ X - W H ‖}_{2, 1} + λ Tr (H {L H}^{T}), s.t. W, H \geq 0,

(9)

2.2. Topological NMF

While graph regularization improves the traditional NMF and rNMF, the choice of σ can vastly change the result. Furthermore, graph regularization only captures a single scale, and may not be able to capture the mutliscale geometric information in the data. Here, we give a brief introduction to persistent homology and persistent Laplacian (PL) and derive the updating scheme for the topological NMF (TNMF) and the robust topological NMF (TNMF).

2.2.1. Persistent Laplacians

Persistent homology and PLs have been successfully used in biomolecular data [14, 45, 47-50]. Similar to persistent homology, PLs track birth and death of topological features, i.e. holes, over different scales. However, unlike persistent homology, PLs can further capture the homotopic shape evolution of data during the filtration. Through the filtration process, these methods offer the multiscale analysis of data.

We begin by the definition of simplex. Let $σ_{q} = [v_{0}, \dots, v_{q}]$ denote $q$ -simplex, where $v_{i}$ is a vertex. $σ_{0}$ is a node, $σ_{1}$ is an edge, $σ_{2}$ is a triangle $σ_{3}$ is a tetrahedron, and so on. A simplicial complex $K$ is a union of simplicies such that

If $σ_{q} \in K$ and $σ_{p}$ is a face of $σ_{q}$ , then $σ_{p} \in K$
The nonempty intersection of any 2 simplicies in $K$ is a face of both simplicies.

We can think of $K$ as gluing lower dimensional simplicies that satisfies the above 2 properties.

A $q$ -chain is a formal sum of $q$ -simplicies in $K$ with the coefficients $Z_{2} = {0, 1}$ . The set of all $q$ -chains has contains the basis for the set of $q$ -simplicies in $K$ . Such set forms a finitely generated free Abelian group $C_{q} (K)$ . We can relate the chain groups via a boundary operator, which is a group homomorphism $\partial_{q} : C_{q} (K) \to C_{q - 1} (K)$ . The boundary operator is defined as the following.

\partial_{q} σ_{q} ≔ \sum_{i = 0}^{q} {(- 1)}^{i} σ_{q - 1}^{i}

(10)

where $σ_{q - 1}^{i} = [v_{0}, \dots, v_{i}^{*}, \dots, v_{q}]$ and $σ_{q - 1}^{i}$ is a $(q - 1)$ -simplex with vertex $v_{i}$ removed. The sequence of chain group connected by the boundary operator defines the chain complex.

\dots \overset{\partial_{q + 2}}{\to} C_{q + 1} \overset{\partial_{q + 1}}{\to} C_{q} (K) \overset{\partial_{q}}{\to} \dots

(11)

The chain complex associated with a simplicial complex $K$ defines the $q$ -th homology group $H_{q} = {Ker \partial}_{q} ∕ {Im \partial}_{q}$ , and the dimension of $H_{q}$ is the $q$ -dimensional holes, or the $q$ th Betti number denoted as $β_{q}$ . For example, $β_{0}$ is the number of connected components, $β_{1}$ is the number of loops and $β_{2}$ is the number of cavities.

We can now define the dual chain complex through the adjoint operator of $\partial_{q}$ . The dual space is defined as $C^{q} (K) ≅ C_{q}^{*} (K)$ , and the coboundary operator $\partial_{q}^{*}$ is defined as $\partial_{q}^{*} : C^{q - 1} (K) \to C^{q} (K)$ . For $ω^{q - 1} \in C^{q - 1} (K)$ and $c_{q} \in C_{q} (K)$ , the coboundary operator is defined as

\partial^{*} ω^{q - 1} (c_{q}) \equiv ω^{q - 1} ({\partial c}_{q}) .

(12)

Here $ω^{q - 1}$ is a ( $q - 1$ ) cochain, or a homomorphic mapping from a chain to the coefficient group. The homology of the dual chain complex is called the cohomology.

We then define the $q$ -combinatorial Laplacian operator $Δ_{q} : C^{q} (K) \to C^{q} (K)$

Δ_{q} ≔ \partial_{q + 1} \partial_{q + 1}^{*} + \partial_{q}^{*} \partial_{q} .

(13)

Let $ℬ_{q}$ be the standard basis for the matrix representation of $q$ -boundary operator from $C_{q} (K)$ and $C_{q - 1} (K)$ , and $ℬ_{q}^{T}$ be th $q$ -coboundary operator. The matrix representation of the $q$ -th order Laplacian operator $ℒ_{q}$ is defined as

ℒ_{q} = ℬ_{q + 1} ℬ_{q + 1}^{T} + ℬ_{q}^{T} ℬ_{q} .

(14)

The multiplicity of zero eigenvalue of $ℒ_{q}$ is the $q$ -th Betti number of the simplicail complex. The nonzero eigenvalues (non-harmonic spectrum) contains other topological and geometrical features.

As stated before, simplicial complex does not provide sufficient information to understand the geometry of the data. To this end, we utilize simplicial complex induced by filtration

{\emptyset} = K_{0} \subseteq K_{1} \subseteq \dots \subseteq K_{p} = K,

(15)

where $p$ is the number of filtration.

For each $K_{t} 0 \leq t \leq p$ , denote $C_{q} (K_{t})$ as chain group induced by $K_{t}$ , and the corresponding boundary operator $\partial_{q}^{t} : C_{q} (K_{t}) \to C_{q - 1} (K_{t})$ , resulting in

\partial_{q}^{t} σ_{q} = \sum_{i = 0}^{q} {(- 1)}^{i} σ_{q - 1}^{i},

(16)

for $σ_{q} \in K_{t}$ . The adjoint operator of $\partial_{q}^{t}$ is similarity defined as $\partial_{q}^{t *} : C^{q - 1} (K_{t}) \to C^{q} (K_{t})$ , which we regard as the mapping $C_{q - 1} (K_{t}) \to C_{q} (K_{t})$ via the isomorphism between cochain and chain groups. Through these 2 operators, we can define the chain complexes induced by $K_{t}$ .

Utilizing filtration with simplicial complex, we can define persistence Laplacian spectra. Let $C_{q}^{t + p}$ whose boundary is in $C_{q - 1}^{t}$ be $C_{q}^{t + p}$ , assuming an inclusion mapping $C_{q - 1}^{t} \to C_{q - 1}^{t + p}$ . On this set, we can define the $p$ -persistent $q$ -boundary operator denoted ${\hat{\partial}}_{q}^{t, p} : C_{q}^{t, p} \to C_{q - 1}^{t}$ and the corresponding adjoint operator ${({\hat{\partial}}^{t, p})}^{*} : C_{q - 1}^{t} \to C_{q}^{t, p}$ . Then, the $q$ -order $p$ -persistent Laplacian operator is computed as

Δ_{q}^{t, p} = {\hat{\partial}}_{q + 1}^{t, p} {({\hat{\partial}}_{q + 1}^{t, p})}^{*} + {({\hat{\partial}}_{q}^{t})}^{*} {\hat{\partial}}_{q}^{t},

(17)

and its matrix representation as

ℒ_{q}^{t, p} = ℬ_{q + 1}^{t, p} {(ℬ_{q + 1}^{t, p})}^{T} + {(ℬ_{q}^{t})}^{T} ℬ_{q}^{t} .

(18)

Likewise as before, the multiplicity of the zero-eigenvalue is the $q$ -th order $p$ -persistent Betti number $ℬ_{q}^{t, p}$ , which is the $q$ -dimensional hole in $K_{t}$ that persists in $K_{t + p}$ . Moreover, the $q$ -th order Laplacian is just a particular case of $ℒ_{q}^{t, p}$ , where $p = 0$ , which is a snapshot of the topology at the filtration step $t$ [38, 44].

We can utilize the 0-persistent Laplacian to capture the interactions between the data at different filtration values. In particular, we can perform filtration by computing a family of subgraphs induced by a threshold distance $r$ , which is called the Vietoris Rips complex. Alternatively, we can compute a Gaussian Kernel induced distance to construct the subgraphs.

2.2.2. TNMF and rTNMF

For scRNA-seq data, we calculate the 0-persistent Laplacian using the Vietoris-Rips (VR) complexes by increasing the filtration distance. We can then take a weighted sum over the 0-persistent Laplacian induced by the changes in the filtration distance. For persistent Laplacian enhanced NMF, we will provide a computationally efficient algorithm to construct the persistent Laplacian matrix.

Let $L$ be a Laplacian matrix induced by some weighted graph, and note the following

L = {\begin{matrix} l_{i j}, & i \neq j \\ - \sum_{j = 1}^{N} l_{i j} & i = j . \end{matrix}

Then, $l_{\max} = \max_{i \neq j} l_{i j}$ , $l_{min} = {min}_{i \neq j} l_{i j}$ and $d = l_{\max} - l_{min}$ . The $t$ -th Persistent Laplacian $L^{t}$ , $t = 1, \dots, T$ , is defined as $L^{t} = {l_{i j}^{t}}$ , where

l_{i j}^{t} = {\begin{matrix} 0 & l_{i j} \leq (t ∕ T) d + l_{min} \\ 1 & otherwise \end{matrix}

(19)

l_{i i}^{t} = - \sum_{i \neq j} l_{i j}^{t} .

(20)

Then, we can take the weighted sum over the all the persistent Laplacians

P L ≔ \sum_{t = 1}^{T} ζ_{t} L^{t} .

(21)

Unlike the standard Laplacian matrix $L$ , PL captures the topological features that persists over different filtration, thus providing a multiscale view of the data that standard Laplacian lacks. Here, $ζ_{t}$ is the hyper-parameter and must be chosen.Then, the objective function for PL regularized NMF, which we call topological nonnegative matrix factorization (TNMF), is defined as,

min_{W, H} {‖ X - W H ‖}_{F}^{2} + λ Tr (H^{T} (P L) H), s.t. W, H \geq 0

(22)

and the objective function for the robust topological NMF (rTNMF) is defined as

min_{W, H} {‖ X - W H ‖}_{2, 1} + λ Tr (H^{T} (P L) H), s.t. W, H \geq 0 .

(23)

2.2.3. Multiplicative Updating scheme

The updating scheme follows the same principle as the standard GNMF and rGNMF.

2.2.3.1. TNMF

For TNMF, the Lagrangian function is defined as

ℒ = {‖ X - W H ‖}_{F}^{2} + λ Tr (H^{T} (P L) H) + Tr (Φ W) + Tr (Ψ H)

(24)

= Tr (X^{T} X) - 2 Tr ({X H}^{T} W^{T}) + Tr ({W H H}^{T} W^{T}) + λ Tr (H^{T} (P L) H) + Tr (Φ W) + Tr (Ψ H) .

(25)

Taking the partial with respect to $W$ , we get

\frac{\partial ℒ}{\partial W} = - {2 H}^{T} X H + 2 W H H^{T} + Φ .

(26)

Using the KKT condition $Φ_{i j} w_{i j} = 0$ , we get the following

{(- 2 X H^{T})}_{i j} w_{i j} + {(2 W H H^{T})}_{i j} w_{i j} = 0 .

(27)

Therefore, the updating scheme is

w_{i j}^{t + 1} \leftarrow w_{i j}^{t} \frac{{(X H^{T})}_{i j}}{{(W H H^{T})}_{i j}} .

(28)

For updating $H$ , we take the derivative of the Lagrangian function with respect to $H$

\frac{\partial ℒ}{\partial H} = - {2 W}^{T} X + {2 W}^{T} W H + 2 λ H (P L) + Ψ .

(29)

Using the Karush–Kuhn–Tucker (KKT) condition, we have $Ψ_{i j} h_{i j} = 0$ , and by substituting the condition, we get

- 2 (W^{T} X + λ H {(P A))}_{i j} h_{i j} + 2 (W^{T} W H + λ H {(P D))}_{i j} h_{i j} = 0,

(30)

where $P L = P D - P A$ and ${P D}_{i i} = \sum_{i \neq j} {P A}_{i j}$ . The updating scheme is then given by

h_{i j}^{t + 1} \leftarrow h_{i j}^{t} \frac{{(W^{T} W H + λ H (P D))}_{i j}}{{(W^{T} X + λ H (P A))}_{i j}} .

(31)

2.2.3.2. rTNMF

For the updating scheme for rTNMF, we utilize the fact that ${‖ A ‖}_{2, 1} = Tr ({A Q A}^{T})$ , where $Q_{i i} = \frac{1}{2 {‖ A_{i} ‖}_{2}}$ . The Lagrangian is given by

ℒ = {‖ X - W H ‖}_{2, 1} + λ Tr (H^{T} (P L) H) + Tr (Φ W) + Tr (Ψ H)

(32)

= Tr ((X - W H) Q {(X - W H)}^{T}) + λ Tr (H^{T} (P L) H) + Tr (Φ W) + Tr (Ψ H)

(33)

= Tr (X Q X^{T}) - 2 Tr (W H Q) + λ Tr (H^{T} (P L) H) + Tr (Φ W) + Tr (Ψ H),

(34)

where $Q_{i i} = \frac{1}{‖ x_{j} - W h_{j} ‖}$ . Taking the partial with respect to $W$ , we get

\frac{\partial L}{\partial W} = - (X Q H^{T}) + W H Q H^{T} - Φ .

(35)

Using the KKT conditions $Φ_{i j} w_{i j} = 0$ , we get

- {(X Q H^{T})}_{i j} w_{i j} + {(W H Q H^{T})}_{i j} w_{i j} = 0,

(36)

which gives the updating scheme

w_{i j}^{t + 1} \leftarrow w_{i j}^{t} \frac{{(X Q H^{T})}_{i j}}{{(W H Q H^{T})}_{i j}} .

(37)

For $H$ , we take the partial with respect to $H$ .

\frac{\partial L}{\partial H} = - W^{T} X Q + W^{T} W H Q + 2 λ H (P L) + Ψ .

(38)

Then, using the KKT conditions $Ψ_{i j} h_{i j} = 0$ , we get

{(- W^{T} X Q - 2 λ H (P A))}_{i j} h_{i j} + {(W^{T} W H Q + 2 λ H (P D))}_{i j} h_{i j} = 0,

(39)

where $P L = P D - P A$ and gives the updating scheme

h_{i j}^{t + 1} \leftarrow h_{i j}^{t} \frac{{(W^{T} X Q + 2 λ H (P A))}_{i j}}{{(W^{T} W H Q + 2 λ H (P D))}_{i j}} .

(40)

2.3. $k$ -NN induced Persistent Laplacian

One major issue with TNMF and rTNMF is that the parameters ${ζ_{t}}_{t = 1}^{T}$ have to be chosen. For the parameters, we let $ζ_{t} \in {0, 1, 1 ∕ 2, \dots, 1 ∕ T}$ for a total of $T + 1$ parameters. Therefore, the number of parameters that needs to be chosen increases exponentially as the number of filtration $T$ increases. Therefore, we propose an approximation to the original formulation using $k$ -NN induced PL·

Let $𝒩_{t} (x_{j})$ be the $t$ -nearest neighbors of sample $x_{j}$ . First, we define the $t$ -persistent directed adjacency matrix ${\tilde{A}}^{t}$ as

{\tilde{A}}^{t} = {{\tilde{a}}_{i j}^{t}}, {\tilde{a}}_{i j}^{t} = {\begin{matrix} 1 & x_{j} \in 𝒩_{t} (x_{i}) \\ 0 & otherwise . \end{matrix}

(41)

Then, the $k$ -NN based directed adjacency Laplacian is the weighted sum of ${A^{t}}$

\tilde{A} ≔ \sum_{t = 1}^{T} ζ_{t} {\tilde{A}}^{t} .

(42)

The undirected persistent adjacency matrix can be obtained via symmetrization

P A = \tilde{A} + {\tilde{A}}^{T} - \tilde{A} \cdot {\tilde{A}}^{T},

where · denote Hadamard product. Then, the PL can be constructed using the persistent degree and persistent adjacency matrices

P L = P D - P A, {P D}_{i i} = \sum_{j \neq i} {P A}_{i j} .

(43)

One advantage of utilizing the $k$ -NN induced persistent Laplacian is that the parameter space is much smaller. We can set $ζ_{t} \in {0, 1}$ , where $ζ_{t} = 0$ would ’turn-off’ the particular neighbor’s connectivity. In essence, the number of parameters will be reduced to $2^{T}$ , a significant decrease from ${(T + 1)}^{T}$ of the original formulation. We denote the $k$ -NN induced TNMF as $k$ -TNMF and the $k$ -NN induced rTNMF as $k$ -rTNMF

2.4. Evaluation metrics

Let $Y = {Y_{1}, \dots, Y_{L}}$ and $C = {C_{1}, \dots, C_{L}}$ be 2 partitions of the data. Here, we let $Y$ be the true label partition and $C$ be the cluster label partition. Let ${y^{i}}_{i = 1}^{N}$ and ${c^{i}}_{i = 1}^{N}$ be the true and predicted labels of sample $i$ .

2.4.0.1. Adjusted Rand Index

Adjusted random index (ARI) measures the similarity between two clustering by observing all pairs of samples that belong to the same cluster, and seeing if the other clustering result also have the same pair of samples in the same cluster [51]. Let $n_{i j} = ∣ T_{i} \cap S_{j} ∣$ be the number of samples that belong to true label $i$ and cluster label $j$ , and define $a_{i} = \sum_{j} n_{i j}$ and $b_{j} = \sum_{i} n_{i j}$ . Then, the ARI is defined as

ARI = \frac{\sum_{i j} (\begin{matrix} n_{i j} \\ 2 \end{matrix}) - [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] ∕ (\begin{matrix} N \\ 2 \end{matrix})}{\frac{1}{2} [(\begin{matrix} a_{i} \\ 2 \end{matrix}) + (\begin{matrix} b_{j} \\ 2 \end{matrix})] - [(\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] ∕ (\begin{matrix} N \\ 2 \end{matrix})} .

(44)

The ARI takes on a value between −1 and 1, where 1 is a perfect match between two clustering methods, and 0 is a completely random assignment of labels, and −1 indicates that the two clusterings are completely different.

2.4.0.2. Normalized Mutual Information

The normalized mutual information (NMI) measures the mutual information between two clustering results and normalized according to cluster size [52]. We fix the true labels $Y$ as one of the clustering result, and use the predicted labels as the other to calculate NMI. The NMI is calculated as the following

NMI = \frac{2 I (Y; C)}{H (Y) H (C)},

(45)

where $H (\cdot)$ is the entropy and $I (Y; C)$ is the mutual information between true labels $Y$ and predicted labels $C$ . NMI has a range of 0 and 1, where 1 is a perfect mutual correlation between the two sets of labels and 0 means no mutual information.

2.4.0.3. Accuracy

Accuracy (ACC) calculates the percentage of correctly predicted class labels. The accuracy is given by

ACC = \frac{1}{N} \sum_{i = 1}^{N} δ (y^{i}, f (c^{i})),

(46)

where $δ (a, b)$ is the indicator function, where if $a = b$ , $δ (a, b) = 1$ , and 0 otherwise. $f : C \to Y$ maps the cluster labels to the true labels, where the mapping is the optimal permutation of the cluster labels and true labels obtained from the Hungarian algorithm [53].

2.4.0.4. Purity

For purity calculation, each predicted label $C_{i}$ is assigned to a true label $Y_{j}$ such that the $∣ C_{i} \cap Y_{j} ∣$ is maximized [54]. Taking the average over all the predicted label, we obtain the following

Purity = \frac{1}{N} \max_{j} ∣ C_{i} \cap Y_{j} ∣ .

(47)

Note that unlike accuracy, purity does not map the predicted labels to the true labels.

3. Results

3.1. Benchmark Data

We have performed benchmark on 12 publicly available datasets. The GEO accession number, reference, organism, number of cell types, and number of samples are recorded in Table 2. For each data, cell types with less than 15 cells were removed. Log-normalization was applied, and scaled the data to have unit length. For GNMF and rGNMF, $k = 8$ neighbors were used. For TNMF and rTNMF, 8 filtration values were used to construct PL, and for each scale, binary selection $ζ_{p} = {0, 1}$ was used. for $k$ -TNMF and $k$ -rTNMF, $k = 8$ was used with $ζ_{t} = {0, 1}$ . For each test, double nonnegative singular value decomposition with zeros filled with the average of $X$ (NNDSVDA) was used for the initialization. For the rank, we chose $\sqrt{N}$ , where $N$ is the number of cells. The $k$ -means clustering was applied to obtain the clustering results.

Table 2:

GEO accession code, reference, organism type, cell type, number of samples, and number of genes of each dataset.

Geo Accession	Reference	Organism	Cell type	Number of Samples	Number of Genes
GSE67835	Dramanis [55]	Human	8	420	22084
GSE75748 time	Chu [56]	Human	6	758	19189
GSE82187	Gokce [57]	Mouse	8	705	18840
GSE84133human1	Baron [58]	Human	9	1895	20125
GSE84133human2	Baron [58]	Human	9	1702	20125
GSE84133human3	Baron [58]	Human	9	3579	20125
GSE84133human4	Baron [58]	Human	6	1275	20125
GSE84133mouse1	Baron [58]	Mouse	6	782	14878
GSE84133mouse2	Baron [58]	Mouse	8	1036	14878
GSE57249	Biase [59]	Human	3	49	25737
GSE64016	Leng [60]	Human	4	460	19084
GSE94820	Villani [61]	Human	5	1140	26593

Open in a new tab

3.2. Benchmarking PL regularized NMF

In order to benchmark persistent Laplacain regularized NMF, we compared our methods to other commonly used NMF methods, namely the GNMF, rGNMF, rNMF and NMF. For a fair comparison, we omitted supervised and semi-supervised methods. For $k$ -rTNMF, rTNMF, $k$ -TNMF, TNMF, GNMF and rGNMF, we set $λ = 1$ for all tests.

Table 3 shows the ARI values of the NMF methods for the 12 data we have tested. The bold number indicate the highest performance. Figure 1 depicts the average ARI value over the 12 datasets for each method.

Table 3:

ARI of NMF methods across 12 datasets.

data	$k$ -rTNMF	rTNMF	$k$ -TNMF	TNMF	rGNMF	GNMF	rNMF	NMF
GSE67835	0.9454	0.9236	0.9306	0.8533	0.9391	0.9109	0.7295	0.7314
GSE64016	0.2569	0.1544	0.2237	0.1491	0.1456	0.1605	0.1455	0.1466
GSE75748time	0.6421	0.6581	0.5963	0.6099	0.6104	0.5790	0.5969	0.5996
GSE82187	0.9877	0.9815	0.9676	0.9809	0.7558	0.7577	0.8221	0.8208
GSE84133human1	0.8310	0.8969	0.8301	0.8855	0.8220	0.7907	0.7080	0.6120
GSE84133human2	0.9469	0.9072	0.9433	0.9255	0.9350	0.9255	0.8930	0.8929
GSE84133human3	0.8504	0.9179	0.8625	0.9181	0.8447	0.8361	0.7909	0.8089
GSE84133human4	0.8712	0.9692	0.8712	0.9692	0.8699	0.8681	0.8311	0.8311
GSE84133mouse1	0.8003	0.7894	0.8003	0.7913	0.7945	0.7918	0.6428	0.6348
GSE84133mouse2	0.6953	0.8689	0.7005	0.9331	0.6808	0.6957	0.5436	0.5470
GSE57249	1.0000	0.9638	1.0000	0.9483	1.0000	1.0000	0.9483	0.9483
GSE94820	0.6101	0.5480	0.4916	0.5574	0.5139	0.5189	0.5440	0.5556

Open in a new tab

Figure 1: — Average ARI of $k$ -rTNMF, rTNMF, $k$ -TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets

Overall, PL regularization improves the ARI values across all the datasets. $k$ -rTNMF outperforms other NMF methods by at least 0.09 for GSE64016. All PL regularized NMF methods outperform other NMF methods by at least 0.14 for GSE82187. For GSE84133 human 3, both rTNMF and TNMF outperform other methods by 0.07. TNMF improves other methods by more than 0.2 for GSE84133 mouse 2. Lastly, $k$ -rTNMF has the highest ARI value for GSE94820. Moreover, rTNMF improves rGNMF by 0.05, and TNMF improves GNMF by about 0.06. $k$ -TNMF and $k$ -rTNMF also improve GNMF and rGNMF by about 0.03.

Table 4 shows the NMI values of of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 2 shows the average NMI value over the 12 datasets.

Table 4:

NMI of NMF methods across 12 datasets.

data	$k$ -rTNMF	rTNMF	$k$ -TNMF	TNMF	rGNMF	GNMF	rNMF	NMF
GSE67835	0.9235	0.8999	0.9107	0.8607	0.9104	0.8858	0.7975	0.8017
GSE64016	0.3057	0.2059	0.3136	0.1869	0.2593	0.2562	0.1896	0.1849
GSE75748time	0.7522	0.7750	0.7159	0.7343	0.7235	0.6971	0.7227	0.7244
GSE82187	0.9759	0.9691	0.9298	0.9668	0.8802	0.8754	0.9124	0.9117
GSE84133human1	0.8802	0.8716	0.8785	0.8780	0.8713	0.8310	0.8226	0.7949
GSE84133human2	0.9363	0.8937	0.9313	0.9070	0.9237	0.9145	0.8835	0.8829
GSE84133human3	0.8500	0.8718	0.8577	0.8677	0.8439	0.8357	0.8215	0.8260
GSE84133human4	0.8795	0.9542	0.8795	0.9542	0.8775	0.8753	0.8694	0.8694
GSE84133mouse1	0.8664	0.8498	0.8664	0.8495	0.8596	0.8565	0.7634	0.7593
GSE84133mouse2	0.8218	0.8355	0.8299	0.8713	0.8005	0.8129	0.7258	0.7272
GSE57249	1.0000	0.9505	1.0000	0.9293	1.0000	1.0000	0.9293	0.9293
GSE94820	0.7085	0.6657	0.6157	0.6716	0.6195	0.6258	0.6624	0.6693

Open in a new tab

Figure 2: — Average NMI values of $k$ -rTNMF, rTNMF, $k$ -TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets

Interestingly, $k$ -rTNMF and $k$ -TNMF, on average, have higher NMI values than rTNMF and TNMF, respectively. However, all PL regularized methods outperform rGNMF, GNMF, rNMF and NMF. Most notably, $k$ -rTNMF, rTNMF and TNMF outperform standard NMF methods by 0.06 for GSE82187. Both rTNMF and TNMF outperform rGNMF and GNMF by 0.08 for GSE84133 human 4.

Table 5 shows the purity values of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 3 shows the average purity over the 12 datasets.

Table 5:

Purity of NMF methods across 12 datasets.

data	$k$ -rTNMF	rTNMF	$k$ -TNMF	TNMF	rGNMF	GNMF	rNMF	NMF
GSE67835	0.9643	0.9267	0.9595	0.9024	0.9595	0.9476	0.8726	0.8719
GSE64016	0.6048	0.4913	0.5846	0.5013	0.5339	0.5398	0.5080	0.5050
GSE75748time	0.7736	0.7512	0.7533	0.7454	0.7553	0.7387	0.7467	0.7455
GSE82187	0.9927	0.9895	0.9620	0.9888	0.9620	0.9594	0.9693	0.9692
GSE84133human1	0.9543	0.9357	0.9536	0.9382	0.9490	0.9187	0.9189	0.9099
GSE84133human2	0.9818	0.9614	0.9806	0.9661	0.9777	0.9736	0.9602	0.9600
GSE84133human3	0.9472	0.9485	0.9531	0.9460	0.9452	0.9420	0.9464	0.9466
GSE84133human4	0.9427	0.9882	0.9427	0.9882	0.9427	0.9420	0.9412	0.9412
GSE84133mouse1	0.9565	0.9540	0.9565	0.9540	0.9552	0.9540	0.9309	0.9299
GSE84133mouse2	0.9585	0.9410	0.9604	0.9373	0.9466	0.9507	0.9185	0.9199
GSE57249	1.0000	0.9857	1.0000	0.9796	1.0000	1.0000	0.9796	0.9796
GSE94820	0.7893	0.7462	0.6658	0.7550	0.6421	0.6421	0.7429	0.7531

Open in a new tab

Figure 3: — Average purity values of $k$ -rTNMF, rTNMF, $k$ -TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets

In general, PL regularized methods achieve higher purity values compared to other NMF methods. Purity measures the maximum intersection between true and predicted classes, which is why we do not observe a significant difference, as seen in ARI and NMI. Furthermore, since purity does not account for the size of a class, and given the imbalanced class sizes in scNRA-seq data, it is not surprising that the purity values are similar.

Table 6 shows the ACC of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 4 shows the average ACC over the 12 datasets.

Table 6:

ACC of NMF methods across 12 datasets.

data	$k$ -rTNMF	rTNMF	$k$ -TNMF	TNMF	rGNMF	GNMF	rNMF	NMF
GSE67835	0.9643	0.9243	0.9595	0.9000	0.9595	0.9383	0.8357	0.8364
GSE64016	0.5700	0.4870	0.5502	0.4746	0.4891	0.4537	0.4691	0.4759
GSE75748time	0.7565	0.7438	0.7414	0.6917	0.7355	0.7241	0.6873	0.6875
GSE82187	0.9927	0.9895	0.9599	0.9888	0.8512	0.8514	0.8896	0.8889
GSE84133human1	0.8973	0.9194	0.8974	0.9088	0.8889	0.8364	0.7988	0.7370
GSE84133human2	0.9260	0.9069	0.9242	0.9447	0.9224	0.9177	0.8998	0.8994
GSE84133human3	0.8539	0.9456	0.8597	0.9419	0.8498	0.8228	0.8032	0.8178
GSE84133human4	0.8831	0.9882	0.8831	0.9882	0.8824	0.8816	0.8847	0.8847
GSE84133mouse1	0.8581	0.8542	0.8581	0.8542	0.8555	0.8542	0.7361	0.7311
GSE84133mouse2	0.8232	0.9101	0.8263	0.9305	0.7903	0.8155	0.7239	0.7294
GSE57249	1.0000	0.9857	1.0000	0.9796	1.0000	1.0000	0.9796	0.9796
GSE94820	0.7533	0.7119	0.6482	0.7201	0.6088	0.6107	0.7091	0.7189

Open in a new tab

Figure 4: — Average ACC of $k$ -rTNMF, rTNMF, $k$ -TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets

Once again, we see that PL regularized methods have higher ACC than other NMF methods. RTNMF and TNMF improves rGNMF and GNMF by 0.05, and $k$ -rTNMF and $k$ -TNMF improves rGNMF and GNMF by 0.04. We see an improvement in ACC for both $k$ -rTNMF and $k$ -TNMF for GSE64016. All 4 PL regularized methods improve ACC of GSE82187 by 0.1. RTNMF and TNMF improve GSE84133 mouse 2 by at least 0.1 as well.

3.3. Overall performance

Figure 5 shows the average ARI, NMI, purity and ACC of $k$ -rTNMF, rTNMF, $k$ -TNMF, TNMF, rGNMF, GNMF, rNMF, NMF across 12 datasets. All PL regularized NMF methods outperform the traditional rGNMF, GNMF, rNMF and NMF. Both rTNMF and TNMF have higher average ARI and purity than the $k$ -NN based PL counterparts. However, $k$ -rTNMF and $k$ -TNMF have higher average NMI than rTNMF and TNMF, respectively. $k$ -rTNMF has a significantly higher purity than other methods.

4. Discussion

4.1. Visualization of meta-genes based UMAP and $t$ -SNE

Both UMAP and $t$ -SNE are well-known for their effectiveness in visualization. However, these methods may not perform as competitively in clustering or classification tasks. Therefore, it is beneficial to employ NMF-based methods to enhance the visualization capabilities of UMAP and $t$ -SNE.

In this process, we generate meta-genes and subsequently utilize UMAP or $t$ -SNE to further reduce the data to 2 dimensions for visualization. For a dataset with $N$ cells, the number of meta-genes will be the integer value of $\sqrt{N}$ . To compare the standard UMAP and $t$ -SNE plots with the TNMF-assisted and rTNMF-assisted UMAP and $t$ -SNE visualizations, we used the default settings of the Python implementation of UMAP and the Scikit-learn implementation of $t$ -SNE. For unassisted UMAP and $t$ -SNE, we first removed low-abundance genes and performed log-transformation before applying UMAP and $t$ -SNE.

Figure 6 shows the visualization of PL regularized NMF methods through UMAP. Each row corresponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns from left to right are the $k$ -rTNMF assisted UMAP, rTNMF assisted UMAP, $k$ -TNMF assisted UMAP, TNMF assisted UMAP and UMAP visualization. Samples were colored according to their true cell types.

Figure 7 shows the visualization of PL regularized NMF through $t$ -SNE. Each row corresponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns from left to right are the $k$ -rTNMF assisted $t$ -SNE, rTNMF assisted $t$ -SNE, $k$ -TNMF assisted $t$ -SNE, TNMF assisted $t$ -SNE and $t$ -SNE visualization. Samples were colored according to their true cell types.

We see a considerable improvement in both TNMF assisted and rTNMF assisted UMAP and $t$ -SNE visualization.

4.1.0.1. GSE67835

In the assisted UMAP and $t$ -SNE visualizations of GSE67835, we observe a more distinct cluster, which includes a supercluster of fetal quiescent (Fetal-Q) and fetal replicating (Fetal-R) cells. Darmanis et al. [55] conducted a study that involved obtaining differential gene expression data for human adult brain cells and sequencing fetal brain cells for comparison. It is not surprising that the undeveloped Fetal-Q and Fetal-R cells do not exhibit significant differences and cluster together.

4.1.0.2. GSE75748 time

In GSE75748 time data, Chu et al. [56] sequenced human embryonic stem cells at times 0hr, 12hr, 24hr, 36hr, 72hr, and 96hr under hypoxic conditions to observe differentiation. In unassisted UMAP and $t$ -SNE, although some clustering is visible, there is no clear separation between the clusters. Additionally, two subclusters of 12hr cells are observed.

Notably, in the PL regularized assisted UMAP and $t$ -SNE visualizations, there is a distinct supercluster comprising the 72hr and 96hr cells, while cells from different time points form their own separate clusters. This finding aligns with Chu’s observation that there was no significant difference between the 72hr and 96hr cells, suggesting that differentiation may have already occurred by the 72hr mark.

4.1.0.3. GSE94820

Notice that in both $t$ -SNE and UMAP, although there is a boundary, the cells do not form distinct clusters. This lack of distinct clustering can pose challenges in many clustering and classification methods. On the other hand, all PL regularized NMF methods result in distinct clusters.

Among the PL regularized NMF approaches, cutoff-based PL, rTNMF, and TNMF form a single CD1C⁺ (CD1C1) cluster, whereas the $k$ -NN induced PL, $k$ -rTNMF, and $k$ -TNMF exhibit two subclusters. Villani et al. [61] previously noted the similarity in the expression profile of CD1C1⁻CD141⁻ (DoubleNeg) cells and monocytes. PL regularized NMF successfully differentiates between these two types.

4.1.0.4. GSE84133 mouse 2

PL-regularized NMF yields significantly more distinct clusters compared to unassisted UMAP and $t$ -SNE. Notably, the beta and gamma cells form distinct clusters in PL regularized NMF. Additionally, when PL regularized NMF is applied to assist UMAP, potential outliers within the beta cell population become visible. Baron et al. [58] previously highlighted heterogeneity within the beta cell population, and we observe potential outliers in all visualizations.

4.2. RS analysis

Although UMAP and $t$ -SNE are excellent tools for visualizing clusters, they may struggle to capture heterogeneity within clusters. Moreover, these methods can be less effective when dealing with a large number of classes. Therefore, it is essential to explore alternative visualization techniques.

In our approach, we visualize each cluster using RS plots [23]. RS plots depict the relationship between the residue score (R score) and similarity score (S score) and have proven useful in various applications for visualizing data with multiple class types [14, 62-65].

Let ${(x_{m}, y_{m}) ∣ x_{m} \in R^{N}$ , $y_{m} \in Z_{L}, 1 \leq m \leq M}$ be the data, where $x_{m}$ is the mth sample, $y_{m}$ is the cell type or cluster label. $L$ is the number of class. That is, $C_{l} = {x_{m} \in 𝒳 ∣ y_{m} = l}$ and $⊎_{0}^{L - 1} C_{l} = 𝒳$ .

The residue (R) score is defined as the inter-class sum of distance. For a given data $x_{m}$ with assignment $y_{m} = l$ , the R-score is defined as

R_{m} = R (x_{m}) = \frac{1}{R_{\max}} \sum_{x_{j} \notin 𝒞_{l}} ‖ x_{m} - x_{j} ‖,

where $R_{\max} = \max_{x_{m}, x_{m} \in 𝒳} R_{m}$ . The similarity (S) score is defined as the intra-class average of distance, defined as

S_{m} = S (x_{m}) = \frac{1}{∣ 𝒞_{l} ∣} \sum_{x_{j} \in 𝒞_{l}} (1 - \frac{‖ x_{m} - x_{j} ‖}{d_{\max}}),

where $d_{\max} \max_{x_{i}, x_{j} \in 𝒳} ‖ x_{i} - x_{j} ‖$ and $∣ 𝒞_{l} ∣$ is the number of data in class $𝒞_{l}$ . Both $R_{m}$ and $S_{m}$ are bounded by 0 and 1, and the larger the better for a given dataset.

The class residue index (CRI) and the class similarity index (CSI) can then be defined as the average of the R-score and S-score of each of the classes. That is ${CRI}_{l} = \frac{1}{∣ 𝒞_{l} ∣} \sum_{m} R_{m}$ and ${CSI}_{l} = \frac{1}{∣ 𝒞_{l} ∣} \sum_{m} S_{m}$ . Then, the residue index (RI) and the similarity index (SI) can be defined $RI = \frac{1}{L} {CRI}_{l}$ and $SI = \frac{1}{L} {CSI}_{l}$ , respectively.

Using the RI and SI, the residue similarity disparity can be computed by taking $RSD = RI - SI$ , and the residue-similarity index (RSI) can be computed as $RSI = 1 - ∣ RI - SI ∣$ .

Figure 8 shows the RS plots of PL regularized NMF methods for GSE67835 data. The columns from left to right correspond to $k$ -rTNMF, rTNMF, $k$ -TNMF, and TNMF, while the rows correspond to the cell types. The x-axis and y-axis represent the S-score and R-score for each sample, respectively. The samples are colored according to their predicted cell types. Predictions were obtained using $k$ -means clustering, and the Hungarian algorithm was employed to find the optimal mapping from the cluster labels to the true cell types.

We can see that TNMF fails to identify OPC cells, whereas $k$ -rTNMF, rTNMF, and $k$ -TNMF are able to identify OPC cells. Notably, the S-score is quite low, indicating that the OPC did not form a cluster for TNMF. For fetal quiescent and replicating cells, $k$ -rTNMF correctly identifies these two types, and the few misclassified samples are located on the boundaries. RTNMF is able to correctly identify fetal replicating cells but could not distinguish fetal quiescent cells from fetal replicating cells. The S-score is low for neurons in both rTNMF and TNMF, which shows a direct correlation with the number of misclassified cells.

5. Conclusion

Persistent Laplacian regularized NMF is a dimensionality reduction technique that incorporates multiscale geometrical interactions between the cells. We have shown that PL regularized NMF methods, namely the TNMF and rTNMF, improve traditional graph Laplacian regularized NMF. Unlike the standard graph regularization, which only captures the data in a single scale, the multiscale information obtained from PL is beneficial in scRNA-seq data analysis because PL can fine-tune the effect of cell-cell interaction at different scales. Moreover, the computational cost for a given hyperparameter is the same as GNMF and rGNMF. However, PL methods do come with their downside. In particular, the weights for each filtration must be determined prior to the reduction. If there are $T$ filtrations, then the hyperparameter space is ${(T + 1)}^{T}$ . We have also introduced $k$ -NN induced PL, which shows comparable results to cutoff-based PL methods and reduces the parameter space to $2^{T}$ . We have noticed that taking a subset of the parameter space $\sum_{t = 1}^{8} ζ_{t} = 4$ yielded a consistently good result. Although $k$ -NN induced PL reduces the parameter space, we would like to further explore a possible parameter-free approach in the future. Additionally, in this work, we validated PL regularization in the context of unsupervised learning, and a potential extension is towards semi-supervised and supervised NMF. Lastly, a possible extension to the proposed methods is to incorporate higher-order PL in the regularization framework, which will reveal higher-order interactions. In addition, we would like to expand the ideas to tensor decomposition, such as Canonical Polyadic Decomposition (CPD) and Tucker decomposition, multimodal omics data, and spatial transcriptomics data.

Acknowledgment

This work was supported in part by NIH grants R01GM126189, R01AI164266, and R35GM148196, National Science Foundation grants DMS2052983, DMS-1761320, and IIS-1900473, NASA grant 80NSSC21M0023, Michigan State University Research Foundation, and Bristol-Myers Squibb 65109.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

6. Data availability and code

The data and model used to produce these results can be obtained at https://github.com/hozumiyu/TopologicalNMF-scRNAseq.

References

[1].Lun Aaron TL, McCarthy Davis J, and Marioni John C. A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. F1000Research, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Hwang Byungjin, Lee Ji Hyun, and Bang Duhee. Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Andrews Tallulah S, Kiselev Vladimir Yu, McCarthy Davis, and Hemberg Martin. Tutorial: guidelines for the computational analysis of single-cell rna sequencing data. Nature protocols, 16(1):1–9, 2021. [DOI] [PubMed] [Google Scholar]
[4].Luecken Malte D and Theis Fabian J. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Chen Geng, Ning Baitang, and Shi Tieliu. Single-cell rna-seq technologies and related computational data analysis. Frontiers in genetics, page 317, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Petegrosso Raphael, Li Zhuliu, and Kuang Rui. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020. [DOI] [PubMed] [Google Scholar]
[7].Lähnemann David, Köster Johannes, Szczurek Ewa, McCarthy Davis J, Hicks Stephanie C, Robinson Mark D, Vallejos Catalina A, Campbell Kieran R, Beerenwinkel Niko, Mahfouz Ahmed, et al. Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].La Manno Gioele, Soldatov Ruslan, Zeisel Amit, Braun Emelie, Hochgerner Hannah, Petukhov Viktor, Lidschreiber Katja, Kastriti Maria E, Lönnerberg Peter, Furlan Alessandro, et al. Rna velocity of single cells. Nature, 560(7719):494–498, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Bergen Volker, Lange Marius, Peidli Stefan, Wolf F Alexander, and Theis Fabian J. Generalizing rna velocity to transient cell states through dynamical modeling. Nature biotechnology, 38(12):1408–1414, 2020. [DOI] [PubMed] [Google Scholar]
[10].Luecken Malte D, Büttner Maren, Chaichoompu Kridsadakorn, Danese Anna, Interlandi Marta, Müller Michaela F, Strobl Daniel C, Zappia Luke, Dugas Martin, Colomé-Tatché Maria, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Stuart Tim, Butler Andrew, Hoffman Paul, Hafemeister Christoph, Papalexi Efthymia, Mauck William M, Hao Yuhan, Stoeckius Marlon, Smibert Peter, and Satija Rahul. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Dunteman George H. Principal components analysis, volume 69. Sage, 1989. [Google Scholar]
[13].Jolliffe Ian T and Cadima Jorge. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Cottrell Sean, Wang Rui, and Wei Guowei. PLPCA: Persistent Laplacian enhanced-PCA for microarray data analysis. Journal of Chemical Information and Modeling, 10.1021/acs.jcim.3c01023, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Lounici Karim. Sparse principal component analysis with missing observations. In High Dimensional Probability VI: The Banff Volume, pages 327–356. Springer, 2013. [Google Scholar]
[16].Zou Hui, Hastie Trevor, and Tibshirani Robert. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006. [Google Scholar]
[17].Townes F William, Hicks Stephanie C, Aryee Martin J, and Irizarry Rafael A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome biology, 20:1–16, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].McInnes Leland, Healy John, and Melville James. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]
[19].Hinton Geoffrey E and Roweis Sam. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002. [Google Scholar]
[20].Van der Maaten Laurens and Hinton Geoffrey. Visualizing data using $t$ -sne. Journal of machine learning research, 9(11), 2008. [Google Scholar]
[21].Kobak Dmitry and Linderman George C. Initialization is critical for preserving global data structure in both $t$ -sne and umap. Nature biotechnology, 39(2):156–157, 2021. [DOI] [PubMed] [Google Scholar]
[22].Becht Etienne, McInnes Leland, Healy John, Dutertre Charles-Antoine, Kwok Immanuel WH, Ng Lai Guan, Ginhoux Florent, and Newell Evan W. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019. [DOI] [PubMed] [Google Scholar]
[23].Hozumi Yuta, Wang Rui, and Wei Guo-Wei. Ccp: correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:2206.04189, 2022. [Google Scholar]
[24].Lee Daniel and Seung H Sebastian. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000. [Google Scholar]
[25].Wang Yu-Xiong and Zhang Yu-Jin. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012. [Google Scholar]
[26].Liu Weixiang, Zheng Nanning, and You Qubo. Nonnegative matrix factorization and its applications in pattern recognition. Chinese Science Bulletin, 51:7–18, 2006. [Google Scholar]
[27].Kong Deguang, Ding Chris, and Huang Heng. Robust nonnegative matrix factorization using l21-norm. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 673–682, 2011. [Google Scholar]
[28].Xiao Qiu, Luo Jiawei, Liang Cheng, Cai Jie, and Ding Pingjian. A graph regularized non-negative matrix factorization method for identifying microrna-disease associations. Bioinformatics, 34(2):239–248, 2018. [DOI] [PubMed] [Google Scholar]
[29].Wu Peng, An Mo, Zou Hai-Ren, Zhong Cai-Ying, Wang Wei, and Wu Chang-Peng. A robust semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Shu Zhenqiu, Long Qinghan, Zhang Luping, Yu Zhengtao, and Wu Xiao-Jun. Robust graph regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering. Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022. [DOI] [PubMed] [Google Scholar]
[31].Lan Wei, Chen Jianwei, Chen Qingfeng, Liu Jin, Wang Jianxin, and Chen Yi-Ping Phoebe. Detecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv, pages 2022–05, 2022. [Google Scholar]
[32].Liu Jin-Xing, Wang Dong, Gao Ying-Lian, Zheng Chun-Hou, Shang Jun-Liang, Liu Feng, and Xu Yong. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for rna-seq data analysis. Neurocomputing, 228:263–269, 2017. [Google Scholar]
[33].Yu Na, Gao Ying-Lian, Liu Jin-Xing, Wang Juan, and Shang Junliang. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics, 13(1):1–10, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Chen Duan, Li Shaoyu, and Wang Xue. Geometric structure guided model and algorithms for complete deconvolution of gene expression data. Foundations of Data Science, 4(3):441–466, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Eckmann Beno. Harmonische funktionen und randwertaufgaben in einem komplex. Commentarii Mathematici Helvetici, 17(1):240–255, 1944. [Google Scholar]
[36].Horak Danijela and Jost Jürgen. Spectra of combinatorial laplace operators on simplicial complexes. Advances in Mathematics, 244:303–336, 2013. [Google Scholar]
[37].Chen Jiahui, Zhao Rundong, Tong Yiying, and Wei Guo-Wei. Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Wang Rui, Nguyen Duc Duy, and Wei Guo-Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Mémoli Facundo, Wan Zhengchao, and Wang Yusu. Persistent laplacians: Properties, algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858–884, 2022. [Google Scholar]
[40].Liu Jian, Li Jingyan, and Wu Jie. The algebraic stability for persistent laplacians. arXiv preprint arXiv:2302.03902, 2023. [Google Scholar]
[41].Wei Xiaoqi and Wei Guo-Wei. Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906, 2021. [Google Scholar]
[42].Wang Rui and Wei Guo-Wei. Persistent path laplacian. Foundations of Data Science, 5:26–55, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Chen Dong, Liu Jian, Wu Jie, and Wei Guo-Wei. Persistent hyperdigraph homology and persistent hyperdigraph laplacians. Foundations of Data Science, doi: 10.3934/fods.2023010, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Wang Rui, Zhao Rundong, Ribando-Gros Emily, Chen Jiahui, Tong Yiying, and Wei Guo-Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Qiu Yuchi and Wei Guo-Wei. Persistent spectral theory-guided protein engineering. Nature Computational Science, 3(2):149–163, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Chen Jiahui, Qiu Yuchi, Wang Rui, and Wei Guo-Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine, 151:106262, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Meng Zhenyu and Xia Kelin. Persistent spectral–based machine learning (perspect ml) for protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Zomorodian Afra and Carlsson Gunnar. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational geometry, pages 347–356, 2004. [Google Scholar]
[49].Edelsbrunner Herbert, Harer John, et al. Persistent homology-a survey. Contemporary mathematics, 453(26):257–282, 2008. [Google Scholar]
[50].Cang Zixuan and Wei Guo-Wei. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS computational biology, 13(7):e1005690, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Hubert Lawrence and Arabie Phipps. Comparing partitions. Journal of classification, 2:193–218, 1985. [Google Scholar]
[52].Vinh Nguyen Xuan, Epps Julien, and Bailey James. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pages 1073–1080, 2009. [Google Scholar]
[53].Crouse David F. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016. [Google Scholar]
[54].Rama Rao KVSN and Josephine B Manjula. Exploring the impact of optimal clusters on cluster purity. In 2018 3rd International Conference on Communication and Electronics Systems (ICCES), pages 754–757. IEEE, 2018. [Google Scholar]
[55].Darmanis Spyros, Sloan Steven A, Zhang Ye, Enge Martin, Caneda Christine, Shuer Lawrence M, Gephart Melanie G Hayden, Barres Ben A, and Quake Stephen R. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Chu Li-Fang, Leng Ning, Zhang Jue, Hou Zhonggang, Mamott Daniel, Vereide David T, Choi Jeea, Kendziorski Christina, Stewart Ron, and Thomson James A. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17:1–20, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Gokce Ozgun, Stanley Geoffrey M, Treutlein Barbara, Neff Norma F, Camp J Gray, Malenka Robert C, Rothwell Patrick E, Fuccillo Marc V, Südhof Thomas C, and Quake Stephen R. Cellular taxonomy of the mouse striatum as revealed by single-cell rna-seq. Cell reports, 16(4):1126–1137, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[58].Baron Maayan, Veres Adrian, Wolock Samuel L, Faust Aubrey L, Gaujoux Renaud, Vetere Amedeo, Ryu Jennifer Hyoje, Wagner Bridget K, Shen-Orr Shai S, Klein Allon M, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4):346–360, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[59].Biase Fernando H, Cao Xiaoyi, and Zhong Sheng. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome research, 24(11):1787–1796, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[60].Leng Ning, Chu Li-Fang, Barry Chris, Li Yuan, Choi Jeea, Li Xiaomao, Jiang Peng, Stewart Ron M, Thomson James A, and Kendziorski Christina. Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments. Nature methods, 12(10):947–950, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Villani Alexandra-Chloé, Satija Rahul, Reynolds Gary, Sarkizova Siranush, Shekhar Karthik, Fletcher James, Griesbeck Morgane, Butler Andrew, Zheng Shiwei, Lazo Suzan, et al. Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science, 356(6335):eaah4573, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Hozumi Yuta, Tanemura Kiyoto Aramis, and Wei Guo-Wei. Preprocessing of single cell rna sequencing data using correlated clustering and projection. Journal of Chemical Information and Modeling, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[63].Feng Hongsong and Wei Guo-Wei. Virtual screening of drugbank database for herg blockers using topological laplacian-assisted ai models. Computers in biology and medicine, 153:106491, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[64].Zhu Zailiang, Dou Bozheng, Cao Yukang, Jiang Jian, Zhu Yueying, Chen Dong, Feng Hongsong, Liu Jie, Zhang Bengong, Zhou Tianshou, et al. Tidal: Topology-inferred drug addiction learning. Journal of Chemical Information and Modeling, 63(5):1472–1489, 2023. [DOI] [PubMed] [Google Scholar]
[65].Shen Li, Feng Hongsong, Qiu Yuchi, and Wei Guo-Wei. Svsbi: sequence-based virtual screening of biomolecular interactions. Communications Biology, 6(1):536, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data and model used to produce these results can be obtained at https://github.com/hozumiyu/TopologicalNMF-scRNAseq.

[R1] [1].Lun Aaron TL, McCarthy Davis J, and Marioni John C. A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. F1000Research, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Hwang Byungjin, Lee Ji Hyun, and Bang Duhee. Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Andrews Tallulah S, Kiselev Vladimir Yu, McCarthy Davis, and Hemberg Martin. Tutorial: guidelines for the computational analysis of single-cell rna sequencing data. Nature protocols, 16(1):1–9, 2021. [DOI] [PubMed] [Google Scholar]

[R4] [4].Luecken Malte D and Theis Fabian J. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Chen Geng, Ning Baitang, and Shi Tieliu. Single-cell rna-seq technologies and related computational data analysis. Frontiers in genetics, page 317, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Petegrosso Raphael, Li Zhuliu, and Kuang Rui. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020. [DOI] [PubMed] [Google Scholar]

[R7] [7].Lähnemann David, Köster Johannes, Szczurek Ewa, McCarthy Davis J, Hicks Stephanie C, Robinson Mark D, Vallejos Catalina A, Campbell Kieran R, Beerenwinkel Niko, Mahfouz Ahmed, et al. Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].La Manno Gioele, Soldatov Ruslan, Zeisel Amit, Braun Emelie, Hochgerner Hannah, Petukhov Viktor, Lidschreiber Katja, Kastriti Maria E, Lönnerberg Peter, Furlan Alessandro, et al. Rna velocity of single cells. Nature, 560(7719):494–498, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Bergen Volker, Lange Marius, Peidli Stefan, Wolf F Alexander, and Theis Fabian J. Generalizing rna velocity to transient cell states through dynamical modeling. Nature biotechnology, 38(12):1408–1414, 2020. [DOI] [PubMed] [Google Scholar]

[R10] [10].Luecken Malte D, Büttner Maren, Chaichoompu Kridsadakorn, Danese Anna, Interlandi Marta, Müller Michaela F, Strobl Daniel C, Zappia Luke, Dugas Martin, Colomé-Tatché Maria, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Stuart Tim, Butler Andrew, Hoffman Paul, Hafemeister Christoph, Papalexi Efthymia, Mauck William M, Hao Yuhan, Stoeckius Marlon, Smibert Peter, and Satija Rahul. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Dunteman George H. Principal components analysis, volume 69. Sage, 1989. [Google Scholar]

[R13] [13].Jolliffe Ian T and Cadima Jorge. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Cottrell Sean, Wang Rui, and Wei Guowei. PLPCA: Persistent Laplacian enhanced-PCA for microarray data analysis. Journal of Chemical Information and Modeling, 10.1021/acs.jcim.3c01023, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Lounici Karim. Sparse principal component analysis with missing observations. In High Dimensional Probability VI: The Banff Volume, pages 327–356. Springer, 2013. [Google Scholar]

[R16] [16].Zou Hui, Hastie Trevor, and Tibshirani Robert. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006. [Google Scholar]

[R17] [17].Townes F William, Hicks Stephanie C, Aryee Martin J, and Irizarry Rafael A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome biology, 20:1–16, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].McInnes Leland, Healy John, and Melville James. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]

[R19] [19].Hinton Geoffrey E and Roweis Sam. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002. [Google Scholar]

[R20] [20].Van der Maaten Laurens and Hinton Geoffrey. Visualizing data using $t$ -sne. Journal of machine learning research, 9(11), 2008. [Google Scholar]

[R21] [21].Kobak Dmitry and Linderman George C. Initialization is critical for preserving global data structure in both $t$ -sne and umap. Nature biotechnology, 39(2):156–157, 2021. [DOI] [PubMed] [Google Scholar]

[R22] [22].Becht Etienne, McInnes Leland, Healy John, Dutertre Charles-Antoine, Kwok Immanuel WH, Ng Lai Guan, Ginhoux Florent, and Newell Evan W. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019. [DOI] [PubMed] [Google Scholar]

[R23] [23].Hozumi Yuta, Wang Rui, and Wei Guo-Wei. Ccp: correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:2206.04189, 2022. [Google Scholar]

[R24] [24].Lee Daniel and Seung H Sebastian. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000. [Google Scholar]

[R25] [25].Wang Yu-Xiong and Zhang Yu-Jin. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012. [Google Scholar]

[R26] [26].Liu Weixiang, Zheng Nanning, and You Qubo. Nonnegative matrix factorization and its applications in pattern recognition. Chinese Science Bulletin, 51:7–18, 2006. [Google Scholar]

[R27] [27].Kong Deguang, Ding Chris, and Huang Heng. Robust nonnegative matrix factorization using l21-norm. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 673–682, 2011. [Google Scholar]

[R28] [28].Xiao Qiu, Luo Jiawei, Liang Cheng, Cai Jie, and Ding Pingjian. A graph regularized non-negative matrix factorization method for identifying microrna-disease associations. Bioinformatics, 34(2):239–248, 2018. [DOI] [PubMed] [Google Scholar]

[R29] [29].Wu Peng, An Mo, Zou Hai-Ren, Zhong Cai-Ying, Wang Wei, and Wu Chang-Peng. A robust semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Shu Zhenqiu, Long Qinghan, Zhang Luping, Yu Zhengtao, and Wu Xiao-Jun. Robust graph regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering. Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022. [DOI] [PubMed] [Google Scholar]

[R31] [31].Lan Wei, Chen Jianwei, Chen Qingfeng, Liu Jin, Wang Jianxin, and Chen Yi-Ping Phoebe. Detecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv, pages 2022–05, 2022. [Google Scholar]

[R32] [32].Liu Jin-Xing, Wang Dong, Gao Ying-Lian, Zheng Chun-Hou, Shang Jun-Liang, Liu Feng, and Xu Yong. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for rna-seq data analysis. Neurocomputing, 228:263–269, 2017. [Google Scholar]

[R33] [33].Yu Na, Gao Ying-Lian, Liu Jin-Xing, Wang Juan, and Shang Junliang. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics, 13(1):1–10, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Chen Duan, Li Shaoyu, and Wang Xue. Geometric structure guided model and algorithms for complete deconvolution of gene expression data. Foundations of Data Science, 4(3):441–466, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Eckmann Beno. Harmonische funktionen und randwertaufgaben in einem komplex. Commentarii Mathematici Helvetici, 17(1):240–255, 1944. [Google Scholar]

[R36] [36].Horak Danijela and Jost Jürgen. Spectra of combinatorial laplace operators on simplicial complexes. Advances in Mathematics, 244:303–336, 2013. [Google Scholar]

[R37] [37].Chen Jiahui, Zhao Rundong, Tong Yiying, and Wei Guo-Wei. Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Wang Rui, Nguyen Duc Duy, and Wei Guo-Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Mémoli Facundo, Wan Zhengchao, and Wang Yusu. Persistent laplacians: Properties, algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858–884, 2022. [Google Scholar]

[R40] [40].Liu Jian, Li Jingyan, and Wu Jie. The algebraic stability for persistent laplacians. arXiv preprint arXiv:2302.03902, 2023. [Google Scholar]

[R41] [41].Wei Xiaoqi and Wei Guo-Wei. Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906, 2021. [Google Scholar]

[R42] [42].Wang Rui and Wei Guo-Wei. Persistent path laplacian. Foundations of Data Science, 5:26–55, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Chen Dong, Liu Jian, Wu Jie, and Wei Guo-Wei. Persistent hyperdigraph homology and persistent hyperdigraph laplacians. Foundations of Data Science, doi: 10.3934/fods.2023010, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Wang Rui, Zhao Rundong, Ribando-Gros Emily, Chen Jiahui, Tong Yiying, and Wei Guo-Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Qiu Yuchi and Wei Guo-Wei. Persistent spectral theory-guided protein engineering. Nature Computational Science, 3(2):149–163, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Chen Jiahui, Qiu Yuchi, Wang Rui, and Wei Guo-Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine, 151:106262, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Meng Zhenyu and Xia Kelin. Persistent spectral–based machine learning (perspect ml) for protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Zomorodian Afra and Carlsson Gunnar. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational geometry, pages 347–356, 2004. [Google Scholar]

[R49] [49].Edelsbrunner Herbert, Harer John, et al. Persistent homology-a survey. Contemporary mathematics, 453(26):257–282, 2008. [Google Scholar]

[R50] [50].Cang Zixuan and Wei Guo-Wei. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS computational biology, 13(7):e1005690, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Hubert Lawrence and Arabie Phipps. Comparing partitions. Journal of classification, 2:193–218, 1985. [Google Scholar]

[R52] [52].Vinh Nguyen Xuan, Epps Julien, and Bailey James. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pages 1073–1080, 2009. [Google Scholar]

[R53] [53].Crouse David F. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016. [Google Scholar]

[R54] [54].Rama Rao KVSN and Josephine B Manjula. Exploring the impact of optimal clusters on cluster purity. In 2018 3rd International Conference on Communication and Electronics Systems (ICCES), pages 754–757. IEEE, 2018. [Google Scholar]

[R55] [55].Darmanis Spyros, Sloan Steven A, Zhang Ye, Enge Martin, Caneda Christine, Shuer Lawrence M, Gephart Melanie G Hayden, Barres Ben A, and Quake Stephen R. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Chu Li-Fang, Leng Ning, Zhang Jue, Hou Zhonggang, Mamott Daniel, Vereide David T, Choi Jeea, Kendziorski Christina, Stewart Ron, and Thomson James A. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17:1–20, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Gokce Ozgun, Stanley Geoffrey M, Treutlein Barbara, Neff Norma F, Camp J Gray, Malenka Robert C, Rothwell Patrick E, Fuccillo Marc V, Südhof Thomas C, and Quake Stephen R. Cellular taxonomy of the mouse striatum as revealed by single-cell rna-seq. Cell reports, 16(4):1126–1137, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] [58].Baron Maayan, Veres Adrian, Wolock Samuel L, Faust Aubrey L, Gaujoux Renaud, Vetere Amedeo, Ryu Jennifer Hyoje, Wagner Bridget K, Shen-Orr Shai S, Klein Allon M, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4):346–360, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] [59].Biase Fernando H, Cao Xiaoyi, and Zhong Sheng. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome research, 24(11):1787–1796, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] [60].Leng Ning, Chu Li-Fang, Barry Chris, Li Yuan, Choi Jeea, Li Xiaomao, Jiang Peng, Stewart Ron M, Thomson James A, and Kendziorski Christina. Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments. Nature methods, 12(10):947–950, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Villani Alexandra-Chloé, Satija Rahul, Reynolds Gary, Sarkizova Siranush, Shekhar Karthik, Fletcher James, Griesbeck Morgane, Butler Andrew, Zheng Shiwei, Lazo Suzan, et al. Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science, 356(6335):eaah4573, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Hozumi Yuta, Tanemura Kiyoto Aramis, and Wei Guo-Wei. Preprocessing of single cell rna sequencing data using correlated clustering and projection. Journal of Chemical Information and Modeling, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] [63].Feng Hongsong and Wei Guo-Wei. Virtual screening of drugbank database for herg blockers using topological laplacian-assisted ai models. Computers in biology and medicine, 153:106491, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] [64].Zhu Zailiang, Dou Bozheng, Cao Yukang, Jiang Jian, Zhu Yueying, Chen Dong, Feng Hongsong, Liu Jie, Zhang Bengong, Zhou Tianshou, et al. Tidal: Topology-inferred drug addiction learning. Journal of Chemical Information and Modeling, 63(5):1472–1489, 2023. [DOI] [PubMed] [Google Scholar]

[R65] [65].Shen Li, Feng Hongsong, Qiu Yuchi, and Wei Guo-Wei. Svsbi: sequence-based virtual screening of biomolecular interactions. Communications Biology, 6(1):536, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization

Yuta Hozumi

Guo-Wei Wei

Abstract

1. Introduction

2. Methods

Table 1:

2.1. Prior Work

2.1.0.1. NMF

2.1.0.2. rNMF

2.1.0.3. GNMF and rGNMF

2.2. Topological NMF

2.2.1. Persistent Laplacians

2.2.2. TNMF and rTNMF

2.2.3. Multiplicative Updating scheme

2.2.3.1. TNMF

2.2.3.2. rTNMF

2.3. k-NN induced Persistent Laplacian

2.4. Evaluation metrics

2.4.0.1. Adjusted Rand Index

2.4.0.2. Normalized Mutual Information

2.4.0.3. Accuracy

2.4.0.4. Purity

3. Results

3.1. Benchmark Data

Table 2:

3.2. Benchmarking PL regularized NMF

Table 3:

Figure 1:

Table 4:

Figure 2:

Table 5:

Figure 3:

Table 6:

Figure 4:

3.3. Overall performance

Figure 5:

4. Discussion

4.1. Visualization of meta-genes based UMAP and t-SNE

Figure 6:

Figure 7:

4.1.0.1. GSE67835

4.1.0.2. GSE75748 time

4.1.0.3. GSE94820

4.1.0.4. GSE84133 mouse 2

4.2. RS analysis

Figure 8:

5. Conclusion

Acknowledgment

Footnotes

6. Data availability and code

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3. $k$ -NN induced Persistent Laplacian

4.1. Visualization of meta-genes based UMAP and $t$ -SNE