GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA

DUAN CHEN; SHAOYU LI; XUE WANG

doi:10.3934/fods.2022013

. Author manuscript; available in PMC: 2024 Jan 19.

Published in final edited form as: Found Data Sci. 2022 Sep;4(3):441–466. doi: 10.3934/fods.2022013

GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA

DUAN CHEN ^1,^*, SHAOYU LI ², XUE WANG ³

PMCID: PMC10798655 NIHMSID: NIHMS1908409 PMID: 38250319

Abstract

Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.

Keywords: Nonnegative matrix factorization; data analysis; geometric structure; complete deconvolution; bulk RNA-seq data; Primary: 65F22, 65Z05; Secondary: 92B05

1. Introduction

Over past decades, analysis of transcriptome or gene expression data has been an essential component to understand the biomolecular processes involved in human development and diseases [9, 30, 58, 27]. The complex nature of bulk tissue samples under investigation remains as a major obstacle[16, 56, 17]. A bulk tissue sample could include many cell types, and its heterogeneous characteristics make the interpretation of gene expression (such as RNA-seq) complicated [22, 49, 4]: for every gene, its measured gene expression profiles (GEPs) in a compound sample are actually tissue-averaged, i.e., the sum of expression of all cells in the sample. On the other side, cellular composition of bulk samples varies, and samples may have high variances between one and another in relative cell subset proportions. So the GEP of low abundant cell types could be masked by that of ones with higher proportions. Consequently, it is challenging to determine whether an experimental or clinic treatment should target one particular gene or focus on investigating possible sources of varying cell types among samples. For example, Alzheimer’s disease is marked by amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and gliosis in the affected brain regions. Transcriptome-wide GEP from brain tissue of patients and neuropathologically normal controls are different. Such differences are critical for discovering genes and biological pathways that are perturbed in and/or lead to Alzheimer’s disease [3, 40, 42, 18]. Differential expression (DE) analysis is one of the important tools to unveil these differences. It will reveal novel insights into the genes and pathways, and is potentially helpful for drug targets therapeutics. However, a fundamental knowledge gap still remains for DE, concerning whether disease-associated GEP changes in brain tissues are due to changes in cellular composition of tissue samples, or due to GEP changes in specific cells, e.g., central nervous system cells. It could be much more informative to study gene expression on specific cells, or identify cell-intrinsic differentially expressed genes. But for many complex biological mixtures, exhaustive knowledge of individual cell types in brain tissues and their specific markers is lacking. Although single-cell RNA sequencing data can be used or serve as a reference, such approaches remain costly, cumbersome and limited in sample sizes[60, 15, 35].

In contrast, computational tools can be used to leverage widely available large-scale bulk tissue RNA-seq data sets [40, 18, 2, 34, 11]. This problem, as illustrated in Figure 1, is called complete deconvolution. In this approach, expression of a gene in a sample tissue is assumed to be the linear combination of its expressions in the constituting cell types, with respect to the cell proportion, i.e.

g_{i j} = \sum_{l = 1}^{k} p_{l j} c_{i l}, 1 \leq i \leq N, 1 \leq j \leq n,

(1)

where $g_{i j}$ and $c_{i, l}$ are the GEPs of gene $i$ in the $j$ -th sample and $l$ -th cell type, respectively, while $p_{l j}$ is the proportion of the $l$ -th cell type in the $j$ -th sample. For the total number of genes $N$ , total number of samples $n$ , and the number of cell types $k$ , we usually have $N ≫ n > k$ . In matrix form, Eq. (1) is represented as $G = CP$ with all matrix entries being non-negative. Given data $G$ , both variables $C$ and $P$ are to be solved. Note that many deconvolution algorithms have been developed [51, 4, 41, 43, 46, 61, 26, 14, 1, 10, 49] for $GEP$ in bulk tissues, but their primary focuses have been only on estimating the cellular composition with prior knowledge of cell-specific markers. This type of problem to solve for $P$ , with known $C$ , is called partial convolution, and can be performed with remarkable robustness and accuracy. However, in more realistic circumstances when little or no information about the underlying cell type is available, developing reliable complete deconvolution methods is still an open problem and only a handful models have been established [31, 48, 57].

Figure 1. — Diagram of complete deconvolution of bulk tissue data

Mathematically, complete deconvolution can be solved as a nonnegative matrix factorization (NMF) problem [13, 45, 37]. Many studies have been established for various types of data in other fields, such as spectral unmixing in analytical chemistry [45], remote sensing [39], image processing [24], or topic mining in machine learning [59], etc. There is no obstacle at all if one is simply looking for a couple of solutions $C$ and $P$ . However, the NMF is strongly ill-posed and solutions are generally not unique. Such non-uniqueness poses great challenges on solution interpretability: For RNA-seq data, different solutions represent various combinations of GEPs in each cell type and cell proportions in tissues. Meaningful explanation of these biological quantities is critical to next step DE analysis. There are a few guidelines to reduce such ill-posedness. As stated in [25], if the matrices $C$ and $P$ satisfy certain identifiability conditions (see Section 2 for details), it is possible for the NMF problem to have unique solution, subjective to row/column scaling and permutation ambiguities of the solution. Applying these sufficient conditions depends on the specific properties of available data in the corresponding research field. Successful methods in one field cannot be directly implanted to another because of different data characteristics. Modeling the right NMF tool for the application at hand is essential [23]. On the biology side, the GEP data $G$ in bulk tissues includes expression of marker genes, or cell-type-specific genes, which are defined by their exclusive expression in only one component (cell type) in cell mixtures. These biological characteristics establish a connection to the mathematical theories of NMF. So it is possible to develop robust and accurate complete deconvolution algorithms for bulk tissue GEP data, without prior information of GEP in single cells.

The objective of the current work is to develop mathematical model and computational algorithms to perform complete deconvolution of bulk RNA-seq data with reduced solution ambiguity and high interpretability. Our approaches are based on the abovementioned inherent characteristics of bulk tissue data and the theoretical foundation of NMF problem. This goal is achieved by a structure-exploring and inheriting strategy. To explore the structure, we first define correlation distance among rows of data $G$ (considered as data points in $ℝ^{n}$ ) and generate the corresponding graph, then spectral clustering technique is used to classify all points in $k$ (assumed number of cell types) groups. Finally, marker genes of each cell type are identified by picking the most correlated points in each cluster. In the noiseless case, rows of $C$ can be understood as coefficients of data points with rows of $P$ as coordinates. Thus, we expect the row space of $C$ inherits the geometric structure of $G$ and impose the weak identifiability (to accommodate noises in real data) condition on $C$ . Additionally, the manifold regularization is applied by the local invariance assumption [5, 8, 28]. Combining these approaches, we establish a structure guided non-convex optimization model, to deconvolute bulk tissue RNA-seq data without prior information about marker genes. The proposed model is numerically solved under the frame work of alternating direction method of multipliers (ADMM)[21, 6], in which each variable can be solved one at a time in a two-fold iteration. Effectiveness and accuracy of the model and algorithms are tested by both synthetic and biological data. This work is motivated by various manifold regularization NMF models, such as[7, 47], but it has the following novel features: (1) Traditionally, Euclidean distance is used in the manifold assumption of data space and it results a linear graph regularization term. But the application on biological data requires the correlation distance, from which a nonlinear graph regularizor is derived and it poses great challenges in computation; (2) More importantly, this new model is equipped with a solvability constraint, and this regularizor significantly improves solution identifiability from realistic noisy data.

The paper is organized as the follows: Section 2 briefly reviews the NMF and its separability conditions, and why this condition is related to the biological problem. Section 3 presents the geometric structure guided complete deconvolution model, including using spectral clustering analysis to identify marker genes (finding structures) and the quantitative constraints in the optimization problem (preserving the structure). For the resulting non-convex learning model, an ADMM based algorithm is introduced in Section 4. As validations, in Section 5 there display numerical results of the proposed model and algorithms for both synthetic and biological data. The paper ends with a conclusion in Section 6, where potential challenges of the work and possible future research directions are discussed. For convenience, some frequently used acronyms throughout the paper are listed in Table 1. Necessary definitions of NMF are summarized in the Appendix.

Table 1.

Some frequently used acronyms

GEP	Gene expression profile
DE	Differential expression
NMF	Nonnegative Matrix Factorization
GS-NMF	Geometric structure constrained NMF
ADMM	Alternating direction method of multipliers
NDR	Noise to data ratio

Open in a new tab

2. NMF and its identifiability conditions

In this section, we briefly review notations and some theoretical foundations in the NMF. Additional details and other necessary definition of the NMF problem are included in the Appendix.

2.1. Notations

Throughout the paper, a bold lower-case letter, such as $x$ , represents a column vector with the appropriate dimension, and $| x |$ represents its $l_{2}$ norm. Vector $1$ is a column vector of some dimension with all entries being one. For a matrix $A$ , $∥ A ∥_{F}$ represents its Frobenius norm, $A_{(i)}$ and $A^{(j)}$ mean its $i$ -th row and $j$ -th column, respectively.

Let $G \in ℝ^{N \times n}$ with entry $g_{i j}$ being the expression of the $i$ -th gene in the $j$ -th sample; $C \in ℝ^{N \times k}$ with entry $c_{i j}$ being the reference expression of the $i$ -th gene in the $j$ -th cell type; and $P \in ℝ^{k \times n}$ with entry $p_{i j}$ being the proportion of the $i$ -th cell type in the $j$ -th sample. Dimensions $N ≫ \max (n, k)$ . The following linear relation is assumed:

G = CP + ϵ,

(2)

where $ϵ$ is noise. The problem of complete deconvolution can be summarized as: given data $G \in ℝ^{N \times n}$ , solve

(C^{*}, P^{*}) = \underset{C \in ℝ_{+}^{N \times k}, P \in ℝ_{+}^{k \times n}}{\arg \min} δ (CP, G)

(3)

where $ℝ_{+}^{N \times k}$ or $ℝ_{+}^{k \times n}$ represent matrices with nonnegative entries and $δ (\cdot, \cdot)$ is a cost function. There are many choices for the cost function, depending on prior knowledge about the probability distribution of the data noise and susceptibility to outliers. For example, when Frobenius norm is used, the minimization process is considered as a maximum likelihood estimator for additive Gaussian noise. Meanwhile, I-divergence [38] cost function is equivalent to the Expectation Maximization (EM) algorithm and maximum likelihood when noises are Poisson processes [12]. If noises follow Laplace distribution, the cost function can be taken as row-wise or column-wise $l_{1}$ norms of the difference matrix [41]. As summarized in [54], other choices include Earth Mover’s distance metric, $α$ -divergence, $β$ -divergence, $γ$ -divergence, $φ$ -divergence, Bregman divergence, and $α$ - $β$ -divergence. The choice of cost function does not affect exploration of the geometric structures of data, so for simplicity, we consider $δ (CP, G) = \frac{1}{2} ∥ G - CP ∥_{F}^{2}$ and Gaussian noise in the current work.

2.2. Ill-posedness of NMF:

Solving Eq. (3) for only $C$ (or $P$ ) with the other variable known (partial deconvolution) is simply a convex regression problem. But it is well-known that solving both variables simultaneously is non-convex, NP-hard in general, and computational algorithms only converge to local minima or just stationary points [54]. Further, the NMF is ill-posed and the solution is not unique, or not identifiable: if ( $C^{*}, P^{*}$ ) is a local minimum to (3), then for any $Ω \in ℝ^{k \times k}$ , $\hat{C} = C^{*} Ω$ and $\hat{P} = Ω^{- 1} P^{*}$ are also solutions, as long as their non-negativity is satisfied. Non-uniqueness of solution will significantly impact statistical analysis for decisions in biological implementation. Therefore, it is important to restrict searching space of variables to increase the identifiability of solutions, in order for better interpretability. The uniqueness of NMF solution is defined in the following sense [23]:

Definition 2.1 (Uniqueness of NMF solution).

The solution $(C^{*}, P^{*})$ of NMF (3) is unique, or identifiable, if and only if for any other solution $(\bar{C}, \bar{P})$ , there exists a permutation matrix $Π \in {0, 1}^{k \times k}$ and a diagonal scaling matrix $S$ with positive diagonal matrix such that

\bar{C} = C^{*} Π S and \bar{P} = S^{- 1} Π^{⊤} P^{*} .

(4)

Further, it is summarized in [54, 23] that this uniqueness can be achieved under certain circumstances [29, 19, 36, 25]:

Theorem 2.2 (Strong identifiability condition).

Assuming $k = r a n k (G)$ , $ϵ = 0$ , if problem (3) admits a solution, for which both $C^{⊤}$ and $P$ are separable matrices, then the solution is unique.

Theorem 2.3 (Weak identifiability condition).

Assuming $k = r a n k (G)$ , $ϵ = 0$ , if both $C^{⊤}$ and $P$ are sufficiently scattered, then problem (3) admits a unique solution.

More details of related definitions, such as separability and sufficiently scattered properties of matrices are included in the Appendix. Specifically, Figure 14 illustrated the geometric interpretation of the NMF problem and Figure 15 intuitively explains the strong and weak identifiability conditions of NMF in order to achieve interpretable solutions. As explained, the weak identifiability condition is much relaxed than the strong one, so it is more suitable for realistic data with noises.

2.3. Relation to the gene expression data:

There are several issues when applying the above NMF theories to RNA-seq data: (1) First of all and most importantly, how do these general identifiability conditions relate to the specific biological problem, i.e. bulk tissue RNA-seq data? (2) What measure should be used to define the “sufficient scattering” of matrix columns? Euclidean distance is used to illustrate the idea in Figure 15, but it is not practical for high dimension data because column (row) normalization is needed for convex hull description. (3) Theorem 2.2 and 2.3 can be used as variable constraints in the optimization problem, but the actual questions are how many, and which columns or (rows) the constraints should be enforced? How can we obtain this information from the only available data $G$ ?

The concept of marker genes will help to address these issues. By its name, marker genes of a certain type of cell dominantly express in that cell type while rarely express in others. Each cell type may have multiple marker genes but one marker gene is only for one cell type. Mathematically, for each cell type $r = 1, 2, 3, \dots k$ , there exists an index set $𝓢_{r}$ , such that for any $i \in 𝓢_{r}$ : expression level $c_{i r}$ is the dominant entry (the only nonzero entry in ideal noiseless case) in the $i$ -th row of matrix $C$ . This characteristics implies that $C^{⊤}$ is separable in ideal case and its columns are sufficiently scattered if noises present. On the other hand, no structures can be assumed for matrix $P$ . Although all conditions in Theorem 2.2 or 2.3 are not fully satisfied, reasonable solutions can be expected with constraints on variable $C$ .

Then the question is how to identify marker genes (their index $𝓢_{r}$ ) from the data $G$ . In the noiseless case, for a given $r = 1, 2, \dots k$ and any $i \in 𝓢_{r}$ , one has $C_{(i)} = α_{i} e_{r}^{⊤}$ by the definition of marker genes, where $e_{r}$ is the unit basis vector in $ℝ^{r}$ . As consequences, the $i$ -th row $G_{(i)} = α_{i} P_{(r)}$ and hence all rows of $G$ are linearly dependent if their indices are from the same set $𝓢_{r}$ . In the practical scenario where noise present, this linear dependence among vectors will become strong correlations. Many literature [33, 41, 57, 4, 50] has confirmed this phenomena that gene expressions across samples, or rows of $G$ , will display strong correlation, if they are from marker genes of the same cell type.

3. Mathematical model:

Bulk tissue RNA-seq data $G$ has richer structures than just $G_{(i)} \in c o n e (P^{⊤}) \subseteq ℝ_{+}^{n}$ : for marker genes of the same cell type, e.g., the $r$ -th cell type, their expressions across samples are highly correlated and correlated to $P_{(r)}$ . A cone view of this property is displayed in the left-top panel of Figure 2 for $k = 3$ . It is easier to investigate this feature further from the convex hull view, in which each row of $G$ can be represented by a dot: As shown by the left-bottom panel of Figure 2, rows of $G$ for the marker genes of the $r$ -th cell type tend to form a cluster around $P_{(r)}$ due to the strong correlation. This property motivates us that it is possible to first identify marker genes from data $G$ by clustering its rows and to quantitatively explore the geometric structures of its row space. Further, note that $C_{(i)}$ is actually the coefficient vector of $G_{(i)}$ under the basis vectors of rows of $P$ . Then the second step is to transfer such geometric structure of $G$ to $C$ , hence to enforce the weak identifiability condition on variable $C$ . These two steps are termed as finding and preserving the structure, respectively, which will be detailed as the following:

Figure 2. — RNA-seq data structure (left) and geometric constraints (right)

3.1. Finding geometric structures by spectral clustering analysis

In this step we will classify $G_{(i)}$ , $1 \leq i \leq N$ into $k$ groups to identify possible marker genes for the $k$ types of cells. Among many existing clustering techniques, we propose to use spectral clustering [52], which is one type of manifold learning algorithms that can explore intrinsic geometric/topological structure of high dimensional data. Thus, it has many fundamental advantages and very often outperforms traditional clustering algorithms such as $k$ -means or single linkage. To perform spectral clustering, we need the similarity graph $G = (V, E)$ , with vertex set $V = {G_{(i)}}_{i = 1}^{N} \subset ℝ^{n}$ . The non-negative weights $ω_{i j}$ of edges $E = {e_{i j}}$ are calculated by a function $ℝ^{n} \times ℝ^{n} \to ℝ_{+}$ , quantifying the correlation between two vertices. We propose to evaluate $ω_{i j}$ as

ω_{i j} = \exp {- \frac{d_{e i s e n} {(G_{(i)}, G_{(j)})}^{2}}{σ}}, 1 \leq i \leq N, 1 \leq j \leq N,

(5)

where

d_{e i s e n} (G_{(i)}, G_{(j)}) = 1 - \frac{〈 G_{(i)}, G_{(j)} 〉}{| G_{(i)} | | G_{(j)} |}

(6)

is the Eisen cosine correlation distance and $σ > 0$ is a parameter. The matrix $W = (ω_{i j}) \in ℝ^{N \times N}$ is called the adjacency matrix. Meanwhile, define its degree matrix $D = diag (d_{1}, d_{2}, \dots d_{N})$ , where $d_{i} = \sum_{j = 1}^{N} ω_{i j}$ is the degree of the vertex $G_{(i)}$ . With these matrices, different types of graph Laplacians (gL) of $G$ can be defined, such as the unnormalized gL $L = D - W$ , symmetric normalized gL $L_{sym} = I - D^{- \frac{1}{2}} W D^{- \frac{1}{2}}$ , or random walk gL $L_{rw} = I - D^{- 1} W$ . By examining the first a few eigenvectors of gL, rows of data $G$ will be clustered into $k$ groups and the set ${𝓖_{r}}_{r = 1}^{k}$ records row indices of $G$ in the corresponding clusters. Choice of different gLs depends on specific data applications [47]. For our problem, we use the normalized gL $L_{sym}$ and perform the spectral clustering package in Matlab.

3.2. Geometric structure guided model

With the clustering information, we are able to establish geometric structure guided model by applying two constraints on the row space of variable $C$ : the solvability constraint and manifold regularization. This work is motivated by manifold regularization which uses graph Laplacian as regularizer, but it provides new characteristics. The major novelty is to incorporate the identifiability conditions in the regularization of the NMF (primary constraint). Another new feature is to encode the geometric information of the data space (secondary constraint) based on Eisen cosine correlation distance, instead of Euclidean distance in traditional graph regularized NMF.

3.2.1. Solvability constraint:

According to Theorem 2.3, rows of $C$ need to be scattered sufficiently, such that the second-order cone in $ℝ_{+}^{r}$ contained in $c o n e (C^{⊤})$ . Further, Definition A.2 implies that this requirement is only needed for some of rows in $C$ , i.e., those rows corresponding to marker genes. In the previous step, all rows of $G$ are clustered into $k$ groups and their row numbers are recorded in the set ${𝓖_{r}}_{r = 1}^{k}$ . In the current step, correlations are ranked within each group, and a subset, i.e., $𝓢_{r} \subset 𝓖_{r}$ is determined accordingly to represent the indices of marker genes for each cell type. Note that row indices of both $G$ and $C$ represent gene IDs, we need rows of $C$ with index $𝓢_{r}$ to scatter enough to accommodate the second-order cone. This can be done by requiring them to have strong correlations with ${e_{r}^{⊤}}$ , i.e., defining the penalty function:

𝓕_{1} (C) = \frac{λ_{1}}{2} \sum_{r = 1}^{k} \sum_{i \in 𝓢_{r}} d_{e i s e n} {(C_{(i)}, e_{r}^{⊤})}^{2},

(7)

where $λ_{1}$ is a parameter. This idea is illustrated from the convex hull view in the right panel of Figure 2 as $k = 3$ . Red, blue, green dots represent rows of $C$ that indexed by $𝓖_{r}$ . The darker colored dots, representing selected marker genes in $𝓢_{r}$ , are “required” to stay in the circular sectors (orange), such that the $c o n v {C^{⊤}}$ (dashed purple) is large enough to contain the second-order cone (dashed circle). With the Eisen cosine correlation distance, we do not need to work in $c o n v {C^{⊤}}$ (which requires normalizing coefficients), but directly in $c o n v {C^{⊤}}$ .

3.2.2. Manifold constraint:

According to the local invariance assumption in manifold regularization [5, 8, 28], if two data points are close in the intrinsic geometry of the data distribution, then the representations of these two points in a new basis should also be close to each other under the same metric. Note that $C_{(i)}$ is the representation of the data point $G_{(i)}$ under the basis $P^{⊤}$ , then by such manifold assumption, we require matrix $C$ to inherit the similar geometric structure of matrix $G$ , i.e., rows of $C$ belong to the same cluster have strong mutual correlations. To achieve this goal, we define another penalty function:

𝓕_{2} (C) = \frac{λ_{2}}{2} \sum_{j = 1}^{N} \sum_{i = 1}^{N} ω_{i j} d_{e i s e n} {(C_{(i)}, C_{(j)})}^{2} .

(8)

Recall that entry $ω_{i j} > 0$ in the adjacency matrix $W$ in (5) measures the correlations (larger value represents stronger correlation) between genes $i$ and $j$ in data $G$ .

Remark 1.

Equation (8) is a generalization of the traditional graph Laplacian regularization [7, 47]. Actually, if the Eisen cosine correlation distance in (8) is replaced by the Euclidean distance ( $l_{2}$ norm), then

𝓕_{2} (C) = \frac{λ_{2}}{2} \sum_{j = 1}^{N} \sum_{i = 1}^{N} ω_{i j} ∥ C_{(i)} - C_{(j)} ∥^{2} = λ_{2} [\sum_{i = 1}^{N} d_{i i} C_{(i)}^{⊤} C_{(i)} - \sum_{i = 1}^{N} ω_{i j} \sum_{j = 1}^{N} C_{(i)}^{⊤} C_{(j)}] = λ_{2} Tr (C^{⊤} L C),

with $L$ being the graph Laplacian operator defined earlier.

Remark 2.

The current work is more than a generalization of traditional graph Laplacian regularized NMF by using correlation distance metric. Indeed, the manifold assumption only requires $C_{(i)}$ and $C_{(j)}$ to be close, if $G_{(i)}$ and $G_{(j)}$ are close in the distance metric, i.e., genes $i$ and $j$ are classified to belong the same cell type. However, it does not require rows of $C$ to be far from each other if they are for different types of cells. This issue is addressed by Eq. (7) and this is the major novelty of the current work.

Remark 3.

In the extreme case $λ_{1} \to \infty$ , we have $C_{(i)} = α_{i} e_{r}^{⊤}$ if $i \in 𝓢_{r}$ . This case corresponds to the strong identifiability condition. However, in the realistic circumstances where noises present, this extreme requirement does not provide the optimal results, as shown in numerical simulations.

Remark 4.

Relations between constraints (7) and (8) can be explained by the right panel of Figure 2 as $k = 3$ . All genes are classified into $k$ groups (red, blue, and green), indexed by $𝓖_{r}$ , assuming there are $k$ types of cells in those tissue samples. In addition to the constraints that darker colored dots (indexed by $𝓢_{r}$ ) close to $e_{r}^{⊤}$ , the dots in the same color need to stay close, corresponding to the geometric structure of data $G$ shown in the left panel.

3.3. Full model:

Combining the solvability condition (7) and manifold constraint (8), we propose the following geometric structure guided nonnegative matrix factorization (GS-NMF) model. For illustration convenience, we define the set $T : = {Z \in ℝ_{+}^{k \times n}, 1^{⊤} Z = 1^{⊤}}$ , and the indicator function $𝟙_{T}$ as $𝟙_{T} (Z) = 0$ if $Z \in T$ while $𝟙_{T} (Z) = \infty$ otherwise. With these notations, solving for $C$ and $P$ becomes the optimization problem:

\min_{C \geq 0, P \geq 0} \frac{1}{2} ∥ G - CP ∥_{F}^{2} + 𝓕 (C) + 𝟙_{T} (P) .

(9)

In the first term, the Frobenius norm is used to measure the error between deconvoluted solution and the given data. The total regularization function $𝓕 (C) = 𝓕_{1} (C) + 𝓕_{2} (C)$ as each component is defined in (7)-(8). The third term $𝟙_{T} (P)$ simply means sum-to-one conditions on columns of $P$ , or column stochasticity, since the sum of cellular proportions in each tissue sample is supposed to be one.

4. Computational algorithms

It is well-known that the objective function in (9) is non-convex in both variables together. There are several types of numerical methods to obtain a local minimum, including Multiplicative Update Algorithm (MUA) [37], Alternating nonnegativity constrained least squares (ANLS)[32], and the alternating direction method of multipliers (ADMM), etc. The first two types of methods are quite straightforward for the basic NMF problems, or with linear constraints. While for the nonlinear constraints (7)-(8), it is convenient to adopt ADMM framework to develop numerical schemes.

To do so, we first rewrite model (9) as

\min_{C, P} \frac{1}{2} ∥ G - CP ∥_{F}^{2} + 𝓕 (C) + 𝟙_{S} (C) + 𝟙_{T} (P),

(10)

where $S : = ℝ_{+}^{N \times k}$ is the set of all nonnegative matrices of the size $N \times k$ . Then we introduce two auxiliary variables $A$ and $Q$ , and rewrite (10) into an equivalent form

\min_{C, P, A, Q} \frac{1}{2} ∥ G - CP ∥_{F}^{2} + 𝓕 (A) + 𝟙_{S} (C) + 𝟙_{T} (Q) s.t.: C - A = 0, P - Q = 0,

(11)

and the corresponding augmented Lagrange function is [55]

𝓛 = \frac{1}{2} ∥ G - CP ∥_{F}^{2} + 𝓕 (A) + 𝟙_{S} (C) + 𝟙_{T} (Q) + \frac{ρ}{2} ∥ C - A + \tilde{A} ∥_{F}^{2} + \frac{γ}{2} ∥ P - Q + \tilde{Q} ∥_{F}^{2}

(12)

where $\tilde{A}$ and $\tilde{Q}$ are dual variables of $A$ and $Q$ , respectively, while $ρ > 0$ and $γ > 0$ are penalty parameters. As results, the ADMM (12) can be written as an iteration (from $i$ - to $i + 1$ -th step) in its scaled form [6]:

C^{i + 1} : = \underset{C}{\arg \min} \frac{1}{2} ∥ G - C P^{i} ∥_{F}^{2} + \frac{ρ}{2} ∥ C - A^{i} + {\tilde{A}}^{i} ∥_{F}^{2} + 𝟙_{S} (C) P^{i + 1} : = \underset{P}{\arg \min} \frac{1}{2} ∥ G - C^{i} P ∥_{F}^{2} + \frac{γ}{2} ∥ P - Q^{i} + {\tilde{Q}}^{i} ∥_{F}^{2} + 𝟙_{T} (P) A^{i + 1} : = \underset{A}{\arg \min} 𝓕 (A) + \frac{ρ}{2} ∥ C^{i} - A + {\tilde{A}}^{i} ∥_{F}^{2} Q^{i + 1} : = \underset{Q}{\arg \min} \frac{γ}{2} ∥ P^{i} - Q + {\tilde{Q}}^{i} ∥_{F}^{2} {\tilde{A}}^{i + 1} : = {\tilde{A}}^{i} + C^{i} - A^{i} {\tilde{Q}}^{i + 1} : = {\tilde{Q}}^{i} + P^{i} - Q^{i}

(13)

Each variable in system (13) can be solved individually. Specifically, for the $C$ -subproblem, the Karush-Kuhn-Tucker (KKT) condition [44] yields a closed form for $C$ , i.e.,

C^{i + 1} = [G P^{i^{⊤}} + ρ (A^{i} - {\tilde{A}}^{i})] {(P^{i} P^{i^{⊤}} + ρ I)}^{- 1},

(14)

where $P^{i} P^{i^{⊤}} + ρ I$ is a small $k \times k$ matrix that can be inverted easily. The non-negativity of $C$ is obtained by row-wise active set method. For the $P$ -subproblem, KKT condition gives

P^{i + 1} = Π {{({C^{i}}^{T} C^{i} + γ I)}^{- 1} [C^{i ⊤} G + γ (Q^{i} - {\tilde{Q}}^{i})]},

(15)

and this is a small-scale problem, in which a $k \times k$ matrix is to be inverted with column-wise probability simplex projection $Π$ [53]. Solution of the $Q$ -subproblem is simply

Q^{i + 1} = \max {P^{i} + {\tilde{Q}}^{i}, 0} .

(16)

The $A$ -subproblem involves the solvability condition (7) and manifold constraints (8), both of which are non-linear problems, so there is no closed form. To solve this subproblem we have to use the gradient descent method and make this step an inner iteration. Denote the total objective function of the $A$ -subproblem as

f (A) = 𝓕 (A) + \frac{ρ}{2} ∥ C - A + \tilde{A} ∥_{F}^{2},

(17)

then its gradient $\nabla f (A)$ can be computed from (7) and (8) accordingly. Note that $\nabla f (A)$ is nonlinear in terms of $A$ . For computational efficiency, we will use the result of $C$ in the current outer iteration step, so values of $\nabla f (A) \approx \nabla f (C)$ will not update in the inner loop. Algorithm 1 summarizes the entire processes of the GS-NMF. Necessary raw data processing and spectral clustering steps are not included. The stoping criteria are to set $∥ C^{i + 1} - C^{i} ∥ / ∥ C^{i} ∥_{F}$ and $∥ P^{i + 1} - P^{i} ∥ / ∥ P^{i} ∥_{F}$ smaller than some tolerance.

Algorithm 1.

Geometric structure constrained NMF (GS-NMF)

Require: Data $G$ , initial guesses $C_{0}$ , $P_{0}$ , structure identifier $C_{g}$ , graph adjacency matrix $W$ , tolerance $ϵ$ , parameters $λ_{1}$ , $λ_{2}$ , $ρ$ and $γ$ .
Ensure: Matrices $C$ and $P$ .
1:	for $i = 0, 1, \dots$ until criteria is satisfied do	% outer iteration
2:	Solve $C$ -subproblem in Eq. (13) by (14);
3:	Solve $P$ -subproblem in Eq. (13) by (15);
4:	for $m = 0, 1, \dots$ until criteria is satisfied do	% inner iteration
5:	Solve $A$ -subproblem in Eq. (13) through gradient descent method;
6:	end for
7:	Solve $Q$ -subproblem in Eq. (13) by (16);
8:	Set ${\tilde{A}}^{i + 1} : = {\tilde{A}}^{i} + C^{i} - A^{i}$ ;
9:	Set ${\tilde{Q}}^{i + 1} : = {\tilde{Q}}^{i} + P^{i} - Q^{i}$ .
10:	end for

Open in a new tab

5. Numerical results

In this section, we test the proposed GS-NMF algorithms on two types of data.

5.1. Simulations on synthetic data

In order to have flexibility of matrix dimensions, noise levels, and known ground truth, we first test the algorithms on synthetic data, which are generated as the following strategies: matrix $C \in ℝ_{+}^{N \times k}$ has a structure, so first we split $N = N_{1} + N_{2} + \dots N_{k} + N_{k + 1}$ . In order to mimic the marker gene expression in the corresponding cells, we generate $N_{l}, l = 1, 2, \dots k$ rows of $C$ such that they have strong correlations to $e_{l}^{⊤}$ . The rest of $N_{k + 1}$ rows are non-marker genes, so they are just generated randomly. Then all rows of $C$ are assembled and a random row-permutation is performed. Matrix $P$ is just a random $k \times n$ matrix with non-negative entries and columns normalized by their $l_{1}$ norms. Data $G$ is computed simply by $G = CP + ϵ$ , where $ϵ$ is the noise matrix following normal distribution.

Figure 3 displays geometric structures of synthetic data under different settings. In this case, we take $n = 30$ , and $k = 3$ and various dimensions of $N$ . We define the noise to data ratio (NDR) as $∥ ϵ ∥_{F} / ∥ C P ∥_{F}$ and consider different (low, medium, and high) noise levels in simulations. Eigenvectors of graph Laplacians of these data are computed, and they are used as coordinates to plot the $N$ points in Figure 3. Note that according to spectral clustering theory, only the second and third eigenvectors are needed (because $k = 3$ and the first one is almost a constant vector).

The first column of Figures 3, or (1), (4), and (7), shows the plots for marker genes only ( $N_{1} = N_{2} = N_{3} = 300$ while $N_{4} = 0$ ) with low, medium, and high levels of noises. Three different colors represent the three clustered groups. It can be concluded from (1) that after clustering, all marker genes will concentrate around the vertices of the $k - 1$ simplex since all the corresponding rows of $C$ are strongly correlated to some $e_{l}^{⊤}$ . With increasing noises, data points navigate away from the vertices. Keeping the same numbers for marker genes and low noises, Figures 3 (2), (5), (8) display data with $N_{4} = 300, 600, 900$ , and we can see that non-marker gene data only fill in the edge of the simplex. In contrast, Figures 3 (3), (6), and (9) show data of all genes ( $N_{1} = N_{2} = N_{3} = N_{4} = 300$ ) with low, medium and high levels of noises, respectively. It can be observed that strong noises will fill in the interior of the simplex.

In order to show that the proposed constraints are important, we perform the NMF without constraints, by simply setting $λ_{1} = λ_{2} = 0$ . Initial starting points $C_{0}$ and $P_{0}$ are chosen randomly, so it can be seen in Figure 4 that each initial condition will result in a different result of $C$ from others. In these experiments, the stopping criteria are the same (10⁻⁵) and the relative residues $∥ G - CP ∥_{F} / ∥ G ∥_{F}$ are the same and consistent to the NDR. Hence we can claim that the approximated stationary points are achieved but none of them is even close to the ground truth. The different solutions are due to the illposedness of the original NMF model.

Figure 4. — Complete deconvolution without constraint. (i)-(iii): Comparisons of true and simulated matrices of $C$ with three different initial conditions.

Figure 5 presents computational results of the GS-NMF model on these synthetic data. Two sets of randomly generated matrices $C^{*}$ and $P^{*}$ , with different levels of noises ( $NDR = 0.071, 0.336$ ) are used to obtain data $G$ . Comparisons of ground truth (blue) to the corresponding computational results (red) are displayed in the left and right panels in Figure 5. The first and second rows are for $C$ and $P$ , respectively. It can be seen that the solutions from the GS-NMF model are remarkably more reasonable comparing to the ground truth.

Quantitative results can be found in Table 5.1, where relative errors (comparing to ground truth) of $C$ , $P$ , and relative residues $∥ G - CP ∥_{F} / ∥ G ∥_{F}$ are displayed for different NDRs. As indicated by both Figure 5 and the table, errors in matrix $C$ increase more obviously when more noises present (larger NDR). On the contrary, computation of matrix $P$ seems less vulnerable to noise levels. We observe that the relative residues $∥ G - CP ∥_{F} / ∥ G ∥_{F}$ for all iterations have been already comparable to the NDR, and this result implies that pursing even smaller residues in the cost functions is not necessary.

We also compare GS-NMF with Linseed [57], a recently developed deconvolution tool, for two sets of synthetic data. Figure 6 shows comparisons in cellular proportion ( $P$ ) simulations. Both methods can solve for $C$ and $P$ simultaneously from data $G$ . For some data, as shown in Fig. 6 (a), simulation results are similar, while for other data as in Fig. 6 (b), the proposed GS-NMF has better stability, as the Linseed predicts all zero proportion for the second cell type.

5.2. Parameter discussion

There are five major parameters in models (5) and (13): the clustering parameter $σ$ , ADMM penalty parameters $ρ$ and $γ$ , and the geometric constraints parameters $λ_{1}$ and $λ_{2}$ . As other optimization algorithms, there is barely generic way to find “optimal” tuning parameters, and choices of parameters depend on data and specific problems. Here we would rather discuss parameter choices based on the synthetic data and provide some general insights. Parameter $σ$ will be discussed in Section 5.3. When choosing ADMM parameters $ρ$ and $γ$ , it is well-known that errors in $C$ and $P$ decrease for larger parameters, while too large values for $ρ$ and $γ$ will introduce matrix singularity in the algorithm. Computational results in Table 5.1 are obtained with $ρ = 1.6 \times 10^{3}$ and $γ = 1.5 \times 10^{4}$ . For the geometric constraint parameters, it is still an open problem. We follow the idea of other manifold constrained methods [47, 7, 55, 20] and perform a grid search with candidates evenly spaced over the interval. To do so, we simply take $λ_{1} = λ_{2} = λ$ and rescale to $\tilde{λ} = λ / ρ$ as it is defined in Eq. (13). Empirically, smaller $\tilde{λ}$ means less constraints on the geometric structure of $C$ hence could damage solution identifiability. On the other hand, the extreme case $\tilde{λ} \to \infty$ implies the strong identifiability, which is not realistic when noises present. Figure 7 displays relative errors in $C$ , when parameters $ρ$ and $γ$ are fixed as above but $\tilde{λ}$ varies. It can be concluded that for both noise levels ( $NDR = 0.599$ and $0.071$ ), the change of errors against $\tilde{λ}$ is not monotone. For the testing data, $\tilde{λ} \approx 4$ or $5$ seems the best choice for computational accuracy. How to chose reasonable parameter $\tilde{λ}$ according to different data sets could be a future study.

Figure 7. — Relations between prediction errors (in $C$ ) and geometric constraint parameter $\tilde{λ}$ at two noise-to-data ratios (NDR).

5.3. Algorithm results on biological data

We also validate the proposed algorithms by realistic biological data from GSE19830 [50]. This data set was obtained from tissue samples of the brain, liver, and lung of a single rat using expression arrays (Affymetrix). Homogenates of these three types of tissues were mixed together at the cRNA homogenate level with a known proportion, and then the gene expression pattern of every mixed sample was measured. The GSE19830 data set mimics the common scenario of heterogeneous biological samples which vary in the relative frequency of the component subsets from one to another and has been used in some literature [31, 57] to validate computational algorithms. For this dataset, we know cell type $k = 3$ and tissue sample number $n = 33$ . After necessary data preprocessing to exclude obvious outliers (row norm, column norm, etc), we take $N = 10, 000$ out of $\sim 12, 000$ total genes. Note that $k$ in the NMF is also the numerical rank of $G$ , so it is possible to estimate the total number of cell types for a dataset if it is unknown a prior, by investigating the distribution of the singular values of the original data matrix. For the GSE19830 data, we perform the singular value decomposition and display its singular value in Figure 8. Clearly the first three singular values of $G$ are dominant, which indicate three cell types.

Figure 8. — Singular value distribution of GSE19830 dataset

Figure 9 displays the structure of data $G$ . Mutual correlations of rows of $G$ , i.e. gene expression of the 10, 000 genes in those 33 samples, are computed and shown as heat maps, before (left) and after (right) clustering/permutation. Clearly, there are three clusters and gene expressions are strongly correlated within each clustered group. Based this clustering and evaluation of correlations, marker genes could be computationally identified.

Figures 10–11 show some clustering details. Since the clustering is based on eigenvectors of the graph Laplacian of data $G$ , we plot the first three eigenvectors of $L$ , with each column as the $x$ -, $y$ -, and $z$ -coordinates of the $N$ dots, respectively in Figure 10 (a). Different colors represent the clustered groups. The first eigenvector is almost a constant vector, so it is convenient to just display the 2D data for the rest of figures, as in Figure 10 (b). All the dots are distributed in a ( $k - 1$ )-simplex, and each cluster is identified with its vertex.

Figure 10. — Spectral clustering of GSE19830 data.

Notice that exploring the data structure depends on the parameter $σ$ in Eq. (5). Figure 11 shows data distribution with values of $σ = 1, 0.5$ , and $0.2$ . From Eq. (5) we see that the connectivity of any two vertices in the graph increases for large value of $σ$ . This feature is displayed in Figure 11 (a), (b) and (c): when $σ = 1$ , all data points are distributed within almost a circle and the clustering is not that significant. On the other hand, when $σ = 0.2$ , the three vertices of the triangles naturally define the three clusters. In our experiments, the graph loses majority of connectivity for even smaller value, so $σ = 0.2$ is used for all the simulations. To determine marker genes, we pick a subset $𝓢_{r}$ from each colored group and they are chosen as the one that are closest to each vertex. With $σ = 0.2$ , each set $𝓢_{r}$ , $r = 1, 2, 3$ contains 1,000 entries, and the corresponding data points are shown in Figure 11 (d).

With such parameter choices and set $\tilde{λ} = 0.6$ , we apply the proposed algorithms to data set GSE19830. Figure 12 shows the comparison between computed (red) cellular composition (liver, brain, lung, from top to bottom) in bulk tissue samples and ground truth (blue). In the 33 samples, 11 different cellular compositions were used and each of them was replicated three times. The computational results have reproduced this pattern. Additionally, the simulated cellular proportions fit the ground truth fairly well, especially for the third cell type.

5.4. Model discussion and future work for more complicated data

Quantitatively, correlations of the computational results (red) and the ground truth (blue) in Figure 12 are 0.9916, 0.9916 and 0.9997 for the three cell types. They show the promising algorithm performances on biological data. On the other hand, we see that the predictions systematically underestimate the proportion of liver cells, while overestimate brain cells in samples. It indicates that simulation $\tilde{P}$ and ground truth $P$ differ by merely a scaling factor, i.e. $\tilde{P} = diag (s_{1}, s_{2}, s_{3}) P$ . This phenomenon is majorly due to the inherent characteristics of NMF problem, and the definition of its solution uniqueness in Eq. (4). Further, even for such a definition of unique solution, one needs identifiability conditions on both $C$ and $P$ , as demonstrated in Theorems 2.2 and 2.3. But in reality, not all these assumptions on both variables are satisfied (in this application we only have it on $C$ ). Based on this issue, one can expect that there could be other challenges in the NMF framework for more complicated data. (i) One challenge could be significant heterogeneity in cellular proportions in samples, i.e., certain cell type may outnumber others across sample tissues. Mathematically, this means large differences between magnitudes of rows of $P$ . Figure 13 shows such a scenario: we designed a set of synthetic data where the proportion of one cell type in tissues is significantly less than the others. It can be seen in the third panel of Figure 13 (a) that our algorithms obviously overestimate the corresponding proportion. With additional assumption that gene expressions within cells (matrix $C$ ) are not significantly different from each other, we computed row norms of data $G$ for the corresponding marker genes and used them as rescaling factors during the iteration. This modified simulation provides much more accurate results, as displayed in Figure 13(b). However, we do not know how well the assumption will hold in realistic applications. (ii) Another challenge may arise when proportions for different cell types in tissue samples are highly similar. This situation will lead to strong correlations between rows of $P$ , and hence destroy the geometric structure of bulk tissue RNA-seq data: two vertices of the $k$ -simplex as in the left panel of Figure 2 may overlap if there are two rows of $P$ are strongly correlated. In summary, the novelty of the current work is to improve solution identifiably of the NMF model by only utilizing data $G$ and the concept of marker genes. To improve solution quality, a future research direction within this framework is to address the above-mentioned challenges by combing necessary biological information for various complicated data.

Figure 13. — Comparison of $P$ and $\tilde{P}$ : Algorithm performances on data $G$ generated by $P$ containing a very small row. (a) Original GS-NMF algorithm; (b) Algorithm with rescaling factors.

6. Conclusions

In this paper we develop a robust mathematical model and corresponding computational algorithms for complete data deconvolution. The major technique is nonnegative matrix factorization (NMF), which has a wide-range of applications in the machine learning community. Meanwhile, the NMF is a well-known strongly ill-posed problem, so a direct application of it to RNA-seq data will suffer severe difficulties in the interpretability of solutions. To address this issue, we leverage the biological concept of marker genes, combine it with the solvability conditions of the NMF theories, and hence develop a geometric structured guided optimization model. In this approach, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. In this step, correlations graph among GEPs across tissue samples is established, and more importantly, marker genes for each cell types are identified. Then, information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. The resulting non-convex optimization problem, termed as geometric structured nonnegative matrix factorization (GS-NMF) model is numerically solved under the framework of alternating direction method of multipliers (ADMM). Finally, synthetic and biological data are used to validate the proposed model and algorithms. With this novel method, solution interpretability is significantly improved and accuracy is satisfactory comparing to the ground truth for both types of data. It is worthwhile to note that all simulation results may still suffer a linear scaling factor comparing to the ground truth. Unfortunately this is nothing to do with marker gene selection, parameter choices, or algorithm accuracy, but due to the inherent definition of NMF solution uniqueness. In the future research, we will combine necessary biological information in realistic applications, to reduce this scaling ambiguity as much as possible.

Table 2.

Quantitative results of the GS-NMF with different noise to data ratios (NDR).

NDR	erros in $C$	errors in $P$	Relative residue
0.071	0.0901	0.0444	0.0693
0.162	0.1007	0.0457	0.1543
0.336	0.1372	0.0545	0.3024
0.599	0.1667	0.0569	0.4888

Open in a new tab

A. Appendix

We summarize some theoretical concepts of the NMF model as follows:

A.1. Geometric interpretation.

In order to understand the strong and weak conditions of uniqueness, we review the NMF problem from the perspective of geometric structures. For $A \in ℝ^{m \times n}$ , the notation $c o n e (A)$ denotes the convex cone generated by the columns of $A$ , i.e.

c o n e {A} = {x \in ℝ^{m} ∣ x = A θ, for some θ \in ℝ^{n}, θ \geq 0},

(18)

and the conex hull of $A$ is defined as

c o n v {A} = {x \in ℝ^{m} ∣ x = A θ, for θ \in ℝ^{n}, θ \geq 0, and 1^{⊤} θ = 1} .

(19)

Note that we will illustrate with $G^{⊤} = P^{⊤} C^{⊤}$ in the geometric structure because gene expressions across sample issues, i.e., columns of $G^{⊤}$ (row of $G$ ) are our major interested data features. By non-negativity of $C$ and $P$ , we have

G_{(i)} \in c o n e (P^{⊤}) \subseteq ℝ_{+}^{n}, 1 \leq i \leq N,

(20)

and equivalently $c o n e (G^{⊤}) \subseteq c o n e (P^{⊤}) \subseteq ℝ_{+}^{n}$ . Then problem (3) can be inter-preted as finding a nested cone problem: given two nested cones, $c o n e (G^{⊤})$ and $ℝ_{+}^{n}$ , find the nested cone $c o n e (P^{⊤})$ between them. This interpretation is displayed in the left panel of Figure 14 with $N = 8$ and $k = n = 3$ . It is easier to interpret this idea as shown in the right panel, in terms of convex hull view, which is one dimension less than the cone view. This can be done easily with normalization of columns of $G^{⊤}$ and $P^{⊤}$ to their unit $l_{1}$ norm. From the right panel of Figure 14, we can also see that the solution of $c o n v (P^{⊤})$ is not unique: data $G$ can be contained within two different cones (red solid and green dashed triangles) formed by different choices of matrix $P$ . It also motivates the idea that if rows of $G$ “spread out” enough in the nonnegative orthant, $c o n e (P^{⊤})$ or $c o n v (P^{⊤})$ may be unique. This condition turns into the following theories on the identifiability of the NMF.

Figure 14. — Cone (left) and convex hull (right) views of NMF as a nested cone problem.

A.2. Strong and weak conditions on identifiability:

There are two types of conditions for unique solution of problem (3) when $ϵ = 0$ . The strong and weak conditions are illustrated in Section 2, in which the definitions for separable and sufficiently scattered matrices are listed as follows:

Definition A.1.

The matrix $A \in ℝ_{+}^{m \times n}$ is separable if $c o n e (A) = ℝ_{+}^{m}$ .

Definition A.2.

The matrix $A \in ℝ_{+}^{m \times n}$ is sufficiently scattered if: (i) The second-order cone in $ℝ_{+}^{m}$ is contained in $c o n e (A)$ , i.e. $𝓒 = {x \in ℝ_{+}^{m} ∣ e^{⊤} x \geq \sqrt{m - 1} ∥ x ∥_{2}} \subseteq c o n e (A)$ ; and (ii) There does not exist any orthogonal matrix $Q$ such that $c o n e (A) \subseteq c o n e (Q)$ , except for permutation matrices.

Theorem 2.2 is quite rigorous: being separable matrices means $P$ and $C^{⊤}$ must contains (scaled) extreme rays of the nonnegative orthant in the corresponding space, i.e., for every $r = 1, 2, \dots k$ there exists a column index $l_{r}$ , such that $P^{(l_{r})} = α_{r} e_{r} \in ℝ_{+}^{k}$ , where $α_{r}$ is a scalar. This condition is illustrated in the left panel of Figure 15 as $k = 3$ . Note it is the convex hull view, so columns of $P$ are represented as blue dots in the unit simplex (red triangle) in $ℝ_{+}^{3}$ . As in the figure, some columns of $P$ are required to be exactly align with unit vectors $e_{r}$ , $r = 1, 2, 3$ (overlapping with red dots). Similar situation is for matrix $C^{⊤}$ . Such assumptions on both variables are too strong for practical applications, especially when noises present.

On the other hand, Theorem 2.3 is much more relaxed: The right panel of Figure 15 illustrates such condition: the dashed circle represents the intersection of the second-order cone in $ℝ_{+}^{3}$ and the unit simplex. All blue objects, including dots and pentagons, are for columns of $P$ . In this case, none of those columns are required to overlap with $e_{r}$ , but some of them (pentagons) need to fall out of the circle (second-order cone).

Contributor Information

DUAN CHEN, Department of Mathematics and Statistics School of Data Science University of North Carolina at Charlotte, USA.

SHAOYU LI, Department of Mathematics and Statistics University of North Carolina at Charlotte, USA.

XUE WANG, Department of Quantitative Health Sciences Mayo Clinic, Florida, 32224, USA.

REFERENCES

[1].Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z. and Clark HF, Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus, PloS One, 4 (2009), e6098. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Allen M, Carrasquillo MM, Funk C, Heavner BD, Zou F, Younkin CS, Burgess JD, Chai H-S, Crook J, Eddy JA, et al. , Human whole genome genotype and transcriptome data for Alzheimer’s and other neurodegenerative diseases, Scientific Data, 3 (2016), 160089. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Allen M, Wang X, Burgess JD, Watzlawik J, Serie DJ, Younkin CS, Nguyen T, Malphrus KG, Lincoln S, Carrasquillo MM, et al. , Conserved brain myelination networks are altered in Alzheimer’s and other neurodegenerative diseases, Alzheimer’s & Dementia., 14 (2018), 352–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Avila Cobos F, Vandesompele J, Mestdagh P. and De Preter K, Computational deconvolution of transcriptomics data from mixed cell populations, Bioinformatics, 34 (2018), 1969–1979. [DOI] [PubMed] [Google Scholar]
[5].Belkin M. and Niyogi P, Laplacian eigenmaps and spectral techniques for embedding and clustering, in Advances in Neural Information Processing Systems, (2002), 585–591.
[6].Boyd S, Parikh N, Chu E, et al. , Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers, Now Publishers Inc, 2011. [Google Scholar]
[7].Cai D, He X, Han J. and Huang TS, Graph regularized nonnegative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2010), 1548–1560. [DOI] [PubMed] [Google Scholar]
[8].Cai D, Wang X. and He X, Probabilistic dyadic data analysis with local and global consistency, in Proceedings of the 26th Annual International Conference on Machine Learning, (2009), 105–112. [Google Scholar]
[9].Cang Z. and Nie Q, Inferring spatial and signaling relationships between cells from single cell transcriptomic data, Nature Communications, 11 (2020), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Gaujoux R. and Seoighe C, Semi-supervised nonnegative matrix factorization for gene expression deconvolution: A case study, Infection, Genetics and Evolution, 12 (2012), 913–921. [DOI] [PubMed] [Google Scholar]
[11].Chikina M, Zaslavsky E. and Sealfon SC, CellCODE: A robust latent variable approach to differential expression analysis for heterogeneous cell populations, Bioinformatics, 31 (2015), 1584–1591. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Cichocki A, Zdunek R, Phan AH and Amari S.-i., Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. John Wiley & Sons, 2009. [Google Scholar]
[13].Craig MD, Minimum-volume transforms for remotely sensed data, IEEE Transactions on Geoscience and Remote Sensing, 32 (1994), 542–552. [Google Scholar]
[14].Cui A, Quon G, Rosenberg AM, Yeung RS, Morris Q. and Consortium BS, Gene expression deconvolution for uncovering molecular signatures in response to therapy in juvenile idiopathic arthritis, PloS One, 11 (2016), e0156055. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA and Quake SR, A survey of human brain transcriptome diversity at the single cell level, Proceedings of the National Academy of Sciences, 112 (2015), 7285–7290. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].H. M. Davey and D. B. Kell, Flow cytometry and cell sorting of heterogeneous microbial populations: The importance of single-cell analyses, Microbiological Reviews, 60 (1996), 641–696. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].de Ridder D, Van Der Linden CE, Schonewille T, Dik WD, Reinders MJT, Van Dongen J. and Staal F, Purity for clarity: The need for purification of tumor cells in dna microarray studies, Leukemia, 19 (2005), 618–627. [DOI] [PubMed] [Google Scholar]
[18].De Jager PL, Ma Y, McCabe C, Xu J, Vardarajan BN, Felsky D, Klein H-U, White CC, Peters MA, Lodgson B, et al. , A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research, Scientific Data, 5 (2018), 180142. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Donoho D. and Stodden V, When does non-negative matrix factorization give a correct decomposition into parts?, in Advances in Neural Information Processing Systems, (2004), 1141–1148.
[20].Drumetz L, Meyer TR, Chanussot J, Bertozzi AL and Jutten C, Hyperspectral image unmixing with endmember bundles and group sparsity inducing mixed norms, IEEE Transactions on Image Processing, 28 (2019), 3435–3450. [DOI] [PubMed] [Google Scholar]
[21].Eckstein J. and Yao W, Augmented Lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results, RUTCOR Research Reports, 32 (2012), 44. [Google Scholar]
[22].Fridman WH, Pages F, Sautes-Fridman C. and Galon J, The immune contexture in human tumours: Impact on clinical outcome, Nature Reviews Cancer, 12 (2012), 298–306. [DOI] [PubMed] [Google Scholar]
[23].Fu X, Huang K, Sidiropoulos ND and Ma W-K, Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications, IEEE Signal Process. Mag, 36 (2019), 59–80. [Google Scholar]
[24].Fu X, Ma W-K, Chan T-H and Bioucas-Dias JM, Self-dictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related, IEEE Journal of Selected Topics in Signal Processing, 9 (2015), 1128–1141. [Google Scholar]
[25].Gillis N, Nonnegative Matrix Factorization: Complexity, Algorithms and Applications, Unpublished Doctoral Dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE, 2011. [Google Scholar]
[26].Gong T. and Szustakowski JD, DeconRNASeq: A statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data, Bioinformatics, 29 (2013), 1083–1085. [DOI] [PubMed] [Google Scholar]
[27].Harrington H, Drellich E, Gainer-Dewar A, He Q, Heitsch C. and Poznanovic S, Geometric Combinatorics and Computational Molecular Biology: Branching Polytopes for Rna Sequences, 2017.
[28].He X. and Niyogi P, Locality preserving projections, in Advances in Neural Information Processing Systems, (2004), 153–160.
[29].Huang K, Sidiropoulos ND and Swami A, Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition, IEEE Transactions on Signal Processing, 62 (2014), 211–224. [Google Scholar]
[30].Jin S, Zhang L. and Nie Q, Scai: An unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles, Genome Biology, 21 (2020), 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X. and Li L, Cdseq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data, PLoS Computational Biology, 15 (2019), e1007510. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Kim H. and Park H, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method, SIAM Journal on Matrix Analysis and Applications, 30 (2008), 713–730. [Google Scholar]
[33].Kuhn A, Kumar A, Beilina A, Dillman A, Cookson MR and Singleton AB, Cell population-specific expression analysis of human cerebellum, BMC Genomics, 13 (2012), 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Kuhn A, Thu D, Waldvogel HJ, Faull RL and Luthi-Carter R, Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain, Nature Methods, 8 (2011), 945–947. [DOI] [PubMed] [Google Scholar]
[35].Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, Duong TE, Gao D, Chun J, Kharchenko PV, et al. , Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain, Nature Biotechnology, 36 (2018), 70–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Laurberg H, Christensen MG, Plumbley MD, Hansen LK and Jensen SH, Theorems on positive data: On the uniqueness of NMF, Computational Intelligence and Neuroscience, (2008), Article ID 764206. [DOI] [PMC free article] [PubMed]
[37].Lee DD and Seung HS, Learning the parts of objects by non-negative matrix factorization, Nature, 401 (1999), 788–791. [DOI] [PubMed] [Google Scholar]
[38].Lee D. and Seung HS, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, 13 (2000).
[39].Ma W-K, Bioucas-Dias JM, Chan T-H, Gillis N, Gader P, Plaza AJ, Ambikapathi A. and Chi C-Y, A signal processing perspective on hyperspectral unmixing: Insights from remote sensing, IEEE Signal Processing Magazine, 31 (2013), 67–81. [Google Scholar]
[40].McKenzie AT, Moyon S, Wang M, Katsyv I, Song W-M, Zhou X, Dammer EB, Duong DM, Aaker J, Zhao Y, et al. , Multiscale network modeling of oligodendrocytes reveals molecular components of myelin dysregulation in Alzheimer’s disease, Molecular Neurodegeneration, 12 (2017), Article number: 82. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Mohammadi S, Zuckerman N, Goldsmith A. and Grama A, A critical survey of deconvolution methods for separating cell types in complex tissues, Proceedings of the IEEE, 105 (2016), 340–366. [Google Scholar]
[42].Mostafavi S, Gaiteri C, Sullivan SE, White CC, Tasaki S, Xu J, Taga M, Klein H-U, Patrick E, Komashko V, et al. , A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease, Nature Neuroscience, 21 (2018), 811–819. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M. and Alizadeh AA, Robust enumeration of cell subsets from tissue expression profiles, Nature Methods, 12 (2015), 453–457. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Nocedal J. and Wright SJ, Numerical Optimization, Springer Science & Business Media, 2006.
[45].Paatero P. and Tapper U, Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 5 (1994), 111–126. [Google Scholar]
[46].Qiao W, Quon G, Csaszar E, Yu M, Morris Q, and Zandstra PW, PERT: A method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions, PLoS Comput. Biol, 8 (2012), e1002838. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Qin J, Lee H, Chi JT, Lou Y, Chanussot J. and Bertozzi AL, Fast blind hyperspectral unmixing based on graph Laplacian, in 2019 10th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), IEEE, (2019), 1–5. [Google Scholar]
[48].Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH and Jacobsen M, Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach, BMC Bioinformatics, 11 (2010), 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Shen-Orr SS and Gaujoux R, Computational deconvolution: Extracting cell type-specific information from heterogeneous samples, Current Opinion in Immunology, 25 (2013), 571–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM and Butte AJ, Cell type–specific gene expression differences in complex tissues, Nature Methods, 7 (2010), 287–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Tsoucas D, Dong R, Chen H, Zhu Q, Guo G. and Yuan G-C, Accurate estimation of cell-type composition from gene expression data, Nature Communications, 10 (2019), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Von Luxburg U, A tutorial on spectral clustering, Statistics and Computing, 17 (2007), 395–416. [Google Scholar]
[53].Wang W. and Carreira-Perpinán MA, Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application, arXiv preprint, arXiv:1309.1541, 2013. [Google Scholar]
[54].Wang Y-X and Zhang Y-J, Nonnegative matrix factorization: A comprehensive review, IEEE Transactions on Knowledge and Data Engineering, 25 (2012), 1336–1353. [Google Scholar]
[55].Warren RE and Osher SJ, Hyperspectral unmixing by the alternating direction method of multipliers, Inverse Problems & Imaging, 9 (2015), 917–933. [Google Scholar]
[56].Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA and Brown PO, Individuality and variation in gene expression patterns in human blood, Proceedings of the National Academy of Sciences, 100 (2003), 1896–1901. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Zaitsev K, Bambouskova M, Swain A. and Artyomov MN, Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures, Nature Communications, 10 (2019), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
[58].Zhang J, Nie Q. and Zhou T, Revealing dynamic mechanisms of cell fate decisions from single-cell transcriptomic data, Frontiers in Genetics, 10 (2019), 1280. [DOI] [PMC free article] [PubMed] [Google Scholar]
[59].Zhang S, Wang W, Ford J. and Makedon F, Learning from incomplete ratings using non-negative matrix factorization, in Proceedings of the 2006 SIAM International Conference on Data Mining, SIAM, (2006), 549–553. [Google Scholar]
[60].Zhang Y, Sloan SA, Clarke LE, Caneda C, Plaza CA, Blumenthal PD, Vogel H, Steinberg GK, Edwards MS, Li G, et al. , Purification and characterization of progenitor and mature human astrocytes reveals transcriptional and functional differences with mouse, Neuron, 89 (2016), 37–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Zhong Y, Wan Y-W, Pang K, Chow LM and Liu Z, Digital sorting of complex tissues for cell type-specific gene expression profiles, BMC Bioinformatics, 14 (2013), 89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z. and Clark HF, Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus, PloS One, 4 (2009), e6098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Allen M, Carrasquillo MM, Funk C, Heavner BD, Zou F, Younkin CS, Burgess JD, Chai H-S, Crook J, Eddy JA, et al. , Human whole genome genotype and transcriptome data for Alzheimer’s and other neurodegenerative diseases, Scientific Data, 3 (2016), 160089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Allen M, Wang X, Burgess JD, Watzlawik J, Serie DJ, Younkin CS, Nguyen T, Malphrus KG, Lincoln S, Carrasquillo MM, et al. , Conserved brain myelination networks are altered in Alzheimer’s and other neurodegenerative diseases, Alzheimer’s & Dementia., 14 (2018), 352–366. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Avila Cobos F, Vandesompele J, Mestdagh P. and De Preter K, Computational deconvolution of transcriptomics data from mixed cell populations, Bioinformatics, 34 (2018), 1969–1979. [DOI] [PubMed] [Google Scholar]

[R5] [5].Belkin M. and Niyogi P, Laplacian eigenmaps and spectral techniques for embedding and clustering, in Advances in Neural Information Processing Systems, (2002), 585–591.

[R6] [6].Boyd S, Parikh N, Chu E, et al. , Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers, Now Publishers Inc, 2011. [Google Scholar]

[R7] [7].Cai D, He X, Han J. and Huang TS, Graph regularized nonnegative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2010), 1548–1560. [DOI] [PubMed] [Google Scholar]

[R8] [8].Cai D, Wang X. and He X, Probabilistic dyadic data analysis with local and global consistency, in Proceedings of the 26th Annual International Conference on Machine Learning, (2009), 105–112. [Google Scholar]

[R9] [9].Cang Z. and Nie Q, Inferring spatial and signaling relationships between cells from single cell transcriptomic data, Nature Communications, 11 (2020), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Gaujoux R. and Seoighe C, Semi-supervised nonnegative matrix factorization for gene expression deconvolution: A case study, Infection, Genetics and Evolution, 12 (2012), 913–921. [DOI] [PubMed] [Google Scholar]

[R11] [11].Chikina M, Zaslavsky E. and Sealfon SC, CellCODE: A robust latent variable approach to differential expression analysis for heterogeneous cell populations, Bioinformatics, 31 (2015), 1584–1591. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Cichocki A, Zdunek R, Phan AH and Amari S.-i., Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. John Wiley & Sons, 2009. [Google Scholar]

[R13] [13].Craig MD, Minimum-volume transforms for remotely sensed data, IEEE Transactions on Geoscience and Remote Sensing, 32 (1994), 542–552. [Google Scholar]

[R14] [14].Cui A, Quon G, Rosenberg AM, Yeung RS, Morris Q. and Consortium BS, Gene expression deconvolution for uncovering molecular signatures in response to therapy in juvenile idiopathic arthritis, PloS One, 11 (2016), e0156055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA and Quake SR, A survey of human brain transcriptome diversity at the single cell level, Proceedings of the National Academy of Sciences, 112 (2015), 7285–7290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].H. M. Davey and D. B. Kell, Flow cytometry and cell sorting of heterogeneous microbial populations: The importance of single-cell analyses, Microbiological Reviews, 60 (1996), 641–696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].de Ridder D, Van Der Linden CE, Schonewille T, Dik WD, Reinders MJT, Van Dongen J. and Staal F, Purity for clarity: The need for purification of tumor cells in dna microarray studies, Leukemia, 19 (2005), 618–627. [DOI] [PubMed] [Google Scholar]

[R18] [18].De Jager PL, Ma Y, McCabe C, Xu J, Vardarajan BN, Felsky D, Klein H-U, White CC, Peters MA, Lodgson B, et al. , A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research, Scientific Data, 5 (2018), 180142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Donoho D. and Stodden V, When does non-negative matrix factorization give a correct decomposition into parts?, in Advances in Neural Information Processing Systems, (2004), 1141–1148.

[R20] [20].Drumetz L, Meyer TR, Chanussot J, Bertozzi AL and Jutten C, Hyperspectral image unmixing with endmember bundles and group sparsity inducing mixed norms, IEEE Transactions on Image Processing, 28 (2019), 3435–3450. [DOI] [PubMed] [Google Scholar]

[R21] [21].Eckstein J. and Yao W, Augmented Lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results, RUTCOR Research Reports, 32 (2012), 44. [Google Scholar]

[R22] [22].Fridman WH, Pages F, Sautes-Fridman C. and Galon J, The immune contexture in human tumours: Impact on clinical outcome, Nature Reviews Cancer, 12 (2012), 298–306. [DOI] [PubMed] [Google Scholar]

[R23] [23].Fu X, Huang K, Sidiropoulos ND and Ma W-K, Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications, IEEE Signal Process. Mag, 36 (2019), 59–80. [Google Scholar]

[R24] [24].Fu X, Ma W-K, Chan T-H and Bioucas-Dias JM, Self-dictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related, IEEE Journal of Selected Topics in Signal Processing, 9 (2015), 1128–1141. [Google Scholar]

[R25] [25].Gillis N, Nonnegative Matrix Factorization: Complexity, Algorithms and Applications, Unpublished Doctoral Dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE, 2011. [Google Scholar]

[R26] [26].Gong T. and Szustakowski JD, DeconRNASeq: A statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data, Bioinformatics, 29 (2013), 1083–1085. [DOI] [PubMed] [Google Scholar]

[R27] [27].Harrington H, Drellich E, Gainer-Dewar A, He Q, Heitsch C. and Poznanovic S, Geometric Combinatorics and Computational Molecular Biology: Branching Polytopes for Rna Sequences, 2017.

[R28] [28].He X. and Niyogi P, Locality preserving projections, in Advances in Neural Information Processing Systems, (2004), 153–160.

[R29] [29].Huang K, Sidiropoulos ND and Swami A, Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition, IEEE Transactions on Signal Processing, 62 (2014), 211–224. [Google Scholar]

[R30] [30].Jin S, Zhang L. and Nie Q, Scai: An unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles, Genome Biology, 21 (2020), 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X. and Li L, Cdseq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data, PLoS Computational Biology, 15 (2019), e1007510. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Kim H. and Park H, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method, SIAM Journal on Matrix Analysis and Applications, 30 (2008), 713–730. [Google Scholar]

[R33] [33].Kuhn A, Kumar A, Beilina A, Dillman A, Cookson MR and Singleton AB, Cell population-specific expression analysis of human cerebellum, BMC Genomics, 13 (2012), 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Kuhn A, Thu D, Waldvogel HJ, Faull RL and Luthi-Carter R, Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain, Nature Methods, 8 (2011), 945–947. [DOI] [PubMed] [Google Scholar]

[R35] [35].Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, Duong TE, Gao D, Chun J, Kharchenko PV, et al. , Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain, Nature Biotechnology, 36 (2018), 70–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Laurberg H, Christensen MG, Plumbley MD, Hansen LK and Jensen SH, Theorems on positive data: On the uniqueness of NMF, Computational Intelligence and Neuroscience, (2008), Article ID 764206. [DOI] [PMC free article] [PubMed]

[R37] [37].Lee DD and Seung HS, Learning the parts of objects by non-negative matrix factorization, Nature, 401 (1999), 788–791. [DOI] [PubMed] [Google Scholar]

[R38] [38].Lee D. and Seung HS, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, 13 (2000).

[R39] [39].Ma W-K, Bioucas-Dias JM, Chan T-H, Gillis N, Gader P, Plaza AJ, Ambikapathi A. and Chi C-Y, A signal processing perspective on hyperspectral unmixing: Insights from remote sensing, IEEE Signal Processing Magazine, 31 (2013), 67–81. [Google Scholar]

[R40] [40].McKenzie AT, Moyon S, Wang M, Katsyv I, Song W-M, Zhou X, Dammer EB, Duong DM, Aaker J, Zhao Y, et al. , Multiscale network modeling of oligodendrocytes reveals molecular components of myelin dysregulation in Alzheimer’s disease, Molecular Neurodegeneration, 12 (2017), Article number: 82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Mohammadi S, Zuckerman N, Goldsmith A. and Grama A, A critical survey of deconvolution methods for separating cell types in complex tissues, Proceedings of the IEEE, 105 (2016), 340–366. [Google Scholar]

[R42] [42].Mostafavi S, Gaiteri C, Sullivan SE, White CC, Tasaki S, Xu J, Taga M, Klein H-U, Patrick E, Komashko V, et al. , A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease, Nature Neuroscience, 21 (2018), 811–819. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M. and Alizadeh AA, Robust enumeration of cell subsets from tissue expression profiles, Nature Methods, 12 (2015), 453–457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Nocedal J. and Wright SJ, Numerical Optimization, Springer Science & Business Media, 2006.

[R45] [45].Paatero P. and Tapper U, Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 5 (1994), 111–126. [Google Scholar]

[R46] [46].Qiao W, Quon G, Csaszar E, Yu M, Morris Q, and Zandstra PW, PERT: A method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions, PLoS Comput. Biol, 8 (2012), e1002838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Qin J, Lee H, Chi JT, Lou Y, Chanussot J. and Bertozzi AL, Fast blind hyperspectral unmixing based on graph Laplacian, in 2019 10th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), IEEE, (2019), 1–5. [Google Scholar]

[R48] [48].Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH and Jacobsen M, Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach, BMC Bioinformatics, 11 (2010), 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Shen-Orr SS and Gaujoux R, Computational deconvolution: Extracting cell type-specific information from heterogeneous samples, Current Opinion in Immunology, 25 (2013), 571–578. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM and Butte AJ, Cell type–specific gene expression differences in complex tissues, Nature Methods, 7 (2010), 287–289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Tsoucas D, Dong R, Chen H, Zhu Q, Guo G. and Yuan G-C, Accurate estimation of cell-type composition from gene expression data, Nature Communications, 10 (2019), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Von Luxburg U, A tutorial on spectral clustering, Statistics and Computing, 17 (2007), 395–416. [Google Scholar]

[R53] [53].Wang W. and Carreira-Perpinán MA, Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application, arXiv preprint, arXiv:1309.1541, 2013. [Google Scholar]

[R54] [54].Wang Y-X and Zhang Y-J, Nonnegative matrix factorization: A comprehensive review, IEEE Transactions on Knowledge and Data Engineering, 25 (2012), 1336–1353. [Google Scholar]

[R55] [55].Warren RE and Osher SJ, Hyperspectral unmixing by the alternating direction method of multipliers, Inverse Problems & Imaging, 9 (2015), 917–933. [Google Scholar]

[R56] [56].Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA and Brown PO, Individuality and variation in gene expression patterns in human blood, Proceedings of the National Academy of Sciences, 100 (2003), 1896–1901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Zaitsev K, Bambouskova M, Swain A. and Artyomov MN, Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures, Nature Communications, 10 (2019), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] [58].Zhang J, Nie Q. and Zhou T, Revealing dynamic mechanisms of cell fate decisions from single-cell transcriptomic data, Frontiers in Genetics, 10 (2019), 1280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] [59].Zhang S, Wang W, Ford J. and Makedon F, Learning from incomplete ratings using non-negative matrix factorization, in Proceedings of the 2006 SIAM International Conference on Data Mining, SIAM, (2006), 549–553. [Google Scholar]

[R60] [60].Zhang Y, Sloan SA, Clarke LE, Caneda C, Plaza CA, Blumenthal PD, Vogel H, Steinberg GK, Edwards MS, Li G, et al. , Purification and characterization of progenitor and mature human astrocytes reveals transcriptional and functional differences with mouse, Neuron, 89 (2016), 37–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Zhong Y, Wan Y-W, Pang K, Chow LM and Liu Z, Digital sorting of complex tissues for cell type-specific gene expression profiles, BMC Bioinformatics, 14 (2013), 89. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA

DUAN CHEN

SHAOYU LI

XUE WANG

Abstract

1. Introduction

Figure 1.

Table 1.

2. NMF and its identifiability conditions

2.1. Notations

2.2. Ill-posedness of NMF:

Definition 2.1 (Uniqueness of NMF solution).

Theorem 2.2 (Strong identifiability condition).

Theorem 2.3 (Weak identifiability condition).

2.3. Relation to the gene expression data:

3. Mathematical model:

Figure 2.

3.1. Finding geometric structures by spectral clustering analysis

3.2. Geometric structure guided model

3.2.1. Solvability constraint:

3.2.2. Manifold constraint:

Remark 1.

Remark 2.

Remark 3.

Remark 4.

3.3. Full model:

4. Computational algorithms

Algorithm 1.

5. Numerical results

5.1. Simulations on synthetic data

Figure 3.

Figure 4.

Figure 5.

Figure 6.

5.2. Parameter discussion

Figure 7.

5.3. Algorithm results on biological data

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

5.4. Model discussion and future work for more complicated data

Figure 13.

6. Conclusions

Table 2.

A. Appendix

A.1. Geometric interpretation.

Figure 14.

A.2. Strong and weak conditions on identifiability:

Definition A.1.

Definition A.2.

Figure 15.

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases