Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data

Andreas Kopf; Vincent Fortuin; Vignesh Ram Somnath; Manfred Claassen

doi:10.1371/journal.pcbi.1009086

. 2021 Jun 30;17(6):e1009086. doi: 10.1371/journal.pcbi.1009086

Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data

Andreas Kopf ^1,⁵, Vincent Fortuin ^2,⁴, Vignesh Ram Somnath ¹, Manfred Claassen ^3,^*

Editor: Qing Nie⁶

PMCID: PMC8277074 PMID: 34191792

Abstract

Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering has gained popularity due to its flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the lower dimensional latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on the MNIST benchmark data set and challenging real-world tasks of clustering mouse organs from single-cell RNA-sequencing measurements and defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines as well as competitor methods.

Author summary

Clustering single cell measurements into relevant biological phenotypes, such as cell types or tissue types, is an important task in computational biology. We developed a computational approach which allows incorporating prior knowledge about the single cell similarity into the training process, and ultimately achieve significant better clustering performance compared to baseline methods. This single cell similarity can be defined to benefit specific needs of the modeling goal, for example to either cluster cell type or tissue type, respectively.

In addition, we are able to generate new realistic single cell data from a respective mode of the phenotype due to the architecture of the model, which consists of smaller sub-models learning the different modes of the data. Compared to competitor methods, we show significantly better results on clustering and generation of handwritten digits of the MNIST data set, on clustering seven different mouse organs from single-cell RNA sequencing measurements, and on clustering cell types in over 272 different datasets of Peripheral Blood Mononuclear Cell measured via CyTOF.

This is a PLOS Computational Biology Methods paper.

Introduction

Clustering has been studied extensively [1, 2] in machine learning and has found wide application in identifying grouping structure in high dimensional biological data such as various omics data modalities. Recently, many Deep Clustering approaches were proposed, which modified (Variational) Autoencoder ((V)AE) architectures [2, 3] or by varying regularization of the latent representation [4–7].

The reconstruction error usually drives the definition of the latent representation learned from an AE or VAE. The representation for AE models is unconstrained and typically places data objects close to each other according to an implicit similarity measure that also yields favorable reconstruction error. In contrast, VAE models regularize the latent representation such that the represented inputs follow a certain variational distribution. This construction enables sampling from the latent representation and data generation via the decoder of a VAE. Typically, the variational distribution is assumed standard Gaussian, but for example Jiang et al. [7] introduced a mixture-of-Gaussians variational distribution for clustering purposes.

A key component of clustering approaches is the choice of similarity metric for the considered data objects which we try to group [8]. Such similarity metrics are either defined a priori or learned from the data to specifically solve classification tasks via a Siamese network architecture [9]. Dimensionality reduction approaches, such as UMAP [10] or t-SNE [11], allow to specify a similarity metric for projection and thereby define the data separation in the inferred latent representation.

In this work, we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a new deep architecture that performs similarity-based representation learning, clustering of the data and generation of data from each specific data mode. Due to a combined loss function, it can be jointly optimized. We empirically assess the scope of the model and present superior clustering performance on the canonical benchmark MNIST. Moreover, in an ablation study, we show the efficiency and precision of MoE-Sim-VAE for data generation purposes in comparison to the most related state-of-the-art method [7]. We achieve superior results on the identification of tissue- or cell type groupings via MoE-Sim-VAE on a murine single-cell RNA-sequencing atlas and mass cytometry measurements of Peripheral Blood Mononuclear Cells.

Materials and methods

MoE-Sim-VAE

Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE, Fig 1). The model is based on the Variational Autoencoder [12]. While the encoder network is shared across all data points, the decoder of the MoE-Sim-VAE consists of a number of K different subnetworks, forming a Mixture-of-Experts architecture [13]. Each subnetwork constitutes a generator for a specific data mode and is learned from the data.

The variational distribution over the latent representation is defined to be a mixture of multivariate Gaussians, first introduced by Jiang et al. [7]. In our model, we aim to learn the mixture components in the latent representation to be standard Gaussians

\begin{matrix} z \sim \sum_{k = 0}^{K} ω_{k} N (μ_{k}, I) \end{matrix}

(1)

where ω_k are mixture coefficients, μ_k are the means for each mixture component, I is the identity matrix and K is the number of mixture components. The dimension of the latent representation z needs to be defined to suit the demands of Gaussian mixtures which have limitations in higher dimensions [14]. Similar to optimizing an Evidence Lower Bound (ELBO), we penalize the latent representation via the reconstruction loss of the data $L_{r e c o n s t}$ and by using the Kullback-Leibler (KL) divergence for multivariate Gaussians [7] on the latent representation

\begin{matrix} L_{K L} = & D_{K L} (N_{0}, N_{1}) = \frac{1}{2} {t r (Σ_{1}^{- 1} Σ_{0}) + \\ {(μ_{1} - μ_{0})}^{T} Σ_{1}^{- 1} (μ_{1} - μ_{0}) - k + l n \frac{| Σ_{1} |}{| Σ_{0} |}} \end{matrix}

(2)

where k is a constant, $N_{0} \sim N (μ_{0}, Σ_{0} = I)$ , and I is the identity matrix. Further, $N_{1} \sim N (μ_{1}, Σ_{1} = d i a g (σ_{j}))$ , where σ_j for j = 1, …, D, for a number of dimensions D, is estimated from the samples of the latent representation. Finally, we assume μ₀ = μ₁ resulting in the following simplified objective

\begin{matrix} L_{K L} = D_{K L} (N_{0}, N_{1}) = \frac{1}{2} {t r (Σ_{1}^{- 1} Σ_{0}) - k + l n \frac{| Σ_{1} |}{| Σ_{0} |}}, \end{matrix}

(3)

to penalize exclusively the covariance of each cluster. It remains to define the reconstruction loss $L_{r e c o n s t}$ , where we choose a Binary Cross-Entropy (BCE)

\begin{matrix} L_{r e c o n s t} = \sum_{i}^{N} \sum_{d}^{D} x_{i, d} log (x_{i, d}^{r e c o n s t}) \end{matrix}

(4)

between the original data x (scaled between 0 and 1) and the reconstructed data x^reconst, where i iterates the batch size N and d the dimensions of the data D. We motivate the BCE loss due to better convergence properties using artificial neural networks in comparison to mean squared error [15]. Finally the loss for the VAE part is defined by

\begin{matrix} L_{V A E} = L_{r e c o n s t} + π_{1} L_{K L} \end{matrix}

(5)

with a weighting coefficient π₁ which can be optimized as a hyperparameter.

Similarity clustering and gating of latent representation

Training of a data mode-specific generator expert requires samples from the same data mode. This necessitates to solve a clustering problem, that is, mapping the data via the latent representation into K clusters, each corresponding to one of the K generator experts. We solve this clustering problem via a clustering network, also referred to as gating network for MoE models. It takes as input the latent representation z_i of sample i and outputs probabilities p_ik ∈ [0, 1] for clustering sample i into cluster k. According to this cluster assignment, sample i is then gated to expert k = argmax_k p_ik for each sample i. We further define the cluster centers μ_k for k ∈ {1, …, K} similar as in the Expectation Maximization (EM) algorithm for Gaussian Mixture models [16] as

\begin{matrix} μ_{k} = \frac{1}{N_{k}} \sum_{i = 1}^{N} p_{i k} z_{i}, \end{matrix}

(6)

where N_k is the absolute number of data points assigned to cluster k based on highest probability p_ik for each sample i = 1, …, N. The Gaussian mixture distributed latent representation (via KL loss in Eq 3) is motivation for the empirical computation of the cluster means and further, similar as in the EM algorithm, it allows iterative optimization of the means of the Gaussians. We train the clustering network to reconstruct a data-driven similarity matrix S, using the Binary Cross-Entropy

\begin{matrix} L_{S i m i l a r i t y} = \sum_{i}^{N} \sum_{j}^{N} S_{i, j} log ({(P P^{T})}_{i, j}) \end{matrix}

(7)

to minimize the error in PP^T ≈ S, with P ≔ {p_ik}_{i∈{1,…,N},k∈{1,…,K}} where N is the number of samples (e.g., batch size). Intuitively, PP^T approximates the similarity matrix S since values in PP^T are only close to 1 when similar data objects are assigned to the same cluster, similar to the entries in the adjacency similarity matrix S. This similarity matrix is derived in an unsupervised way in our experiments (e.g. UMAP projection of the data and k-nearest-neighbors or distance thresholding to define the adjacency matrix for the batch), but can also be used to include weakly-supervised information (e.g., knowledge about diseased vs. non-diseased patients). If labels are available, the model could even be used to derive a latent representation with supervision. The similarity feature in MoE-Sim-VAE thus allows to include prior knowledge about the best similarity measure on the data.

Moreover, we apply the DEPICT loss from Dizaji et al. [4], to improve the robustness of the clustering. For the DEPICT loss, we additionally propagate a noisy probability ${\hat{p}}_{i k}$ through the clustering network using dropout after each layer. The goal is to predict the same cluster for both, the noisy ${\hat{p}}_{i k}$ and the clean probability p_ik (without applying dropout). Dizaji et al. [4] derived as objective function a standard cross-entropy loss

\begin{matrix} L_{D E P I C T} = - \frac{1}{N} \sum_{i = 0}^{N} \sum_{k = 0}^{K} q_{i k} log {\hat{p}}_{i k} \end{matrix}

(8)

whereby q_ik is computed via the auxiliary function

\begin{matrix} q_{i k} = \frac{p_{i k} / {(\sum_{i^{'}} p_{i^{'} k})}^{\frac{1}{2}}}{\sum_{k^{'}} p_{i k^{'}} / {(\sum_{i^{'}} p_{i^{'} k^{'}})}^{\frac{1}{2}}} . \end{matrix}

(9)

We refer to Dizaji et al. [4] for the exact derivation. The DEPICT loss encourages the model to learn invariant features from the latent representation for clustering with respect to noise [4]. Looking at it from a different perspective, the loss helps to define a latent representation which has those invariant features to be able to reconstruct the similarity and therefore the clustering correctly. The complete clustering loss function $L_{C l u s t e r i n g}$ is then defined by

\begin{matrix} L_{C l u s t e r i n g} = L_{S i m i l a r i t y} + π_{2} L_{D E P I C T} \end{matrix}

(10)

with a mixture coefficient π₂ which can be optimized as a hyperparameter.

MoE-Sim-VAE loss function

Finally, the MoE-Sim-VAE model loss is defined by

\begin{matrix} L_{M o E - S i m - V A E} = \underset{L_{r e c o n s t} + π_{1} L_{K L}}{\underset{︸}{L_{V A E}}} + \underset{L_{S i m i l a r i t y} + π_{2} L_{D E P I C T}}{\underset{︸}{L_{C l u s t e r i n g}}} \end{matrix}

(11)

which consists of the two main loss functions $L_{V A E}$ , acting as a regularization for the latent representation, and $L_{C l u s t e r i n g}$ , which helps to learn the mixture components based on an a priori defined data similarity. The model objective function $L_{M o E - S i m - V A E}$ can then be optimized end-to-end to train all parts of the model.

Related work

(V)AEs have been extensively used for clustering [1, 4–6, 17–20]. The most related approaches to MoE-Sim-VAE are Jiang et al. [7] and Zhang et al. [3].

Jiang et al. [7] introduced the VaDE model, comprising a mixture of Gaussians as underlying distribution in the latent representation of a Variational Autoencoder. Optimizing the Evidence Lower Bound (ELBO) of the log-likelihood of the data can be rewritten to optimize the reconstruction loss of the data and KL divergence between the variational posterior and the mixture of Gaussians prior. Jiang et al. [7] use two separate networks for reconstruction and the generation process. Further, to effectively generate images from a specific data mode and to increase image quality, the sampled points have to surpass a certain posterior threshold and are otherwise rejected. This leads to an increased computational effort. The MoE Decoder of our model, which is used for both reconstruction and generation, does not need such a threshold.

Zhang et al. [3] have introduced a mixture of autoencoders (MIXAE) model. The latent representation of the MIXAE is defined as the concatenation of the latent representation vectors of each single autoencoder in the model. Based on this concatenated latent representation, a Mixture Assignment Network predicts probabilities which are used in the Mixture Aggregation to form the output of the generator network. Each AE model learns the manifold of a specific cluster, similarly to our MoE Decoder. However, MIXAE does not optimize a variational distribution, such that generation of data from a distribution over the latent representation is not possible, in contrast to the MoE-Sim-VAE (Fig 2).

Fig 2 — Data points from the latent representation were sampled from the variational distribution (A) which is learned to be a mixture of standard Gaussians and then clustered and gated (B) to the data-mode-specific experts of the MoE Decoder (C). (D) All samples from the variational distribution were correctly classified and therefore also correctly gated.

Results

In the following we report superior clustering and generating results of MoE-Sim-VAE on real world problems. First, we evaluate MoE-Sim-VAE on images from MNIST and show why a MoE decoder is beneficial. Second, we present significantly better clustering results on mouse organ single-cell RNA sequencing data. Third, we apply MoE-Sim-VAE to cluster cell types in Peripheral Blood Mononuclear Cells using CyTOF measurements on 272 distinct data sets significantly better than competitors. (Exact model and optimization details as well as preprocessing steps for all experiments can be found in S1 Text)

Unsupervised clustering, representation learning and data generation on MNIST

We trained a MoE-Sim-VAE model on images from MNIST. We compared our model against multiple models which were recently reviewed in Aljalbout et al. [1], and specifically against VaDE [7] which shares similar properties with MoE-Sim-VAE. The VaDE model is comprising a mixture of Gaussians as underlying distribution in the latent representation of a Variational Autoencoder (more detailed comparison in Section Related work).

We compare the models with the Normalized Mutual Information (NMI) criterion but also classification accuracy (ACC) (Table 1). The MoE-Sim-VAE outperforms the other methods w.r.t. clustering performance when comparing NMI and achieves the second-best result when comparing ACC. Note that for comparability reasons we used the number of experts k = 10 in our model to fit the existing number of digits in MNIST. To prove that MoE-Sim-VAE is able to learn the correct number of experts, we report a study on synthetic data in supporting information (S1 Text and S1 Fig).

Table 1. Performance comparison of our method MoE-Sim-VAE with several published methods on MNIST.

The table is mainly extracted from [1, 21] and complemented with results of interest. (“-”: metric not reported).

Method	NMI	ACC
JULE [22]	0.915	-
CCNN [23]	0.876	-
DEC [17]	0.80	0.843
DBC [18]	0.917	0.964
DEPICT [4]	0.916	0.965
DCN [5]	0.81	0.83
Neural Clustering [19]	-	0.966
UMMC [20]	0.864	-
VaDE [7]	0.876	0.945
TAGnet [24]	0.651	0.692
IMSAT [25]	-	0.984
Aljalbout et al. [1]	0.923	0.961
MIXAE [3]	-	0.945
Spectral clustering [26]	0.754	0.717
SpectralNet [26]	0.924	0.971
ClusterGAN [21]	0.89	0.95
Info-GAN [27]	0.86	0.89
GAN with bp [21]	0.90	0.95
MoE-Sim-VAE (proposed)	0.935	0.975

Open in a new tab

We use a UMAP projection [10] of MNIST as our similarity measure and then apply k-nearest-neighbors of each sample in a batch. In an ablation study, we show the importance of the similarity matrix to create a clear separation of the different digits in the latent representation. Therefore, we computed a test statistic based on the Maximum Mean Discrepancy (MMD) [28, 29] which can be used to test if two samples are drawn from the same distribution (see Section 1.2 in S1 Text). In this work we use MMD to test if samples of different clusters of the latent representation are similar. When sampling twice from the same cluster we get an average MMD test statistic of t_sim = −0.05 with, and t = −0.11 without similarity matrix, whereas the average distance between samples from two different clusters is significantly larger when training with similarity matrix t_sim = 221.66 compared to when training without t = 49.29. This clearly suggests better separation on the latent representations between the clusters when being able to define a respective similarity (S2 Fig).

In addition to the clustering network, we can make use of the latent representation for image generation purposes. The latent representation is trained as a mixture of standard Gaussians. The means of these Gaussians are the centers of the clusters trained via the clustering network. Therefore, the variational distribution can be sampled from and gated to the cluster-specific expert in the MoE-decoder. The expert then generates new data points for the specific data mode. Results and the schematic are displayed in Fig 2.

In an ablation study, we compare the two models MoE-Sim-VAE and VaDE [7] on generating MNIST images with the request for a specific digit. The goal is to show that a MoE decoder, as proposed in our model, is beneficial. We focus our comparison to VaDE since this model, as the MoE-Sim-VAE, resorts to a mixture of Gaussian latent representation but differs in generating images by means of a single decoder network instead of a Mixture-of-Expert decoder network. The rationale for our design choice is to ensure that smaller sub-networks learn to reproduce and generate specific modes of the data, in this case of specific MNIST digits.

To show that both models’ latent representations are separating the different clusters well, we computed the Maximum Mean Discrepancy (MMD) [28], similar as introduced above. An MMD statistic of t_MoE-Sim-VAE = 256.31 and t_VaDE = 355.14 suggests separation of the clusters when sampling in the latent representations of both models. Therefore, both latent representations can separate the clusters of respective digits well, such that the decoder gets well-defined samples to generate the requested digit. Hence, the main difference of generating specific digits arises in the decoder/generator networks (S3 Fig).

We evaluated the importance of the MoE-Decoder to (1) accurately generate requested digits and (2) be efficient in generating requested digits. Specifically, we sampled 10, 000 points from each mixture component in the latent representation, generated images, and used the model’s internal clustering to assign a probability to which digits were generated. To generate correct and high-quality images with VaDE, the posterior of the latent representation needs to be evaluated for each sample. This was done for the different thresholds ϕ ∈ [0.0, 0.1, 0.2, ⋯, 0.9, 0.999]. The default threshold [7] used was ϕ = 0.999. To compare the separation of the clusters in the latent representation above using MMD, we used a threshold of only ϕ = 0.8, which already is enough to have higher separation based on MMD. Instead of thresholding the latent representation, we ran the generation process for MoE-Sim-VAE for each threshold with the same settings. To generate images from VaDE we used the Python implementation (https://github.com/slim1017/VaDE) and model weights publicly available from Jiang et al. [7].

As a result the MoE-Sim-VAE generates digits more accurately with fewer resources required, especially when comparing the number of iterations required to fulfill the default posterior threshold of 0.999. VaDE needs nearly 2 million iterations to find samples that fulfill the aforementioned threshold criterion whereas the MoE-Sim-VAE only requires 10, 000 for a comparable sample accuracy. In comparison, the mean accuracy over all thresholds for MoE-Sim-VAE is 0.970, whereas VaDE reaches on average only 0.944 (S4, S5 and S6 Figs).

Clustering organ-specific single cell RNA-seq data

Single-cell RNA-sequencing (scRNA-seq) measurements allow measuring transcriptomes of tens of thousands of single cells. Clustering of the resulting data into groups representing biological phenotypes, such as cell type or tissue type, constitutes a major analysis task in scRNA-seq studies. In the following, we present how MoE-Sim-VAE outperforms the methods Gaussian Mixture Models (GMM), k-means, hierarchical clustering, HDBSCAN, fuzzy-c-means (FCM), Louvain and scVI. [30–33] for clustering the scRNA-seq data of the Tabula Muris study covering seven different mouse organs [34]. The method scVI is a well established deep generative modeling framework designed for single cell transcriptomic data modeling the count data with a Poisson distribution, and allows to perform several downstream analysis tasks, such as clustering. Also in this example we used MoE-Sim-VAE with a BCE loss instead of a mean squared error loss for MoE-Sim-VAE, motivated due to better convergence properties in combination with artificial neural networks [15], and due to literature where BCE was used as reconstruction loss on RNA-seq data for visualization purposes [35].

MoE-Sim-VAE allows for incorporation of a user defined similarity and therefore also prior knowledge about the data. From literature we identified for each organ a representing signature gene and use each to encode prior expectation of organ assignment for each cell measurement. Namely, we take advantage of the high expressions of Lpl in the heart, Miox in kidney, Hpx in liver, Tspan1 in large intestine, Prx in lung, Cd79a in spleen and Dntt in thymus [36–39]. Single cells which show above average expression of a respective signature gene are considered to be similar. For the training of MoE-Sim-VAE we only considered cells which show above average expression in exactly one of the respective organ-specific signature genes (does not apply for the test data). To highlight the influence of the similarity prior in this example, we report the average accuracy of 0.92 of having a correct similarity assignment for each organ based on above average expression (S1 Table).

MoE-Sim-VAE outperforms all above mentioned baseline approaches in clustering the single cells with respect to the organ of origin. Our model reaches a F-measure of 0.748 and is therefore close to 0.1 better compared to the second best competitor. We performed a hyperparameter screening for the competitor methods (more details in Chapter 4 in S1 Text) and chose the best results achieved on the test dataset based on the F-measure as well. In Table 2 we present the exact results in detailed comparison. In Fig 3A and 3B) we show a Principal Component Analysis (PCA) of the original data as well as of the latent representation of MoE-Sim-VAE. It can be seen that the organs are better separated in the latent representation inferred from our model which enables for better clustering results of MoE-Sim-VAE (Fig 3C)). In Fig 3D–3I) we present the results of the competitor methods in the latent representation of MoE-Sim-VAE and can clearly see that Louvain performs second best, but poorly separates cells from organs which are close to each other or overlap in the original PCA representation.Fig 3J) visualizes the Leiden clustering results on the latent representation inferred from scVI. This latent space shows a more detailed separated latent representation but with overlapping or separated true labels. For example, samples from the heart are separated in up to seven different groups. This leads to less precise clustering results concerning the task of identifying tissue types, but might be beneficial when clustering cell types. This also highlights the importance of being able to incorporate prior knowledge when inferring latent representations for specific clustering tasks, such as grouping tissue types.

Table 2. Results on clustering mouse organs based on RNA-seq.

We compare MoE-Sim-VAE to the competitor methods Gaussian Mixture Models (GMM), k-means, Hierarchical clustering, DBSCAN, FCM and Louvain clustering.

Method	F-measure	NMI
GMM PCA k = 20	0.632	0.487
k-means PCA k = 40	0.606	0.443
hierarchical PCA k = 20	0.643	0.534
HDBSCAN PCA k = 20 min cluster size = 50	0.615	0.517
fuzzy-c-means PCA k = 50 m = 4	0.549	0.336
Louvain PCA k = 30 resolution = 0.01	0.679	0.584
scVI leiden clustering resolution = 0.06	0.653	0.561
MoE-Sim-VAE (proposed)	0.748	0.519

Open in a new tab

Fig 3 — A) Principal Component Analysis of the original data with true labels. The remaining panels are UMAP representations of the latent representation inferred from MoE-Sim-VAE with B) true labels. C) predicted labels from MoE-Sim-VAE. D) predicted labels from Gaussian Mixture Model. E) predicted labels from k-means. F) predicted labels from hierarchical clustering. G) predicted labels from DBSCAN. H) predicted labels from fuzzy-c-means. I) predicted labels from Louvain. J) true and predicted labels on scVI inferred latent representation using Leiden clustering.

Learning cell type composition in peripheral blood mononuclear cells using CyTOF measurements

In the following, we want to assess representation learning performance on the real-world problem of cell type definition from single-cell measurements. Cytometry by time-of-flight mass spectrometry (CyTOF) is a state-of-the-art technique allowing measurements of up to 1, 000 cells per second and in parallel over 40 different protein markers of the cells [40]. Defining biologically relevant cell subpopulations by clustering this data is a common learning task [41, 42].

Many methods have been developed to tackle the problem introduced above and were compared on four publicly available datasets in Weber and Robinson [42]. The best out of 18 methods were FlowSOM [43], PhenoGraph [44] and X-shift [45]. These are based on k-nearest-neighbors heuristics, either defined from a spanning graph or from estimating the data density. In contrast to these methods, MoE-Sim-VAE can map new cells into the latent representation, assign probabilities for cell types, and infer an interpretable latent representation, allowing intuitive downstream analysis by domain experts.

We applied MoE-Sim-VAE to the same datasets as in Weber and Robinson [42] and achieve superior results in classification using the F-measure [41] in three out of four datasets. Similarly as in Weber and Robinson [42], we trained MoE-Sim-VAE 30 times and report in Table 3 (adopted from Weber and Robinson [42]) the means and standard deviation across all runs (S7 Fig). As a MoE-Sim-VAE similarity measure we used a UMAP projection with Canberra distance [46] as metric and computed similarly to the MNIST experiments the k-nearest-neighbors of each sample in the batch. This applies for all CyTOF experiments.

Table 3. Comparison of MoE-Sim-VAE performance to competitor methods in defining cell type composition in CyTOF measurements.

The results in the table are extracted from the review paper of [42], where 18 methods are compared on four different datasets. Our model outperforms the baselines on three out of four data sets.

Method	Levine_32dim	Levine_13dim	Samusik_01	Samusik_all
ACCENSE	0.494	0.358	0.517	0.502
ClusterX	0.682	0.474	0.571	0.603
DensVM	0.66	0.448	0.239	0.496
FLOCK	0.727	0.379	0.608	0.631
flowClust	N/A	0.416	0.612	0.61
flowMeans	0.769	0.518	0.625	0.653
flowMerge	N/A	0.247	0.452	0.341
flowPeaks	0.237	0.215	0.058	0.323
FlowSOM	0.78	0.495	0.707	0.702
FlowSOM_pre	0.502	0.422	0.583	0.528
immunoClust	0.413	0.308	0.552	0.523
k-means	0.42	0.435	0.65	0.59
PhenoGraph	0.563	0.468	0.671	0.653
Rclusterpp	0.605	0.465	0.637	0.613
SamSPECTRAL	0.512	0.253	0.263	0.138
SPADE	N/A	0.127	0.169	0.13
SWIFT	0.177	0.179	0.202	0.208
X-Shift	0.691	0.47	0.679	0.657
MoE-Sim-VAE (proposed)	0.70 ± 0.04	0.68 ± 0.01	0.76 ± 0.03	0.74 ± 0.02

Open in a new tab

Further, we trained a MoE-Sim-VAE model with a fixed number of experts k = 15 (thereby slightly overestimating the true number of subpopulations) on 268 datasets from Bodenmiller et al. [47] and achieve superior clustering results of cell subpopulations in the data when comparing to state-of-the-art methods in this field (PhenoGraph, X-Shift, FlowSOM). Results are summarized in Fig 4 as well as exactly listed in S2 Table. Furthermore, we visualize in detail the reconstruction of the original data per expert using a Principal Component Analysis on the original data space. This visualization also shows that many experts were silenced during the training, since only seven out of possible 15 experts where selected (S8 Fig).

Fig 4 — Comparison of MoE-Sim-VAE to the most popular competitor methods on defining cell types in peripheral blood mononuclear cell data via CyTOF measurements. On the x-axis different inhibitor treatments are listed whereas the y-axis reports the respective F-measure. Each violin plot represents a run on a different inhibitor with multiple wells, whereas the line connects the means of the performance on the specific inhibitor.

Discussion

Our MoE-Sim-VAE model can infer similarity-based representations, perform clustering tasks, and efficiently as well as accurately generate high-dimensional data. The training of the model is performed by optimizing a joint objective function consisting of data reconstruction, clustering, and KL loss, where the latter regularizes the latent representation. On the benchmark dataset of MNIST, we present superior clustering performance and the efficiency and accuracy of MoE-Sim-VAE in generating high-dimensional data. On the biological real-world tasks of clustering mouse organs and defining cell subpopulations in complex single-cell data, we show superior performances compared to state-of-the-art methods on a vast range of over 270 datasets and therefore demonstrate the MoE-Sim-VAE’s real-world usefulness.

To achieve outstanding clustering performances the choice of the similarity measure as well as the hyperparameter tuning, such as for the loss coefficients, play a crucial role. As shown in the ablation study for clustering MNIST, setting the similarity clustering loss to zero has a tremendous effect on the learned latent representation and the clustering performance. In general, we could observe that the loss coefficients of the reconstruction loss and clustering loss need to be selected close to one, whereas the loss coefficient for the KL loss is closer to zero. A less crucial role plays the selection of number of experts, as shown on clustering synthetic data, on the example of clustering mouse organs based on single cell RNA-sequencing data, or when clustering cell types from mass cytometry measurements. Even when defining more experts than the number of expected clusters, MoE-Sim-VAE did not target each single expert. A minimum number of required experts to distribute the different modes of the data with respect to the defined similarity where selected by the model.

Future work might include to add adversarial training to the MoE decoder, which could improve image generation to create even more realistic images. Also, specific applications might benefit from replacing the Gaussian with a different mixture model. Especially biological data is not always generated from Gaussian distributions. So far the MoE-Sim-VAE’s similarity measure has to be defined by the user. Relaxing this requirement and allowing for learning a useful similarity measure automatically for inferring latent representations will be an interesting extension to explore. This could be useful in a weakly-supervised setting, which often occurs for example in clinical data consisting of healthy and diseased patients. Minor details between a healthy and diseased patient might make a huge difference and could be learned from the data using neural networks.

In summary, we expect the MoE-VAE model, as well its future extensions, to be a valuable contribution to the computational biology toolbox to identify biological group structure in high-dimensional molecular data modalities under consideration of weak prior knowledge, in particular including single-cell omics data.

Supporting information

S1 Text. Supporting information for Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data.

(ZIP)

Click here for additional data file.^{(185.8KB, zip)}

S1 Fig. Testing MoE-Sim-VAE on data sampled from a Gaussian mixture model with randomly sampled parameters.

We tested for specific number of synthetic mixture components and iterating number of experts. Until a number of GMM components of23 MoE-Sim-VAE is precise in learning the real number of clusters even when allowing the model to have 40 experts.

(EPS)

Click here for additional data file.^{(820.9KB, eps)}

S2 Fig. Ablation study on the similarity matrix S.

Both figures show the MMD statistic and UMAP [10] projection of reconstructed MNIST digits computed on the latent representation. A) shows the results on MoE-Sim-VAE trained with the similarity matrix. The different digits separate well which can also be seen in the heatmap showing the MMD statistics between all digits. In comparison, B) shows results of the MoE-Sim-VAE model ignoring the similarity matrix setting the loss coefficient to zero. One can observe that the MMD statistic, which can be seen as a similarity measure of two distributions, is way lower compared to the model including the similarity matrix. Further, also the UMAP projection confirms less separation in the latent representation between the different digits.

(EPS)

Click here for additional data file.^{(1.9MB, eps)}

S3 Fig. Comparison of two sample MMD test on the distributions from the different mixture components in the latent representation.

The heatmaps on the left side show the estimation of the MMD which can be seen as the distance between pairs of distributions. The figures on the right side show the separation of the cluster in the latent representation based on a dimensionality reduction via UMAP [10]. A) shows the results for the clusters of VaDE at a posterior threshold of 0.8 which is the first threshold which shows total separation of all clusters. B) shows the separation of the clusters in latent space learned from MoE-Sim-VAE. For both methods, all distributions belonging to clusters of different respective digits show a larger distance compared to the diagonal of matching distributions, such that we generate images from a well-separated latent representation for both methods and therefore the main difference comes from the decoders.

(EPS)

Click here for additional data file.^{(1.6MB, eps)}

S4 Fig. Comparison of data generation process between Moe-Sim-VAE and VaDE.

A) shows the accuracy of how certain a specific digit can be generated from the respective cluster in the latent representation whereas B) compares the number of runs until a sample from the latent representation satisfied the posterior criterion from VaDE. It needs to be mentioned that MoE-Sim-VAE does not require any thresholding such that we ran the data generation process multiple times with the same settings to compare with VaDE. In total 10000 samples are generated for each digit.

(EPS)

Click here for additional data file.^{(1.1MB, eps)}

S5 Fig. Confusion map for data generation using MoE-Sim-VAE.

Besides the systematic error of confusing digit 5 and 8, which can also depend on the clustering network, the digit generation of our model performs very precise with a high accuracy of generating the digit asked for. In comparison to VaDE [7] our model does not need any threshold on samples from the latent representation which reduces the computational costs by far.

(EPS)

Click here for additional data file.^{(872.1KB, eps)}

S6 Fig. Confusion maps for data generation using VaDE.

A) Posterior threshold 0.0. B) Posterior threshold 0.1. C) Posterior threshold 0.2. D) Posterior threshold 0.3. E) Posterior threshold 0.4. F) Posterior threshold 0.5. G) Posterior threshold 0.6. H) Posterior threshold 0.7. I) Posterior threshold 0.8. J) Posterior threshold 0.9. K) Posterior threshold 0.999 (default for VaDE [7]).

(EPS)

Click here for additional data file.^{(536.2KB, eps)}

S7 Fig. Reproducibility of MoE-Sim-VAE on the four datasets.

Similar as in Weter et al. [42], we show the reproducibility of MoE-Sim-VAE on the four datasets when running MoE-Sim-VAE 30 times. The variance on defining the correct subpopulations of MoE-Sim-VAE is quite small and therefore also an improvement to many methods compared in Weber et al. [42].

(EPS)

Click here for additional data file.^{(189.4KB, eps)}

S8 Fig. Reconstruction of data modes per expert.

PCA plot showing the reconstruction (red) of original data (colored underneath) separated per MoE-expert on the Inhibitor GDC-0941 and Well A09 from the Bodenmiller [47] data. This example reached a F-measure of 0.8606. The experts with ID 2, 3, …, 9 where not selected via the gating network. The red samples in each plot visualize the reconstructed data. A) Expert ID = 0. B) Expert ID = 1. C) Expert ID = 10. D) Expert ID = 11. E) Expert ID = 12. F) Expert ID = 13. G) Expert ID = 14. H) Visualization of the reconstruction taking the data modes from all selected experts together. I) PCA plot of the true labels without any reconstruction overlaid.

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S1 Table. Signature gene accuracy.

Accuracy of assigning a organ similarity based on high gene expression of prior selected organ specific signature genes for the split training and test data set. We computed the balanced accuracy for each single organ vs. the rest, respectively.

(XLS)

Click here for additional data file.^{(25.5KB, xls)}

S2 Table. Exact results on 268 mass cytometry experiments.

CyTOF measurements from peripheral blood mononuclear cells (PBMCs) were taken and the goal is to define the different cell types present in the data. The ground truth was defined using the SPADE algorithm [48], which can visualize the high dimensional data in such a way to be able to manual gate the cells. We compare to other fully unsupervised methods as FlowSOM, X-shift and PhenoGraph and achieve in most cases the best F-measure.

(XLS)

Click here for additional data file.^{(58KB, xls)}

Acknowledgments

AK thanks Florian Buettner for helpful discussions and his inspirational attitude.

Data Availability

All relevant data are within the paper and its Supporting information files. MoE-Sim-VAE is available at the following Github repository: https://github.com/andkopf/MoESimVAE.

Funding Statement

AK is supported by the “SystemsX.ch HDL-X” and “ERASysApp Rootbook” and PHRT 2017-103. VF is supported by a PhD fellowship from the Swiss Data Science Center and by the PHRT grant #2017-110 of the ETH Domain. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Aljalbout E., Golkov V., Siddiqui Y., Strobel M., Cremers D. Clustering with Deep Learning: Taxonomy and New Methods. arXiv, 2018.
2.Min E., Guo X., Liu Q., Zhang G., Cui J., Long J. A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture. IEEE, 2018.
3.Zhang D., Sun Y., Eriksson B., Balzano L. Deep Unsupervised Clustering Using Mixture of Autoencoders. arXiv, 2017.
4.Dizaji K. G., Herandi A., Deng C., Cai W., Huang H. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. arXiv, 2017.
5.Yang B., Fu X., Sidiropoulos N. D., Hong M. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. arXiv, 2017.
6.Fortuin V., Hüser M., Locatello F., Strathmann H., Rätsch G. SOM-VAE: Interpretable Discrete Representation Learning on Time Series. Conference paper at ICLR, 2019.
7.Jiang Z., Zheng Y., Tan H., Tang B., Zhou H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. arXiv, 2017.
8. Irani J., Pise N., Phatak M. Clustering Techniques and the Similarity Measures used in Clustering: A Survey. International Journal of Computer Applications, 2016. doi: 10.5120/ijca2016907841 [DOI] [Google Scholar]
9.Chopra S., Hadsell R., LeCun Y. Learning a similarity metric discriminatively, with application to face verification. IEEE, 2005.
10.McInnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, 2018.
11. van der Maaten L., Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008. [Google Scholar]
12.Kingma D. P., Welling M. Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.
13.Shazeer N., Mirhoseini A., Maziarz K., Davis A., Le Q, Hinton G. et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layers. arXiv, 2017.
14. Bishop C. M. Neural Networks for Pattern Recognition, CLARENDON PRESS, 1995. [Google Scholar]
15.Golik P., Doetsch P., Ney H. Cross-entropy vs. squared error training: a theoretical and experimental comparison INTERSPEECH (2013)
16. Bishop C. M. Pattern Recognition and Machine Learning. Springer, 2006. [Google Scholar]
17.Xie J., Girshick R., Farhadi A. Unsupervised deep embedding for clustering analysis. International Conference on Machine Learning (ICML), 2016.
18.Li F., Qiao H., Zhang B., Xi X. Discriminatively boosted image clustering with fully convolutional autoencoders. arXiv, 2017.
19.Saito S., Tan R. T. Neural clustering: Concatenating layers for better projections. Workshop track at ICLR, 2017.
20.Chen D., Lv J., Yi Z. Unsupervised multi-manifold clustering by learning deep representation. Workshops at the AAAI Conference on Artificial Intelligence, 2017.
21.Mukherjee S., Asnani H., Lin E., Kannan S. ClusterGAN: Latent Space Clustering in Generative Adversarial Networks The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) (2019)
22.Yang J., Parikh D., Batra D. Joint unsupervised learning of deep representations and image clusters. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016b.
23.Hsu C.-C., Lin C.-W. Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data. arXiv, 2017.
24.Wang Z., Chang S., Zhou J., Wang M., Huang T. S. Learning a task-specific deep architecture for clustering. Proceedings of the SIAM International Conference on Data Mining (ICDM), 2016.
25.Hu W., Miyato T., Tokui S., Matsumoto E., Sugiyama M. Learning discrete representations via information maximizing self augmented training. arXiv, 2017.
26.Shaham U., Stanton K., Li H., Nadler B., Basri R., Kluger Y. SpectralNet: Spectral Clustering using Deep Neural Networks. Published as a conference paper at ICLR, 2018.
27.Chen X., Duan Y., Houthooft R., Schulman J., Sutskever I., Abbeel, P. Infogan: Interpretable representa- tion learning by information maximizing generative adversarial nets In Advances in Neural Information Processing Systems, 2172–2180 (2016)
28.Gretton A., Borgwardt K., Rasch M. J., Scholkopf B., Smola A. J. A Kernel Method for the Two-Sample Problem. arXiv, 2008.
29.Sutherland D. J., Tung H.-Y., Strathmann H., De S., Ramdas A., Smola A. et al. Generative models and model criticism via optimized maximum mean discrepancy. arXiv, 2019.
30. Feng C., Liu S., Zhang H., Guan R., Li D., Zhou F. et al. Dimension Reduction and Clustering Models for Single-Cell RNA Sequencing Data: A Comparative Study Int J Mol Sci. (2020) doi: 10.3390/ijms21062181 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. McInnes L., Healy J., Astels S. hdbscan: Hierarchical density based clustering In: Journal of Open Source Software The Open Journal, volume 2, number 11. (2017) [Google Scholar]
32. Dias M. L. D. fuzzy-c-means: An implementation of Fuzzy C-means clustering algorithm Zenodo (2019) [Google Scholar]
33. Lopez R., Regier J., Cole M. B., Jordan M. I., Yosef N. Deep Generative Modeling for Single-cell Transcriptomics Nat Methods. (2018) doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. The Tabula Muris Consortium., Overall coordination., Schaum N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris Nature 562, 367–372 (2018). doi: 10.1038/s41586-018-0590-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Wang D., Gu J. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder Genomics Proteomics Bioinformatics (2018) doi: 10.1016/j.gpb.2018.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Li B., Qing T., Zhu J., Wen Z., Yu Y., Fukumura R. et al. A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq Sci Rep 7, 4200 (2017). doi: 10.1038/s41598-017-04520-z [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Trent C. M., Yu S., Hu Y., Skoller N., Huggins L. A., Homma S. et al. Lipoprotein lipase activity is required for cardiac lipid droplet production J Lipid Res. (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Yagyu H., Chen G., Yokoyama M., Hirata K., Augustus A., Kako Y. et al. Lipoprotein lipase (LpL) on the surface of cardiomyocytes increases lipid uptake and produces a cardiomyopathy J Clin Invest. (2003) doi: 10.1172/JCI16751 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Yue F., Cheng Y., Breschi A., Vierstra J., Wu W., Ryba T. et al. A comparative encyclopedia of DNA elements in the mouse genome Nature. (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Kay A. W., Strauss-Albee D. M., Blish C. A. Application of Mass Cytometry (CyTOF) for Functional and Phenotypic Analysis of Natural Killer Cells. Methods in Molecular Biology, 2013. doi: 10.1007/978-1-4939-3684-7_2 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Aghaeepour N., Finak G., FlowCAP Consortium, DREAM Consortium, Hoos H., Mosmann TR. et al. Critical assessment of automated flow cytometry data analysis techniques. Nature Methods, 2013. doi: 10.1038/nmeth.2365 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Weber L. M., Robinson M. D. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A, 2016. doi: 10.1002/cyto.a.23030 [DOI] [PubMed] [Google Scholar]
43. Van Gassen S., Callebaut B., Van Helden M. J., Lambrecht B. N., Demeester P., Dhaene T. et al. FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A, 2015. doi: 10.1002/cyto.a.22625 [DOI] [PubMed] [Google Scholar]
44. Levine J. H., Simonds E. F., Bendall S. C., Davis K. L., Amir E.-a.D., Tadmor M. D. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell, 2015. doi: 10.1016/j.cell.2015.05.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Samusik N., Good Z., Spitzer M. H., Davis K. L., Nolan G. P. Automated Mapping of Phenotype Space with Single-Cell Data. Nature Methods, 2016. doi: 10.1038/nmeth.3863 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Lance G. N., Williams W. T. Computer programs for hierarchical polythetic classification (“similarity analysis”) Computer Journal (1966) doi: 10.1093/comjnl/9.1.60 [DOI] [Google Scholar]
47. Bodenmiller B., Zunder E. R., Finck R., Chen T. J., Savig E. S., Bruggner R. V. et al. Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nature Biotechnology, 2012. doi: 10.1038/nbt.2317 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Qiu P., Simonds E. F., Bendall S. C., Gibbs K. D. Jr., Bruggner R. V., Linderman M. D. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology, 2011. doi: 10.1038/nbt.1991 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009086.r001

Decision Letter 0

Qing Nie, Alice Carolyn McHardy

22 Mar 2021

Dear Prof. Dr. Claassen,

Thank you very much for submitting your manuscript "Mixture-of-Experts Variational Autoencoder for Clustering and Generating from Similarity-Based Representations on Single Cell Data" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Qing Nie

Associate Editor

PLOS Computational Biology

Alice McHardy

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The manuscript `Mixture-of-Experts Variational Autoencoder for Clustering and Generating from Similarity-Based Representations on Single Cell Data` proposes a generative clustering model, which is based on a variational autoencoder with a Gaussian mixture model for the latent space and a decoder consisting of a mixture of networks, where each mode in the latent space is decoded by an expert network. The gating network to assign samples to modes can be guided by prior knowledge on sample similarity. The authors demonstrate the performance of their model on MNIST data as well as single cell RNA-seq and mass cytometry clustering tasks. Overall, the work is interesting and the experiments conducted in a thorough manner. For an added value to the community, the authors should make their methods and scripts to reproduce the results accessible and provide additional details on their experiments. Other points are listed below.

Major points:

- Please make the code for the method and the scripts to reproduce the experiments accessible.

- What pre-processing was applied to the data presented in the Results? Please specify this and also provide details on the split of train and test data.

- - Several methods exist that use non-Gaussian variational distribution in a VAE, especially count based models in the single cell domain. These should be mentioned in p.2 l.14 and the results of a clustering in their latent space included in Table 2.

- The ablation study nicely highlights the benefits of including a similarity matrix. However, the exact impact of the chosen similarity metric remains unclear. For this, a clustering based solely on the chosen similarity matrix should be included as a reference in all experimental results. Does the influence of the similarity matrix depend on the dimensions of the data or number of clusters? How sensitive is the method to misspecifications in the similarity (e.g. wrongly chosen signature genes in the cell type clustering problem)?

Minor points:

- Could the authors comment on the motivation for the binary cross-entropy as reconstruction loss in Eq. (4). This seems to lead to blurry images in Fig. 2 and it is unclear why this should be a suitable loss for RNA-seq data.

- What would be the guidelines to choose the number of clusters K in applications? This seems to be crucial for the ability to generate samples and according to Fig S1 many more experts than actual clusters in the data might be found.

- Table 1 would be more insightful if the authors could provide a short description for these methods. Also, GANs for clustering seem an important alternative for the task but are not contained in the comparison nor mentioned in the text.

- The ablation study could be included as part of Table 1.

- l117-136 are hard to follow without going back to the original publication of VaDE.

- The authors could remove repetitive parts of the text (e.g. method description of VaDE and MoE-Sim-VAE in Results, eq (2) and (3)) and better separate general concepts and technical details in the introduction.

- How sensitive is the method to the choice of pi_1 and pi_2?

Reviewer #2: The authors report a computational method, Mixture-of-Experts Similarity Variational Autoencoder, at clustering and data generation, with applications on large-scale single-cell data. The proposed mathematical framework is solid and builds upon ideas that are appropriate for the analysis of large-scale single-cell data. The authors have demonstrated the applicability of their method using publicly available data sets, and the figures and tables are simple and clear. I however have a few suggestions to improve the manuscript.

Major comments:

1. The authors do not a link to their implementation. This should be a red line for academic computational tools, and I would request the authors to share their implementation via a github repository reproducible in a revised version of the manuscript. In addition, the authors should include a vignette or a tutorial reproducing the results from at least one of the applications presented in the manuscript.

2. In Table 2 the authors extract the F-measure and NMI scores from a public review (Lukas et al. 2016). This is fine as long as the processing pipelines that the author’s followed is the same as in the review. Minor changes in the data processing pipelines can lead to significant differences in the results. If this is the case, the authors should clarify. If not, the authors should verify the reported results with their own data processing pipelines.

Minor comments:

1. Fig3 is missing the colour legend

2. Could the authors back the quantitative results shown in Fig4 with a visualisation of the data, with cells coloured by cluster / cellt ype?

3. Why do the authors use the binary cross entropy as a reconstruction loss instead of the mean-squared error? Single-cell data is usually not scaled between 0 and 1.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: Code to reproduce the results is missing.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2021 Jun 30;17(6):e1009086. doi: 10.1371/journal.pcbi.1009086.r002

Author response to Decision Letter 0

27 Apr 2021

Attachment

Submitted filename: PCB-review-comments.pdf

Click here for additional data file.^{(156.2KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009086.r003

Decision Letter 1

Qing Nie, Alice Carolyn McHardy

14 May 2021

Dear Prof. Dr. Claassen,

We are pleased to inform you that your manuscript 'Mixture-of-Experts Variational Autoencoder for Clustering and Generating from Similarity-Based Representations on Single Cell Data' has been provisionally accepted for publication in PLOS Computational Biology.

The reviewer one has two minor comments. Please address.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Qing Nie

Associate Editor

PLOS Computational Biology

Alice McHardy

Deputy Editor

PLOS Computational Biology

***********************************************************

Please address Reviewer 1's two minor comments in the final version of the submission.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed my previous comments in their revised manuscript.

Before publication they should still address the following minor points:

1. Please fix typos and sentence structure in lines 66/67 and 261.

2. Please complete the code repository to make all examples from the paper reproducible and include the code that was used for the comparison to other methods in the benchmarks.

Reviewer #2: The authors have correctly addressed all my comments. I recommend publication of this manuscript.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: see comments to the Authors

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009086.r004

Acceptance letter

Qing Nie, Alice Carolyn McHardy

25 Jun 2021

PCOMPBIOL-D-20-02250R1

Mixture-of-Experts Variational Autoencoder for Clustering and Generating from Similarity-Based Representations on Single Cell Data

Dear Dr Claassen,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supporting information for Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data.

(ZIP)

Click here for additional data file.^{(185.8KB, zip)}

S1 Fig. Testing MoE-Sim-VAE on data sampled from a Gaussian mixture model with randomly sampled parameters.

(EPS)

Click here for additional data file.^{(820.9KB, eps)}

S2 Fig. Ablation study on the similarity matrix S.

(EPS)

Click here for additional data file.^{(1.9MB, eps)}

S3 Fig. Comparison of two sample MMD test on the distributions from the different mixture components in the latent representation.

(EPS)

Click here for additional data file.^{(1.6MB, eps)}

S4 Fig. Comparison of data generation process between Moe-Sim-VAE and VaDE.

(EPS)

Click here for additional data file.^{(1.1MB, eps)}

S5 Fig. Confusion map for data generation using MoE-Sim-VAE.

(EPS)

Click here for additional data file.^{(872.1KB, eps)}

S6 Fig. Confusion maps for data generation using VaDE.

(EPS)

Click here for additional data file.^{(536.2KB, eps)}

S7 Fig. Reproducibility of MoE-Sim-VAE on the four datasets.

(EPS)

Click here for additional data file.^{(189.4KB, eps)}

S8 Fig. Reconstruction of data modes per expert.

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S1 Table. Signature gene accuracy.

(XLS)

Click here for additional data file.^{(25.5KB, xls)}

S2 Table. Exact results on 268 mass cytometry experiments.

(XLS)

Click here for additional data file.^{(58KB, xls)}

Attachment

Submitted filename: PCB-review-comments.pdf

Click here for additional data file.^{(156.2KB, pdf)}

Data Availability Statement

All relevant data are within the paper and its Supporting information files. MoE-Sim-VAE is available at the following Github repository: https://github.com/andkopf/MoESimVAE.

[pcbi.1009086.ref001] 1.Aljalbout E., Golkov V., Siddiqui Y., Strobel M., Cremers D. Clustering with Deep Learning: Taxonomy and New Methods. arXiv, 2018.

[pcbi.1009086.ref002] 2.Min E., Guo X., Liu Q., Zhang G., Cui J., Long J. A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture. IEEE, 2018.

[pcbi.1009086.ref003] 3.Zhang D., Sun Y., Eriksson B., Balzano L. Deep Unsupervised Clustering Using Mixture of Autoencoders. arXiv, 2017.

[pcbi.1009086.ref004] 4.Dizaji K. G., Herandi A., Deng C., Cai W., Huang H. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. arXiv, 2017.

[pcbi.1009086.ref005] 5.Yang B., Fu X., Sidiropoulos N. D., Hong M. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. arXiv, 2017.

[pcbi.1009086.ref006] 6.Fortuin V., Hüser M., Locatello F., Strathmann H., Rätsch G. SOM-VAE: Interpretable Discrete Representation Learning on Time Series. Conference paper at ICLR, 2019.

[pcbi.1009086.ref007] 7.Jiang Z., Zheng Y., Tan H., Tang B., Zhou H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. arXiv, 2017.

[pcbi.1009086.ref008] 8. Irani J., Pise N., Phatak M. Clustering Techniques and the Similarity Measures used in Clustering: A Survey. International Journal of Computer Applications, 2016. doi: 10.5120/ijca2016907841 [DOI] [Google Scholar]

[pcbi.1009086.ref009] 9.Chopra S., Hadsell R., LeCun Y. Learning a similarity metric discriminatively, with application to face verification. IEEE, 2005.

[pcbi.1009086.ref010] 10.McInnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, 2018.

[pcbi.1009086.ref011] 11. van der Maaten L., Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008. [Google Scholar]

[pcbi.1009086.ref012] 12.Kingma D. P., Welling M. Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.

[pcbi.1009086.ref013] 13.Shazeer N., Mirhoseini A., Maziarz K., Davis A., Le Q, Hinton G. et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layers. arXiv, 2017.

[pcbi.1009086.ref014] 14. Bishop C. M. Neural Networks for Pattern Recognition, CLARENDON PRESS, 1995. [Google Scholar]

[pcbi.1009086.ref015] 15.Golik P., Doetsch P., Ney H. Cross-entropy vs. squared error training: a theoretical and experimental comparison INTERSPEECH (2013)

[pcbi.1009086.ref016] 16. Bishop C. M. Pattern Recognition and Machine Learning. Springer, 2006. [Google Scholar]

[pcbi.1009086.ref017] 17.Xie J., Girshick R., Farhadi A. Unsupervised deep embedding for clustering analysis. International Conference on Machine Learning (ICML), 2016.

[pcbi.1009086.ref018] 18.Li F., Qiao H., Zhang B., Xi X. Discriminatively boosted image clustering with fully convolutional autoencoders. arXiv, 2017.

[pcbi.1009086.ref019] 19.Saito S., Tan R. T. Neural clustering: Concatenating layers for better projections. Workshop track at ICLR, 2017.

[pcbi.1009086.ref020] 20.Chen D., Lv J., Yi Z. Unsupervised multi-manifold clustering by learning deep representation. Workshops at the AAAI Conference on Artificial Intelligence, 2017.

[pcbi.1009086.ref021] 21.Mukherjee S., Asnani H., Lin E., Kannan S. ClusterGAN: Latent Space Clustering in Generative Adversarial Networks The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) (2019)

[pcbi.1009086.ref022] 22.Yang J., Parikh D., Batra D. Joint unsupervised learning of deep representations and image clusters. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016b.

[pcbi.1009086.ref023] 23.Hsu C.-C., Lin C.-W. Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data. arXiv, 2017.

[pcbi.1009086.ref024] 24.Wang Z., Chang S., Zhou J., Wang M., Huang T. S. Learning a task-specific deep architecture for clustering. Proceedings of the SIAM International Conference on Data Mining (ICDM), 2016.

[pcbi.1009086.ref025] 25.Hu W., Miyato T., Tokui S., Matsumoto E., Sugiyama M. Learning discrete representations via information maximizing self augmented training. arXiv, 2017.

[pcbi.1009086.ref026] 26.Shaham U., Stanton K., Li H., Nadler B., Basri R., Kluger Y. SpectralNet: Spectral Clustering using Deep Neural Networks. Published as a conference paper at ICLR, 2018.

[pcbi.1009086.ref027] 27.Chen X., Duan Y., Houthooft R., Schulman J., Sutskever I., Abbeel, P. Infogan: Interpretable representa- tion learning by information maximizing generative adversarial nets In Advances in Neural Information Processing Systems, 2172–2180 (2016)

[pcbi.1009086.ref028] 28.Gretton A., Borgwardt K., Rasch M. J., Scholkopf B., Smola A. J. A Kernel Method for the Two-Sample Problem. arXiv, 2008.

[pcbi.1009086.ref029] 29.Sutherland D. J., Tung H.-Y., Strathmann H., De S., Ramdas A., Smola A. et al. Generative models and model criticism via optimized maximum mean discrepancy. arXiv, 2019.

[pcbi.1009086.ref030] 30. Feng C., Liu S., Zhang H., Guan R., Li D., Zhou F. et al. Dimension Reduction and Clustering Models for Single-Cell RNA Sequencing Data: A Comparative Study Int J Mol Sci. (2020) doi: 10.3390/ijms21062181 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref031] 31. McInnes L., Healy J., Astels S. hdbscan: Hierarchical density based clustering In: Journal of Open Source Software The Open Journal, volume 2, number 11. (2017) [Google Scholar]

[pcbi.1009086.ref032] 32. Dias M. L. D. fuzzy-c-means: An implementation of Fuzzy C-means clustering algorithm Zenodo (2019) [Google Scholar]

[pcbi.1009086.ref033] 33. Lopez R., Regier J., Cole M. B., Jordan M. I., Yosef N. Deep Generative Modeling for Single-cell Transcriptomics Nat Methods. (2018) doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref034] 34. The Tabula Muris Consortium., Overall coordination., Schaum N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris Nature 562, 367–372 (2018). doi: 10.1038/s41586-018-0590-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref035] 35. Wang D., Gu J. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder Genomics Proteomics Bioinformatics (2018) doi: 10.1016/j.gpb.2018.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref036] 36. Li B., Qing T., Zhu J., Wen Z., Yu Y., Fukumura R. et al. A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq Sci Rep 7, 4200 (2017). doi: 10.1038/s41598-017-04520-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref037] 37. Trent C. M., Yu S., Hu Y., Skoller N., Huggins L. A., Homma S. et al. Lipoprotein lipase activity is required for cardiac lipid droplet production J Lipid Res. (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref038] 38. Yagyu H., Chen G., Yokoyama M., Hirata K., Augustus A., Kako Y. et al. Lipoprotein lipase (LpL) on the surface of cardiomyocytes increases lipid uptake and produces a cardiomyopathy J Clin Invest. (2003) doi: 10.1172/JCI16751 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref039] 39. Yue F., Cheng Y., Breschi A., Vierstra J., Wu W., Ryba T. et al. A comparative encyclopedia of DNA elements in the mouse genome Nature. (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref040] 40. Kay A. W., Strauss-Albee D. M., Blish C. A. Application of Mass Cytometry (CyTOF) for Functional and Phenotypic Analysis of Natural Killer Cells. Methods in Molecular Biology, 2013. doi: 10.1007/978-1-4939-3684-7_2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref041] 41. Aghaeepour N., Finak G., FlowCAP Consortium, DREAM Consortium, Hoos H., Mosmann TR. et al. Critical assessment of automated flow cytometry data analysis techniques. Nature Methods, 2013. doi: 10.1038/nmeth.2365 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref042] 42. Weber L. M., Robinson M. D. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A, 2016. doi: 10.1002/cyto.a.23030 [DOI] [PubMed] [Google Scholar]

[pcbi.1009086.ref043] 43. Van Gassen S., Callebaut B., Van Helden M. J., Lambrecht B. N., Demeester P., Dhaene T. et al. FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A, 2015. doi: 10.1002/cyto.a.22625 [DOI] [PubMed] [Google Scholar]

[pcbi.1009086.ref044] 44. Levine J. H., Simonds E. F., Bendall S. C., Davis K. L., Amir E.-a.D., Tadmor M. D. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell, 2015. doi: 10.1016/j.cell.2015.05.047 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref045] 45. Samusik N., Good Z., Spitzer M. H., Davis K. L., Nolan G. P. Automated Mapping of Phenotype Space with Single-Cell Data. Nature Methods, 2016. doi: 10.1038/nmeth.3863 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref046] 46. Lance G. N., Williams W. T. Computer programs for hierarchical polythetic classification (“similarity analysis”) Computer Journal (1966) doi: 10.1093/comjnl/9.1.60 [DOI] [Google Scholar]

[pcbi.1009086.ref047] 47. Bodenmiller B., Zunder E. R., Finck R., Chen T. J., Savig E. S., Bruggner R. V. et al. Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nature Biotechnology, 2012. doi: 10.1038/nbt.2317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009086.ref048] 48. Qiu P., Simonds E. F., Bendall S. C., Gibbs K. D. Jr., Bruggner R. V., Linderman M. D. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology, 2011. doi: 10.1038/nbt.1991 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data

Andreas Kopf

Vincent Fortuin

Vignesh Ram Somnath

Manfred Claassen

Roles

Abstract

Author summary

Introduction

Materials and methods

MoE-Sim-VAE

Fig 1. Schematic overview of MoE-Sim-VAE.

Similarity clustering and gating of latent representation

MoE-Sim-VAE loss function

Related work

Fig 2. Generation of MNIST digit images.

Results

Unsupervised clustering, representation learning and data generation on MNIST

Table 1. Performance comparison of our method MoE-Sim-VAE with several published methods on MNIST.

Clustering organ-specific single cell RNA-seq data

Table 2. Results on clustering mouse organs based on RNA-seq.

Fig 3. Results of clustering mouse organs from single-cell RNA-sequencing data.

Learning cell type composition in peripheral blood mononuclear cells using CyTOF measurements

Table 3. Comparison of MoE-Sim-VAE performance to competitor methods in defining cell type composition in CyTOF measurements.

Fig 4. Results on clustering cell types on CyTOF measurements.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Qing Nie

Alice Carolyn McHardy

Roles

Author response to Decision Letter 0

Decision Letter 1

Qing Nie

Alice Carolyn McHardy

Roles

Acceptance letter

Qing Nie

Alice Carolyn McHardy

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases