Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Hyunghoon Cho; Bonnie Berger; Jian Peng

doi:10.1016/j.cels.2018.05.017

. Author manuscript; available in PMC: 2019 Apr 17.

Published in final edited form as: Cell Syst. 2018 Jun 20;7(2):185–191.e4. doi: 10.1016/j.cels.2018.05.017

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Hyunghoon Cho ¹, Bonnie Berger ^1,^2,^4,^*, Jian Peng ^3,^*

PMCID: PMC6469860 NIHMSID: NIHMS1021959 PMID: 29936184

SUMMARY

Visualization algorithms are fundamental tools for interpreting single-cell data. However, standard methods, such as t-stochastic neighbor embedding (t-SNE), are not scalable to datasets with millions of cells and the resulting visualizations cannot be generalized to analyze new datasets. Here we introduce net-SNE, a generalizable visualization approach that trains a neural network to learn a mapping function from high-dimensional single-cell gene-expression profiles to a low-dimensional visualization. We benchmark net-SNE on 13 different datasets, and show that it achieves visualization quality and clustering accuracy comparable with t-SNE. Additionally we show that the mapping function learned by net-SNE can accurately position entire new subtypes of cells from previously unseen datasets and can also be used to reduce the runtime of visualizing 1.3 million cells by 36-fold (from 1.5 days to an hour). Our work provides a framework for bootstrapping single-cell analysis from existing datasets.

Graphical Abstract:

graphic file with name nihms-1021959-f0001.jpg

In Brief

Researchers are applying single-cell RNA sequencing to increasingly large numbers of cells in diverse tissues and organisms. We introduce a data visualization tool, named net-SNE, which trains a neural network to embed single cells in 2D or 3D. Unlike previous approaches, our method allows new cells to be mapped onto existing visualizations, facilitating knowledge transfer across different datasets. Our method also vastly reduces the runtime of visualizing large datasets containing millions of cells.

INTRODUCTION

Complex biological systems arise from functionally diverse, heterogeneous populations of cells. Single-cell RNA sequencing (scRNA-seq) (Gawad et al., 2016), which profiles transcriptomes of individual cells rather than bulk samples, has been a key tool in dissecting the intercellular variation in a wide range of domains, including cancer biology (Wang et al., 2014), immunology (Stubbington et al., 2017), and metagenomics (Yoon et al., 2011). scRNA-seq also enables the de novo identification of cell types with distinct expression patterns (Grün et al., 2015; Jaitin et al., 2014).

A standard analysis for scRNA-seq data is to visualize single-cell gene-expression patterns of samples in a low-dimensional (2D or 3D) space via methods such as t-stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008) or, in earlier studies, principal component analysis (Jackson, 2005), whereby each cell is represented as a dot and cells with similar expression profiles are located close to each other. Such visualization reveals the salient structure of the data in a form that is easy for researchers to grasp and further manipulate. For instance, researchers can quickly identify distinct subpopulations of cells through visual inspection of the image, or use the image as a common lens through which different aspects of the cells are compared. The latter is typically achieved by overlaying additional data on top of the visualization, such as known labels of the cells or the expression levels of a gene of interest (Zheng et al., 2017). While many of these approaches have initially been explored for visualizing bulk RNA-seq (Palmer et al., 2012; Simmons et al., 2015), methods that take into account the idiosyncrasies of scRNA-seq (e.g., dropout events where nonzero expression levels are missed as zero) have also been proposed (Pierson and Yau, 2015; Wang et al., 2017). Recently, more advanced approaches that visualize the cells while capturing important global structures such as cellular hierarchy or trajectory have been proposed (Anchang et al., 2016; Hutchison et al., 2017; Moon et al., 2017; Qiu et al., 2017), which constitute a valuable complementary approach to general-purpose methods such as t-SNE.

Comprehensively characterizing the landscape of single cells requires a large number of cells to be sequenced. Fortunately, advances in automatic cell isolation and multiplex sequencing have led to an exponential growth in the number of cells sequenced for individual studies (Svensson et al., 2018) (Figure 1A). For example, 10x Genomics recently made publicly available a dataset containing the expression profiles of 1.3 million brain cells from mice (https://support.10xgenomics.com/single-cell-gene-expression/datasets). However, the emergence of such mega-scale datasets poses new computational challenges before they can be widely adopted. Many of the existing computational methods for analyzing scRNA-seq data require prohibitive runtimes or computational resources; in particular, the state-of-the-art implementation of t-SNE (Van Der Maaten, 2014) requires 1.5 days to run on 1.3 million cells based on our estimates.

Figure 1. — (A) The exponential increase in the number of single cells sequenced by individual studies (adapted from Svensson et al., 2018). Note that the y axis scales exponentially.

(B) Retrospective analysis of redundancy in the Brain1m dataset (STAR Methods) with 2,000 initial cells and repeated doubling of the data size. For each batch added, we computed the distribution of the cells’ minimum Euclidean distance to cells already observed based on their gene expression. Each curve corresponds to a particular distance threshold for deeming the new cell redundant. The thresholds are chosen as the deciles of the overall distribution of minimum Euclidean distances.

Here, we introduce neural t-SNE (net-SNE), a scalable and generalizable method for visualizing single cells for scRNA-seq analysis. Taking inspiration from compressive genomics (Loh et al., 2012), we exploit the intuition that when a large number of cells are sequenced, a significant portion of the cells are redundant (i.e., highly similar to other cells). Taking advantage of the expressive power of neural networks (NNs), which has been demonstrated in numerous applications (LeCun et al., 2015), net-SNE trains an NN to learn a high-quality mapping function that takes an expression profile as input and outputs a low-dimensional embedding in 2D or 3D for visualization. Unlike t-SNE, the mapping function learned by net-SNE can be used to map previously unseen cells that were not included in the input data. This capability allows for novel workflows for single-cell studies, whereby newly observed cells are visualized in the context of existing datasets to gain additional insights.

To demonstrate visualization quality as well as scalability, we show that net-SNE learns visualizations that are similar to those of t-SNE on 14 scRNA-seq datasets of various cell types and data sizes up to 1.3 million cells. Next we focused on generalizability, demonstrating that net-SNE newly achieves the ability to map previously unseen cells; in particular, we show that net-SNE not only can identify subtypes of cells that were not included in the initial data but can also be used to bootstrap the visualization from a subset of data to achieve significantly better scalability to mega-scale datasets. Given the inherent redundancy in biological data (Yu et al., 2015), we expect our techniques for neural data visualization to accelerate and enhance other high-dimensional biological data analyses beyond visualization.

RESULTS

Increasing Redundancy in Single-Cell Datasets

We first set out to empirically assess the extent to which additional sequencing of single cells from the same biological source capture unforeseen expression patterns. Starting with 2,000 randomly chosen cells from the 10x Genomics scRNA-seq dataset with 1.3 million mouse neurons (STAR Methods), we repeatedly doubled the data size up to a million cells by sampling the remaining cells (without replacement) and measured how redundant the newly added cells are compared with the ones already observed (Figure 1B). As the scale of data grows, sequencing more cells exhibits a clear diminishing return in terms of capturing cells with unique expression patterns. For example, the majority of cells (53%) in the final half of the data can be considered redundant according to a certain distance threshold, which deems only 10% of the cells redundant in the initial batch. Nevertheless, a considerable fraction of the newly observed cells remains unique even at the scale of a million cells. For instance, 10% of the cells in the final half are as unique as the top 30% of the initial batch, which suggests that the push toward a higher cell count is indeed valuable for gaining access to relatively unexplored regions of the gene-expression landscape, albeit with decreasing effectiveness.

These results imply that, as researchers collectively accumulate scRNA-seq data for a particular biological system (e.g., tissue, organism, or microbial population) to the scale of millions of cells, a significant portion of newly sequenced cells will fall into the space already visited by existing data, where useful insights may be available from previous analyses. This observation motivates our development of net-SNE, which allows new data to be mapped onto an existing visualization to accelerate such knowledge transfer across different studies or experiments.

Overview of net-SNE

net-SNE achieves generalizability by training a feedforward neural network (LeCun et al., 2015) to learn a parameterized embedding function that takes a cell’s expression profile as input and outputs the coordinates in a low-dimensional space for visualization (STAR Methods). Given the wide success of t-SNE in single-cell biology (Amir et al., 2013), we aim to emulate the behavior of t-SNE while newly achieving the ability to map new cells, by training our neural network to optimize the same objective function as t-SNE. This objective function intuitively captures how faithfully the local structure among the input vectors (i.e., single-cell gene-expression profiles) is represented in the visualization. Although our parametric approach to t-SNE has been theoretically considered (Van Der Maaten, 2009), its application to real-world, large-scale datasets has been considerably limited due to the difficulties in successfully training a neural network to perform a complex task such as t-SNE. Our work employs new optimization techniques to improve the scalability of neural network training for t-SNE (STAR Methods) and newly demonstrates the effectiveness of this approach for single-cell analysis.

net-SNE Learns High-Quality Visualizations of Single Cells

To evaluate the ability of net-SNE to accurately model the visualization of single-cell datasets, we tested it on 13 existing scRNA-seq datasets of varying sizes with known clusters (STAR Methods). We found that for all of the datasets net-SNE is able to learn an embedding that closely matches the output of t-SNE (Figures 2A and S1).

Figure 2. — (A) Comparison of net-SNE and t-SNE visualizations on four largest benchmark datasets with known clusters. Colors indicate known cell types provided by the original work. net-SNE visualizations of the Klein and Zeisel datasets are reflected over a diagonal axis for comparison. Figures for the remaining datasets are provided in Figure S1.

(B) Quality of each visualization is quantified by the adjusted Rand index between the known labels and the output of k-means clustering based on the embedding. Each dot represents one of the 13 datasets analyzed. Results based on agglomerative clustering instead of k-means are provided in Figure S2, also showing a high concordance between net-SNE and t-SNE.

To systematically evaluate the quality of embeddings produced by net-SNE, we assessed the agreement between the known subtypes and clusters that are computationally identified based on the low-dimensional embeddings. The level of agreement was quantified by the adjusted Rand index (Rand, 1971), following previous work (Kiselev et al., 2017). We obtained the clusters by applying the standard k-means clustering algorithm (Hartigan and Wong, 1979) to the embeddings, where the known number of clusters was provided to the algorithm. As shown in Figure 2B, net-SNE achieves clustering accuracy that is comparable with t-SNE for all 13 datasets, which agrees with the visual concordance of the two methods. An analogous analysis we performed, based on agglomerative clustering instead of k-means, leads to similar concordance between net-SNE and t-SNE (Figure S2).

Notably, we obtained all of these results using a relatively simple neural network with only two layers of nonlinearities with 50 units in each layer (Figure S1B). In additional experiments, not only did we observe that the net-SNE results are reasonably stable across a wide range of network architectures, we also found that the size of our network can be reduced to as low as 10 units per layer without significantly sacrificing the quality of visualizations (Figure S4), even for the PBMC68k dataset containing tens of thousands of cells. This finding suggests that the relationship between gene-expression profiles and the clustering pattern of cells may be simple enough to admit a concise characterization for various cell populations.

net-SNE Accurately Maps New Cells

To demonstrate the potential of net-SNE for translational analyses across different datasets, we performed a cross-validation experiment whereby an entire cluster of cells was removed from a dataset and placed onto the visualization after the fact. While the original t-SNE does not support the visualization of new data points, we considered as baseline a naive extension of t-SNE whereby the embedding of a new cell is determined as the average position (in the low-dimensional space) of the cell’s nearest neighbors in the initial data according to expression measurements (t-SNE + k-NN). An alternative extension of t-SNE whereby the new cells are randomly initialized and optimized while fixing the positions of the initial cells similarly lacks the scalability to mega-scale datasets as the original t-SNE and thus was not considered in our analysis.

Figure 3 shows our cross-validation results from the Klein dataset (Klein et al., 2015), which contains four known clusters, each of which was held out in four separate experiments. Remarkably, in three out of the four cases, the embedding learned by net-SNE accurately positioned the held-out cells as a distinct cluster, despite the fact that the training data did not contain any of the cells from this cluster. In contrast, our nearest neighbor-extension of t-SNE (t-SNE + k-NN) overlaid most of the new cells onto existing clusters and ended up incorrectly outputting an obfuscated map. Although visualizing the entire dataset from scratch tends to result in better-quality scores than both of these approaches (Figure S4A), we note that the initial generalization obtained by either approach can be further optimized if desired.

The setting of this cross-validation analysis may arise in practice in cases where a rare subpopulation of cells was omitted from the initial dataset (e.g., due to small data size). Our results suggest that, while the naive nearest neighbor-based projection of newly observed cells (including the rare subtype) will likely render the new subtype invisible, net-SNE is still able to identify the new cluster, given that its gene expression is sufficiently distinct from that of existing cells. In an alternative setting where the new cells are from a subtype that is already represented in the initial dataset, net-SNE is still able to accurately assign the new cells to the correct cluster (Figure S4B).

To further demonstrate the utility of net-SNE’s generalization performance beyond cross-validation, we obtained six scRNA-seq datasets of different purified blood cell subtypes from 10x Genomics (STAR Methods). We projected each dataset onto a pre-trained, net-SNE visualization of the PBMC68k dataset of a whole blood sample, which includes all six subtypes. Despite the differences in sample preparation and the possibility of batch effects (Tung et al., 2017), net-SNE accurately positions the purified cell populations onto existing clusters, immediately providing a useful characterization for a number of clusters in the PBMC68k dataset (Figure 3B). Notably, net-SNE projected most of the purified CD34-positive cells onto a distinct region that contained only a small number of cells in the initial dataset. This observation is consistent with the low basal levels of CD34-positive cells in blood (Kikuchi-Taura et al., 2006) and further illustrates net-SNE’s ability to capture previously unseen subtypes of cells.

net-SNE Accelerates Visualization of Millions of Cells

After validating net-SNE’s ability to map new cells, we then asked whether this ability can be exploited to achieve fast visualization of mega-scale datasets. Drawing from the intuition that datasets of this scale can be accurately represented by a smaller subset of cells due to high redundancy, we first trained net-SNE on a subset of 100,000 cells from the Brain1m dataset containing 1.3 million cells and later applied the learned embedding function to the entire dataset. This fast approach took around only 20 min overall and resulted in a higher-quality map than the output of t-SNE with the default parameter settings, which took 13 hr to finish. Note that we use the Kullback-Leibler (KL) divergence objective score—the quantity minimized by both t-SNE and net-SNE—as the metric of quality (inversely related), which is more objective than a visual assessment. If a researcher already has access to a pre-trained mapping based on an existing dataset, the reduction in runtime achieved by net-SNE is likely to be even more drastic (e.g., days for t-SNE to a few minutes).

It is worth noting that the original dataset provided by 10x Genomics also included a t-SNE embedding, which appeared higher in quality than the t-SNE output we obtained using the default setting (Figure 4E). While the objective score we computed based on the published embedding was superior to net-SNE’s initial generalization, 45 min of further optimization of net-SNE was sufficient to outperform this score (Figure 4C).

Figure 4. — The Brain1m dataset with 1.3 million cells is visualized using a novel bootstrap approach enabled by net-SNE, whereby the embedding learned on a subset of 100 K cells (A) was used to initialize training on the whole dataset. Our initial generalization to the full data instantly obtained (in minutes) a visualization (B) of higher quality, as measured by the KL divergence objective score minimized by both methods (STAR Methods), than that of t-SNE with default parameters achieved after 13 hr (D). While the t-SNE embedding provided in the original dataset by 10x Genomics (E) achieves a better objective, net-SNE outperforms this embedding with less than an hour of further training (C). Top row shows the heatmap for each embedding using a linear color map, where the highest value represented is chosen for each plot to achieve the best clarity. Bottom row shows contour plots of the same data. Lower objective score corresponds to better agreement between the gene-expression landscape and the visualization. *The visualization by 10x Genomics can be closely reproduced by increasing the number of iterations for t-SNE, and based on our experiments t-SNE required 1.5 days to achieve a solution with a comparable score.

After observing that we can achieve a visualization similar to the one provided by 10x Genomics by increasing the number of iterations for t-SNE (thus increasing runtime), we performed an experiment whereby we ran t-SNE until it reached an objective score matching the provided embedding. This resulted in a runtime estimate of 1.5 days for the published visualization, which is substantially longer than that required by net-SNE to achieve a superior quality (i.e., 20 min of pre-training and 45 min of further optimization). Notably, if the visualization by net-SNE was performed based on a well-characterized dataset, the map obtained by net-SNE would have the additional benefit of allowing researchers to immediately transfer insights from the existing dataset.

The visual difference between the net-SNE and t-SNE outputs in this experiment can be partially attributed to the fact that there are likely many locally optimal solutions to the same optimization problem solved by both methods. Although the clusters may appear more clearly separated in the t-SNE output, the fact that our visualization actually achieves a better objective score suggests the possibility of the abrupt boundaries in t-SNE being an artifact that is not warranted by the underlying data. Since net-SNE restricts the space of possible visualizations to those that can be modeled as a continuous function of gene expression as specified by our relatively simple neural network, it has a tendency to obtain a “smoother” visualization, which could potentially be a more accurate representation of the gene-expression landscape. Note that we obtained our results on the Brain1m dataset using a two-layer neural network with 50 hidden units per layer, as in our aforementioned analyses.

DISCUSSION

As we enter the age of mega-scale single-cell analysis, new computational methods that take advantage of the growing redundancy in the scRNA-seq data are needed. To this end we have presented net-SNE, a visualization method that uses a neural network to learn a parametric embedding function that emulates t-SNE’s visualization while newly achieving the ability to map previously unseen cells. We have demonstrated that net-SNE not only learns high-quality maps such as t-SNE, but also gracefully generalizes to unseen cells—even when a whole subpopulation is missing from the initial dataset or when the new data come from a different sequencing experiment. Indeed, net-SNE’s ability to generalize allows researchers to exploit redundancy across different datasets by projecting the cells in one dataset onto another to facilitate transfer of knowledge. In addition, we have shown that using a pre-trained embedding from a subsampled (or an existing) dataset is an effective way for performing fast visualization of mega-scale scRNA-seq datasets. Our approach achieves significantly better scalability than t-SNE, which is on the verge of being impractical for datasets with more than a million cells. Although a number of recent studies introduced new techniques for improving the scalability of data visualization tools (Dzwinel and Wcisło, 2015; Tang et al., 2016), they do not address the lack of generalizability that net-SNE overcomes.

A number of recently proposed techniques for single-cell analysis can be used in conjunction with net-SNE to potentially further enhance its visualization quality; for instance, methods that account for dropout events (Wang et al., 2017) or batch effects (Haghverdi et al., 2017) can be employed to improve the input similarity matrix before applying net-SNE. Notably, Amodio et al. (2017), concurrently with this work, introduced an autoencoder-based approach for jointly performing batch effect correction and visualization, which finds a parametric embedding like net-SNE. Because they optimize a different objective function, however, the behavior of their embedding is fundamentally different from that of t-SNE (and thus net-SNE).

The fact that net-SNE obtains a visualization whereby the coordinates are directly modeled by the neural network as a function of gene expression opens up new directions for further research. In particular, one can investigate the parameters of the NNs trained by net-SNE for insights into what types of expression patterns are being utilized for t-SNE-like visualizations. Furthermore, while the characterization of conspicuous clusters in t-SNE output has typically been done by summarizing the expression of cells that belong to the cluster of interest, net-SNE enables a more direct and potentially more effective approach that analyzes the behavior of the embedding function.

A recently proposed idea of building a reference map of all human cell types based on high-throughput single-cell experiments (called the Human Cell Atlas [Regev et al., 2017]) is closely related to our vision. While embedding all cell types in a space with as few as two or three dimensions is unlikely to be successful given the complexity of the problem, we believe that insights from net-SNE and its future extensions may lead to an effective approach for learning compact vector space representations of all human cells that can be readily plugged into new and existing computational methods to further advance our understanding of biology.

STAR★METHODS

CONTACT FOR REAGENT AND RESOURCE SHARING

Bonnie Berger, bab@mit.edu, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.

METHOD DETAILS

Review of t-Stochastic Neighbor Embedding

Let $x_{1}, \dots, x_{n} \in R^{d}$ represent the (normalized) expression profiles for each of the n cells in a scRNA-seq dataset that we wish to visualize in $R^{s}$ , where d is typically on the order of tens of thousands (number of human genes) and s is two or three. More precisely, we want to learn the low-dimensional embedding of the cells $y_{1}, \dots, y_{n} \in R^{s}$ that capture the low-dimensional structure represented by the original input vectors x₁,…,x_n.

A widely-used approach called t-stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008) relates the notion of quality of an embedding y₁,…,y_n (inversely) to the Kullback-Leibler (KL) divergence between the two probability distributions P and Q defined over all pairs of cells, which reflect how the cells are laid out in the input and output (embedding) spaces, respectively. The probability assigned to a particular pair (i, j) in each distribution represents how close the two associated vectors are—i.e., (x_i, x_j) for P and (y_i, y_j) for Q. Intuitively, maximizing the agreement between P and Q corresponds to finding a good embedding that faithfully represents the structure in the original data.

Formally, t-SNE solves the following optimization problem

{minimize}_{y_{1}, \dots, y_{n}} KL (P ‖ Q) = \sum_{i \neq j} p_{i j} \log \frac{p_{i j}}{q_{i j}},

where

p_{i j} = \frac{p_{i ∣ j} + p_{i ∣ j}}{2} with p_{i ∣ j} = \frac{\exp {- ‖ x_{i} - x_{j} ‖_{2}^{2} / (2 σ_{j}^{2})}}{\sum_{i^{'} \neq j} \exp {- ‖ x_{i^{'}} - x_{j} ‖_{2}^{2} / (2 σ_{j}^{2})}},

and

q_{i j} = \frac{{\tilde{q}}_{i j}}{Z} with {\tilde{q}}_{i j} = {(1 + ‖ y_{i} - y_{j} ‖_{2}^{2})}^{- 1} and Z = \sum_{i \neq j} {\tilde{q}}_{i j} .

Note p_ij and q_ij denote the (i, j) element of matrices P and Q, respectively. In addition, σ_j is a parameter that is tuned for each j to ensure P_i∣j achieves a predefined value of information-theoretic entropy.

t-SNE solves the above optimization problem via gradient descent on the embedding vectors y₁,…,y_n with random initialization. As derived in the original paper (Maaten and Hinton, 2008), the gradient of the objective with respect to each y_i is given as

\frac{δ KL (P ‖ Q)}{δ y_{i}} = \sum_{j \neq i} p_{i j} {\tilde{q}}_{i j} (y_{i} - y_{j}) - \frac{1}{Z} \sum_{j \neq i} {\tilde{q}}_{i j}^{2} (y_{i} - y_{j}) .

Computing this gradient for every cell i would require O(n²) computation, which is prohibitive for large n. In the state-of-the-art implementation of t-SNE (Van Der Maaten, 2014), this expression is approximated in two ways for computational efficiency. First, P is approximated with a sparse matrix based on k-nearest neighbors for each cell, which greatly speeds up the computation of the first term since most summands are zero. Second, an efficient data structure (space-partitioning trees (Samet, 1984)) is built over y₁,…y_n so that the second summation can be coarsely approximated by grouping terms corresponding to nearby y_i’s together. Even with these optimizations, applying t-SNE to datasets with millions of cells requires days of computation as shown in our results.

Our Method: Neural t-SNE

We introduce neural t-SNE (net-SNE), which models each embedding vector y_i as the output of a parameterized embedding function evaluated at the corresponding input vector x_i. Importantly, our approach is generalizable—i.e., it induces the embedding of any point in the input space, not just the observed data points as in t-SNE. We use standard feedforward neural networks (NNs) (LeCun et al., 2015) to represent the embedding function, drawing from the intuition that NNs have sufficient expressive capacity to find high-quality maps similar to those typically uncovered by t-SNE.

The precise form of the parameterized mapping of net-SNE is as follows. Let ℓ be the number of hidden layers in the NN and u be the number of units in each layer (same for every layer). Furthermore, let $W^{(t)} \in R^{u \times u}$ ( $R^{u \times d}$ for t = 1) be the weight matrix and $b^{(t)} \in R^{u}$ be the intercept associated with layers t = 1,…,ℓ. An additional weight matrix W^(ℓ+1) is associated with the final output layer. Given a data point x_i, the forward pass through the NN to compute the embedding y_i can be recursively described as

h_{i}^{(0)} = x_{i}, h_{i}^{(t)} = f (W^{(t)} h_{i}^{(t - 1)} + b^{(t)}), for t = 1, \dots, ℓ, y_{i} = W^{(ℓ + 1)} h_{i}^{(ℓ)} .

Note that f denotes an element-wise nonlinear activation function (e.g., sigmoid or rectifier). In the following, we compactly represent the above NN-based embedding function as y_i = NN(x_i; ϴ), where ϴ refers to the network parameters W⁽¹⁾,…,W^(ℓ+1) and b⁽¹⁾,…,b^(ℓ). All of our experimental results are obtained uniformly based on a simple architecture with ℓ = 2, u = 50, and rectifier activation, as illustrated in Figure S1B.

Given a NN that defines an embedding for every point in the input space, net-SNE optimizes the same KL divergence objective as t-SNE over the observed data points, via gradient descent. To see how the gradients are computed in net-SNE, first note that for a particular network parameter θ∈θ we have

\frac{δ KL (P ‖ Q)}{δ θ} = \sum_{i = 1}^{n} {(\frac{δ KL (P ‖ Q)}{δ y_{i}})}^{T} \frac{δ NN (x_{i}; Θ)}{δ θ}

by the chain rule and using y_i = NN(x_i; ϴ). Notice the first term in each product is identical to the t-SNE gradient and can be computed in the same manner. The second term can be computed via standard backpropagation algorithm (LeCun et al., 2015).

Intuitively, here we keep most of the computation in t-SNE intact, but add an additional step to each iteration, where, after computing the gradients for y₁,…,y_n as in t-SNE, we propagate them backward through the NN to update the network weights accordingly. Consequently, net-SNE is compatible with any computational optimization for t-SNE. In particular, our implementation of net-SNE incorporates the state-of-the-art version of t-SNE based on the Barnes-Hut approximation (Van Der Maaten, 2014) to the gradients with respect to y₁,…y_n, which achieves substantially faster runtime than vanilla t-SNE.

Although our method was independently developed, we note that a theoretical approach of training a parametric embedding (e.g., a neural network) for t-SNE via the chain rule has been previously described in an earlier work (Van Der Maaten, 2009). However, given the difficulty in successfully training a neural network to find good solutions to the t-SNE objective on large-scale datasets, practical adoption of this approach has been limited. In our work, we introduce additional techniques described in the following sections to improve the effectiveness of neural network-based visualization, while also demonstrating its utility for single cell analysis on a wide range of benchmark datasets.

Accelerating net-SNE via Stochastic Optimization

To fully exploit the generalizability of net-SNE, we improve upon the above procedure with techniques from stochastic optimization. First, note that the gradient for θ∈ϴ given in the previous section is a summation over all data points, and thus can be approximated with a randomly chosen subset $B \subset {1, \dots, n}$ (“mini-batch”) as

\frac{δ KL (P ‖ Q)}{δ θ} \approx \frac{n}{∣ B ∣} \sum_{i \in B} {(\frac{δ KL (P ‖ Q)}{δ y_{i}})}^{T} \frac{δ NN (x_{i}; Θ)}{δ θ} .

Similarly, the gradient with respect to y_i for $i \in B$ can be approximated as

\frac{δ KL (P ‖ Q)}{δ y_{i}} \approx \frac{n - 1}{∣ B ∣ - 1} \sum_{j \in B \land j \neq i} p_{i j} {\tilde{q}}_{i j} (y_{i} - y_{j}) - \frac{n - 1}{∣ B - 1 ∣} • \frac{1}{\hat{Z}} \sum_{j \in B \land j \neq i} {\tilde{q}}_{i j}^{2} (y_{i} - y_{j})

using the approximate normalization factor

\hat{Z} = \frac{n}{∣ B ∣} • \frac{n - 1}{∣ B ∣ - 1} \sum_{i, j \in B \land i \neq j} {\hat{q}}_{i j} .

With small $∣ B ∣$ and precomputed P (which typically constitutes a small fraction of the runtime of t-SNE), this approach greatly reduces the required computation in each gradient step and improves the rate of convergence to a good visualization, as we demonstrate in our experiments. Our observation is in line with well-known results in the field of optimization (Bousquet and Bottou, 2008) showing superior runtimes of stochastic gradient descent (SGD) methods compared to their exact counterparts that process the entire dataset for every iteration. Notably, although only a small subset of cells are considered for each iteration, each of our gradient updates to ϴ affects all cells in the data (due to its generalizability). In contrast, applying a similar mini-batch SGD procedure to t-SNE results in an ineffective method, as only the positions of the cells in a given mini-batch are updated while the remaining cells are fixed. This reduces to a coordinate descent-like procedure, which we found to be very slow in terms of learning speed, likely due to the tight coupling of parameters being optimized. Although net-SNE shares the model parameters across all cells and thus is less prone to this issue, we did notice difficulties in optimization with mini-batches that are too small. We found setting $∣ B ∣$ to be around 10% of the dataset to be a reasonable compromise that leads to good performance.

In addition, sampling strategy for $B$ has a considerable effect on the quality of approximation for the gradients. Specifically, given a sparse approximation of P, the first summation in our equation for δKL(P ∥ Q)/δy_i has only a few nonzero summands. If $B$ is uniformly sampled, then only a few indices will contribute to the sum, leading to high variance in the estimate. We address this problem by introducing additional structure into $B$ . In particular, we first sample a smaller set of seed cells $S \subset {1, \dots, n}$ uniformly at random and then sample a fixed number of cells from each of their “neighbors” (where p_ij > 0). After these local samples are used to facilitate the approximation of t-SNE gradients for the seed cells, these gradients are backpropagated through the neural network to update the embedding parameters of net-SNE. Given that our estimated normalization factor $\hat{Z}$ appears in the denominator of the approximate t-SNE gradient, our gradient estimate based on $\hat{Z}$ is thus biased. To control the amount of error introduced, we impose a minimum threshold (10%) on the fraction of total samples used to approximate $\hat{Z}$ . If the set of seed cells is too small, additional samples are drawn to ensure $\hat{Z}$ is of sufficient quality.

Training net-SNE with Reference Visualization

Even with our stochastic optimization techniques, training a neural network to optimize the t-SNE objective is a challenging task, especially for large-scale datasets with complex patterns. We thus introduce another technique, where a pre-trained t-SNE embedding is used to provide more direct feedback to the neural network instead of relying on the gradients of the highly complex t-SNE objective. More precisely, mean-squared error between the net-SNE embedding and an existing t-SNE map is used as the loss function to optimize the network in order to obtain a good initial solution, which can be further optimized if needed. Although this does require t-SNE to be performed before applying net-SNE, the initial t-SNE map need not be fully optimized, as further SGD iterations in net-SNE can fine-tune the solution.

Note that all of our datasets other than PBMC68k and Brain1m are small enough that we were able to train net-SNE in batch mode without the stochastic optimization and the use of a reference visualization. On the other hand, our results on PBMC68k and the 100k subset of Brain1m were obtained by training net-SNE to match a t-SNE reference using mini-batch SGD with a batch size of 10%. We did not further fine-tune net-SNE after training on the reference visualization, as the resulting visualization quality was sufficiently high. After generalizing the 100k cell visualization to the full Brain1m dataset, further training of net-SNE was performed without a t-SNE map (since t-SNE becomes impractical at this scale); instead, we used our stochastic optimization with a batch size of 10%. Although our generalization based on a smaller subset is the primary factor in achieving fast runtime for the Brain1m visualization, our stochastic optimization techniques also lead to significant runtime reductions. For instance, with a batch size of 10%, the average runtime of each iteration of net-SNE is reduced by around a factor of 10 as a result of our techniques.

Benchmark Datasets

For our main experiments, we used 13 published scRNA-seq datasets of varying sizes with known cluster labels for the cells, which allowed us to directly assess the quality of the visualization produced by net-SNE. The list of datasets, sorted in increasing order of size: (Biase et al., 2014) (n = 49, k = 3), (Treutlein et al., 2014) (n = 80, k = 5), (Goolam et al., 2016) (n = 124, k = 5), (Ting et al., 2014) (n =149, k = 7), (Buettner et al., 2015) (n = 182, k = 3), (Deng et al., 2014) (n = 268, k = 10), (Pollen et al., 2014) (n = 301, k =11), (Patel et al., 2014) (n = 430, k = 5), (Usoskin et al., 2015) (n = 622, k = 4), (Kolodziejczyk et al., 2015) (n = 704, k = 3), (Klein et al., 2015) (n = 2,717, k = 4), (Zeisel et al., 2015) (n = 3,005, k = 9), and PBMC68k (Zheng et al., 2017) (n = 68,560, k =10). Note that n denotes the number of cells and k denotes the number of clusters defined by the original publications.

We additionally used a mega-scale dataset Brain1m (n = 1,283,543), downloaded from the 10x Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets), to assess the generalizability and scalability of net-SNE. However, this dataset does not have any known labels, and thus we resorted to using the objective score optimized by both net-SNE and t-SNE as the quality metric for comparison.

For demonstrating the generalization of net-SNE across different datasets, we used the datasets of six blood cell subtypes (CD4+, CD8+, CD14+, CD19+, CD34+, and CD56+) experimentally purified via fluorescence activated cell sorting (FACS), which are provided by 10x Genomics and described in Zheng et al. (2017). These datasets were downloaded from the same URL as the Brain1m dataset given above.

Our benchmark datasets measured gene expression via a number of different metrics, including read, fragment, transcript, or unique molecule (UMI) counts that are either normalized or unnormalized for gene length (Table S1). We kept the metric chosen by the original publication for each dataset with the goal of demonstrating the performance of net-SNE in a variety of settings. Unless already performed by the original publication, we applied the standard log(1 + x) transformation to each element x of the cell-gene expression matrix before visualizing the data.

DATA AND SOFTWARE AVAILABILITY

A C++ implementation of net-SNE with example data and scripts are available at:http://netsne.csail.mit.edu and https://github.com/hhcho/netsne. The accession numbers for all scRNA-seq datasets analyzed in this paper are provided in Key Resources Table. Preprocessed gene-expression matrices are available from the authors upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
Benchmark scRNA-seq data (Biase)	Biase et al., 2014	GEO: GSE57249
Benchmark scRNA-seq data (Treutlein)	Treutlein et al., 2014	Supplementary Data 3
Benchmark scRNA-seq data (Goolam)	Goolam et al., 2016	ArrayExpress: E-MTAB-3321
Benchmark scRNA-seq data (Ting)	Ting et al., 2014	GEO: GSE51372
Benchmark scRNA-seq data (Buettner)	Buettner et al., 2015	ArrayExpress: E-MTAB-2805
Benchmark scRNA-seq data (Deng)	Deng et al., 2014	GEO: GSE45719
Benchmark scRNA-seq data (Pollen)	Pollen et al., 2014	SRA: SRP041736
Benchmark scRNA-seq data (Patel)	Patel et al., 2014	GEO: GSE57872
Benchmark scRNA-seq data (Usoskin)	Usoskin et al., 2015	GEO: GSE59739
Benchmark scRNA-seq data (Kolodziejczyk)	Kolodziejczyk et al., 2015	ArrayExpress: E-MTAB-2600
Benchmark scRNA-seq data (Klein)	Klein et al., 2015	GEO: GSE65525
Benchmark scRNA-seq data (Zeisel)	Zeisel et al., 2015	GEO: GSE60361
Benchmark scRNA-seq data (PBMC68k)	Zheng et al., 2017	https://support.10xgenomics.com/single-cellgene-expression/datasets
Benchmark scRNA-seq data (Brain1m)	10x Genomics	https://support.10xgenomics.com/single-cellgene-expression/datasets
Benchmark scRNA-seq data (FACS-purified blood cells)	Zheng et al., 2017	https://support.10xgenomics.com/single-cell-gene-expression/datasetsgene-expression/datasets
Software and Algorithms
Barnes-Hut t-SNE	Van Der Maaten, 2014	https://github.com/lvdmaaten/bhtsne
Scikit-learn (k-means and agglomerative clustering)	Pedregosa et al., 2011	http://scikit-learn.org/stable/; RRID:SCR_002577
Other
Software package for net-SNE	This paper	http://netsne.csail.mit.edu

Open in a new tab

Supplementary Material

Supplemental Information

NIHMS1021959-supplement-Supplemental_Information.pdf^{(1.4MB, pdf)}

Highlights.

We train a neural network to visualize single-cell RNA-sequencing datasets
Our method can map new cells onto existing visualizations to allow knowledge transfer
Our method efficiently visualizes millions of cells via a bootstrap procedure

ACKNOWLEDGMENTS

H.C. and B.B. are partially supported by the US NIH grant R01GM081871 (to B.B.). H.C. is also partially supported by the Kwanjeong Educational Foundation. J.P. is supported by the Sloan Research Fellowship and the US National Science Foundation Career Award 1652815. We thank Serafim Batzoglou and Bo Wang for providing the preprocessed data for Pollen and Kolodziejczyk datasets. Editor’s note: An early version of this paper was submitted to and peer reviewed at the 2018 Annual International Conference on Research in Computational Molecular Biology (RECOMB). The manuscript was revised and then independently further reviewed at Cell Systems.

Footnotes

SUPPLEMENTAL INFORMATION

Supplemental Information includes four figures and one table and can be found with this article online at https://doi.org/10.1016/j.cels.2018.05.017.

DECLARATION OF INTERESTS

The authors declare no conflicting interests.

REFERENCES

Amir el-A.D., Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, and Pe’er D (2013). viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol 31, 545. [DOI] [PMC free article] [PubMed] [Google Scholar]
Amodio M, Srinivasan K, van Dijk D, Moshen H, Yim K, Muhle R, Moon KR, Kaech S, Sowell R, Montgomery R, et al. (2017). Exploring single-cell data with deep multitasking neural networks. bioRxiv. 10.1101/237065. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anchang B, Hart TDP, Bendall SC, Qiu P, Bjornson Z, Linderman M, Nolan GP, and Plevritis SK (2016). Visualization and cellular hierarchy inference of single-cell data using SPADE. Nat. Protoc 11, 1264–1279. [DOI] [PubMed] [Google Scholar]
Biase FH, Cao X, and Zhong S (2014). Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 24, 1787–1796. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bousquet O, and Bottou L (2008). The tradeoffs of large scale learning In Advances in Neural Information Processing Systems 21, Koller D, Schuurmans D, Bengio Y, and Bottou L, eds. (NIPS; ), pp. 161–168. [Google Scholar]
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, and Stegle O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA- sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol 33, 155–160. [DOI] [PubMed] [Google Scholar]
Deng Q, Ramsköld D, Reinius B, and Sandberg R (2014). Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196. [DOI] [PubMed] [Google Scholar]
Dzwinel W, and Wcisło R (2015). Very fast interactive visualization of large sets of high-dimensional data. Procedia Comput. Sci. 51, 572–581. [Google Scholar]
Gawad C, Koh W, and Quake SR (2016). Single-cell genome sequencing: current state of the science. Nat. Rev. Genet 17, 175–188. [DOI] [PubMed] [Google Scholar]
Goolam M, Scialdone A, Graham SJ, Macaulay IC, Jedrusik A, Hupalowska A, Voet T, Marioni JC, and Zernicka-Goetz M (2016). Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell 165, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, and van Oudenaarden A (2015). Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255. [DOI] [PubMed] [Google Scholar]
Haghverdi L, Lun ATL, Morgan MD, and Marioni JC (2017). Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours. bioRxiv. 10.1101/165118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hartigan JA, and Wong MA (1979). Algorithm AS 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat 28, 100–108. [Google Scholar]
Hutchison LAD, Berger B, and Kohane I (2017). C. elegans exhibits coordinated oscillation in gene expression during development. bioRxiv. 10.1101/114074. [DOI] [Google Scholar]
Jackson JE (2005). A User’s Guide to Principal Components (John Wiley & Sons; ). [Google Scholar]
Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, et al. (2014). Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kikuchi-Taura A, Soma T, Matsuyama T, Stern DM, and Taguchi A (2006). A new protocol for quantifying CD34⁺ cells in peripheral blood of patients with cardiovascular disease. Tex. Heart Inst. J. 33, 427. [PMC free article] [PubMed] [Google Scholar]
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. (2017). SC3: consensus clustering ofsingle-cell RNA-seq data. Nat. Methods 14,483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, and Kirschner MW (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161,1187–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, et al. (2015). Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
LeCun Y, Bengio Y, and Hinton G (2015). Deep learning. Nature 521, 436–444. [DOI] [PubMed] [Google Scholar]
Loh P-R, Baym M, and Berger B (2012). Compressive genomics. Nat. Biotechnol 30, 627–630. [DOI] [PubMed] [Google Scholar]
Maaten LVD, and Hinton G (2008). Visualizing data using t-SNE. J. Mach. Learn. Res 9, 2579–2605. [Google Scholar]
Moon KR, van Dijk D, Wang Z, Burkhardt D, Chen W, van den Elzen A, Hirn MJ, Coifman RR, Ivanova NB, Wolf G, and Krishnaswamy S (2017). Visualizing transitions and structure for high dimensional data exploration. bioRxiv. 10.1101/120378. [DOI] [Google Scholar]
Palmer NP, Schmid PR, Berger B, and Kohane IS (2012). A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. (2011). Scikit-learn: machine learning in Python. J. Mach. Learn. Res 12, 2825–2830. [Google Scholar]
Pierson E, and Yau C (2015). ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P, et al. (2014). Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C (2017). Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rand WM (1971). Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc 66, 846–850. [Google Scholar]
Regev A, Teichmann S, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. (2017). The human cell atlas. Elife 6, 10.7554/eLife.27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Samet H (1984). The quadtree and related hierarchical data structures. ACM Comput. Surv 16, 187–260. [Google Scholar]
Simmons S, Peng J, Bienkowska J, and Berger B (2015). Discovering what dimensionality reduction really tells us about RNA-seq data. J. Comput. Biol 22, 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stubbington MJ, Rozenblatt-Rosen O, Regev A, and Teichmann SA (2017). Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Svensson V, Vento-Tormo R, and Teichmann SA (2018). Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc 13, 599–604. [DOI] [PubMed] [Google Scholar]
Tang J, Liu J, Zhang M, and Mei Q (2016). Visualizing large-scale and high-dimensional data. Proceedings of the 25th International Conference on World Wide Web 287–297 10.1145/2872427.2883041. [DOI] [Google Scholar]
Ting DT, Wittner BS, Ligorio M, Jordan NV, Shah AM, Miyamoto DT, Aceto N, Bersani F, Brannigan BW, Xega K, et al. (2014). Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 8, 1905–1918. [DOI] [PMC free article] [PubMed] [Google Scholar]
Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, and Quake SR (2014). Reconstructing line-age hierarchies of the distal lung epithelium usingsingle-cell RNA-seq. Nature 509, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tung P-Y, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, and Gilad Y (2017). Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921. [DOI] [PMC free article] [PubMed] [Google Scholar]
Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, et al. (2015). Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat. Neurosci. 18, 145–153. [DOI] [PubMed] [Google Scholar]
Van Der Maaten L (2014). Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res 15, 3221–3245. [Google Scholar]
Van Der Maaten L (2009). Learning a parametric embedding by preserving local structure. RBM 500, 26. [Google Scholar]
Wang B, Zhu J, Pierson E, Ramazzotti D, and Batzoglou S (2017). Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416. [DOI] [PubMed] [Google Scholar]
Wang Y, Waters J, Leung ML, Unruh A, Roh W, Shi X, Chen K, Scheet P, Vattathil S, Liang H, et al. (2014). Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki ME, Wilson WH, Yang EC, Duffy S, and Bhattacharya D (2011). Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science 332, 714–717. [DOI] [PubMed] [Google Scholar]
Yu YW, Daniels NM, Danko DC, and Berger B (2015). Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeisel A, Murñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. [DOI] [PubMed] [Google Scholar]
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information

NIHMS1021959-supplement-Supplemental_Information.pdf^{(1.4MB, pdf)}

Data Availability Statement

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
Benchmark scRNA-seq data (Biase)	Biase et al., 2014	GEO: GSE57249
Benchmark scRNA-seq data (Treutlein)	Treutlein et al., 2014	Supplementary Data 3
Benchmark scRNA-seq data (Goolam)	Goolam et al., 2016	ArrayExpress: E-MTAB-3321
Benchmark scRNA-seq data (Ting)	Ting et al., 2014	GEO: GSE51372
Benchmark scRNA-seq data (Buettner)	Buettner et al., 2015	ArrayExpress: E-MTAB-2805
Benchmark scRNA-seq data (Deng)	Deng et al., 2014	GEO: GSE45719
Benchmark scRNA-seq data (Pollen)	Pollen et al., 2014	SRA: SRP041736
Benchmark scRNA-seq data (Patel)	Patel et al., 2014	GEO: GSE57872
Benchmark scRNA-seq data (Usoskin)	Usoskin et al., 2015	GEO: GSE59739
Benchmark scRNA-seq data (Kolodziejczyk)	Kolodziejczyk et al., 2015	ArrayExpress: E-MTAB-2600
Benchmark scRNA-seq data (Klein)	Klein et al., 2015	GEO: GSE65525
Benchmark scRNA-seq data (Zeisel)	Zeisel et al., 2015	GEO: GSE60361
Benchmark scRNA-seq data (PBMC68k)	Zheng et al., 2017	https://support.10xgenomics.com/single-cellgene-expression/datasets
Benchmark scRNA-seq data (Brain1m)	10x Genomics	https://support.10xgenomics.com/single-cellgene-expression/datasets
Benchmark scRNA-seq data (FACS-purified blood cells)	Zheng et al., 2017	https://support.10xgenomics.com/single-cell-gene-expression/datasetsgene-expression/datasets
Software and Algorithms
Barnes-Hut t-SNE	Van Der Maaten, 2014	https://github.com/lvdmaaten/bhtsne
Scikit-learn (k-means and agglomerative clustering)	Pedregosa et al., 2011	http://scikit-learn.org/stable/; RRID:SCR_002577
Other
Software package for net-SNE	This paper	http://netsne.csail.mit.edu

Open in a new tab

[R1] Amir el-A.D., Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, and Pe’er D (2013). viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol 31, 545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Amodio M, Srinivasan K, van Dijk D, Moshen H, Yim K, Muhle R, Moon KR, Kaech S, Sowell R, Montgomery R, et al. (2017). Exploring single-cell data with deep multitasking neural networks. bioRxiv. 10.1101/237065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Anchang B, Hart TDP, Bendall SC, Qiu P, Bjornson Z, Linderman M, Nolan GP, and Plevritis SK (2016). Visualization and cellular hierarchy inference of single-cell data using SPADE. Nat. Protoc 11, 1264–1279. [DOI] [PubMed] [Google Scholar]

[R4] Biase FH, Cao X, and Zhong S (2014). Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 24, 1787–1796. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bousquet O, and Bottou L (2008). The tradeoffs of large scale learning In Advances in Neural Information Processing Systems 21, Koller D, Schuurmans D, Bengio Y, and Bottou L, eds. (NIPS; ), pp. 161–168. [Google Scholar]

[R6] Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, and Stegle O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA- sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol 33, 155–160. [DOI] [PubMed] [Google Scholar]

[R7] Deng Q, Ramsköld D, Reinius B, and Sandberg R (2014). Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196. [DOI] [PubMed] [Google Scholar]

[R8] Dzwinel W, and Wcisło R (2015). Very fast interactive visualization of large sets of high-dimensional data. Procedia Comput. Sci. 51, 572–581. [Google Scholar]

[R9] Gawad C, Koh W, and Quake SR (2016). Single-cell genome sequencing: current state of the science. Nat. Rev. Genet 17, 175–188. [DOI] [PubMed] [Google Scholar]

[R10] Goolam M, Scialdone A, Graham SJ, Macaulay IC, Jedrusik A, Hupalowska A, Voet T, Marioni JC, and Zernicka-Goetz M (2016). Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell 165, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, and van Oudenaarden A (2015). Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255. [DOI] [PubMed] [Google Scholar]

[R12] Haghverdi L, Lun ATL, Morgan MD, and Marioni JC (2017). Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours. bioRxiv. 10.1101/165118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Hartigan JA, and Wong MA (1979). Algorithm AS 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat 28, 100–108. [Google Scholar]

[R14] Hutchison LAD, Berger B, and Kohane I (2017). C. elegans exhibits coordinated oscillation in gene expression during development. bioRxiv. 10.1101/114074. [DOI] [Google Scholar]

[R15] Jackson JE (2005). A User’s Guide to Principal Components (John Wiley & Sons; ). [Google Scholar]

[R16] Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, et al. (2014). Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kikuchi-Taura A, Soma T, Matsuyama T, Stern DM, and Taguchi A (2006). A new protocol for quantifying CD34⁺ cells in peripheral blood of patients with cardiovascular disease. Tex. Heart Inst. J. 33, 427. [PMC free article] [PubMed] [Google Scholar]

[R18] Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. (2017). SC3: consensus clustering ofsingle-cell RNA-seq data. Nat. Methods 14,483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, and Kirschner MW (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161,1187–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, et al. (2015). Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] LeCun Y, Bengio Y, and Hinton G (2015). Deep learning. Nature 521, 436–444. [DOI] [PubMed] [Google Scholar]

[R22] Loh P-R, Baym M, and Berger B (2012). Compressive genomics. Nat. Biotechnol 30, 627–630. [DOI] [PubMed] [Google Scholar]

[R23] Maaten LVD, and Hinton G (2008). Visualizing data using t-SNE. J. Mach. Learn. Res 9, 2579–2605. [Google Scholar]

[R24] Moon KR, van Dijk D, Wang Z, Burkhardt D, Chen W, van den Elzen A, Hirn MJ, Coifman RR, Ivanova NB, Wolf G, and Krishnaswamy S (2017). Visualizing transitions and structure for high dimensional data exploration. bioRxiv. 10.1101/120378. [DOI] [Google Scholar]

[R25] Palmer NP, Schmid PR, Berger B, and Kohane IS (2012). A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. (2011). Scikit-learn: machine learning in Python. J. Mach. Learn. Res 12, 2825–2830. [Google Scholar]

[R28] Pierson E, and Yau C (2015). ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P, et al. (2014). Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C (2017). Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Rand WM (1971). Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc 66, 846–850. [Google Scholar]

[R32] Regev A, Teichmann S, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. (2017). The human cell atlas. Elife 6, 10.7554/eLife.27041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Samet H (1984). The quadtree and related hierarchical data structures. ACM Comput. Surv 16, 187–260. [Google Scholar]

[R34] Simmons S, Peng J, Bienkowska J, and Berger B (2015). Discovering what dimensionality reduction really tells us about RNA-seq data. J. Comput. Biol 22, 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Stubbington MJ, Rozenblatt-Rosen O, Regev A, and Teichmann SA (2017). Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Svensson V, Vento-Tormo R, and Teichmann SA (2018). Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc 13, 599–604. [DOI] [PubMed] [Google Scholar]

[R37] Tang J, Liu J, Zhang M, and Mei Q (2016). Visualizing large-scale and high-dimensional data. Proceedings of the 25th International Conference on World Wide Web 287–297 10.1145/2872427.2883041. [DOI] [Google Scholar]

[R38] Ting DT, Wittner BS, Ligorio M, Jordan NV, Shah AM, Miyamoto DT, Aceto N, Bersani F, Brannigan BW, Xega K, et al. (2014). Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 8, 1905–1918. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, and Quake SR (2014). Reconstructing line-age hierarchies of the distal lung epithelium usingsingle-cell RNA-seq. Nature 509, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Tung P-Y, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, and Gilad Y (2017). Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, et al. (2015). Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat. Neurosci. 18, 145–153. [DOI] [PubMed] [Google Scholar]

[R42] Van Der Maaten L (2014). Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res 15, 3221–3245. [Google Scholar]

[R43] Van Der Maaten L (2009). Learning a parametric embedding by preserving local structure. RBM 500, 26. [Google Scholar]

[R44] Wang B, Zhu J, Pierson E, Ramazzotti D, and Batzoglou S (2017). Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416. [DOI] [PubMed] [Google Scholar]

[R45] Wang Y, Waters J, Leung ML, Unruh A, Roh W, Shi X, Chen K, Scheet P, Vattathil S, Liang H, et al. (2014). Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki ME, Wilson WH, Yang EC, Duffy S, and Bhattacharya D (2011). Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science 332, 714–717. [DOI] [PubMed] [Google Scholar]

[R47] Yu YW, Daniels NM, Danko DC, and Berger B (2015). Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Zeisel A, Murñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. [DOI] [PubMed] [Google Scholar]

[R49] Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Hyunghoon Cho

Bonnie Berger

Jian Peng

SUMMARY

Graphical Abstract:

In Brief

INTRODUCTION

Figure 1. The Increasing Scale and Redundancy of Single-Cell RNA-Seq Datasets.

RESULTS

Increasing Redundancy in Single-Cell Datasets

Overview of net-SNE

net-SNE Learns High-Quality Visualizations of Single Cells

Figure 2. net-SNE Recapitulates t-SNE Mapping on 13 Benchmark Datasets with Known Subtypes.

net-SNE Accurately Maps New Cells

Figure 3. net-SNE Generalizes to Unseen Cells.

net-SNE Accelerates Visualization of Millions of Cells

Figure 4. net-SNE Enables Fast Visualization of Mega-Scale Datasets.

DISCUSSION

STAR★METHODS

CONTACT FOR REAGENT AND RESOURCE SHARING

METHOD DETAILS

Review of t-Stochastic Neighbor Embedding

Our Method: Neural t-SNE

Accelerating net-SNE via Stochastic Optimization

Training net-SNE with Reference Visualization

Benchmark Datasets

DATA AND SOFTWARE AVAILABILITY

Supplementary Material

Highlights.

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases