Clustering with t-SNE, provably

George C Linderman; Stefan Steinerberger

doi:10.1137/18m1216134

. Author manuscript; available in PMC: 2020 Oct 15.

Published in final edited form as: SIAM J Math Data Sci. 2019 May 28;1(2):313–332. doi: 10.1137/18m1216134

Clustering with t-SNE, provably

George C Linderman ^*, Stefan Steinerberger ^†

PMCID: PMC7561036 NIHMSID: NIHMS1046585 PMID: 33073204

Abstract

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the ‘early exaggeration’ phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter α and step size h. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.

Keywords: t-SNE, dimensionality reduction, spectral clustering, convergence rates, theoretical guarantees

1. Introduction and main result.

The analysis of large, high dimensional datasets is ubiquitous in an increasing number of fields and vital to their progress. Traditional approaches to data analysis and visualization often fail in the high dimensional setting, and it is common to perform dimensionality reduction in order to make data analysis tractable. t-distributed Stochastic Neighborhood Embedding (t-SNE), introduced by van der Maaten and Hinton (2008), is an impressively effective non-linear dimensionality reduction technique that has recently found enormous popularity in several fields. It is most commonly used to produce a two-dimensional embedding of high dimensional data with the goal of simplifying the identification of clusters. Despite its tremendous empirical success, the theory underlying t-SNE is unclear. The only¹ theoretical paper at this point is Shaham and Steinerberger (2017), which shows that the structure of the loss functional of SNE (a precursor to t-SNE) implies that global minimizers separate clusters in a quantitative sense.

1.1. A case study.

As an unsupervised learning method, t-SNE is commonly used to visualize high dimensional data and provide crucial intuition in settings where ground truth is unknown. The analysis of single cell RNA sequencing (scRNA-seq) data, where t-SNE has become an integral part of the standard analysis pipeline, provides a relevant example of its usage. Figure 1.1 shows (left) the output of running t-SNE on the 30 largest principal components of the normalized expression matrix of 49300 retinal cells taken from Macosko et al. (2015). The output on the right has cells colored based on which of 12 cell type marker genes were most expressed (with grey signifying that none of the marker genes were expressed). This example is well suited to showcase both the tremendous impact of t-SNE in the medical sciences as well as the inherent difficulties of interpreting its output when ground truth is unknown: how many clusters are in the original space, and do they correspond one-to-one to clusters in the t-SNE plot? Do the clusters (e.g. the largest cluster that does not express any marker genes) have substructure that is not apparent in this visualization? Pre-processing steps will yield different embeddings; how stable are the clusters? All these questions are of the utmost importance and underline the need for a better theoretical understanding.

Figure 1.1: — t-SNE output (left) and colored by some known ground truth (right).

1.2. Early Exaggeration.

t-SNE (described in greater detail in §3) minimizes the Kullback-Leibler divergence between a Gaussian distribution modeling distances between points in the high dimensional input space and a Student t-distribution modeling distances between corresponding points in a low dimensional embedding. Given a d-dimensional input dataset $X = {x_{1}, x_{2}, \dots, x_{n}} \subset ℝ^{d}$ , t-SNE computes an s-dimensional embedding of the points in $X$ , denoted by $Y = {y_{1}, y_{2}, \dots, y_{n}} \subset ℝ^{s}$ , where s ≪ d and most commonly s = 2 or 3. The main idea is to define a series of affinities p_ij on $X$ as well as a series of affinities q_ij in the embedding $Y$ and then try minimize the distance of these distributions in the Kullback-Leibler distance

C (Y) = K L (P ‖ Q) = \sum_{i \neq j} p_{i j} log \frac{p_{i j}}{q_{i j}},

which gives rise to a gradient descent method via

\frac{\partial C}{\partial y_{i}} = 4 \sum_{j \neq i} (p_{i j} - q_{i j}) q_{i j} Z (y_{i} - y_{j})

where Z = Σ_k≠l(1 + ∥yk − yl∥²)⁻¹. One difficulty is that the speed with which the algorithm converges slows down as the number of points n increases, the algorithm requires many more iterations to converge. However, already the original paper van der Maaten and Hinton (2008) proposes a number of ways in which the convergence can be accelerated.

A less obvious way to improve the optimization, which we call ‘early exaggeration’, is to multiply all of the p_ij’s by, for example, 4, in the initial stages of the optimization. […] In all the visualizations presented in this paper and in the supporting material, we used exactly the same optimization procedure. We used the early exaggeration method with an exaggeration of 4 for the first 50 iterations (note that early exaggeration is not included in the pseudocode in Algorithm 1). (from: van der Maaten and Hinton (2008))

It is easy to test empirically that this renormalization indeed improves the clustering and is effective. It has become completely standard and is hard-coded into the very widely used standard implementation available online, as described by van der Maaten (2014):

During the first 250 learning iterations, we multiplied all p_ij–values by a userdefined constant α > 1. […] this trick enables t-SNE to find a better global structure in the early stages of the optimization by creating very tight clusters of points that can easily move around in the embedding space. In preliminary experiments, we found that this trick becomes increasingly important to obtain good embeddings when the data set size increases [Emphasis GL & SS], as it becomes harder for the optimization to find a good global structure when there are more points in the embedding because there is less space for clusters to move around. In our experiments, we fix α = 12 (by contrast, van der Maaten and Hinton (2008) used α = 4). (from: van der Maaten (2014))

As it turns out, this simple optimization trick can be rigorously analyzed. As a byproduct of our analysis, we see that the convergence of non-accelerated t-SNE will slow down as the number of points n increases and the number of iterations required will grow at least linearly in n. The implementation available online counteracts this problem by various methods: (1) the early exaggeration factor α, (2) a large (h = 200) stepsize in the gradient descent

y_{i} (t + 1) = y_{i} (t) - h \frac{\partial C}{\partial y_{i} (t)},

and by (3) optimization techniques such as momentum. We only deal with the t-SNE algorithm, the early exaggeration factor α and the step-size h; one of the main points of our paper is that a suitable parameter selection of α and h makes it possible to guarantee fast convergence without additional optimization techniques.

1.3. Summary of Main Results.

We will now state our main results at an informal level; all the statements can be made precise (this is done in §3.2) and will be rigorously proven.

1. Canonical parameters and exponential convergence.

There is a canonical setting for the parameters α,h for which the algorithm applied to clustered data converges provably at an exponential rate without the use of other optimization techniques (such as momentum). This setting is

α ~ \frac{n}{10} and h ~ 1.

These parameters lead to an exponential convergence of all embedded clusters to small balls (whose diameter depends on how well $X$ is clustered). Generally, the speed of convergence is exponential with an exponential factor κ

κ ~ 1 - \frac{α h}{n} .

The restriction αh ≤ n is not artificial, easy experiments show that the algorithm generally fails (no longer converges) as soon as αh ≥ 2n.

2. Spectral clustering.

The t-SNE algorithm, in this regime, behaves like a spectral clustering algorithm (defined in Section §4.1); moreover, this algorithm can be written down explicitly. This allows for (1) the use of theory from spectral clustering to rigorously analyze t-SNE and (2) a fast implementation that can perform the early exaggeration phase in a fraction of the time necessary to run t-SNE (in this regime). It also poses the challenge of trying to understand whether t-SNE behaves qualitatively different for the standard parameters α ~ 12,h ~ 200 or whether it behaves more or less identically (and thus like a spectral method).

3. Disjoint clusters.

It is not guaranteed that the embedded clusters in $Y$ are disjoint; but given a random initialization, it is extremely unlikely that two distinct clusters will converge to the same center. Furthermore, if $X$ is well-clustered, the diameter of the clusters $Y$ can be made even smaller by decreasing the step-size h and further increasing α as long as the product satisfies αh ~ n/10. Increasing α will resolve overlapping clusters, as long as they have different centers. In particular, the number of disjoint clusters in $Y$ is a lower bound on the number of clusters in $X$ (and, generically, the numbers coincide). Finally, we note that since the appearance of the preprint for this manuscript, Arora et al. (2018) built upon our analytic framework to show that clusters will effectively be disjoint, under a slightly different set of assumptions.

4. Independence of initialization.

All these results are independent of the initialization of $Y$ as long as it is contained in a sufficiently small ball.

An immediate implication of (3) is the following: if we are given some clustered data $X$ and see that the embedding of t-SNE for large values of α (and small values of h) produces k clusters, then there are exactly k clusters in $X$ . The results guarantee that all clusters in $X$ are eventually mapped to small balls which can be made arbitrarily small. We see that when parameters are chosen optimally, this result provides a justification for the way t-SNE is commonly used in, say, biomedical research. However, we emphasize that this result only applies to data which is well-clustered, as made precise in §3.2.

1.4. Approximating Spectral Clustering.

The fact that t-SNE approximates a spectral clustering method for α ~ n/10, h ~ 1 raises a fascinating question: does t-SNE, in its early exaggeration phase, perform better with the classical parameter choices of α ~ 12,h ~ 200 than it does with α ~ n/10, h ~ 1? If yes, then its inner workings may give rise to improved spectral methods. If no, then it would be advantageous to use α ~ n/10, h ~ 1, which then, however, is essentially a spectral method and it may be advantageous (and much faster) to initialize the second phase of t-SNE by using the outcome of a more advanced spectral method as initialization instead. We discuss some experiments in that direction in §4.1 and believe this to be worthy of further investigation. Moreover, we describe a visualization technique in the style of t-SNE for spectral clustering tools (see §4.2).

1.5. Organization.

The Organization of this paper is as follows: we first illustrate our main points with some numerical examples in Section §2. Section §3 establishes notation and a formal statement of our main result, Section §4 derives a connection between t-SNE and spectral clustering, Section §5 discusses a certain type of discrete dynamical system on finite numbers of points and establishes a crucial estimate, Section §6 gives a proof of the main result.

2. Numerical examples.

This section discusses a number of numerical examples to illustrate our main points.

2.1. Lines and Swiss roll.

It is classical that t-SNE does not successfully embed the swiss roll; however, the random initialization causes difficulty even on simpler data: Figure 2.1 shows the t-SNE embedding (using Matlab implementation of van der Maaten (2014) with default parameters) of a simple line in $ℝ^{3}$ .

Figure 2.1: — Classical t-SNE embeddings of a line and the swiss roll.

The randomized initialization causes, after initial contraction in the early exaggeration phase, a topological interlocking that cannot be further resolved. The example is even more striking with the swiss roll, where the random initialization leads to ‘knots’ that cannot be untied by t-SNE. In stark contrast, the parameter selection

α ~ \frac{n}{10} and h ~ 1

allows for a more effective early exaggeration phase that clearly recovers the line from random initial data and even contracts the swiss roll to a correctly ordered line (that would then expand in the second phase of the algorithm).

The successful embedding of these examples when α and h are chosen optimally is consistent with our claim that in this regime, the early exaggeration phase of t-SNE acts like a spectral method, many of which also correctly embed these manifolds.

2.2. Real-life data.

Finally, we show the impact of the parameter selection α ~ n/10,h ~ 1 in a real-life example. Figure 2.4 shows (left) classical out-of-the-box t-SNE on 10000 randomly subsampled handwritten digits (0–5) from the MNIST data set as well as the outcome of the early exaggeration phase of t-SNE with parameters α ~ n/10,h ~ 1 (middle) and the final outcome after the second phase of t-SNE has been initialized with the data shown in the middle (right). We see that early exaggeration does essentially all the clustering already and the second phase rearranges them. This behavior is typical: the clustering and global organization occurs in the early exaggeration phase – when the points can move most easily – and then during the rest of the iterations they often expand to fill a larger area, revealing more of the within-cluster variability that is often of interest. As evident in this figure, the second phase often improves separation between clusters, because when α = 1 the repulsive forces are no longer negligible, and the clusters hence repulse one another.

We believe this example again hints at one of the fundamental questions that arises from the work in this paper: is the initial clustering done by standard t-SNE comparable to the initial clustering with the new parameter selection? If so, then the fact that the new parameter selection emulates a spectral clustering method (see §4) certainly suggests the option of initializing with other clustering methods as opposed to random initialization. Moreover, it would hint at the danger of using a spectral clustering method and t-SNE as a dual verification of clustering.

3. t-SNE: Notation and the Main result.

This section starts with a complete discussion of t-distributed Stochastic Neighbor Embedding, partly for the convenience of the reader and partly to establish terminology and notation, and then describes the main result.

3.1. t-SNE.

We denote the d-dimensional input dataset by $X = {x_{1}, x_{2}, \dots, x_{n}} \subset ℝ^{d}$ , t-SNE computes an s-dimensional embedding of the points in $X$ , denoted by $Y = {y_{1}, y_{2}, \dots, y_{n}} \subset ℝ^{s}$ , where s ≪ d and most commonly s = 2 or 3. The joint probability p_ij measuring the similarity between x_i and x_j is computed as:

p_{i | j} = \frac{exp (- ‖ x_{i} - x_{j} ‖^{2} / 2 σ_{i}^{2})}{\sum_{k \neq i} exp (- ‖ x_{i} - x_{k} ‖^{2} / 2 σ_{i}^{2})} and p_{i j} = \frac{p_{i | j} + p_{j | i}}{2 n} .

The bandwidth of the Gaussian kernel, σ_i, is often chosen such that the perplexity of P_i matches a user defined value, where P_i is the conditional distribution across all data points given x_i. We will never deal with these issues: we will assume that the p_ij are given and that they correspond to a well-clustered set $X$ (in a precise sense defined below). In particular, we will not assume that they have been obtained using a Gaussian kernel. The similarity between points y_i and y_j in the low dimensional embedding is defined as:

q_{i j} = \frac{{(1 + ‖ y_{i} - y_{j} ‖^{2})}^{- 1}}{\sum_{k \neq l} {(1 + ‖ y_{k} - y_{l} ‖^{2})}^{- 1}} .

t-SNE finds the points {y₁,…,y_n} which minimize the Kullback-Leibler divergence between the joint distribution P of points in the input space and the joint distribution Q of points in the embedding space:

C (Y) = K L (P ‖ Q) = \sum_{i \neq j} p_{i j} log \frac{p_{i j}}{q_{i j}} .

The points $Y$ are initialized randomly, and the cost function $C (Y)$ is minimized using gradient descent. The gradient is derived in Appendix A of van der Maaten and Hinton (2008):

\frac{\partial C}{\partial y_{i}} = 4 \sum_{j \neq i} (p_{i j} - q_{i j}) q_{i j} Z (y_{i} - y_{j})

where Z is a global normalization constant

Z = \sum_{k \neq l} {(1 + ‖ y_{k} - y_{l} ‖^{2})}^{- 1} .

As in van der Maaten (2014), we split the gradient into two parts:

\frac{1}{4} \frac{\partial C}{\partial y_{i}} = \sum_{j \neq i} p_{i j} q_{i j} Z (y_{i} - y_{j}) - \sum_{j \neq i} q_{i j}^{2} Z (y_{i} - y_{j}) .

Since we are interested in the minimization of the functional, we would naturally step in direction of the negative gradient

- \frac{1}{4} \frac{\partial C}{\partial y_{i}} = \sum_{j \neq i} p_{i j} q_{i j} Z (y_{j} - y_{i}) - \sum_{j \neq i} q_{i j}^{2} Z (y_{j} - y_{i}) .

Each of these two terms has a very natural interpretation, the first term is usually called the attractive term while the second term describes a repulsive force (see, for example, van der Maaten and Hinton (2008) for use of this terminology). The reason is simple: the first term is moving y_i to a weighted average of the other y_j. The weights are big if the underlying points are close in space and small if they are far away. The second term has the opposite sign and thus exerts the opposite effect, however, the degree of repulsion depends solely on the closeness of points in the embedding space. Put differently, the first term attracts terms that are meant to be together based on distance in the original space and the second term tries to push points apart when they get too close in the embedding space regardless of whether they are meant to be close to each other or not. One would hope that attractive forces win out over the repulsion for points that are meant to be close to each other and loses out for points that are meant to be far apart from each other and this is one of the common interpretations of the underlying mechanism. Early exaggeration introduces the coefficient α > 1 and corresponds to the gradient descent method

\frac{1}{4} \frac{\partial C}{\partial y_{i}} = \sum_{j \neq i} α p_{i j} q_{i j} Z (y_{i} - y_{j}) - \sum_{j \neq i} q_{i j}^{2} Z (y_{i} - y_{j})

and a small step-size h > 0 leads to the expression

\frac{h}{4} \frac{\partial C}{\partial y_{i}} = h \sum_{j \neq i} α p_{i j} q_{i j} Z (y_{i} - y_{j}) - h \sum_{j \neq i} q_{i j}^{2} Z (y_{i} - y_{j}) .

3.2. Main result.

This section gives our main result. We emphasize that the method of proof is rather flexible and it is not difficult to obtain variations on the result under slightly different assumptions. We emphasize that our result is formally stated for a set of points {x₁,…,x_n} and a set of mutual affinities p_ij. We will not assume that the p_ij are obtained using the standard t-SNE normalizations but work at a full level of generality using a set of three assumptions. We note, and explain below, that for standard t-SNE the second assumption holds until the number of points exceeds, roughly, n ~ 20000 and the third assumption holds by design. The first assumption encapsulates our notion of clustered data.

1. $X$ is clustered.

We proceed by giving a very versatile definition of what it means to be a cluster; it is also applicable to things that clearly are not clusters, however, in those cases the error bound in the Theorem will not convey any information. Formally, we assume that there exists a $k \in ℕ$ (the number of clusters) and a map π: {1,…,n} → {1,2,…,k} assigning each point to one of the k clusters such that the following property holds: if π(x_i) = π(x_j), then

p_{i j} \geq \frac{1}{10 n | π^{- 1} (π (i)) |} .

The constant 10 is fairly arbitrary in the sense that any absolute constant will do (but slow down the speed of convergence as it gets larger); we put 10 to avoid an overly complicated exposition. Observe that |π⁻¹(π(i))| is merely the size of the cluster in which i and j lie. We will furthermore abbreviate, for fixed 1 ≤ i ≤ n, summations over clusters as

\underset{same cluster}{\sum_{j \neq i}} : = \underset{π (j) = π (i)}{\sum_{j \neq i}} and \underset{other clusters}{\sum_{j \neq i}} : = \underset{π (j) = π (i)}{\sum_{j \neq i}}

The condition on p_ij above implies that

\underset{same cluster}{\sum_{j \neq i}} p_{i j} \geq \frac{1}{10 n} for all 1 \leq i \leq n .

This assumption ensures that elements are at least somewhat connected within their cluster; that is, we lower bound the affinity of each point to the other points in its cluster. We emphasize that it is a rather weak assumption since we do not simultaneously demand that the sum over all other elements is small. Indeed, one of the leading-error terms in the Theorem is

the quantity \underset{other clusters}{\sum_{j \neq i}} p_{i j},

so if the data is strongly clustered and the p_ij have been obtained in one of the usual ways, then the optimal choice of π will be exactly the one that minimizes this quantity uniformly over all i and will then correspond to the original clustering; however, any other map π: {1,…,n} → {1,2,…,k} is equally admissible but is likely to result in a large (trivial, uninformative) error bound. In particular, our assumptions on what it means to be clustered are so weak that a given data set does not necessarily have a unique decomposition into different clusters, and this is mirrored in the main result that does not imply cluster separation (we refer to Arora et al. (2018) for such a statement under different assumptions).

2. Parameter choice.

Our second assumption ensures that the step size in the gradient descent, a quantity determined by the step size h and the exaggeration parameter α as the product αh, is within a reasonable regime. An assumption along these lines is clearly necessary for general gradient descent techniques to avoid overoscillation (i.e. missing a local minima and moving too far in the other direction). More precisely, we assume that α and h are chosen such that, for some 1 ≤ i ≤ n

\frac{1}{100} \leq α h \underset{same cluster}{\sum_{j \neq i}} p_{i j} \leq \frac{9}{10} .

The main result will be applicable to single cluster (i.e. it is possible to guarantee that a single cluster converges even if the rest does not) and it can be applied to exactly those clusters satisfying this inequality. It is easy to see, both in the proof and in numerical experiments, that the upper bound is a necessary condition for the early exaggeration phase of t-SNE to work (more precisely, the upper bound 1 is necessary but we need a little bit of leeway in another part of the argument). We observe that condition (1) implies that α ~ n/10 and h ~ 1 is admissible, however, other parameter choices (i.e. α ~ 10n, h ~ 1/100) are equally valid. As the number of points gets larger, the lower bound is violated: our main result can be easily extended to cover that case, however, the factor κ with which exponential convergence occurs approaches 1 and convergence, while technically exponential, becomes slow. In particular, an analysis of how this condition acts in the proof motivates an accurate parameter selection rule.

Guideline.

The best convergence rate for the cluster containing y_i is attained when

α h = \frac{9}{10} {(\underset{same cluster}{\sum_{j \neq i}} p_{i j})}^{- 1} while α h = \frac{9}{10} {(max_{1 \leq i \leq n} \underset{same cluster}{\sum_{j \neq i}} p_{i j})}^{- 1}

is the best selection to ensure that all clusters converge.

We also observe that in the setting where the parameters are not chosen optimally (e.g. convergence rate κ ~ 1), momentum can be useful to accelerate convergence. However, for t-SNE there are currently no rigorous results in that direction.

3. Localized initialization.

The initialization satisfies $Y \subset {[- 0.01, 0.01]}^{2}$ . This assumption is not crucial and could be easily modified at the cost of changing some other constants. The proof suggests that initializing at smaller scales might be beneficial on the level of constants.

We can now state our main result.

Theorem 3.1 (Main Result).

Under Assumptions (1)–(3), the diameter of the embedded cluster {y_j: 1 ≤ j ≤ n ∧ π(j) = π(i)} decays exponentially (at universal rate) until its diameter satisfies, for some universal c > 0,

diam {y_{j} : 1 \leq j \leq n \land π (j) = π (i)} \leq c \cdot h (α \underset{o t h e r c l u s t e r s}{\sum_{j \neq i}} p_{i j} + \frac{1}{n}) .

Remarks.

The Theorem can be applied to a single cluster; in particular, some clusters may contract to tiny balls while others do not contract at all.
Since αh ~ n, we see that the bound is only nontrivial if, for some small constant c₂ > 0,
$\underset{other clusters}{\sum_{j \neq i}} p_{i j} \leq \frac{c_{2}}{n} .$
Otherwise, it merely tells us that the elements of the clusters are contained in a ball of radius ~ 1 (as are all the other points): this is not an artifact of the proof, but clearly necessary. If the affinity to other clusters is large, the data is not well-clustered. Generally, for well-clustered data, we would expect that sum to be very close to 0 which would yield a leading term error of ch/n.
The constant c seems to be roughly on scale c ~ 10 for well-clustered data and slightly larger for data with worse clustering properties (in particular, for the classical t-SNE parameter section, it would slowly increase with the number of points n). We believe this estimate to be too conservative and consider the true value to be on a smaller order of magnitude; this question will be pursued in future work.

The proof of the main result is actually rather versatile and should easily adapt to a variety of other settings that might be of interest. This versatility is partly due to the connection of the argument to rather fundamental ideas in partial differential equations, indeed, the argument may be interpreted as a maximum principle for a discrete parabolic operator acting on vector-valued (i.e. points in space) data. This interpretation is what led us to establish a connection to spectral clustering which we now discuss.

4. A Connection to Spectral Clustering.

The purpose of this section is to note a connection to spectral clustering (Laplacian eigenmaps, see Belkin and Niyogi (2003)); the fact that the main terms coincide has previously been noted by Carreira-Perpinan (2010).

4.1. Approximating spectral clustering.

We give a quantitative description: it is possible to take the limit α → ∞, h → 0 (scaled so that α · h = const) and that, in that limit, one obtains a simple spectral clustering method. We re-introduce notation and assume again that $X = {x_{1}, x_{2}, \dots, x_{n}} \subset ℝ^{d}$ is given. We assume p_ij is some collection of affinities scaled in such a way that for x_i,x_j in the same cluster π(i) = π(j)

p_{i j} \geq \frac{1}{10} \frac{1}{| π^{- 1} (π (i)) |} and \sum_{j = 1}^{n} p_{i j} \leq 1.

We observe that this scaling is slightly differently than the one above: it is obtained by absorbing the αh ~ n term into the affinities. At the same time, h → 0 implies that the repulsion term containing the q_ij does not exert any force. This implies that, in the limit, the remaining term in the gradient descent method is given by

y_{i} (t + 1) = y_{i} (t) + \sum_{j \neq i} p_{i j} (y_{j} (t) - y_{i} (t)) = \sum_{j \neq i} p_{i j} y_{j} (t) + (1 - \sum_{j \neq i} p_{i j}) y_{i} (t) .

This, however, can be interpreted as a Markov chain with suitably chosen transition probabilities. It may be unusual, at first, to see this equation since the y_i(t) are vectors in $ℝ^{2}$ , however, all the equations separate different coordinates, which allows for a reduction to the familiar form. All the canonical results from spectral clustering apply: the asymptotic behavior is given by the largest non-trivial eigenvalue(s), which are either 1 (in the case of perfectly separated clusters) or very close to 1 and convergence speed depends on the spectral gap. Note, in particular, that for generic data the largest eigenvector is constant; however, since the algorithm is only run for a small number of steps, we essentially recover the dominant eigenvectors and the second phase of t-SNE then re-expands the clusters.

4.2. Visualizing spectral clustering.

The connection also allows us to go the other direction and discuss a particular visualization technique for spectral methods that shows arising clusters as points in $ℝ^{2}$ (or higher dimensions, which is not essential here). The transition matrix of the Markov chain is given by

A_{i j} = {\begin{array}{l} 1 - \sum_{i \neq k} p_{i k} & if i = j \\ p_{j i} & otherwise. \end{array}

The large-time behavior of y(t) = A^ty(0) is essentially determined by the spectrum of A close to 1. Moreover, in the case of perfect clustering with p_ij = 0 whenever x_i and x_j are in different clusters, there are exactly k eigenvalues equal to 1 and the initialization converges to that. Let us now assume that the goal is visualization in $ℝ^{2}$ . We let $Y = {y_{1} (0), y_{2} (0), \dots, y_{n} (0)} \subset ℝ^{2}$ be a set of points that we assume are i.i.d. random variables from, say, the uniform distribution on [−0.01,0.01]². We propose to visualize the point set after k iterations as follows: collect these n initial vectors in a n × 2 vector y and interpret the n rows of A^ky as coordinates in $ℝ^{2}$ . This creates t-SNE-style visualizations for spectral methods (see Fig. 4.1 and Fig. 4.2, lower rows).

Figure 4.1: — Early exaggeration via t-SNE with α = n/10, h = 1 (top) and visualization of iterations of the spectral method (bottom) on same initialization.

Figure 4.2: — Early exaggeration via t-SNE with α ~ n/3, h = 1 (top, parameter selection via guideline) and iterations of the spectral method (bottom).

Examples.

An example of this method is shown in Figure 4.1. The example is comprised of 40000 points in $ℝ^{25}$ sampled from four very narrow Gaussians and are highly clustered. We used perplexity of 30 to create the p_ij and used α = n/10,h = 1 in the implementation of t-SNE. The second row in Figure 4.1 shows the projection onto the 50 largest eigenvectors of A. The computation time of t-SNE took roughly 7 minutes vs. 1 minute for the spectral decomposition – note, however, that once the spectral decomposition has been computed, then iterations can be computed in constant time (one only has to raise the eigenvalues to some power). Another example is given in Figure 4.2 that is run on 4 digits in MNIST; again, both methods coincide. This demonstrates that our derivation of the approximating spectral method was accurate. At the same time, it suggests to repeat the fundamental question.

Open problem.

Is the clustering behavior of the early exaggeration phase of t-SNE with α = 12, h = 200 (and possibly optimization techniques such as momentum) essentially qualitatively equivalent to the behavior of t-SNE with our parameter choice α = n/10,h = 1?

If this were indeed the case, then the early exaggeration phase of t-SNE would be simply a spectral clustering method in disguise; if not, then it would be very valuable to understand under which circumstances its performance is superior to spectral clustering and whether its underlying mechanisms could be used to boost spectral methods. We reiterate that we believe this to be a very interesting problem.

5. Ingredients for the Proof: Discretized Dynamical Systems.

This section introduces a type of discrete dynamical systems on sets of points in $ℝ^{s}$ and we describe their asymptotic behavior; this is a self-contained result; it could potentially be interpreted as an analysis of a spectral method that is robust to small error terms but the analysis is simple enough for us to keep entirely self-contained. Our original guiding picture was that of the maximum principle in the theory of parabolic partial differential equations.

5.1. A discrete dynamical system.

Let $z_{1}, \dots, z_{n} \in ℝ^{s}$ be given. We use them as initial values for a time-discrete dynamical system that is defined via

z_{i} (t + 1) = z_{i} (t) + \sum_{j = 1}^{n} α_{i, j, t} (z_{j} (t) - z_{i} (t)) + ε_{i} (t) z_{i} (0) = z_{i}

At this stage, if the points are in general position and n ≥ s, basic linear algebra implies that this system can undergo almost any arbitrary evolution as long as one is free to choose α_i,j,t.

We will henceforth assume that these parameters assume the following three conditions.

There is a uniform lower bound on the coefficients for all t > 0 and all i ≠ j
$| α_{i, j, t} | \geq δ > 0.$
There is a uniform upper bound on the coefficients
$\sum_{j = 1}^{n} α_{i, j, t} \leq 1.$
There is a uniform upper bound on the error term
$‖ ε_{i} (t) ‖ \leq ε .$

A typical example of such a dynamical system is given in the Figure below: we start with twelve points on the unit circle and then iterate the system for some random choices of a_i,j,t and random ε_i(t). The points move at first towards each other until they are close and the error term starts being on the same scale as the forces of attraction. The points then move around randomly (all the while staying close to each other). We will make this intuitive picture precise below. The main result of this section is that all the points in this dynamical system are eventually contained in a ball whose size only depends on n, δ and ε. We start by showing that the convex hull of the points is stable. We use B(0, ε) to denote a ball of radius ε, A + B = {a + b: a ∈ A ∧ b ∈ B} and conv A for the convex hull of A.

Lemma 5.1 (Stability of the convex hull).

With the assumptions above, we have

conv {z_{1} (t + 1), z_{2} (t + 1), \dots, z_{n} (t + 1)} \subseteq conv {z_{1} (t), z_{2} (t), \dots, z_{n} (t)} + B (0, ε),

Proof.

This argument is simple. We note that

z_{i} (t + 1) = z_{i} (t) + \sum_{j = 1}^{n} α_{i, j, t} (z_{j} (t) - z_{i} (t)) + ε_{i} (t) = (1 - \underset{j \neq i}{\sum_{j = 1}^{n}} α_{i, j, t}) z_{i} (t) + \underset{j \neq i}{\sum_{j = 1}^{n}} α_{i, j, t} z_{j} (t) + ε_{i} (t) .

By assumption,

0 \leq \sum_{j = 1}^{n} α_{i, j, t} \leq 1

and this implies z_i(t + 1) − ε_i(t) ∈ conv{z₁(t),z₂(t),…,z_n(t)}. ∎

Lemma 5.2 (Contraction inequality).

With the notation above, if the diameter is large

diam {z_{1} (t), z_{2} (t), \dots, z_{n} (t)} \geq \frac{10 ε}{n δ},

then

diam {z_{1} (t + 1), z_{2} (t + 1), \dots, z_{n} (t + 1)} \leq (1 - \frac{n δ}{20}) diam {z_{1} (t), z_{2} (t), \dots, z_{n} (t)} .

One particularly important consequence is the following: the diameter shrinks, at an exponential rate (1 − nδ/20)^t, to a size of ~ ε/(nδ). Of course, this convergence is particularly fast whenever nδ ~ 1. It is easy to see, for example by taking n = 2 points in $ℝ$ , that this is the optimal scale for the result to hold.

Proof.

The method of proof will be as follows: we will project the set of points onto an arbitrary line (say, the x−axis by taking only the first coordinate of each point) and show that the one-dimensional projections contract exponentially quickly. This then implies the desired statement. Let $π_{x} : ℝ^{n} \to ℝ$ be such a projection. We abbreviate the diameter of the projection as

diam : = diam {π_{x} z_{1} (t), π_{x} z_{2} (t), \dots, π_{x} z_{n} (t)} .

We may assume w.l.o.g. that this set is contained in {π_xz₁(t),π_xz₂(t),…,π_xz_n(t)} ⊂ [0,diam]. We then subdivide the interval into two regions

I_{1} = [0, \frac{diam}{2}] and I_{2} = (\frac{diam}{2}, diam]

and denote the number of points in each interval by i₁,i₂. Clearly, i₁ + i₂ = n and therefore either i₁ ≥ n/2 or i₂ ≥ n/2. We assume w.l.o.g. the first case holds. Projections are linear, thus

π_{x} z_{i} (t + 1) = π_{x} z_{i} (t) + \sum_{j = 1}^{n} a_{i, j, t} π_{x} (z_{j} (t) - z_{i} (t)) + π_{x} ε_{i} (t) .

We abbreviate

0 \leq σ : = \sum_{j = 1}^{n} α_{i, j, t} \leq 1

and write

\sum_{j = 1}^{n} a_{i, j, t} π_{x} (z_{j} (t) - z_{i} (t)) = \sum_{π_{x} z_{j} \leq diam / 2}^{n} a_{i, j, t} π_{x} (z_{j} (t) - z_{i} (t)) + \sum_{π_{x} z_{j} > diam / 2}^{n} a_{i, j, t} π_{x} (z_{j} (t) - z_{i} (t)) \leq \sum_{π_{x} z_{j} \leq diam / 2}^{n} a_{i, j, t} (\frac{diam}{2} - π_{x} z_{i} (t)) + \sum_{π_{x} z_{j} > diam / 2}^{n} a_{i, j, t} (diam - π_{x} z_{i} (t)) = \sum_{π_{x} z_{j} \leq diam / 2}^{n} a_{i, j, t} \frac{diam}{2} + \sum_{π_{x} z_{j} > diam / 2}^{n} a_{i, j, t} diam - σ π_{x} z_{i} (t) .

■

Moreover, using the lower bound a_i,j,t ≥ δ

diam (\frac{1}{2} \sum_{π_{x} z_{j} \leq diam / 2}^{n} a_{i, j, t} + \sum_{π_{x} z_{j} > diam / 2}^{n} a_{i, j, t}) \leq diam (σ - \frac{1}{2} \sum_{π_{x} z_{j} \leq diam / 2}^{n} a_{i, j, t}) \leq (σ - \frac{n δ}{4}) diam.

Then, however,

π_{x} z_{i} (t + 1) = π_{x} z_{i} (t) + \sum_{j = 1}^{n} a_{i, j, t} π_{x} (z_{j} (t) - z_{i} (t)) + π_{x} ε_{i} (t) \leq (1 - σ) π_{x} z_{i} (t) + (σ - \frac{n δ}{4}) diam + π_{x} ε_{i} (t) \leq (1 - σ) diam + (σ - \frac{n δ}{4}) diam + π_{x} ε_{i} (t) = (1 - \frac{n δ}{4}) diam + π_{x} ε_{i} (t)

which shows that π_xz_i(t + 1) ∈ [−ε, diam(1 − nδ/4) + ε]. This implies

diam {π_{x} z_{1} (t + 1), π_{x} z_{2} (t + 1), \dots, π_{x} z_{n} (t + 1)} \leq (1 - \frac{n δ}{4}) diam + 2 ε .

If the diameter is indeed disproportionately large

diam \geq \frac{10 ε}{n δ},

then this can be rearranged as

ε \leq \frac{n δ}{10} diam

and therefore

(1 - \frac{n δ}{4}) diam + 2 ε \leq (1 - \frac{n δ}{4}) diam + \frac{n δ}{5} diam \leq (1 - \frac{n δ}{20}) diam .

Since this is true in every projection, it also holds for the diameter of the original set. ∎

Remark.

The argument could be slightly improved because in its current form it assumes that the error ε_i has ∥ε_i(t)∥_ℓ∞ = ε, while we assume $‖ ε_{i} (t) ‖_{l^{\infty}} \leq ‖ ε_{i} (t) ‖_{l^{2}} = ε$ . This, together with the usual other optimization schemes, should yield an improved estimate on the constant. The condition on δ could also be weakened (at the cost of losing constants). In particular, it would be sufficient in Assumption (1) in our main result to assume that, for every 1 ≤ i ≤ n

p_{i j} \geq \frac{1}{10 n | π^{- 1} (π (i)) |} holds for at least (\frac{1}{2} + ε) | π^{- 1} (π (i)) | values of j

that are in the same cluster π(z_i) = π(z_j). By adapting the proof, the constant (1/2 + ε) could be reduced further, however, this is inevitably going to decrease the provable bounds on the exponential decay rate (which is not an artifact of the method, the speed of convergence of the algorithm slows down in this setting).

6. Proof of the Main Result.

The rough outline of the argument is as follows: we initialize all points inside [−0.01,0.01]². We rewrite the gradient descent method acting on one particular embedded cluster as a dynamical system of the type studied above with an error term. The error term contains q_ij, which depend on distances between points from different clusters. This is difficult to control, especially if the points are far apart. Our strategy will now be as follows: we show that the q_ij are all under control as long as everything is contained in [−0.02,0.02]². We use stability of the convex hull to guarantee that all of the embedded points are within [−0.02,0.02]² for at least ℓ iterations and show that this time-scale is enough to guarantee contraction of the cluster.

Proof.

We start by showing that the q_ij are comparable as long as the point set is contained in a small region space. Let now {y₁,y₂,…,y_n} ⊂ [−0.02,0.02]² and recall the definitions

q_{i j} = \frac{{(1 + ‖ y_{i} - y_{j} ‖^{2})}^{- 1}}{\sum_{k \neq l} {(1 + ‖ y_{k} - y_{l} ‖^{2})}^{- 1}} and Z = \sum_{k \neq l} {(1 + ‖ y_{k} - y_{l} ‖^{2})}^{- 1} .

Then, however, it is easy to see that 0 ≤ ∥y_i − y_j∥ ≤ 0.06 implies

\frac{9}{10} \leq q_{i j} Z = {(1 + ‖ y_{i} - y_{j} ‖^{2})}^{- 1} \leq 1.

We will now restrict ourselves to a small embedded cluster {y_i: π(i) fixed} and rewrite the gradient descent method as

y_{i} (t + 1) = y_{i} (t) + \underset{same cluster}{\sum_{j \neq i}} (α h) p_{i j} q_{i j} Z (y_{j} (t) - y_{i} (t)) + \underset{other clusters}{\sum_{j \neq i}} (α h) p_{i j} q_{i j} Z (y_{j} (t) - y_{i} (t)) - h \sum_{j \neq i} q_{i j}^{2} Z (y_{j} (t) - y_{i} (t)),

where the first sum is yielding the main contribution and the other two sums are treated as a small error. Applying our results for dynamical systems of this type requires us to verify the conditions. We start by showing the conditions on the coefficients to be valid. Clearly,

α h p_{i j} q_{i j} Z \geq α h p_{i j} \frac{9}{10} \geq \frac{α h}{10 n | π^{- 1} (π (i)) |} \frac{9}{10} \geq \frac{9}{100} \frac{α h}{| π^{- 1} (π (i)) |} ~ δ,

which is clearly admissible whenever αh ~ n. As for the upper bound, it is easy to see that

\underset{same cluster}{\sum_{j \neq i}} (α h) p_{i j} q_{i j} Z \leq α h \underset{same cluster}{\sum_{j \neq i}} p_{i j} \leq 1.

It remains to study the size of the error term for which we use the triangle inequality

‖ \underset{other clusters}{\sum_{j \neq i}} (α h) p_{i j} q_{i j} Z (y_{j} (t) - y_{i} (t)) ‖ \leq α h \underset{other clusters}{\sum_{j \neq i}} p_{i j} ‖ y_{j} (t) - y_{i} (t) ‖ \leq 0.06 α h \underset{other clusters}{\sum_{j \neq i}} p_{i j}

and, similarly for the second term,

‖ h \sum_{j \neq i} q_{i j}^{2} Z (y_{j} (t) - y_{i} (t)) ‖ \leq h \sum_{j \neq i} q_{i j} ‖ (y_{j} (t) - y_{i} (t)) ‖ \leq 0.06 h \sum_{j \neq i} q_{i j} \leq \frac{0.1 h}{n} .

■

This tells us that the norm of the error term is bounded by

‖ ε ‖ \leq 0.1 h (α \underset{other clusters}{\sum_{j \neq i}} p_{i j} + \frac{1}{n}) .

It remains to check whether time-scales fit. The number of iterations ℓ for which the assumption $Y \subset {[- 0.02, 0.02]}^{2}$ is reasonable is at least ℓ ≥ 0.01/ε. At the same time, the contraction inequality implies that in that time the cluster shrinks to size

max {\frac{10 ε}{| π^{- 1} (π (i)) | δ}, 0.01 {(1 - \frac{1}{20})}^{l}} \leq max {\frac{10 ε}{| π^{- 1} (π (i)) | δ}, 8 ε},

where the last inequality follows from the elementary inequality

{(1 - \frac{1}{20})}^{1 / 100 ε} \leq 8 ε .

Remarks.

The proof is relatively flexible in several different spots. By demanding that the initialization $Y$ is contained in a sufficiently small ball, one can force the quantity Zq_ij to be arbitrarily close to 1. We also emphasize that we did not optimize over constants and additional fine-tuning in various spots would yield better constants (at the cost of a more involved argument which is why we decided against it). The use of the triangle inequality in bounding the error terms is another part of the proof that deserves attention: if the clusters are spread out, then we would expect the repulsive forces to act from all directions and lead to additional cancellation (which, if it were indeed the case that the q_ij do not play a significant role in the clustering that occurs in the early exaggeration phase, would be an additional reason for the strong similarity to the outcome of the spectral method). It could be of interest to study mean-field-type approximations to gain a better understanding of this phenomenon.

Figure 2.2: — Early exaggeration phase of t-SNE on a line with α = 20n, h = 0.05.

Figure 2.3: — Early exaggeration phase of t-SNE on a swiss roll with α = *n, h* = 1.

Figure 5.1: — A typical evolution a dynamical system of this type.

7. Acknowledgements.

The authors thank Laurens van der Maaten for valuable feedback on this work. GCL was supported by NIH grant #1R01HG008383–01A1 (PI: Yuval Kluger) and U.S. NIH MSTP Training Grant T32GM007205. SS was partially supported by #INO1500038 (Institute of New Economic Thinking).

Footnotes

After the appearance of the preprint for this manuscript, Arora et al. (2018) used our analytic framework to obtain results of a similar flavor. We mention how this work relates to our result in Section §1.3; Arora et al. (2018) can be consulted for more details.

References.

Arora Sanjeev, Hu Wei, Kothari Pravesh K. An Analysis of the t-SNE Algorithm for Data Visualization. preprint at arXiv:1803.01768
Belkin M and Niyogi P . Laplacian eigenmaps for dimensionality reduction and data representation Neural Computation 15 (6), 1373–1396. [Google Scholar]
Miguel A Carreira-Perpinan. The elastic embedding algorithm for dimensionality reduction. 27th Int. Conf. Machine Learning (ICML 2010), pp. 167–174. [Google Scholar]
van der Maaten Laurens. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15(1):3221–3245, 2014. [Google Scholar]
van der Maaten Laurens and Hinton Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. [Google Scholar]
Evan Z Macosko Anindita Basu, Satija Rahul, Nemesh James, Shekhar Karthik, Goldman Melissa, Tirosh Itay, Allison R Bialas Nolan Kamitaki, Martersteck Emily M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shaham Uri and Steinerberger Stefan. Stochastic Neighbor Embedding separates wellseparated clusters. preprint at arXiv:1702.02670

[R1] Arora Sanjeev, Hu Wei, Kothari Pravesh K. An Analysis of the t-SNE Algorithm for Data Visualization. preprint at arXiv:1803.01768

[R2] Belkin M and Niyogi P . Laplacian eigenmaps for dimensionality reduction and data representation Neural Computation 15 (6), 1373–1396. [Google Scholar]

[R3] Miguel A Carreira-Perpinan. The elastic embedding algorithm for dimensionality reduction. 27th Int. Conf. Machine Learning (ICML 2010), pp. 167–174. [Google Scholar]

[R4] van der Maaten Laurens. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15(1):3221–3245, 2014. [Google Scholar]

[R5] van der Maaten Laurens and Hinton Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. [Google Scholar]

[R6] Evan Z Macosko Anindita Basu, Satija Rahul, Nemesh James, Shekhar Karthik, Goldman Melissa, Tirosh Itay, Allison R Bialas Nolan Kamitaki, Martersteck Emily M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Shaham Uri and Steinerberger Stefan. Stochastic Neighbor Embedding separates wellseparated clusters. preprint at arXiv:1702.02670

PERMALINK

Clustering with t-SNE, provably

George C Linderman

Stefan Steinerberger

Abstract

1. Introduction and main result.

1.1. A case study.

Figure 1.1:

1.2. Early Exaggeration.

1.3. Summary of Main Results.

1. Canonical parameters and exponential convergence.

2. Spectral clustering.

3. Disjoint clusters.

4. Independence of initialization.

1.4. Approximating Spectral Clustering.

1.5. Organization.

2. Numerical examples.

2.1. Lines and Swiss roll.

Figure 2.1:

2.2. Real-life data.

Figure 2.4:

3. t-SNE: Notation and the Main result.

3.1. t-SNE.

3.2. Main result.

1. X is clustered.

2. Parameter choice.

Guideline.

3. Localized initialization.

Theorem 3.1 (Main Result).

Remarks.

4. A Connection to Spectral Clustering.

4.1. Approximating spectral clustering.

4.2. Visualizing spectral clustering.

Figure 4.1:

Figure 4.2:

Examples.

Open problem.

5. Ingredients for the Proof: Discretized Dynamical Systems.

5.1. A discrete dynamical system.

Lemma 5.1 (Stability of the convex hull).

Proof.

Lemma 5.2 (Contraction inequality).

Proof.

Remark.

6. Proof of the Main Result.

Proof.

Remarks.

Figure 2.2:

Figure 2.3:

Figure 5.1:

7. Acknowledgements.

Footnotes

References.

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

1. $X$ is clustered.