Visualizing probabilistic models and data with Intensive Principal Component Analysis

Katherine N Quinn; Colin B Clement; Francesco De Bernardis; Michael D Niemack; James P Sethna

doi:10.1073/pnas.1817218116

. 2019 Jun 24;116(28):13762–13767. doi: 10.1073/pnas.1817218116

Visualizing probabilistic models and data with Intensive Principal Component Analysis

Katherine N Quinn ^a,¹, Colin B Clement ^a, Francesco De Bernardis ^a, Michael D Niemack ^a, James P Sethna ^a

PMCID: PMC6628833 PMID: 31235593

Significance

We introduce Intensive Principal Component Analysis (InPCA), a widely applicable manifold-learning method to visualize general probabilistic models and data. Using replicas to tune dimensionality in high-dimensional data, we use the zero-replica limit to discover a distance metric, which preserves distinguishability in high dimensions, and an embedding with superior visualization performance. We apply InPCA to the model of cosmology which predicts the angular power spectrum of the cosmic microwave background, allowing visualization of the space of model predictions (i.e., different universes).

Keywords: manifold learning, information theory, probabilistic models, probabilistic data, visualization

Abstract

Unsupervised learning makes manifest the underlying structure of data without curated training and specific problem definitions. However, the inference of relationships between data points is frustrated by the “curse of dimensionality” in high dimensions. Inspired by replica theory from statistical mechanics, we consider replicas of the system to tune the dimensionality and take the limit as the number of replicas goes to zero. The result is intensive embedding, which not only is isometric (preserving local distances) but also allows global structure to be more transparently visualized. We develop the Intensive Principal Component Analysis (InPCA) and demonstrate clear improvements in visualizations of the Ising model of magnetic spins, a neural network, and the dark energy cold dark matter ( $Λ CDM$ ) model as applied to the cosmic microwave background.

Visualizing high-dimensional data is a cornerstone of machine learning, modeling, big data, and data mining. These fields require learning faithful and interpretable low-dimensional representations of high-dimensional data and, almost as critically, producing visualizations which allow interpretation and evaluation of what was learned (1–4). Unsupervised learning, which infers features from data without manually curated data or specific problem definitions (5), is especially important for high-dimensional, big data applications in which specific models are unknown or impractical. For high dimensions, the relative distances between features become small and most points are orthogonal to one another (6). A trade-off between preserving local and global structure must often be made when inferring a low-dimensional representation. Classic manifold learning techniques include linear methods such as principal component analysis (PCA) (7) and multidimensional scaling (MDS) (8), which preserve global structure but at the cost of obscuring local features. Existing nonlinear manifold learning techniques, such as t-distributed stochastic network embedding (t-SNE) (9) and diffusion maps (10), preserve the local structure while maintaining only some qualitative global patterns such as large clusters. The uniform manifold approximation (UMAP) (11) better preserves topological structures in data, a global property.

In this article, we develop a nonlinear manifold learning technique which achieves a compromise between preserving local and global structure. We accomplish this by developing an isometric embedding for general probabilistic models based on the replica trick (12). Taking the number of replicas to zero, we reveal an intensive property—an information density characterizing the distinguishability of distributions—ameliorating the canonical orthogonality problem and “curse of dimensionality.” We then describe a simple, deterministic algorithm that can be used for any such model, which we call Intensive Principal Component Analysis (InPCA). Our method quantitatively captures global structure while preserving local distances. We first apply InPCA to the canonical Ising model of magnetism, which inspired the zero-replica limit. Next, we show how InPCA can capture and summarize the learning trajectory of a neural network. Finally, we visualize the dark energy cold dark matter ( $Λ CDM$ ) model as applied to the cosmic microwave background (CMB), using InPCA, t-SNE, and diffusion maps.

Model Manifolds of Probability Distributions

Any measurement obtained from an experiment with uncertainty can generally be understood as a probability distribution. For example, when some data $x$ are observed with normally distributed noise $ξ$ of variance $σ^{2}$ , under experimental conditions $θ_{j}$ , a model is expressed as

x = f (θ_{j}) + ξ w h e r e L (ξ) ∽ N (0, σ^{2}),

[1]

and $f (θ_{i})$ is a prediction given the experimental conditions. This relationship is equivalent to saying that the probability of measuring data $x$ given some conditions $θ$ is

L (x ∣ θ) ∽ N (f (θ), σ^{2}) .

[2]

More complicated noise profiles with asymmetry or correlations can be accommodated with this picture. Measurements without an underlying model can also be seen as distributions, where a measurement $x_{i}$ with uncertainty $σ$ can induce a probability $L (x ∣ x_{i}, σ)$ of observing new data $x$ .

We define a probabilistic model $L (x ∣ θ)$ , the likelihood of observing data $x$ given parameters $θ$ . The model manifold is defined as the set of all possible predictions, ${L (x ∣ θ_{i})}$ , which is a surface parameterized by the model parameters ${θ_{i}}$ . The parameter directions related to the longest distances along the model manifold have been shown to predict emergent behavior (how microscopic parameters lead to macroscopic behavior) (13). We will see that InPCA orders its principal components by the length of the model manifold along their direction, highlighting global structure. The boundaries of the model manifold represent simplified models which retain predictive power (14), and the constraint of data lying near the model manifold has been used to optimize experimental design (15). In this article, we study the Ising model, which defines probabilities of spin configurations given interaction strengths; a neural network, which predicts the probability of an image representing a single handwritten digit given weights and biases; and $Λ CDM$ , which predicts the distribution of CMB radiation given fundamental constants of nature.

Hypersphere Embedding

We promised an embedding which both is isometric and preserves global structures. We satisfy the first promise by considering the hypersphere embedding

{z_{x} (θ_{i})} = \{2 \sqrt{L (x ∣ θ_{i})}\},

[3]

where the normalization constraint of $L (x ∣ θ)$ forces $z_{x}$ to lie on the positive orthant of a sphere. A natural measure of distance on the hypersphere is the Euclidean distance, in this case also known as the Hellinger divergence (16)

\begin{matrix} d^{2} (θ_{1}, θ_{2}) & = & {∥z (θ_{1}) - z (θ_{2})∥}^{2} \\ = & 8 {(1 - \sqrt{L (x ∣ θ_{1})} \cdot \sqrt{L (x ∣ θ_{2})})}^{2}, \end{matrix}

[4]

where $\cdot$ represents the inner product over $x$ . Now we can see that the hypersphere embedding is isometric: The Euclidean metric of this embedding is equal to the Fisher information metric $I$ of the model manifold (17),

d^{2} (z_{i}, z_{i} + d z_{i}) = \sum_{i} d z_{i} d z_{i} = \sum_{k l} I_{k l} d θ_{k} d θ_{l} .

[5]

The Fisher information metric (FIM) is the natural metric of the model manifold (18), so the hypersphere embedding preserves the local structure of the manifold defined by $L (x ∣ θ)$ .

As the dimension of the data increases, almost all features become orthogonal to each other, and most measures of distance lose their ability to discriminate between the smallest and largest distances (19). For the hypersphere embedding, we see that as the dimension of $x$ increases, the inner product in the Hellinger distance of Eq. 4 becomes smaller as the probability is distributed over more dimensions. In the limit of large dimension, all nonidentical pairs of points become orthogonal and equidistant around the hypersphere (a constant distance $\sqrt{8}$ apart), frustrating effective dimensional reductions and visualization.

To illustrate this problem with the hypersphere embedding, consider the Ising model, which predicts the likelihood of observing a particular configuration of binary random variables (spins) on a lattice. The probability of a spin configuration is determined by the Boltzmann distribution and is a function of a local pairwise coupling and a global applied field. The dimension is determined by the number of spin configurations, $2^{N}$ , where $N$ is the number of spins. Holding temperature fixed at one, we vary $h$ and $J$ : external magnetic field ( $h \in (- 1.3, 1.3)$ ) and nearest-neighbor coupling ( $J \in (- 0.4, 0.6)$ ), using a Monte Carlo method weighted by Jeffrey’s prior to sample 12,000 distinct points. From the resulting set of parameters, we compute $X_{i j} = {z_{i} (θ_{j})}$ using the Boltzmann distribution and visualize the model manifold in the $N$ -sphere embedding of Eq. 3 by projecting the predictions onto the first three principal components of $X$ . Fig. 1A shows this projection of the model manifold of a $2 \times 2$ Ising model which is embedded in $2^{4}$ dimensions. Fig. 1B shows a larger, $4 \times 4$ Ising model, of dimension $2^{16}$ . As the dimension is increased from $2^{4}$ to $2^{16}$ , we see the points starting to wrap around the hypersphere, becoming increasingly equidistant and less distinguishable.

Fig. 1. — (*A–C*) Hypersphere embedding, illustrating an embedding of the 2D Ising model. Points were generated through a Monte Carlo sampling and visualized by projecting the probability distributions onto the first three principal components (28). The points are colored by magnetic field strength. As the system size increases from $2 \times 2$ to $4 \times 4$ , the orthogonality problem is demonstrated by an increase in “wrapping” around the hypersphere. This effect can also be produced by instead considering four replicas of the original system, motivating the replica trick which takes the embedding dimension or number of replicas to zero.

A natural way to increase the dimensionality of a probabilistic model is by drawing multiple samples from the distribution. If $D$ is the dimension of $x$ , then $N$ identical draws from the distribution will have dimension $D^{N}$ . The more samples drawn, the easier it is to distinguish between distributions, mimicking the curse of dimensionality for large systems. We see this demonstrated for our Ising model in Fig. 1C, where we drew four replica samples from the same model. Note that compared with the original 2 $\times$ 2 model, the model manifold of the four-replica 2 $\times$ 2 model “wraps” more around the hypersphere, just like the larger, $4 \times 4$ Ising model. High-dimensional systems have “too much information,” in the same way that large numbers of samples have too much information. In the next section, we consider the contraposition of the insight that a large number of replicas lead to the curse of dimensionality and discover an embedding which not only is isometric but also ameliorates the high-dimensional wrapping around the $n$ -sphere.

Replica Theory and the Intensive Embedding

We saw in Fig. 1 that increasing the dimension of the data led to a saturation of the distance function Eq. 4. This problem is referred to as the loss of relative contrast or the concentration of distances (19), and to overcome it requires a non-Euclidean distance function, discussed below. In the previous section we saw the same saturation of distance could be achieved by adding replicas, increasing the embedding dimension. Fig. 2A shows this process taken to an extreme: the model manifold of the $2 \times 2$ Ising model with the number of replicas taken to infinity. All of the points cluster together, obscuring the fact that the underlying manifold is 2D. To cure the abundance of information which makes all points on the hypersphere equidistant, we seek an intensive distance, such as the distance per number of replicas observed. Next, because the limit of many replicas artificially leads to the same symptoms of the curse of dimensionality, we consider the limit of zero replicas, a procedure which is often used in the study of spin glasses and disordered systems (20). Fig. 2B shows the result of this analysis, the intensive embedding, where the distance concentration has been cured, and the inherent 2D structure of the Ising model has been recovered.

Fig. 2. — Replicated Ising model illustrating the derivation of our intensive embedding. All points are colored by magnetic field strength. (A) Large dimensions are characterized by large system sizes; here we mimic a $128 \times 128$ Ising model which is of dimension $2^{12 8^{2}}$ . The orthogonality problem becomes manifest as all points are effectively orthogonal, producing a useless visualization with all points clustered in the cusp. (B) Using replica theory, we tune the dimensionality of the system and consider the limit as the number of replicas goes to zero. In this way, we derive our intensive embedding. Note that the z axis reflects a negative-squared distance, a property which allows violations of the triangle inequality and is discussed in the text.

To find the intensive embedding, we must first find the distance between replicated models. The likelihood for $N$ replicas of a system is given by their product

{L ({x_{1}, \dots, x_{N}} ∣ θ))}^{(N)} = L (x_{1} ∣ θ) \dots L (x_{N} ∣ θ),

[6]

where the set ${x_{1}, \dots, x_{N}}$ represents the observed data in the replicated systems. Writing the inner product or cosine angle between two distributions as

⟨θ_{1}; θ_{2}⟩ = \sqrt{L (x ∣ θ_{1})} \cdot \sqrt{L (x ∣ θ_{2})},

[7]

and using Eq. 4, the distance per replica $d_{N}^{2}$ between two points on the model manifold is

d_{N}^{2} (θ_{1}, θ_{2}) = \frac{d^{2} (θ_{1}, θ_{2})}{N} = - 8 \frac{{⟨θ_{1}; θ_{2}⟩}^{N} - 1}{N} .

[8]

We are now poised to define the intensive distance by taking the number of replicas to zero:

d_{I}^{2} (θ_{1}, θ_{2}) = lim_{N \to 0} d_{N}^{2} (θ_{1}, θ_{2}) = - 8 \log ⟨θ_{1}; θ_{2}⟩ .

[9]

The last equality is achieved using the standard trick in replica theory, $(x^{N} - 1) / N \to \log x$ as $N \to \infty$ , which is a basis trick used to solve challenging problems in statistical physics (20). The trick is most evident using the identity $x^{N} = \exp (\log N x) \approx 1 + N \log x$ . One can check that the intensive distance is isometric,

d_{I}^{2} (θ, θ + δ θ) = δ θ^{α} δ θ^{β} g_{α β} = δ θ^{α} δ θ^{β} I_{α β},

[10]

where again $I$ is the Fisher information metric in Eq. 5, so that we can be confident the intensive embedding distance preserves local structures.

Importantly, the intensive distance does not satisfy the triangle inequality (and is thus non-Euclidean): The distance between points on the hypersphere can go to infinity, rather than lie constrained to the finite radius of the hypersphere embedding. Because of this, the intensive embedding can overcome the loss of relative contrast (19) discussed at the beginning of this section. Distances in the intensive embedding maintain distinguishability in high dimensions, as illustrated in Fig. 2B, wherein the 2D nature of the Ising model has been recovered. We hypothesize that this process, which cures the curse of dimensionality for models with too many samples, will also cure it for models with intrinsically high dimensionality. The intensive distance obtained here is proportional to the Bhattacharyya distance (21). Considering the zero-replica limit of the Hellinger divergence, we discovered a way to derive the Bhattacharyya distance. The importance of this is discussed further in the following section.

Connection to Least Squares.

Consider the concrete and canonical paradigm of models $f_{i} (θ)$ with data points $x_{i}$ and additive white Gaussian noise, usually called a nonlinear least-squares model. The likelihood $L (x ∣ θ)$ is defined by

- \log L (x ∣ θ) = \sum_{i} \frac{{(f_{i} (θ) - x_{i})}^{2}}{2 σ_{i}^{2}} + \log Z (θ),

[11]

where $Z$ sets the normalization. A straightforward evaluation of the intensive distance given by Eq. 9 finds for the case of nonlinear least squares that

d_{I}^{2} (θ_{1}, θ_{2}) = \sum_{i} \frac{{(f_{i} (θ_{1}) - f_{i} (θ_{2}))}^{2}}{σ_{i}^{2}},

[12]

so that the intensive distance is simply the variance-scaled Euclidean distance between model predictions.

Intensive Principal Component Analysis

Classical PCA takes a set of data examples and infers features which are linearly uncorrelated. (7). The features to be analyzed with PCA are compared via their Euclidean distance. Can we generalize this comparison to use our intensive embedding distance? Given a matrix of data examples $X \in R^{m \times p}$ (with features along the rows), PCA first requires the mean-shifted matrix $M_{i j} = X_{i j} - {\bar{X}}_{i} = P X$ , where $P_{i j} = δ_{i j} - 1 / p$ is the mean-shift projection matrix and $p$ is the number of sampled points. The covariance and its eigenvalue decomposition are then

c o v (X, X) = \frac{1}{p} M^{T} M = X^{T} P P X = V Σ V^{T},

[13]

where the orthogonal columns of the matrix $V$ are the natural basis onto which the rows of $M$ are projected,

M V = (U D V^{T}) V = U D = U \sqrt{Σ},

[14]

where the columns of $U \sqrt{Σ}$ are called the principal components of the data $X$ .

The principal components can also be obtained from the cross-covariance matrix, $M M^{T}$ , since

M M^{T} = P X X^{T} P = (U D V^{T}) {(U D V^{T})}^{T} = U Σ U^{T} .

[15]

The eigenbasis $U$ of the cross-covariance is the natural basis for the components of the data, and the eigenbasis $V$ of the covariance is the natural basis of the data points. For us this flexibility is invaluable, as the cross-covariance is more natural for expressing the distances between distributions of different parameters.

Writing our data matrix as $X_{i j} = z_{i} (θ_{j})$ using Eq. 3 for replicated systems, the cross-covariance is

\begin{matrix} {(M M^{T})}_{i j}^{(N)} & = & {(P X X^{T} P)}_{i j} \\ = & (z (θ_{i}) - \bar{z}) \cdot (z (θ_{j}) - \bar{z}) \\ = & 4 {⟨θ_{i}; θ_{j}⟩}^{N} + \frac{4}{p^{2}} \sum_{k, k' = 1}^{p} {⟨θ_{k}; θ_{k'}⟩}^{N} \\ - \frac{4}{p} \sum_{k = 1}^{p} ({⟨θ_{i}; θ_{k}⟩}^{N} + {⟨θ_{j}; θ_{k}⟩}^{N}), \end{matrix}

[16]

where $\bar{z}$ is the average over all sampled parameters, and we used the definition of $z$ in Eq. 6. As with the intensive embedding, we can take the limit as the number of replicas goes to zero to find

W_{i j} = lim_{N \to 0} \frac{1}{N} {(M M^{T})}_{i j}^{(N)} .

[17]

Explicitly, the intensive cross-covariance matrix is

W_{i j} = 4 \log ⟨θ_{i}; θ_{j}⟩ + \frac{4}{p^{2}} \sum_{k, k' = 1}^{p} \log ⟨θ_{k'}; θ_{k}⟩ - \frac{4}{p} \sum_{k = 1}^{p} (\log ⟨θ_{i}; θ_{k}⟩ + \log ⟨θ_{j}; θ_{k}⟩)

[18]

= {(P L P)}_{i j},

[19]

where $L_{i, j} = 4 \log ⟨θ_{i}; θ_{j}⟩$ and $P$ is the same projection matrix as defined above. In taking the limit of zero replicas, the structure of the cross-covariance has transformed

P X X^{T} P \underset{N \to 0}{\to} P L P,

[20]

and thus the symmetric Wishart structure is lost. It is therefore possible to obtain negative eigenvalues in this decomposition, which give rise to imaginary components in the projections. Note the similarity between the form of this cross-covariance and the double-centered distance matrix used in PCA and multidimensional scalding (MDS). This arises because both InPCA and PCA/MDS rely on mean shifing the input data before finding an eigenbasis. Thus, we view InPCA as a natural generalization of PCA to probability distributions and MDS to non-Euclidean embeddings.

In summary, InPCA is achieved by the following procedure: (i) Compute the cross-covariance matrix from a set of probability samples: Compute $W_{i j}$ as derived in Eq. 18. (ii) Compute the eigenvalue decomposition $W = U Σ U^{T}$ . (iii) Compute the coordinate projections, $T = U \sqrt{Σ}$ . (iv) Plot the projections using the columns of $T$ .

Neural Network MNIST Digit Classifier.

To demonstrate the utility of InPCA, we use it to visualize the training of a two-layer convolution neural network (CNN), constructed using TensorFlow (22), trained on the MNIST dataset of hand-written digits (23). A set of 55,000 images was used to train the network, which was then used to predict the likelihood that an additional set of 10,000 images is classified each as a specific digit between 0 and 9. We use softmax (24) to calculate the probabilities from the category estimates supplied by the network. The CNN defines the likelihood $L (x ∣ θ)$ that some input image $θ$ contains the image of a particular handwritten digit $x$ . The InPCA projections of the CNN output in Fig. 3 visualize the clustering learned by the CNN as a function of the number of learning epochs. The initialized network’s model manifold shows no knowledge of the digits (colored dots), but as training commences, the network separates digits into separate regions of its manifold (Movie S1). InPCA can be used as a fast, interpretable, and deterministic method for qualitatively evaluating what a neural network has learned.

Fig. 3. — Stages of training a CNN. Each point in the 3D projections represents one of 10,000 test images supplied to the CNN (29). At the first epoch, the neural network is untrained and so is unable to reliably classify images, with about a $90 %$ error rate—an effect reflected in the cloud of points. As training progresses and error rate decreases, the cloud begins to cluster as shown by InPCA at the 20th epoch. Finally, when completely trained, the clustered regions are manifest at the 2000th epoch with 10 clusters representing the 10 digits.

Properties of the Intensive Embedding and InPCA

The space characterized by our intensive embedding has two weird properties: First, it is formally 1D, yet there are multiple orthogonal directions upon which it can be projected; and second, it is Minkowski-like, in that it has negative squared distances, violating the triangle inequality. We posit that, fundamentally, this second property is what allows InPCA to cure the orthogonality problem.

We begin with a discussion of the the 1D nature of the embedding space. The embedding dimension is given by $D^{N}$ , where $D$ is the original dimension of data $x$ and $N$ is the number of replicas. In the case of noninteger replicas the space becomes “fractional” in dimension and in the limit of zero replicas ultimately goes to one. However, it is still possible to obtain projections themselves along the dominant components of this space, by leveraging the cross-covariance instead of the covariance, summarized in step ii of our algorithm. Visualizations produced by InPCA are cross-sections of a space of the dimension equal to the number of sampled points of the model manifold $p$ , instead of the dimension $D$ or $D^{N}$ .

In the limit of zero replicas in Eq. 18, the positive-definite, Wishart structure of the cross-covariance matrix is lost. It is therefore possible to have negative squared distances. The non-Euclidean nature of the embedding (flat, but Minkowski-like) does not suffer from the concentration of distances which plagues Euclidean measures in high dimensions, thus allowing the model manifold to be “unwound” from the $N$ -sphere and for InPCA to produce useful, low-dimensional representations.

Finally, the eigenvalues of InPCA correspond to the cross-sectional widths of the model manifold. We see this quite explicitly with the following example of a biased coin (specifically, in Fig. 4B) where the eigenvalues extracted from InPCA map directly to the manifold widths measured along the direction of the corresponding InPCA eigenvector. Therefore, we see that InPCA produces a hierarchy of directions, ordered by the global widths of the model manifold. Note that, as with classical PCA, this correspondence depends on how faithfully the model manifold was originally sampled; that is, InPCA can tell you about the structure of the manifold only from observed points.

Fig. 4. — InPCA visualization of biased coins (30). (A) The first two InPCA components correspond to the coin bias and variance, yet the first one is real and the second one is imaginary (the aspect ratio between axes is one). The contour lines represent constant distances from a fair coin and are hyperbolas: Points can be a finite distance from a fair coin yet an infinite distance from each other. (B) The ordered eigenvalues correspond to the manifold lengths, illustrating the hierarchical nature of the components extracted from InPCA.

Biased Coins.

To illustrate the properties of InPCA, we use it to visualize a simple probabilistic model, that of a simple biased coin. A biased coin has one parameter, the odds ratio of heads to tails, and so forms a 1D manifold. Fig. 4A shows the first two InPCA components for the manifold of biased coins, for 2,000 sampled points with probabilities uniformly spread between 0 and 1 (excluding the endpoints, since they are orthogonal and thus are infinitely far apart). The two extracted InPCA components correspond to the bias and variance of the coin, respectively. The hierarchy of components extracted from InPCA therefore corresponds to known features of the model (i.e., they are meaningful).

The importance of the negative-squared distances is illustrated in Fig. 4. The contour lines representing constant distances from a fair coin and are hyperbolas: Points can be a finite distance from a fair coin yet an infinite distance from each other. As two oppositely biased coins become increasingly biased, their distance from each other can go to infinity (because an outcome of a coin which always lands on heads will never be the same as an outcome of a coin which always land on tails) yet all points remain a finite distance from a fair coin. Note that all points are in the left and right portions of Fig. 4A, representing net positive distances (the intensive pairwise distances are all positive).

Comparing with t-SNE and Diffusion Maps.

We compare our manifold learning technique to two standard methods, t-SNE and the diffusion maps, by applying each one to the six-parameter $Λ CDM$ cosmological model predictions of the CMB. The $Λ CDM$ predicts $L (x ∣ θ)$ , where $x$ represents fluctuations in the CMB, and $θ$ are the different cosmological parameters (i.e., it predicts the angular power spectrum of temperature and polarization anisotropies in sky maps of the CMB). Observations of the CMB from telescopes on satellites, balloons, and the ground provide thousands of independent measurements from large angular scales to a few arcminutes, which are used to fit model parameters. Here we consider only CMB observations from the 2015 Planck data release (25). The $Λ CDM$ model we consider has six parameters, the Hubble constant ( $H_{0}$ ) which we sampled in a range of 20–100 $km \cdot s^{- 1} \cdot {Mpc}^{- 1}$ , the physical baryon density ( $Ω_{b} h^{2}$ ) and the physical cold dark matter density ( $Ω_{c} h^{2}$ ) both sampled from 0.0009 to 0.8, the primordial fluctuation amplitude ( $A_{s}$ ) sampled from $1 0^{- 11}$ to $1 0^{- 8}$ , the scalar spectral index ( $η$ ) sampled from 0 to 0.98, and the optical depth at reionization ( $τ$ ) sampled from 0.001 to 0.9.

To determine the likelihood functions, we use the Code for Anisotropies in the Microwave Background software package to generate power spectra (26). We perform a Monte Carlo sampling of 50,000 points around the best-fit parameters provided by the 2015 Planck data release (25), with sample weights based on the intensive distance to the best fit.

In Fig. 5 we show the first two components of the manifold embedding for InPCA, t-SNE, and diffusion maps. To apply t-SNE and the diffusion map to probabilistic data we must provide a distance. We therefore use our intensive distance, from Eq. 9, for consistency and ease of comparison. In all three cases, the first component from each method is directly related to the primordial fluctuation amplitude $A_{s}$ , which reflects the amplitude of density fluctuations in the early universe and is the dominant feature in real data (25). The second InPCA component predicts the Hubble constant, whereas the diffusion map predicts the scalar spectral index (a reflection of the size variance of primordial density fluctuations). In all cases, the projected components were plotted against the corresponding parameters to determine correlations, such as how one can see that $A_{s}$ corresponds with the first component in all three cases. Detailed plots and correlation coefficients for all three methods are provided in SI Appendix.

Such stark differences between manifold learning methods are surprising, as all techniques aim to extract important features in the data distribution, i.e., important geometric features in the manifolds. Given the ranges of sampled parameters, one would expect the variation in the Hubble constant to relate in some way to one of the dominant components, as it does for InPCA. Figures illustrating the effect of different parameters are provided in SI Appendix, following results from ref. 27.

There are two important differences between InPCA and other methods. First, InPCA has no tunable parameters and yields a geometric object defined entirely by the model distribution. For example, t-SNE embeddings rely on parameters such as the perplexity, a learning rate, and a random seed (yielding nondeterministic results), and the diffusion maps rely on a diffusion parameter and choice of diffusion operator, all of which must be manually optimized to obtain good results. Second, t-SNE and diffusion maps embed manifolds in Euclidean spaces in a way which aims to preserve local features. However, InPCA seeks to preserve both global and local features, by embedding manifolds in a non-Euclidean space.

Summary

In this article, we introduce an unsupervised manifold learning technique, InPCA, which captures low-dimensional features of general, probabilistic models with wide-ranging applicability. We consider replicas of a probabilistic system to tune its dimensionality and consider the limit of zero replicas, deriving an intensive embedding that ameliorates the canonical orthogonality problem. Our intensive embedding provides a natural, meaningful way to characterize a symmetric distance between probabilistic data and produces a simple, deterministic algorithm to visualize the resulting manifold.

Supplementary Material

Supplementary File

pnas.1817218116.sm01.gif^{(7.3MB, gif)}

Supplementary File

pnas.1817218116.sapp.pdf^{(18.3MB, pdf)}

Acknowledgments

We thank Mark Transtrum for guidance on algorithms and for useful conversations. We thank Pankaj Mehta for pointing out the connection to MDS. K.N.Q. was supported by a fellowship from the Natural Sciences and Engineering Research Council of Canada, and J.P.S. and K.N.Q. were supported by the National Science Foundation (NSF) through NSF Grants DMR-1312160 and DMR-1719490. M.D.N. was supported by NSF Grant AST-1454881.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: All code used to generate the figures is available through public repositories on Github. Fig. 1 and Fig. 2 can be generated from code on https://github.com/katnquinn/Ising_ModelManifold, Fig. 3 can be generated from code on https://github.com/katnquinn/IntensiveEmbedding, Fig. 4 from code found on https://github.com/katnquinn/1Spin, and Fig. 5 from code on https://github.com/katnquinn/CMB_ModelManifold and from Code for Anisotropies in the Microwave Background software found at https://camb.info/.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1817218116/-/DCSupplemental.

References

1.De Oliveira M. F., Levkowitz H., From visual data exploration to visual data mining: A survey. IEEE Trans. Visualization Comput. Graphics 9, 378–394 (2003). [Google Scholar]
2.Liu S., Maljovec D., Wang B., Bremer P. T., Pascucci V., Visualizing high-dimensional data: Advances in the past decade. IEEE Trans. Visualization Comput. Graphics 23, 1249–1268 (2017). [DOI] [PubMed] [Google Scholar]
3.Lee J. A., Verleysen M., Nonlinear Dimensionality Reduction (Springer, New York, NY, 2007). [Google Scholar]
4.Zimek A., Schubert E., Kriegel H. P., A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Mining ASA Data Sci. J. 5, 363–387 (2012). [Google Scholar]
5.Murphy K. P., Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). [Google Scholar]
6.Kriegel H. P., Kröger P., Zimek A., Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3, 1–58 (2009). [Google Scholar]
7.Hotelling H., Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 (1933). [Google Scholar]
8.Torgerson W. S., Multidimensional scaling: I. Theory and method. Psychometrika 17, 401–419 (1952). [Google Scholar]
9.van derMaaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
10.Coifman R. R., et al. , Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. U.S.A. 102, 7426–7431 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.McInnes L., Healy J., Melville J., Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (6 December 2018).
12.Mézard M., Parisi G., Virasoro M., Spin Glass Theory and Beyond (World Scientific, 1986). [Google Scholar]
13.Machta B. B., Chachra R., Transtrum M. K., Sethna J. P., Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013). [DOI] [PubMed] [Google Scholar]
14.Transtrum M. K., Qiu P., Model reduction by manifold boundaries. Phys. Rev. Lett. 113, 098701 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Transtrum M. K., et al. , Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J. Chem. Phys. 143, 010901 (2015). [DOI] [PubMed] [Google Scholar]
16.Hellinger E., Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 136, 210–271(1909). [Google Scholar]
17.Gromov M., In a search for a structure, part 1: On entropy. Entropy 17, 1273–1277 (2013). [Google Scholar]
18.Amari S., Nagaoka H., Translations of Mathematical Monographs: Methods of Information Geometry (Oxford University Press, 2000), vol. 191. [Google Scholar]
19.Beyer K., Goldstein J., Ramakrishnan R., Shaft U., “When is “nearest neighbor” meaningful?” in Database Theory— ICDT’99, C. Beeri, P. Buneman, Eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, Germany, 1999), pp. 217–235. [Google Scholar]
20.Parisi G., Infinite number of order parameters for spin-glasses. Phys. Rev. Lett. 43, 1754–1756 (1979). [Google Scholar]
21.Bhattacharyya A., On a measure of divergence between two multinomial populations. Sankhyā Indian J. Stat. (1933-1960) 7, 401–406 (1946). [Google Scholar]
22.Abadi M., et al. , TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Accessed 1 December 2017.
23.LeCun Y., Cortes C., Burges C. J., “MNIST database”. http://yann.lecun.com/exdb/mnist/. Accessed 1 December 2017.
24.Bishop C. M., Pattern Recognition and Machine Learning (Springer, New York, NY, 2006). [Google Scholar]
25.Planck Collaboration , Planck 2015 results - i. Overview of products and scientific results. A&A 594, A1 (2016). [Google Scholar]
26.Lewis A., Challinor A., Lasenby A., Efficient computation of cosmic microwave background anisotropies in closed Friedmann-Robertson-Walker models. Astrophys. J. 538, 473–476 (2000). [Google Scholar]
27.Hu W., CMB tutorials. http://background.uchicago.edu/. Accessed 1 August 2018.
28.Quinn K., Ising Model Manifold. GitHub. https://github.com/katnquinn/Ising_ModelManifold. Deposited 23 July 2018.
29.Quinn K., Intensive Embedding. GitHub. https://github.com/katnquinn/IntensiveEmbedding. Deposited 11 March 2019.
30.Quinn K., 1 Spin. GitHub. https://github.com/katnquinn/1Spin. Deposited 13 March 2019.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1817218116.sm01.gif^{(7.3MB, gif)}

Supplementary File

pnas.1817218116.sapp.pdf^{(18.3MB, pdf)}

[r1] 1.De Oliveira M. F., Levkowitz H., From visual data exploration to visual data mining: A survey. IEEE Trans. Visualization Comput. Graphics 9, 378–394 (2003). [Google Scholar]

[r2] 2.Liu S., Maljovec D., Wang B., Bremer P. T., Pascucci V., Visualizing high-dimensional data: Advances in the past decade. IEEE Trans. Visualization Comput. Graphics 23, 1249–1268 (2017). [DOI] [PubMed] [Google Scholar]

[r3] 3.Lee J. A., Verleysen M., Nonlinear Dimensionality Reduction (Springer, New York, NY, 2007). [Google Scholar]

[r4] 4.Zimek A., Schubert E., Kriegel H. P., A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Mining ASA Data Sci. J. 5, 363–387 (2012). [Google Scholar]

[r5] 5.Murphy K. P., Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). [Google Scholar]

[r6] 6.Kriegel H. P., Kröger P., Zimek A., Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3, 1–58 (2009). [Google Scholar]

[r7] 7.Hotelling H., Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 (1933). [Google Scholar]

[r8] 8.Torgerson W. S., Multidimensional scaling: I. Theory and method. Psychometrika 17, 401–419 (1952). [Google Scholar]

[r9] 9.van derMaaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]

[r10] 10.Coifman R. R., et al. , Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. U.S.A. 102, 7426–7431 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.McInnes L., Healy J., Melville J., Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (6 December 2018).

[r12] 12.Mézard M., Parisi G., Virasoro M., Spin Glass Theory and Beyond (World Scientific, 1986). [Google Scholar]

[r13] 13.Machta B. B., Chachra R., Transtrum M. K., Sethna J. P., Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013). [DOI] [PubMed] [Google Scholar]

[r14] 14.Transtrum M. K., Qiu P., Model reduction by manifold boundaries. Phys. Rev. Lett. 113, 098701 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Transtrum M. K., et al. , Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J. Chem. Phys. 143, 010901 (2015). [DOI] [PubMed] [Google Scholar]

[r16] 16.Hellinger E., Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 136, 210–271(1909). [Google Scholar]

[r17] 17.Gromov M., In a search for a structure, part 1: On entropy. Entropy 17, 1273–1277 (2013). [Google Scholar]

[r18] 18.Amari S., Nagaoka H., Translations of Mathematical Monographs: Methods of Information Geometry (Oxford University Press, 2000), vol. 191. [Google Scholar]

[r19] 19.Beyer K., Goldstein J., Ramakrishnan R., Shaft U., “When is “nearest neighbor” meaningful?” in Database Theory— ICDT’99, C. Beeri, P. Buneman, Eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, Germany, 1999), pp. 217–235. [Google Scholar]

[r20] 20.Parisi G., Infinite number of order parameters for spin-glasses. Phys. Rev. Lett. 43, 1754–1756 (1979). [Google Scholar]

[r21] 21.Bhattacharyya A., On a measure of divergence between two multinomial populations. Sankhyā Indian J. Stat. (1933-1960) 7, 401–406 (1946). [Google Scholar]

[r22] 22.Abadi M., et al. , TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Accessed 1 December 2017.

[r23] 23.LeCun Y., Cortes C., Burges C. J., “MNIST database”. http://yann.lecun.com/exdb/mnist/. Accessed 1 December 2017.

[r24] 24.Bishop C. M., Pattern Recognition and Machine Learning (Springer, New York, NY, 2006). [Google Scholar]

[r25] 25.Planck Collaboration , Planck 2015 results - i. Overview of products and scientific results. A&A 594, A1 (2016). [Google Scholar]

[r26] 26.Lewis A., Challinor A., Lasenby A., Efficient computation of cosmic microwave background anisotropies in closed Friedmann-Robertson-Walker models. Astrophys. J. 538, 473–476 (2000). [Google Scholar]

[r27] 27.Hu W., CMB tutorials. http://background.uchicago.edu/. Accessed 1 August 2018.

[r28] 28.Quinn K., Ising Model Manifold. GitHub. https://github.com/katnquinn/Ising_ModelManifold. Deposited 23 July 2018.

[r29] 29.Quinn K., Intensive Embedding. GitHub. https://github.com/katnquinn/IntensiveEmbedding. Deposited 11 March 2019.

[r30] 30.Quinn K., 1 Spin. GitHub. https://github.com/katnquinn/1Spin. Deposited 13 March 2019.

PERMALINK

Visualizing probabilistic models and data with Intensive Principal Component Analysis

Katherine N Quinn

Colin B Clement

Francesco De Bernardis

Michael D Niemack

James P Sethna

Significance

Abstract

Model Manifolds of Probability Distributions

Hypersphere Embedding

Fig. 1.

Replica Theory and the Intensive Embedding