Abstract
To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.
Subject terms: Data processing, Statistical methods
Biomedical measurements usually generate high-dimensional data where individual samples are classified in several categories. Vogelstein et al. propose a supervised dimensionality reduction method which estimates the low-dimensional data projection for classification and prediction in big datasets.
Introduction
Supervised learning—the art and science of estimating statistical relationships using labeled training data—has enabled a wide variety of basic and applied findings, ranging from discovering biomarkers in omics data1 to recognizing objects from images2. A special case of supervised learning is classification, where a classifier predicts the “class” of a novel observation (for example, by predicting sex from an MRI scan). One of the most foundational and important approaches to classification is Fisher’s Linear Discriminant Analysis (LDA)3. LDA has a number of highly desirable properties for a classifier. First, it is based on simple geometric reasoning: when the data are Gaussian, all the information is in the means and variances, so the optimal classifier uses both the means and the variances. Second, LDA can be applied to multiclass problems. Third, theorems guarantee that when the sample size n is large and the dimensionality p is relatively small, LDA converges to the optimal classifier under the Gaussian assumption. Finally, algorithms for implementing it are highly efficient.
Modern scientific datasets, however, present challenges for classification that were not addressed in Fisher’s era. Specifically, the dimensionality of datasets is quickly ballooning. Current raw data can consist of hundreds of millions of features or dimensions; for example, an entire genome or connectome. Yet, the sample sizes have not experienced a concomitant increase. This “large p, small n” problem is a non-starter for many classical statistical approaches because they were designed with a “small p, large n” situation in mind. Running LDA when p ≥ n is like trying to fit a line to a point: there are infinitely many equally good fits (all lines that pass through the point), and no way to know which of them is “best”. Therefore, without further constraints these algorithms will overfit, meaning they will choose a classifier based on noise in the data, rather than discarding the noise in favor of the desired signal. We also desire methods that can adapt to the complexity of the data, are robust to outliers, and are computationally efficient. Several complementary strategies have been pursued to address these p ≥ n problems.
First, and perhaps the most widely used method, is Principal Components Analysis (PCA)4. According to PubMed, PCA has been referenced over 40,000 times, and nearly 4000 times in 2018 alone. This is in contrast to other methods that receive much more attention in the media, such as deep learning, random forests, and sparse learning, which received ~2000, ~1200, and ~500 hits, respectively. This suggests that PCA remains the most popular workhorse for high-dimensional problems. PCA “pre-processes” the data by reducing its dimensionality to those dimensions whose variance is largest in the dataset. While highly successful, PCA is a wholly unsupervised dimensionality reduction technique, meaning that PCA does not use the class labels while learning the low-dimensional representation, resulting in suboptimal performance for subsequent classification. Nonlinear manifold learning techniques generalize PCA5, but also typically do not incorporate class label information; moreover, they scale poorly. Deep learning provides the most recent version of nonlinear manifold learning, for example, using (supervised) autoencoders, but these methods remain poorly understood, have many parameters to tune, and typically do not provide interpretable results6. Further, deep learning tends to suffer in the wide data problem, where the number of samples is far less than the dimensionality.
The second set of strategies regularize or penalize a supervised method, such as regularized LDA7 or canonical correlation analysis (CCA)8. Such approaches can drastically overfit in the p > n setting, tend to lack theoretical support in these contexts, and have multiple “knobs” to tune that are computationally taxing. Partial least squares (PLS) is another popular method in this set that often achieves impressive empirical performance, though it lacks strong theoretical guarantees and a scalable implementation9,10. Sparse methods are the third common strategy to mitigate this “curse of dimensionality” 11–13. Unfortunately, exact solutions are computationally intractable, and approximate solutions have theoretical guarantees only under very restrictive assumptions, and are quite fragile to those assumptions14. Thus, there is a gap: no existing approach can classify multi-class wide data with millions of features while obtaining strong theoretical guarantees, favorable and interpretable empirical performance, and a flexible, robust, and scalable implementation.
To address these issues, we developed a technique for incorporating class-conditional moment estimates, XOX, the simplest example of which is LOL. The key intuition behind LOL is that we can jointly use the means and variances from each class (like LDA and CCA), but without requiring more dimensions than samples (like PCA), or restrictive sparsity assumptions. Using random matrix theory, we are able to prove that when the data are sampled from a Gaussian, LOL finds a better low-dimensional representation than PCA, LDA, CCA, and other linear methods. Under relatively relaxed assumptions, this is true regardless of the dimensionality of the features, the number of samples, or the number of dimensions in which we project. We then demonstrate the superiority of techniques derived using the XOX approach—including (i) LOL, (ii) a variant of XOX which allows greater flexibility of the class-conditional covariances called QOQ, and (iii) a robust variant of LOL called RLOL—over other methods numerically on a variety of simulated settings including several not following the theoretical assumptions. Finally, we show that on several 500 gigabyte neuroimaging datasets, and several multi-gigabyte genomics datasets, LOL achieves superior accuracy at lower dimensions while requiring only a few minutes of time on a single workstation.
Results
Flexibility and accuracy of XOX framework
We empirically investigate the flexibility and accuracy of XOX using simulations that extend beyond theoretical claims. For three different scenarios, we sample 100 training samples each with 100 features; therefore, Fisher’s LDA cannot solve the problem (because there are infinitely many ways to overfit). We consider a number of different methods, including PCA, rrLDA, PLS, random projections (RP), and CCA to project the data onto a low dimensional space. After projecting the data, we train either LDA (for the first two scenarios) or quadratic discriminant analysis (QDA, for the third scenario), which generalizes LDA by allowing each class to have its own covariance matrix15. For each scenario, we evaluate the misclassification rate on held-out data.
Figure 1 shows a two-dimensional scatterplot (left) and misclassification rate versus dimensionality (right) for each simulation. Hereafter, LOL will refer to the version of LOL with a robust estimate of the location (the class medians, related to the central moment when the population has a symmetric distribution), and a truncated singular value decomposition to estimate of the second moment. A robust location estimate tends to make little difference when a robust estimate was not necessary, and empirically improves performance in simulations and real-data examples when a robust estimate was warranted. Alternative strategies would have been to use robust estimates of the first moment or second moment directly16–18. We do not use a robust estimate of the second moment, as typical robust estimates of the second moment available in standard numerical packages require d < n, which is unsuitable for wide data. The top C − 1 embedding dimensions for LOL correspond to the performance after projection onto the class-conditional means, and rrLDA corresponds to the performance of projection onto the class-conditional covariance matrix. Figure 1a shows a three class generalization of the Trunk example from Fig. 5b. LOL can trivially be extended to more than two classes (see Supplementary Note 2 for details), unlike ROAD which only operates in a two-class setting. Figure 1b shows a two-class example with many outliers, as is typical in modern biomedical datasets. Both LOL and PLS perform well, despite the outliers, and efficiently identify embedding dimensions despite the outliers. Figure 1c shows an example which should be adversarial for LOL in comparison to PCA or rrLDA. This is because the difference of means is utterly informative, so LOL utilizes additional dimensions which are noise compared to PCA. Further, the class-conditional covariances are orthogonal, whereas LOL assumes the class-conditional covariance is the same across both classes. While LOL cannot possibly do as well as PCA in this situation, its performance is only slightly worse. Further, another XOX variant, quadratic optimal QDA (QOQ), uses the same difference of means as LOL and then computes the eigenvectors separately for each class, concatenates them (sorting them according to their singular values), and then classifies with QDA instead of LDA. QOQ is able to identify a slightly more efficient projection for classification than PCA. This is due to the fact that while the first few dimensions are uninformative (those spanned by the difference of the means), the successive dimensions are far more efficient (the class-conditional covariances). For all three scenarios, either LOL—or its extended variant QOQ—achieves a misclassification rate comparable to or lower than other methods, for all dimensions. These three results demonstrate how straightforward generalizations of LOL under the XOX framework which incorporate alternate or robust moment estimates can dramatically improve performance over other projection methods. This is in marked contrast to other approaches, for which such flexibility is either not available, or otherwise problematic.
XOX is computationally efficient and scalable
When the dimensionality is large (e.g., millions or billions), the main bottleneck is sometimes merely the ability to run anything on the data, rather than its predictive accuracy. We evaluate the computational efficiency and scalability of LOL in the simplest setting: two classes of spherically symmetric Gaussians (see Supplementary Note 3 for details) with dimensionality varying from 2 million to 128 million, and 1000 samples per class. Because LOL admits a closed form solution, it can leverage highly optimized linear algebra routines rather than the costly iterative programming techniques currently required for sparse or dictionary learning type problems19. To demonstrate these computational capabilities, we built FlashLOL, an efficient scalable LOL implementation with R bindings, to complement the R package used for the above figures.
Four properties of LOL enable its scalable implementation. First, LOL is linear in both sample size and dimensionality (Fig. 2a, solid red line). Second, LOL is easily parallelizable using recent developments in “semi-external memory”20–22 (Fig. 2a, dashed red line demonstrates that LOL is also linear in the number of cores). Also note that LOL does not incur any meaningful additional computational cost over PCA (orange dashed line). Third, LOL can use randomized approximate algorithms for eigendecompositions to further accelerate its performance23,24 (Fig. 2a, orange lines). FlashLFL, short for Flash Low-rank Fast Linear embedding, achieves an order of magnitude improvement in speed when using very sparse RP instead of the eigenvectors. Fourth, hyper-parameter selection for LOL is nested, meaning that once estimating the d-dimensional projection, every lower dimensional projection is automatically available. This is in contrast to tuning the weight of a penalty term, which leads to a new optimization problem for each different parameter values. Thus, the computational complexity of LOL is , where n is sample size, p is the dimension of the data, d is the dimension of the projection, T is the number of threads, and c is the sparsity of the projection.
Finally, note that this simulation setting is ideal for PCA and rrLDA, because the first principal component includes the mean difference vector. Nonetheless, both LOL and LFL achieve near optimal accuracy, whereas rrLDA is at chance, and PCA requires 500 dimensions to even approach the same accuracy that LOL achieves with only one dimension. While PCA would also benefit efficiency wise from a randomized approach, we emphasize that LFL maintains the high performance of LOL in comparison to PCA despite the randomization technique, with the benefit of greater computational efficiency compared to LOL.
Real data benchmarks and applications
Real data often break the theoretical assumptions in more varied ways than the above simulations, and can provide a complementary perspective on the performance properties of different algorithms. We describe two sets of problems, one from brain imaging, and the other from genomics. In both cases we consider a classification problem. To classify participants, researchers typically employ substantiative preprocessing pipelines25 to reduce the dimensionality of the data. Unfortunately, as debates persist about the validity of preprocessing approaches, there is no defacto “standard” for the optimal strategies to preprocess the data. Traditional approaches typically include a deep processing chain, with many steps of parametric modeling and downsampling26–28. We therefore investigate the possibility of directly classifying on the nearly raw, high-dimensional data.
The Consortium for Reliability and Reproducibility (CoRR)29 has generated anatomical and diffusion magnetic resonance imaging scans from n > 800 participants from five processing sites, each featuring participant-specific annotations for the sex of each individual. At the native resolution, each brain volume is over 150 million dimensions, and each dataset consists of between 42 (60 GB of data) and >400 samples (600 GB of data).
We then also consider a large genomics dataset30 consisting of 340 individuals: 144 patients with nonmetastatic cancer and 196 healthy controls, of which 198 are male and 142 are female. Samples are aligned to > 750,000 amplicons distributed throughout the genome to investigate the presence of aneuploidy (abnormal chromosomal counts) in samples from cancer patients (see Supplementary Note 5 for details). The raw amplicon counts are then used with no further preprocessing. We have two tasks of interest: classification on the basis of either sex or age.
For each of the above described problems, we first compute an embedding matrix to project the training data using LOL, PCA, rrLDA, and RP, and then train LDA to classify the resulting low-dimensional representations. The held-out set is then projected and classified using the embedding matrix and trained classifier respectively, and the average cross-validated error is computed over all folds of the data. For each problem, the optimal dimensionality for each strategy is selected to be the number of embedding dimensions with the lowest average cross-validated error. We compute Cohen’s Kappa κ to compare performance across methods because it normalizes the performance of the classification strategy between zero (the classifier is equivalent to the random chance classifier) and one (the classifier performs perfectly). Finally, for each projection technique, we measure the effect size for each strategy as the difference κ(PCA) − κ(embed). See Supplementary Table 1 for a table detailing the datasets employed.
Our FlashLOL implementations are the only algorithms that could successfully run on these data with a single core on a standard desktop computer. In Fig. 3a, LOL is the only technique to outperform PCA on all problems. Figure 3b shows the relative ranks of the average cross-validated misclassification rates for the LDA classifier on each dataset after projection with the specified embedding technique. For all problems, LOL is the technique with the lowest average cross-validated misclassification rate. Further, LOL performs significantly better than all other techniques (Wilcoxon signed-rank statistic, all p values = 0.008). The average misclassification rate achieved at the optimal number of embedding dimensions via LOL is between 5% and 15% across all datasets, which is the same performance we and others obtain using extensively processed and downsampled data that is typically required on similar datasets31,32. LOL therefore enables researchers to side-step hotly debated preprocessing issues by hardly preprocessing at all, and instead simply applying LOL to the data in its native dimensionality.
Discussion
We have introduced a very simple methodology to improve performance on supervised learning problems with wide data (that is, big data where dimensionality is at least as large as sample size) by using class-conditional moments to estimate a low rank projection under a generalized framework, XOX. In particular, LOL uses both the difference of the means and the class-centered covariance matrices, which enables it to outperform PCA, as well as existing supervised linear classification schemes, in a wide variety of scenarios without incurring any meaningful additional computational cost. Straightforward generalizations enable robust and nonlinear variants by using robust estimators and/or class specific covariance estimators. Our open source implementation optimally scales to terabyte datasets. Moreover, the intuition can be extended for both hypothesis testing and regression (see Supplementary Note 6 for additional numerical examples in these settings).
Two commonly applied approaches in these settings are PLS and CCA. CCA is equivalent to rrLDA whenever p < n, which is not of interest here. When p ≥ n, CCA and rrLDA are not equivalent; however, in such settings, CCA exhibits the “maximal data piling problem”33 (see Supplementary Note 2.6 for details). Specifically, all the points in each class are projected onto the exact same point. This results in severe overfitting of the data, yielding poor empirical performance in essentially all settings we considered here (the first dimension of CCA is typically worse even than the difference of the means). While PLS does not exhibit these problems, it lacks strong theoretical guarantees and simple geometric intuition. In contrast to XOX, neither CCA nor PLS enable straightforward generalizations, such as when there are outliers or the discriminant boundary is quadratic (see Fig. 1). Further, across all simulations, XOX outperforms both of these approaches, sometimes quite dramatically (for example, XOX outperforms CCA on over all of the simulations considered). Finally, no scalable nor parallelized implementations are readily available for these methods (see Fig. 2). One could use stochastic gradient descent with penalties to solve these other optimization problems, but they would still need to tune the penalty parameter which would be quite computationally costly. Neither PLS nor CCA could be successfully run on the massive neuroimaging dataset nor the amplicon-level genomics dataset using readily-available tools.
Many previous investigations have addressed similar challenges. The celebrated Fisherfaces paper was the first to compose Fisher’s LDA with PCA (equivalent to PCA in this manuscript)34. The authors showed via a sequence of numerical experiments the utility of projecting the data using PCA prior to classifying with LDA. We extend this work by adding a supervised component to the initial projection. Moreover, we provide the geometric intuition for why and when incorporating supervision is advantageous, with numerous examples demonstrating its superiority, and theoretical guarantees formalizing when LOL outperforms PCA. The “sufficient dimensionality reduction” literature has similar insights, but a different construction that typically requires the dimensionality to be smaller than the sample size35–39 (although see40 for some promising work). More recently, communication-inspired classification approaches have yielded theoretical bounds on linear and affine classification performance41; they do not, however, explicitly compare different projections, and the bounds we provide are more general and tighter. Moreover, none of the above strategies have implementations that scale to millions or billions of features. Recent big data packages are designed for millions or billions of samples42,43. In biomedical sciences, however, it is far more common to have tens or hundreds of samples, and millions or billions of features (e.g., genomics or connectomics).
Most manifold learning methods, while exhibiting both strong theoretical44–46 and empirical performance, are typically fully unsupervised. Thus, in classification problems, they discover a low-dimensional representation of the data, ignoring the labels. This approach can be highly problematic when the discriminant dimensions and the directions of maximal variance in the learned manifold are not aligned (see Fig. 4 for some examples). Moreover, nonlinear manifold learning techniques tend to learn a mapping from the original samples to a low-dimensional space, but do not learn a projection, meaning that new samples cannot easily be mapped onto the low-dimensional space, a requirement for supervised learning. Deep learning methods6 can easily be supervised, but they tend to require huge sample sizes, lack theoretical guarantees, or are opaque “black-boxes” that are insufficient for many biomedical applications. This yields a dearth of “out of the box” supervised scalable dimensionality reduction techniques with strong theoretical guarantees with respect to classification performance bounds designed for wide datasets. Random forests circumvent many of these problems, but implementations that operate on millions of dimensions do not exist47, and often produce embeddings that perform no better than PCA on wide datasets (Fig. 3).
Other approaches formulate an optimization problem, such as projection pursuit48 and empirical risk minimization49. These methods are limited because they are prone to fall into local minima, require costly iterative algorithms, lack any theoretical guarantees on classification accuracy49. Feature selection strategies, such as higher criticism thresholding50 effectively filter the dimensions, possibly prior to performing PCA on the remaining features51. These approaches could be combined with LOL in ultrahigh-dimensional problems. Similarly, another recently proposed supervised PCA variant builds on the elegant Hilbert–Schmidt independence criterion52 to learn an embedding53. Our theory demonstrates that under the Gaussian model, composing this linear projection with the difference of the means will improve subsequent performance under general settings, implying that this will be a fertile avenue to pursue. A natural extension to this work would therefore be to estimate a Gaussian mixture model per class, rather than simply a Gaussian per class, and project onto the subspace spanned by the collection of all Gaussians.
In conclusion, the key XOX idea, appending class-conditional moment estimates to convert unsupervised manifold learning to supervised manifold learning, has many potential applications and extensions. We have presented the first few, including LOL, QOQ, and RLOL, which demonstrated the flexibility of XOX under both theoretical and benchmark settings. Incorporating additional nonlinearities via higher order moments, kernel methods54, ensemble methods55 such as random forests56, and multiscale methods46 are all of immediate interest.
Methods
Supervised manifold learning
A general strategy for supervised manifold learning is schematized in Fig. 4, and outlined here. Step (A): Obtain or select n training samples of high-dimensional data. For concreteness, we use one of the most popular benchmark datasets, the MNIST dataset57. This dataset consists of images of hand-written digits 0 through 9. Each image is represented by a 28 × 28 matrix, which means that the observed dimensionality of the data is p = 282 = 784. Because we are motivated by the n ≪ p scenario, we subsample the data to select n = 300 examples of the numbers 3, 7, and 8 (100 of each). Step (B): Learn a “projection” that maps the high-dimensional data to a low-dimension representation. One can do so in a way that ignores which images correspond to which digit (the “class labels”), as PCA and most manifold learning techniques do, or try to use the labels, as LDA and sparse methods do. LOL is a supervised linear manifold learning technique that uses the class labels to learn projections that are linear combinations of the original data samples. Step (C): Use the learned projections to map high-dimensional data into the learned lower-dimensional space. This step requires having learned a projection that can be applied to new (test) data samples for which we do not know the true class labels. Nonlinear manifold learning methods typically cannot be applied in this way (though see58). LOL, however, can project new samples in such a way as to separate the data into classes. Step (D): Using the low-dimensional representation of the data, learn a classifier. A good classifier correctly identifies as many points as possible with the correct label. For these data, when LDA is used on the low-dimensional data learned by LOL, the data points are mostly linearly separable, yielding a highly accurate classifier.
The geometric intuition of LOL
To build intuition for situations when LOL performs well, and when it does not, we consider the simplest high-dimensional classification setting. We observe n samples (xi, yi), where xi are p dimensional feature vectors, and yi is the binary class label, that is, yi is either 0 or 1. We assume that both classes are distributed according to a multivariate Gaussian distribution, the two classes have the same identity covariance matrix (all features are uncorrelated with unity variance), and data from either class is equally likely, so that the only difference between the classes is their means. In this scenario, the optimal low-dimensional projection is analytically available: it is the dot product of the difference of means and the inverse covariance matrix, commonly referred to as Fisher’s Linear Discriminant Analysis (LDA)59 (see Supplementary Note 1.2 for derivation). When the distribution of the data is unavailable, as in all real data problems, machine learning methods can be used to estimate the parameters. Unfortunately, when n < p, the estimated covariance matrix will not be invertible (because the solution to the underlying mathematical problem is under specified), so some other approach is required. As mentioned above, PCA is commonly used to learn a low-dimensional representation. PCA uses the pooled sample mean and the pooled sample covariance matrix. The PCA projection is composed of the top d eigenvectors of the pooled sample covariance matrix, after subtracting the pooled mean (thereby completely ignoring the class labels).
In contrast, LOL uses the class-conditional means and class-centered covariance. This approach is motivated by Fisher’s LDA, which uses the same two terms, and should therefore improve performance over PCA. More specifically, for a two-class problem, LOL is constructed as follows:
Compute the sample mean of each class.
Estimate the difference between means.
Compute the class-centered covariance matrix, that is, compute the covariance matrix after subtracting the class mean from each point.
Compute the eigenvectors of this class-conditionally centered covariance.
Concatenate the difference of the means with the top d − 1 eigenvectors of class-centered covariance.
Note that the sample class-centered covariance matrix estimates the population covariance, whereas the sample pooled covariance matrix is distorted by the difference of the class means. Further, as discussed in Methods, the class-centered covariance matrix is equivalent to “Reduced Rank LDA”60 (rrLDA hereafter, which is simply LDA but truncating the covariance matrix). For the theoretical background on LDA and rrLDA, a formal definition of LOL, and detailed description of the simulation settings that follow, see Supplementary Notes 1, 2, and 3, respectively. Figure 5 shows three different examples of 100 data points sampled from a 1000 dimensional Gaussian to geometrically illustrate the intuition that motivated LOL. In each case, all dimensions are uncorrelated with one another, and all classes are equally likely with the same covariance; the only difference between the classes are their means.
Figure 5 a shows “stacked cigars”, in which the difference between the means and the direction of maximum variance are large and aligned with one another. This is an idealized setting for PCA, because PCA finds the direction of maximal variance, which happens to correspond to the direction of maximal separation of the classes. rrLDA performs well here too, for the same reason that PCA does. Because all dimensions are uncorrelated, and one dimension contains most of the information discriminating between the two classes, this is also an ideal scenario for sparse methods. Indeed, ROAD, a sparse classifier designed for precisely this scenario, does an excellent job finding the most useful dimensions12. LOL, using both the difference of means and the directions of maximal variance, also does well. To calibrate all of these methods, we also show the performance of the optimal classifier.
Figure 5b shows an example that is worse for PCA. In particular, the variance is getting larger for subsequent dimensions, while the magnitude of the difference between the means is decreasing with dimension. Because PCA operates on the pooled sample covariance matrix, the dimensions with the maximum difference are included in the estimate, and therefore, PCA finds some of them, while also finding some of the dimensions of maximum variance. The result is that PCA performs fairly well in this setting. rrLDA, however, by virtue of subtracting out the difference of the means, is now completely at chance performance. ROAD is not hampered by this problem; it is also able to find the directions of maximal discrimination, rather than those of maximal variance. Again, LOL, by using both the means and the covariance, does extremely well.
Figure 5c is exactly the same as Fig. 5b, except the data have been randomly rotated in all 1000 dimensions. This means that none of the original features have much information, but rather, linear combinations of them do. This is evidenced by observing the scatter plot, which shows that the first two dimensions fail to disambiguate the two classes. PCA performs even worse in this scenario than in the previous one. rrLDA is rotationally invariant (see Supplementary Note 2.4 for details), so still performs at chance levels. Because there is no small number of features that separate the data well, ROAD fails. LOL performs as well here as it does in the other examples.
When is LOL better than PCA and other supervised linear methods?
We desire theoretical confirmation of the above numerical results. To do so, we investigate when LOL is “better” than other linear dimensionality reduction techniques. In the context of supervised dimensionality reduction or manifold learning, the goal is to obtain low dimensional representation that maximally separates the two classes, making subsequent classification easier. Chernoff information quantifies the dissimilarity between two distributions. Therefore, we can compute the Chernoff information between distribution of the two classes after embedding to evaluate the quality of a given embedding strategy. As it turns out, Chernoff information is the exponential convergence rate for the Bayes error61, and therefore, the tightest possible theoretical bound. The use of Chernoff information to theoretically evaluate the performance of an embedding strategy is novel, to our knowledge, and leads to the following main result:
Main theoretical result
LOL is always better than or equal to rrLDA under the Gaussian model when p ≥ n, and better than or equal to PCA (and many other linear projection methods) with additional (relatively weak) conditions. This is true for all possible observed dimensionalities of the data, and the number of dimensions into which we project, for sufficiently large sample sizes. Moreover, under relatively weak assumptions, these conditions almost certainly hold as the number of dimensions increases.
Formal statements of the theorems and proofs required to substantiate the above result are provided in Methods. The condition for LOL to be better than PCA is essentially that the dth eigenvector of the pooled sample covariance matrix has less information about classification than the difference of the means vector. The implication of the above theorem is that it is better to incorporate the mean difference vector into the projection matrix, rather than ignoring it, under basically the same assumptions that motivate PCA. The degree of improvement is a function of the dimensionality of the feature set p, the number of samples n, the projection dimension d, and the parameters, but the existence of an improvement—or at least no worse performance—is independent of those factors.
Supplementary information
Acknowledgements
The authors are grateful for the support by the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303; DARPA GRAPHS contract N66001-14-1-4028; and DARPA SIMPLEX program through SPAWAR contract N66001-15-C-4041 and DARPA Lifelong Learning Machines program through contract FA8650-18-2-7834.
Author contributions
M.T. and M.M. contributed theoretical results, D.Z. and R.B. devised the semi-external memory implementation, C.D. procured relevant genomics datasets, J.T.V. and E.W.B. wrote the paper, E.W.B. developed the experiments and R package, J.T.V. supervised.
Data availability
Data used within this manuscript are available from https://neurodata.io/lol/and https://neurodata.io//mri.
Code availability
MATLAB, R, and Python code for the experiments performed in this manuscript and a docker container for FlashLOL are available from https://neurodata.io/lol/, and an R package is available on the Comprehensive R Archive Network (CRAN)62.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks Andrew Patterson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Joshua T. Vogelstein and Eric W. Bridgeford.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-23102-2.
References
- 1.Vogelstein JT, et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science. 2014;344:386–392. doi: 10.1126/science.1250298. [DOI] [PubMed] [Google Scholar]
- 2.Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (eds. Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc. 2012).
- 3.Fisher RA. Theory of statistical estimation. Math. Proc. Cambridge Philos. Soc. 1925;22:700–725. doi: 10.1017/S0305004100009580. [DOI] [Google Scholar]
- 4.Jolliffe, I. T. in Principal Component Analysis, Springer Series in Statistics Ch. 1 (Springer, 1986).
- 5.Lee, J. A. & Verleysen, M. Nonlinear Dimensionality Reduction (Springer, 2007). .
- 6.Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep Larning (MIT press, 2016).
- 7.Witten DM, Tibshirani R. Covariance-regularized regression and classification for high-dimensional problems. J. R. Stat. Soc. Series B Stat. Methodol. 2009;71:615–636. doi: 10.1111/j.1467-9868.2009.00699.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shin H, Eubank RL. Unit canonical correlations and high-dimensional discriminant analysis. J. Stat. Comput. Simulation. 2011;81:167–178. doi: 10.1080/00949650903222343. [DOI] [Google Scholar]
- 9.ter Braak, C. J. F. & de Jong, S. The objective function of partial least squares regression. J. Chemom.12, 41–54 (1998).
- 10.Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away: PLS-DA: taking the magic away. J. Chemom. 2014;28:213–225. doi: 10.1002/cem.2609. [DOI] [Google Scholar]
- 11.Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B. 1996;58:267–288. [Google Scholar]
- 12.Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Series B Stati. Methodol. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hastie, T., Tibshirani, R. & Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations (Chapman and Hall/CRC, 2015).
- 14.Weijie S, et al. False discoveries occur early on the Lasso path. Ann. Stat. 2017;45:2133–2150. [Google Scholar]
- 15.Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Publishing House of Electronics Industry, 2004).
- 16.Fan, J., Wang, W. & Zhu, Z. A shrinkage principle for heavy-tailed data: high-dimensional robust low-rank matrix recovery. Preprint at arXiv:1603.08315 (2016). [DOI] [PMC free article] [PubMed]
- 17.Ke Y, Minsker S, Ren Z, Sun Q, Zhou W-X. User-friendly covariance estimation for heavy-tailed distributions. Statist. Sci. 2019;34:454–471. doi: 10.1214/19-STS711. [DOI] [Google Scholar]
- 18.Minsker, S., and Wei, X. Estimation of the covariance structure of heavy-tailed distributions. Preprint at https://arxiv.org/abs/1708.00502v3 (2017).
- 19.Mairal, J., Ponce, J., Sapiro, G., Zisserman, A. & Bach, F. R. Supervised dictionary learning. In Proc. Advances in Neural Information Processing Systems (eds. Koller, D., Schuurmans, D., Bengio, Y. & Bottou, L.) 1033–1040 (Curran Associates Inc. 2009).
- 20.Zheng, D. et al. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In Proc. 13th USENIX Conference on File and Storage Technologies (FAST 15) 45–58 (USENIX Association 2015).
- 21.Zheng, D., Mhembere, D., Vogelstein, J. T., Priebe, C. E. & Burns, R. Flashmatrix: parallel, scalable data analysis with generalized matrix operations using commodity ssds. Preprint at arXiv:1604.06414 (2016b).
- 22.Zheng, D., Burns, R., Vogelstein, J., Priebe, C. E. & Szalay, A. S. An ssd-based eigensolver for spectral analysis on billion-node graphs. Preprint at arvix:1602.01421 (2016a).
- 23.Candès EJ, Tao T. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inf. Theory. 2006;52:5406–5425. doi: 10.1109/TIT.2006.885507. [DOI] [Google Scholar]
- 24.Li, P., Hastie, T. J. & Church, K. W. Very sparse random projections. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining 287–296 (Association for Computing Machinery, 2006).
- 25.Bridgeford, E. W. et al. Eliminating accidental deviations to minimize generalization error and maximize reliability: applications in connectomics and genomics. Preprint at bioRxiv10.1101/802629 (2020). [DOI] [PMC free article] [PubMed]
- 26.Gray WR, et al. Magnetic resonance connectome automated pipeline. IEEE Pulse. 2011;3:42–48. doi: 10.1109/MPUL.2011.2181023. [DOI] [PubMed] [Google Scholar]
- 27.Roncal, W. G. et al. MIGRAINE: MRI graph reliability analysis and inference for connectomics In Proc.2013 IEEE Global Conference on Signal and Information Processing 313–316 (IEEE, 2013).
- 28.Kiar, G. et al. Science in the cloud (sic): a use case in MRI connectomics. GigaScience10.1093/gigascience/gix013 (2017). [DOI] [PMC free article] [PubMed]
- 29.Zuo X-N, et al. An open science resource for establishing reliability and reproducibility in functional connectomics. Sci. Data. 2014;1:140049. doi: 10.1038/sdata.2014.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Douville C, et al. Assessing aneuploidy with repetitive element sequencing. Proc. Natl Acad. Sci. USA. 2020;117:4858–4863. doi: 10.1073/pnas.1910041117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Vogelstein JT, Roncal WG, Vogelstein RJ, Priebe CE. Graph classification using signal-subgraphs: applications in statistical connectomics. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35:1539–1551. doi: 10.1109/TPAMI.2012.235. [DOI] [PubMed] [Google Scholar]
- 32.Duarte-Carvajalino JM, Jahanshad N. Hierarchical topological network analysis of anatomical human brain connectivity and differences related to sex and kinship. Neuroimage. 2011;59:3784–3804. doi: 10.1016/j.neuroimage.2011.10.096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ahn J, Marron JS. The maximum data piling direction for discrimination. Biometrika. 2010;97:254–259. doi: 10.1093/biomet/asp084. [DOI] [Google Scholar]
- 34.Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 1997;19:711–720. doi: 10.1109/34.598228. [DOI] [Google Scholar]
- 35.Li K-C. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 1991;86:316–327. doi: 10.1080/01621459.1991.10475035. [DOI] [Google Scholar]
- 36.Naftali, T., Fernando, C .P. & William, B. The Information Bottleneck Method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377 (1999).
- 37.Globerson A, Tishby N. Sufficient dimensionality reduction. J. Mach. Learn. Res. 2003;3:1307–1331. [Google Scholar]
- 38.Cook RD, Ni L. Sufficient dimension reduction via inverse regression. J. Am. Stat. Assoc. 2005;100:410–428. doi: 10.1198/016214504000001501. [DOI] [Google Scholar]
- 39.Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing Kernel Hilbert spaces. J. Mach. Lear. Res. 2004;5:73–99. [Google Scholar]
- 40.Cook RD, Forzani L, Rothman AJ. Prediction in abundant high-dimensional linear regression. Electron. J. Stat. 2013;7:3059–3088. doi: 10.1214/13-EJS872. [DOI] [Google Scholar]
- 41.Nokleby M, Rodrigues M, Calderbank R. Discrimination on the grassmann manifold: Fundamental limits of subspace classifiers. IEEE Trans. Inf. Theory. 2015;61:2133–2147. doi: 10.1109/TIT.2015.2407368. [DOI] [Google Scholar]
- 42.Agarwal A, Chapelle O, Dudík M, Langford J. A reliable effective terascale linear learning system. J. Mach. Learn. Res. 2014;15:1111–1133. [Google Scholar]
- 43.Abadi, M. et al. Tensorflow: large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv:1603.04467 (2016).
- 44.Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1:211–218. doi: 10.1007/BF02288367. [DOI] [Google Scholar]
- 45.de Silva, V. & Tenenbaum, J. B. Global versus local methods in nonlinear dimensionality reduction. In Proc. 15th International Conference on Neural Information Processing Systems 721–728 (eds. Becker, S., Thrun, S. & Obermayer, K.) (MIT Press 2003).
- 46.Allard WK, Chen G, Maggioni M. Multi-scale geometric methods for data sets II: geometric multi-resolution analysis. Appl. Comput. Harmon. Anal. 2012;32:435–462. doi: 10.1016/j.acha.2011.08.001. [DOI] [Google Scholar]
- 47.Tomita, T., Maggioni, M. & Vogelstein, J. ROFLMAO: robust oblique forests with linear MAtrix operations. In Proc. 2017 SIAM International Conference on Data Mining 498–506 (eds. Chawla, N. & Wang, W.) (Society for Industrial and Applied Mathematics, 2017).
- 48.Huber PJ. Projection pursuit. Ann. Stat. 1985;13:435–475. [Google Scholar]
- 49.Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006;7:2399–2434. [Google Scholar]
- 50.Donoho DL, Jin J. Higher criticism thresholding: optimal feature selection when useful features are rare and weak. Proc. Natl Acad. Sci. USA. 2008;105:14790–5. doi: 10.1073/pnas.0807471105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. J. Am. Stat. Assoc. 2006;101:119–137. doi: 10.1198/016214505000000628. [DOI] [Google Scholar]
- 52.Gretton A, Herbrich R, Smola A, Bousquet O, Scholkopf B. Kernel methods for measuring independence. J. Mach. Learn. Res. 2005;6:2075–2129. [Google Scholar]
- 53.Barshan E, Ghodsi A, Azimifar Z, Jahromi MZ. Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recognit. 2011;44:1357–1371. doi: 10.1016/j.patcog.2010.12.015. [DOI] [Google Scholar]
- 54.Mika, S., Ratsch, G., Weston, J., Scholkopf, B. & Mullers, K. R. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Pro. 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468) (eds. Hu, Y.-H., Larsen, J., Wilson, E. & Douglas, S.) 41–48 (IEEE, 1999).
- 55.Cannings, T. I. & Samworth, R. J. Random-projection ensemble classification. Preprint at arXiv:1504.04595 (2015).
- 56.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- 57.LeCun, Y., Cortes, C. & Burges, C. MNIST Handwritten Digit Databasehttp://yann.lecun.com/exdb/mnist/ (2015).
- 58.Bengio, Y. et al. Out-of-Sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. In Advances in Neural Information Processing Systems (eds Thrun, S., Saul, L. K. & Schölkopf, P. B.) 177–184 (MIT Press, 2004).
- 59.Bickel PJ, Levina E. Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. doi: 10.3150/bj/1106314847. [DOI] [Google Scholar]
- 60.Hastie T, Tibshirani R. Discriminant analysis by gaussian mixtures. J. R. Stat. Soc. Series B Stat. Methodol. 1996;58:155–176. [Google Scholar]
- 61.Chernoff H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952;23:493–507. doi: 10.1214/aoms/1177729330. [DOI] [Google Scholar]
- 62.Bridgeford, E. W., Tang, M., Yim, J. & Vogelstein, J. T. Linear optimal low-rank projection. Zenodo10.5281/zenodo.1246979 (2018).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used within this manuscript are available from https://neurodata.io/lol/and https://neurodata.io//mri.
MATLAB, R, and Python code for the experiments performed in this manuscript and a docker container for FlashLOL are available from https://neurodata.io/lol/, and an R package is available on the Comprehensive R Archive Network (CRAN)62.