Abstract
Exponential-family singular value decomposition (eSVD) is a new approach for embedding multivariate data into a lower-dimensional space. It provides an elegant dimension reduction framework with flexibility to handle one-parameter exponential family distributions and proven consistency. This approach adds a valuable new tool to the toolbox of data analysts. Here we discuss a number of open problems and challenges that remain to be addressed in the future in order to unleash the full potential of eSVD and other similar approaches.
Keywords: Multivariate Analysis, Dimension Reduction, Big Data, Genomics
We would like to congratulate Lin, Lei and Roeder on developing an elegant framework, exponential-family singular value decomposition (eSVD) (Lin et al., 2021), for embedding multivariate data into a lower-dimensional space. Low-dimensional embedding is an important dimension reduction tool with many applications in high-dimensional data analyses. Examples include data compression, visualization, denoising, feature extraction, improving computational efficiency, and mitigating the curse of dimensionality for downstream data analysis tasks. While many low-dimensional embedding methods have been developed in the past, the work by Lin et al. (2021) introduces an appealing new solution for at least two reasons.
First, eSVD generalizes the widely used singular value decomposition (SVD) approach by providing additional model flexibility. While the SVD-based embedding can be interpreted as modeling data using a low rank matrix plus constant-variance Gaussian noise, eSVD assumes that data follow one-parameter exponential family distributions and models their natural parameters using a low rank matrix factorization. By relaxing the data distribution assumption, one gains flexibility which is important for real data applications. For example, different genomic technologies generate data with different characteristics. While gene expression data from the traditional microarrays have continuous values which may be modeled using normal distribution, data generated from the most recent single-cell sequencing technologies such as single-cell RNA-sequencing (scRNA-seq) are discrete counts and may be better modeled using count-based distributions. The additional flexibility provided by eSVD makes it easier for data analysts to handle different data types using a common framework and allows them to quickly adapt their analysis pipelines to new data types generated from new technologies.
Second, eSVD comes with proven consistency and identifiability conditions. These theoretical results are useful for understanding the conditions under which the low-dimensional embedding can accurately reflect the underlying data structure. Similar results are not yet available for many other non-linear embedding methods such as T-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008) and Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018).
By providing a flexible low-dimensional embedding framework amenable to theoretical analysis, eSVD adds a very valuable dimension reduction tool to the toolbox of data analysts. Clearly, the work by Lin et al. (2021) only represents a start. To unleash the full potential of eSVD, there are still a number of open problems and challenges that need to be addressed in the future.
Impact of data preprocessing and feature selection
Lin et al. (2021) developed eSVD under the assumption that the data matrix is given. Although it is not their focus, data preprocessing that creates the data matrix can also have an impact on the embedding results. For example, a common data preprocessing step in the analysis of scRNA-seq data is to filter out genes that show low variability across cells in order to reduce noise and improve computational efficiency. Only high-variability genes are retained to construct the data matrix used for embedding cells. Depending on how many genes and which genes are retained, there can be substantial variation in the embedding performance. To demonstrate, we applied eSVD to a scRNA-seq dataset consisting of five different cell types originally generated by Zheng et al. (2017) (Fig. 1). This dataset was created by mixing cells obtained via cell sorting, therefore the true cell type label of each cell is known and can be used for evaluating the analysis results. After normalizing genes’ read counts by each cell’s library size and selecting high-variability genes using the FindVariableFeatures() function in Seurat (Stuart et al., 2019), we applied eSVD to embed cells into a low-dimensional space (Fig. 1a). The eSVD was run with curved Gaussian distribution (t = 2) and latent space dimension k = 5. Using the five-dimensional eSVD embedding, unsupervised k-means clustering was run to group cells into five clusters. The clustering results were then compared with cells’ true cell type labels (Fig. 1b) and adjusted Rand index (ARI) (Lawrence and Phipps, 1985) was computed (Fig. 1c). A higher ARI indicates that an embedding can better retain the true cell type structure. Using different numbers of highly variable genes has resulted in variable eSVD embedding performance (Fig. 1). To see relevance of this variability, we also applied SVD (with k = 5 latent dimension) and UMAP (using the umap R package with k = 2 latent dimensions) followed by the same k-means clustering procedure (5 clusters) to the same data. Consistent with Lin et al. (2021), SVD and UMAP were run using normalized and log2-transformed read counts, whereas eSVD was run using normalized data without log2-transformation. It was observed that the variability in eSVD’s performance due to gene filtering can be even bigger than the variability across different dimension reduction methods (Fig. 1c). This example demonstrates the potential impact of data preprocessing and shows the importance of optimizing feature selection and construction of the input data matrix in conjunction with eSVD in order to obtain optimal embedding. It also suggests a natural extension of eSVD in which one incorporates the optimization of feature selection and data matrix construction into the embedding framework. More generally, an analysis of a complex dataset often consists of multiple steps, and low-dimensional embedding may be only one of them. The optimal analysis may require one to develop methods that jointly optimize the embedding with both the upstream data preprocessing and the downstream analysis tasks.
Fig. 1.

Impact of gene filtering on low-dimensional cell embedding. A scRNA-seq dataset with 500 hematopoietic stem cells (HSCs), 500 B cells, 500 monocytes, 500 natural killer cells and 1000 T cells was analyzed. (a) The eSVD embedding (the first two dimensions) based on 500 and 3000 highly variable genes selected by Seurat. Cells (dots) are color-coded using true cell type labels. (b) Using eSVD embeddings, cells are grouped into five clusters using k-means clustering. The number of cells from each cell type is shown for each cluster. In a perfect clustering, each cluster should contain only one cell type. Using 500 or 3000 genes for eSVD results in different clustering performance. (c) The adjusted Rand index of k-means clustering based on eSVD, SVD and UMAP with 500, 1000, 2000, and 3000 highly variable genes selected by Seurat. The data and R code for this analysis can be found in Supplementary Materials (SuppPubdata1.zip, SuppPubcode1.zip).
Scalability and computational efficiency
Thanks to the rapidly evolving technology, the data size also grows fast. Lin et al. (2021) has demonstrated eSVD using data with n = 102 – 103 cells and p = 102 – 103 genes. However, today’s cell atlas projects increasingly generate much larger datasets, such as scRNA-seq data consisting of n = 105 – 106 cells and p = 103 – 104 genes (Regev et al., 2017; Su et al., 2020) or single-cell chromatin accessibility data consisting of n = 104 – 105 cells and p = 104 – 106 genomic regulatory elements (Cusanovich et al., 2018). Based on our own limited experience with eSVD, its current implementation is incapable of handling these atlas-level data. In order to be widely adopted in these big data settings, a method has to be scalable to the increasing sample size n and dimension p in terms of both hardware requirements (e.g., memory usage) and computational speed. For this reason, establishing the computational complexity of eSVD with regard to both n and p and improving its scalability and computational efficiency for big data applications are clearly important topics that warrant further investigation.
Model flexibility, selection and tuning
The model flexibility is a key strength of eSVD, but the flexibility in itself is not an insurance against poor performance. Model selection and tuning are critical for users to leverage the flexibility to achieve optimal performance. While Lin et al. (2021) presented a matrix-completion diagnostic and bootstrap procedure for model selection and tuning, one limitation of this procedure is its computational burden which makes it difficult to use in large datasets. This can constrain users’ ability to benefit from the model flexibility.
In eSVD, users need to jointly optimize multiple parameters (e.g., distribution family, nuisance parameters, and dimensionality of the latent space). The high computation cost often means that users can only afford a low-resolution grid search in the joint parameter space. Some users will even skip the systematic model selection and tuning. What this means in practice is that different users may choose different models and parameters, leading to substantial variability in the method’s results and performance when analyzing the same data. This raises the question of how one may further improve the model selection and tuning methods to allow users to take full advantage of the model flexibility. This may be achieved either by developing model evaluation statistics that are more efficient to compute to enable fine-grained parameter search, or by developing smarter strategies to navigate through the joint parameter space (e.g., coupling low-resolution search at initial stage with high-resolution search later).
Gaps between theory and application
The consistency and identifiability conditions established by Lin et al. (2021) represent an important step towards understanding the theorical properties of eSVD. However, a few gaps remain in terms of applying the theory to guide practice. First, the theoretical results are based on a number of assumptions. Currently, checking the validity of these assumptions is not easy. Additional work is still needed to develop simple ways to check these assumptions in real data. Second, with a given dataset, it will be useful to know whether its n and p are large enough so that the estimated embedding is close enough to the underlying truth. Answering this question requires knowledge about not only the convergence rate but also the various constants used to establish the convergence results (e.g., C, μ and S in Proposition D.1, etc.). Methods to determine these unknown constants in real data applications still need to be developed. Filling these remaining gaps is critical in order to seamlessly connect theory to application.
To summarize, the work by Lin et al. (2021) has not only introduced a useful new dimension reduction framework but also created many new research problems. We hope that this discussion can provide some perspectives complementary to Lin et al. (2021) to inspire future theoretical, methodological and applied research on eSVD and other dimension reduction methods. We conclude by commenting that with a large number of dimension reduction methods available and many more to come in the future, it is also important to understand the relative strengths and weaknesses of each method. In real applications, users choose methods based on many considerations such as accuracy, consistency, robustness, scalability, speed, and easiness to use. It is unlikely that a single method can provide a one-size-fits-all solution in all applications. Currently, a systematic benchmark study that comprehensively evaluates and compares all dimension reduction methods including eSVD is still lacking. It will be extremely valuable to conduct such a study in the future in order to establish guidelines for users to choose the most suitable methods for their applications.
Supplementary Material
Acknowledgments
This work was supported by the US National Institutes of Health grants R01HG009518 and R01HG010889.
References
- Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, Filippova GN, Huang X, Christiansen L, DeWitt WS, Lee C, Regalado SG, Read DF, Steemers FJ, Disteche CM, Trapnell C, and Shendure J (2018). “A single-cell atlas of in vivo mammalian chromatin accessibility,” Cell 174(5), 1309–1324.e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence H, and Phipps A (1985). “Comparing partitions,” J. Classif 2, 193–218. [Google Scholar]
- Lin KZ, Lei J, and Roeder K (2021). “Exponential-family embedding with application to cell developmental trajectories for single-cell rna-seq data,” J. Am. Stat. Assoc. xx(x), xx–xx. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McInnes L, Healy J, and Melville J (2018). “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426.
- Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, Clevers H, Deplancke B, Dunham I, Eberwine J, Eils R, Enard W, Farmer A, Fugger L, Göttgens B, Hacohen N, Haniffa M, Hemberg M, Kim S, Klenerman P, Kriegstein A, Lein E, Linnarsson S, Lundberg E, Lundeberg J, Majumder P, Marioni JC, Merad M, Mhlanga M, Nawijn M, Netea M, Nolan G, Pe’er D, Phillipakis A, Ponting CP, Quake S, Reik W, Rozenblatt-Rosen O, Sanes J, Satija R, Schumacher TN, Shalek A, Shapiro E, Sharma P, Shin JW, Stegle O, Stratton M, Stubbington MJT, Theis FJ, Uhlen M, van Oudenaarden A, Wagner A, Watt F, Weissman J, Wold B, Xavier R, Yosef N, and Participants HCAM (2017). “The human cell atlas,” Elife 6, e27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E 3rd, W. MM,Hao Y, Stoeckius M, Smibert P, and Satija R (2019). “Comprehensive integration of single-cell data,” Cell 177(7), 1888–1902.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su Y, Chen D, Yuan D, Lausted C, Choi J, Dai CL, Voillet V, Duvvuri VR, Scherler K, Troisch P, Baloni P, Qin G, Smith B, Kornilov SA, Rostomily C, Xu A, Li J, Dong S, Rothchild A, Zhou J, Murray K, Edmark R, Hong S, Heath JE, Earls J, Zhang R, Xie J, Li S, Roper R, Jones L, Zhou Y, Rowen L, Liu R, Mackay S, O’Mahony DS, Dale CR, Wallick JA, Algren HA, Zager MA, Unit, I.-S. C. B.,Wei W, Price ND, Huang S, Subramanian N, Wang K, Magis AT, Hadlock JJ, Hood L, Aderem A, Bluestone JA, Lanier LL, Greenberg PD, Gottardo R, Davis MM, Goldman JD, and Heath JR (2020). “Multi-omics resolves a sharp disease-state shift between mild and moderate covid-19,” Cell 183(6), 1479–1495.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Maaten L, and Hinton G (2008). “Visualizing data using t-sne,” J. Mach. Learn. Res 9, 2579–2605. [Google Scholar]
- Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schnall-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, and Bielas JH (2017). “Massively parallel digital transcriptional profiling of single cells,” Nat. Commun 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
