Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Mar 27:2024.03.23.586420. [Version 1] doi: 10.1101/2024.03.23.586420

Accelerated dimensionality reduction of single-cell RNA sequencing data with fastglmpca

Eric Weine 1,2,*, Peter Carbonetto 3, Matthew Stephens 3,4,*
PMCID: PMC10996495  PMID: 38585920

Summary:

Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced “Poisson GLM-PCA”, a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call “Alternating Poisson Regression” (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.


Almost every analysis of single-cell RNA sequencing (scRNA-seq) data involves some kind of dimensionality reduction to help summarize and denoise the data (Amezquita et al., 2020; Linderman, 2021; Stuart et al., 2019; Sun et al., 2019; Tsuyuzaki et al., 2020). Principal Components Analysis (PCA) is a widely used dimensionality reduction technique, but it has been criticized as being poorly suited to the sparse count nature of scRNA-seq data. Motivated by this, Townes et al. (2019) suggested instead using a version of PCA, called “GLM-PCA”, that is specifically tailored to count data. However, GLM-PCA is computationally challenging to fit. In this paper we provide faster algorithms, implemented in the software fastglmpca, to fit this model.

The GLM-PCA model combines PCA with ideas from generalized linear models (McCullagh, 1989), and dates back at least to Choulakian (1996); see also Collins et al. (2001) and Chen et al. (2013). We consider here the Poisson version of this model which was the primary focus in Townes et al. (2019). The Poisson variant models the n × m data matrix Y as

yij~Pois(λij)logλij=hijH=UVT, (1)

where Pois(λ) denotes the Poisson distribution with mean λ; yij and hij denote entries of the matrices Y and H, respectively; U Rn×K and V Rm×K are the matrices of unknowns to be estimated from the data; and K > 0 is an integer specifying the dimension of the reduced representation, typically a number much smaller than n or m. In this form, the model is symmetric in the rows and columns of Y, but by convention we assume that rows i are genes and columns j are cells (e.g., Nicol and Miller, 2023; Stuart et al., 2019; Townes et al., 2019). See the Supplementary Text for elaborations of this model with options to specify row (gene) and column (cell) covariates.

In (1), we do not impose the typical PCA constraints enforcing orthogonality on U and V because such constraints can be easily applied after fitting; that is, once U, V have been estimated, a “PCA-like” decomposition for H can be obtained from a singular value decomposition of the estimated UVT. See the Supplementary Text for details.

Whereas standard PCA involves straightforward application of a (truncated) SVD algorithm, fitting the GLM-PCA model is much less straightforward; computing a maximum-likelihood estimate (MLE) of U, V in (1) is a high-dimensional, nonconvex optimization problem. The glmpca R package (Townes et al., 2019) uses stochastic gradients (Asi and Duchi, 2019; Bottou et al., 2018), progressively improving the parameter estimates in the direction of a noisy estimate of the gradient computed with a random subset (a “mini-batch”) of the cells. (The R package NewWave afits a related model via stochastic gradients; see Agostinis et al. 2022.) Since the performance of the stochastic gradients method can depend strongly on the choice of learning rate, they used the adaptive AvaGrad method (Savarese et al., 2021). However, even with AvaGrad, the method can be very sensitive to the choice of learning rate, and may be unstable if the learning rate is too large. (glmpca implements other approaches, but we and others Nicol and Miller 2023 have found that the AvaGrad approach generally performed best.)

graphic file with name nihpp-2024.03.23.586420v1-f0001.jpg

The scGBM R package (Nicol and Miller, 2023) takes a different approach, iteratively solving an approximation to the log-likelihood that has the form of a more tractable “weighted SVD” problem (Srebro and Jaakkola, 2003). This approach, called IRSVD (“iteratively reweighted SVD”), can be very memory-intensive—for example, it involves forming a matrix of the same size as Y that is not sparse—which limits its application to larger scRNA-seq data sets (see also Lee and Han 2022 for related discussion of these issues).

Here we describe another approach to fitting GLM-PCA models that is based on a simple observation: when V is fixed, computing the MLE for U reduces to the much simpler and very well studied problem of independently fitting n generalized linear models (GLMs) with a Poisson error distribution and log-link function (McCullagh, 1989). Similarly, when U is fixed, computing the MLE of V reduces to independently fitting m Poisson GLMs. This suggests a block-coordinate optimization approach (Wright, 2015) that alternates between optimizing U with fixed V, and optimizing V with fixed U (Algorithm 1). This approach is analogous to the “alternating least squares” algorithm for truncated SVD (Hastie et al., 2015), and a similar alternating approach has proven very effective for nonnegative matrix factorization (Carbonetto et al., 2021; Hien and Gillis, 2021; Kim et al., 2014). We call our approach “Alternating Poisson Regression” (APR) to draw attention to its two key aspects: (i) the alternating optimization of U and V, and (ii) the reduction to simpler Poisson GLM optimization problems. We have implemented the APR algorithm in the R package fastglmpca.

The APR approach has several benefits. First, it has strong convergence guarantees; the block-coordinatewise updates monotonically improve the log-likelihood, and under mild conditions converge to a (local) maximum of the likelihood (Wright, 2015). In addition, by splitting the large optimization problem into smaller pieces (the Poisson GLMs), the computations are memory-efficient and are trivially parallelized to take advantage of multi-core processors.

Since APR reduces the problem of fitting a Poisson GLM-PCA model to the problem of fitting many (much smaller) Poisson GLMs, the speed of the APR algorithm depends critically on how efficiently one can fit the individual Poisson GLMs. The “classic” algorithm for GLMs is iteratively reweighted least squares (IRLS) (Green, 1984; McCullagh, 1989). However, the complexity of IRLS grows very quickly with K—we would prefer an approach that is still fast for large K. Therefore, we instead take a cyclic coordinate descent (CCD) approach to fitting each Poisson GLM, which involves very simple (and therefore very fast) 1-d Newton updates (Nocedal and Wright, 2006) of the GLM parameters. Although the convergence behaviour of CCD is theoretically much worse than IRLS, the CCD updates can converge quickly in practice, especially when we orthogonalize U and V at each iteration. These and other details of the implementation resulting in improved speed and reliability are discussed in the Supplementary Text.

To illustrate the benefits of the APR approach for GLM-PCA, we analyzed two scRNA-seq data sets: 7,193 cells from the tracheal epithelium in wild-type mice (Montoro et al., 2018) and 68,579 cells from peripheral blood mononuclear cells (“68k PBMC”) (Zheng et al., 2017). We compared APR, implemented in the R package fastglmpca, to two existing software implementations in R: the Iterative Reweighted SVD (IRSVD) algorithm implemented in the R package scGBM (Miller and Carter, 2020; Nicol and Miller, 2023); and the adaptive stochastic gradient algorithm (“AvaGrad”) implemented in the R package glmpca (Savarese et al., 2021; Townes, 2019; Townes et al., 2019).

Since all the methods are attempting to optimize the same objective function (the log-likelihood), we use this objective function to compare the quality of the fits. The quality of the fit and the running time depends very strongly on the criterion used to stop the model fitting. Since this criterion is somewhat arbitrary, we performed these comparisons by visualizing the evolution of the log-likelihood against running time. The results for different settings of K, ranging from 2 to 25, are summarized in Supplementary Figures 1 and 2, and for K = 10 in Fig. 1A. (Full details of these comparisons are given in the Supplementary Text.) The typical result is that while all the algorithms continued to (slowly) improve the fits even after running for many hours, the APR algorithm improved the fit at a much greater rate than the other approaches. (In two exceptions to this, AvaGrad seemed to have settled into better local solutions than APR; Supplementary Fig. 1.) Examining the log-likelihood for each cell reveals that fastglmpca consistently improves the fit across almost all cells, rather than just a small subset of cells (Supplementary Figures 3, 4). When multi-processor computing resources are available, APR can easily leverage these resources to dramatically speed up model fitting. For example, to achieve the same log-likelihood as running AvaGrad for 10 hours, the parallel APR updates running on 28-core processor needed to run only 10 minutes on the 68k PBMC data set and only 1 minute on the epithelial airway data set (Fig. 1A).

Fig. 1:

Fig. 1:

Comparison of GLM-PCA model fitting algorithms on two scRNA-seq data sets: mouse epithelial airway data (7,193 cells) and 68k PBMC data (68,579 cells). (A) Improvement in K = 10 GLM-PCA fits over time. Log-likelihoods are shown relative to the best log-likelihood recovered among methods compared. The Y axis has a log scale, and log-likelihood differences less than 1 are shown as 1. (B) The first 2 rows of V after fitting the model for about 10 hours.

To assess whether the different log-likelihoods achieved by different methods corresponded to qualitatively different solutions, we examined the estimated latent factors (the “PCs”) returned by each method. We found that the different methods often generated quite different PCs; consider the different K = 10 GLM-PCA representations of the epithelial airway and 68k PBMC transcriptome profiles produced by the three different model fitting algorithms (Fig. 1B, C, Supplementary Figures 5, 6). Although there is no ground truth here, the results nonetheless show that fastglmpca not only yields better log-likelihoods, but also yields solutions that are qualitatively different from the existing methods.

The computational effort involved in running all the algorithms grows at most linearly in n, m and K (Supplementary Fig. 7). However, fastglmpca can cope with much larger scRNA-seq data sets because of the care taken to avoid computations that “fill in” the sparse data matrix Y. In summary, the key differentiating factors are (1) the speed at which fastglmpca finds good GLM-PCA fits, particularly when the updates can be run in parallel, and (2) numerical computations that limit memory usage when the data matrix Y is large and sparse. We also observed, anecdotally, the potential to additionally accelerate model fits using DAAREM (Henderson and Varadhan, 2019; Tang et al., 2022) (Fig. 1A).

The Poisson GLM-PCA model (1) can be seen as combining a Poisson measurement model with a low-rank (log) expression model (in the terminology of Sarkar and Stephens 2021). There is good theoretical and empirical support for the Poisson measurement model, but the expression model would likely be improved by allowing for deviation from an exact low-rank structure (Sarkar and Stephens, 2021). This idea can motivate alternative models, such as the negative binomial variation of GLM-PCA that is implemented in glmpca (see also the NewWave package; Agostinis et al. 2022.) Future work could consider extending the algorithms introduced here to the negative binomial case.

In summary, we contribute a new R package, fastglmpca, which implements fast algorithms for dimensionality reduction of count data based on the Poisson GLM-PCA model. The package is available on CRAN for all major computing platforms. It features a well-documented, user-friendly model fitting interface that aligns closely with glmpca and scGBM, and a vignette giving an example analysis of scRNA-seq data. The interface splits the GLM-PCA analysis into two phases: an initialization phase, where modeling choices are made, including the rank, K, and row- and column-covariates (function “init glmpca pois”); and a model fitting phase, where the optimization may be monitored and fine-tuned (function “fit glmpca pois”). The core model fitting routines were implemented efficiently in C++ using the Armadillo linear algebra library (Sanderson and Curtin, 2016, 2019) and Intel Threading Building Blocks (TBB) (Allaire et al., 2023; Contreras and Martonosi, 2008).

Supplementary Material

Supplement 1
media-1.pdf (12.1MB, pdf)

Acknowledgments

We thank the staff at the Research Computing Center for providing the high-performance computing resources used to implement the numerical experiments. We also thank Abhishek Sarkar, Jason Willwerscheid, Dongyue Xie and other members of the Stephens lab for helpful suggestions and feedback.

Funding

This work was supported by the NHGRI at the National Institutes of Health under award number R01HG002585.

Footnotes

Conflicts of interest

All authors declare no conflicts of interest.

Supplementary information:

Supplementary data are available at bioRxiv online.

Availability and implementation:

The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository.

References

  1. Agostinis F., Romualdi C., Sales G., and Risso D.. NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data. Bioinformatics, 38(9):2648–2650, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Allaire J., Francois R., Ushey K., Vandenbrouck G., Geelnard M., and Intel. RcppParallel: parallel programming tools for ‘Rcpp’, 2023. URL https://CRAN.R-project.org/package=RcppParallel. R package version 5.1.7.
  3. Amezquita R. A., Lun A. T. L., Becht E., Carey V. J., Carpp L. N., Geistlinger L., et al. Orchestrating single-cell analysis with Bioconductor. Nature Methods, 17(2):137–145, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Asi H. and Duchi J. C.. The importance of better models in stochastic optimization. Proceedings of the National Academy of Sciences, 116 (46):22924–22930, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bottou L., Curtis F. E., and Nocedal J.. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. [Google Scholar]
  6. Carbonetto P., Sarkar A., Wang Z., and Stephens M.. Non-negative matrix factorization algorithms greatly improve topic model fits. arXiv, 2105.13440, 2021. [Google Scholar]
  7. Chen M., Li W., Zhang W., and Wang X.. Dimensionality reduction with generalized linear models. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pages 1267–1272, 2013. [Google Scholar]
  8. Choulakian V.. Generalized bilinear models. Psychometrika, 61(2): 271–283, 1996. [Google Scholar]
  9. Collins M., Dasgupta S., and Schapire R. E.. A generalization of principal components analysis to the exponential family. Advances in Neural Information Processing Systems, 14, 2001. [Google Scholar]
  10. Contreras G. and Martonosi M.. Characterizing and improving the performance of Intel Threading Building Blocks. In IEEE International Symposium on Workload Characterization, pages 57–66, 2008. [Google Scholar]
  11. Green P. J.. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society, Series B, 46(2):149–192, 1984. [Google Scholar]
  12. Hastie T., Mazumder R., Lee J. D., and Zadeh R.. Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(104):3367–3402, 2015. [PMC free article] [PubMed] [Google Scholar]
  13. Henderson N. C. and Varadhan R.. Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms. Journal of Computational and Graphical Statistics, 28(4):834–846, 2019. [Google Scholar]
  14. Hien L. T. K. and Gillis N.. Algorithms for nonnegative matrix factorization with the Kullback–Leibler divergence. Journal of Scientific Computing, 87(3):93, 2021. [Google Scholar]
  15. Kim J., He Y., and Park H.. Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014. [Google Scholar]
  16. Lee H. and Han B.. FastRNA: An efficient solution for PCA of single-cell rna-sequencing data based on a batch-accounting count model. American Journal of Human Genetics, 109(11):1974–1985, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Linderman G. C.. Dimensionality reduction of single-cell RNA-seq data. In Picardi E., editor, RNA Bioinformatics, pages 331–342. Springer, New York, NY, 2021. [DOI] [PubMed] [Google Scholar]
  18. McCullagh P.. Generalized linear models. Chapman and Hall, New York, NY, 2nd edition, 1989. [Google Scholar]
  19. Miller J. W. and Carter S. L.. Inference in generalized bilinear models. arXiv, 2010.04896, 2020. [Google Scholar]
  20. Montoro D. T., Haber A. L., Biton M., Vinarsky V., Lin B., Birket S. E., et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature, 560(7718):319–324, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Nicol P. B. and Miller J. W.. Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models. bioRxiv, 2023. doi: 10.1101/2023.04.21.537881. [DOI] [Google Scholar]
  22. Nocedal J. and Wright S. J.. Numerical optimization. Springer, New York, NY, second edition, 2006. [Google Scholar]
  23. Sanderson C. and Curtin R.. Armadillo: a template-based C++ library for linear algebra. 1(2):26, 2016. [Google Scholar]
  24. Sanderson C. and Curtin R.. Practical sparse matrices in C++ with hybrid storage and template-based expression optimisation. Mathematical and Computational Applications, 24(3):70, 2019. [Google Scholar]
  25. Sarkar A. and Stephens M.. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nature Genetics, 53(6):770–777, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Savarese P., McAllester D., Babu S., and Maire M.. Domain-independent dominance of adaptive methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16286–16295, 2021. [Google Scholar]
  27. Srebro N. and Jaakkola T.. Weighted low-rank approximations. In Proceedings of the 20th International Conference on Machine Learning, pages 720–727, 2003. [Google Scholar]
  28. Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W. M., Hao Y., Stoeckius M., Smibert P., and Satija R.. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sun S., Zhu J., Ma Y., and Zhou X.. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biology, 20:269, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Tang B., Henderson N. C., and Varadhan R.. Accelerating fixed-point algorithms in statistics and data science: A state-of-art review. Journal of Data Science, 21(1):1–26, 2022. [Google Scholar]
  31. Townes F. W.. Generalized principal component analysis. arXiv, 1907.02647, 2019. [Google Scholar]
  32. Townes F. W., Hicks S. C., Aryee M. J., and Irizarry R. A.. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology, 20:295, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Tsuyuzaki K., Sato H., Sato K., and Nikaido I.. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biology, 21:9, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wright S. J.. Coordinate descent algorithms. Mathematical Programming, 151:3–34, 2015. [Google Scholar]
  35. Zheng G. X., Terry J. M., Belgrader P., Ryvkin P., Bent Z. W., Wilson R., et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8:14049, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (12.1MB, pdf)

Data Availability Statement

The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES