Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2026 May 4;27(1):kxag010. doi: 10.1093/biostatistics/kxag010

Stochastic gradient descent estimation of generalized matrix factorization models with application to single-cell RNA sequencing data

Cristian Castiglione 1, Alexandre Segers 2,3, Lieven Clement 4, Davide Risso 5,
PMCID: PMC13143404  PMID: 42086483

Summary

Single-cell RNA sequencing allows the quantification of gene expression at the individual cell level, enabling the study of cellular heterogeneity and gene expression dynamics. Dimensionality reduction is a common preprocessing step critical for the visualization, clustering, and phenotypic characterization of samples. This step, often performed using principal component analysis or closely related methods, is challenging because of the size and complexity of the data. In this work, we present a generalized matrix factorization model assuming a general exponential dispersion family distribution and we show that many of the proposed approaches in the single-cell dimensionality reduction literature can be seen as special cases of this model. Furthermore, we propose a scalable adaptive stochastic gradient descent algorithm that allows us to estimate the model efficiently, enabling the analysis of millions of cells. We benchmark the proposed algorithm through extensive numerical experiments against state-of-the-art methods and showcase its use in real-world biological applications. The proposed method systematically outperforms existing methods of both generalized and non-negative matrix factorization, demonstrating faster execution times and parsimonious memory usage, while maintaining, or even enhancing, matrix reconstruction fidelity and accuracy in biological signal extraction. On real data, we show that our method scales seamlessly to millions of cells, enabling dimensionality reduction in large single-cell datasets. Finally, all the methods discussed here are implemented in an efficient open-source R package, sgdGMF, available on CRAN.

Keywords: dimension reduction, generalized linear models, matrix factorization, RNA-seq, single-cell, stochastic optimization

1. Introduction

1.1. The ever-increasing single-cell RNA sequencing data sets

Single-cell RNA sequencing (scRNA-seq) technologies have revolutionized the comprehension of biological processes by offering a quantitative measure of transcript abundance at the individual cell level. Single-cell resolution is critical for the study of cellular heterogeneity (Kim and Cho 2023; Wu et al. 2023), temporal dynamics (Kouno et al. 2013; Jean-Baptiste et al. 2019), and cell-type differentiation (Denyer et al. 2019). However, the size and complexity of the data have dramatically increased compared to bulk sample-level assays, challenging statistical methods and software implementations to deal with thousands of genes profiled in millions of cells (Angerer et al. 2017; Kharchenko 2021).

From a statistical perspective, scRNA-seq yields high-dimensional count data, in which observations (ie cells) lie in a Inline graphic-dimensional gene space. Working in such high-dimensional spaces poses a wealth of statistical and computational challenges. Lähnemann et al. (2020) identified eleven “grand challenges” in single-cell data science; here we focus on three of them: (i) the handling of data sparsity, (ii) the integration of datasets across samples and experiments, and (iii) the definition of flexible statistical frameworks for the discovery of complex gene expression patterns. These challenges are exacerbated by the growing size and complexity of single-cell datasets. Indeed, while early studies focused on few cells from one or few samples, modern studies involve complex experimental designs, collecting thousands of cells from several individuals across different experimental conditions (eg Stephenson et al. 2021; Perez et al. 2022).

Regarding sparsity, single-cell sequencing technologies yield count data with low mean and large variance, leading to extremely skewed distributions with a large fraction of zeros (Hicks et al. 2018). To address zero inflation and overdispersion problems, Risso et al. (2018) introduced a zero-inflated negative binomial framework for gene-expression matrix factorization, named ZINB-WaVE. Despite its superior performance compared to other methods, ZINB-WaVE’s computational complexity renders it obsolete for the ever-increasing data volumes (Sun et al. 2019; Cao et al. 2021). Moreover, in modern UMI-based scRNA-seq datasets, the need for modeling zero inflation has decreased (Townes et al. 2019; Svensson 2020; Ahlmann-Eltze and Huber 2023; Nguyen et al. 2023). In this regard, simpler negative binomial models, such as those developed by Townes et al. (2019) and Agostinis et al. (2022), achieve similar performances with a much smaller computational burden. Nevertheless, even these methods struggle to scale with the current massive data volumes.

Another main challenge is the integration of data across different samples, experiments, or labs. These effects, often globally referred to as “batches,” represent a source of unwanted variation that need to be accounted for in the analysis. In unsupervised settings, for instance cell clustering or trajectory analysis, batch effects are usually corrected for with ad hoc methods (eg Haghverdi et al. 2018; Korsunsky et al. 2019). While many such methods exist, the batch integration of single-cell data is still an open problem (Lähnemann et al. 2020; Luecken et al. 2025).

Factor analysis models are a promising, unified framework that allows to account for the count nature of the data, adjust for potential sources of unwanted variation, and provide a parsimonious, low-dimensional representation that can be used as input for downstream analyses. To this end, state-of-the-art methods leverage a low-rank representation of the data to extract useful information while correcting for known confounders such as batch effects or other technical or biological covariates (Risso et al. 2018; Townes et al. 2019; Agostinis et al. 2022).

However, because of the lack of scalability of state-of-the-art count-based factor models, biological researchers often resort to log-transformation of the scRNA-seq count tables and apply conventional principal component analysis (PCA; Ahlmann-Eltze and Huber 2023). While this is a much faster alternative, it ignores the complex mean-variance relation and discrete nature of the data and may introduce synthetic biases due to imperfect data transformations (Townes et al. 2019). Furthermore, simple PCA of log-transformed counts fails to account for batch effects. Therefore, the need still remains for fast and memory-efficient matrix factorization tools on the original count scale able to include covariates, such as batch effects.

1.2. A matrix factorization perspective

The above-mentioned factor analysis models are examples of matrix factorization, a statistical tool of fundamental importance in many theoretical and applied fields. In general, matrix factorization methods aim at decomposing the target data matrix into the product of two lower-rank matrices explaining the principal modes of variation in the observations. Such low-rank matrices are typically interpreted in terms of factors and loadings. Factors represent stochastic latent variables, ie random effects, lying in a small-dimensional space and determining the individual characteristics of each observation in the sample. Loadings are non-stochastic coefficients mapping the latent factors into the observed data space, or some one-to-one transformation thereof. Specifically, in single-cell RNA-seq applications, factors and loadings can be interpreted as “meta genes” and “gene weights,” respectively (Brunet et al. 2004; Stein-O’Brien et al. 2018). In this context, matrix factorization is usually employed to project the cells into the space of the first few factors to then cluster them in discrete groups (Kiselev et al. 2019) or infer other low-rank signal, such as pseudotime ordering (Street et al. 2018).

Principal component analysis (PCA; Jolliffe 1986), ie singular value decomposition (SVD), plays a central role in the literature, being the first and most used factorization method proposed in the field. It constitutes the base for several generalizations such as probabilistic PCA (Tipping and Bishop 1999), factor analysis, and, more generally, generalized linear latent variable models (GLLVM, Bartholomew et al. 2011). Probabilistic formulations equip PCA with a data-generating mechanism that opens the door to alternative inferential procedures, such as likelihood-based and Bayesian approaches. Moreover, it provides a natural way to simulate new synthetic signals using a proper generating mechanism induced by the likelihood specification. Assuming a Gaussian law for the data, PCA can be formulated as the solution to a likelihood maximization problem under appropriate identifiability constraints; see, eg Tipping and Bishop (1999). This perspective unveils why PCA may be suboptimal for non-Gaussian data, such as positive scores, counts, or binary observations.

Over the years, many extensions have been proposed to address the limitations of the Gaussian PCA. Some relevant examples are non-negative matrix factorization (Lee and Seung 1999; Wang and Zhang 2013), Binary PCA (Schein et al. 2003), Poisson PCA (Durif et al. 2019; Smallman et al. 2020; Kenney et al. 2021; Virta and Artemiou 2023), exponential family PCA (Collins et al. 2001; Mohamed et al. 2008; Li and Tao 2010; Gopalan et al. 2015; Wang et al. 2020), generalized linear latent variable models (Niku et al. 2017), generalized factor model (Liu et al. 2023; Nie et al. 2024), covariate-augmented overdispersed Poisson factor model (Liu and Zhong 2024), generalized PCA (Townes et al. 2019), generalized matrix factorization (Kidziński et al. 2022), and deviance matrix factorization (Wang and Carvalho 2023).

Non-negative matrix factorization (NMF; Lee and Seung 1999; Wang and Zhang 2013) decomposes positive score matrices by minimizing either the squared error loss or the Kullbak–Leibler loss under non-negativity constraints for the factor and loading matrices. The exploitation of non-negative patterns proved successful in many applied fields, such as computer vision and recommendation systems, as well as omics feature extraction (Brunet et al. 2004; Stein-O’Brien et al. 2018).

Similarly, methods based on exponential family models extend PCA by assuming a more general loss function and linking the data to the latent matrix decomposition using a smooth bijective transformation. For instance, Collins et al. (2001) considered the exponential family Bregman divergence, while Townes et al. (2019), Kidziński et al. (2022) and Wang and Carvalho (2023) considered the exponential family (negative) log-likelihood.

Exponential family generalizations of PCA, which we refer to as generalized matrix factorization (GMF) models, are typically estimated using alternated Fisher scoring algorithms implemented via iterative re-weighted least squares, or some modification thereof; see, eg Collins et al. (2001), Kidziński et al. (2022), and Wang and Carvalho (2023). Such a procedure directly extends the classical Fisher scoring algorithm for generalized linear models (McCullagh and Nelder 1989), being a stable and easy-to-code algorithmic approach. On the other hand, in high-dimensional settings, this iterative procedure suffers two major flaws: (i) it requires multiple scans of the entire dataset during each iteration; (ii) it requires the costly numerical solution of several linear systems. It is worth noting that, for matrix factorization problems, the number of parameters to update, ie the number of linear systems to solve, is proportional to the dimension of the data matrix. To address points (i) and (ii) in the context of scRNA-seq, Townes et al. (2019) employed an alternated Fisher scoring method which only requires the computation of element-wise derivatives and matrix multiplications, avoiding expensive matrix inversions. In the same vein, Kidziński et al. (2022) proposed a quasi-Newton algorithm that only requires cheap element-wise algebraic calculations. To the best of our knowledge, these are the most efficient methods proposed in the literature for the estimation of matrix factorization models under exponential family likelihood. However, in both cases, a complete pass through the data is still necessary to completely update the parameter estimates, which might be infeasible in massive data scenarios.

1.3. Our contribution

In this work, we propose a scalable stochastic optimization algorithm to tackle the complex optimization problem underlying the estimation of high-dimensional GMF models. Specifically, the proposed algorithm relies on the stochastic gradient descent (SGD) framework (Robbins and Monro 1951; Bottou 2010) with adaptive learning rate schedules (Duchi et al. 2011; Zeiler 2012; Kingma and Ba 2014; Reddi et al. 2019). By doing this, we decrease the computational complexity of the problem by using a convenient combination of minibatch subsampling, partial parameter updates and exponential gradient averaging. Additionally, we propose two efficient initialization methods, which help the convergence to a meaningful solution while reducing the likelihood of the algorithm’s convergence to highly sub-optimal stationary points.

Alongside, we propose an efficient R/C++ implementation of the proposed method in the new open-source R package sgdGMF, freely available on CRAN (see https://CRAN.R-project.org/package=sgdGMF). Compared to alternative implementations in R, sgdGMF offers one of the most complete and flexible estimation frameworks for generalized matrix factorization modeling, allowing for all standard exponential family distributions, quasi-likelihood models, row- and column-specific regression effects, as well as model-based missing value imputation. Moreover, it provides several algorithms for parameter estimation, including the proposed stochastic gradient descent approach. To enhance scalability, parallel computing is employed as much as possible at any stage of the analysis, including initialization, estimation, model selection and post-processing. Table 1 compares all these features with alternative packages freely available in R.

Table 1.

List of the major matrix factorization models available in the literature for exponential family data.a

Families Effects Implementation
Package Model N G B P NB Q ZI Inline graphic Inline graphic Inline graphic Core Missing Parallel Stochastic
RSpectra PCA Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic F/C++ Inline graphic Inline graphic
NMF NMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic C++ Inline graphic Inline graphic
NNLM NMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic C++ Inline graphic
cmfrec NMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic C++ Inline graphic Inline graphic
gllvm GMF Inline graphic Inline graphic R Inline graphic Inline graphic Inline graphic
GFM GMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic C++ Inline graphic Inline graphic Inline graphic
COAP GMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic C++ Inline graphic Inline graphic Inline graphic
glmpca GMF Inline graphic Inline graphic Inline graphic Inline graphic R Inline graphic Inline graphic
zinbwave GMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic R Inline graphic Inline graphic
newwave GMF Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic R Inline graphic
gmf GMF Inline graphic Inline graphic Inline graphic Inline graphic R Inline graphic
dmf GMF Inline graphic Inline graphic R Inline graphic Inline graphic Inline graphic
sgdGMF GMF Inline graphic C++
a

For each model, we report the corresponding R package (first column) and we describe its characteristics. Each feature is marked with ✓ if it is completely implemented in the package and Inline graphic otherwise. The column Model indicates the broad model family implemented in the package, where GMF and NMF stand for generalized matrix factorization, and non-negative matrix factorization, respectively. The column Families lists the most common distributions belonging to the exponential family along with some generalizations: Normal (N), Gamma (G), Binomial (B), Poisson (P), Negative Binomial (NB), quasi-likelihood (Q), and zero-inflated (ZI) models. The column Effects refers to the regression effects that can be included in the linear predictor and uses the notation introduced in Section 2, Equation (2.3). The column Implementation describes some technical features of the numerical implementation, such as the language used for the core computations (Core), if automatic missing value estimation is allowed (Missing), if parallel computing is allowed (Parallel), and if minibatch subsample and stochastic optimization methods are available (Stochastic).

We showcase the sgdGMF implementation on both simulated and real data, demonstrating the scalability of the proposed method on gene expression matrices of different dimensions. In all the numerical experiments we consider, the proposed stochastic gradient approach outperforms the alternative state-of-the-art methods in terms of execution time while having superior signal reconstruction quality, measured as out-of-sample residual deviance, logarithmic root mean squared error, and cluster separation in the latent space.

The paper is organized as follows. In Section 2, we formally define the class of generalized matrix factorization models, we formulate the associated estimation problem and we discuss the connections with other models in the literature. In Section 3, we introduce the proposed stochastic optimization method building upon quasi-Newton and stochastic gradient descent algorithms; additionally, we propose an efficient approach for parameter initialization. In Section 4, we briefly discuss some additional computational aspects, such as initialization and model selection. In Section 5, we empirically compare the proposed algorithm with several state-of-the-art methods in the literature through an extensive simulation study. In Section 6, we propose two case studies on real datasets of medium- and high-dimensional sizes, respectively, proving the effectiveness of our approach to extract coherent biological signals while maintaining a high level of computational efficiency. Section 7 is devoted to a concluding discussion and future research directions.

2. Model specification

Let us define Inline graphic as the Inline graphic data matrix containing the response variables of interest, which have Inline graphicth entry Inline graphic, Inline graphicth row Inline graphic and Inline graphicth column Inline graphic. Conventionally, Inline graphic is here considered as the Inline graphicth observational unit over the Inline graphicth variable, that in our genomic application are the Inline graphicth cell and the Inline graphicth feature/gene. Hereafter, all the vectors are column vectors and all the transposed vectors, denoted by Inline graphic, are row vectors. To account for non-Gaussian observations, such as odds, counts, or continuous positive scores, we consider for the response variable Inline graphic an exponential dispersion family (EF) distribution with natural parameterInline graphic and dispersionInline graphic, denoted by

2. (2.1)

Moreover, we denote the mean and variance of Inline graphic by Inline graphic and Inline graphic, respectively. Here and elsewhere, Inline graphic is the family-specific variance function, which controls the heteroscedastic relationship between the mean and variance of Inline graphic, while Inline graphic is a dispersion function specified as Inline graphic, with Inline graphic being a user-specified weight and Inline graphic being a scalar dispersion parameter. Under such a model specification, the probability density function of Inline graphic can be written as

2. (2.2)

where Inline graphic and Inline graphic are family-specific functions. Specifically, Inline graphic is a convex twice differentiable cumulant function and Inline graphic is the so-called log-partition function. By the fundamental properties of the dispersion exponential family, the mean and variance of Inline graphic satisfy the identities Inline graphic and Inline graphic, which imply Inline graphic, where Inline graphic and Inline graphic denote the first and second derivatives of Inline graphic, respectively. Therefore, Inline graphic is a bijective map and Inline graphic is uniquely determined by Inline graphic. We recall that the deviance function of model (2.2) is defined as Inline graphic, where Inline graphic denotes the likelihood function relative to observation Inline graphic expressed as a function of mean Inline graphic and possibly depending on dispersion Inline graphic.

Some relevant distributions belonging to the exponential family are, among others, the Gaussian, inverse Gaussian, gamma, Poisson, binomial, and negative binomial laws (see Table S1 for family-specific variance functions and the associated canonical link and deviance functions of these examples). Of particular interest for omics applications are the Poisson and negative binomial distributions, as the gene expressions are typically obtained from technologies that yield count data (eg RNA sequencing or in-situ hybridization). However, formulating the model in the general form of the exponential family allows for a more general treatment and for the application to other omics technologies, such as mass spectrometry and microarray assays, which yield continuous readouts.

We complete the model specification by introducing an appropriate parametrization for the conditional mean Inline graphic. In particular, we consider the generalized multivariate regression model

2. (2.3)

where Inline graphic is a continuously differentiable bijective link function and Inline graphic is a linear predictor. The latter is represented as an additive decomposition of three terms: a column-specific regression effect, Inline graphic, a row-specific regression effect, Inline graphic, and a residual matrix factorization Inline graphic. Specifically, Inline graphic and Inline graphic denote observed covariate vectors, Inline graphic and Inline graphic are unknown regression coefficient vectors, while Inline graphic and Inline graphic encode latent traits explaining the residual modes of variation in the data that are not captured by the regression effects. Finally, we introduce the vector of unknown parameters in the model as Inline graphic, where the lower-case letters represent the flat vectorization of the corresponding matrix forms; for instance, Inline graphic.

Specifically, in the scRNA-seq context, the regression term Inline graphic is often used to control for cell-specific technical confounders, such as batch effects and individual characteristics in multi-subject studies, with Inline graphic containing indicator variables of the group assignment and group-specific attributes. Similarly, the second regression term, Inline graphic, can account for gene-specific information contained in the covariate vector Inline graphic, such as GC-content or known functional interactions between genes.

Notice that, even in the absence of external confounders, such regression effects allow for the inclusion of cell- and gene-specific intercepts, say Inline graphic, which play the role of centering factors in the linear predictor metric. In scRNA-seq, the gene-specific intercept Inline graphic represents a baseline expression level for gene Inline graphic, capturing systematic differences in average expression across genes. Further, the cell-specific intercept Inline graphic is of particular importance, as it acts as a scale factor that allows the model to be applied to raw counts rather than library size normalized data (Risso et al. 2018), which is critical since normalization can lead to distortions (Townes et al. 2019).

An alternative approach, common in the scRNA-seq literature, is the use of a fixed offset to account for library size (Robinson et al. 2010; Love et al. 2014). However, fixing Inline graphic equal to the log library size implicitly assumes that the observed scale factor is a known and exact proxy for the true cell-specific effect. In scRNA-seq this assumption can be violated because the total UMI count reflects not only sequencing depth but also capture efficiency, cell size, RNA content, and composition effects, and it can be correlated with latent cell states (Vallejos et al. 2017). Our approach estimates Inline graphic under explicit identifiability constraints, thus treating the cell-specific global effect as an unknown scaling factor and propagating its uncertainty into downstream inference. This reduces model misspecification when the offset is imperfect and prevents the latent factors Inline graphic from inadvertently capturing depth-related variation forced by a fixed offset.

The latent traits Inline graphic can be interpreted as meta-gene variables representing the fundamental biological characteristics of the Inline graphicth cell in a latent low-dimensional space that accounts for the covariates effects, with interpretation similar to usual PCA. For example, the meta-gene variables can be used for data visualization, cell clustering, and lineage reconstruction. It is worth noting that, unlike formulation (2.3), most factorization models proposed in the literature and implemented in standard computing environments do not support the inclusion of row effects Inline graphic’s, column effects Inline graphic’s, or both, resulting in latent representations that still capture variability of unwanted confounders, such as batch effects, or that do not include a normalization factor. This limitation applies to gmf (Kidziński et al. 2022), dmf (Wang and Carvalho 2023), GFM (Liu et al. 2023; Nie et al. 2024), COAP (Liu and Zhong 2024), NMF (Gaujoux and Seoighe 2010), and NNLM (Lin and Boutros 2020), as summarized in Table 1.

Following the naming convention introduced by Kidziński et al. (2022), we refer to the model specification presented so far in (2.1) to (2.3) as generalized matrix factorization (GMF). Alternative nomenclatures, such as exponential family principal component analysis (EPCA; Collins et al. 2001; Mohamed et al. 2008; Li and Tao 2010), generalized low-rank models (GLRM; Udell et al. 2016), generalized linear latent variable model (GLLVM; Hui et al. 2017; Niku et al. 2017), generalized principal component analysis (glmPCA; Townes et al. 2019), deviance matrix factorization (DMF; Wang and Carvalho 2023), or generalized factor model (GFM; Liu et al. 2023; Liu and Zhong 2024; Nie et al. 2024), can also be found in the literature. Straightforward extensions of the GMF specification include pseudo-likelihood models, with Inline graphic being the negative exponential of a loss function Inline graphic. Also, vector generalized estimating equations for overdispersed data are a particular case of this more general setup.

2.1. Parameter identifiability

The multivariate generalized linear model in (2.3) is non-identifiable, allowing for column-space overlapping, and being invariant with respect to rotation, scaling, and sign-flip transformations of U and V. Then, to enforce the uniqueness of the matrix decomposition, we need to impose additional identifiability constraints. First, we need to impose the orthogonality of the parameters with respect to the covariate column space:

(A) X and Z are full-column rank matrices, moreover Inline graphic, Inline graphic, and Inline graphic.

The first conditions ensure that Inline graphic and Inline graphic are non-singular, which is a standard requirement for regression models. Conditions Inline graphic and Inline graphic prevent Inline graphic and U from spanning the same column space of X, and similarly Inline graphic prevents V from spanning the same column space of Z (see, eg Liu and Zhong 2024).

Then, we must ensure the identifiability of U and V with respect to rotation, scaling, and sign-flip. To this end, some of the most common choices in the literature involve the following equivalent parameterizations:

  • (B1)

    U has orthogonal columns, Inline graphic, V has orthonormal columns, Inline graphic, the first non-zero element of each column of V is positive;

  • (B2)

    U has orthonormal columns, Inline graphic, V has orthogonal columns, Inline graphic, the first non-zero element of each column of U is positive;

  • (B3)

    U has standardized columns, say Inline graphic, and V is lower triangular with positive diagonal entries.

Here, Inline graphic denotes the diagonal matrix collecting all the non-zero singular values of Inline graphic in decreasing order, Inline graphic denotes the Inline graphic identity matrix, Inline graphic is the Inline graphic vector of zeros, and Inline graphic is the Inline graphic vector of ones. In Appendix S1 in the Supplementary Material, we prove that constraints (A), together with one of (B1), (B2), or (B3), are sufficient to ensure the identifiability of the model parameters. Notice that any unconstrained estimate of B, Inline graphic, U and V can be easily projected into the constrained space induced by the identifiability restrictions via post-processing; see Appendix S1 in the Supplementary Material, along with, eg Kidziński et al. (2022), Wang and Carvalho (2023), and Liu and Zhong (2024). The choice of the parametrization typically depends on the specific application and the desired interpretation of the score and loading matrices. (B1) is the standard parametrization in principal component analysis, (B2) is the usual parametrization in spectral analysis of digital signals, and (B3) is the most common parametrization in the factor model literature. In our numerical experiments (Sections 5 and 6), we use parametrization (B1), which is conventional in the RNA-seq literature; see, eg Risso et al. (2018), Townes et al. (2019), and Ahlmann-Eltze and Huber (2023).

2.2. Penalized likelihood estimation

In the statistical literature, the variables Inline graphic are called latent factors and, typically, are assumed to follow a standard independent Inline graphic-variate Gaussian distribution. This representation provides a complete specification of the probabilistic mechanism that generated the samples Inline graphic’s, which are conditionally independent given Inline graphic’s. The marginal log-likelihood function induced by such a latent variable representation is given by

2.2. (2.4)

where Inline graphic is the conditional distribution of Inline graphic given Inline graphic, while Inline graphic is the marginal probability density function of Inline graphic.

The unknown non-stochastic parameters can be estimated via maximum likelihood by optimizing (2.4). To this end, many numerical approaches have been proposed in the literature, such as Laplace approximation (Huber et al. 2004; Bianconcini and Cagnone 2012), adaptive quadrature (Cagnone and Monari 2013), expectation-maximization (Sammel et al. 1997; Cappé and Moulines 2009), variational approximation (Hui et al. 2017). In practice, all these strategies perform very well in terms of accuracy but are extremely computationally expensive and do not scale well in high-dimensional problems.

An alternative approach is to treat the latent factors as if they were non-stochastic parameters and estimate them together with the other unknown coefficients. Kidziński et al. (2022) motivated this approach as a form of penalized quasi-likelihood (PQL, Breslow and Clayton 1993), which is a standard tool in the estimation of generalized linear mixed models (GLMM, Lee et al. 2017). Formally, the PQL estimate of Inline graphic, say Inline graphic, is the solution of

2.2. (2.5)

with Inline graphic being the parameter space of Inline graphic, which incorporate the identifiability constrains, and Inline graphic denoting the penalized negative log-likelihood function. The latter is given by

2.2. (2.6)

where Inline graphic is a regularization parameter and Inline graphic denotes the Frobenius norm. Frobenius penalization is often introduced for numerical stability issues, but it is also intrinsically connected to the rank determination problem. Indeed, for any Inline graphic, the Frobenius penalty implicitly shrinks the singular values of the matrix factorization toward zero, encouraging compact low-rank representations of the signal. This spectral penalization effect and its connection with the nuclear norm have been extensively discussed in, eg Witten et al. (2009), Mazumder et al. (2010), and Kidziński et al. (2022).

In a complete data scenario, the number of observed data entries in the response matrix is Inline graphic and the total number of unknown parameters to be estimated is Inline graphic. In partially observed data cases, the sum over Inline graphic and Inline graphic in (2.6), can be easily replaced with a sum over Inline graphic, where Inline graphic is the set collecting the index-position of all the observed data in the response matrix.

The optimization problem (2.5) is not jointly convex in U and V. However, objective function (2.6) is bi-convex, namely it is conditional convex in U given V, and vice versa. This characteristic naturally encourages the development of iterative methods, which cycle over the alternated updates of U and V until convergence to a local stationary point. Similar strategies are commonly used in matrix completion and recommendation systems for the estimation of high-dimensional matrix factorization models; see, eg Zou et al. (2006), Koren et al. (2009), and Mazumder et al. (2010).

2.3. Related models

The GMF specified in (2.1) to (2.3) has strict connections and similarities to several models in the literature. It extends vector generalized linear models (VGLM, Yee 2015), and hence also univariate generalized linear models (McCullagh and Nelder 1989), by introducing a second regression effect, Inline graphic, and a latent matrix factorization, Inline graphic, in the linear predictor, to account for additional modes of variations and residual dependence structures.

Also, GMF directly generalizes principal component analysis (PCA), which, by definition, is the solution of the minimization problem

2.3.

Then, in the GMF notation, PCA can be obtained by assuming a Gaussian distribution for the data matrix, an identity link function for the mean and no regression effects in the linear predictor, namely Inline graphic and Inline graphic. Generalizations of PCA, such as Binary PCA (Schein et al. 2003; Lee et al. 2010; Landgraf 2015; Song et al. 2019) and Poisson PCA (Kenney et al. 2021; Virta and Artemiou 2023), are also included in the GMF framework.

Close connections can also be drawn with non-negative matrix factorization (NMF; Wang and Zhang 2013), which, in its more common formulations, searches for the best low-rank approximation Inline graphic of the data matrix Y by minimizing either the squared error loss or the Poisson deviance under non-negativity constraints for U and V. Formally, the NMF solution is defined as

2.3.

with Inline graphic being either the squared error loss, ie the Gaussian deviance, or the Kullback-Leibler loss, ie the Poisson deviance. This representation clarifies that NMF can be written as a particular instance of GMF for non-negative data, where Gaussian/Poisson likelihood is used together with an identity link and non-negativity constraints. Extensions of basic NMF have also been proposed to introduce external information through the inclusion of covariates; see, eg collective matrix factorization and content-aware recommendation systems (Singh and Gordon 2008; Cortes 2018).

3. Estimation algorithm

For the sake of exposition, throughout this section, we assume without loss of generality that Inline graphic and Inline graphic. Moreover, we consider Inline graphic as a known fixed parameter, noting that, if unknown, it can be estimated iteration-by-iteration using either a method of moments or via maximum likelihood. More details about the general case Inline graphic and Inline graphic and the estimation of Inline graphic are provided in Appendix S2 in the Supplementary Material. Finally, we introduce the matrices Inline graphic and Inline graphic for the derivatives of the deviance with respect to the linear predictor, where

3.

and Inline graphic. Accordingly, we define Inline graphic, Inline graphic and Inline graphic. The expected second-order derivative, ie the Fisher weight Inline graphic, just corresponds to the observed second-order differential Inline graphic with Inline graphic, which is positive for any Inline graphic and Inline graphic.

The penalized estimate Inline graphic in (2.5), say the vectorized concatenation of Inline graphic and Inline graphic, must satisfy the first-order matrix conditions

3. (3.7)

where the differentiation is performed element-wise. Using the same formulation, we may also express the second-order derivatives of (2.6) as

3. (3.8)

where Inline graphic is the Hadamard product and Inline graphic is a matrix of appropriate dimensions filled by Inline graphic.

Conditionally on V, the left matrix equation in (3.7) can be decomposed into Inline graphic multivariate equations row-by-row, Inline graphic (Inline graphic), that can be solved independently in parallel. The existence and uniqueness of the solution of each row-equation are guaranteed under mild regularity conditions on the exponential family and the link function. In the same way, the right matrix equation in (3.7) can be split into Inline graphic independent vector equations, Inline graphic (Inline graphic), to be solved in parallel. See, eg Kidziński et al. (2022) and Wang and Carvalho (2023) for a detailed discussion and derivation on (3.7) and (3.8). Notice that, in the presence of covariate effects, the derivatives in (3.7) must be replaced by Inline graphic and Inline graphic, as detailed in Appendix S2 in the Supplementary Material.

3.1. Fisher scoring and quasi-Newton algorithms

The first, and most popular, algorithm introduced in the literature for finding the solution of (3.7) is the alternated iterative re-weighted least squares (AIRWLS, Collins et al. 2001; Li and Tao 2010; Risso et al. 2018; Kidziński et al. 2022; Liu et al. 2023; Wang and Carvalho 2023) method. It cycles between the conditional updates of U and V by solving the equations in (3.7) in a row-wise manner, using standard Fisher scoring for GLMs (McCullagh and Nelder 1989). The resulting routine is statistically motivated, easy to implement and allows for efficient parallel computing.

However, in massive data settings, it becomes infeasible when the dimension of the problem increases in the sample size or the latent space rank. One iteration of the algorithm, ie a complete update of U and V, needs Inline graphic floating point operations to be performed, where the leading term proportional to Inline graphic comes from a matrix inversion that must be computed at least Inline graphic times per iteration. This is particularly limiting in real-data applications, since Inline graphic is unknown a priori and must be selected in a data driven way, which might require fitting the model several times and for an increasing number of Inline graphic, which scales cubically.

To overcome this issue, Kidziński et al. (2022) proposed a quasi-Newton algorithm which employs an approximate inversion only using the diagonal elements of the Fisher information matrix. With this simplification, only elementary matrix operations are performed reducing the computational complexity to Inline graphic. In formulas, the quasi-Newton algorithm of Kidziński et al. (2022) updates the parameter estimates at iteration Inline graphic as

3.1. (3.9)

where Inline graphic is a sequence of learning rate parameters, Inline graphic is the search direction, while Inline graphic and Inline graphic denote respectively the first two derivatives of Inline graphic with respect to Inline graphic evaluated at Inline graphic. Throughout, Inline graphic stands for the assignment operator and the division is performed element-wise. Exploiting the block structure of Inline graphic, the joint update (3.9) can be written in the coordinate-wise form

3.1. (3.10)

where Inline graphic, Inline graphic, Inline graphic and Inline graphic can be obtained as in (3.7) and (3.8) (Kidziński et al. 2022). Overall, each quasi-Newton iteration requires Inline graphic floating point operations and Inline graphic memory allocations. To the best of our knowledge, this is the most efficient algorithm in the literature for the estimation of GMF models.

In what follows, we build upon the quasi-Newton algorithm of Kidziński et al. (2022) to derive an efficient stochastic optimization method that, for our purposes, should further improve the scalability of GMF modeling in high-dimensional settings.

3.2. Stochastic gradient descent

Stochastic gradient descent (SGD, Bottou 2010) provides an easy and effective strategy to handle complex optimization problems in massive data applications. Similarly to deterministic gradient-based methods, SGD is an iterative optimization procedure which updates the parameter vector Inline graphic until convergence following the approximate steepest descent direction, say Inline graphic. Here, the hat-notation, Inline graphic, stands for an unbiased stochastic estimate of Inline graphic. Under mild regularity conditions on the optimization problem and the learning rate sequence, specifically Inline graphic and Inline graphic, SGD is guaranteed to converge to a stationary point of the objective function (Robbins and Monro 1951). A standard choice for the learning rate sequence is Inline graphic for Inline graphic and Inline graphic, where Inline graphic is the initial stepsize, Inline graphic is a decay rate parameter, and Inline graphic determines the asymptotic tail behavior of the sequence for Inline graphic.

3.2.1. Improving naïve stochastic gradient

In the past two decades, an active area of research built upon naïve SGD to improve its convergence speed, robustify the search path against erratic perturbations in the gradient estimate and introduce locally adaptive learning rate schedules using approximate second-order information. Some examples are Nesterov acceleration (Nesterov 1983), SGD-QN (Bordes et al. 2009), AdaGrad (Duchi et al. 2011), AdaDelta (Zeiler 2012), RMSProp (Hinton et al. 2012), Adam (Kingma and Ba 2014), AMSGrad (Reddi et al. 2019). Inspired by this line of literature, we propose the following adaptive stochastic gradient descent (aSGD) updating rule

3.2.1. (3.11)

where Inline graphic and Inline graphic are smoothed estimates of the derivatives of (2.6), while Inline graphic is a scalar bias-correction factor. Similar to Adam (Kingma and Ba 2014) and AMSGrad (Reddi et al. 2019), we update the smoothed gradients, Inline graphic and Inline graphic, using an exponential moving average of the previous gradient values and the current stochastic estimates Inline graphic and Inline graphic, namely

3.2.1. (3.12)

with Inline graphic being user-specified smoothing coefficients, while Inline graphic is introduced to filter out the bias induced by the exponential moving average in (3.12). Typical values for the smoothing coefficients are Inline graphic and Inline graphic, respectively. Such a smoothing technique has two main advantages: it speeds up the convergence using the inertia accumulated by previous gradients and, at the same time, it stabilizes the optimization, reducing the noise around the gradient estimate. We refer the reader to Kingma and Ba (2014) and Reddi et al. (2019) for a deeper discussion on the benefits of bias corrected exponential gradient averaging in high-dimensional stochastic optimization problems.

Different from the original implementation of Adam (Kingma and Ba 2014), which only involves first-order information, we scale the smoothed gradient estimate using a smoothed diagonal Hessian approximation. This strategy allows us to directly extend the diagonal quasi-Newton algorithm of Kidziński et al. (2022) in a stochastic vein without increasing the computational overload or the stability of the optimization. Indeed, the derivatives of the deviance function of a GLM model are always well-defined, bounded away from zero, available in closed form and easy to compute. This is not the case for deep neural networks and other machine learning models for which Adam and AMSGrad have been originally developed.

Analogously to the quasi-Newton update (3.10) for GMF models, also (3.11) can be written in the block-wise form

3.2.1. (3.13)

In this formulation, update (3.10) and (3.13) have a computational complexity proportional to the dimension of the matrices U and V, namely Inline graphic. Since the dimension of the parameter space can not be further compressed without losing prediction power, the alternative way to speed up the computations is to efficiently approximate the gradients, Inline graphic and Inline graphic.

Exploiting the additive structure of the penalized log-likelihood in (2.6), a natural unbiased estimate of its derivatives can be obtained via sub-sampling. In particular, we can estimate the log-likelihood gradient by summing up only a small fraction of the data contributions, which we refer to as a minibatch, instead of the whole dataset. This strategy is highly scalable and can be calibrated on the available computational resources.

In what follows, we discuss a new stochastic optimization method for GMF models based on formula (3.13), which uses local parameter updates along with minibatch estimates of the current gradients to reduce the computational complexity of the resulting algorithm. To this end, we introduce the following notation: Inline graphic is a subset of row-indices of dimension Inline graphic, Inline graphic is a subset of column-indices of dimension Inline graphic, Inline graphic is the Cartesian product between Inline graphic and Inline graphic. Finally, Inline graphic, Inline graphic and Inline graphic denote the corresponding sub-matrices, also called row-, column- and block-minibatch subsamples of the original data matrix, respectively.

3.2.2. Block-wise adaptive stochastic gradient descent

Both the exact and stochastic quasi-Newton methods identified by equations (3.10) and (3.13) entirely update U and V at each iteration. Despite being an effective and parallelizable strategy, in many applicative contexts, when it comes to factorizing huge matrices, an entire update of the parameters could be extremely expensive in terms of memory allocation and execution time. This is a well-understood issue in the literature on recommendation systems, where standard matrix completion problems may involve matrices with millions of rows and columns; see, eg Koren et al. (2009), Mairal et al. (2010), Recht and Ré (2013), Mensch et al. (2018). Moreover, batch optimization strategies do not generalize well to stream data contexts, where the data arrive sequentially and the parameters must be updated on-the-fly as new sets of observations come in.

A classic solution is to perform iterative element-wise SGD steps using only one entry of the data matrix, Inline graphic, at each iteration, thus updating the low-rank decomposition matrices row-by-row through the paired equations Inline graphic and Inline graphic. In the matrix factorization literature, this strategy is also known as online SGD algorithm, since it permits updating the parameter estimates dynamically when a new observation is gathered.

To stabilize the optimization and speed up the convergence, we generalize the online SGD approach in two directions: we consider local stochastic quasi-Newton updates in place of naïve gradient steps, and we use block-wise minibatches of the original data matrix, Inline graphic, instead of singletons, Inline graphic. In formulas, the algorithm we propose cycles over the following adaptive gradient steps:

3.2.2. (3.14)

The smoothed gradients are then estimated by exponential average as in (3.12) and the minibatch stochastic gradients are obtained as

3.2.2. (3.15)

Here, Inline graphic and Inline graphic denote the number of rows and columns of each minibatch and Inline graphic is an unbiased stochastic estimate of Inline graphic for Inline graphic. Similarly, it is easy to show that the other minibatch averages in (3.15) are unbiased estimates of the corresponding batch quantities. Figure 1 provides a graphical representation of the updates, while Algorithm 1 provides a pseudo-code description of the proposed procedure. Overall, each iteration of Algorithm 1 requires Inline graphic floating point operations and Inline graphic memory allocations.

Figure 1.

Schematic example of the generalized matrix factorization model and of how the gradient and parameter updates work.

Graphical representation of the stochastic gradient updates employed at the Inline graphicth iteration of Algorithm 1. Left: generalized matrix factorization model (2.3). Middle: updates of the penalized log-likelihood gradients (3.15). Right: adaptive gradient step (3.14). The colored and empty cells highlight the sub-sampled data used and not used at the Inline graphicth update, respectively. To save space, the calculations of the second-order differentials and the gradient smoothing are not displayed here.

Algorithm 1.

Pseudo-code description of the block-wise adaptive SGD algorithm described in Section 3.2.2. On the right, we report the computational complexity of each step.

Initialize U, V, Inline graphic, Inline graphic, Inline graphic;

Sample a random partition Inline graphic such that Inline graphic;

Sample a random partition Inline graphic such that Inline graphic;

while convergence is not reached dofor  Inline graphic  do

  1. Sample Inline graphic and set Inline graphic; Inline graphic

  2. Compute the subsampled likelihood derivatives

     Inline graphic;  Inline graphic; Inline graphic

     Inline graphic; Inline graphic

     Inline graphic; Inline graphic

  3. Compute the smoothed gradients and update V

     Inline graphic;  Inline graphic; Inline graphic

     Inline graphic;  Inline graphic; Inline graphic

     Inline graphic;  Inline graphic; Inline graphic

  4. Compute the smoothed gradients and update U

     Inline graphic;  Inline graphic; Inline graphic

     Inline graphic;  Inline graphic; Inline graphic

     Inline graphic;  Inline graphic; Inline graphic

end for

end while

5. Orthogonalize Inline graphic and Inline graphic;

If the dispersion parameter Inline graphic is unknown and has to be estimated from the data, we can also adopt a smoothed stochastic estimator obtained through exponential averaging. More details are provided in Appendix S2 in the Supplementary Material.

Whenever the input data matrix is only partially complete, it is necessary to properly handle the missing values during the estimation process. To this end, we rely on the general framework proposed by Cai et al. (2010) and Mazumder et al. (2010), and later used by Kidziński et al. (2022). This prescribes imputing the missing data entries during the optimization by updating them at each iteration using the most recent prediction of those values. Algorithm 1 can thus be adapted by replacing the static (incomplete) minibatch matrix Inline graphic with its completed version Inline graphic obtained at iteration Inline graphic after imputation. At the beginning of each iteration, we then introduce the imputation step as Inline graphic if Inline graphic, where Inline graphic is the complement of Inline graphic.

We refer to (3.14) as local or partial updates since they just modify the rows of U and V corresponding to the minibatch block of indices Inline graphic. On the contrary, we refer to updates (3.10) and (3.13) as global updating rules. The choice between alternated least squares (Kidziński et al. 2022; Wang and Carvalho 2023), exact quasi-Newton (Kidziński et al. 2022) and the proposed adaptive stochastic gradient descent (Algorithm 1, Fig. 1) methods is up to the researcher and, in principle, depends on the dimension of the problem under study and on the available computational resources. In general, algorithms using a higher amount of information are more stable and accurate; however, they tend to scale poorly in high-dimensional settings and often get stuck on suboptimal optima. On the other hand, cheaper stochastic methods using less information scale well in big-data problems at the cost of a lower level of precision.

4. Additional computational aspects

4.1. Parameter initialization

As is common in non-convex optimization, the performance of Algorithm 1 heavily depends on its initialization. Random starting values can slow convergence and increase the risk of getting stuck in poor local minima or unstable saddle points. To improve both performance and accuracy, we adopt a structured initialization strategy that leverages the conditional GLM formulation of model (2.3), and is inspired by the initialization approaches employed in Risso et al. (2018), Townes et al. (2019), and Kidziński et al. (2022).

We first estimate the column-specific regression parameters, Inline graphic’s, by fitting Inline graphic separate GLMs. Then, conditionally on the estimated offsets Inline graphic, we estimate the row-specific parameters, Inline graphic’s, using the same strategy, and fitting Inline graphic separate GLMs. The resulting estimates, Inline graphic and Inline graphic, initialize the regression effects. To initialize the latent scores, Inline graphic’s, we adopt the null-residual method of Townes et al. (2019), extracting the first Inline graphic left eigenvectors of a residual matrix based on either deviance or Pearson residuals: Inline graphic and Inline graphic, where the mean Inline graphic is approximated as Inline graphic. Finally, we estimate the loadings, Inline graphic’s, by fitting Inline graphic column-specific GLMs with the latent scores as fixed design matrix and an offset term Inline graphic in the predictor. Using standard solvers for GLMs and singular value decomposition, the overall computational cost is

4.1.

All GLM-fitting steps are highly parallelizable, with per-task complexity at most Inline graphic, where Inline graphic. In high-dimensional settings, one may approximate the GLM fits via ordinary least squares on a link-transformed response, reducing complexity to:

4.1.

Although inspired by the null-residual strategy of Townes et al. (2019), our initialization differs in four key aspects: (i) it serves solely for initialization, not final estimation; (ii) it generalizes to any exponential family, not just count data; (iii) it accounts for covariate-dependent sampling; (iv) it produces loadings satisfying approximate estimating equations, unlike the original purely spectral null-residual approach.

4.2. Model selection

In the GMF formulation detailed in Section 2, the model complexity is mainly controlled by the rank of the matrix factorization, Inline graphic. The optimal selection of the factorization rank needs a careful balance between model complexity and goodness of fit to avoid both under- and over-fitting issues. In the matrix factorization and factor model literature, a popular class of rank selection methods is the so-called spectral thresholding, or elbow, approach. This consists of analyzing the singular values, in descending order, of a sufficiently high-rank matrix decomposition to detect a significant decrease in the rate of change of the singular values, which suggests the optimal number of factors to retain. Thanks to its simplicity, computational efficiency, and the significant amount of empirical and theoretical support, spectral thresholding methods gained wide popularity for rank selection problems. See, eg the work of Onatski (2010), Fan et al. (2022), Wang and Carvalho (2023), Liu et al. (2023), Nie et al. (2024), and Liu and Zhong (2024). This class of selection criteria favors a compact low-rank representation of the signal and is often used when it is of interest to identify the principal modes of variation in the data for interpretation purposes.

Another popular rank determination method proposed in the literature leverages the concept of out-of-sample error minimization (Mazumder et al. 2010; Kidziński et al. 2022). An estimate of the out-of-sample error for matrix factorization problems can be obtained either by using information-based metrics, such as the Akaike (AIC) or the Bayesian (BIC) information criteria, or prediction error measures. The latter may be computed by removing some entries of the data matrix during the estimation process substituting them with missing values and then evaluating the reconstruction error on the hold-out set of data. The same strategy can be performed repeatedly within a cross-validation procedure to obtain a more reliable estimate of the generalization error. The minimizer of such an error measure over a fairly large grid of prespecified matrix ranks is the selected latent dimension.

Rank selection based on error minimization is intuitive, general, and robust with respect to the estimation method. However, its application to high-dimensional data is hindered by its extremely high computational cost, due to the need for multiple estimations of possibly over-parametrized models. Moreover, many state-of-the-art methods proposed in the literature and implemented in standard software packages do not handle missing values directly, thus preventing robust evaluation of out-of-sample error prior to missing value imputation.

In the high-dimensional setting, spectral thresholding methods, akin to those explored by Wang and Carvalho (2023), Nie et al. (2024), and Liu and Zhong (2024), gained increasing popularity, thanks to their ease of implementation, computational efficiency and connection with standard scree-plot analysis of PCA. Yet, these are not often used in practice in the omics literature, where practitioners often rely on software default values, such as 10 or 50.

Thanks to its ability of handling missing values, the intrinsic scalability of the proposed stochastic optimization method, and a convenient warm-start initialization strategy (Friedman et al. 2007, 2010), our approach enables for the first time in the omics literature the systematic selection of the number of latent factors using well grounded criteria based on cross-validation out-of-sample error and spectral thresholding.

5. Simulation studies

In this section, we assess the relative performances of the proposed estimation algorithms compared to state-of-the-art methods through several simulation experiments. In particular, we are interested in evaluating the considered approaches in terms of execution time, memory consumption, out-of-sample prediction error, and biological signal extraction quality.

In real data scenarios, the functional form of the data-generating mechanism is typically unknown to the researcher, and the assumption of correct model specification is rarely met. To mimic this realistic situation, in our experiments, we used different models for data simulation and signal extraction. In this way, all the estimation methods we consider are misspecified by construction and, a priori, none of them has an advantage over the others.

5.1. Data generating process

To simulate the data, we use the R package splatter (Zappia et al. 2017), which is freely available on Bioconductor (Huber et al. 2015). The package splatter allows us to simulate gene-expression matrices incorporating several user-specified features, such as the dimension of the matrix, the number of cell-types, the proportion of each cell-type in the sample, the form of the cell-type clusters, the expression level, the number of batches, the strength of the batch effects, and many others.

In our experiments, we considered the following simulation setting: each dataset contains cells from five well-separated types evenly distributed in the sample. This is the signal that we aim to reconstruct with the latent factors. Moreover, the data are divided into three batches having different expression levels. This effect can be modeled as row covariates in our approach. No lineage or branching effects are considered. Moreover, the simulation includes cell-specific library sizes, which mimic the differences in the total number of counts per cell observed in real data. We model such effects with a column intercept in our framework.

To evaluate the performance of the proposed method under different regimes, we consider two simulation settings. In simulation setting A, we compare several matrix factorization models and algorithms under a fixed latent space rank, Inline graphic, and we let the dimensions of the response matrix increase. Specifically, we set the number of cells, Inline graphic, to be 10 times the number of genes, Inline graphic, and we set Inline graphic. In simulation setting B, we compare the same set of factorization methods by fixing the dimensions of the response matrix to Inline graphic and Inline graphic, and letting the latent space rank grow, ie Inline graphic. For each combination of latent rank, Inline graphic, number of cells, Inline graphic, and number of genes, Inline graphic, under the two scenarios, we generated 100 expression matrices. Additional details are provided in Appendix S3 in the Supplementary Material. Moreover, the code to generate the data and run the simulations is publicly available on GitHub. Refer to the GitHub repository https://github.com/alexandresegers/sgdGMF_Paper.

5.2. Competing methods and performance measures

For the estimation, we consider several matrix factorization methods based on different model specifications coming from the statistical, machine learning, or bioinformatics literature. In the following, we list all the methods we consider for signal extraction of the latent variables:

  • CMF: collective matrix factorization with non-negativity constraints and batch indicator as side information matrix (cmfrec package; Cortes 2023);

  • NMF: non-negative matrix factorization based on the Poisson deviance without side information matrix and automatic missing value estimation mechanism (NMF package; Gaujoux and Seoighe 2010);

  • NMF+: non-negative matrix factorization based on the Poisson deviance without side information matrix, where the missing values are automatically estimated together with the latent variables (NNLM package; Lin and Boutros 2020);

  • AvaGrad: Poisson GMF model estimated using the AvaGrad algorithm (glmpca package; Townes et al. 2019);

  • Fisher: Poisson GMF model estimated via alternated diagonal Fisher scoring (glmpca package; Townes et al. 2019);

  • NBWaVE: negative binomial GMF estimated via alternated Fisher scoring (NewWave pakage; Agostinis et al. 2022);

  • GFM-AM: Poisson generalized factor model estimated via alternated maximization (GFM package; Liu et al. 2023);

  • GFM-VEM: Poisson generalized factor model estimated via variational expectation maximization (GFM package; Nie et al. 2024);

  • COAP: Covariate-augmented overdispersed Poisson factor model via variational expectation maximization (COAP package; Liu and Zhong 2024);

  • AIRWLS: Poisson GMF model estimated via the alternated iterative re-weighted least squares algorithm of Kidziński et al. (2022) and Wang and Carvalho (2023) (sgdGMF package);

  • Newton: Poisson GMF model estimated via the exact quasi-Newton algorithm of Kidziński et al. (2022) (sgdGMF package);

  • aSGD: Poisson GMF model estimated via the proposed adaptive stochastic gradient descent with block-wise subsampling described in Algorithm 1 (sgdGMF package; this work)

Some of the most relevant characterizing features of all these methods and the relative implementations are summarized in Table 1. To emulate a conventional usage of all these packages, we tried to adhere closely to the standard option setups recommended in the documentation provided by their respective authors. All the algorithms implemented in the sgdGMF package are initialized with the strategy described in Section 3. Additionally, for all methods allowing parallel computing (CMF, NMF+, NBWaVE, AIRWLS, and Newton), we run the estimation employing 4 cores, a configuration commonly supported by modern PCs. Finally, we also compare AIRWLS, Newton, and aSGD specifying a negative binomial likelihood. For more details, we refer to Appendix S3 in the Supplementary Material.

To account for technical confounders, we use the batch group as a covariate in all the models supporting regression effects, namely CMF, AvaGrad, Fisher, NBWaVE, COAP, AIRWLS, Newton, and aSGD. Additionally, we also included row- and column-specific intercepts to account for cell- and gene-specific effects (see Section 2 for a discussion of the roles of the intercepts). Thus, in our model formulation, we obtain the linear predictor Inline graphic, where Inline graphic is a vector of dummy variables identifying which batch each cell Inline graphic belongs to.

To assess the relative performance of the models under consideration, we estimate them on a designated training set and subsequently evaluate their goodness-of-fit using a validation set. In each simulation scenario, we initially generate a complete data matrix and then construct the training set by introducing a predetermined percentage of missing values, typically set at Inline graphic. The positions of such holdout entries are sampled from a uniform distribution on the matrix indices. The corresponding test set comprises all observations withheld during the training phase. We evaluate the models in terms of elapsed execution time (in seconds), peak random access memory consumption (in megabytes), out-of-sample reconstruction, and cell-type separation in the estimated latent space. Let Inline graphic and Inline graphic be the index sets corresponding to the training and validation entries of the response matrix, and let Inline graphic be the empirical average of the training matrix. The out-of-sample reconstruction error is then computed using the relative logarithmic root mean squared error, and the relative Poisson deviance, which are defined as

5.2.

To assess the degree of cell-type cluster separation over the estimated latent space, we consider two validation scores: the average silhouette width (Rousseeuw 1987) computed on a two-dimensional tSNE embedding (Van der Maaten and Hinton 2008), and the neighborhood purity (Manning et al. 2008) evaluated on the original latent space, which can be evaluated using the functions silhouette() and neighborPurity() from the R packages cluster (Maechler et al. 2022) and bluster (Lun 2023), respectively. The average silhouette is a global measure of cluster cohesion ranging from Inline graphic to 1, with 1 indicating perfect separation between clusters. The neighborhood purity is a local measure of cluster cohesion ranging from 0 to 1, with 1 indicating perfect coherence between clusters. Being a global measure based on Euclidean distances, the silhouette favors clusters with spherical shapes, while it is not able to detect well-separated clusters featuring non-spherical or non-convex boundaries. On the other hand, neighborhood purity is a local measure of cluster separation, which is able to detect localized behavior and does not depend on the cluster shapes.

5.3. Simulation results

Figure 2 presents an overview of the results obtained from the data simulated with Inline graphic, Inline graphic, and Inline graphic. Our proposed aSGD method is the fastest approach together with GFM-VEM and COAP (Fig. 2A): it is parsimonious in terms of memory usage, while performing similarly to the best-performing methods in terms of out-of-sample error, deviance, silhouette width, and neighborhood purity. An exemplar simulation run (Fig. 2B) shows that all the GMF methods using covariate information succeed in reconstructing the original cell-type clustering, being able to filter out the batch effect via the regression term in the linear predictor. The only exception is AvaGrad, which fails to separate the different groups. On the other hand, GFM-AM and GFM-VEM are unable to disentangle the cell-types from the batch effects, primarily due to the GFM package’s inability to account for covariate effects. Similarly, the NMF methods over-cluster the data, not allowing for batch effect removal via regression. This is confirmed by the average silhouette width and the mean neighborhood purity (bottom lines of Fig. 2A), which show a large difference in performance, discriminating methods based on their ability to account for covariate (ie batch) effects.

Figure 2.

Boxplots representing, for each method, the distribution of time, memory, error, deviance, silhouette width, and purity across simulations, and t-sne projections of one replication for each method.

Summary information for the simulation experiment described in Section 5.1 with Inline graphic, Inline graphic, and Inline graphic. Left (panel A): summary statistics reporting the execution time (in seconds), the peak memory consumption (in megabytes), the out-of-sample relative logarithmic root mean squared error (multiplied by Inline graphic), the out-of-sample relative Poisson deviance (multiplied by Inline graphic), the silhouette score of the true cell-type clusters calculated on a 2-dimensional tSNE projection, and the mean cluster purity of the true cell-type clusters calculated on the 5-dimensional estimated latent space. Right (panel B): 2-dimensional tSNE projections of the estimated latent factors for one specific replication of the experiment.

These results are confirmed across both setting A and B (Figs 3 and 4): in terms of computational efficiency, the proposed aSGD implementation consistently outperforms its competitors in setting A and ranks as the top performer in setting B, showing lower elapsed execution times and a better scalability with respect to both the sample size and the dimension of the latent space (top row of Fig. 3). Additionally, aSGD manifests a parsimonious management of the random access memory, which is aligned to AIRWLS, Newton, and GFM-VEM and surpassed only by CMF and NMF+ (bottom row of Fig. 3). Interestingly, CMF, NMF+, GFM-VEM, AIRWLS, Newton, and aSGD maintain an almost constant memory footprint in simulation setting B. This behavior is attributed to their efficient memory allocation strategies, where memory usage is mainly dictated by the storage of input data and sufficient statistics, with negligible additional cost during optimization, even as the latent dimension scales but is much smaller than the matrix dimensions. Among the alternative methods, COAP and Newton emerge as the fastest, followed by AIRWLS. On the other hand, GFM-AM, NBWaVE, Fisher and AvaGrad optimizers display inferior scalability compared to aSGD, Newton, and AIRWLS (Fig. 3). Contrary to the claims in the glmpca package documentation, our findings indicate that the AvaGrad optimizer often lacks in both speed and reliability when compared to the Fisher optimizer, which typically achieves convergence within a reasonable time and, on average, runs faster than AvaGrad.

Figure 3.

Line plots showing the execution time and peak memory consumption of each method as a function of matrix dimension and latent space rank.

Summary statistics of the simulation experiments described in Section 5.1. The columns correspond to simulation settings A) (left) and B) (right). The rows correspond to the elapsed execution time in seconds (top) and the peak memory consumption in megabytes (bottom).

Figure 4.

Line plots showing the out-of-sample error, deviance, silhouette, and purity for each method as a function of matrix dimension and latent space rank.

Summary statistics of the simulation experiments described in Section 5.1. The columns correspond to simulation settings A) (left) and B) (right). The rows correspond to four goodness-of-fit measures. From top to bottom: the out-of-sample relative logarithmic root mean squared error, the out-of-sample relative residual deviance, the silhouette evaluated on a 2-dimensional tSNE projection of the latent space, and the neighborhood purity of the true cell-type evaluated on the original latent space.

Regarding non-negative matrix factorization, implementations such as NMF and CMF demonstrate poor computational scalability across both settings. However, NMF+ shows improved performance, reaching efficiency levels close to AIRWLS in setting B, and an optimal management of the memory in both settings (Fig. 3).

In terms of goodness-of-fit measures (top two rows of Fig. 4), the least out-of-sample logarithmic error and deviance are systematically achieved by NMF+, COAP, AIRWLS, Newton, and aSGD, with aSGD consistently emerging as the most accurate method in the deviance metric. Such a situation partially reflects on the silhouette width and neighborhood purity scores (bottom two rows of Fig. 4), for which, in both scenarios A and B, NBWaVE, COAP, AIRWLS, Newton, and aSGD always outperform the other methods in terms of cell-type separation in the latent space. As such, aSGD exhibits slightly suboptimal performance compared to the other top-performing methods, especially in small sample settings; however, it converges toward them as the sample size increases. This behavior is not unexpected, as stochastic optimization algorithms typically require a large number of observations to mitigate their intrinsic randomness, achieving stability as the data dimension grows.

Finally, in Appendix S4 in the Supplementary Material (Figs S1 and S2), we also show a comparison between the three methods implemented in the sgdGMF package (ie AIRWLS, Newton, and aSGD) under both Poisson and negative binomial likelihood specifications under the same simulation settings outlined above. Overall, aSGD exhibits comparable performance under both Poisson and negative binomial specifications in terms of execution time and memory usage. In contrast, for AIRWLS and Newton, changing the likelihood induces some differences: in setting A, the negative binomial model converges faster than the Poisson model, resulting in reduced computational time, while no substantial differences are observed in setting B (Fig. S1). Regarding out-of-sample performance, Poisson models consistently achieve lower logarithmic root mean squared error, whereas negative binomial models uniformly yield lower deviance. Moreover, negative binomial models attain higher silhouette and neighborhood purity scores, suggesting improved separation of cell lines in the latent space compared to Poisson models within the sgdGMF framework (Fig. S2). These results indicate no clear differences between the two likelihood specifications.

6. Real data applications

In this section, we will demonstrate the effectiveness of our method using two real datasets. The first dataset, referred to as the Arigoni dataset, is a 10X Genomics scRNA-seq experiment on lung cancer cell lines with unique driver mutations (Arigoni et al. 2024). As suggested by the authors, the heterogeneity among cell lines can be used as ground truth to benchmark computational methods. We use these data to illustrate that our method can discover real biological signal.

Further, we apply our method to a large scRNA-seq dataset consisting of more than 1.3 million cells from the mouse brain, generated by 10X Genomics (Lun and Morgan 2023). As this dataset does not contain a ground truth of cell labels, this dataset primarily showcases the scalability of our approach in large datasets.

Throughout this section, we consider parametrization (B1) to obtain an orthonormal loading matrix V and a scaled orthogonal score factor matrix U (see Section 2.1). This choice is conventional in the RNA-seq literature (see, eg Risso et al. 2018; Townes et al. 2019; Ahlmann-Eltze and Huber 2023) and is coherent with the standard parametrization of principal component analysis.

6.1. Arigoni data

The original Arigoni dataset (Arigoni et al. 2024) consists of Inline graphic cells from 8 different lung cancer cell lines with unique driver mutations (EGFR, ALK, MET, ERBB2, KRAS, BRAF, ROS1). The ground truth knowledge of the driver mutations can be used to evaluate cell clustering by visual inspection and upon using clustering algorithms. Importantly, the CCL-185-IG cell line is derived from the A549 cell line. Therefore, only subtle differences are expected between these two cell lines. Quality control is done using the perCellQCFilters function of the R package scuttle (McCarthy et al. 2017), which filtered cells that have a library size lower than 1306, a percentage of mitochondrial reads higher than 6.05% or cells that have fewer than 732 features expressed. Additionally, peripheral blood mononuclear cells are removed due to their distinct expression profile. The final filtered dataset includes Inline graphic cells from 7 different cell lines. Unless mentioned differently, all the analyses are based on the 500 most variable genes, a common choice in standard scRNA-seq workflows. The selection of the number of highly-variable genes did not prove critical, as the results are robust across a large spectrum of values (Figs S3 and S4).

To select the optimal matrix rank for the latent component of the model, different model selection criteria are assessed in a 5-fold cross-validation (see Section 4.2 for details). AIC and BIC are assessed on the 5 training data partitions, while the out-of-sample deviance is calculated on the test data. Also, considering all the data, we assess the scree plot of the eigenvalues based on the deviance residuals after using OLS on the log-transformed data.

Both the AIC and out-of-sample deviance criteria suggest a matrix rank of 15, while the BIC and the scree plot suggest 9 (Fig. 5A). The mean cell-line neighborhood purity scores (Fig. 5B) show that, for the majority of cell lines, a rank of 9 is sufficient to completely separate the groups, with only cell lines A549 and CCL-185-IG exhibiting a score lower than 0.9. At rank 15, all cell lines achieve a large value of mean purity, and increasing the rank further does not improve this index, while introducing the risk of overfitting the data, as observed in the increasing out-of-sample deviance in the cross-validation (Fig. 5A). These remarks are confirmed when visually inspecting the tSNE plots, colored by the ground truth labels (Fig. 5C), that show that a matrix rank of 9 yields a good separation, except for A549 from CCL-185-IG, which show a slight degree of mixing (Fig. 5C, confusion matrix). Although there are no outstanding visual differences between rank 15 and 30, performing Leiden clustering with a resolution tuned to obtain 7 clusters shows that a matrix of rank 30 results in one small cluster with very few cells, rather than separating the A549 and CCL-185-IG groups, while the latter happens as expected for rank 15. This suggests that 15 is a reasonable number of latent factors to include in the model. Importantly, the tested model selection approaches were not informed by cell type labels, which were used only to evaluate the methods’ performance. This shows that unsupervised approaches to model selection are able to estimate a number of factors sufficient to extract meaningful biological signal from real data.

Figure 5.

Graphs showing that different measures of model selection indicate either 9 or 15 as latent matrix rank. T-sne plots showing the projection of latent matrices of rank 9, 15, and 30 and confusion matrix heatmaps of cluster-cell-type agreement.

Assessment of model selection metrics. A) Application of diverse model selection criteria including scree plot, AIC, BIC, and cross-validation based on out-of-sample deviances. B) Mean cell-line purity as a function of the matrix rank. C) tSNE plots colored by the ground truth and clusters obtained by Leiden clustering, alongside a confusion matrix representing cell-line distributions across clusters. In the confusion matrices, each entry is featured with a number corresponding to the percentage of cells belonging to that configuration. The matrix total is equal to 100 and the color intensity is proportional to the percentages.

Using 15 factors, as suggested by the model selection criteria, aSGD was compared with NBWaVE, Fisher, and AvaGrad, three of the most popular methods in single-cell analysis, as well as with COAP, one of the fastest and most reliable method in the simulations in addition to aSGD (Fig. S5). We tested aSGD with both a negative binomial and a Poisson likelihood, and the results were virtually identical (Fig. S5). All methods achieve similar results, as shown by both visual inspection of the tSNE plots and the mean cell-line neighborhood purities. However, aSGD is orders of magnitude faster, and therefore allows for model selection obtaining optimal matrix rank, which was shown to be important for clustering the data. Further, aSGD has a better out-of-sample deviance. In terms of memory usage, aSGD has a similar peak RAM memory usage as COAP, outperforming the three other methods.

6.2. TENxBrainData

To demonstrate the scalability of our method to large datasets, we apply aSGD to the TENxBrainData (Lun and Morgan 2023), which consists of scRNA-seq UMI counts generated by 10X Genomics for approximately 1.3 million cells. Cells of this dataset were obtained from the cortex, hippocampus and ventricular zone of two mouse brains. Although no ground truth is available for the different cell types in this dataset, marker genes that can discriminate between subtypes of cells are available for mouse brains, and the list used by Hicks et al. (2021) is used to qualitatively evaluate extraction of biological signal in different cell clusters. Quality control and filtering are performed using the R package scater (McCarthy et al. 2017), excluding cells with an exceptional number of mitochondrial reads (more than 3 median absolute deviances away from the median), and genes with no expression in over 99% of the cells. This procedure yields Inline graphic cells and we retain the 500 most variable genes, as done with the Arigoni dataset.

Considering the eigenvalue gap method (Fig. S6), which is a very fast procedure for model selection, we selected a model with 10 latent factors. To study the scalability of aSGD and its competitors on this large dataset, we considered subsamples of Inline graphic, Inline graphic, and Inline graphic cells (Fig. 6A). This analysis showed that aSGD is orders of magnitude faster than competing methods, with Fisher and NBWaVE taking 4 and 8 h, respectively, to analyze Inline graphic cells. While AvaGrad achieves better computational speed than Fisher and NBWaVE, it remains extremely slow compared to aSGD. Moreover, AvaGrad did not converge one in five times for Inline graphic cells, and two in five times for Inline graphic cells. Therefore, these methods would result in extreme computational times on the full dataset. Moreover, the memory usage of aSGD remains lower than competing methods. Note also that COAP returned errors when using more than Inline graphic cells on a high-performance computer (100 GB RAM, 2.2 GHz CPU), rendering it unable, in our system, to handle large datasets. Therefore, only aSGD is a reasonable model to run on the full dataset due to its superior computational efficiency.

Figure 6.

Line plots of time and memory consumption as a function of the number of cells and heatmap of marker gene expression across clusters.

Results on large-scale data. A) Computational time and memory usage for different methods on increasingly larger subsets of the dataset. Each subset includes 500 high-variable genes and a growing number of cells. B) Heatmap of the average gene expression of the 18 clusters obtained by Leiden clustering computed on the latent score matrix, for 29 marker genes. Each marker gene is colored based on the cell type it is expressed in.

Using aSGD on the full dataset returned results in 77 min. Subsequent Leiden clustering of this matrix factorization revealed sensible biological signal extraction, as its clusters align with established marker genes (Fig. 6B). For example, cluster 8 is characterized by cells expressing Pyramidal neuron cells marker genes, such as Crym (Loo et al. 2019), while cluster 4 contains cells expressing interneuron markers, eg Sst and Lhx6 (Tasic et al. 2018).

7. Discussion

In the present work, we propose a flexible and scalable tool to perform generalized matrix factorization in massive data problems, with a particular focus on scRNA-seq applications. We propose an innovative adaptive stochastic gradient descent algorithm, whose performances are enhanced via a memory-efficient block-wise subsampling method and a convenient initialization strategy. An R/C++ implementation in the open-source package sgdGMF is freely available on CRAN. Overall, the proposed method proved competitive with state-of-the-art approaches available in R, showing higher prediction accuracy, good biological signal-extraction ability, and a significant speed-up of the execution time in simulated and real data examples. Unlike most methods currently employed for scRNA-seq signal extraction, our approach natively deals with missing values, iteratively imputing them with the model’s current best prediction. This feature proved important for out-of-sample error evaluation, model selection, and matrix completion.

Accounting for batch effects is critical in the analysis of large single-cell datasets with complex experimental designs. Our model deals with such effects by including batch information within the set of cell-level covariates; within our GLM framework, covariate adjustment at the level of the linear predictor induces both mean and variance corrections, owing to the heteroscedastic nature of exponential family models, particularly under Poisson and negative binomial likelihoods. Hence, the model has the ability to account both for mean and variance shifts. In both our simulation studies and real-data analyses, we did not observe the need for additional corrections to the estimated latent factors, suggesting that the proposed approach is sufficiently flexible to mitigate batch effects in a variety of practical scenarios. However, a potentially interesting direction for future research is to introduce a more explicit functional dependence of the dispersion parameter on batch indicators.

An appealing feature of the proposed method is its flexibility, which enables several extensions and generalizations. The proposed framework naturally extends to heterogeneous likelihood specification across rows or columns of the response matrix. This would make it possible to jointly factorize discrete, count and continuous data sharing the same latent factorization structure, but different conditional distributions. From a biological viewpoint, this extension would permit to flexibly model multi-omic data (Argelaguet et al. 2018, 2020) under the unified framework provided by sgdGMF.

Another interesting extension of the proposed algorithm is generalized tensor factorization for non-Gaussian data arrays. In this setting, the computational complexity of estimating highly parametrized models can easily grow very fast. Thus, cheap and modulable estimation algorithms, such as the proposed adaptive stochastic gradient descent, are becoming increasingly important.

From an algorithmic viewpoint, another fascinating possibility is considering non-uniform sampling schemes for the mini-batch selection. For instance, if the sample is divided into known subpopulations, it could be convenient to exploit the clustered nature of the data when forming the mini-batch partition via stratification. This can improve the representativeness of each chunk, reduce the variance of the gradient estimator, and prevent the optimization from converging to suboptimal maxima dominated by a specific subpopulation signal.

Supplementary Material

kxag010_Supplementary_Data

Acknowledgments

The authors thank the anonymous reviewers and the associate editor for their valuable suggestions.

Contributor Information

Cristian Castiglione, Institute for Data Science and Analytics, Bocconi University, Via Röntgen 1, Milan 20136, Italy.

Alexandre Segers, Department of Mathematics, Computer Science and Statistics, Ghent University, Krijgslaan 299-S9, Ghent 9000, Belgium; Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, Ghent 9000, Belgium.

Lieven Clement, Department of Mathematics, Computer Science and Statistics, Ghent University, Krijgslaan 299-S9, Ghent 9000, Belgium.

Davide Risso, Department of Statistical Sciences, University of Padova, Via Cesare Battisti 241, Padova 35121, Italy.

Supplementary material

Supplementary material is available at Biostatistics Journal online.

Funding

This work was supported by EU funding within the MUR PNRR “National Center for HPC, big data and quantum computing” [Project no. CN00000013 CN1]. The views and opinions expressed are only those of the authors and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. DR was also supported by the National Cancer Institute of the National Institutes of Health [U24CA289073] and by project EOSS6-0000000644 from the Chan Zuckerberg Initiative. This work was supported by grants from Ghent University Special Research Fund [BOF20/GOA/023 to A.S., L.C.], Research Foundation Flanders [FWO G062219N to A.S., L.C.] and [FWO G071326N to L.C.].

Conflicts of interest

None declared.

Data Availability

sgdGMF is freely available as an open-source R package on CRAN at https://CRAN.R-project.org/package=sgdGMF. The scripts used to run all analyses are available on GitHub at https://github.com/alexandresegers/sgdGMF_Paper. The Arigoni dataset is available at https://doi.org/10.6084/m9.figshare.23939481.v1. The TENxBrainData dataset is available as part of the TENxBrainData Bioconductor package at https://bioconductor.org/packages/TENxBrainData.

References

  1. Agostinis F, Romualdi C, Sales G, Risso D.  2022. NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data. Bioinformatics. 38:2648–2650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ahlmann-Eltze C, Huber W.  2023. Comparison of transformations for single-cell RNA-seq data. Nat Methods. 20:665–672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Angerer P  et al.  2017. Single cells make big data: new challenges and opportunities in transcriptomics. Curr Opin Syst Biol. 4:85–91. [Google Scholar]
  4. Argelaguet R  et al.  2020. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21:111–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Argelaguet R  et al.  2018. Multi-Omics factor analysis–a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 14:e8124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Arigoni M  et al.  2024. A single cell RNAseq benchmark experiment embedding “controlled” cancer heterogeneity. Sci Data. 11:159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bartholomew D, Knott M, Moustaki I.  2011. Latent variable models and factor analysis. A unified approach, 3rd ed. Wiley series in probability and statistics. John Wiley & Sons, Ltd. [Google Scholar]
  8. Bianconcini S, Cagnone S.  2012. Estimation of generalized linear latent variable models via fully exponential Laplace approximation. J Multivar Anal. 112:183–193. [Google Scholar]
  9. Bordes A, Bottou L, Gallinari P.  2009. SGD-QN: careful quasi-Newton stochastic gradient descent. J Mach Learn Res. 10:1737–1754. [Google Scholar]
  10. Bottou L.  2010. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics. Paris France, August 22–27, 2010 Keynote, Invited and Contributed Papers. Springer. p. 177–186.
  11. Breslow NE, Clayton DG.  1993. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 88:9–25. [Google Scholar]
  12. Brunet J-P, Tamayo P, Golub TR, Mesirov JP.  2004. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 101:4164–4169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cagnone S, Monari P.  2013. Latent variable models for ordinal data by using the adaptive quadrature approximation. Comput Stat. 28:597–619. [Google Scholar]
  14. Cai J-F, Candès EJ, Shen Z.  2010. A singular value thresholding algorithm for matrix completion. SIAM J Optim. 20:1956–1982. [Google Scholar]
  15. Cao Y, Yang P, Yang JYH.  2021. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat Commun. 12:6911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Cappé O, Moulines E.  2009. On-line expectation–maximization algorithm for latent data models. J R Stat Soc Ser B Stat Methodol. 71:593–613. [Google Scholar]
  17. Collins M, Dasgupta S, Schapire RE.  2001. A generalization of principal components analysis to the exponential family. Adv Neural Inf Process Syst. 14. [Google Scholar]
  18. Cortes D.  2018. Cold-start recommendations in collective matrix factorization [preprint]. arXiv, arXiv:1809.00366.
  19. Cortes D.  2023. cmfrec: collective matrix factorization for recommender systems. R package version 3.5.1-1. DOI: 10.32614/CRAN.package.cmfrec [DOI]
  20. Denyer T  et al.  2019. Spatiotemporal developmental trajectories in the arabidopsis root revealed using high-throughput single-cell RNA sequencing. Dev Cell. 48:840–852.e5. [DOI] [PubMed] [Google Scholar]
  21. Duchi J, Hazan E, Singer Y.  2011. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 12:2121–2159. [Google Scholar]
  22. Durif G, Modolo L, Mold JE, Lambert-Lacroix S, Picard F.  2019. Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics. 35:4011–4019. [DOI] [PubMed] [Google Scholar]
  23. Fan J, Guo J, Zheng S.  2022. Estimating number of factors by adjusted eigenvalues thresholding. J Am Stat Assoc. 117:852–861. [Google Scholar]
  24. Friedman J, Hastie T, Höfling H, Tibshirani R.  2007. Pathwise coordinate optimization. Ann Appl Stat. 1:302–332. [Google Scholar]
  25. Friedman J, Hastie T, Tibshirani R.  2010. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 33:1–22. [PMC free article] [PubMed] [Google Scholar]
  26. Gaujoux R, Seoighe C.  2010. A flexible r package for nonnegative matrix factorization. BMC Bioinformatics. 11:367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Gopalan P, Hofman JM, Blei DM.  2015. Scalable recommendation with hierarchical Poisson factorization. In: UAI'15 Proceedings of theThirty-First Conference on Uncertainty in Artificial Intelligence. p. 326–335.
  28. Haghverdi L, Lun AT, Morgan MD, Marioni JC.  2018. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 36:421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hicks SC, Liu R, Ni Y, Purdom E, Risso D.  2021. mbkmeans: fast clustering for single cell data using mini-batch k-means. PLoS Comput Biol. 17:e1008625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hicks SC, Townes FW, Teng M, Irizarry RA.  2018. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 19:562–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hinton G, Srivastava N, Swersky K.  2012. Lecture 6a: overview of mini-batch gradient descent. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture\_slides\_lec6.pdf. Accessed: 2024-02-15.
  32. Huber P, Ronchetti E, Victoria-Feser M-P.  2004. Estimation of generalized linear latent variable models. J R Stat Soc Ser B Stat Methodol. 66:893–908. [Google Scholar]
  33. Huber W  et al.  2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 12:115–121. and [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Hui FKC, Warton DI, Ormerod JT, Haapaniemi V, Taskinen S.  2017. Variational approximations for generalized linear latent variable models. J Comput Graph Stat. 26:35–43. [Google Scholar]
  35. Jean-Baptiste K  et al.  2019. Dynamics of gene expression in single root cells of Arabidopsis thaliana. Plant Cell. 31:993–1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Jolliffe IT.  1986. Principal component analysis: with 26 illustrations. Springer. [Google Scholar]
  37. Kenney T, Gu H, Huang T.  2021. Poisson PCA: Poisson measurement error corrected PCA, with application to microbiome data. Biometrics. 77:1369–1384. [DOI] [PubMed] [Google Scholar]
  38. Kharchenko PV.  2021. The triumphs and limitations of computational methods for scrna-seq. Nat Methods. 18:723–732. [DOI] [PubMed] [Google Scholar]
  39. Kidziński, ukasz Hui Francis KC  et al.  2022. Generalized matrix factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. J Mach Learn Res. 23:1–29. [PMC free article] [PubMed] [Google Scholar]
  40. Kim SH, Cho SY.  2023. Single-cell transcriptomics to understand the cellular heterogeneity in toxicology. Mol Cell Toxicol. 19:223–228. [Google Scholar]
  41. Kingma DP, Ba J.  2014. Adam: a method for stochastic optimization [preprint]. arXiv, arXiv:1412.6980.
  42. Kiselev VY, Andrews TS, Hemberg M.  2019. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 20:273–282. [DOI] [PubMed] [Google Scholar]
  43. Koren Y, Bell R, Volinsky C.  2009. Matrix factorization techniques for recommender systems. Computer. 42:30–37. [Google Scholar]
  44. Korsunsky I  et al.  2019. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods. 16:1289–1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kouno T  et al.  2013. Temporal dynamics and transcriptional control using single-cell gene expression analysis. Genome Biol. 14:R118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Lähnemann D  et al.  2020. Eleven grand challenges in single-cell data science. Genome Biol. 21:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Landgraf AJ.  2015. Generalized principal component analysis: dimensionality reduction through the projection of natural parameters [Ph.D. Thesis]. The Ohio State University.
  48. Lee DD, Seung HS  1999. Learning the parts of objects by non-negative matrix factorization. Nature. 401:788–791. [DOI] [PubMed] [Google Scholar]
  49. Lee S, Huang JZ, Hu J.  2010. Sparse logistic principal components analysis for binary data. Ann Appl Stat. 4:1579–1601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Lee Y, Nelder JA, Pawitan Y.  2017. Generalized linear models with random effects, Unified analysis via H-likelihood, 2nd ed., Vol. 153, Monographs on Statistics and Applied Probability. CRC Press. [Google Scholar]
  51. Li J, Tao D.  2010. Simple exponential family PCA. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. p. 453–460.
  52. Lin X, Boutros PC.  2020. NNLM: fast and versatile non-negative matrix factorization. R package version 0.4.4. https://github.com/linxihui/NNLM
  53. Liu W, Lin H, Zheng S, Liu J.  2023. Generalized factor model for ultra-high dimensional correlated variables with mixed types. J Am Stat Assoc. 118:1385–1401. [Google Scholar]
  54. Liu W, Zhong Q.  2024. High-dimensional covariate-augmented overdispersed Poisson factor model. Biometrics. 80:ujae031, 12. [DOI] [PubMed] [Google Scholar]
  55. Loo L  et al.  2019. Single-cell transcriptomic analysis of mouse neocortical development. Nat Commun. 10:134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Love MI, Huber W, Anders S.  2014. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Luecken MD  et al.  2025. Defining and benchmarking open problems in single-cell analysis. Nat Biotechnol. 43:1035–1040. [DOI] [PubMed] [Google Scholar]
  58. Lun A.  2023. bluster: clustering algorithms for bioconductor. R package version 1.10.0. DOI: 10.18129/B9.bioc.bluster [DOI]
  59. Lun A, Morgan M.  2023. TENxBrainData: data from the 10x1.3 million brain cell study, R package version 1.22.0. DOI: 10.18129/B9.bioc.TENxBrainData [DOI]
  60. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K.  2022. cluster: cluster analysis basics and extensions. R package version 2.1.4. DOI: 10.32614/CRAN.package.cluster. [DOI]
  61. Mairal J, Bach F, Ponce J, Sapiro G.  2010. Online learning for matrix factorization and sparse coding. J Mach Learn Res. 11:19–60. [Google Scholar]
  62. Manning CD, Raghavan P, Schütze H.  2008. Introduction to information retrieval, Vol. 39. Cambridge University Press. [Google Scholar]
  63. Mazumder R, Hastie T, Tibshirani R.  2010. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res. 11:2287–2322. [PMC free article] [PubMed] [Google Scholar]
  64. McCarthy DJ, Campbell KR, Lun ATL, Wills QF.  2017. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 33:1179–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. McCullagh P, Nelder JA.  1989. Generalized linear models, 2nd ed., Monographs on Statistics and Applied Probability. Chapman & Hall. [Google Scholar]
  66. Mensch A, Mairal J, Thirion B, Varoquaux G.  2018. Stochastic subsampling for factorizing huge matrices. IEEE Trans Signal Process. 66:113–128. [Google Scholar]
  67. Mohamed A, Heller K, Ghahramani Z.  2008. Bayesian exponential family PCA. Adv Neural Inf Process Syst. 21. [Google Scholar]
  68. Nesterov Y.  1983. A method for solving a convex programming problem with convergence rate Inline graphic. Soviet Math Doklady. 27:372–376. [Google Scholar]
  69. Nguyen TKH, Van den Berge K, Chiogna M, Risso D.  2023. Structure learning for zero-inflated counts with an application to single-cell RNA sequencing data. Ann Appl Stat. 17:2555–2573. [Google Scholar]
  70. Nie J, Qin Z, Liu W.  2024. High-dimensional overdispersed generalized factor model with application to single-cell sequencing data analysis. Stat Med. 43:4836–4849. [DOI] [PubMed] [Google Scholar]
  71. Niku J, Warton DI, Hui FKC, Taskinen S.  2017. Generalized linear latent variable models for multivariate count and biomass data in ecology. JABES. 22:498–522. [Google Scholar]
  72. Onatski A.  2010. Determining the number of factors from empirical distribution of eigenvalues. Rev Econ Stat. 92:1004–1016. [Google Scholar]
  73. Perez RK  et al.  2022. Single-cell RNA-seq reveals cell type–specific molecular and genetic associations to lupus. Science. 376:eabf1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Recht B, Ré C.  2013. Parallel stochastic gradient algorithms for large-scale matrix completion. Math Prog Comp. 5:201–226. [Google Scholar]
  75. Reddi SJ, Kale S, Kumar S.  2019. On the convergence of Adam and beyond [preprint]. arXiv, arXiv:1904.09237.
  76. Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J-P.  2018. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 9:284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Robbins H, Monro S.  1951. A stochastic approximation method. Ann Math Statist. 22:400–407. [Google Scholar]
  78. Robinson MD, McCarthy DJ, Smyth GK.  2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Rousseeuw PJ.  1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 20:53–65. [Google Scholar]
  80. Sammel MD, Ryan LM, Legler JM.  1997. Latent variable models for mixed discrete and continuous outcomes. J R Stat Soc Ser B Stat Methodol. 59:667–678. [Google Scholar]
  81. Schein AI, Saul LK, Ungar LH.  2003. A generalized linear model for principal component analysis of binary data. In: International Workshop on Artificial Intelligence and Statistics. PMLR. p. 240–247. [Google Scholar]
  82. Singh AP, Gordon GJ.  2008. Relational learning via collective matrix factorization. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. p. 650–658.
  83. Smallman L, Underwood W, Artemiou A.  2020. Simple Poisson PCA: an algorithm for (sparse) feature extraction with simultaneous dimension determination. Comput Stat. 35:559–577. [Google Scholar]
  84. Song Y  et al.  2019. Principal component analysis of binary genomics data. Brief Bioinform. 20:317–329. [DOI] [PubMed] [Google Scholar]
  85. Stein-O’Brien GL  et al.  2018. Enter the matrix: factorization uncovers knowledge from omics. Trends Genet. 34:790–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Stephenson E  et al.  2021. Single-cell multi-omics analysis of the immune response in covid-19. Nat Med. 27:904–916. and [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Street K  et al.  2018. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 19:477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Sun S, Zhu J, Ma Y, Zhou X.  2019. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20:269–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Svensson V.  2020. Droplet scrna-seq is not zero-inflated. Nat Biotechnol. 38:147–150. [DOI] [PubMed] [Google Scholar]
  90. Tasic B  et al.  2018. Shared and distinct transcriptomic cell types across neocortical areas. Nature. 563:72–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Tipping ME, Bishop CM.  1999. Probabilistic principal component analysis. J R Stat Soc Ser B Stat Methodol. 61:611–622. [Google Scholar]
  92. Townes F, William Hicks SC, Aryee MJ, Irizarry RA.  2019. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 20:295–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Udell M, Horn C, Zadeh R, Boyd S.  others  2016. Generalized low-rank models. Found Trends® Mach Learn. 9:1–118. [Google Scholar]
  94. Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC.  2017. Normalizing single-cell rna sequencing data: challenges and opportunities. Nat Methods. 14:565–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Van der Maaten L, Hinton G.  2008. Visualizing data using t-sne. J Mach Learn Res. 9:2579–2605. [Google Scholar]
  96. Virta J, Artemiou A.  2023. Poisson PCA for matrix count data. Pattern Recognit. 138:109401. [Google Scholar]
  97. Wang L, Carvalho L.  2023. Deviance matrix factorization. Electron J Statist. 17:3762–3810. [Google Scholar]
  98. Wang Y, Bi X, Qu A.  2020. A logistic factorization model for recommender systems with multinomial responses. J Comput Graph Stat. 29:396–404. [Google Scholar]
  99. Wang Y-X, Zhang Y-J.  2013. Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng. 25:1336–1353. [Google Scholar]
  100. Witten DM, Tibshirani R, Hastie T.  2009. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 10:515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Wu Y-F  et al.  2023. Single-Cell transcriptomics reveals cellular heterogeneity and complex cell–cell communication networks in the mouse cornea. Invest Ophthalmol Vis Sci. 64:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  102. Yee TW.  2015. Vector generalized linear and additive models. With an implementation in R, springer series in statistics. Springer, New York. [Google Scholar]
  103. Zappia L, Phipson B, Oshlack A.  2017. Splatter: simulation of single-cell rna sequencing data. Genome Biol. 18:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Zeiler MD.  2012. Adadelta: an adaptive learning rate method [preprint]. arXiv, arXiv:1212.5701.
  105. Zou H, Hastie T, Tibshirani R.  2006. Sparse principal component analysis. J Comput Graph Stat. 15:265–286. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxag010_Supplementary_Data

Data Availability Statement

sgdGMF is freely available as an open-source R package on CRAN at https://CRAN.R-project.org/package=sgdGMF. The scripts used to run all analyses are available on GitHub at https://github.com/alexandresegers/sgdGMF_Paper. The Arigoni dataset is available at https://doi.org/10.6084/m9.figshare.23939481.v1. The TENxBrainData dataset is available as part of the TENxBrainData Bioconductor package at https://bioconductor.org/packages/TENxBrainData.


Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES