SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis

Dong Yuan; Nicholas Mancuso

doi:10.1016/j.isci.2023.108181

. 2023 Oct 13;26(11):108181. doi: 10.1016/j.isci.2023.108181

SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis

Dong Yuan ^1,^3,^∗, Nicholas Mancuso ^1,^2,^∗∗

PMCID: PMC10638022 PMID: 37953948

Summary

Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] $= 9.2 \times 10^{- 82}$ vs. $1.4 \times 10^{- 33}$ ), while being $\sim$ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.

Subject areas: Biocomputational method, Classification of bioinformatical subject, data processing in systems biology, Algorithms

Graphical abstract

Highlights

•
Efficient PCA feature selection via posterior inclusion probabilities
•
Learned model prior enhances inferential robustness
•
Seamless CPU/GPU/TPU implementation enables efficient inference

Biocomputational method; Classification of bioinformatical subject; Data processing in systems biology; Algorithms

Introduction

Principal component analysis (PCA) is a popular dimension reduction technique¹ that has been widely applied for exploratory data analysis in many fields. One notable functionality of PCA is to synthesize crucial information across features into a small number of principal components (PCs). For example, PCA is commonly used to infer population structure from large-scale genetic data.²^,³ The top PCs explain differences in genetic variation arising from different geographic origins and ancestry of individuals, due to historical migration, admixture, etc.⁴ Moreover, PCA provides a means to rank contributing relevant variables for each latent component, as Tipping and Bishop (1986) proposed the probabilistic reformulation of principal component analysis (PPCA).⁵ Specifically, each PC is independent of other PCs and has its unique weights to represent the “importance” of original features, suggesting different latent components arise from different combinations of variables, or distinct aspects of information from the data.

However, one disadvantage of conventional PCA is that PCs provide limited interpretability, as each results from a linear combination of variables in the data.⁶ To improve the interpretability of PCs, while providing an identifiable solution in high-dimensional data, a common approach is to impose sparsity on the PCA loadings. Broadly speaking, there are two types of approaches to achieving sparsity on the loading matrix. The first is the regularization methods such as sparse PCA,⁶ which rewrites the PCA as a regression-based optimization problem and then includes a $L_{1}$ penalty on the objective function to achieve sparse loadings. The second type of method is the Bayesian treatment of PPCA, which imposes sparsity-induced prior on the factor loading matrix.⁷^,⁸^,⁹^,¹⁰^,¹¹^,¹² Despite various methods that focus on inducing sparse solutions for PCA, few provide a statistically rigorous way to select variables relevant to each factor in a post hoc manner. Although several sparse models are capable of shrinking the loadings of uninformative variables to zero, for those variables with non-zero weights, neither a reasonable threshold nor a formal statistical test is provided to inform feature prioritization for validation or follow-up.

Here, we propose SuSiE PCA, a highly scalable Bayesian framework for sparse PCA, that quantifies the uncertainty of contributing features for each latent component. Specifically, SuSiE PCA leverages the recent “sum of single effects” (SuSiE) approach¹³ to model a loading matrix such that each latent factor contains at most L contributing features. Latent factors and sparse loading weights are learned through an efficient variational algorithm. In addition to providing a sparse loading matrix, SuSiE PCA computes posterior inclusion probabilities (PIPs) for each feature, which enables defining $ρ -$ level credible sets for feature selection. We demonstrate through extensive simulations that SuSiE PCA outperforms sparse PCA⁶ and empirical Bayes matrix factorization (EBMF)¹² in identifying relevant features contributing to structured data while being robust to data-generating assumptions. Next, we apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from the the genotype-tissue expression (GTEx) v8¹²^,¹⁴ study to identify tissue-specific components of regulatory genetic features and contributing eGenes (genes that have an associated eQTL). We also apply SuSiE PCA to high-dimensional perturb-seq data (CRISPR-based screens with single-cell RNA-sequencing readouts)¹⁵ and identify gene sets more enriched in the ribosome and coronavirus disease pathways when compared with sparse PCA (false discovery rate (FDR) $= 9.2 \times 10^{- 82}$ , 63 genes involved vs. $1.4 \times 10^{- 33}$ , 35 genes involved) while requiring 17.8 times less computing time. Overall, we find that SuSiE PCA provides an efficient approach to compute interpretable latent factors from high-dimensional biological data. We provide an open-source python implementation that can run seamlessly on central processing unit (CPU), graphics processing unit (GPU), or tensor processing unit (TPU) available at http://www.github.com/mancusolab/susiepca.

Results

PIPs from SuSiE PCA outperform existing approaches for PCA feature selection

To evaluate the performance of SuSiE PCA, we performed extensive simulations (see details in STAR Methods). Briefly, we performed 100 simulations by varying model parameters one at a time and performed inference using SuSiE PCA with the true number of latent variables (K) and effects (L) known. First, we evaluated the ability of inferred PIPs to discriminate between relevant and non-relevant features for latent factors. Specifically, we compared the sensitivity and specificity of inferred PIPs to normalized posterior mean weights from SuSiE PCA (see Figure 1). When selecting variables based on $PIPs > 0.90$ , SuSiE PCA identifies 88.9% of true positive (non-zero) signals, demonstrating largely calibrated posterior inference. We observed nearly all true negative signals exhibited $PIPs < 0.05$ . As a comparison, the normalized posterior weights performed well on excluding the true negative signals but failed to capture true positive signals as rapidly as PIP thresholds. Overall, the simulation demonstrates that PIPs provide an intuitive and more efficient indicator for feature selection than normalized posterior weights in SuSiE PCA. In addition, we also examined the sensitivity and specificity using weights estimated from sparse PCA and EBMF (see Figure S1), which have similar trends to the curves in Figure 1B and can only capture a small proportion of the true positive signals as the cutoff threshold increases.

PIPs exhibit a higher efficiency in selecting the true signals than the posterior weights in SuSiE PCA

The proportion of correct classified signals using PIPs as cutoff (A) or posterior weights as cutoff (B). The green dots represent sensitivity, i.e., $\Pr (PIPs \geq cutoff | True positive signal)$ , and the red dots represent specificity, i.e., $\Pr (PIPs < cutoff | True false signal)$ . For consistency and to ensure comparability between PIPs and weights, the weights are normalized to be ranged from 0 to 1.

SuSiE PCA is robust to model mis-specification

Next, we examined the estimation accuracy of the loading matrix as a function of sample size (N), feature dimension (P), latent dimension (K), and the number of single effects (or sparsity level) (L), via the Procrustes errors¹⁶ (the Frobenius norm after Procrustes transformation,¹⁷ see STAR Methods) (Figures 2A–2D). We found that SuSiE PCA has the smallest Procrustes errors across all simulation settings compared to sparse PCA and EBMF. And we noticed that the Bayesian methods including SuSiE PCA and EBMF maintain a low error even with a small sample size or high feature dimension. Moreover, we found that SuSiE PCA has the lowest relative root mean squared error (RRMSE) across all simulations compared with other methods (Figure S2); and EBMF and SuSiE PCA have a lower level of Procrustes error of factor $Z$ than sparse PCA (Figure S3). In summary, SuSiE PCA exhibits the highest estimation accuracy, which is consistent with its superior performance in variable selection.

SuSiE PCA outperforms sparse PCA and EBMF in estimation accuracy and model robustness

SuSiE PCA generates the smallest Procrustes error in weight matrix than sparse PCA and EBMF (A–D) and is robust to over-specified K and L (E and F). For each scenario in (A–D) we vary one of the parameters at a time to generate the simulation data while fixing the other three parameters, and then input the true parameters ( $N, P, K, L$ ) into models. Finally, we compute the Procrustes error and plot them as a function of $N, P, K, L$ . For (E and F), we use the same simulation setting in Figure 1 to generate data but vary the specified L in SuSiE PCA (E) and K in all three models (F). Reference lines refer to the error from the models with correctly specified parameters (i.e., $L = 40, K = 4$ ).

We next investigated model robustness under model mis-specification. Similar to other latent factor models, SuSiE PCA could be mis-specified as it requires manually inputting the latent dimension K and the number of single effects L. Considering the potential model mis-specification setting, the simulation datasets are generated based on $K = 4, L = 40$ and then input into SuSiE PCA, sparse PCA, and EBMF with two mis-specified situations: vary L while fixing K, or vary K while fixing L. The model estimation accuracy is then compared among three models with Procrustes error (see Figures 2E and 2F). We observed that as K and L in the model approach the true value (i.e., $K = 4$ or $L = 40$ ), the Procrustes error decreases rapidly to the lower level in SuSiE PCA and remains the same even when $K > 4$ or $L > 40$ . However, the error for sparse PCA has a V shape and reaches its minimum at the real K. The explanation is that when there are over-specified latent factors in the model, SuSiE PCA and EBMF will not extract any information from the data due to their probabilistic model structure; the sparse PCA, on the other hand, cannot handle the weights since it does not impose a probabilistic assumption on them. Instead, the value of the redundant latent factor in sparse PCA is close to 0, which ensures the latent component does not contribute.

Finally, to compare the generative capacity, we computed and compared the log likelihood of held-out data between sparse PCA and SuSiE PCA. We observed that SuSiE PCA outperforms sparse PCA and obtains higher log likelihoods for simulations (Figure S5). In addition to the overall superior model performance, SuSiE PCA remains faster on both CPU and GPU than sparse PCA and EBMF due to the efficient variational algorithm we implement (see STAR Methods) with the JAX library developed by Google.¹⁸

Dissecting cross-tissue eQTLs in GTEx

To illustrate the utility of SuSiE PCA to make inferences in biological data, we analyzed multi-tissue eQTL Z score results computed from GTEx v8¹²^,¹⁴ (see STAR Methods). Specifically, we sought to identify latent factors corresponding to tissue-specific and tissue-shared eQTLs similar to ref. 12. Overall, we found that 27 latent factors explained 53.1% of the variance in the data (see Figure S6). Although we set $L = 18$ across all factors, we found the number of tissues with $PIP > 0.9$ is frequently lower than 18 in different factors (see Figure S9), which is due to inferred $τ_{0 k l}$ acting to “shut off” uninformative features. Indeed, we observed 30 out of 486 $τ_{0 k l}$ with estimates greater than $e^{10}$ (see Figure S7) which effectively shrink the effect size of the corresponding single effect toward 0, driving the number of non-zero single effects in some factors smaller than specified L. We found this behavior also reflected in estimated level-0.9 credible sets, where 456 out of 486 contained a single tissue, and the remaining 30 credible sets contained at least two tissues.

To understand what each factor represents, we examined inferred PIPs (Figure S9) and posterior mean weights of each tissue across 27 factors (Figure S8). Here we present the results from factor $z_{1}$ and $z_{3}$ through the posterior weights (Figure 3; see Figure S8 for the remainder). We observed that the latent factor $z_{1}$ with the second largest percentage of variance (PVE) demonstrates high absolute weights on most tissues except for the brain tissues, while the latent factor $z_{3}$ has large weights almost exclusively on brain tissues. Moreover, we observed that brain tissue tends to appear as a group and has similar effects, implying the eQTLs in brain tissue are different from those in other tissue and those strong signals are specifically captured by the factor $z_{1}$ . For the rest of the factors, we noticed that factors with large PVE such as $z_{2}, z_{4}, z_{5}$ tended to have large weights on multiple tissues; for example, factor $z_{2}$ has large weights on esophagus and thyroid, suggesting the eQTLs signals are mostly shared across those tissues, while the factors with small PVE usually have large weights exclusively on one or a few tissues, for example, liver-specific component $z_{12}$ , lung-specific component $z_{15}$ , etc. The only exception is that the factor $z_{0}$ with the largest PVE has an exclusively large weight only on the testis, implying the $z_{0}$ captures the testis-specific eQTL signals. This is consistent with the investigation of the latent factor values of $z_{0}$ : the gene with the largest factor value in $z_{0}$ is D-dopachrome tautomerase (DDT) (Figure S10), which is shown to be associated with testis cancer.¹⁹ To make a comparison with the existing method, we expanded our investigation by applying sparse PCA to the GTEx Z score dataset and observed comparable tissue weights and factor scores across components in both SuSiE PCA and sparse PCA (Figure S11). However, a notable distinction arises where certain tissues exhibit tiny weights and can potentially be neglected in sparse PCA; in contrast, the SuSiE PCA can successfully capture the signals in those tissues through the PIP. For example, from the original analysis, both models identify adipose gland as the most relevant tissue in factor 10, while the remaining tissues have a much smaller relative weight and can effectively be ignored. Despite this, SuSiE PCA assigns a PIP of 1 to the lowly weighted tissues, suggesting that important signals would be missed if weights alone were used to provide insight. Overall, we find that SuSiE PCA is able to identify tissue-specific components from multi-tissue eQTL data in an intuitive, interpretable manner.

Factor $z_{1}$ and $z_{3}$ captures different types of tissues (tissues without brain vs. brain tissues)

The posterior weights refer to the inferred $W$ matrix from the SuSiE PCA. The clustering pattern in different factors is found as there are only a limited number of tissues with non-zero weights in each factor since we set L = 18 while the feature dimension is 44.

Identifying regulatory modules from perturb-seq data

To identify gene regulatory modules from genome-wide perturbation data, we ran SuSiE PCA on perturb-seq in cell lines¹⁵ (see STAR Methods) with $K = 10$ and $L = 300$ . Briefly, we inputted the normalized expression data ( $2057 \times 8563$ ) to SuSiE PCA to identify gene regulatory modules (i.e., $Z$ ) and downstream-regulated networks (i.e., $W$ ). To ensure our results were robust to K and L, we explored a grid of possible combinations and found that K = 10 and L = 300 retain the most important information while keeping the relevant gene set much smaller (see Figure S12 for a detailed explanation).

Overall, we found the total PVE was 10.71% across all components (Figure S13), with each component exhibiting 299 downstream genes with PIP $> 0.9$ on average. Focusing on the leading component, we found that perturbations with the top 10 largest absolute factor scores are primarily related to Ribosomal Protein Small (RPS) subunit genes and Ribosomal Protein Large (RPL) subunit family (Figure 4A). To provide a broader characterization of the module function, we extracted downstream genes with PIP greater than 0.9 (298 genes) as input into ShinyGO²⁰ to perform a gene set enrichment analysis (Figure 4B). We observed the most enriched pathway was related to ribosome function (FDR = $9.2 \times 10^{- 82}$ , 63 genes involved), followed by coronavirus disease (FDR = $2.5 \times 10^{- 62}$ , 62 genes involved). Inspecting the loadings at these downstream genes, we found nearly all weights were positive, suggesting that the knockout of RPS and RPL genes downregulates the expression level of those downstream genes. We found multiple elongation factor genes (EEF1G, EEF1A1, EEF1B2, EIF4B, EIF3L) among the leading downstream genes, which are known to be involved in ribosome function. Additionally, recent studies have suggested that the decreased expression of elongation factor genes is associated with less severe conditions among COVID-19 patients.²¹^,²² We repeated pathway analysis for each latent factor using corresponding loadings at genes with PIP greater than 0.9 (see Figures S14–S22).

Dominant factor scores in top component link to RPL and RPS family with subsequent gene enrichment in Ribosome and Coronavirus disease

The perturbations with top factor scores in the first component mostly belong to RPL and RPS family(A), and the enrichment analysis results of downstream genes in the same component are enriched for ribosome and coronavirus disease(B). Each point in (A) represents the latent factor value of each perturbation. The top 9 points as well as the control group are labeled in the plot and colored red and blue, respectively. In gene set enrichment analysis, we input the downstream genes with $PIP > 0.9$ and show the top enriched pathways with log(FDR) and the number of genes included in the corresponding pathways.

To compare with sparse PCA, we performed the same pathway analysis on factor loadings and assessed enrichments. From the sparse PCA with the largest PVE (alpha = 1), we observed components identified by sparse PCA to be less enriched with biological pathways when compared to SuSiE PCA (80 unique enriched pathways in sparse PCA versus 88 pathways in SuSiE PCA), and the top enriched pathways such as ribosome and coronavirus disease are less significant and contain less number of selected genes (FDR $= 1.4 \times 10^{- 33}$ , 35 genes; FDR $= 2.9 \times 10^{- 18}$ , 29 genes). We noticed that, when alpha equals 17, the sparse PCA achieves an approximate similar total PVE (10.91%) with that of our model (10.71%) but with lower sparsity level (Figure S23). We then extracted the top 300 genes with non-zero weights in sparse PCA with alpha = 17 and performed the gene set enrichment analysis and found that the significance level is almost similar to that in SuSiE PCA (Figure S24). However, this is a post hoc analysis that suggests SuSiE PCA is more suitable for sparse data analysis while maintaining the power to perform the feature selection in a more statistical and reasonable manner.

Overall, we find distinct biological functions identified by each component, with groupings consistent with those reported in previous works.²³^,²⁴^,²⁵

Discussion

In this paper, we propose SuSiE PCA, an efficient Bayesian variable selection approach to PCA for structured biological data. The sparsity of the loading matrix is achieved by restricting the number of features associated with each factor to be at most L. Through simulations and real-data application, we find that SuSiE PCA outperforms existing approaches to sparse latent structure learning in identifying contributing features, while maintaining a more efficient run time.

There are several advantages of SuSiE PCA as compared to other sparse factor models. First, SuSiE PCA generates the PIPs for each feature that quantifies the uncertainty of the selected feature, which can not be provided by other sparse models, such as sparse PCA with regularization¹³ or the Bayesian treatment of PPCA. And assessing the selected variables based on the probability is more reasonable and convenient than using weights. Second, PIPs are capable of selecting more signals with high confidence. In simulations, we demonstrated that using weights for variable selection from SuSiE PCA, sparse PCA, and EBMF can deliver a high specificity (low FDR) but with low sensitivity as the cutoff value increases, while using PIPs as selection tools can maintain a high sensitivity for any positive cutoff value between 0 and 1. Third, SuSiE PCA provides a more precise estimate of the loadings and higher prediction accuracy, even in the mis-specified case, as we impose a probabilistic distribution over the loadings that enables a much more accurate inference on the posterior distribution. Finally, the inference procedure of SuSiE PCA works on the dimension of K and L, which is typically set to be much smaller than feature dimension P; therefore, it is scalable to high-dimensional data and requires less computational demands. We implement the SuSiE PCA with the JAX library developed by Google¹⁸ to enable fast convergence on CPU, GPU, or TPU. The comparison of run time among SuSiE PCA, sparse PCA, and EBMF is listed in Table 1.

Table 1.

Comparison of mean and standard deviation of running time (seconds) between models

Model^a	Simulation^b	GTEx Z score	Perturb-seq
SuSiE PCA	3.14(0.49)	1.20	68.11
Sparse PCA	51.96(33.50)	41.22	1213.21
EBMF	39.83(5.80)	498.60	243.03

Open in a new tab

All run time data in the table are based on the analyses performed on the same CPU for consistency. The CPU we used is the Apple M2 chip with 16 GB memory.

Run time for simulation is recorded based on simulation setting in Figure 1, i.e., $N = 1000, P = 6000, K = 4, L = 40$ ; the average run time and corresponding standard deviation are computed for 100 simulations. We presented a more detailed run time comparison in simulation in Figure S4.

In the SuSiE PCA, two parameters, the number of components K, and the number of single effects L, need to be prespecified by the user before fitting the model. The selection of K follows a similar strategy as conventional PCA, often informed by researchers’ domain expertise. The merit of SuSiE PCA is that when there are excessive latent components being specified, the variance explained for those components would be extremely minimal with a near-zero count of single effects exhibiting $PIP > 0.9$ . This effectively allows for an initial choice of a relatively large K and subsequently inspecting the PVE and PIPs in each component to decide the most suitable K.

The choice of L determines the sparsity in the SuSiE PCA. Although SuSiE PCA only allows one common L specified across all factors, the number of non-zero effects captured across factors can be varied and learned from the data. This is because we treat the inverse of variance $τ_{0 k l}$ of the $l_{t h}$ single effect in factor $z_{k}$ as a random variable. As the Algorithm 1 demonstrates, the maximum likelihood estimate (MLE) of $τ_{0 k l}$ at the step 3 is derived before inference of other parameters. When the L specified in the model, for a certain factor k, is greater than the true number of signals associated with that factor, the MLE of the $τ_{0 k l}$ will be extremely large for those excessive single effects, which then shrinks the $w_{k l}$ and PIP to be 0 or close to 0, and therefore removes the redundant single effects from the model. For example, in the simulation and GTEx Z score data analysis, we have shown that when the user-specified L is larger than the data-generating L, the automatic relevance determination-like (ARD) prior over loadings will shrink effects toward 0, thus adding little additional predictive power and overall mean square error (MSE) from the true loadings matrix. Although it seems like the L parameter may be automatically set to the total number of variables (and thus “shut off” if necessary), we emphasize that this still comes with an added computational cost, albeit a low one due to the scalability of our approach. Therefore, we allow users to specify their own choice of L. From this point of view, without prior knowledge of the data, one can specify a relatively larger L during the initial model fitting and then examine the estimates of $τ_{0 k l}$ to explore how many single effects are reasonable for the dataset.

Algorithm 1. Algorithm for SuSiE PCA.

Require: Data $X_{N \times P}$

Require: Number of Factors K; Number of single effects in each factor L

Require: Initialize variational parameters $(μ_{Z}, Σ_{Z}; μ_{w_{k l}}, σ_{w_{k l}}; α_{k l})$ ; hyperparameters $τ, τ_{0 k l}$ , for $l = 1, \dots, L; k = 1, \dots, K$

Require: update equations on different variables $F_{Z}; F_{w_{k l}}; F_{α_{k l}}; F_{τ_{0}}; F_{τ}$ :

Require: function to compute $ELBO, F_{ELBO}$

Ensure: ELBO increase

1: repeat

2: $W \leftarrow \sum_{l = 1}^{L} μ_{w} \circ α$ . $▷$ Define $μ_{w}, α$ as $(L, K, P)$ arrays by arranging $μ_{w_{k l}}, α_{k l}$

3: $τ_{0} \leftarrow F_{τ_{0}} (μ_{w}, σ_{w}, α)$

4: for k in $1, \dots, K$ do

5: $E {[R_{k l}^{⊺} Z_{k}]}^{(1)} = X^{⊺} μ_{z_{k}} - \sum_{k^{'} \neq k} E [w_{k^{'}}] E [Z_{k^{'}}^{⊺} Z_{k}]$ $▷$ compute the first two terms in Eq

6: for l in $1, \dots, L$ do

7: $E [w_{k l^{'}}] = w_{k} - μ_{w_{k l}} \circ α_{k l}$ $▷$ removing the $l_{t h}$ effect from $w_{k}$

8: $E [R_{k l}^{⊺} Z_{k}] = E {[R_{k l}^{⊺} Z_{k}]}^{(1)} - w_{k} E [Z_{k}^{⊺} Z_{k}]$ $▷$ complete the calculation of $E [R_{k l}^{⊺} Z_{k}]$

9: $(μ_{w_{k l}}, σ_{w_{k l}}) \leftarrow F_{w_{k l}} (E [R_{k l}^{⊺} Z_{k}], E [Z_{k}^{⊺} Z_{k}], τ_{0 k l}, τ)$

10: $α_{k l} \leftarrow F_{α_{k l}} (E [R_{k l}^{⊺} Z_{k}], μ_{w_{k l}}, σ_{w_{k l}})$

11: $w_{k} = E [w_{k l^{'}}] + μ_{w_{k l}} \circ α_{k l}$ $▷$ Update the $w_{k}$

12: end for

13: end for

14: $(μ_{Z}, Σ_{Z}) \leftarrow F_{Z} (X, τ, E [W])$

15: $τ = F_{τ} (X, τ, E [W], E [Z])$

16: $E L B O \leftarrow F_{ELBO}$

17: until ELBO convergence criterion satisfied

Overall, SuSiE PCA provides a flexible approach to high-dimensional biological data with a low-rank structure and allows for feature selection in sparse PCA.

Limitations of the study

One limitation of SuSiE PCA is that under the mean-field approximation, all the posteriors, i.e., $Q (W)$ and $Q (Z)$ , are factorized to facilitate inference. Under this factorization, estimation for mean terms (i.e., $E [W]$ and $E [Z]$ ) is approximately unbiased.²⁶ However, it produces overconfident covariance structures within variables (W, Z, etc) due to the assumed independence across Q functions.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

The Genotype-Tissue Expression Z score data	Wei Wang and Matthew Stephens, Empirical Bayes Matrix Factorization, 2021¹	https://github.com/ysfoo/sparsefactor
Genome-scale Perturb-seq experiment data	Joseph Replogle and Jonathan Weissman, Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq, 2022²	https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387

Software and algorithms

Scikit-learn library: sparse principal component analysis	Python library scikit-learn	https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html; RRID:SCR_002577
R Package: Factors and Loadings by Adaptive SHrinkage in R (flashr)	Wei Wang and Matthew Stephens, Empirical Bayes Matrix Factorization¹	https://stephenslab.github.io/flashr/index.html
Variational algorithm in SuSiE PCA	This paper	http://www.github.com/mancusolab/susiepca
Python 3.9	Python Software Foundation	https://www.python.org/
R 4.0.0	R Software	https://www.r-project.org
ShinyGO v0.77	Ge SX, Jung D & Yao R³	http://bioinformatics.sdstate.edu/go/; RRID:SCR_019213

Open in a new tab

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Dong Yuan (dongyuan@usc.edu).

Material availability

This study did not generate new unique materials or reagents.

Experimental model and subject details

This study did not include experiments with a specific model or subject.

Method details

Overview of SuSiE PCA

In this section, we will give a detailed description of SuSiE PCA. Let $X_{N \times P}$ be the observed data matrix, $Z_{N \times K}$ be the K dimensional latent vectors, and $W_{K \times P}$ be the loading matrix. We denote the normal distribution with mean μ and variance $σ^{2}$ as $N (μ, σ^{2})$ , the multinomial distribution with n choices and probabilities $π$ as $Multi (n, π)$ and the matrix normal distribution with dimension $N \times K$ , mean $M$ , row-covariance $R$ , and column-covariance $C$ as $M N_{N, K} (M, R, C)$ . We denote the basis vector in which $k^{t h}$ coordinate is 1 and 0 elsewhere as $e_{k}$ . The sampling distribution of $X$ under the SuSiE PCA model is given by,

X | Z, W, σ^{2} \sim M N_{N, P} (Z W, I_{N}, σ^{2} I_{P})

(Equation 1)

Z \sim M N_{N, K} (0, I_{N}, I_{K})

(Equation 2)

W = \sum_{k = 1}^{K} e_{k} w_{k}^{⊺}

(Equation 3)

w_{k} = \sum_{l = 1}^{L} w_{k l}

(Equation 4)

w_{k l} = w_{k l} γ_{k l}

(Equation 5)

w_{k l} | σ_{0 k l}^{2} \sim N (0, σ_{0 k l}^{2})

(Equation 6)

γ_{k l} | π \sim Multi (1, π),

(Equation 7)

where $w_{k}$ corresponds to the $k^{t h}$ row of $W$ , and contains at most L non-zero elements determined by the sum of L single-effect vectors $w_{k l}$ . These single-effect vectors are described by a single random effect $w_{k l}$ and indicator vector $γ_{k l}$ which assigns the effect to a feature with prior probabilities $π = \frac{1}{p} 1$ .

Posterior inclusion probability

One of the distinguishing features that the SuSiE model¹³ provides is a posterior inclusion probability (PIP). The PIP reflects the posterior probability that a given variable has a non-zero effect given the observed data. Here we extend the PIP definition to include latent factors. Specifically, given variational parameters $α_{k l}$ we can define the PIP that the $i^{t h}$ variable has a non-zero effect in the $k^{t h}$ latent component as,

{PIP}_{k i} : = \Pr (w_{k i} \neq 0 | X) = 1 - \prod_{l = 1}^{L} (1 - α_{k l i})

(Equation 8)

Similarly, a level- $ρ$ credible set (CS) refers to a subset of variables that cumulatively explain at least ρ of the posterior density. Here, we define factor-specific level- $ρ$ CSs, which can be computed across each $α_{k l}$ independently, resulting in $K \times L$ total level-ρ credible sets. This lets us reflect on the uncertainty in identified variables to explain a single-effect for each latent factor.

Variational inference in SuSiE PCA

We seek to perform inference of model variables $Z, w_{k l}$ and $γ_{k l}$ conditional on observed data $X$ , however, the marginal likelihood is intractable to compute and therefore, we cannot evaluate the posterior exactly. While sampling based approaches such as Markov Chain Monte Carlo (MCMC) methods provide a numerical approximation of the exact posterior distribution,²⁷ they often lack computational efficiency in high-dimensional settings. As an alternative, we leverage recent advancements in the variational inference that provides an analytical approximation to the posterior distribution²⁸ and remains computationally efficient.

Briefly, To approximate the conditional distribution of latent variables $Z$ given the observed samples $X$ , variational methods first impose a family of densities over the latent variables, $Q (Z)$ , usually predefined as known distributions parameterized with a set of variational parameters. Then the goal is to infer those variational parameters such that the variational distribution $Q (Z)$ is as similar as possible to the true posterior distribution $P (Z | X)$ . A quantity commonly used to measure dissimilarity between distributions is Kullback-Leibler divergence $D_{K L} (Q ∥ P)$ .²⁹ However, since KL divergence contain the unknown true posterior distribution $P (Z | X)$ , it cannot be directly computed. Instead, we can show that the log-likelihood of data, $\log P (X)$ can be decomposed as:

\log P (X) = D_{K L} (Q ∥ P) + L (Q)

(Equation 9)

Where $L (Q) = E_{Q} [\log P (Z, X) - \log Q (Z)]$ , which is also known as the Evidence Lower Bound (ELBO). Since the $\log P (X)$ is a constant with respect to the variational parameters, minimizing KL divergence is equivalent to maximizing ELBO. As the ELBO does not contain the unknown posterior distribution and therefore is tractable to compute and maximize for variational parameters.

Mean-field approximation

Mean field approximation³⁰ is a common solution to find the optimal solution that maximizes ELBO. The basic assumption is that we can factorize the variational distribution into independent components. Then using the calculus of variations, one can show that the distribution $Q_{j}^{*} (z_{j})$ minimizing KL divergence for each factor $Z_{j}$ can be expressed as:

\ln Q_{j}^{*} (z_{j} | X) = E_{i \neq j} [\ln P (Z, X)] + c o n s t a n t

(Equation 10)

Applying the Mean-Field approximation to SuSiE PCA the approximate posterior given by,

Q (Z, W) = Q (Z) Q (W)

(Equation 11)

Q (W) = \prod_{k = 1}^{K} \prod_{l = 1}^{L} Q (w_{k l} | γ_{k l}) Q (γ_{k l})

(Equation 12)

Equation 11 factorizes the variational densities of the latent variables $Z$ and the loading matrix $W$ into independent parts. We further assume that the variational distribution of loadings $w_{k l}$ from each factor across L single effects are independent as well, leading to Equation 12. For ease of notation we first define $τ = \frac{1}{σ^{2}}, τ_{0 k l} = \frac{1}{σ_{0 k l}^{2}}$ . Based on the factorization, the complete-data log-likelihood of data and parameters of SuSiE PCA is given by:

\begin{array}{l} l_{c} (σ^{2}, σ_{0}^{2}, π | X, Z, W) = \log \Pr (X | Z, W, σ^{2}) + \log \Pr (Z) + \log \Pr (W | σ_{0}^{2}, π) = \log M N_{n, p} (X | Z W, I_{n}, I_{p} σ^{2}) + \log M N_{n, k} (Z | 0, I_{n}, I_{k}) + \\ \sum_{l = 1}^{L} \sum_{k = 1}^{K} [logMulti (γ_{k l} | 1, π) + \log N (w_{k l} | 0, σ_{0}^{2})] \end{array}

Helpful definitions

Before proceeding to the full derivation of variational distribution of parameters $Z, w_{k l}, and γ_{k l}$ , we first give some helpful definitions, including the expansion of the first and second moment of $W$ and $Z$ .

The second moment of $Z$ is:

\begin{array}{l} E [Z^{⊺} Z] = tr (I_{n}) Σ_{Z} + E {[Z]}^{⊺} E [Z] = n Σ_{Z} + E {[Z]}^{⊺} E [Z] \\ E [Z_{k}^{⊺} Z_{k}] = tr (V [Z_{k}]) + E {[Z_{k}]}^{⊺} E [Z_{k}] = tr (I_{n} {(Σ_{Z})}_{k k}) + E {[Z_{k}]}^{⊺} E [Z_{k}] = n {(Σ_{Z})}_{k k} + E {[Z_{k}]}^{⊺} E [Z_{k}] \end{array}

The first and second moments of $w_{k}$ are listed as follows:

\begin{array}{l} E [w_{k l} | γ_{k l}] = p - vector of posterior conditional means \\ V [w_{k l} | γ_{k l}] = p - vector of posterior conditional variances \\ E [w_{k}] = E [\sum_{l} w_{k l}] = \sum_{l} E [w_{k l}] \\ E [w_{k l}] = \sum_{l} E [w_{k l} | γ_{k l}] \circ E [γ_{k l}] \\ V [w_{k}] = V [\sum_{l} w_{k l}] = \sum_{l} V [w_{k l}] \\ V [w_{k l}] = E [w_{k l} w_{k l}^{⊺}] - E [w_{k l}] E {[w_{k l}]}^{⊺} = E [w_{k l}^{2} γ_{k l} γ_{k l}^{⊺}] - E [w_{k l}] E {[w_{k l}]}^{⊺} = diag (E [w_{k l} \circ w_{k l} | γ_{k l}] \circ E [γ_{k l}]) - E [w_{k l}] E {[w_{k l}]}^{⊺} \\ diag (V [w_{k l}]) = E [w_{k l} \circ w_{k l} | γ_{k l}] \circ E [γ_{k l}] - {(E [w_{k l} | γ_{k l}] \circ E [γ_{k l}])}^{2} \\ E [w_{k}^{⊺} w_{k}] = tr (V [w_{k}]) + E {[w_{k}]}^{⊺} E [w_{k}] \\ E [w_{k l}^{2}] = [E^{2} [w_{k l} | γ_{k l}] + V [w_{k l} | γ_{k l}]] \circ E [γ_{k l}] \end{array}

The first and second moments of $W$ are listed as follows:

\begin{array}{c} E [W] = E [\sum_{k} e_{k} w_{k}^{⊺}] = \sum_{k} e_{k} E {[w_{k}]}^{⊺} \\ E [W W^{⊺}] = E [(\sum_{k} e_{k} w_{k}^{⊺}) {(\sum_{k^{'}} e_{k^{'}} w_{k^{'}}^{⊺})}^{⊺}] = E [\sum_{k} \sum_{k^{'}} e_{k} w_{k}^{⊺} w_{k^{'}} e_{k^{'}}^{⊺}] = \sum_{k} \sum_{k^{'}} e_{k} e_{k^{'}}^{⊺} E [w_{k}^{⊺} w_{k^{'}}] = \sum_{k} \sum_{k^{'}} e_{k} e_{k^{'}}^{⊺} E {[w_{k}]}^{⊺} E [w_{k^{'}}] + \sum_{k} e_{k} e_{k}^{⊺} (E [w_{k}^{⊺} w_{k}] - E {[w_{k}]}^{⊺} E [w_{k}]) = E [W] E {[W]}^{⊺} + \sum_{k} e_{k} e_{k}^{⊺} (E [w_{k}^{⊺} w_{k}] - E {[w_{k}]}^{⊺} E [w_{k}]) = E [W] E {[W]}^{⊺} + \sum_{k} e_{k} e_{k}^{⊺} tr (V [w_{k}]) = E [W] E {[W]}^{⊺} + diag (tr (V [w_{1}]), \dots, tr (V [w_{k}])) \end{array}

Other terms in likelihood function:

\begin{array}{l} \log M N_{n, p} (X | Z W, I_{n}, I_{p} σ^{2}) = - \frac{1}{2 σ^{2}} t r [{(X - Z W)}^{⊺} (X - Z W)] - \frac{n p}{2} \log (2 π σ^{2}) \\ \log M N_{n, k} (Z | 0, I_{n}, I_{k}) = - \frac{1}{2} t r [Z^{⊺} Z] - \frac{n k}{2} \log (2 π) \\ logMulti (γ_{k l} | 1, π) = \sum_{i = 1}^{p} γ_{k l i} \log (π_{i}) \\ \log N (w_{k l} | 0, σ_{0}^{2}) = - \frac{1}{2 σ_{0}^{2}} w_{k l}^{2} - \frac{1}{2} \log (2 π σ_{0}^{2}) \\ tr [E_{\neg Z} [{(X - Z W)}^{⊺} (X - Z W)]] = tr [E_{\neg Z} (X^{⊺} X - X^{⊺} Z W - W^{⊺} Z^{⊺} X + W^{⊺} Z^{⊺} Z W)] = tr (X^{⊺} X) - 2 t r (E [W] X^{⊺} Z) + tr (Z^{⊺} Z E [W W^{⊺}]) = tr (X^{⊺} X) - 2 t r (E [W] X^{⊺} Z) + \sum_{i = 1}^{p} tr (Z σ_{W_{i}} Z^{⊺}) + E [W^{⊺}] Z^{⊺} Z E [W] \\ tr [E_{\neg W} [{(X - Z W)}^{⊺} (X - Z W)]] = tr [E_{\neg W} (X^{⊺} X - X^{⊺} Z W - W^{⊺} Z^{⊺} X + W^{⊺} Z^{⊺} Z W)] = tr (X^{⊺} X) - 2 t r (X^{⊺} E [Z] W) + tr (E [Z^{T} Z] W W^{⊺}) \end{array}

\begin{array}{l} E [tr ({(X - Z W)}^{⊺} (X - Z W))] = E [tr (X^{⊺} X - X^{⊺} Z W - W^{⊺} Z^{⊺} X + W^{⊺} Z^{⊺} Z W)] = tr (X^{⊺} X - X^{⊺} E [Z] E [W] - E [W^{⊺}] E [Z^{⊺}] X + E [W^{⊺} Z^{⊺} Z W]) = tr (X^{⊺} X) - 2 t r (X^{⊺} E [Z] E [W]) + tr (E [Z^{⊺} Z] E [W W^{⊺}]) \\ {\bar{R}}_{k l} : = X - E [Z] (\sum_{k^{'} \neq k} e_{k^{'}} E {[w_{k^{'}}]}^{⊺} + \sum_{l^{'} \neq l} e_{k} E {[w_{k l^{'}}]}^{⊺}) = X - \sum_{k^{'} \neq k} E [Z_{k^{'}}] E {[w_{k^{'}}]}^{⊺} - \sum_{l^{'} \neq l} E [Z_{k}] E {[w_{k l^{'}}]}^{⊺} \end{array}

Derivation of model parameters

In this section, we present the detailed derivation of the optimal variational distributions of variables $Z, w_{k l}$ , and $γ_{k l}$ . First, we derived the $\log Q (Z) :$

\begin{array}{l} \log Q (Z) = E_{\neg Z} [l_{c} (σ^{2}, σ_{0}^{2}, π | X, Z, W)] = E_{\neg Z} [\log M N_{n, p} (X | Z W, I_{n}, I_{p} σ^{2})] + \log M N_{n, k} (Z | 0, I_{n}, I_{k}) = - \frac{τ}{2} [- tr (X^{⊺} Z E [W]) - tr (E [W^{⊺}] Z^{⊺} X) + tr (Z^{⊺} Z E (W W^{⊺}))] - \frac{1}{2} t r (Z^{⊺} Z) + O (1) = - \frac{1}{2} [tr (τ Z^{⊺} Z E (W W^{⊺})) + tr (Z^{⊺} Z) - tr (τ X^{⊺} Z E [W]) - tr (τ E [W^{⊺}] Z^{⊺} X)] + O (1) = - \frac{1}{2} [tr (Z^{⊺} Z (E (W W^{⊺}) τ + I_{k})) - tr (τ X^{⊺} Z E [W]) - tr (τ E [W^{⊺}] Z^{⊺} X)] + O (1) = - \frac{1}{2} [tr (Z \underset{Σ_{Z}^{- 1}}{\underset{︸}{(E (W W^{⊺}) τ + I_{k})}} Z^{⊺}) - tr (Z E [W] X^{⊺} τ) - tr (τ X E [W^{⊺}] Z^{⊺})] + O (1) = - \frac{1}{2} [tr (Z Σ_{Z}^{- 1} Z^{⊺}) - tr (Z Σ_{Z}^{- 1} \underset{μ_{Z}^{⊺}}{\underset{︸}{Σ_{Z} E [W] X^{⊺} τ}}) - tr (\underset{μ_{Z}}{\underset{︸}{τ X E [W^{⊺}] Σ_{Z}}} Σ_{Z}^{- 1} Z^{⊺})] + O (1) = - \frac{1}{2} [tr (Σ_{Z}^{- 1} Z^{⊺} Z) - tr (Σ_{Z}^{- 1} μ_{Z}^{⊺} Z) - tr (Σ_{Z}^{- 1} Z^{T} μ_{Z}) + tr (Σ_{Z}^{- 1} μ_{Z}^{T} μ_{Z}) - tr (Σ_{Z}^{- 1} μ_{Z}^{T} μ_{Z})] + O (1) = - \frac{1}{2} [tr (Σ_{Z}^{- 1} (Z^{⊺} Z - Z^{T} μ_{Z} - μ_{Z}^{T} Z + μ_{Z}^{T} μ_{Z}))] + O (1) = - \frac{1}{2} tr (Σ_{Z}^{- 1} {(Z - μ_{Z})}^{⊺} (Z - μ_{Z})) + O (1) \Rightarrow \\ Q (Z) = M N_{n, k} (Z | μ_{Z}, I_{n}, Σ_{Z}) \end{array}

Second, we derive the $\log Q (w_{k l} | γ_{k l i} = 1)$ :

\log Q (w_{k l} | γ_{k l i} = 1) = - \frac{τ}{2} E_{\neg w_{k l}} [tr ({(R_{k l} - Z_{k} w_{k l}^{⊺})}^{⊺} (R_{k l} - Z_{k} w_{k l}^{⊺}))] - \frac{τ_{0}}{2} E_{\neg w_{k l}} [\sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{k l}^{2}] + O (1) = - \frac{τ}{2} E_{\neg w_{k l}} [- 2 tr (R_{k l}^{⊺} Z_{k} w_{k l}^{⊺}) + tr (Z_{k}^{⊺} Z_{k} w_{k l}^{⊺} w_{k l})] - \frac{τ_{0}}{2} w_{k l}^{2} + O (1) = - \frac{τ}{2} E_{\neg w_{k l}} [- 2 tr ({(X - \sum_{k^{'} \neq k} Z_{k^{'}} w_{k^{'}}^{⊺} - \sum_{l^{'} \neq l} Z_{k} w_{k l^{'}}^{⊺})}^{⊺} Z_{k} w_{k l}^{⊺}) + tr (Z_{k}^{⊺} Z_{k} w_{k l}^{2})] - \frac{τ_{0}}{2} w_{k l}^{2} + O (1) = - \frac{τ}{2} E_{\neg w_{k l}} [- 2 tr ((X^{⊺} Z_{k} - \sum_{k^{'} \neq k} w_{k^{'}} Z_{k^{'}}^{⊺} Z_{k} - \sum_{l^{'} \neq l} w_{k l^{'}} Z_{k}^{⊺} Z_{k}) w_{k l}^{⊺}) + tr (Z_{k}^{⊺} Z_{k} w_{k l}^{2})] - \frac{τ_{0}}{2} w_{k l}^{2} + O (1) = - \frac{τ}{2} E_{\neg w_{k l}} [- 2 (X_{i}^{⊺} Z_{k} - \sum_{k^{'} \neq k} w_{k^{'}, i} Z_{k^{'}}^{⊺} Z_{k} - Z_{k}^{⊺} Z_{k} \sum_{l^{'} \neq l} w i_{k l^{'}}) w_{k l} + Z_{k}^{⊺} Z_{k} w_{k l}^{2}] - \frac{τ_{0}}{2} w_{k l}^{2} + O (1) = - \frac{1}{2} [- 2 τ (X_{i}^{⊺} E [Z_{k}] - \sum_{k^{'} \neq k} E [w_{k^{'}, i}] E [Z_{k^{'}}^{⊺} Z_{k}] - E [Z_{k}^{⊺} Z_{k}] \sum_{l^{'} \neq l} E [w i_{k l^{'}}]) w_{k l} + τ E [Z_{k}^{⊺} Z_{k}] w_{k l}^{2}] - \frac{τ_{0}}{2} w_{k l}^{2} + O (1) = - \frac{1}{2} [- 2 τ (X_{i}^{⊺} E [Z_{k}] - \sum_{k^{'} \neq k} E [w_{k^{'}, i}] E [Z_{k^{'}}^{⊺} Z_{k}] - E [Z_{k}^{⊺} Z_{k}] \sum_{l^{'} \neq l} E [w i_{k l^{'}}]) w_{k l} + τ E [Z_{k}^{⊺} Z_{k}] w_{k l}^{2} + τ_{0} w_{k l}^{2}] + O (1) = - \frac{1}{2} [- 2 τ (X_{i}^{⊺} E [Z_{k}] - \sum_{k^{'} \neq k} E [w_{k^{'}, i}] E [Z_{k^{'}}^{⊺} Z_{k}] - E [Z_{k}^{⊺} Z_{k}] \sum_{l^{'} \neq l} E [w i_{k l^{'}}]) w_{k l} + \underset{1 / σ_{w_{k l}}^{2}}{\underset{︸}{(τ E [Z_{k}^{⊺} Z_{k}] + τ_{0})}} w_{k l}^{2}] + O (1) = - \frac{1}{2 σ_{w_{k l}}^{2}} [w_{k l}^{2} - 2 \underset{μ_{w_{k l}}}{\underset{︸}{τ σ_{w_{k l}}^{2} (X_{i}^{⊺} E [Z_{k}] - \sum_{k^{'} \neq k} E [w_{k^{'}, i}] E [Z_{k^{'}}^{⊺} Z_{k}] - E [Z_{k}^{⊺} Z_{k}] \sum_{l^{'} \neq l} E [w i_{k l^{'}}])}} w_{k l}] \Rightarrow = \log N (μ_{w_{k l}}, σ_{w_{k l}}^{2})

Noticed that we can update $w_{k l}$ for all feature at once:

μ_{w_{k l}} = τ σ_{w_{k l}}^{2} E [R_{k l}^{⊺} Z_{k}], σ_{w_{k l}} = σ_{w_{k l}}^{2} I_{p}

Finally we derive the $\log Q (γ_{k l})$ : Note that $τ R_{k l}^{⊺} E [Z_{k}] = E [w_{k l} | γ_{k l}] / σ_{w_{k l}}^{2} = μ_{w_{k l}} / σ_{w_{k l}}^{2} .$

\begin{array}{c} \log Q (γ_{k l i} = 1) = E_{\neg γ_{k l}} [l_{c} (σ^{2}, σ_{0}^{2}, π, | X, Z, W)] + logMulti (γ_{k l} | π) + O (1) = - \frac{τ}{2} E_{\neg γ_{k l}} t r ({(R_{k l} - Z_{k} w_{k l}^{⊺})}^{⊺} (R_{k l} - Z_{k} w_{k l}^{⊺})) + logMulti (γ_{k l} | π) + O (1) = - \frac{τ}{2} [- 2 tr (R_{k l}^{⊺} E [Z_{k}] E [w_{k l} | γ_{k l i} = 1] γ_{k l}^{⊺}) + tr (E [Z_{k}^{⊺} Z_{k}] E [w_{k l}^{2} | γ_{k l i} = 1])] + \log π_{i} + O (1) = - \frac{τ}{2} [- 2 R_{k l i}^{⊺} E [Z_{k}] E [w_{k l} | γ_{k l i} = 1] + E [Z_{k}^{⊺} Z_{k}] E [w_{k l}^{2} | γ_{k l}]] + \log π_{i} + O (1) ⟨ ⟩ = τ R_{k l i}^{⊺} E [Z_{k}] E [w_{k l} | γ_{k l i} = 1] - \frac{τ}{2} E [Z_{k}^{⊺} Z_{k}] E [w_{k l}^{2} | γ_{k l i} = 1] + \log π_{i} + O (1) = \frac{1}{σ_{w_{k l}}^{2}} E {[w_{k l} | γ_{k l i} = 1]}^{2} - \frac{τ}{2} E [Z_{k}^{⊺} Z_{k}] E [w_{k l}^{2} | γ_{k l i} = 1] + \log π_{i} - \frac{τ_{0}}{2} E [w_{k l}^{2} | γ_{k l i} = 1] + O (1) = \frac{1}{σ_{w_{k l}}^{2}} E {[w_{k l} | γ_{k l i} = 1]}^{2} - \frac{1}{2} E [w_{k l}^{2} | γ_{k l i} = 1] (τ E [Z_{k}^{⊺} Z_{k}] + τ_{0}) + \log π_{i} + O (1) = \frac{1}{σ_{w_{k l}}^{2}} E {[w_{k l} | γ_{k l i} = 1]}^{2} - \frac{1}{2 σ_{w_{k l}}^{2}} E [w_{k l}^{2} | γ_{k l i} = 1] + \log π_{i} + O (1) = - \frac{1}{2 σ_{w_{k l}}^{2}} [- 2 E {[w_{k l} | γ_{k l i} = 1]}^{2} + E [w_{k l}^{2} | γ_{k l i} = 1]] + \log π_{i} + O (1) = - \frac{1}{2 σ_{w_{k l}}^{2}} [- 2 E {[w_{k l} | γ_{k l i} = 1]}^{2} + σ_{w_{k l}}^{2} + E {[w_{k l} | γ_{k l i} = 1]}^{2}] + \log π_{i} + O (1) = \frac{1}{2 σ_{w_{k l}}^{2}} E {[w_{k l} | γ_{k l i} = 1]}^{2} + \log π_{i} + O (1) \Rightarrow \\ \log {\tilde{α}}_{k l i} = \log π_{i} - \log N (0 | μ_{w_{k l}}, σ_{w_{k l}}^{2}) \\ Q (γ_{k l}) = Multi (1, α_{k l} = softmax (\log {\tilde{α}}_{k l})) \end{array}

In summary, the optimal variational distribution of model parameters can be summarized as:

Q (Z) : = M N_{n, k} (Z | μ_{Z}, I_{n}, Σ_{Z})

(Equation 13)

Q (w_{k l} | γ_{k l}) : = N (μ_{w_{k l}}, σ_{w_{k l}}^{2})

(Equation 14)

Q (γ_{k l}) : = Multi (1, α_{k l}) .

(Equation 15)

The corresponding update rules for variational parameters from $Q (\cdot)$ can be expressed as,

μ_{Z} = τ X E [W^{⊺}] Σ_{Z}

(Equation 16)

Σ_{Z} = {(E [W W^{⊺}] τ + I_{k})}^{- 1}

(Equation 17)

μ_{w_{k l}} = τ σ_{w_{k l}}^{2} E [R_{k l}^{⊺} Z_{k}]

(Equation 18)

Σ_{w_{k l}} = σ_{w_{k l}}^{2} I_{p}

(Equation 19)

σ_{w_{k l}}^{2} = {(τ E [Z_{k}^{⊺} Z_{k}] + τ_{0 k l})}^{- 1}

(Equation 20)

α_{k l} = softmax (\log π - \log N (0 | μ_{w_{k l}}, σ_{w_{k l}}^{2})) .

(Equation 21)

Derivation of evidence lower bound (ELBO)

The ELBO provides a natural criterion for evaluating model performance during model training, and also provides a means to perform hyperparameter optimization for model variance τ and $τ_{0}$ (or equivalently precision) parameters. Given the above definitions for Q, we derive the ELBO for SuSiE PCA as

\begin{array}{l} ELBO (W, Z) = E_{Q} [logPr (X, Z, W) - \log Q (Z, W)] = E_{Q} [logPr (X | Z, W)] + E_{Q} [logPr (Z, W) - \log Q (Z, W)] = E_{Q} [logPr (X | Z, W)] + E_{Q (Z)} [logPr (Z) - \log Q (Z)] + \\ \sum_{l = 1}^{L} [E_{Q (w_{l} | Γ_{l})} [logPr (w_{l} | Γ_{l}) - \log Q (w_{l} | Γ_{l})] + E_{Q (Γ_{l})} [logPr (Γ_{l}) - \log Q (Γ_{l})]] = E_{Q} [logPr (X | Z, W)] + E_{Q (Z)} [logPr (Z) - \log Q (Z)] + E_{Q (W, Γ)} [logPr (W, Γ) - \log Q (W, Γ)] \end{array}

Based on the above derivation, ELBO can be decomposed into three parts. The first term is the expectation of the data with respect to all the parameters in the model:

E_{Q} [logPr (X | Z, W, Γ)] = E_{Q} [- \frac{1}{2 σ^{2}} t r [{(X - Z W)}^{⊺} (X - Z W)] - \frac{n p}{2} \log (2 π σ^{2})] = - \frac{1}{2 σ^{2}} [tr (X^{⊺} X) - 2 t r (X^{⊺} E [Z] E [W]) + tr (E [Z^{⊺} Z] E [W W^{⊺}])] - \frac{n p}{2} \log (2 π σ^{2})

The second term is the negative KL divergence of $Z$ .

\begin{array}{l} E_{Q (Z)} [logPr (Z) - \log Q (Z)] = E [- \frac{1}{2} t r (Z^{⊺} Z) - \frac{n k}{2} \log (2 π) + \frac{1}{2} t r (Σ_{Z}^{- 1} {(Z - μ_{Z})}^{⊺} (Z - μ_{Z})) \\ + \frac{n k}{2} \log (2 π) + \frac{n}{2} \log (| Σ_{Z} |)] \\ = - \frac{1}{2} tr (E [Z^{⊺} Z]) + \frac{1}{2} t r [Σ_{Z}^{- 1} (E [Z^{⊺} Z] - μ_{Z}^{⊺} μ_{Z})] + \frac{N}{2} \log (| Σ_{Z} |) \\ = - \frac{1}{2} tr (E [Z^{⊺} Z]) + \frac{1}{2} t r [Σ_{Z}^{- 1} (N Σ_{Z} + μ_{Z}^{⊺} μ_{Z} - μ_{Z}^{⊺} μ_{Z})] + \frac{N}{2} \log (| Σ_{Z} |) \\ = - \frac{1}{2} tr (E [Z^{⊺} Z]) + \frac{N K}{2} + \frac{N}{2} \log (| Σ_{Z} |) \end{array}

The last term contains joint negative KL divergence of $W and Γ$ can be further decomposed as following:

\begin{array}{l} E_{Q (W, Γ)} [logPr (W, Γ) - \log Q (W, Γ)] = E_{Q (W, Γ)} [logPr (W | Γ) \Pr (Γ) - \log Q (W | Γ) Q (Γ)] = E_{Q (W, Γ)} [logPr (W | Γ) - \log Q (W | Γ)] + E_{Q (W, Γ)} [logPr (Γ) - \log Q (Γ)] = \sum_{k = 1}^{K} \sum_{l = 1}^{L} E_{Q (w_{k l}, γ_{k l})} [logPr (w_{k l} | γ_{k l}) - \log Q (w_{k l} | γ_{k l})] + \\ \sum_{k = 1}^{K} \sum_{l = 1}^{L} E_{Q (γ_{k l})} [logPr (γ_{k l}) - \log Q (γ_{k l})] = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{i = 1}^{P} α_{k l i} E_{Q (w_{k l} | γ_{k l})} [\log \Pr (w_{k l} | γ_{k l i} = 1) - \log Q (w_{k l} | γ_{k l i} = 1)] + \\ \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{i = 1}^{P} E_{γ_{k l}} [logPr (γ_{k l i} = 1) - \log Q (γ_{k l i} = 1)] \end{array}

The first expectation term of the last line of equation $E_{Q (w_{k l} | γ_{k l})}$ can be expanded as following:

\begin{array}{l} E_{Q (w_{k l} | γ_{k l})} [\log \frac{\Pr (w_{k l} | γ_{k l})}{Q (w_{k l} | γ_{k l})}] = \sum_{i = 1}^{P} E_{Q (w_{k l} | γ_{k l i} = 1)} [\log \frac{\Pr (w_{k l} | γ_{k l i} = 1)}{Q (w_{k l} | γ_{k l i} = 1)}] = E \sum_{i = 1}^{P} [- \frac{τ_{0}}{2} {(w_{k l i})}^{2} + \frac{1}{2 σ_{w_{k l}}^{2}} {(w_{k l i} - μ_{w_{k l i}})}^{2}] = \sum_{i = 1}^{P} [(- \frac{τ_{0}}{2} + \frac{1}{2 σ_{w_{k l}}^{2}}) [μ_{w_{k l i}}^{2} + σ_{w_{k l}}^{2}] - \frac{1}{2 σ_{w_{k l}}^{2}} μ_{w_{k l i}}^{2}] + \frac{P}{2} \log (σ_{w_{k l}}^{2} τ_{0}) + \frac{P}{2} \log (σ_{w_{k l}}^{2} τ_{0}) = \sum_{i = 1}^{P} [- \frac{τ_{0}}{2} μ_{w_{k l i}}^{2} - \frac{τ_{0}}{2} σ_{w_{k l i}}^{2} + \frac{1}{2}] + \frac{P}{2} \log (σ_{w_{k l}}^{2} τ_{0}) \\ - \frac{p}{2} \log (2 π / τ_{0}) + \frac{p}{2} \log (2 π σ_{w_{k l}}^{2})] \\ = \sum_{i = 1}^{P} [(- \frac{τ_{0}}{2} + \frac{1}{2 σ_{w_{k l}}^{2}}) E [{(w_{k l i})}^{2}] - \frac{1}{2 σ_{w_{k l}}^{2}} μ_{w_{k l i}}^{2}] - \frac{P}{2} \log (2 π / τ_{0}) + \frac{P}{2} \log (2 π σ_{w_{k l}}^{2}) \end{array}

And the second expectation term $E_{γ_{k l}}$ can be decomposed as

E_{Q (Γ)} [logPr (Γ) - \log Q (Γ)] = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{i = 1}^{P} E_{Q (γ_{k l i} = 1)} [(γ_{k l i} \log π_{i} - γ_{k l i} \log α_{k l i})] = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{i = 1}^{P} [E (γ_{k l i}) \log (π_{i}) - E (γ_{k l i}) \log (α_{k l i})] = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \sum_{i = 1}^{P} [α_{k l i} (\log (π_{i}) - \log α_{k l i})]

With the explicit form of ELBO, we obtained the maximum likelihood estimates of model precision parameters $τ, τ_{0 k l}$ by setting the derivative of ELBO with respect to each variance parameter to be 0, which results in closed-form update equations given by,

{\hat{τ}}_{0 k l} = \frac{\sum_{i = 1}^{P} α_{k l i}}{\sum_{i = 1}^{P} α_{k l i} (μ_{w_{k l i}}^{2} + σ_{w_{k l i}}^{2})}

(Equation 22)

\hat{τ} = \frac{N P}{\sum_{i, j} X_{i j}^{2} - 2 t r (E [W] X^{⊺} μ_{Z})} .

(Equation 23)

Simulations

To investigate the performance of SuSiE PCA in variable selection and model fitting, we simulated various data sets that are controlled by 4 parameters: the sample size N, number of features P, number of latent factors K, and number of single effects L in each of the factors. For simplicity, we assume L is the same across different factors. The simulated data $X$ is generated according to equation ((0.1)), where $N = 1000, P = 6000$ , and $z_{k}$ and $w_{k}$ , for $k = 1, \dots, 4$ are simulated such that each factor only contain 40 non-zero effects (0.67%) given by,

z_{k} \sim N (0, I_{N})

(Equation 24)

w_{1, i} \sim N (0,1) i = 1, \dots, 40

(Equation 25)

w_{2, i} \sim N (0,1) i = 41, \dots, 80

(Equation 26)

w_{3, i} \sim N (0, 2^{2}) i = 81, \dots, 120

(Equation 27)

w_{4, i} \sim N (0,1) i = 121, \dots, 160

(Equation 28)

with the remaining effects set to zero. Considering the scale of the estimates of loadings may differ from various types of methods, we normalized the loading matrix with respect to Frobenius norm, i.e. $tr (A^{⊺} A) = tr (B^{⊺} B) = 1$ .

To evaluate the accuracy of SuSiE PCA, we compared inferred posterior expectations with the true latent variables. However, due to the rotational invariance property in latent factor models, evaluating loading or latent factor accuracy can be challenging. To account for possible rotation, we leverage the Procrustes transformation,¹⁷ which finds an orthogonal rotation matrix $P$ to transform the estimated loading matrix to the true loading matrix space. Specifically, given an estimated loading matrix $\hat{W} : = E_{Q} [W]$ under approximate posterior distribution Q and true effect matrix $W$ , the “Procrustes Norm” can be obtained as following:

{‖ W - \hat{W} ‖}_{P}^{2} : = \min_{{P | P^{- 1} = P^{⊺}}} {‖ \hat{W} P - W ‖}_{F}^{2}

(Equation 29)

Here we perform the Procrustes analysis via Procrustes package,¹⁶ from which $P$ is obtained by performing a singular value decomposition on matrix ${\hat{W}}^{⊺} {\hat{W}}^{}$ (padding zeros on matrix $\hat{W}$ would ensure the above operation process correctly).

In addition, we employ the relative root mean squared error (RRMSE) to evaluate the reconstructed data loss as,

RRMSE (\hat{X}, X) = \sqrt{\frac{\sum_{i, j} {({\hat{x}}_{i j} - x_{i j})}^{2}}{\sum_{i, j} x_{i j}^{2}}}

(Equation 30)

Lastly, to assess generative modeling proficiency, we computed the log-likelihood under held-out data. Specifically, we first trained the model on simulated training data. Next, we computed latent space representations for the testing data under each of the trained models. Lastly, we computed log-likelihoods under normality assumptions given the latent representations and learned loadings and parameters.

For model comparison, we also evaluate the performance of sparse PCA⁶ and Empirical Bayes Matrix Factorization (EBMF) (a recently described variational approach)¹² on the same simulation data sets with the same K, and compare the model performance with SuSiE PCA via criterion described above.

GTEx Z score dataset

To illustrate the application of SuSiE PCA in genetic research, we downloaded the Genotype-Tissue Expression (GTEx)¹⁴ summary statistics data, composed of z-scores computed from the testing association between genetic variants and the gene expression levels across 44 different human tissues.¹² The GTEx project collected genotype data and gene expression data from 49 non-disease tissues across $n = 838$ individuals, providing an ideal resource database to study the relationship between genetic variants and gene expression levels.¹⁴ The genetic variants that are statistically associated with gene expression levels are referred to as expression quantitative trait loci (eQTLs). To identify eQTLs, the GTEx project tested the association between each nearby genetic variant of a certain gene with its expression levels using linear regression to yield a Z score. The summary data we explored reflects the most significant eQTL (equivalently, the largest absolute Z score in each SNP and gene pair) at each of 16069 genes (row) from 44 tissues (column) curated from GTEx v8,¹²^,¹⁴ as those 16069 genes show indication of being expressed in 44 of all 49 human tissues. To identify tissue-specific components of regulatory genetic features and contributing genes, we applied SuSiE PCA across this Z score matrix with a latent dimension of 27 and the number of single effects of 18. The prior information on the number of latent dimensions comes from Wang et al. (2021)¹² who contribute to the Z score dataset and run the EBMF model with 27 factors. To determine the appropriate L that fits the data, we run the SuSiE PCA with L ranged from 10 to 25, and select the model when the increase in the total percentage of variance explained (PVE) is less than 5%. PVE is a measure of the amount of signals in the data captured by the latent component, the PVE of the factor $z_{k}$ is calculated based on the following equation:

{PVE}_{k} = \frac{s_{k}}{\sum_{k} s_{k} + N P / τ}

(Equation 31)

where $s_{k} = \sum_{i = 1}^{N} \sum_{j = 1}^{P} {(E [z_{i k}] E [w_{k j}])}^{2}$ .

Purturb-seq dataset

We next investigated genome-scale Perturb-seq data¹⁵ to discover the co-regulated gene sets affected by some common type of perturbations. The Perturb-seq data originated from Perturb-seq experiments performed by Replogle et al.¹⁵ Perturb-seq is a cutting-edge technique combining CRISPR-based perturbations with single-cell RNA-sequencing readouts, enabling the investigation of co-regulated gene sets affected by various perturbations. The researchers employed three cell lines: K562 cells, hTERT-immortalized RPE1 cells, and HEK293T cells. CRISPRi technology was used to generate cell lines expressing dCas9-BFP-KRAB (KOX1-derived) for the perturbation experiments. Since we focus our analyses on the expression data from the K562 cell line, we give a brief description of the experiments performed on the K562 cell lines. Namely, the authors targeted genes expressed in K562 cells, transcription factors, Cancer Dependency Map common essential genes, and included non-targeting control sgRNAs accounting for 5% of the total library. The gene sets were defined based on a combination of bulk RNA-seq data from ENCODE and 10x Genomics 3′ single-cell RNA-seq data. Libraries were constructed with dual-sgRNA pairs targeting each gene, expressed from tandem U6 expression cassettes in a single lentiviral vector, and ranked based on empirical data and computational predictions. Subsequently, the author conducted Perturb-seq experiments on the K562 cells, with 2056 distinct knocked-out genes and one non-targeting control group over an average of 150 different single cells, and then measured the expression levels of the downstream 8563 genes from each cell.

The final dataset contains 310385 rows, each representing one perturbation in a specific cell, and the expression levels of 8563 downstream genes as the column. As an exploratory analysis, we omitted the single-cell level information and aggregated the expression levels of downstream genes with the same perturbation over all the cells, which resulted in a “psuedo-bulk” data matrix with 2057 rows and 8563 columns. We then performed the SuSiE PCA and Sparse PCA to investigate the regulatory modules from the common perturbations. To exclude the batch effects and other non-genetic covariates, we regressed out the germ-line group and the mitochondrial percent from the original expression data and then aggregated the expression level of downstream genes with the same perturbation. Finally, the aggregated data is centered and standardized before input into SuSiE PCA.

As a comparison, we also run the sparse PCA with the same K in both datasets. While choosing an appropriate sparsity parameter alpha in sparse PCA is less straightforward than tuning L in the SuSiE PCA, as we cannot directly pull all of the non-zero genes even with a fairly large alpha (higher sparsity). To make a reasonable comparison, we run sparse PCA with a set of alpha from 1 to 20 and choose two models to compare: first, choose the model giving the highest PVE, then investigate the model having a similar level of PVE with SuSiE PCA.

Acknowledgments

This work was funded by the National Institutes of Health (NIH) under awards R01HG012133, P01CA196569.

Author contributions

D.Y. and N.M. developed the method. D.Y. performed analysis. D.Y. and N.M. edited and approved the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: October 13, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2023.108181.

Contributor Information

Dong Yuan, Email: dongyuan@usc.edu.

Nicholas Mancuso, Email: nmancuso@usc.edu.

Supplemental information

Document S1. Figures S1–S24

mmc1.pdf^{(12.4MB, pdf)}

Data and code availability

•
This paper analyzes existing, publicly available data, i.e., the GTEx z-score dataset¹⁴ and the perturb-seq data.¹⁵ These accession numbers for the datasets are listed in the key resources table.
•
All original codes related to SuSiE PCA have been deposited and are publicly available on GitHub (https://github.com/mancusolab/susiepca).
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

References

1.Hotelling H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933;24:417–441. doi: 10.1037/h0071325. [DOI] [Google Scholar]
2.Patterson N., Price A.L., Reich D. Population Structure and Eigenanalysis. PLoS Genet. 2006;2 doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Agrawal A., Chiu A.M., Le M., Halperin E., Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16 doi: 10.1371/journal.pgen.1008773. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.McVean G. A Genealogical Interpretation of Principal Components Analysis. PLoS Genet. 2009;5 doi: 10.1371/journal.pgen.1000686. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jolliffe I.T. Springer-Verlag; 1986. Principal Component Analysis. [Google Scholar]
6.Zou H., Hastie T., Tibshirani R. Sparse Principal Component Analysis. J. Comput. Graph Stat. 2006;15:265–286. doi: 10.1198/106186006X113430. [DOI] [Google Scholar]
7.Bishop C. Advances in Neural Information Processing Systems. MIT Press; 1998. Bayesian PCA. [Google Scholar]
8.Guan Y., Dy J. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR; 2009. Sparse Probabilistic Principal Component Analysis; pp. 185–192. [Google Scholar]
9.Ning B. Spike and slab Bayesian sparse principal component analysis. arXiv. 2021 doi: 10.48550/arXiv.2102.00305. Preprint at. [DOI] [Google Scholar]
10.Armagan A., Clyde M., Dunson D. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011. Generalized Beta Mixtures of Gaussians. [PMC free article] [PubMed] [Google Scholar]
11.Zhao S., Gao C., Mukherjee S., Engelhardt B.E. Bayesian group factor analysis with structured sparsity. J. Mach. Learn. Res. 2016;17:1–47. [Google Scholar]
12.Wang W., Ge L., Zhang L., Liu L., Zhang X., Ma X. Empirical bayes matrix factorization. Hum. Fertil. 2021;22:1–11. [Google Scholar]
13.Wang G., Sarkar A., Carbonetto P., Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.GTEx Consortium. Ardlie K.G., Deluca D.S., Segrè A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Replogle J.M., Saunders R.A., Pogson A.N., Hussmann J.A., Lenail A., Guna A., Mascibroda L., Wagner E.J., Adelman K., Lithwick-Yanai G., et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022;185:2559–2575.e28. doi: 10.1016/j.cell.2022.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Meng F., Richer M., Tehrani A., La J., Kim T.D., Ayers P.W., Heidar-Zadeh F. Procrustes: A python library to find transformations that maximize the similarity between matrices. Comput. Phys. Commun. 2022;276 doi: 10.1016/j.cpc.2022.108334. [DOI] [Google Scholar]
17.Borg I., Groenen P. 2005. Modern Multidimensional Scaling: Theory and Applications. (Springer Series in Statistics)). [DOI] [Google Scholar]
18.Bradbury J., Frostig R., Hawkins P., Johnson M.J., Leary C., Maclaurin D., Necula G., Paszke A., VanderPlas J., Wanderman-Milne S., et al. 2018. JAX: Composable Transformations of Python+NumPy Programs. [Google Scholar]
19.Cohn B.A., Cirillo P.M., Christianson R.E. Prenatal DDT Exposure and Testicular Cancer: A Nested Case-Control Study. Arch. Environ. Occup. Health. 2010;65:127–134. doi: 10.1080/19338241003730887. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Ge S.X., Jung D., Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–2629. doi: 10.1093/bioinformatics/btz931. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Amrute J.M., Perry A.M., Anand G., Cruchaga C., Hock K.G., Farnsworth C.W., Randolph G.J., Lavine K.J., Steed A.L. Cell specific peripheral immune responses predict survival in critical COVID-19 patients. Nat. Commun. 2022;13:882. doi: 10.1038/s41467-022-28505-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Garg M., Li X., Moreno P., Papatheodorou I., Shu Y., Brazma A., Miao Z. Meta-analysis of COVID-19 single-cell studies confirms eight key immune responses. Sci. Rep. 2021;11 doi: 10.1038/s41598-021-00121-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Signorile A., Sgaramella G., Bellomo F., De Rasmo D. Prohibitins: A Critical Role in Mitochondrial Functions and Implication in Diseases. Cells. 2019;8:71. doi: 10.3390/cells8010071. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Artal-Sanz M., Tsang W.Y., Willems E.M., Grivell L.A., Lemire B.D., van der Spek H., Nijtmans L.G.J. The mitochondrial prohibitin complex is essential for embryonic viability and germline function in Caenorhabditis elegans. J. Biol. Chem. 2003;278:32091–32099. doi: 10.1074/jbc.M304877200. [DOI] [PubMed] [Google Scholar]
25.Artal-Sanz M., Tavernarakis N. Prohibitin couples diapause signalling to mitochondrial metabolism during ageing in C. elegans. Nature. 2009;461:793–797. doi: 10.1038/nature08466. [DOI] [PubMed] [Google Scholar]
26.Opper M., Saad D. MIT Press; 2001. Advanced Mean Field Methods: Theory and Practice. [Google Scholar]
27.Andrieu C., de Freitas N., Doucet A., Jordan M.I. An Introduction to MCMC for Machine Learning. Mach. Learn. 2003;50:5–43. doi: 10.1023/A:1020281327116. [DOI] [Google Scholar]
28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
29.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Stat. 1951;22:79–86. [Google Scholar]
30.Tanaka T. Advances in Neural Information Processing Systems. MIT Press; 1998. A Theory of Mean Field Approximation. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S24

mmc1.pdf^{(12.4MB, pdf)}

Data Availability Statement

•
This paper analyzes existing, publicly available data, i.e., the GTEx z-score dataset¹⁴ and the perturb-seq data.¹⁵ These accession numbers for the datasets are listed in the key resources table.
•
All original codes related to SuSiE PCA have been deposited and are publicly available on GitHub (https://github.com/mancusolab/susiepca).
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

[bib1] 1.Hotelling H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933;24:417–441. doi: 10.1037/h0071325. [DOI] [Google Scholar]

[bib2] 2.Patterson N., Price A.L., Reich D. Population Structure and Eigenanalysis. PLoS Genet. 2006;2 doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Agrawal A., Chiu A.M., Le M., Halperin E., Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16 doi: 10.1371/journal.pgen.1008773. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.McVean G. A Genealogical Interpretation of Principal Components Analysis. PLoS Genet. 2009;5 doi: 10.1371/journal.pgen.1000686. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Jolliffe I.T. Springer-Verlag; 1986. Principal Component Analysis. [Google Scholar]

[bib6] 6.Zou H., Hastie T., Tibshirani R. Sparse Principal Component Analysis. J. Comput. Graph Stat. 2006;15:265–286. doi: 10.1198/106186006X113430. [DOI] [Google Scholar]

[bib7] 7.Bishop C. Advances in Neural Information Processing Systems. MIT Press; 1998. Bayesian PCA. [Google Scholar]

[bib8] 8.Guan Y., Dy J. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR; 2009. Sparse Probabilistic Principal Component Analysis; pp. 185–192. [Google Scholar]

[bib9] 9.Ning B. Spike and slab Bayesian sparse principal component analysis. arXiv. 2021 doi: 10.48550/arXiv.2102.00305. Preprint at. [DOI] [Google Scholar]

[bib10] 10.Armagan A., Clyde M., Dunson D. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011. Generalized Beta Mixtures of Gaussians. [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Zhao S., Gao C., Mukherjee S., Engelhardt B.E. Bayesian group factor analysis with structured sparsity. J. Mach. Learn. Res. 2016;17:1–47. [Google Scholar]

[bib12] 12.Wang W., Ge L., Zhang L., Liu L., Zhang X., Ma X. Empirical bayes matrix factorization. Hum. Fertil. 2021;22:1–11. [Google Scholar]

[bib13] 13.Wang G., Sarkar A., Carbonetto P., Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.GTEx Consortium. Ardlie K.G., Deluca D.S., Segrè A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Replogle J.M., Saunders R.A., Pogson A.N., Hussmann J.A., Lenail A., Guna A., Mascibroda L., Wagner E.J., Adelman K., Lithwick-Yanai G., et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022;185:2559–2575.e28. doi: 10.1016/j.cell.2022.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Meng F., Richer M., Tehrani A., La J., Kim T.D., Ayers P.W., Heidar-Zadeh F. Procrustes: A python library to find transformations that maximize the similarity between matrices. Comput. Phys. Commun. 2022;276 doi: 10.1016/j.cpc.2022.108334. [DOI] [Google Scholar]

[bib17] 17.Borg I., Groenen P. 2005. Modern Multidimensional Scaling: Theory and Applications. (Springer Series in Statistics)). [DOI] [Google Scholar]

[bib18] 18.Bradbury J., Frostig R., Hawkins P., Johnson M.J., Leary C., Maclaurin D., Necula G., Paszke A., VanderPlas J., Wanderman-Milne S., et al. 2018. JAX: Composable Transformations of Python+NumPy Programs. [Google Scholar]

[bib19] 19.Cohn B.A., Cirillo P.M., Christianson R.E. Prenatal DDT Exposure and Testicular Cancer: A Nested Case-Control Study. Arch. Environ. Occup. Health. 2010;65:127–134. doi: 10.1080/19338241003730887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Ge S.X., Jung D., Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–2629. doi: 10.1093/bioinformatics/btz931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Amrute J.M., Perry A.M., Anand G., Cruchaga C., Hock K.G., Farnsworth C.W., Randolph G.J., Lavine K.J., Steed A.L. Cell specific peripheral immune responses predict survival in critical COVID-19 patients. Nat. Commun. 2022;13:882. doi: 10.1038/s41467-022-28505-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Garg M., Li X., Moreno P., Papatheodorou I., Shu Y., Brazma A., Miao Z. Meta-analysis of COVID-19 single-cell studies confirms eight key immune responses. Sci. Rep. 2021;11 doi: 10.1038/s41598-021-00121-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Signorile A., Sgaramella G., Bellomo F., De Rasmo D. Prohibitins: A Critical Role in Mitochondrial Functions and Implication in Diseases. Cells. 2019;8:71. doi: 10.3390/cells8010071. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Artal-Sanz M., Tsang W.Y., Willems E.M., Grivell L.A., Lemire B.D., van der Spek H., Nijtmans L.G.J. The mitochondrial prohibitin complex is essential for embryonic viability and germline function in Caenorhabditis elegans. J. Biol. Chem. 2003;278:32091–32099. doi: 10.1074/jbc.M304877200. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Artal-Sanz M., Tavernarakis N. Prohibitin couples diapause signalling to mitochondrial metabolism during ageing in C. elegans. Nature. 2009;461:793–797. doi: 10.1038/nature08466. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Opper M., Saad D. MIT Press; 2001. Advanced Mean Field Methods: Theory and Practice. [Google Scholar]

[bib27] 27.Andrieu C., de Freitas N., Doucet A., Jordan M.I. An Introduction to MCMC for Machine Learning. Mach. Learn. 2003;50:5–43. doi: 10.1023/A:1020281327116. [DOI] [Google Scholar]

[bib28] 28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]

[bib29] 29.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Stat. 1951;22:79–86. [Google Scholar]

[bib30] 30.Tanaka T. Advances in Neural Information Processing Systems. MIT Press; 1998. A Theory of Mean Field Approximation. [Google Scholar]

PERMALINK

SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis

Dong Yuan

Nicholas Mancuso

Summary

Graphical abstract

Highlights

Introduction

Results

PIPs from SuSiE PCA outperform existing approaches for PCA feature selection

Figure 1.

SuSiE PCA is robust to model mis-specification

Figure 2.

Dissecting cross-tissue eQTLs in GTEx

Figure 3.

Identifying regulatory modules from perturb-seq data

Figure 4.

Discussion

Table 1.

Algorithm 1. Algorithm for SuSiE PCA.

Limitations of the study

STAR★Methods

Key resources table

Resource availability

Lead contact

Material availability

Experimental model and subject details

Method details

Overview of SuSiE PCA

Posterior inclusion probability

Variational inference in SuSiE PCA

Mean-field approximation

Helpful definitions

Derivation of model parameters

Derivation of evidence lower bound (ELBO)

Simulations

GTEx Z score dataset

Purturb-seq dataset

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Contributor Information

Supplemental information

Data and code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases