Skip to main content
iScience logoLink to iScience
. 2023 Oct 13;26(11):108181. doi: 10.1016/j.isci.2023.108181

SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis

Dong Yuan 1,3,, Nicholas Mancuso 1,2,∗∗
PMCID: PMC10638022  PMID: 37953948

Summary

Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] =9.2×1082 vs. 1.4×1033), while being 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.

Subject areas: Biocomputational method, Classification of bioinformatical subject, data processing in systems biology, Algorithms

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Efficient PCA feature selection via posterior inclusion probabilities

  • Learned model prior enhances inferential robustness

  • Seamless CPU/GPU/TPU implementation enables efficient inference


Biocomputational method; Classification of bioinformatical subject; Data processing in systems biology; Algorithms

Introduction

Principal component analysis (PCA) is a popular dimension reduction technique1 that has been widely applied for exploratory data analysis in many fields. One notable functionality of PCA is to synthesize crucial information across features into a small number of principal components (PCs). For example, PCA is commonly used to infer population structure from large-scale genetic data.2,3 The top PCs explain differences in genetic variation arising from different geographic origins and ancestry of individuals, due to historical migration, admixture, etc.4 Moreover, PCA provides a means to rank contributing relevant variables for each latent component, as Tipping and Bishop (1986) proposed the probabilistic reformulation of principal component analysis (PPCA).5 Specifically, each PC is independent of other PCs and has its unique weights to represent the “importance” of original features, suggesting different latent components arise from different combinations of variables, or distinct aspects of information from the data.

However, one disadvantage of conventional PCA is that PCs provide limited interpretability, as each results from a linear combination of variables in the data.6 To improve the interpretability of PCs, while providing an identifiable solution in high-dimensional data, a common approach is to impose sparsity on the PCA loadings. Broadly speaking, there are two types of approaches to achieving sparsity on the loading matrix. The first is the regularization methods such as sparse PCA,6 which rewrites the PCA as a regression-based optimization problem and then includes a L1 penalty on the objective function to achieve sparse loadings. The second type of method is the Bayesian treatment of PPCA, which imposes sparsity-induced prior on the factor loading matrix.7,8,9,10,11,12 Despite various methods that focus on inducing sparse solutions for PCA, few provide a statistically rigorous way to select variables relevant to each factor in a post hoc manner. Although several sparse models are capable of shrinking the loadings of uninformative variables to zero, for those variables with non-zero weights, neither a reasonable threshold nor a formal statistical test is provided to inform feature prioritization for validation or follow-up.

Here, we propose SuSiE PCA, a highly scalable Bayesian framework for sparse PCA, that quantifies the uncertainty of contributing features for each latent component. Specifically, SuSiE PCA leverages the recent “sum of single effects” (SuSiE) approach13 to model a loading matrix such that each latent factor contains at most L contributing features. Latent factors and sparse loading weights are learned through an efficient variational algorithm. In addition to providing a sparse loading matrix, SuSiE PCA computes posterior inclusion probabilities (PIPs) for each feature, which enables defining ρ level credible sets for feature selection. We demonstrate through extensive simulations that SuSiE PCA outperforms sparse PCA6 and empirical Bayes matrix factorization (EBMF)12 in identifying relevant features contributing to structured data while being robust to data-generating assumptions. Next, we apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from the the genotype-tissue expression (GTEx) v812,14 study to identify tissue-specific components of regulatory genetic features and contributing eGenes (genes that have an associated eQTL). We also apply SuSiE PCA to high-dimensional perturb-seq data (CRISPR-based screens with single-cell RNA-sequencing readouts)15 and identify gene sets more enriched in the ribosome and coronavirus disease pathways when compared with sparse PCA (false discovery rate (FDR) =9.2×1082, 63 genes involved vs. 1.4×1033, 35 genes involved) while requiring 17.8 times less computing time. Overall, we find that SuSiE PCA provides an efficient approach to compute interpretable latent factors from high-dimensional biological data. We provide an open-source python implementation that can run seamlessly on central processing unit (CPU), graphics processing unit (GPU), or tensor processing unit (TPU) available at http://www.github.com/mancusolab/susiepca.

Results

PIPs from SuSiE PCA outperform existing approaches for PCA feature selection

To evaluate the performance of SuSiE PCA, we performed extensive simulations (see details in STAR Methods). Briefly, we performed 100 simulations by varying model parameters one at a time and performed inference using SuSiE PCA with the true number of latent variables (K) and effects (L) known. First, we evaluated the ability of inferred PIPs to discriminate between relevant and non-relevant features for latent factors. Specifically, we compared the sensitivity and specificity of inferred PIPs to normalized posterior mean weights from SuSiE PCA (see Figure 1). When selecting variables based on PIPs>0.90, SuSiE PCA identifies 88.9% of true positive (non-zero) signals, demonstrating largely calibrated posterior inference. We observed nearly all true negative signals exhibited PIPs<0.05. As a comparison, the normalized posterior weights performed well on excluding the true negative signals but failed to capture true positive signals as rapidly as PIP thresholds. Overall, the simulation demonstrates that PIPs provide an intuitive and more efficient indicator for feature selection than normalized posterior weights in SuSiE PCA. In addition, we also examined the sensitivity and specificity using weights estimated from sparse PCA and EBMF (see Figure S1), which have similar trends to the curves in Figure 1B and can only capture a small proportion of the true positive signals as the cutoff threshold increases.

Figure 1.

Figure 1

PIPs exhibit a higher efficiency in selecting the true signals than the posterior weights in SuSiE PCA

The proportion of correct classified signals using PIPs as cutoff (A) or posterior weights as cutoff (B). The green dots represent sensitivity, i.e., Pr(PIPscutoff|Truepositivesignal), and the red dots represent specificity, i.e., Pr(PIPs<cutoff|Truefalsesignal). For consistency and to ensure comparability between PIPs and weights, the weights are normalized to be ranged from 0 to 1.

SuSiE PCA is robust to model mis-specification

Next, we examined the estimation accuracy of the loading matrix as a function of sample size (N), feature dimension (P), latent dimension (K), and the number of single effects (or sparsity level) (L), via the Procrustes errors16 (the Frobenius norm after Procrustes transformation,17 see STAR Methods) (Figures 2A–2D). We found that SuSiE PCA has the smallest Procrustes errors across all simulation settings compared to sparse PCA and EBMF. And we noticed that the Bayesian methods including SuSiE PCA and EBMF maintain a low error even with a small sample size or high feature dimension. Moreover, we found that SuSiE PCA has the lowest relative root mean squared error (RRMSE) across all simulations compared with other methods (Figure S2); and EBMF and SuSiE PCA have a lower level of Procrustes error of factor Z than sparse PCA (Figure S3). In summary, SuSiE PCA exhibits the highest estimation accuracy, which is consistent with its superior performance in variable selection.

Figure 2.

Figure 2

SuSiE PCA outperforms sparse PCA and EBMF in estimation accuracy and model robustness

SuSiE PCA generates the smallest Procrustes error in weight matrix than sparse PCA and EBMF (A–D) and is robust to over-specified K and L (E and F). For each scenario in (A–D) we vary one of the parameters at a time to generate the simulation data while fixing the other three parameters, and then input the true parameters ( N,P,K,L) into models. Finally, we compute the Procrustes error and plot them as a function of N,P,K,L. For (E and F), we use the same simulation setting in Figure 1 to generate data but vary the specified L in SuSiE PCA (E) and K in all three models (F). Reference lines refer to the error from the models with correctly specified parameters (i.e., L=40,K=4).

We next investigated model robustness under model mis-specification. Similar to other latent factor models, SuSiE PCA could be mis-specified as it requires manually inputting the latent dimension K and the number of single effects L. Considering the potential model mis-specification setting, the simulation datasets are generated based on K=4,L=40 and then input into SuSiE PCA, sparse PCA, and EBMF with two mis-specified situations: vary L while fixing K, or vary K while fixing L. The model estimation accuracy is then compared among three models with Procrustes error (see Figures 2E and 2F). We observed that as K and L in the model approach the true value (i.e., K=4 or L=40), the Procrustes error decreases rapidly to the lower level in SuSiE PCA and remains the same even when K>4 or L>40. However, the error for sparse PCA has a V shape and reaches its minimum at the real K. The explanation is that when there are over-specified latent factors in the model, SuSiE PCA and EBMF will not extract any information from the data due to their probabilistic model structure; the sparse PCA, on the other hand, cannot handle the weights since it does not impose a probabilistic assumption on them. Instead, the value of the redundant latent factor in sparse PCA is close to 0, which ensures the latent component does not contribute.

Finally, to compare the generative capacity, we computed and compared the log likelihood of held-out data between sparse PCA and SuSiE PCA. We observed that SuSiE PCA outperforms sparse PCA and obtains higher log likelihoods for simulations (Figure S5). In addition to the overall superior model performance, SuSiE PCA remains faster on both CPU and GPU than sparse PCA and EBMF due to the efficient variational algorithm we implement (see STAR Methods) with the JAX library developed by Google.18

Dissecting cross-tissue eQTLs in GTEx

To illustrate the utility of SuSiE PCA to make inferences in biological data, we analyzed multi-tissue eQTL Z score results computed from GTEx v812,14 (see STAR Methods). Specifically, we sought to identify latent factors corresponding to tissue-specific and tissue-shared eQTLs similar to ref. 12. Overall, we found that 27 latent factors explained 53.1% of the variance in the data (see Figure S6). Although we set L=18 across all factors, we found the number of tissues with PIP>0.9 is frequently lower than 18 in different factors (see Figure S9), which is due to inferred τ0kl acting to “shut off” uninformative features. Indeed, we observed 30 out of 486 τ0kl with estimates greater than e10 (see Figure S7) which effectively shrink the effect size of the corresponding single effect toward 0, driving the number of non-zero single effects in some factors smaller than specified L. We found this behavior also reflected in estimated level-0.9 credible sets, where 456 out of 486 contained a single tissue, and the remaining 30 credible sets contained at least two tissues.

To understand what each factor represents, we examined inferred PIPs (Figure S9) and posterior mean weights of each tissue across 27 factors (Figure S8). Here we present the results from factor z1 and z3 through the posterior weights (Figure 3; see Figure S8 for the remainder). We observed that the latent factor z1 with the second largest percentage of variance (PVE) demonstrates high absolute weights on most tissues except for the brain tissues, while the latent factor z3 has large weights almost exclusively on brain tissues. Moreover, we observed that brain tissue tends to appear as a group and has similar effects, implying the eQTLs in brain tissue are different from those in other tissue and those strong signals are specifically captured by the factor z1. For the rest of the factors, we noticed that factors with large PVE such as z2,z4,z5 tended to have large weights on multiple tissues; for example, factor z2 has large weights on esophagus and thyroid, suggesting the eQTLs signals are mostly shared across those tissues, while the factors with small PVE usually have large weights exclusively on one or a few tissues, for example, liver-specific component z12, lung-specific component z15, etc. The only exception is that the factor z0 with the largest PVE has an exclusively large weight only on the testis, implying the z0 captures the testis-specific eQTL signals. This is consistent with the investigation of the latent factor values of z0: the gene with the largest factor value in z0 is D-dopachrome tautomerase (DDT) (Figure S10), which is shown to be associated with testis cancer.19 To make a comparison with the existing method, we expanded our investigation by applying sparse PCA to the GTEx Z score dataset and observed comparable tissue weights and factor scores across components in both SuSiE PCA and sparse PCA (Figure S11). However, a notable distinction arises where certain tissues exhibit tiny weights and can potentially be neglected in sparse PCA; in contrast, the SuSiE PCA can successfully capture the signals in those tissues through the PIP. For example, from the original analysis, both models identify adipose gland as the most relevant tissue in factor 10, while the remaining tissues have a much smaller relative weight and can effectively be ignored. Despite this, SuSiE PCA assigns a PIP of 1 to the lowly weighted tissues, suggesting that important signals would be missed if weights alone were used to provide insight. Overall, we find that SuSiE PCA is able to identify tissue-specific components from multi-tissue eQTL data in an intuitive, interpretable manner.

Figure 3.

Figure 3

Factor z1 and z3 captures different types of tissues (tissues without brain vs. brain tissues)

The posterior weights refer to the inferred W matrix from the SuSiE PCA. The clustering pattern in different factors is found as there are only a limited number of tissues with non-zero weights in each factor since we set L = 18 while the feature dimension is 44.

Identifying regulatory modules from perturb-seq data

To identify gene regulatory modules from genome-wide perturbation data, we ran SuSiE PCA on perturb-seq in cell lines15 (see STAR Methods) with K=10 and L=300. Briefly, we inputted the normalized expression data (2057×8563) to SuSiE PCA to identify gene regulatory modules (i.e., Z) and downstream-regulated networks (i.e., W). To ensure our results were robust to K and L, we explored a grid of possible combinations and found that K = 10 and L = 300 retain the most important information while keeping the relevant gene set much smaller (see Figure S12 for a detailed explanation).

Overall, we found the total PVE was 10.71% across all components (Figure S13), with each component exhibiting 299 downstream genes with PIP >0.9 on average. Focusing on the leading component, we found that perturbations with the top 10 largest absolute factor scores are primarily related to Ribosomal Protein Small (RPS) subunit genes and Ribosomal Protein Large (RPL) subunit family (Figure 4A). To provide a broader characterization of the module function, we extracted downstream genes with PIP greater than 0.9 (298 genes) as input into ShinyGO20 to perform a gene set enrichment analysis (Figure 4B). We observed the most enriched pathway was related to ribosome function (FDR = 9.2×1082, 63 genes involved), followed by coronavirus disease (FDR = 2.5×1062, 62 genes involved). Inspecting the loadings at these downstream genes, we found nearly all weights were positive, suggesting that the knockout of RPS and RPL genes downregulates the expression level of those downstream genes. We found multiple elongation factor genes (EEF1G, EEF1A1, EEF1B2, EIF4B, EIF3L) among the leading downstream genes, which are known to be involved in ribosome function. Additionally, recent studies have suggested that the decreased expression of elongation factor genes is associated with less severe conditions among COVID-19 patients.21,22 We repeated pathway analysis for each latent factor using corresponding loadings at genes with PIP greater than 0.9 (see Figures S14–S22).

Figure 4.

Figure 4

Dominant factor scores in top component link to RPL and RPS family with subsequent gene enrichment in Ribosome and Coronavirus disease

The perturbations with top factor scores in the first component mostly belong to RPL and RPS family(A), and the enrichment analysis results of downstream genes in the same component are enriched for ribosome and coronavirus disease(B). Each point in (A) represents the latent factor value of each perturbation. The top 9 points as well as the control group are labeled in the plot and colored red and blue, respectively. In gene set enrichment analysis, we input the downstream genes with PIP>0.9 and show the top enriched pathways with log(FDR) and the number of genes included in the corresponding pathways.

To compare with sparse PCA, we performed the same pathway analysis on factor loadings and assessed enrichments. From the sparse PCA with the largest PVE (alpha = 1), we observed components identified by sparse PCA to be less enriched with biological pathways when compared to SuSiE PCA (80 unique enriched pathways in sparse PCA versus 88 pathways in SuSiE PCA), and the top enriched pathways such as ribosome and coronavirus disease are less significant and contain less number of selected genes (FDR =1.4×1033, 35 genes; FDR =2.9×1018, 29 genes). We noticed that, when alpha equals 17, the sparse PCA achieves an approximate similar total PVE (10.91%) with that of our model (10.71%) but with lower sparsity level (Figure S23). We then extracted the top 300 genes with non-zero weights in sparse PCA with alpha = 17 and performed the gene set enrichment analysis and found that the significance level is almost similar to that in SuSiE PCA (Figure S24). However, this is a post hoc analysis that suggests SuSiE PCA is more suitable for sparse data analysis while maintaining the power to perform the feature selection in a more statistical and reasonable manner.

Overall, we find distinct biological functions identified by each component, with groupings consistent with those reported in previous works.23,24,25

Discussion

In this paper, we propose SuSiE PCA, an efficient Bayesian variable selection approach to PCA for structured biological data. The sparsity of the loading matrix is achieved by restricting the number of features associated with each factor to be at most L. Through simulations and real-data application, we find that SuSiE PCA outperforms existing approaches to sparse latent structure learning in identifying contributing features, while maintaining a more efficient run time.

There are several advantages of SuSiE PCA as compared to other sparse factor models. First, SuSiE PCA generates the PIPs for each feature that quantifies the uncertainty of the selected feature, which can not be provided by other sparse models, such as sparse PCA with regularization13 or the Bayesian treatment of PPCA. And assessing the selected variables based on the probability is more reasonable and convenient than using weights. Second, PIPs are capable of selecting more signals with high confidence. In simulations, we demonstrated that using weights for variable selection from SuSiE PCA, sparse PCA, and EBMF can deliver a high specificity (low FDR) but with low sensitivity as the cutoff value increases, while using PIPs as selection tools can maintain a high sensitivity for any positive cutoff value between 0 and 1. Third, SuSiE PCA provides a more precise estimate of the loadings and higher prediction accuracy, even in the mis-specified case, as we impose a probabilistic distribution over the loadings that enables a much more accurate inference on the posterior distribution. Finally, the inference procedure of SuSiE PCA works on the dimension of K and L, which is typically set to be much smaller than feature dimension P; therefore, it is scalable to high-dimensional data and requires less computational demands. We implement the SuSiE PCA with the JAX library developed by Google18 to enable fast convergence on CPU, GPU, or TPU. The comparison of run time among SuSiE PCA, sparse PCA, and EBMF is listed in Table 1.

Table 1.

Comparison of mean and standard deviation of running time (seconds) between models

Modela Simulationb GTEx Z score Perturb-seq
SuSiE PCA 3.14(0.49) 1.20 68.11
Sparse PCA 51.96(33.50) 41.22 1213.21
EBMF 39.83(5.80) 498.60 243.03
a

All run time data in the table are based on the analyses performed on the same CPU for consistency. The CPU we used is the Apple M2 chip with 16 GB memory.

b

Run time for simulation is recorded based on simulation setting in Figure 1, i.e., N=1000,P=6000,K=4,L=40; the average run time and corresponding standard deviation are computed for 100 simulations. We presented a more detailed run time comparison in simulation in Figure S4.

In the SuSiE PCA, two parameters, the number of components K, and the number of single effects L, need to be prespecified by the user before fitting the model. The selection of K follows a similar strategy as conventional PCA, often informed by researchers’ domain expertise. The merit of SuSiE PCA is that when there are excessive latent components being specified, the variance explained for those components would be extremely minimal with a near-zero count of single effects exhibiting PIP>0.9. This effectively allows for an initial choice of a relatively large K and subsequently inspecting the PVE and PIPs in each component to decide the most suitable K.

The choice of L determines the sparsity in the SuSiE PCA. Although SuSiE PCA only allows one common L specified across all factors, the number of non-zero effects captured across factors can be varied and learned from the data. This is because we treat the inverse of variance τ0kl of the lth single effect in factor zk as a random variable. As the Algorithm 1 demonstrates, the maximum likelihood estimate (MLE) of τ0kl at the step 3 is derived before inference of other parameters. When the L specified in the model, for a certain factor k, is greater than the true number of signals associated with that factor, the MLE of the τ0kl will be extremely large for those excessive single effects, which then shrinks the wkl and PIP to be 0 or close to 0, and therefore removes the redundant single effects from the model. For example, in the simulation and GTEx Z score data analysis, we have shown that when the user-specified L is larger than the data-generating L, the automatic relevance determination-like (ARD) prior over loadings will shrink effects toward 0, thus adding little additional predictive power and overall mean square error (MSE) from the true loadings matrix. Although it seems like the L parameter may be automatically set to the total number of variables (and thus “shut off” if necessary), we emphasize that this still comes with an added computational cost, albeit a low one due to the scalability of our approach. Therefore, we allow users to specify their own choice of L. From this point of view, without prior knowledge of the data, one can specify a relatively larger L during the initial model fitting and then examine the estimates of τ0kl to explore how many single effects are reasonable for the dataset.

Algorithm 1. Algorithm for SuSiE PCA.

Require: Data XN×P

Require: Number of Factors K; Number of single effects in each factor L

Require: Initialize variational parameters (μZ,ΣZ;μwkl,σwkl;αkl); hyperparameters τ,τ0kl, for l=1,,L;k=1,,K

Require: update equations on different variables FZ;Fwkl;Fαkl;Fτ0;Fτ:

Require: function to compute ELBO,FELBO

Ensure: ELBO increase

1: repeat

2:  Wl=1Lμwα. Define μw,α as (L,K,P) arrays by arranging μwkl,αkl

3:  τ0Fτ0(μw,σw,α)

4:  for k in 1,,K do

5:   E[RklZk](1)=XμzkkkE[wk]E[ZkZk] compute the first two terms in Eq

6:   for l in 1,,L do

7:    E[wkl]=wkμwklαkl removing the lth effect from wk

8:    E[RklZk]=E[RklZk](1)wkE[ZkZk] complete the calculation of E[RklZk]

9:    (μwkl,σwkl)Fwkl(E[RklZk],E[ZkZk],τ0kl,τ)

10:    αklFαkl(E[RklZk],μwkl,σwkl)

11:    wk=E[wkl]+μwklαkl Update the wk

12:   end for

13:  end for

14:  (μZ,ΣZ)FZ(X,τ,E[W])

15:  τ=Fτ(X,τ,E[W],E[Z])

16: ELBOFELBO

17: until ELBO convergence criterion satisfied

Overall, SuSiE PCA provides a flexible approach to high-dimensional biological data with a low-rank structure and allows for feature selection in sparse PCA.

Limitations of the study

One limitation of SuSiE PCA is that under the mean-field approximation, all the posteriors, i.e., Q(W) and Q(Z), are factorized to facilitate inference. Under this factorization, estimation for mean terms (i.e., E[W] and E[Z]) is approximately unbiased.26 However, it produces overconfident covariance structures within variables (W, Z, etc) due to the assumed independence across Q functions.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

The Genotype-Tissue Expression Z score data Wei Wang and Matthew Stephens, Empirical Bayes Matrix Factorization, 20211 https://github.com/ysfoo/sparsefactor
Genome-scale Perturb-seq experiment data Joseph Replogle and Jonathan Weissman, Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq, 20222 https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387

Software and algorithms

Scikit-learn library: sparse principal component analysis Python library scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html; RRID:SCR_002577
R Package: Factors and Loadings by Adaptive SHrinkage in R (flashr) Wei Wang and Matthew Stephens, Empirical Bayes Matrix Factorization1 https://stephenslab.github.io/flashr/index.html
Variational algorithm in SuSiE PCA This paper http://www.github.com/mancusolab/susiepca
Python 3.9 Python Software Foundation https://www.python.org/
R 4.0.0 R Software https://www.r-project.org
ShinyGO v0.77 Ge SX, Jung D & Yao R3 http://bioinformatics.sdstate.edu/go/; RRID:SCR_019213

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Dong Yuan (dongyuan@usc.edu).

Material availability

This study did not generate new unique materials or reagents.

Experimental model and subject details

This study did not include experiments with a specific model or subject.

Method details

Overview of SuSiE PCA

In this section, we will give a detailed description of SuSiE PCA. Let XN×P be the observed data matrix, ZN×K be the K dimensional latent vectors, and WK×P be the loading matrix. We denote the normal distribution with mean μ and variance σ2 as N(μ,σ2), the multinomial distribution with n choices and probabilities π as Multi(n,π) and the matrix normal distribution with dimension N×K, mean M, row-covariance R, and column-covariance C as MNN,K(M,R,C). We denote the basis vector in which kth coordinate is 1 and 0 elsewhere as ek. The sampling distribution of X under the SuSiE PCA model is given by,

X|Z,W,σ2MNN,P(ZW,IN,σ2IP) (Equation 1)
ZMNN,K(0,IN,IK) (Equation 2)
W=k=1Kekwk (Equation 3)
wk=l=1Lwkl (Equation 4)
wkl=wklγkl (Equation 5)
wkl|σ0kl2N(0,σ0kl2) (Equation 6)
γkl|πMulti(1,π), (Equation 7)

where wk corresponds to the kth row of W, and contains at most L non-zero elements determined by the sum of L single-effect vectors wkl. These single-effect vectors are described by a single random effect wkl and indicator vector γkl which assigns the effect to a feature with prior probabilities π=1p1.

Posterior inclusion probability

One of the distinguishing features that the SuSiE model13 provides is a posterior inclusion probability (PIP). The PIP reflects the posterior probability that a given variable has a non-zero effect given the observed data. Here we extend the PIP definition to include latent factors. Specifically, given variational parameters αkl we can define the PIP that the ith variable has a non-zero effect in the kth latent component as,

PIPki:=Pr(wki0|X)=1l=1L(1αkli) (Equation 8)

Similarly, a level-ρ credible set (CS) refers to a subset of variables that cumulatively explain at least ρ of the posterior density. Here, we define factor-specific level-ρ CSs, which can be computed across each αkl independently, resulting in K×L total level-ρ credible sets. This lets us reflect on the uncertainty in identified variables to explain a single-effect for each latent factor.

Variational inference in SuSiE PCA

We seek to perform inference of model variables Z,wkl and γkl conditional on observed data X, however, the marginal likelihood is intractable to compute and therefore, we cannot evaluate the posterior exactly. While sampling based approaches such as Markov Chain Monte Carlo (MCMC) methods provide a numerical approximation of the exact posterior distribution,27 they often lack computational efficiency in high-dimensional settings. As an alternative, we leverage recent advancements in the variational inference that provides an analytical approximation to the posterior distribution28 and remains computationally efficient.

Briefly, To approximate the conditional distribution of latent variables Z given the observed samples X, variational methods first impose a family of densities over the latent variables, Q(Z), usually predefined as known distributions parameterized with a set of variational parameters. Then the goal is to infer those variational parameters such that the variational distribution Q(Z) is as similar as possible to the true posterior distribution P(Z|X). A quantity commonly used to measure dissimilarity between distributions is Kullback-Leibler divergence DKL(QP).29 However, since KL divergence contain the unknown true posterior distribution P(Z|X), it cannot be directly computed. Instead, we can show that the log-likelihood of data, logP(X) can be decomposed as:

logP(X)=DKL(QP)+L(Q) (Equation 9)

Where L(Q)=EQ[logP(Z,X)logQ(Z)], which is also known as the Evidence Lower Bound (ELBO). Since the logP(X) is a constant with respect to the variational parameters, minimizing KL divergence is equivalent to maximizing ELBO. As the ELBO does not contain the unknown posterior distribution and therefore is tractable to compute and maximize for variational parameters.

Mean-field approximation

Mean field approximation30 is a common solution to find the optimal solution that maximizes ELBO. The basic assumption is that we can factorize the variational distribution into independent components. Then using the calculus of variations, one can show that the distribution Qj(zj) minimizing KL divergence for each factor Zj can be expressed as:

lnQj(zj|X)=Eij[lnP(Z,X)]+constant (Equation 10)

Applying the Mean-Field approximation to SuSiE PCA the approximate posterior given by,

Q(Z,W)=Q(Z)Q(W) (Equation 11)
Q(W)=k=1Kl=1LQ(wkl|γkl)Q(γkl) (Equation 12)

Equation 11 factorizes the variational densities of the latent variables Z and the loading matrix W into independent parts. We further assume that the variational distribution of loadings wkl from each factor across L single effects are independent as well, leading to Equation 12. For ease of notation we first define τ=1σ2,τ0kl=1σ0kl2. Based on the factorization, the complete-data log-likelihood of data and parameters of SuSiE PCA is given by:

lc(σ2,σ02,π|X,Z,W)=logPr(X|Z,W,σ2)+logPr(Z)+logPr(W|σ02,π)=logMNn,p(X|ZW,In,Ipσ2)+logMNn,k(Z|0,In,Ik)+l=1Lk=1K[logMulti(γkl|1,π)+logN(wkl|0,σ02)]

Helpful definitions

Before proceeding to the full derivation of variational distribution of parameters Z,wkl,andγkl, we first give some helpful definitions, including the expansion of the first and second moment of W and Z.

The second moment of Z is:

E[ZZ]=tr(In)ΣZ+E[Z]E[Z]=nΣZ+E[Z]E[Z]E[ZkZk]=tr(V[Zk])+E[Zk]E[Zk]=tr(In(ΣZ)kk)+E[Zk]E[Zk]=n(ΣZ)kk+E[Zk]E[Zk]

The first and second moments of wk are listed as follows:

E[wkl|γkl]=pvectorofposteriorconditionalmeansV[wkl|γkl]=pvectorofposteriorconditionalvariancesE[wk]=E[lwkl]=lE[wkl]E[wkl]=lE[wkl|γkl]E[γkl]V[wk]=V[lwkl]=lV[wkl]V[wkl]=E[wklwkl]-E[wkl]E[wkl]=E[wkl2γklγkl]-E[wkl]E[wkl]=diag(E[wklwkl|γkl]E[γkl])-E[wkl]E[wkl]diag(V[wkl])=E[wklwkl|γkl]E[γkl](E[wkl|γkl]E[γkl])2E[wkwk]=tr(V[wk])+E[wk]E[wk]E[wkl2]=[E2[wkl|γkl]+V[wkl|γkl]]E[γkl]

The first and second moments of W are listed as follows:

E[W]=E[kekwk]=kekE[wk]E[WW]=E[(kekwk)(kekwk)]=E[kkekwkwkek]=kkekekE[wkwk]=kkekekE[wk]E[wk]+kekek(E[wkwk]-E[wk]E[wk])=E[W]E[W]+kekek(E[wkwk]-E[wk]E[wk])=E[W]E[W]+kekektr(V[wk])=E[W]E[W]+diag(tr(V[w1]),,tr(V[wk]))

Other terms in likelihood function:

logMNn,p(X|ZW,In,Ipσ2)=12σ2tr[(XZW)(XZW)]np2log(2πσ2)logMNn,k(Z|0,In,Ik)=12tr[ZZ]nk2log(2π)logMulti(γkl|1,π)=i=1pγklilog(πi)logN(wkl|0,σ02)=12σ02wkl212log(2πσ02)tr[E¬Z[(XZW)(XZW)]]=tr[E¬Z(XXXZWWZX+WZZW)]=tr(XX)2tr(E[W]XZ)+tr(ZZE[WW])=tr(XX)2tr(E[W]XZ)+i=1ptr(ZσWiZ)+E[W]ZZE[W]tr[E¬W[(XZW)(XZW)]]=tr[E¬W(XXXZWWZX+WZZW)]=tr(XX)2tr(XE[Z]W)+tr(E[ZTZ]WW)
E[tr((XZW)(XZW))]=E[tr(XXXZWWZX+WZZW)]=tr(XXXE[Z]E[W]-E[W]E[Z]X+E[WZZW])=tr(XX)2tr(XE[Z]E[W])+tr(E[ZZ]E[WW])R¯kl:=X-E[Z](kkekE[wk]+llekE[wkl])=XkkE[Zk]E[wk]llE[Zk]E[wkl]

Derivation of model parameters

In this section, we present the detailed derivation of the optimal variational distributions of variables Z,wkl, and γkl. First, we derived the logQZ:

logQ(Z)=E¬Z[lc(σ2,σ02,π|X,Z,W)]=E¬Z[logMNn,p(X|ZW,In,Ipσ2)]+logMNn,k(Z|0,In,Ik)=τ2[tr(XZE[W])tr(E[W]ZX)+tr(ZZE(WW))]12tr(ZZ)+O(1)=12[tr(τZZE(WW))+tr(ZZ)tr(τXZE[W])tr(τE[W]ZX)]+O(1)=12[tr(ZZ(E(WW)τ+Ik))tr(τXZE[W])tr(τE[W]ZX)]+O(1)=12[tr(Z(E(WW)τ+Ik)ΣZ1Z)tr(ZE[W]Xτ)tr(τXE[W]Z)]+O(1)=12[tr(ZΣZ1Z)tr(ZΣZ1ΣZE[W]XτμZ)tr(τXE[W]ΣZμZΣZ1Z)]+O(1)=12[tr(ΣZ1ZZ)tr(ΣZ1μZZ)tr(ΣZ1ZTμZ)+tr(ΣZ1μZTμZ)tr(ΣZ1μZTμZ)]+O(1)=12[tr(ΣZ1(ZZZTμZμZTZ+μZTμZ))]+O(1)=12tr(ΣZ1(ZμZ)(ZμZ))+O(1)Q(Z)=MNn,k(Z|μZ,In,ΣZ)

Second, we derive the logQ(wkl|γkli=1):

logQ(wkl|γkli=1)=τ2E¬wkl[tr((RklZkwkl)(RklZkwkl))]τ02E¬wkl[l=1Lk=1Kwkl2]+O(1)=τ2E¬wkl[2tr(RklZkwkl)+tr(ZkZkwklwkl)]τ02wkl2+O(1)=τ2E¬wkl[2tr((XkkZkwkllZkwkl)Zkwkl)+tr(ZkZkwkl2)]τ02wkl2+O(1)=τ2E¬wkl[2tr((XZkkkwkZkZkllwklZkZk)wkl)+tr(ZkZkwkl2)]τ02wkl2+O(1)=τ2E¬wkl[2(XiZkkkwk,iZkZkZkZkllwikl)wkl+ZkZkwkl2]τ02wkl2+O(1)=12[2τ(XiE[Zk]kkE[wk,i]E[ZkZk]-E[ZkZk]llE[wikl])wkl+τE[ZkZk]wkl2]τ02wkl2+O(1)=12[2τ(XiE[Zk]kkE[wk,i]E[ZkZk]-E[ZkZk]llE[wikl])wkl+τE[ZkZk]wkl2+τ0wkl2]+O(1)=12[2τ(XiE[Zk]kkE[wk,i]E[ZkZk]-E[ZkZk]llE[wikl])wkl+(τE[ZkZk]+τ0)1/σwkl2wkl2]+O(1)=12σwkl2[wkl22τσwkl2(XiE[Zk]kkE[wk,i]E[ZkZk]-E[ZkZk]llE[wikl])μwklwkl]=logN(μwkl,σwkl2)

Noticed that we can update wkl for all feature at once:

μwkl=τσwkl2E[RklZk],σwkl=σwkl2Ip

Finally we derive the logQ(γkl): Note that τRklE[Zk]=E[wkl|γkl]/σwkl2=μwkl/σwkl2.

logQ(γkli=1)=E¬γkl[lc(σ2,σ02,π,|X,Z,W)]+logMulti(γkl|π)+O(1)=τ2E¬γkltr((RklZkwkl)(RklZkwkl))+logMulti(γkl|π)+O(1)=τ2[2tr(RklE[Zk]E[wkl|γkli=1]γkl)+tr(E[ZkZk]E[wkl2|γkli=1])]+logπi+O(1)=τ2[2RkliE[Zk]E[wkl|γkli=1]+E[ZkZk]E[wkl2|γkl]]+logπi+O(1)=τRkliE[Zk]E[wkl|γkli=1]τ2E[ZkZk]E[wkl2|γkli=1]+logπi+O(1)=1σwkl2E[wkl|γkli=1]2τ2E[ZkZk]E[wkl2|γkli=1]+logπiτ02E[wkl2|γkli=1]+O(1)=1σwkl2E[wkl|γkli=1]212E[wkl2|γkli=1](τE[ZkZk]+τ0)+logπi+O(1)=1σwkl2E[wkl|γkli=1]212σwkl2E[wkl2|γkli=1]+logπi+O(1)=12σwkl2[2E[wkl|γkli=1]2+E[wkl2|γkli=1]]+logπi+O(1)=12σwkl2[2E[wkl|γkli=1]2+σwkl2+E[wkl|γkli=1]2]+logπi+O(1)=12σwkl2E[wkl|γkli=1]2+logπi+O(1)logα˜kli=logπilogN(0|μwkl,σwkl2)Q(γkl)=Multi(1,αkl=softmax(logα˜kl))

In summary, the optimal variational distribution of model parameters can be summarized as:

Q(Z):=MNn,k(Z|μZ,In,ΣZ) (Equation 13)
Q(wkl|γkl):=N(μwkl,σwkl2) (Equation 14)
Q(γkl):=Multi(1,αkl). (Equation 15)

The corresponding update rules for variational parameters from Q(·) can be expressed as,

μZ=τXE[W]ΣZ (Equation 16)
ΣZ=(E[WW]τ+Ik)1 (Equation 17)
μwkl=τσwkl2E[RklZk] (Equation 18)
Σwkl=σwkl2Ip (Equation 19)
σwkl2=(τE[ZkZk]+τ0kl)1 (Equation 20)
αkl=softmax(logπlogN(0|μwkl,σwkl2)). (Equation 21)

Derivation of evidence lower bound (ELBO)

The ELBO provides a natural criterion for evaluating model performance during model training, and also provides a means to perform hyperparameter optimization for model variance τ and τ0 (or equivalently precision) parameters. Given the above definitions for Q, we derive the ELBO for SuSiE PCA as

ELBO(W,Z)=EQ[logPr(X,Z,W)logQ(Z,W)]=EQ[logPr(X|Z,W)]+EQ[logPr(Z,W)logQ(Z,W)]=EQ[logPr(X|Z,W)]+EQ(Z)[logPr(Z)logQ(Z)]+l=1L[EQ(wl|Γl)[logPr(wl|Γl)logQ(wl|Γl)]+EQ(Γl)[logPr(Γl)logQ(Γl)]]=EQ[logPr(X|Z,W)]+EQ(Z)[logPr(Z)logQ(Z)]+EQ(W,Γ)[logPr(W,Γ)logQ(W,Γ)]

Based on the above derivation, ELBO can be decomposed into three parts. The first term is the expectation of the data with respect to all the parameters in the model:

EQ[logPr(X|Z,W,Γ)]=EQ[12σ2tr[(XZW)(XZW)]np2log(2πσ2)]=12σ2[tr(XX)2tr(XE[Z]E[W])+tr(E[ZZ]E[WW])]np2log(2πσ2)

The second term is the negative KL divergence of Z.

EQ(Z)[logPr(Z)logQ(Z)]=E[-12tr(ZZ)nk2log(2π)+12tr(ΣZ1(ZμZ)(ZμZ))+nk2log(2π)+n2log(|ΣZ|)]=12tr(E[ZZ])+12tr[ΣZ1(E[ZZ]μZμZ)]+N2log(|ΣZ|)=12tr(E[ZZ])+12tr[ΣZ1(NΣZ+μZμZμZμZ)]+N2log(|ΣZ|)=12tr(E[ZZ])+NK2+N2log(|ΣZ|)

The last term contains joint negative KL divergence of WandΓ can be further decomposed as following:

EQ(W,Γ)[logPr(W,Γ)logQ(W,Γ)]=EQ(W,Γ)[logPr(W|Γ)Pr(Γ)logQ(W|Γ)Q(Γ)]=EQ(W,Γ)[logPr(W|Γ)logQ(W|Γ)]+EQ(W,Γ)[logPr(Γ)logQ(Γ)]=k=1Kl=1LEQ(wkl,γkl)[logPr(wkl|γkl)logQ(wkl|γkl)]+k=1Kl=1LEQ(γkl)[logPr(γkl)logQ(γkl)]=k=1Kl=1Li=1PαkliEQ(wkl|γkl)[logPr(wkl|γkli=1)logQ(wkl|γkli=1)]+k=1Kl=1Li=1PEγkl[logPr(γkli=1)logQ(γkli=1)]

The first expectation term of the last line of equation EQ(wkl|γkl) can be expanded as following:

EQ(wkl|γkl)[logPr(wkl|γkl)Q(wkl|γkl)]=i=1PEQ(wkl|γkli=1)[logPr(wkl|γkli=1)Q(wkl|γkli=1)]=Ei=1P[τ02(wkli)2+12σwkl2(wkliμwkli)2]=i=1P[(τ02+12σwkl2)[μwkli2+σwkl2]12σwkl2μwkli2]+P2log(σwkl2τ0)+P2log(σwkl2τ0)=i=1P[τ02μwkli2τ02σwkli2+12]+P2log(σwkl2τ0)p2log(2π/τ0)+p2log(2πσwkl2)]=i=1P[(τ02+12σwkl2)E[(wkli)2]12σwkl2μwkli2]P2log(2π/τ0)+P2log(2πσwkl2)

And the second expectation term Eγkl can be decomposed as

EQ(Γ)[logPr(Γ)logQ(Γ)]=k=1Kl=1Li=1PEQ(γkli=1)[(γklilogπiγklilogαkli)]=k=1Kl=1Li=1P[E(γkli)log(πi)-E(γkli)log(αkli)]=k=1Kl=1Li=1P[αkli(log(πi)logαkli)]

With the explicit form of ELBO, we obtained the maximum likelihood estimates of model precision parameters τ,τ0kl by setting the derivative of ELBO with respect to each variance parameter to be 0, which results in closed-form update equations given by,

τˆ0kl=i=1Pαklii=1Pαkli(μwkli2+σwkli2) (Equation 22)
τˆ=NPi,jXij22tr(E[W]XμZ). (Equation 23)

Simulations

To investigate the performance of SuSiE PCA in variable selection and model fitting, we simulated various data sets that are controlled by 4 parameters: the sample size N, number of features P, number of latent factors K, and number of single effects L in each of the factors. For simplicity, we assume L is the same across different factors. The simulated data X is generated according to equation ((0.1)), where N=1000,P=6000, and zk and wk, for k=1,,4 are simulated such that each factor only contain 40 non-zero effects (0.67%) given by,

zkN(0,IN) (Equation 24)
w1,iN(0,1)i=1,,40 (Equation 25)
w2,iN(0,1)i=41,,80 (Equation 26)
w3,iN(0,22)i=81,,120 (Equation 27)
w4,iN0,1i=121,,160 (Equation 28)

with the remaining effects set to zero. Considering the scale of the estimates of loadings may differ from various types of methods, we normalized the loading matrix with respect to Frobenius norm, i.e. tr(AA)=tr(BB)=1.

To evaluate the accuracy of SuSiE PCA, we compared inferred posterior expectations with the true latent variables. However, due to the rotational invariance property in latent factor models, evaluating loading or latent factor accuracy can be challenging. To account for possible rotation, we leverage the Procrustes transformation,17 which finds an orthogonal rotation matrix P to transform the estimated loading matrix to the true loading matrix space. Specifically, given an estimated loading matrix Wˆ:=EQ[W] under approximate posterior distribution Q and true effect matrix W, the “Procrustes Norm” can be obtained as following:

WWˆP2:=min{P|P1=P}WˆPWF2 (Equation 29)

Here we perform the Procrustes analysis via Procrustes package,16 from which P is obtained by performing a singular value decomposition on matrix WˆWˆ (padding zeros on matrix Wˆ would ensure the above operation process correctly).

In addition, we employ the relative root mean squared error (RRMSE) to evaluate the reconstructed data loss as,

RRMSEXˆ,X=i,jxˆijxij2i,jxij2 (Equation 30)

Lastly, to assess generative modeling proficiency, we computed the log-likelihood under held-out data. Specifically, we first trained the model on simulated training data. Next, we computed latent space representations for the testing data under each of the trained models. Lastly, we computed log-likelihoods under normality assumptions given the latent representations and learned loadings and parameters.

For model comparison, we also evaluate the performance of sparse PCA6 and Empirical Bayes Matrix Factorization (EBMF) (a recently described variational approach)12 on the same simulation data sets with the same K, and compare the model performance with SuSiE PCA via criterion described above.

GTEx Z score dataset

To illustrate the application of SuSiE PCA in genetic research, we downloaded the Genotype-Tissue Expression (GTEx)14 summary statistics data, composed of z-scores computed from the testing association between genetic variants and the gene expression levels across 44 different human tissues.12 The GTEx project collected genotype data and gene expression data from 49 non-disease tissues across n=838 individuals, providing an ideal resource database to study the relationship between genetic variants and gene expression levels.14 The genetic variants that are statistically associated with gene expression levels are referred to as expression quantitative trait loci (eQTLs). To identify eQTLs, the GTEx project tested the association between each nearby genetic variant of a certain gene with its expression levels using linear regression to yield a Z score. The summary data we explored reflects the most significant eQTL (equivalently, the largest absolute Z score in each SNP and gene pair) at each of 16069 genes (row) from 44 tissues (column) curated from GTEx v8,12,14 as those 16069 genes show indication of being expressed in 44 of all 49 human tissues. To identify tissue-specific components of regulatory genetic features and contributing genes, we applied SuSiE PCA across this Z score matrix with a latent dimension of 27 and the number of single effects of 18. The prior information on the number of latent dimensions comes from Wang et al. (2021)12 who contribute to the Z score dataset and run the EBMF model with 27 factors. To determine the appropriate L that fits the data, we run the SuSiE PCA with L ranged from 10 to 25, and select the model when the increase in the total percentage of variance explained (PVE) is less than 5%. PVE is a measure of the amount of signals in the data captured by the latent component, the PVE of the factor zk is calculated based on the following equation:

PVEk=skksk+NP/τ (Equation 31)

where sk=i=1Nj=1P(E[zik]E[wkj])2.

Purturb-seq dataset

We next investigated genome-scale Perturb-seq data15 to discover the co-regulated gene sets affected by some common type of perturbations. The Perturb-seq data originated from Perturb-seq experiments performed by Replogle et al.15 Perturb-seq is a cutting-edge technique combining CRISPR-based perturbations with single-cell RNA-sequencing readouts, enabling the investigation of co-regulated gene sets affected by various perturbations. The researchers employed three cell lines: K562 cells, hTERT-immortalized RPE1 cells, and HEK293T cells. CRISPRi technology was used to generate cell lines expressing dCas9-BFP-KRAB (KOX1-derived) for the perturbation experiments. Since we focus our analyses on the expression data from the K562 cell line, we give a brief description of the experiments performed on the K562 cell lines. Namely, the authors targeted genes expressed in K562 cells, transcription factors, Cancer Dependency Map common essential genes, and included non-targeting control sgRNAs accounting for 5% of the total library. The gene sets were defined based on a combination of bulk RNA-seq data from ENCODE and 10x Genomics 3′ single-cell RNA-seq data. Libraries were constructed with dual-sgRNA pairs targeting each gene, expressed from tandem U6 expression cassettes in a single lentiviral vector, and ranked based on empirical data and computational predictions. Subsequently, the author conducted Perturb-seq experiments on the K562 cells, with 2056 distinct knocked-out genes and one non-targeting control group over an average of 150 different single cells, and then measured the expression levels of the downstream 8563 genes from each cell.

The final dataset contains 310385 rows, each representing one perturbation in a specific cell, and the expression levels of 8563 downstream genes as the column. As an exploratory analysis, we omitted the single-cell level information and aggregated the expression levels of downstream genes with the same perturbation over all the cells, which resulted in a “psuedo-bulk” data matrix with 2057 rows and 8563 columns. We then performed the SuSiE PCA and Sparse PCA to investigate the regulatory modules from the common perturbations. To exclude the batch effects and other non-genetic covariates, we regressed out the germ-line group and the mitochondrial percent from the original expression data and then aggregated the expression level of downstream genes with the same perturbation. Finally, the aggregated data is centered and standardized before input into SuSiE PCA.

As a comparison, we also run the sparse PCA with the same K in both datasets. While choosing an appropriate sparsity parameter alpha in sparse PCA is less straightforward than tuning L in the SuSiE PCA, as we cannot directly pull all of the non-zero genes even with a fairly large alpha (higher sparsity). To make a reasonable comparison, we run sparse PCA with a set of alpha from 1 to 20 and choose two models to compare: first, choose the model giving the highest PVE, then investigate the model having a similar level of PVE with SuSiE PCA.

Acknowledgments

This work was funded by the National Institutes of Health (NIH) under awards R01HG012133, P01CA196569.

Author contributions

D.Y. and N.M. developed the method. D.Y. performed analysis. D.Y. and N.M. edited and approved the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: October 13, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2023.108181.

Contributor Information

Dong Yuan, Email: dongyuan@usc.edu.

Nicholas Mancuso, Email: nmancuso@usc.edu.

Supplemental information

Document S1. Figures S1–S24
mmc1.pdf (12.4MB, pdf)

Data and code availability

  • This paper analyzes existing, publicly available data, i.e., the GTEx z-score dataset14 and the perturb-seq data.15 These accession numbers for the datasets are listed in the key resources table.

  • All original codes related to SuSiE PCA have been deposited and are publicly available on GitHub (https://github.com/mancusolab/susiepca).

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

References

  • 1.Hotelling H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933;24:417–441. doi: 10.1037/h0071325. [DOI] [Google Scholar]
  • 2.Patterson N., Price A.L., Reich D. Population Structure and Eigenanalysis. PLoS Genet. 2006;2 doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Agrawal A., Chiu A.M., Le M., Halperin E., Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16 doi: 10.1371/journal.pgen.1008773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.McVean G. A Genealogical Interpretation of Principal Components Analysis. PLoS Genet. 2009;5 doi: 10.1371/journal.pgen.1000686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Jolliffe I.T. Springer-Verlag; 1986. Principal Component Analysis. [Google Scholar]
  • 6.Zou H., Hastie T., Tibshirani R. Sparse Principal Component Analysis. J. Comput. Graph Stat. 2006;15:265–286. doi: 10.1198/106186006X113430. [DOI] [Google Scholar]
  • 7.Bishop C. Advances in Neural Information Processing Systems. MIT Press; 1998. Bayesian PCA. [Google Scholar]
  • 8.Guan Y., Dy J. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR; 2009. Sparse Probabilistic Principal Component Analysis; pp. 185–192. [Google Scholar]
  • 9.Ning B. Spike and slab Bayesian sparse principal component analysis. arXiv. 2021 doi: 10.48550/arXiv.2102.00305. Preprint at. [DOI] [Google Scholar]
  • 10.Armagan A., Clyde M., Dunson D. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011. Generalized Beta Mixtures of Gaussians. [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhao S., Gao C., Mukherjee S., Engelhardt B.E. Bayesian group factor analysis with structured sparsity. J. Mach. Learn. Res. 2016;17:1–47. [Google Scholar]
  • 12.Wang W., Ge L., Zhang L., Liu L., Zhang X., Ma X. Empirical bayes matrix factorization. Hum. Fertil. 2021;22:1–11. [Google Scholar]
  • 13.Wang G., Sarkar A., Carbonetto P., Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.GTEx Consortium. Ardlie K.G., Deluca D.S., Segrè A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Replogle J.M., Saunders R.A., Pogson A.N., Hussmann J.A., Lenail A., Guna A., Mascibroda L., Wagner E.J., Adelman K., Lithwick-Yanai G., et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022;185:2559–2575.e28. doi: 10.1016/j.cell.2022.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Meng F., Richer M., Tehrani A., La J., Kim T.D., Ayers P.W., Heidar-Zadeh F. Procrustes: A python library to find transformations that maximize the similarity between matrices. Comput. Phys. Commun. 2022;276 doi: 10.1016/j.cpc.2022.108334. [DOI] [Google Scholar]
  • 17.Borg I., Groenen P. 2005. Modern Multidimensional Scaling: Theory and Applications. (Springer Series in Statistics)). [DOI] [Google Scholar]
  • 18.Bradbury J., Frostig R., Hawkins P., Johnson M.J., Leary C., Maclaurin D., Necula G., Paszke A., VanderPlas J., Wanderman-Milne S., et al. 2018. JAX: Composable Transformations of Python+NumPy Programs. [Google Scholar]
  • 19.Cohn B.A., Cirillo P.M., Christianson R.E. Prenatal DDT Exposure and Testicular Cancer: A Nested Case-Control Study. Arch. Environ. Occup. Health. 2010;65:127–134. doi: 10.1080/19338241003730887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ge S.X., Jung D., Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–2629. doi: 10.1093/bioinformatics/btz931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Amrute J.M., Perry A.M., Anand G., Cruchaga C., Hock K.G., Farnsworth C.W., Randolph G.J., Lavine K.J., Steed A.L. Cell specific peripheral immune responses predict survival in critical COVID-19 patients. Nat. Commun. 2022;13:882. doi: 10.1038/s41467-022-28505-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Garg M., Li X., Moreno P., Papatheodorou I., Shu Y., Brazma A., Miao Z. Meta-analysis of COVID-19 single-cell studies confirms eight key immune responses. Sci. Rep. 2021;11 doi: 10.1038/s41598-021-00121-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Signorile A., Sgaramella G., Bellomo F., De Rasmo D. Prohibitins: A Critical Role in Mitochondrial Functions and Implication in Diseases. Cells. 2019;8:71. doi: 10.3390/cells8010071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Artal-Sanz M., Tsang W.Y., Willems E.M., Grivell L.A., Lemire B.D., van der Spek H., Nijtmans L.G.J. The mitochondrial prohibitin complex is essential for embryonic viability and germline function in Caenorhabditis elegans. J. Biol. Chem. 2003;278:32091–32099. doi: 10.1074/jbc.M304877200. [DOI] [PubMed] [Google Scholar]
  • 25.Artal-Sanz M., Tavernarakis N. Prohibitin couples diapause signalling to mitochondrial metabolism during ageing in C. elegans. Nature. 2009;461:793–797. doi: 10.1038/nature08466. [DOI] [PubMed] [Google Scholar]
  • 26.Opper M., Saad D. MIT Press; 2001. Advanced Mean Field Methods: Theory and Practice. [Google Scholar]
  • 27.Andrieu C., de Freitas N., Doucet A., Jordan M.I. An Introduction to MCMC for Machine Learning. Mach. Learn. 2003;50:5–43. doi: 10.1023/A:1020281327116. [DOI] [Google Scholar]
  • 28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
  • 29.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Stat. 1951;22:79–86. [Google Scholar]
  • 30.Tanaka T. Advances in Neural Information Processing Systems. MIT Press; 1998. A Theory of Mean Field Approximation. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S24
mmc1.pdf (12.4MB, pdf)

Data Availability Statement

  • This paper analyzes existing, publicly available data, i.e., the GTEx z-score dataset14 and the perturb-seq data.15 These accession numbers for the datasets are listed in the key resources table.

  • All original codes related to SuSiE PCA have been deposited and are publicly available on GitHub (https://github.com/mancusolab/susiepca).

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.


Articles from iScience are provided here courtesy of Elsevier

RESOURCES