Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 12.
Published in final edited form as: Res Comput Mol Biol. 2022 Apr 29;13278:86–106. doi: 10.1007/978-3-031-04749-7_6

A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World

Agniva Chowdhury 1,*,, Aritra Bose 2,*, Samson Zhou 3,*, David P Woodruff 3, Petros Drineas 4,
PMCID: PMC9836035  NIHMSID: NIHMS1804098  PMID: 36649383

Abstract

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.

Keywords: Sparse PCA, Population Stratification, Principal Component Analysis, Population Structure

1. Introduction

Principal Component Analysis (PCA) and the related Singular Value Decomposition (SVD) are fundamental data analysis and dimensionality reduction tools that are used across a wide range of areas including machine learning, multivariate statistics, and many others. These tools return a set of orthogonal vectors of decreasing importance that are often interpreted as fundamental latent factors that underlie the observed data. Even though the vectors returned by PCA and SVD have strong optimality properties, they are notoriously difficult to interpret in terms of the underlying processes generating the data [18], since they are linear combinations of all available data points or all available features. The concept of Sparse Principal Components Analysis (SPCA) was introduced in the seminal work of [11], where sparsity constraints were enforced on the singular vectors in order to improve interpretability; see for example, document analysis applications in [11, 18, 22].

Formally, given a positive semidefinite (PSD) matrix ARn×n, SPCA can be defined as the constrained maximization problem:5

Z=maxxRn,x21xAx,subject tox0k. (1)

In the above formulation, A is a covariance matrix representing, for example, all pairwise feature or object similarities for an underlying data matrix. Therefore, SPCA can be applied to either the object or feature space of the data matrix, while the parameter k controls the sparsity of the resulting vector and is part of the input. Let x* denote a vector that achieves the optimal value Z in the above formulation. Intuitively, the optimization problem of eqn. (1) seeks a sparse, unit norm vector x* that maximizes the data variance. It is well-known that solving the above optimization problem is NP-hard [20] and that its hardness is due to the sparsity constraint. Indeed, if the sparsity constraint were removed, then the resulting optimization problem can be easily solved by computing the top left or right singular vector of A and its maximal value Z is equal to the top singular value of A.

In this work, we explore the potential of SPCA in the analysis of genetics data leveraging a provably accurate thresholding algorithm for SPCA. In genetics, PCA is a tool of paramount importance and is ubiquitously used to estimate population structure and extract ancestry information [23]. It is well-known that genome-wide association studies (GWAS) that attempt to identify genetic markers that are associated with complex traits in a typical case/control setting can be grossly confounded by the underlying population structure, due to the presence of subgroups in the population that belong to different ancestries in both the case and control groups [24]. To account for such population stratification and to minimize the underlying spurious associations, researchers typically use the top few principal components as covariates in the underlying model. However, the principal components are linear combinations of all available genetic markers and, therefore, are not interpretable. SPCA is an obvious remedy towards that end, since one can use it to identify Single Nucleotide Polymorphisms (SNPs) or genetic markers carrying information about the underlying genetic ancestry. See also [12,13,16] for prior work motivating and using SPCA in the context of human genetics data analysis.

1.1. Our Contributions

Thresholding is a simple algorithmic concept, where each coordinate of, say, a vector is retained if its value is sufficiently high; otherwise, it is set to zero. Thresholding naturally preserves entries that have large magnitude while creating sparsity by eliminating small entries. Therefore, it seems like a logical strategy for SPCA: after computing a dense vector that approximately solves a PCA problem, perhaps with additional constraints, thresholding can be used to sparsify it.

We present a simple, provably accurate, thresholding algorithm (ThreSPCA, Section 2.1) for SPCA that leverages the fact that the top singular vector is an optimal solution for the SPCA problem without the sparsity constraint. Our algorithm actually uses a thresholding scheme that leverages the top few singular vectors of the underlying covariance matrix; it is simple and intuitive, yet offers tradeoffs in running time vs. accuracy, the first of its kind. Our algorithm returns a vector that is provably sparse and, when applied to the input covariance matrix A, provably captures the optimal solution Z up to a small additive error. Indeed, our output vector has a sparsity that depends on k (the target sparsity of the original SPCA problem of eqn. (1)) and ε (an accuracy parameter between zero and one). Our analysis provides unconditional guarantees for the accuracy of the solution of the proposed thresholding scheme. To the best of our knowledge, no such analyses have appeared in prior work (see Section 1.2 for details). We emphasize that our approach only requires an approximate SVD and, as a result, ThreSPCA runs very quickly. In practice, ThreSPCA is much faster than current state-of-the-art and at least as accurate in the analysis of human genetics datasets. An additional contribution of our work is that, unlike prior work, our algorithm has a clear trade-off between quality of approximation and output sparsity. Indeed, by increasing the density of the final SPCA vector, one can improve the amount of variance that is captured by our SPCA output. See Theorem 1 for details on this sparsity vs. accuracy trade-off for ThreSPCA.

Importantly, we evaluate ThreSPCA on the genotype dataset from 1000 Genomes (1KG) Project [10] and on simulated genotype data in order to practically assess its performance. ThreSPCA identifies functionally relevant, interpretable SNPs from the 1KG data and, from an accuracy perspective, it performs comparably to current state-of-the-art SPCA algorithms while being much faster than its competitors.

1.2. Prior work

SPCA was formally introduced by [11]; however, previously studied PCA approaches based on rotating [14] or thresholding [7] the top singular vector of the input matrix seemed to work well, at least in practice, given sparsity constraints. Following [11], there has been an abundance of interest in SPCA, with extensions based on LASSO (ScoTLASS) on an 1 relaxation of the problem [15] or a non-convex regression-type approximation, penalized similar to LASSO [28].

Prior work that offers provable guarantees, typically given some assumptions about the input matrix, includes [22], which analyzed a specific set of vectors in a low-dimensional eigenspace of the input matrix and presented relative error guarantees for the optimal objective, given the assumption that the input covariance matrix has a decaying spectrum. The time complexity of the algorithm of [22] is given by O(nd+1logn) (due to solving an exact SVD), where d is the low rank parameter that affects the accuracy of the output. Even for d = 1, the theoretical time complexity boils down to O(n2logn) and it is not clear how to make use of an approximate SVD algorithm to improve this running time without affecting its theoretical bound. Furthermore, for a high precision output, one generally needs d to be larger than one, in which case the practical running time also increases drastically. [1] gave a polynomial-time algorithm that solves sparse PCA exactly for input matrices of constant rank. [8] showed that sparse PCA can be approximated in polynomial time within a factor of n−1/3 and also highlighted an additive PTAS of [2] based on the idea of finding multiple disjoint components and solving bipartite maximum weight matching problems. This PTAS needs time npoly(1/ε), whereas ThreSPCA has running time that depends on the sparsity of the input data.

SPCA has been applied in the context of human genetics before, in the form of sparse factor analysis (SFA) [12] and with a penalty term in LASSO (L-PCA) or Adaptive LASSO (AL-PCA) [16]. However, there are a number of aspects that our work improves compared to prior studies. First, unlike ThreSPCA, the SFA method used some prior assumptions on the genotype matrix and none of these previous studies come with a theoretical guarantee showing a clear sparsity vs. accuracy trade-off.

Second, prior work has to tune the penalty parameter in [16] several times in order to achieve a specific sparsity value in practice, which increases the running time of the method. Third, the convergence of the SPCA algorithm proposed by [16] depends on an initial PC score, which typically relies on the top right singular vector of the data and necessitates the computation of an exact SVD, which is expensive. It is not clear whether replacing the exact SVD with a fast approximate SVD would affect the results of [16]

2. Materials and Methods

2.1. The ThreSPCA algorithm

Notation.

We use bold letters to denote matrices and vectors. For a matrix ARn×n, we denote its (i, j)-th entry by Ai,j; its i-th row by Ai*, and its j-th column by A*j; its 2-norm by A2=maxxRn,x2=1Ax2; and its (squared) Frobenius norm by AF2=i,jAi,j2. We use the notation A_0 to denote that the matrix A is symmetric positive semidefinite (PSD) and Tr(A)=iAi,i to denote its trace, which is also equal to the sum of its singular values. Given a PSD matrix ARn×n, its Singular Value Decomposition is given by A = UΣUT, where U is the matrix of left/right singular vectors and Σ is the diagonal matrix of singular values.

Our approach: SPCA via SVD Thresholding.

To achieve nearly input sparsity runtime, our thresholding algorithm is based upon using the top right (or left) singular vectors of the PSD matrix A. Given A and an accuracy parameter ε, our approach first computes ΣR× (the diagonal matrix of the top singular values of A) and URn× (the matrix of the top left singular vectors of A), for = 1/ε. Then, it deterministically selects a subset of O(k/ε3) rows of U using a simple thresholding scheme based on their squared row norms (recall that k is the sparsity parameter of the SPCA problem). In the last step, it returns the top right singular vector of the matrix consisting of the columns of Σ1/2U that correspond to the row indices of U chosen in the thresholding step. Notice that this right singular vector is an O(k/ε3)-dimensional vector, which is finally expanded to a vector in Rn by appropriate padding with zeros. This sparse vector is our approximate solution to the SPCA problem of eqn. (1).

This simple algorithm is somewhat reminiscent of prior thresholding approaches for SPCA. However, to the best of our knowledge, no provable a priori bounds were known for such algorithms without strong assumptions on the input matrix. This might be due to the fact that prior approaches focused on thresholding only the top right singular vector of A, whereas our approach thresholds the top = 1/ε right singular vectors of A. This slight relaxation allows us to present provable bounds for the proposed algorithm.

In more detail, let the SVD of A be A = UΣUT. Let ΣR× be the diagonal matrix of the top singular values and let URn× be the matrix of the top right (or left) singular vectors. Let R = {i1, …, i|R|} be the set of indices of rows of U that have squared norm at least ε2/k and let R¯ be its complement. Here |R| denotes the cardinality of the set R and RR¯={1,,n}. Let RRn×|R| be a sampling matrix that selects6 the rows of U whose indices are in the set R. Given this notation, we are now ready to state Algorithm 1. Notice that Ry satisfies ‖Ry2 = ‖y2 = 1 (since R has orthogonal columns) and ‖Ry0 = |R|. Since R is the set of rows of U with squared norm at least ε2/k and UF2==1/ε, it follows that |R| ≤ k/ε3. Thus, the vector returned by Algorithm 1 has k/ε3 sparsity and unit norm. (See the Appendix for more details.)

Algorithm 1.

ThreSPCA: fast thresholding SPCA via SVD

Input: ARn×n, sparsity k, error parameter ε > 0.
Output: yRn such that ‖y2 = 1 and ‖y0 = k/ε2.
  1: ← 1/ε;
  2: Compute URn× (top left singular vectors of A) and ΣR× (the top singular values of A);
  3: Let R = {i1, …, i|R|} be the set of rows of U with squared norm at least ε2/k and let RRn×|R| be the associated sampling matrix (see text for details);
  4: yR|R|argmaxx2=1Σ1/2URx22;
  5: return z=RyRn;

Theorem 1. Let k be the sparsity parameter and ε ∈ (0, 1] be the accuracy parameter. Then, the vector zRn (the output of Algorithm 1) has sparsity k/ε3, unit norm, and satisfies

zAzZ3εTr(A).

The optimality gap of Theorem 1 depends on Tr(A), which is the sum of the eigenvalues of A and can also be viewed as the total variance of the data. Therefore, if we divide both sides of the bound in Theorem 1 by Tr(A), the resulting bound is given by (prop*prop˜)3ε, where for a given k, prop˜ is the proportion of the total variance explained by the output of ThreSPCA and prop* is the proportion of the total variance explained by the optimal Sparse PC. Now, trivially, we have (propprop˜)0, since prop* is the maximum variance explained by Sparse PC for a given sparsity value. Thus, combining these two yields 0(propprop˜)3ε, which can be interpreted as the quality-of-approximation in terms of the proportion of total variance explained by ThreSPCA.

The proof of Theorem 1 is deferred to the appendix. See Section 1.A for the proof of Theorem 1 as well as an intermediate result (Lemma 1) that leads to the final bound in Theorem 1. The running time of Algorithm 1 is dominated by the computation of the top singular vectors and singular values of the matrix A. One could always use the SVD of the full matrix A (O(n3) time) to compute the top singular vectors and singular values of A. In practice, any iterative method, such as subspace iteration using a random initial subspace or the Krylov subspace of the matrix, can be used towards this end. We now address the inevitable approximation error incurred by such approximate SVD methods below.

Using approximate SVD algorithms.

Although the guarantees of Theorem 1 in Algorithm 1 use an exact SVD computation, which could take time O(n3), we can further improve the running time by using an approximate SVD algorithm such as the randomized block Krylov method of [21], which runs in nearly input sparsity running time. Our analysis uses the relationships Σ,1/222Tr(A)/ and σ1(Σ) ≤ Tr(A). The randomized block Krylov method of [21] recovers these guarantees up to a multiplicative (1 + ε) factor, in O(logn/ε1/2nnz(A)) time. Here nnz(A) denotes the number of non-zero entries of the matrix A, which is O(n2) for dense matrices.

Extracting additional sparse PCs.

To get multiple sparse PCs using Algorithm 1, we remove the top principal component from the data and run ThreSPCA on the residual dataset. In other words, let XRm×n be the mean-centered data matrix corresponding to A, i.e., A = XX. Let vRn be the top right singular vector of X; then, in order to get the second sparse PC, we run ThreSPCA on the covariance matrix A1=X1X1, where X1 = XXvv.

2.2. Data

1000 Genome Data.

In order to evaluate the speed and accuracy of ThreSPCA as well as to interpret its output, we first analyzed data from the 1000 Genome Project (1KG) [10], which contained genotype data from 2, 503 individuals with 39,517,397 SNPs sampled from 26 different populations across all continents. After performing Quality Control (QC) with minor allele frequency below 5% and, subsequently, pruning related genotypes for Linkage Disequilibrium (LD) using a window size of 50 kb and r2 >0.2, we finally retained 360,498 variants.

Simulated Data.

We generated simulated data emulating real-world populations to evaluate whether ThreSPCA can correctly identify markers which contribute to the genetic differences between and within the populations. Based on previous work [4], we simulated two datasets varying m = {5000, 10000} SNPs genotyped across n = {500, 1000} individuals based on the Pritchard-Stephens-Donelly (PSD) model [25] with the mixing parameter between populations, α = 0.01. The allele frequencies were simulated based on real-world data from three divergent populations, namely CEU (Utah residents with Northern and Western European ancestry), ASW (African ancestry in Southwestern US), and MXL (Mexican ancestry in California) from the HapMap Phase 3 data [17]. We selected a threshold t and varied it across the range t = {100, 250, 500}, representing the number of SNPs which contribute to population structure between the populations (true positives); the remaining mt genotypes were simulated such as they had minimal genetic differences between populations (false positives). We simulated 200 data sets (100 each for values of m and n) and applied ThreSPCA, L-PCA and AL-PCA for comparative analyses to evaluate their efficacy.

2.3. Experiments

We performed QC on the 1KG data, including LD pruning using PLINK2.0 [9]. PCA was performed using TeraPCA [5]. Annotation of ThreSPCA derived variants were performed in Ensembl Variant Effect Predictor (VEP) [19]. We performed Gene Ontology (GO) pathway analyses using clusterProfiler [27] in R. We ran ThreSPCA, with the threshold parameter , fixed to one.

3. Results

3.1. ThreSPCA reveals genetic diversity across the world

We applied ThreSPCA with a sparsity threshold of k=500 on the 1KG data after quality control and pruning for correlated SNPs. We obtained sets of informative markers of cardinality k from each of the PCs. We restricted our analysis to the top three PCs, resulting in a total of 1,500 SNPs, which explained approximately 83% of the variance. Thus, we performed PCA on a reduced 1KG data with 2,503 individuals and 1,500 SNPs. We observed that both the PCA plot and the allele frequency bar plot, grouped by populations across the world, are almost identical. The squared Pearson correlation coefficient (r2) between the top two PCs from the original 1KG data and ThreSPCA informed variants are very high, equal to 0.98, 0.97 and 0.94 for PCs 1, 2 and 3 respectively. Thus, the PCA plot of the informative markers clearly preserves the clusters of each subgroup (Figure 1) and reveals fine-scale population structure among the groups.

Fig. 1:

Fig. 1:

Population structure of world populations from: A. pruned 1KG data with 360,498 SNPs, and B. 1KG data with 1,500 ThreSPCA derived variants corresponding to the top three PCs, captured by (i) PCA plot and (ii) mean allele frequency bar plots colored by continental populations arranged in order from Africa (AFR), Americas (AMR), East Asia (EAS), Europe (EUR) and South Asia (SAS).

Examining each of the three PCs closely shows that the mean allele frequency distribution (Appendix Figures 2) from PC1 is skewed towards the African populations and also from the mixed ancestry populations of ASW (Africans in Southwestern US) and ACB (African Caribbeans from Barbados). SNPs obtained from PC2 were almost equally distributed across the continental populations with a slightly higher frequencies in East Asians. PC3 shows a skewness towards South Asian populations. To make an informed choice of the sparsity threshold k, we computed the PC scores from the top two PCs by projecting the sparse vectors obtained from ThreSPCA on the original pruned 1KG data for a range of values of k = {500, 1000, 5000}.

We computed r2 between the PC scores obtained from each PC for each value of k and the original PC obtained from the pruned 1KG data. We observe high correlation values for the top two PCs, cumulatively reaching their peak when the sparsity parameter k is set to approximately 500 (Appendix Figure 4 (left)).

3.2. Interpretability of ThreSPCA informed variants

Annotating the selected variants.

To understand whether the variants derived from ThreSPCA for each PC are functionally relevant and biologically interpretable, we annotated them using VEP [19]. We also explored whether these variants were mapping to a trait or disease in the GWAS catalog [6]. Most of the variants were introns with some intergenic and small number of Transcription Factor binding sites, upstream and downstream gene variants, etc. Interestingly, among the coding consequences, 58 variants were missense and likely disease causing and further statistics revealed that there are seven variants which are deleterious and nine probably or possibly damaging variants (Figure 2). We also performed GO pathway analyses on ThreSPCA informed variants and found significantly enriched pathways common to humans across the world, such as pathways related to synapses and potassium, cation and ion channels, transporter complex, among others (Appendix Figure 3a). We found the calcium signaling pathway from KEGG (Kyoto Encyclopedia of Genes and Genomes) to be significantly enriched (Appendix Figure 3b).

Fig. 2:

Fig. 2:

Pie charts showing the percentage of variants from A. (i) most severe consequences and (ii) coding consequences obtained from VEP. B. Deleterious and probably damaging from (i) SIFT and (ii) PolyPhen.

Mapping the selected variants to traits.

Mapping these variants in GWAS catalog, we found that variants from PC1 mapped to skin pigment measurement (Appendix Table 2), justifying our observation from the PCA plot and mean allele frequency distribution. This is concordant with our observation that ThreSPCA observed variants from PC1 were skewed towards populations of African ancestry (Appendix Figure 2), who are darker skinned than the rest of the world. PC2 and PC3 on the other hand mapped to various traits which are commonly found to be varying in populations across the world such as body height, BMI, hip and waist circumference, circadian rhythm, gut microbiome, smoking status, cardiovascular diseases, calcium channel blocker use (concordant with calcium signaling pathway found in GO analyses), blood measurements, among others.

3.3. Comparing ThreSPCA to state-of-the-art

Simulation studies.

We designed a simulation study to evaluate the correctness of ThreSPCA and compare it with the state-of-the-art SPCA methods in genetics, namely, L-PCA and AL-PCA from [16]. The population structure of the simulation shows three distinct clusters for each population with signs of admixture between them (Appendix Figure 1). Applying ThreSPCA on the simulated dataset with 10,000 markers and 1,000 individuals, we observed that ThreSPCA identified similar numbers (mean) of true positives, i.e., markers contributing to the genetic diversity between and within the populations when compared to its counterparts L-PCA and AL-PCA, while identifying a significantly smaller number of false positives, i.e., noisy markers which have no difference in allele frequencies between populations (Figure 3b).

Fig. 3:

Fig. 3:

Box and whisker plots comparing between ThreSPCA, L-PCA and AL-PCA for true and false positives obtained from the simulated dataset of m = 10, 000 and n = 1, 000 and varying values of t, i.e., the number of SNPs which contribute to population structure.

Real Data.

We applied both ThreSPCA and AL-PCA7 on the 1KG data with k = 500 and compared the PC1 scores vs. PC2 scores generated from the outputs of the aforementioned methods. ThreSPCA and AL-PCA are almost identical to the corresponding standard PC plot , clearly preserving the clusters of each subgroup. We observed a near-linear relationship between the two SPCA algorithms for both PCs with r2 = 0.9808 and 0.9426 for PC1 & PC2, respectively and with varying k. This validates that ThreSPCA and AL-PCA are qualitatively very similar to each other in inferring genetic structure.

Running Time.

ThreSPCA clearly outperforms AL-PCA. In particular, for any given k, while ThreSPCA takes less than two minutes in 1KG data, AL-PCA takes about 15 minutes to do the same for a given penalty parameter λ > 0, since it needs a full SVD. Moreover, as already mentioned in Section 1.2, λ is a hyperparameter which needs to be tuned with many cross-nested runs of the data in order to achieve a desired sparsity value. In our case, for the sparsity parameter set to 500, it took at least six runs for each PC. Therefore, the resulting speed-up achieved by ThreSPCA is more than 45x for real data set and around 80x for simulated data.

Finally, we also compare the output of our algorithm against other state-of-the-art SPCA approaches, including the coordinate-wise optimization algorithm of [3] (cwpca), and the spannogram-based algorithm of [22] (spca-lowrank). To measure the accuracy of the of various SPCA algorithms, we first looked at the the term zAz (for varying k), which is nothing but the total variability explained (VE) by the sparse output z. In terms of VE, we noticed that ThreSPCA matches that of the other state-of-the-art SPCA solvers for all the sparsity values observed, which are much larger than that of AL-PCA (Appendix Figure 4 (right)). In addition, we also found that ThreSPCA is not only among the most accurate algorithms, but also is the fastest (Table 1) among all (takes about 100s to 120s to run for each k, while other solvers including AL-PCA run in time at least 2,200s for each k. See details in Appendix Section 1.B.3).

Table 1:

Running time comparisons between ThreSPCA and other state-of-the-art sparse PCA solvers. All times are in seconds except CWPCA, which is in hours.

k ThreSPCA AL-PCA CWPCA SPCA-Lowrank
150 117.3016 2287.224 > 5hrs 3253.057
800 126.8674 2473.908 3152.857
1000 120.9341 2442.435 3121.234
1500 119.6183 2715.581 3408.294
6000 123.2763 2440.104 3319.691
12000 126.3872 2451.353 3071.864

4. Discussion

We present ThreSPCA, a simple and intuitive approximation algorithm for SPCA, based on a deterministic thresholding scheme, without imposing any restrictive assumption on the input covariance matrix. ThreSPCA comes with a provable accuracy guarantee and provides a clear sparsity vs. accuracy trade-off. In practice, it is much faster than the other state-of-the-art SPCA methods and indeed, can be implemented in nearly input sparsity time.

Applying ThreSPCA on the 1KG data, we observed that the set of derived SNPs accurately approximates the genetic diversity across world populations. For each PC, the derived set of k SNPs (we used k = 500 throughout the analyses) captured genetic structure within different continental populations. Together, the top three PCs which explain most of the variance in the 1KG data, we observed that ThreSPCA selected 1500 meaningful, ancestry information preserving SNPs which leads to similar inference of population structure across the world as the original 1KG data with 360,498 SNPs. Annotating ThreSPCA derived variants further showed that they are interpretable and mostly missense in nature, thus likely disease causing. To interpret this, we mapped these variants to various traits in GWAS catalog and found that indeed these variants were mapped to different common traits such as body height, BMI, etc. which vary within and between populations across the world, sometimes leading to spurious associations due to population structure among populations [26]. These variants also mapped to various diseases, which vary across populations such as cardiovascular diseases. Although the scale of the data used in this analysis is small when compared to large-scale genomic data, we observe that ThreSPCA is designed to handle biobank-scale datasets since it only need to run a randomized SVD/PCA analysis, which can be implemented efficiently in out-of-core settings [5]. ThreSPCA can also be used in GWAS as a population stratification correction step by identifying informative markers which highlight the ancestry stratification of cases/controls with fine-grained details which is often overlooked by a standard PCA.

In summary, ThreSPCA provides a fast and provably accurate approximate method for computing SPCA. It provides a method to find interpretable markers in population genetics, which can immensely help understand population stratification, a major cause of spurious associations in GWAS. Also, it highlights the genetic sub-structure among different populations and the ThreSPCA derived variants are likely disease causing, often mapped to potential diseases and traits.

Supplementary Material

S1
S3
S2
S4
S5
S6
S7

Acknowledgements.

PD and AC were partially supported by National Science Foundation (NSF) 10001390, NSF III-10001674, NSF III-10001225, and an IBM Faculty Award to PD. AB was supported by IBM. DPW and SZ would like to thank partial support from NSF grant No. CCF- 181584, Office of Naval Research (ONR) grant N00014-18-1-2562, National Institute of Health (NIH) grant 5401 HG 10798-2, and a Simons Investigator Award.

Appendix 1.A. SPCA via thresholding: Discussions and Proofs

The intuition behind Theorem 1 is that we can decompose the value of the optimal solution into the value contributed by the coordinates in R, the value contributed by the coordinates outside of R, and a cross term. The first term we can upper bound by the output of the algorithm, which maximizes with respect to the coordinates in R. For the latter two terms, we can upper bound the contribution due to the upper bound on the squared row norms of indices outside of R and due to the largest singular value of U being at most the trace of A.

We highlight that, as an intermediate step in the proof of Theorem 1, we need to prove the following Lemma 1, which is very much at the heart of our proof of Theorem 1 and, unlike prior work, allows us to provide provably accurate bounds for the thresholding Algorithm 1. At a high level, the proof of Lemma 1 first decomposes a basis for the columns spanned by U into those spanned by the top singular vectors and the remaining n singular vectors. We then lower bound the contribution of the top singular vectors by upper bounding the contribution of the remaining n singular vectors after noting that the largest remaining singular value is at most a 1/-fraction of the trace. We look at the detailed proof of Lemma 1 below where we use the notation of Section 2.1. For notational convenience, let σ1, …, σn be the diagonal entries of the matrix ΣRn×n, i.e., the singular values of A.

Lemma 1. Let ARn×n be a PSD matrix and ΣRn×n (respectively, ΣR×) be the diagonal matrix of all (respectively, top ) singular values and let URn×n (respectively, URn×) be the matrix of all (respectively, top ) singular vectors. Then, for all unit vectors xRn,

Σ1/2Ux22Σ1/2Ux22εTr(A).

Proof. Let U,Rn×(n) be a matrix whose columns form a basis for the subspace perpendicular to the subspace spanned by the columns of U. Similarly, let Σ,R(n)×(n) be the diagonal matrix of the bottom n singular values of A. Notice that U=[UU,] and Σ=[Σ0;0Σ,]; thus,

UΣ1/2U=UΣ1/2U+U,Σ,1/2U,.

By the Pythagorean theorem,

UΣ1/2Ux22=UΣ1/2Ux22+U,Σ,1/2U,x22.

Using invariance properties of the vector two-norm and sub-multiplicativity, we get

Σ1/2Ux22Σ1/2Ux22Σ,1/222U,x22.

We conclude the proof by noting that Σ1/2Ux22=xUΣUx=xAx and

Σ,1/222=σ+11i=1nσi=Tr(A).

The inequality above follows since σ1σ2 ≥ … σσ+1 ≥ … ≥ σn. We conclude the proof by setting = 1/ε.

Theorem 1. Let k be the sparsity parameter and ε ∈ (0, 1] be the accuracy parameter. Then, the vector zRn (the output of Algorithm 1) has sparsity k/ε3, unit norm, and satisfies

zAzZ3εTr(A).

Proof. Let R = {i1, …, i|R|} be the set of indices of rows of U (columns of U) that have squared norm at least ε2/k and let R¯ be its complement. Here |R| denotes the cardinality of the set R and RR¯={1,,n}. Let RRn×|R| be the sampling matrix that selects the columns of U whose indices are in the set R and let RRn×(n|R|) be the sampling matrix that selects the columns of U whose indices are in the set R¯. Thus, each column of R (respectively R) has a single non-zero entry, equal to one, corresponding to one of the |R| (respectively |R¯|) selected columns. Formally, Rit,t = 1 for all t = 1, …, |R|, while all other entries of R (respectively R) are set to zero; R can be defined analogously. The following properties are easy to prove: RR+RR=In; RR=I; RR=I; RR=0. Recall that x* is the optimal solution to the SPCA problem from eqn. (1). We proceed as follows:

Σ1/2Ux22=Σ1/2U(RR+RR)x22Σ1/2URRx22+Σ1/2URRx22+2Σ1/2URRx2Σ1/2URRx2Σ1/2URRx22+σ1URRx22+2σ1URRx2URRx2. (2)

The above inequalities follow from the Pythagorean theorem and sub-multiplicativity. We now bound the second term in the right-hand side of the above inequality.

URRx*2=i=1n(UR)*i(Rx*)i2i=1n(UR)*i2|(Rx*)i|ε2ki=1n|(Rx*)i|ε2kRx*1εkk=ε. (3)

In the above derivations we use standard properties of norms and the fact that the columns of U that have indices in the set R¯ have squared norm at most ε2/k. The last inequality follows from Rx*1x*1k, since x* has at most k non-zero entries and Euclidean norm at most one.

Recall that the vector y of Algorithm 1 maximizes Σ1/2URx2 over all vectors x of appropriate dimensions (including Rx*) and thus

Σ1/2URy2Σ1/2URRx*2. (4)

Combining eqns. (2), (3), and (4), we get that for sufficiently small ε,

Σ1/2Ux22Σ1/2Uz22+2εTr(A). (5)

In the above we used z = Ry (as in Algorithm 1) and σ1 ≤ Tr(A). Notice that

UΣ1/2Uz+U,Σ,1/2U,z=UΣ1/2Uz,

and using the Pythagorean theorem we get

UΣ1/2Uz22+U,Σ,1/2U,z|22=UΣ1/2Uz22.

Using the unitary invariance of the two norm and dropping a non-negative term, we get the bound

Σ1/2Uz22Σ1/2Uz22. (6)

Combining eqns. (5) and (6), we conclude

Σ1/2Ux22Σ1/2Uz22+2εTr(A). (7)

We now apply Lemma 1 to the optimal vector x* to get

Σ1/2Ux22εTr(A)Σ1/2Ux22.

Combining with eqn. (7) we get

zAzZ3εTr(A).

In the above we used Σ1/2Uz22=zAz and Σ1/2Ux22=(x)Ax=Z. The result then follows from rescaling ε.

Appendix 1.B. Additional Experiments

Fig. 1:

Fig. 1:

PCA plot of the simulated data with three distinct populations simulated from the PSD model with an α = 0.01, n = 1, 000, m = 10, 000 and t = 100

1.B.1. Simulated Studies

The genotype matrix XRm×n consisting of the simulated allele frequencies was generated using the algorithms of [25]. More specifically, we set F = TS, where TRm×d and SRd×n, where dn is the number of population groups. S is the indicator matrix that encapsulates structure with n individuals and contained in d populations. On the other hand, T characterizes how the structure is manifested in the allele frequencies of each SNP. Finally, projecting S onto the column space of T, we obtain the allele frequency matrix F. We sample X as a special case of F for the Pritchard-Stephens-Donelly (PSD) model. We simulate S using i.i.d draws from the Dirichlet distribution with varying values of α, which denotes the parameter influencing the relatedness between the individuals and is directly proportional to the admixture of populations. Appendix Figure 1 shows the population structure observed in this simulated data.

As it is difficult to establish notions of statistical significance in ThreSPCA capturing the ancestry informative markers from the original data, we simulated data sets with varying numbers of individuals (n) and SNPs (m) and allowed t true SNPs that contribute to genetic ancestry. For the random markers that do not contribute to the genetic differentiation, we sampled the Fst distances between the individuals from a uniform distribution in the range {0, 0.005}, which indicates minimum difference in populations. Thus, with this step we achieve the “true” markers contributing to genetic difference are the t SNPs and the remaining mt SNPs, we conclude, are noise.

1.B.2. Experiments on 1KG data

Population structure captured by PCA plots.

We filtered the original 1KG data for the ThreSPCA derived k = 500 SNPs for each of the first three PCs and in the PCA plots we observe the population structure and the allele frequency distribution captured by each of the PCs. We clearly see that the SNPs from PC1 loadings are most frequent in the African populations or mixed populations of African ancestry (Appendix Figure 2). The PC2 SNPs are most frequent in East Asians, although commonly found in other populations as well and the third PC SNPs are most frequent in South Asian populations (Appendix Figure 2)). Thus, the SNP loadings from the top three PCs accurately captures the population structure across the world and merging them together, we not only capture the entire population structure in the PCA plot but also discover fine-grain substructure of populations (Figure 1).

Fig. 2:

Fig. 2:

Mean allele frequencies obtained from the first three PCs from ThreSPCA with k = 500.

Tuning input sparsity k.

We tried a range of k’s varying it from 50 to 1500 and observed the r2 between the PCs derived from the original 1KG data and the 1500 SNPs derived from ThreSPCA. We observed that for the top two PCs the r2 is high from 0.96 to 0.99 wit the peak for both the PCs reaching around k = 500. PC1 continues to increase by two decimal points before saturing at k = 1000. Thus, we selected k = 500 for all the experiments as both the PCs reached their respective peaks.

Fig. 3:

Fig. 3:

GO pathway analyses of the ThreSPCA informed variants, colored by p-values.

Fig. 4:

Fig. 4:

Left: Line plot between the r2 between the PC scores of each PC obtained from ThreSPCA and the original PC from 1KG data with varying values of sparsity, k. Right: Variance explained by ThreSPCA, AL-PCA, and other state-of-the-art SPCA solvers for varying k.

1.B.3. Comparing ThreSPCA with the state-of-the-art.

Simulated data.

We observed that increasing the threshold of true positives (markers that contribute to genetic structure) t led to an increase of the number of true positives observed in ThreSPCA.

Real Data.

For k = 500, on 1KG data we found perfect correlation with ThreSPCA and AL-PCA for PC1 and PC2 with r2 = 0.97 and 0.94 respectively. We also observed similar trends for k = 1000 and k = 1500 (squared correlations larger than 0.9 for both PC1 and PC2).

Comparing the output of ThresPCA against other state-of-the-art SPCA approaches, we the greedy coordinate-wise (GCW) method of cwpca and we set the low-rank parameter d of spca-lowrank to one. We performed these evaluations on an Intel Xeon Gold 6126 processor running at 2.6 GHz with 96 GB of RAM and a 64-bit CentOS Linux 7 OS.

Table 2:

Traits and genes mapped in GWAS catalog from ThreSPCA informed variants.

PCs SNP CHR POS MAPPED GENE MAPPED TRAITS
PC1 rs35399673 5 104529307 RAB9BP1 skin pigmentation measurement
PC2 rs11556924 7 129663496 ZC3HC1 coronary artery disease, diastolic and systolic blood pressure, myocardial infarction, platelet count, parental longevity, testosterone measurement, Agents acting on the renin-angiotensin system use, Calcium channel blocker use, hematocrit, hemoglobin count, myeloid white cell count, body height, leukocyte count, cardiovascular disease age at menarche
rs12525051 6 151913710 CCDC170 heel bone mineral density
rs1938679 11 69272096 MYEOV - LINC02747 body height
rs196052 6 22057200 CASC15 Corneal astigmatism
rs2069235 22 39747780 SYNGR1 primary biliary cirrhosis rheumatoid arthritis
rs4714599 6 42285815 TRERF1 eosinophil percentage of granulocytes, neutrophil percentage of granulocytes, eosinophil percentage of leukocytes
rs5747035 22 17718606 ADA2 word list delayed recall measurement, memory performance
rs7714191 5 131341541 ACSL6-AS1, ACSL6 cortical surface area measurement
rs7901883 10 103186838 BTRC smoking behavior smoking status measurement
rs7976816 12 124315343 DNAH10 BMI-adjusted waist circumference waist circumference
rs8002164 13 58248732 PCDH17 upper aerodigestive tract neoplasm
rs847888 12 112151742 ACAD10 diastolic blood pressure
rs907183 8 8729761 MFHAS1, MFHAS1 Calcium channel blocker use measurement
PC3 rs10164546 2 106141004 FHL2 pursuit maintenance gain measurement
rs1020410 2 176784138 EXTL2P1 - LNPK physical activity
rs10896109 11 66080023 TMEM151A - CD248 circadian rhythm
rs1264423 6 30571471 PPP1R10 mean corpuscular volume
rs12679528 8 15566164 TUSC3 body mass index
rs16942383 15 89405052 ACAN BMI-adjusted hip circumference
rs2988114 13 80870878 SPRY2 gut microbiome measurement
rs34672598 20 7884260 HAO1 QT interval
rs3828919 6 31466057 MICB platelet count
rs41492548 9 130607359 ENG monocyte count
rs4679760 3 155855418 KCNAB1 birth weight, parental genotype effect measurement
rs744680 10 131741695 EBF3 visual perception measurement
rs76496105 2 110447667 BMS1P19 - SRSF3P6 platelet count platelet crit

Footnotes

5

Recall that the p-th power of the lp norm of a vector xRn is defined as xpp=i=1n|xi|p for 0 < p < ∞. For p = 0, ‖x0 is a semi-norm denoting the number of non-zero entries of x.

6

Each column of R has a single non-zero entry (set to one), corresponding to one of the |R| selected columns. Formally, Rit,t = 1 for t = 1, …, |R|; all other entries of R are set to zero.

7

Results from L-PCA are qualitatively very similar to AL-PCA and we only report results for the latter.

Code Availability. A Python implementation of ThreSPCA can be found at: https://github.com/aritra90/ThreSPCA.

References

  • 1.Asteris M, Papailiopoulos D, Karystinos GN: Sparse Principal Component of a Rank-deficient Matrix. In: 2011 IEEE International Symposium on Information Theory Proceedings. pp. 673–677 (2011) [Google Scholar]
  • 2.Asteris M, Papailiopoulos D, Kyrillidis A, Dimakis AG: Sparse PCA via Bipartite Matchings. In: Advances in Neural Information Processing Systems. pp. 766–774 (2015) [Google Scholar]
  • 3.Beck A, Vaisbourd Y: The Sparse Principal Component Analysis Problem: Optimality Conditions and Algorithms. Journal of Optimization Theory and Applications 170(1), 119–143 (2016) [Google Scholar]
  • 4.Bose A, Burch MC, Chowdhury A, Paschou P, Drineas P: Clustrat: a structure informed clustering strategy for population stratification. bioRxiv (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P: TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 35(19) (2019) [DOI] [PubMed] [Google Scholar]
  • 6.Buniello A, et al. : The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research 47(D1), D1005–D1012 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cadima J, Jolliffe IT: Loading and Correlations in the Interpretation of Principal Components. Journal of Applied Statistics 22(2), 203–214 (1995) [Google Scholar]
  • 8.Chan SO, Papailliopoulos D, Rubinstein A: On the Approximability of Sparse PCA. In: Proceedings of the 29th Conference on Learning Theory. pp. 623–646 (2016) [Google Scholar]
  • 9.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ: Second-generation plink: rising to the challenge of larger and richer datasets. Gigascience 4(1), s13742–015 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consortium, .G.P., et al. : A global reference for human genetic variation. Nature 526(7571), 68 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.d’Aspremont A, Ghaoui LE, Jordan MI, Lanckriet GRG: A Direct Formulation for Sparse PCA using Semidefinite Programming. SIAM Review 49(3), 434–448 (2007) [Google Scholar]
  • 12.Engelhardt BE, Stephens M: Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS genetics 6(9), e1001117 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Hsu YL, Huang PY, Chen DT: Sparse principal component analysis in cancer research. Translational cancer research 3(3), 182 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jolliffe IT: Rotation of principal components: Choice of Normalization Constraints. Journal of Applied Statistics 22(1), 29–35 (1995) [Google Scholar]
  • 15.Jolliffe IT, Trendafilov NT, Uddin M: A Modified Principal Component Technique Based on the LASSO. Journal of Computational and Graphical Statistics 12(3), 531–547 (2003) [Google Scholar]
  • 16.Lee S, et al. : Sparse Principal Component Analysis for Identifying Ancestry-informative Markers in Genome-wide Association Studies. Genetic Epidemiology 36(4), 293–302 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Li JZ, et al. : Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science 319(5866), 1100–1104 (2008) [DOI] [PubMed] [Google Scholar]
  • 18.Mahoney MW, Drineas P: CUR Matrix Decompositions for Improved Data Analysis. In: Proceedings of the National Academy of Sciences. pp. 697–702, 106 (3) (2009) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F: The ensembl variant effect predictor. Genome biology 17(1), 1–14 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Moghaddam B, Weiss Y, Avidan S: Generalized Spectral Bounds for Sparse LDA. In: Proceedings of the 23rd International Conference on Machine learning. pp. 641–648 (2006) [Google Scholar]
  • 21.Musco C, Musco C: Randomized block krylov methods for stronger and faster approximate singular value decomposition. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems. pp. 1396–1404 (2015) [Google Scholar]
  • 22.Papailiopoulos D, Dimakis A, Korokythakis S: Sparse PCA through Low-rank Approximations. In: Proceedings of the 30th International Conference on Machine Learning. pp. 747–755 (2013) [Google Scholar]
  • 23.Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS genetics 2(12) (2006) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38(8), 904–909 (2006) [DOI] [PubMed] [Google Scholar]
  • 25.Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sohail M, et al. : Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. Elife 8, e39702 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Yu G, Wang LG, Han Y, He QY: clusterprofiler: an r package for comparing biological themes among gene clusters. Omics: a journal of integrative biology 16(5), 284–287 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zou H, Hastie T: Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B 67(2), 301–320 (2005) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1
S3
S2
S4
S5
S6
S7

RESOURCES