Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Aug 18.
Published in final edited form as: Biometrika. 2014 Feb 12;101(2):484–490. doi: 10.1093/biomet/ast064

Convergence of Sample Eigenvalues, Eigenvectors, and Principal Component Scores for Ultra-High Dimensional Data

Seunggeun Lee 1, Fei Zou 2, Fred A Wright 3
PMCID: PMC4135472  NIHMSID: NIHMS594795  PMID: 25143634

Summary

The development of high-throughput biomedical technologies has led to increased interest in the analysis of high-dimensional data where the number of features is much larger than the sample size. In this paper, we investigate principal component analysis under the ultra-high dimensional regime, where both the number of features and the sample size increase as the ratio of the two quantities also increases. We bridge the existing results from the finite and the high-dimension low sample size regimes, embedding the two regimes in a more general framework. We also numerically demonstrate the universal application of the results from the finite regime.

Some key words: High-Dimension Low Sample Size Data, Principal Component Analysis, Random Matrix

1. Introduction

With the development of modern high-throughput technologies, it is common to encounter data with many more features, p, than the number of samples, n. In modern genomics applications, for instance, the number of features often ranges from tens of thousands to millions, while the corresponding sample sizes typically range from hundreds to thousands. For those high-dimensional data, principal component analysis is popular for data exploration and dimension reduction. Since principal component analysis is based on the eigenvalues and eigenvectors of the sample covariance matrix, its performance largely depends on the behavior of the sample eigenvalues and eigenvectors.

In their seminal paper on random matrices, Marčenko & Pastur (1967) derived the asymptotic distribution of the sample eigenvalues under the finite γ regime, where p → ∞, n → ∞ and p/nγ < ∞. Specifically, they showed that the sample eigenvalues follow the Marčenko–Pastur law when all the population eigenvalues are identical. For data where the true signal is embedded in a low dimensional space, Johnstone (2001) introduced the spiked eigenvalue model, where a small number of population eigenvalues are substantially larger than the rest. Under this model, asymptotic results on the sample eigenvalues and eigenvectors have been derived (Baik & Silverstein, 2006; Paul, 2007; Nadler, 2008; Lee et al., 2010) for the finite γ asymptotic regime.

These results are useful for evaluating the performances of principal component analysis (Lee et al., 2010). However, one may be concerned about the applicability of the theoretical results from the finite γ regime to ultra-high dimensional data, such as next generation sequencing data, where millions of genetic variants are collected from tens or a few hundreds of samples. Addressing this question is urgent, as the availability of such ultra-high dimensional genomic datasets is expected to increase as the cost of high-throughput technologies decreases. In this paper, we derive asymptotic results that provide theoretical justification for applying the results from the finite γ regime to ultra-high dimensional data. In addition, we compare our results to those from the high-dimension low sample size regime (Hall et al., 2005; Ahn et al., 2007; Jung & Marron, 2009; Jung et al., 2012).

The finite γ and the high-dimension low sample size regimes are based on two seemingly disparate assumptions. In the high-dimension low sample size regime, n is treated as fixed and the population eigenvalues increase with rate pα. In the finite γ regime, the population eigenvalues are assumed to be fixed but n grows with p at a constant rate. Our new results on the ultra-high dimensional regime bridge the asymptotic results from the two extreme regimes and improve our understanding of principal component analysis on high-dimensional data.

2. Method

2.1. General Setting

Throughout this paper, we assume that n is a function of p, and denote it by np whenever needed. We further define γp = p/np. Let p=EpΛpEpT be a p × p nonnegative matrix with an ordered eigenvalue matrix Λp = diag(λp1,…, λpp) and an orthogonal eigenvector matrix Ep = (ep1,…, epp). Both eigenvalues and eigenvectors are fixed sequences which depend on p. Define the p × n data matrix, Xp=EpΛp1/2Zp, where Zp is a p × n random matrix whose elements zij are independent and identically distributed with E(Zij)=0,E(Zij2)=σ2 and E(zij4)<. The sample covariance matrix Sp equals

Sp=n1XpXpT=n1EpΛp1/2ZpZpTΛp1/2EpT,

and the corresponding population covariance matrix of Xp is σ2Σp. The σ2λpvs are the underlying population eigenvalues. The spectral decomposition of the sample covariance matrix is Sp=UpDpUpT, where Dp = diag(dp1, …, dpp) is the diagonal matrix of the ordered sample eigenvalues, and Up = (up1, …, upp) is the corresponding p × p sample eigenvector matrix. The vth sample principal component score vector, v = (v1, …,vn)T, equals XTuv. For a new sample with variable xnew, its vth predicted principal component score is q^v=XnewTuv. Before introducing the main results, we define additional notation for the remainder of the paper. Suppose ap and bp are two sequences. We write apbp if ap = O(bp) and bp = O(ap), and apbp if ap/bp = o(1). For simplicity, we hence suppress the subscript p unless we wish to emphasize a quantity's dependence on p, except for the population eigenvector matrix, which is always denoted by Ep.

2.2. Main Results

In the sequel, we assume γp → ∞ and n → ∞ as p → ∞. We further assume the spiked eigenvalue model (Johnstone, 2001) in which the first m population eigenvalues are substantially larger than the remaining non-spiked eigenvalues. In the random matrix context, it is typically assumed that all non-spiked population eigenvalues equal unity (Johnstone, 2001; Baik & Silverstein, 2006). This strong condition is unlikely to be satisfied in many situations. We define two weaker sphericity conditions. Let ϕ(k)=(pm)1v=m+1p(λvλ¯)k be the kth central moment of the non-spiked population eigenvalues, where λ¯=(pm)1v=m+1pλv.

Condition 1. The non-spiked population eigenvalues satisfy ϕ(2) = o(n−2p).

Condition 2. The non-spiked population eigenvalues satisfy ϕ(2) = o(n−3/2p), ϕ(4) = O(1), and ϕ(4) = o(n−4p3).

Condition 1 is closely related to the sphericity measure in John (1971, 1972) and the (m condition of Jung & Marron (2009). Detailed explanations of both conditions can be found in the Supplementary Material. The following theorem summarizes the convergence results of the sample eigenvalues and eigenvectors.

Theorem 1

Let cv = λv/γp (vm). Suppose that cm < ⋯ < c1, and cm ≍ ⋯ ≍ c1. Let the remaining population eigenvalues satisfy Condition 1 or Condition 2.

  1. When cv is bounded away from zero, for vm, dvλv1σ2cv1(cv+1)0 in probability, and |〈ev, uv〉| − {cv(cv + 1)−1}1/2 → 0 in probability, where 〈.〉 is the inner product between two vectors. For v>m,dvγp1σ2 in probability.

  2. ii) When cv = o(1), for all v, dvγp1σ2 in probability, and |〈ev, uv〉| → 0 in probability.

The proof can be found in the Supplementary Material. Theorem 1 includes convergence results for both the spiked and non-spiked sample eigenvalues. These results clearly indicate that the asymptotic behavior of sample eigenvalues and eigenvectors depends on cv, which can be viewed as a signal to noise ratio, where λv represents the signal strength and γp serves as a surrogate of the noise level.

When λv grows at the same rate as, or at a higher rate than, γp, the spiked eigenvalues are separable from the bulk. When λv grows at a slower rate than γp, i.e., cv = o(1), the spiked eigenvalues cannot be separated from the non-spiked eigenvalues. Theorem 1 also shows that dv is inconsistent. The sample eigenvectors show a similar pattern. Examples on the asymptotic behavior of the sample eigenvalues and eigenvectors under several conditions are described in the Supplementary Material. To mimic the high-dimension low sample size regime, let λv be a function of γpα such that a limit of λv/γpα exists and is finite. Now we have the following corollary.

Corollary 1

Let λv/γpαcv(vm),cm<<c1, and the remaining population eigenvalues satisfy Condition 1 or Condition 2. Then, for v ≤ m, dv/max(γp, γpα) converges in probability to σ2v, σ2v + σ2, and σ2 when α > 1, α = 1, and α < 1, respectively. With the same assumption, |〈ev, uv〉| converges in probability to unity, {v(v + 1)−1}1/2, and zero when α > 1, α = 1, and α < 1, respectively.

The proof can be found in the Supplementary Material. The corollary allows us to compare our results to those from the high-dimension low sample size regime. See Section 2.3 for details. After principal component analysis, the sample principal component scores are often used to summarize data. Predicted principal component scores may also be calculated on new samples for a variety of reasons (Jolliffe, 2002). The next theorem presents the asymptotic results on the principal component scores under the ultra-high dimensional regime.

Theorem 2

Suppose the assumptions in Theorem 1 hold, and cv (v ≤ m) is bounded away from zero. Let pv = XTev be the vth population principal component score derived from the corresponding vth population eigenvector, and corr(·, ·) be the correlation function. Then, for v ≤ m, corr(pv, p̂v) → 1 in probability, and {E(q^v2)/E(p^vj2)}1/2cv(cv+1)10 inprobability, for all j = 1, …, n.

The proof is given in the Supplementary Material. One striking feature in Theorem 2 is that the correlation between pv and v can converge to unity even when the corresponding sample eigenvector is not consistent. Combining Theorems 1 and 2, we conclude that v can accurately estimate pv whenever its corresponding sample eigenvalue is separable from the bulk. This interesting result may partially explain the success of principal component analysis for high dimensional datasets, such as genome-wide association data (Price et al., 2006; Patterson et al., 2006). This theorem also illustrates the shrinkage phenomena of the predicted principal component scores, previously reported by Lee et al. (2010). To apply the asymptotic results in Theorem 1 and 2 to data, we need to estimate σ2. Lee et al. (2010) proposed an algorithm to rescale the data to ensure σ2 of the rescaled data equal to unity. The same approach can be applied to ultra-high dimensional data.

2.3. Comparisons to existing asymptotic results

In the finite γ framework, it is typically assumed that the spiked population eigenvalues are finite. Under the spiked population model, the following results have been established (Baik & Silverstein, 2006; Paul, 2007; Lee et al., 2010). For sample eigenvalues,

dvσ2λv{1+γ(λv1)1}=op(1),λv>1+γ1/2,dvσ2(1+γ1/2)2=op(1),λv1+γ1/2, (1)

and for predicted principal component scores,

{E(q^v2)/E(p^vj2)}1/2(λv1)(λv+γ1)1=op(1). (2)

Equation (1) shows that if λv > 1 + γ1/2, its corresponding sample eigenvalue is separable from the bulk. Interestingly, the result in (1) for λv > 1 + γ1/2 is equivalent to the asymptotic result in Theorem 1 when λv is relatively large. To see this, note that by replacing λv in (1) with cvγ, we obtain

dvλv1σ2+(σ2γ)(λv1)1σ2cv1(cv+1),

which accords with Theorem 1. For the predicted principal component scores, the same holds true. By (2),

{E(q^v2)/E(p^vj2)}1/2(λv1)(λv+γ1)1cv(cv+1)1,

which is consistent with Theorem 2. Similar conclusions can be drawn for the sample eigenvectors and principal component scores and are omitted here. In summary, for ultra-high dimensional data, both the finite γ and ultra-high dimensional asymptotic results can be used to investigate the behavior of sample eigenvalues, eigenvectors and principal component scores, and both produce similar conclusions.

Under the high-dimension low sample size regime, the spiked population eigenvalues are assumed to grow at rate pα (α > 0). Jung et al. (2012) proved the following results with an additional Gaussian assumption on X. For the first sample eigenvalue,

d1max(p,pα){σ2c^1χn2/n,α>1,σ2c^1χn2/n+σ2/n,α=1,σ2/n,α<1, (3)

in distribution, where ĉ1 = limp→∞ λ1/pα, and χn2 denotes the chi-square distribution with n degrees of freedom. For the first sample eigenvector,

|e1,u1|{1,α>1,{c^1χn2/(c^1χn2+1)}1/2,α=1,0,α<1, (4)

in distribution. Results with more relaxed assumptions can be found in Jung et al. (2012). In the high-dimension low sample size regime, the asymptotic behavior of sample eigenvalues and eigenvectors depends on the relative growth rate of λv over p. In Corollary 1, λv is expressed as a function of γp, instead of p directly. However, it should be noted that γpα is equivalent to pα in the high-dimension low sample size regime where n is treated as fixed. Equation (3) shows that when α > 1, the distribution of the scaled sample eigenvalue converges in distribution to the random variable σ2c^1χn2/n as p → ∞. Combining this with the fact that χn2/n1 in probability as n → ∞, we end up with the same conclusion in Corollary 1. When α = 1, d1σ2λ1σ2p/n, and thus the sample eigenvalue is biased with a bias σ2p/n. Equation (4) indicates that the first sample eigenvector is consistent when α > 1 and is asymptotically perpendicular to the first population eigenvector when α < 1. When α = 1, the sample eigenvector is neither consistent nor asymptotically perpendicular to the first population eigenvector. In conclusion, our asymptotic results clearly parallel the asymptotic results for the high-dimension low sample size regime, and embed them within a larger framework.

3. Numerical Study

We conducted simulations to illustrate our theoretical results. A p × n data matrix X was generated from N(0, Λ) with Λ = diag(λ1, …, λp). We set λ1 = c1γp, λ2 = c2γp and λ3 = … = λp = 1. The first and second population eigenvectors were e1 = (1, 0, …, 0) and e2 = (0, 1, 0, …, 0), respectively. Four different sets of cvs were selected to represent different scenarios: no spiked eigenvalues, c1=c2=γp1; very small spiked eigenvalues, c1=γp1/2,c2=0.7γp1/2; moderate spiked eigenvalues, c1 = 1, c2 = 0.7; and very large spiked eigenvalues, c1=γp1/2,c2=0.7γp1/2. The first two scenarios correspond to the case that cv = o(1), and the last two to the case that cv is bounded away from zero. Two different γp values, 500 and 2000, were considered, and the sample size was fixed at 100. For each of the simulation setups, we generated 500 datasets and computed the sample eigenvalues and the inner products between the sample and population eigenvectors.

Table 1 reports the medians and inter-quartile ranges of the estimates. The theoretical asymptotic values of the sample eigenvalues and the inner products from the finite γ and the ultra-high dimensional regimes are also presented. The sample eigenvalues were rescaled by γp. For data with no spiked or with very small spiked eigenvalues, the first and second sample eigenvalues are slightly upward-biased from unity. However, they match well with the theoretical ones from the finite γ regime. For data with moderate or large spiked eigenvalues, the theoretical estimates from the finite γ regime and the ultra-high dimensional regime are identical, and are well matched with the sample eigenvalues. For the inner products of the sample eigenvectors, the empirical estimates match well with the ones from the finite γ and ultra-high dimensional regimes, and the two sets of the theoretical results are identical.

Table 1.

Rescaled sample eigenvalues and eigenvectors based on 500 simulations. Ultra-High and Finite γ present theoretical asymptotic values from the ultra-high dimensional and the finite γ regimes, respectively. Observed presents the median of the estimates, with the interquartile range in parentheses.

Principal component γp Type No Spike Very small Spike Moderate Spike Very Large Spike
Eigenvalues 1 500 Ultra-High 1.00 1.00 2.00 23.36
Finite γ 1.09 1.09 2.00 23.36
Observed 1.09(0.004) 1.09(0.004) 2.02(0.183) 23.6(3.868)
2000 Ultra-High 1.00 1.00 2.00 45.72
Finite γ 1.05 1.05 2.00 45.72
Observed 1.04(0.002) 1.05(0.002) 2.02(0.207) 46.75(7.605)
2 500 Ultra-High 1.00 1.00 1.70 16.65
Finite γ 1.09 1.09 1.70 16.65
Observed 1.08(0.003) 1.09(0.003) 1.67(0.125) 15.98(2.680)
2000 Ultra-High 1.00 1.00 1.70 32.31
Finite γ 1.05 1.05 1.70 32.31
Observed 1.04(0.001) 1.04(0.002) 1.68(0.131) 30.96(5.370)
Eigenvectors 1 500 Ultra-High 0.00 0.00 0.71 0.98
Finite γ 0.00 0.00 0.71 0.98
Observed 0.00(0.004) 0.07(0.069) 0.69(0.056) 0.96(0.061)
2000 Ultra-High 0.00 0.00 0.71 0.99
Finite γ 0.00 0.00 0.71 0.99
Observed 0.00(0.002) 0.05(0.048) 0.69(0.056) 0.97(0.054)
2 500 Ultra-High 0.00 0.00 0.64 0.97
Finite γ 0.00 0.00 0.64 0.97
Observed 0.00(0.003) 0.02(0.030) 0.607(0.055) 0.95(0.059)
2000 Ultra-High 0.00 0.00 0.64 0.98
Finite γ 0.00 0.00 0.64 0.98
Observed 0.00(0.002) 0.02(0.022) 0.61(0.052) 0.96(0.053)

Table 2 summarizes the results for the sample and predicted principal component scores. For the sample principal component scores, the median and inter-quartile ranges of their Pearson correlations with the population principal component scores were calculated. The theoretical results from both the finite γ and ultra-high dimensional regimes are identical and both match well with the empirical estimates. For the predicted principal component scores, we followed exactly the same simulation procedure as described above to generate a new dataset for each of the simulated dataset. We computed the predicted principal component scores on each new dataset. The empirical shrinkage factor was calculated as the ratio of the means of the squared predicted and sample principal component scores. Again, similar conclusions hold. The theoretical results from both the finite γ and ultra-high dimensional regimes are effectively identical. The empirical estimates approximate the theoretical results very well.

Table 2.

Sample and predicted principal component scores based on 500 simulations. UltraHigh and Finite γ present theoretical asymptotic values from the ultra-high dimensional and the finite γ regimes, respectively. Observed presents the median of the estimates, with the interquartile range in parentheses.

γp Type Principal Component 1 Principal Component 2
Moderate Spike Very Large Spike Moderate Spike Very Large Spike
Principal Component Scores 500 Ultra-High 1.00 1.00 1.00 1.00
Finite γ 1.00 1.00 1.00 1.00
Observed 0.99(0.042) 0.99(0.043) 0.97(0.090) 0.97(0.087)
2000 Ultra-High 1.00 1.00 1.00 1.00
Finite γ 1.00 1.00 1.00 1.00
Observed 0.99(0.039) 0.99(0.038) 0.97(0.078) 0.97(0.079)
Shrinkage Factors 500 Ultra-High 0.50 0.96 0.41 0.94
Finite γ 0.50 0.96 0.41 0.94
Observed 0.49(0.045) 0.93(0.120) 0.42(0.046) 0.97(0.122)
2000 Ultra-High 0.50 0.98 0.41 0.97
Finite γ 0.50 0.98 0.41 0.97
Observed 0.49(0.045) 0.95(0.127) 0.42(0.041) 1.00(0.118)

Supplementary Material

Supplementary Materials

Acknowledgments

This research was supported by the National Institutes of Health, U.S.A. We are grateful to the editor, associate editor and a referee for their help in improving our presentation.

Footnotes

Supplementary material available at Biometrika online includes detailed description of sphericity conditions, proofs and two examples.

Contributor Information

Seunggeun Lee, Email: leeshawn@umich.edu, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109, U.S.A.

Fei Zou, Email: fzou@bios.unc.edu, Department of Biostatistics, University of North Carolina, 135 Dauer Drive, Chapel Hill, North Carolina 27599, U.S.A.

Fred A. Wright, Email: fred_wright@ncsu.edu, Bioinformatics Research Center, North Carolina State University, 1 Lampe Drive, Raleigh, North Carolina 27607, U.S.A.

References

  1. Ahn J, Marron JS, Muller KM, Chi YY. The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika. 2007;94:760–766. [Google Scholar]
  2. Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J Multivariate Anal. 2006;97:1382–1408. [Google Scholar]
  3. Hall P, Marron JS, Neeman A. Geometric representation of high dimension, low sample size data. J R Statist Soc B. 2005;67:427–444. [Google Scholar]
  4. John S. Some optimal multivariate tests. Biometrika. 1971;58:123–127. [Google Scholar]
  5. John S. The distribution of a statistic used for testing sphericity of normal distributions. Biometrika. 1972;59:169–173. [Google Scholar]
  6. Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. [Google Scholar]
  7. Jolliffe I. Principal Component Analysis. Springer; New York: 2002. [Google Scholar]
  8. Jung S, Marron JS. PCA consistency in high dimension, low sample size context. Ann Statist. 2009;37:4104–4130. [Google Scholar]
  9. Jung S, Sen A, Marron JS. Boundary behavior in high dimension, low sample size asymptotics of PCA. J Multivariate Anal. 2012;109:190–203. [Google Scholar]
  10. Lee S, Zou F, Wright F. Convergence and prediction of principal component scores in high-dimensional settings. Ann Statist. 2010;38:3605–3629. doi: 10.1214/10-AOS821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Marčenko V, Pastur L. Distribution of eigenvalues for some sets of random matrices. Sbornik: Mathematics. 1967;1:457–483. [Google Scholar]
  12. Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Ann Statist. 2008;36:2791–2817. [Google Scholar]
  13. Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist Sinica. 2007;17:1617–1642. [Google Scholar]
  15. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES