Convergence of Sample Eigenvalues, Eigenvectors, and Principal Component Scores for Ultra-High Dimensional Data

Seunggeun Lee; Fei Zou; Fred A Wright

doi:10.1093/biomet/ast064

. Author manuscript; available in PMC: 2014 Aug 18.

Published in final edited form as: Biometrika. 2014 Feb 12;101(2):484–490. doi: 10.1093/biomet/ast064

Convergence of Sample Eigenvalues, Eigenvectors, and Principal Component Scores for Ultra-High Dimensional Data

Seunggeun Lee ¹, Fei Zou ², Fred A Wright ³

PMCID: PMC4135472 NIHMSID: NIHMS594795 PMID: 25143634

Summary

The development of high-throughput biomedical technologies has led to increased interest in the analysis of high-dimensional data where the number of features is much larger than the sample size. In this paper, we investigate principal component analysis under the ultra-high dimensional regime, where both the number of features and the sample size increase as the ratio of the two quantities also increases. We bridge the existing results from the finite and the high-dimension low sample size regimes, embedding the two regimes in a more general framework. We also numerically demonstrate the universal application of the results from the finite regime.

Some key words: High-Dimension Low Sample Size Data, Principal Component Analysis, Random Matrix

1. Introduction

With the development of modern high-throughput technologies, it is common to encounter data with many more features, p, than the number of samples, n. In modern genomics applications, for instance, the number of features often ranges from tens of thousands to millions, while the corresponding sample sizes typically range from hundreds to thousands. For those high-dimensional data, principal component analysis is popular for data exploration and dimension reduction. Since principal component analysis is based on the eigenvalues and eigenvectors of the sample covariance matrix, its performance largely depends on the behavior of the sample eigenvalues and eigenvectors.

In their seminal paper on random matrices, Marčenko & Pastur (1967) derived the asymptotic distribution of the sample eigenvalues under the finite γ regime, where p → ∞, n → ∞ and p/n → γ < ∞. Specifically, they showed that the sample eigenvalues follow the Marčenko–Pastur law when all the population eigenvalues are identical. For data where the true signal is embedded in a low dimensional space, Johnstone (2001) introduced the spiked eigenvalue model, where a small number of population eigenvalues are substantially larger than the rest. Under this model, asymptotic results on the sample eigenvalues and eigenvectors have been derived (Baik & Silverstein, 2006; Paul, 2007; Nadler, 2008; Lee et al., 2010) for the finite γ asymptotic regime.

These results are useful for evaluating the performances of principal component analysis (Lee et al., 2010). However, one may be concerned about the applicability of the theoretical results from the finite γ regime to ultra-high dimensional data, such as next generation sequencing data, where millions of genetic variants are collected from tens or a few hundreds of samples. Addressing this question is urgent, as the availability of such ultra-high dimensional genomic datasets is expected to increase as the cost of high-throughput technologies decreases. In this paper, we derive asymptotic results that provide theoretical justification for applying the results from the finite γ regime to ultra-high dimensional data. In addition, we compare our results to those from the high-dimension low sample size regime (Hall et al., 2005; Ahn et al., 2007; Jung & Marron, 2009; Jung et al., 2012).

The finite γ and the high-dimension low sample size regimes are based on two seemingly disparate assumptions. In the high-dimension low sample size regime, n is treated as fixed and the population eigenvalues increase with rate p^α. In the finite γ regime, the population eigenvalues are assumed to be fixed but n grows with p at a constant rate. Our new results on the ultra-high dimensional regime bridge the asymptotic results from the two extreme regimes and improve our understanding of principal component analysis on high-dimensional data.

2. Method

2.1. General Setting

Throughout this paper, we assume that n is a function of p, and denote it by n_p whenever needed. We further define γ_p = p/n_p. Let $\sum_{p} = E_{p} Λ_{p} E_{p}^{T}$ be a p × p nonnegative matrix with an ordered eigenvalue matrix Λ_p = diag(λ_p₁,…, λ_pp) and an orthogonal eigenvector matrix E_p = (e_p₁,…, e_pp). Both eigenvalues and eigenvectors are fixed sequences which depend on p. Define the p × n data matrix, $X_{p} = E_{p} Λ_{p}^{1 / 2} Z_{p}$ , where Z_p is a p × n random matrix whose elements z_ij are independent and identically distributed with $E (Z_{i j}) = 0, E (Z_{i j}^{2}) = σ^{2}$ and $E (z_{i j}^{4}) < \infty$ . The sample covariance matrix S_p equals

S_{p} = n^{- 1} X_{p} X_{p}^{T} = n^{- 1} E_{p} Λ_{p}^{1 / 2} Z_{p} Z_{p}^{T} Λ_{p}^{1 / 2} E_{p}^{T},

and the corresponding population covariance matrix of X_p is σ²Σ_p. The σ²λ_pvs are the underlying population eigenvalues. The spectral decomposition of the sample covariance matrix is $S_{p} = U_{p} D_{p} U_{p}^{T}$ , where D_p = diag(d_p₁, …, d_pp) is the diagonal matrix of the ordered sample eigenvalues, and U_p = (u_p₁, …, u_pp) is the corresponding p × p sample eigenvector matrix. The vth sample principal component score vector, p̂_v = (p̂_v₁, …,p̂_vn)^T, equals X^Tu_v. For a new sample with variable x_new, its vth predicted principal component score is ${\hat{q}}_{v} = X_{new}^{T} u_{v}$ . Before introducing the main results, we define additional notation for the remainder of the paper. Suppose a_p and b_p are two sequences. We write a_p ≍ b_p if a_p = O(b_p) and b_p = O(a_p), and a_p ≪ b_p if a_p/b_p = o(1). For simplicity, we hence suppress the subscript p unless we wish to emphasize a quantity's dependence on p, except for the population eigenvector matrix, which is always denoted by E_p.

2.2. Main Results

In the sequel, we assume γ_p → ∞ and n → ∞ as p → ∞. We further assume the spiked eigenvalue model (Johnstone, 2001) in which the first m population eigenvalues are substantially larger than the remaining non-spiked eigenvalues. In the random matrix context, it is typically assumed that all non-spiked population eigenvalues equal unity (Johnstone, 2001; Baik & Silverstein, 2006). This strong condition is unlikely to be satisfied in many situations. We define two weaker sphericity conditions. Let $ϕ (k) = {(p - m)}^{- 1} {\sum_{v = m + 1}^{p} (λ_{v} - \bar{λ})}^{k}$ be the kth central moment of the non-spiked population eigenvalues, where $\bar{λ} = {(p - m)}^{- 1} \sum_{v = m + 1}^{p} λ_{v}$ .

Condition 1. The non-spiked population eigenvalues satisfy ϕ(2) = o(n⁻²p).

Condition 2. The non-spiked population eigenvalues satisfy ϕ(2) = o(n^−3/2p), ϕ(4) = O(1), and ϕ(4) = o(n⁻⁴p³).

Condition 1 is closely related to the sphericity measure in John (1971, 1972) and the (∊_m condition of Jung & Marron (2009). Detailed explanations of both conditions can be found in the Supplementary Material. The following theorem summarizes the convergence results of the sample eigenvalues and eigenvectors.

Theorem 1

Let c_v = λ_v/γ_p (v ≤ m). Suppose that c_m < ⋯ < c₁, and c_m ≍ ⋯ ≍ c₁. Let the remaining population eigenvalues satisfy Condition 1 or Condition 2.

When c_v is bounded away from zero, for v ≤ m, $d_{v} λ_{v}^{- 1} - σ^{2} c_{v}^{- 1} (c_{v} + 1) \to 0$ in probability, and |〈e_v, u_v〉| − {c_v(c_v + 1)⁻¹}^1/2 → 0 in probability, where 〈.〉 is the inner product between two vectors. For $v > m, d_{v} γ_{p}^{- 1} \to σ^{2}$ in probability.
ii) When c_v = o(1), for all v, $d_{v} γ_{p}^{- 1} \to σ^{2}$ in probability, and |〈e_v, u_v〉| → 0 in probability.

The proof can be found in the Supplementary Material. Theorem 1 includes convergence results for both the spiked and non-spiked sample eigenvalues. These results clearly indicate that the asymptotic behavior of sample eigenvalues and eigenvectors depends on c_v, which can be viewed as a signal to noise ratio, where λ_v represents the signal strength and γ_p serves as a surrogate of the noise level.

When λ_v grows at the same rate as, or at a higher rate than, γ_p, the spiked eigenvalues are separable from the bulk. When λ_v grows at a slower rate than γ_p, i.e., c_v = o(1), the spiked eigenvalues cannot be separated from the non-spiked eigenvalues. Theorem 1 also shows that d_v is inconsistent. The sample eigenvectors show a similar pattern. Examples on the asymptotic behavior of the sample eigenvalues and eigenvectors under several conditions are described in the Supplementary Material. To mimic the high-dimension low sample size regime, let λ_v be a function of $γ_{p}^{α}$ such that a limit of $λ_{v} / γ_{p}^{α}$ exists and is finite. Now we have the following corollary.

Corollary 1

Let $λ_{v} / γ_{p}^{α} \to {\tilde{c}}_{v} (v \leq m), {\tilde{c}}_{m} < \dots < {\tilde{c}}_{1}$ , and the remaining population eigenvalues satisfy Condition 1 or Condition 2. Then, for v ≤ m, d_v/max(γ_p, $γ_{p}^{α}$ ) converges in probability to σ²c̃_v, σ²c̃_v + σ², and σ² when α > 1, α = 1, and α < 1, respectively. With the same assumption, |〈e_v, u_v〉| converges in probability to unity, {c̃_v(c̃_v + 1)⁻¹}^1/2, and zero when α > 1, α = 1, and α < 1, respectively.

The proof can be found in the Supplementary Material. The corollary allows us to compare our results to those from the high-dimension low sample size regime. See Section 2.3 for details. After principal component analysis, the sample principal component scores are often used to summarize data. Predicted principal component scores may also be calculated on new samples for a variety of reasons (Jolliffe, 2002). The next theorem presents the asymptotic results on the principal component scores under the ultra-high dimensional regime.

Theorem 2

Suppose the assumptions in Theorem 1 hold, and c_v (v ≤ m) is bounded away from zero. Let p_v = X^Te_v be the vth population principal component score derived from the corresponding vth population eigenvector, and corr(·, ·) be the correlation function. Then, for v ≤ m, corr(p_v, p̂_v) → 1 in probability, and ${E ({\hat{q}}_{v}^{2}) / E ({\hat{p}}_{v j}^{2})}^{1 / 2} - c_{v} {(c_{v} + 1)}^{- 1} \to 0$ inprobability, for all j = 1, …, n.

The proof is given in the Supplementary Material. One striking feature in Theorem 2 is that the correlation between p_v and p̂_v can converge to unity even when the corresponding sample eigenvector is not consistent. Combining Theorems 1 and 2, we conclude that p̂_v can accurately estimate p_v whenever its corresponding sample eigenvalue is separable from the bulk. This interesting result may partially explain the success of principal component analysis for high dimensional datasets, such as genome-wide association data (Price et al., 2006; Patterson et al., 2006). This theorem also illustrates the shrinkage phenomena of the predicted principal component scores, previously reported by Lee et al. (2010). To apply the asymptotic results in Theorem 1 and 2 to data, we need to estimate σ². Lee et al. (2010) proposed an algorithm to rescale the data to ensure σ² of the rescaled data equal to unity. The same approach can be applied to ultra-high dimensional data.

2.3. Comparisons to existing asymptotic results

In the finite γ framework, it is typically assumed that the spiked population eigenvalues are finite. Under the spiked population model, the following results have been established (Baik & Silverstein, 2006; Paul, 2007; Lee et al., 2010). For sample eigenvalues,

\begin{array}{l} d_{v} - σ^{2} λ_{v} {1 + γ {(λ_{v} - 1)}^{- 1}} = o_{p} (1), & λ_{v} > 1 + γ^{1 / 2}, \\ d_{v} - σ^{2} {(1 + γ^{1 / 2})}^{2} = o_{p} (1), & λ_{v} \leq 1 + γ^{1 / 2}, \end{array}

(1)

and for predicted principal component scores,

{E ({\hat{q}}_{v}^{2}) / E ({\hat{p}}_{v j}^{2})}^{1 / 2} - (λ_{v} - 1) {(λ_{v} + γ - 1)}^{- 1} = o_{p} (1) .

(2)

Equation (1) shows that if λ_v > 1 + γ^1/2, its corresponding sample eigenvalue is separable from the bulk. Interestingly, the result in (1) for λ_v > 1 + γ^1/2 is equivalent to the asymptotic result in Theorem 1 when λ_v is relatively large. To see this, note that by replacing λ_v in (1) with c_vγ, we obtain

d_{v} λ_{v}^{- 1} \approx σ^{2} + (σ^{2} γ) {(λ_{v} - 1)}^{- 1} \approx σ^{2} c_{v}^{- 1} (c_{v} + 1),

which accords with Theorem 1. For the predicted principal component scores, the same holds true. By (2),

{E ({\hat{q}}_{v}^{2}) / E ({\hat{p}}_{v j}^{2})}^{1 / 2} \approx (λ_{v} - 1) {(λ_{v} + γ - 1)}^{- 1} \approx c_{v} {(c_{v} + 1)}^{- 1},

which is consistent with Theorem 2. Similar conclusions can be drawn for the sample eigenvectors and principal component scores and are omitted here. In summary, for ultra-high dimensional data, both the finite γ and ultra-high dimensional asymptotic results can be used to investigate the behavior of sample eigenvalues, eigenvectors and principal component scores, and both produce similar conclusions.

Under the high-dimension low sample size regime, the spiked population eigenvalues are assumed to grow at rate p^α (α > 0). Jung et al. (2012) proved the following results with an additional Gaussian assumption on X. For the first sample eigenvalue,

\frac{d_{1}}{max (p, p^{α})} \to {\begin{cases} σ^{2} {\hat{c}}_{1} χ_{n}^{2} / n, & α > 1, \\ σ^{2} {\hat{c}}_{1} χ_{n}^{2} / n + σ^{2} / n, & α = 1, \\ σ^{2} / n, & α < 1, \end{cases}

(3)

in distribution, where ĉ₁ = lim_p→∞ λ₁/p^α, and $χ_{n}^{2}$ denotes the chi-square distribution with n degrees of freedom. For the first sample eigenvector,

| 〈 e_{1}, u_{1} 〉 | \to {\begin{cases} 1, & α > 1, \\ {{\hat{c}}_{1} χ_{n}^{2} / ({\hat{c}}_{1} χ_{n}^{2} + 1)}^{1 / 2}, & α = 1, \\ 0, & α < 1, \end{cases}

(4)

in distribution. Results with more relaxed assumptions can be found in Jung et al. (2012). In the high-dimension low sample size regime, the asymptotic behavior of sample eigenvalues and eigenvectors depends on the relative growth rate of λ_v over p. In Corollary 1, λ_v is expressed as a function of γ_p, instead of p directly. However, it should be noted that $γ_{p}^{α}$ is equivalent to p^α in the high-dimension low sample size regime where n is treated as fixed. Equation (3) shows that when α > 1, the distribution of the scaled sample eigenvalue converges in distribution to the random variable $σ^{2} \hat{c} 1 χ_{n}^{2} / n$ as p → ∞. Combining this with the fact that $χ_{n}^{2} / n \to 1$ in probability as n → ∞, we end up with the same conclusion in Corollary 1. When α = 1, d₁ − σ²λ₁ ≍ σ²p/n, and thus the sample eigenvalue is biased with a bias σ²p/n. Equation (4) indicates that the first sample eigenvector is consistent when α > 1 and is asymptotically perpendicular to the first population eigenvector when α < 1. When α = 1, the sample eigenvector is neither consistent nor asymptotically perpendicular to the first population eigenvector. In conclusion, our asymptotic results clearly parallel the asymptotic results for the high-dimension low sample size regime, and embed them within a larger framework.

3. Numerical Study

We conducted simulations to illustrate our theoretical results. A p × n data matrix X was generated from N(0, Λ) with Λ = diag(λ₁, …, λ_p). We set λ₁ = c₁γ_p, λ₂ = c₂γ_p and λ₃ = … = λ_p = 1. The first and second population eigenvectors were e₁ = (1, 0, …, 0) and e₂ = (0, 1, 0, …, 0), respectively. Four different sets of c_vs were selected to represent different scenarios: no spiked eigenvalues, $c_{1} = c_{2} = γ_{p}^{- 1}$ ; very small spiked eigenvalues, $c_{1} = γ_{p}^{- 1 / 2}, c_{2} = 0.7 γ_{p}^{- 1 / 2}$ ; moderate spiked eigenvalues, c₁ = 1, c₂ = 0.7; and very large spiked eigenvalues, $c_{1} = γ_{p}^{1 / 2}, c_{2} = 0.7 γ_{p}^{1 / 2}$ . The first two scenarios correspond to the case that c_v = o(1), and the last two to the case that c_v is bounded away from zero. Two different γ_p values, 500 and 2000, were considered, and the sample size was fixed at 100. For each of the simulation setups, we generated 500 datasets and computed the sample eigenvalues and the inner products between the sample and population eigenvectors.

Table 1 reports the medians and inter-quartile ranges of the estimates. The theoretical asymptotic values of the sample eigenvalues and the inner products from the finite γ and the ultra-high dimensional regimes are also presented. The sample eigenvalues were rescaled by γ_p. For data with no spiked or with very small spiked eigenvalues, the first and second sample eigenvalues are slightly upward-biased from unity. However, they match well with the theoretical ones from the finite γ regime. For data with moderate or large spiked eigenvalues, the theoretical estimates from the finite γ regime and the ultra-high dimensional regime are identical, and are well matched with the sample eigenvalues. For the inner products of the sample eigenvectors, the empirical estimates match well with the ones from the finite γ and ultra-high dimensional regimes, and the two sets of the theoretical results are identical.

Table 1.

Rescaled sample eigenvalues and eigenvectors based on 500 simulations. Ultra-High and Finite γ present theoretical asymptotic values from the ultra-high dimensional and the finite γ regimes, respectively. Observed presents the median of the estimates, with the interquartile range in parentheses.

	Principal component	γ_p	Type	No Spike	Very small Spike	Moderate Spike	Very Large Spike
Eigenvalues	1	500	Ultra-High	1.00	1.00	2.00	23.36
			Finite γ	1.09	1.09	2.00	23.36
			Observed	1.09(0.004)	1.09(0.004)	2.02(0.183)	23.6(3.868)
		2000	Ultra-High	1.00	1.00	2.00	45.72
			Finite γ	1.05	1.05	2.00	45.72
			Observed	1.04(0.002)	1.05(0.002)	2.02(0.207)	46.75(7.605)
	2	500	Ultra-High	1.00	1.00	1.70	16.65
			Finite γ	1.09	1.09	1.70	16.65
			Observed	1.08(0.003)	1.09(0.003)	1.67(0.125)	15.98(2.680)
		2000	Ultra-High	1.00	1.00	1.70	32.31
			Finite γ	1.05	1.05	1.70	32.31
			Observed	1.04(0.001)	1.04(0.002)	1.68(0.131)	30.96(5.370)
Eigenvectors	1	500	Ultra-High	0.00	0.00	0.71	0.98
			Finite γ	0.00	0.00	0.71	0.98
			Observed	0.00(0.004)	0.07(0.069)	0.69(0.056)	0.96(0.061)
		2000	Ultra-High	0.00	0.00	0.71	0.99
			Finite γ	0.00	0.00	0.71	0.99
			Observed	0.00(0.002)	0.05(0.048)	0.69(0.056)	0.97(0.054)
	2	500	Ultra-High	0.00	0.00	0.64	0.97
			Finite γ	0.00	0.00	0.64	0.97
			Observed	0.00(0.003)	0.02(0.030)	0.607(0.055)	0.95(0.059)
		2000	Ultra-High	0.00	0.00	0.64	0.98
			Finite γ	0.00	0.00	0.64	0.98
			Observed	0.00(0.002)	0.02(0.022)	0.61(0.052)	0.96(0.053)

Open in a new tab

Table 2 summarizes the results for the sample and predicted principal component scores. For the sample principal component scores, the median and inter-quartile ranges of their Pearson correlations with the population principal component scores were calculated. The theoretical results from both the finite γ and ultra-high dimensional regimes are identical and both match well with the empirical estimates. For the predicted principal component scores, we followed exactly the same simulation procedure as described above to generate a new dataset for each of the simulated dataset. We computed the predicted principal component scores on each new dataset. The empirical shrinkage factor was calculated as the ratio of the means of the squared predicted and sample principal component scores. Again, similar conclusions hold. The theoretical results from both the finite γ and ultra-high dimensional regimes are effectively identical. The empirical estimates approximate the theoretical results very well.

Table 2.

Sample and predicted principal component scores based on 500 simulations. UltraHigh and Finite γ present theoretical asymptotic values from the ultra-high dimensional and the finite γ regimes, respectively. Observed presents the median of the estimates, with the interquartile range in parentheses.

	γ_p	Type	Principal Component 1		Principal Component 2
	γ_p	Type	Moderate Spike	Very Large Spike	Moderate Spike	Very Large Spike
Principal Component Scores	500	Ultra-High	1.00	1.00	1.00	1.00
		Finite γ	1.00	1.00	1.00	1.00
		Observed	0.99(0.042)	0.99(0.043)	0.97(0.090)	0.97(0.087)
	2000	Ultra-High	1.00	1.00	1.00	1.00
		Finite γ	1.00	1.00	1.00	1.00
		Observed	0.99(0.039)	0.99(0.038)	0.97(0.078)	0.97(0.079)
Shrinkage Factors	500	Ultra-High	0.50	0.96	0.41	0.94
		Finite γ	0.50	0.96	0.41	0.94
		Observed	0.49(0.045)	0.93(0.120)	0.42(0.046)	0.97(0.122)
	2000	Ultra-High	0.50	0.98	0.41	0.97
		Finite γ	0.50	0.98	0.41	0.97
		Observed	0.49(0.045)	0.95(0.127)	0.42(0.041)	1.00(0.118)

Open in a new tab

Supplementary Material

Supplementary Materials

NIHMS594795-supplement-Supplementary_Materials.pdf^{(181.6KB, pdf)}

Acknowledgments

This research was supported by the National Institutes of Health, U.S.A. We are grateful to the editor, associate editor and a referee for their help in improving our presentation.

Footnotes

Supplementary material available at Biometrika online includes detailed description of sphericity conditions, proofs and two examples.

Contributor Information

Seunggeun Lee, Email: leeshawn@umich.edu, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109, U.S.A.

Fei Zou, Email: fzou@bios.unc.edu, Department of Biostatistics, University of North Carolina, 135 Dauer Drive, Chapel Hill, North Carolina 27599, U.S.A.

Fred A. Wright, Email: fred_wright@ncsu.edu, Bioinformatics Research Center, North Carolina State University, 1 Lampe Drive, Raleigh, North Carolina 27607, U.S.A.

References

Ahn J, Marron JS, Muller KM, Chi YY. The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika. 2007;94:760–766. [Google Scholar]
Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J Multivariate Anal. 2006;97:1382–1408. [Google Scholar]
Hall P, Marron JS, Neeman A. Geometric representation of high dimension, low sample size data. J R Statist Soc B. 2005;67:427–444. [Google Scholar]
John S. Some optimal multivariate tests. Biometrika. 1971;58:123–127. [Google Scholar]
John S. The distribution of a statistic used for testing sphericity of normal distributions. Biometrika. 1972;59:169–173. [Google Scholar]
Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. [Google Scholar]
Jolliffe I. Principal Component Analysis. Springer; New York: 2002. [Google Scholar]
Jung S, Marron JS. PCA consistency in high dimension, low sample size context. Ann Statist. 2009;37:4104–4130. [Google Scholar]
Jung S, Sen A, Marron JS. Boundary behavior in high dimension, low sample size asymptotics of PCA. J Multivariate Anal. 2012;109:190–203. [Google Scholar]
Lee S, Zou F, Wright F. Convergence and prediction of principal component scores in high-dimensional settings. Ann Statist. 2010;38:3605–3629. doi: 10.1214/10-AOS821. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marčenko V, Pastur L. Distribution of eigenvalues for some sets of random matrices. Sbornik: Mathematics. 1967;1:457–483. [Google Scholar]
Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Ann Statist. 2008;36:2791–2817. [Google Scholar]
Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist Sinica. 2007;17:1617–1642. [Google Scholar]
Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS594795-supplement-Supplementary_Materials.pdf^{(181.6KB, pdf)}

[R1] Ahn J, Marron JS, Muller KM, Chi YY. The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika. 2007;94:760–766. [Google Scholar]

[R2] Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J Multivariate Anal. 2006;97:1382–1408. [Google Scholar]

[R3] Hall P, Marron JS, Neeman A. Geometric representation of high dimension, low sample size data. J R Statist Soc B. 2005;67:427–444. [Google Scholar]

[R4] John S. Some optimal multivariate tests. Biometrika. 1971;58:123–127. [Google Scholar]

[R5] John S. The distribution of a statistic used for testing sphericity of normal distributions. Biometrika. 1972;59:169–173. [Google Scholar]

[R6] Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. [Google Scholar]

[R7] Jolliffe I. Principal Component Analysis. Springer; New York: 2002. [Google Scholar]

[R8] Jung S, Marron JS. PCA consistency in high dimension, low sample size context. Ann Statist. 2009;37:4104–4130. [Google Scholar]

[R9] Jung S, Sen A, Marron JS. Boundary behavior in high dimension, low sample size asymptotics of PCA. J Multivariate Anal. 2012;109:190–203. [Google Scholar]

[R10] Lee S, Zou F, Wright F. Convergence and prediction of principal component scores in high-dimensional settings. Ann Statist. 2010;38:3605–3629. doi: 10.1214/10-AOS821. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Marčenko V, Pastur L. Distribution of eigenvalues for some sets of random matrices. Sbornik: Mathematics. 1967;1:457–483. [Google Scholar]

[R12] Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Ann Statist. 2008;36:2791–2817. [Google Scholar]

[R13] Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist Sinica. 2007;17:1617–1642. [Google Scholar]

[R15] Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

PERMALINK

Convergence of Sample Eigenvalues, Eigenvectors, and Principal Component Scores for Ultra-High Dimensional Data

Seunggeun Lee

Fei Zou

Fred A Wright

Summary

1. Introduction

2. Method

2.1. General Setting

2.2. Main Results

Theorem 1

Corollary 1

Theorem 2

2.3. Comparisons to existing asymptotic results

3. Numerical Study

Table 1.

Table 2.

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Convergence of Sample Eigenvalues, Eigenvectors, and Principal Component Scores for Ultra-High Dimensional Data

Seunggeun Lee

Fei Zou

Fred A Wright

Summary

1. Introduction

2. Method

2.1. General Setting

2.2. Main Results

Theorem 1

Corollary 1

Theorem 2

2.3. Comparisons to existing asymptotic results

3. Numerical Study

Table 1.

Table 2.

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases