Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 1.
Published in final edited form as: J Multivar Anal. 2019 Feb 19;173:145–164. doi: 10.1016/j.jmva.2019.02.007

Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model

Rounak Dey a, Seunggeun Lee a,*
PMCID: PMC7441582  NIHMSID: NIHMS1523560  PMID: 32831421

Abstract

With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy.

Keywords: Consistent estimation, High-dimensional data, PC scores, Random matrix

1. Introduction

Principal component analysis (PCA) is a very popular tool for analyzing high-dimensional biomedical data, where the number p of features is often substantially larger than the number n of observations. PCA is widely used to adjust for population stratification in genome-wide association studies [21] and to identify overall expression patterns in transcriptome analysis [23]. However, the asymptotic properties of PCA in high-dimensional data are profoundly different from the properties in low-dimensional (p finite, n → ∞) settings. In high-dimensional settings, the sample eigenvalues and eigenvectors are not consistent estimators of the population eigenvalues and eigenvectors [12, 20], and the predicted principal component (PC) scores based on the sample eigenvectors can be systematically biased toward zero [14].

There has been extensive effort to investigate the asymptotic behavior of PCA in high-dimensional settings. To provide a statistical framework for PCA in these settings, Johnston introduced a spiked population model, which assumes that all the eigenvalues are equal except for finitely many large ones called the spikes. A spiked population covariance matrix is basically a finite-rank perturbation of a scalar multiple of the identity matrix. A typical example of a spiked population with two spikes is shown in Figure 1a. This two-spike eigenvalue structure arises if the population consists of three sub-populations, and the features are largely independent with equal variances. Under this model, convergence of sample eigenvalues, eigenvectors and PC scores have been extensively studied [4, 11, 14, 20].

Figure 1:

Figure 1:

Eigenvalue structures for SP and GSP models.

In many biomedical data, however, the assumption of the equality of non-spiked eigenvalues can be violated due to the presence of local correlation among features. In genome-wide association studies, for example, the genetic variants are locally correlated due to linkage disequilibrium. In gene-expression data, since genes in the same pathway are often expressed together, their expression measurements are often correlated. These local correlations can cause substantial differences in non-spiked eigenvalues. To illustrate this phenomenon, we obtained eigenvalues with an autoregressive within-group correlation structure rather than the independent structure of the previous example. Figure 1b shows that the equality assumption is clearly violated. Thus, if methods developed under the equality assumption are applied to these types of data, we will obtain biased results.

The generalized spiked population model [3] has been proposed to address this problem. The condition that the non-spikes have to be equal is removed in this generalization. In this model, the set of population eigenvalues consists of finitely many large eigenvalues called the generalized spikes, which are well separated from infinitely many small eigenvalues. Although the generalized spiked population model has a great potential to provide more accurate inference in high-dimensional biomedical data, only limited literature is available on the asymptotic properties of PCA under this model and their application to real data. Bai and Yao [3] and Ding [8] provided results regarding convergence of eigenvalues and eigenvectors. However, their work remained largely theoretical. Moreover, to the best of our knowledge, no method has been developed for estimating the correlations between the sample and population PC scores, and adjusting biases in the predicted PC scores under the generalized spiked population model.

In this paper, we systematically investigate the asymptotic behavior of PCA under the generalized spiked population model, and develop methods to estimate the population eigenvalues and adjust for the bias in the predicted PC scores. We first propose two different approaches to consistently estimate the population eigenvalues, the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We compare these two methods and show the asymptotic equivalence of the estimators across them. Finally, we propose a method to reduce the bias in the predicted PC scores based on the estimated population eigenvalues.

The paper is organized as follows. We begin in Section 2 by providing the definition of the generalized spiked population model and present existing theoretical results. We develop our methods to consistently estimate the population spikes in Section 3. In Section 4, we construct consistent estimators of the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We also propose the bias-reduction technique for the predicted PC scores. Section 5 presents the algorithm [9] to estimate the population limiting spectral distribution and the non-spiked eigenvalues. In Section 6, we present results from simulation studies and an example from the Hapmap project to demonstrate the improved performance of our method over the existing one. Finally, we conclude the paper with a discussion.

2. Generalized spiked population model

In order to formally define generalized spiked population model, we require the concept of spectral distribution. In the random matrix literature, it is natural to associate a probability measure to the set of eigenvalues as the dimension p goes to infinity. More explicitly, if a Hermitian matrix Σp has eigenvalues λ1,…, λp, we can define the empirical spectral distribution (ESD) of Σp to be Hp based on the probability measure

dHp(x)=1pi=1pδλi(x),

where δλi (x) is unity when x = λi, and otherwise zero. Now, for a sequence (Σp) of covariance matrices, if the corresponding sequence (Hp) of ESDs converge weakly to a non-random probability distribution H as p → ∞ to, then we define H as the limiting spectral distribution (LSD) of the sequence (Σp).

The generalized spiked population model [3] is defined as follows. Suppose that Hp is the ESD corresponding to the population covariance matrix Σp and that it converges weakly to a non-random probability distribution H. Let ΓH be the support of H and d(x, A) = infyA |xy| be the distance metric from a point x to a set A. Then the set of eigenvalues of Σp comprises of two subsets of eigenvalues α1 ≥ ⋯ ≥ αm and βp,1 ≥ ⋯ ≥ βp,pm, as follows:

  1. there exists δ > 0 such that d(αi, ΓH) > δ for all i ∈ {1,…, m}; α1,…, αm are called the generalized spikes.

  2. max1≤ipmd(βp,i, ΓH) = ϵp → 0;βp,1,…,βp,pm are called the non-spikes.

It is obvious from the definition that the generalized spikes are measure zero points of the population LSD. For Johnstone’s spiked population model [11], the population LSD is H = δ{1} indicating ΓH = {1}. From the definition above, all eigenvalues larger than 1 are spikes. Hence, Johnstone’s spiked population model is a special case of the generalized spiked population model.

Suppose that the population covariance matrix Σp has eigenvalues λ1 ≥ ⋯ ≥ λp, and the sample covariance matrix Sp = XX/n has eigenvalues d1 ≥ ⋯ > dp, where X is an n × p data matrix. Let Σp = EΛpE and S p = UDpU be the spectral decompositions of Σp and S p, respectively. Further, we will assume the following throughout the paper.

  1. p → ∞, n → ∞, p/nγ< .

  2. X=YΛp1/2E, where Y is an n × p random matrix with iid elements such that E(Yij) = 0, E(|Yij|2) = 1, E(|Yij|4) < ∞.

  3. The population eigenvalues follow the generalized spiked population (GSP) model with m generalized spikes. Let Hp be the population ESD, H be the population LSD, and ΓH be the support of H. Moreover, the sequence (∥Σp∥) of spectral norms is bounded.

  4. The generalized spikes are larger than sup ΓH.

Even though we will develop our estimation methods based on the asymptotic regime where p/nγ < ∞, we will discuss the applicability of our methods when p is greatly larger than n in Section 4.6.

In this paper, we will derive and discuss asymptotic properties of different functions of eigenvalues and angles between the sample and population eigenvectors, both of which are rotation-invariant, i.e., if we rotate X by some p×p orthogonal matrix P, the eigenvalues (both sample and population) and angles between the sample and population eigenvectors will remain the same. The new population covariance matrix Σ˜p=PΣpP will have the same eigenvalues as Σp, and the new sample covariance matrix S˜p=PSpP will also have the same eigenvalues as Sp. The angles between the eigenvectors of S˜p and Σ˜p will be the same as the angles between the eigenvectors of S p and Σp, both given by the elements of the matrix E U. Therefore, without loss of generality, by using P = E, we can assume the population covariance matrix to be diagonal.

From the Marčenko-Pastur theorem [16], the sample ESD Fp converges weakly to a non-random probability distribution F with support ΓF. For α ∉ ΓH, α ≠ 0 and x > 0, we define the following two functions:

ψ(α)=α+γαλdH(λ)αλ,  fF(x)=x/{1+γτdF(τ)xτ}. (1)

The following result by Bai and Yao [3] provides the almost sure limits of the sample eigenvalues corresponding to the population generalized spikes.

Result 1. Suppose assumptions (A)-(D) hold. Let λk be a generalized spike of multiplicity 1 and the corresponding sample eigenvalue is dk. Moreover, let ψ′ denote the first derivative of the function ψ. Then,

  1. If ψ′(λk) > 0, then the sample eigenvalue dk converges almost surely to ψ(λk), i.e., |dkψ(λk)|a.s.0.

  2. If ψ′(λk) ≤ 0, then let (uk, vk) ⊂ (sup ΓH, ∞) be the maximal interval on which ψ′ > 0. The sample eigenvalue dk converges almost surely to ψ (w) is a boundary of [uk, vk] that is nearest to λk.

Given that ψ’(α) is a strictly increasing function for α > sup ΓH, if a generalized spike λk is large enough such that ψ’(α) > 0, according to Result 1 the corresponding sample eigenvalue will converge almost surely to ψ(λk). However if the generalized spike lies close enough, i.e., ψ’(λk) ≤ 0, to the set of non-spikes, then the convergence of the corresponding sample eigenvalue is given by the second part of the result. We will denote a generalized spike λk as a “distant spike” if ψ’(λk) > 0; otherwise we will call it a “close spike”.

3. Consistent estimation of the generalized spikes

The following theorem provides two different consistent estimators of the distant spikes.

Theorem 1. Let λk be a distant spike of multiplicity 1 and the corresponding sample eigenvalue is dk. If the assumptions (A)-(D) hold, then

|ψ1(dk)λk|p0,

where ψ−1 is the left inverse of ψ. Also, |fF(dk)λk|p0.

This theorem shows that for any distant spike λk we have two consistent estimators ψ−1(dk) and fF(dk). Notice that the function fF depends only on the sample LSD which can be approximated by the sample ESD. Thus, fF(dk) can be approximated directly using the sample eigenvalues. More explicitly, fF(dk) can be closely approximated as

fF(dk)dk/{1+γpmi=m+1pdidkdi}.

In contrast, the ψ function depends on the population LSD which is unknown. We can estimate the ψ function using the algorithm described in Section 5 and then find the inverse function ψ−1 using a Newton-Raphson type algorithm.

4. Consistent estimators of the asymptotic shrinkage in predicting the PC scores

In this section, we investigate the convergence of sample eigenvectors, PC scores, and shrinkage factors in predicting the PC scores. Let ei and Ei be the ith sample and population eigenvectors, respectively. In addition to assumptions (A)-(D), we further assume that the distant spikes are of multiplicity 1. This assumption is to restrict the dimension of the corresponding eigenspaces to 1, as otherwise the angle between sample and population eigenvectors, or shrinkage in predicted PC scores cannot be well defined.

4.1. Angle between sample and population eigenvectors

We first present the following theorem on the convergence of the quadratic forms of the sample eigenvectors.

Theorem 2. Let λk be a distant spike of multiplicity 1, and the assumptions (A)-(D) hold. Consider the quadratic form η^k=s1ekeks2, where s1 and s2 are non-random vectors with uniformly bounded norm for all p. Then

|η^kηk|a.s.0,

where ηk=λkψ(λk)s1EkEks2/ψ(λk).

Mestre [19] showed similar asymptotic properties of the quadratic forms under the assumption that the number of spikes increases with the dimension. Theorem 2 shows the convergence of the angle between sample and population eigenvectors. Suppose s1 = s2 = Ek. Then

η^k=EkekekEk=ek,Ek2, ηk=λkψ(λk)/ψ(λk).

Combining them, we can show

|ek,Ek2λkψ(λk)/ψ(λk)|a.s.0.

Therefore, {λkψ’(λk)/ψ(λk)}1/2 is a consistent estimator of the cosine of the angle, i.e., the absolute value of the inner product, between the kth sample and population eigenvectors. In order to obtain this estimator, we first need to estimate the ψ function using the algorithm described in Section 5.

The following result by [8] provides another consistent estimator for the angle between the kth sample and population eigenvectors. The proof of the asymptotic equivalence of these two estimators is given in the Appendix.

Result 2. Let λk be a distant spike of multiplicity 1, and dk be the corresponding sample eigenvalue. Assume that (A)-(D) hold. Define,

gF(x)={1+γfF(x)τ(xτ)2dF(τ)}1.

Then |ek,Ek2gF(dk)|p0.

Hence gF (dk)1/2 also works as a consistent estimator of |〈ek, Ek〉|. Since the function gF depends only on sample LSD, it can be approximated directly using sample eigenvalues. More explicitly, if there are m spikes in the population, the function gF can be closely approximated as

gF(dk){1+γfF(dk)pmi=m+1pdi(dkdi)2}1.

The above equation can be used to estimate the angle between the sample and population eigenvectors.

4.2. Correlation between sample and population PC scores

The sample and population PC scores are the projections of the data on the sample and population eigenvectors, respectively. The correlation between them can be perceived as a measure of accuracy of the PCA. The squared correlation can also be interpreted as the proportion of variance in the population PC scores that can be explained by corresponding sample PC scores. The following theorem provides the consistent estimators of the correlation between the sample and population PC scores corresponding to a distant spike.

Theorem 3. Suppose λk is a distant spike of multiplicity 1, dk is the corresponding sample eigenvalue, and the assumptions (A)-(D) hold. Let the normalized kth population PC score be Pk = XEk/(k)1/2 and the normalized kth sample PC score be pk = Xek/(ndk)1/2. Then

|Pk,pk2ψ(λk)|p0

and

|Pk,pk2dkgF(dk)/fF(dk)|p0,

where the function gF is as defined in Result 2.

Since Pk and pk are normalized random vectors, the absolute value of the inner product 〈Pk, pk〉 is identical to the absolute value of their correlation coefficient. Since correlation is scale invariant, this is also the correlation between kth sample and population PC scores. Therefore we can consider both ψ’(λk)1/2 and {dkgF(dk)/fF(dk)}1/2 to be consistent estimators of the correlation between the kth sample and population PC scores.

4.3. Asymptotic shrinkage factor

Suppose λk is a distant spike. Let the kth sample PC score for the jth observation xj be pkj=xjek, and the kth predicted PC score for a new observation xnew be qk=xnewek. Then the quantity ρk=limp{E(qk2)/E(pkj2)}1/2 describes the asymptotic shrinkage in the kth predicted PC score for a new observation. As both pkj and qk are centered, i.e., E(pkj) = E(qk) = 0, ρk represents the limiting ratio of the standard deviations of the predicted PC scores and the sample PC scores. Therefore, if we can estimate pk, then the shrinkage bias in the kth predicted PC scores can be easily adjusted by rescaling the predicted scores by the factor ρk1. The following theorem provides the consistent estimator of the asymptotic shrinkage factor pk.

Theorem 4. Suppose λk is a distant spike of multiplicity 1, dk is the corresponding sample eigenvalue, and the assumptions (A)-(D) hold. Let pkj and qk be as defined above. Then

|E(qk2)/E(pkj2)λk/dk|p0.

This is a surprising result in which the asymptotic shrinkage factor is expressed as a simple ratio of the population and sample eigenvalues. Recall that we already constructed the consistent estimators for population eigenvalues in the previous sections. Using these results, the asymptotic shrinkage factor ρk can be consistently estimated by λ^k/dk, where λ^k is any consistent estimator of λk.

4.4. Comparison of the two different estimators

For each of the quantities discussed above, we proposed two asymptotically equivalent estimators. In terms of practical applications, they have their own advantages and disadvantages. One of them can be approximated directly based only on the sample eigenvalues, while the other one requires to estimate the LSD of the population eigenvalues to obtain the ψ function. For ease of discourse, we will call the former “d-estimator” and the later “λ-estimator”.

If the number of spikes is known, estimating the d-estimator is computationally more efficient than estimating the λ-estimator as it does not involve estimating the population LSD. However, by estimating the population LSD the λ-estimation procedure can verify whether an estimated eigenvalue is actually a distant spike by checking if ψ’ > 0. Thus it can be used to estimate the number of distant spikes when it is unknown; see Section 5. In contrast, the d-estimation procedure provides no information on the population LSD and thus cannot distinguish among distant spikes, close spikes and non-spikes.

To summarize, when the number of spikes is known or we only want to estimate few of the largest eigenvalues which are known to be distant spikes, then the d-estimation procedure has the advantage of a faster computation, while the λ-estimation procedure is more useful when the number of spikes is unknown or the distribution of the non-spikes is of interest.

4.5. Comparison of the Generalized Spiked Population (GSP) model and the Spiked Population (SP) model

As mentioned before, the SP model [11] is a special case of the GSP model. It is easy to verify that when the population eigenvalues follow the SP model, our consistent estimators for the spiked eigenvalues, the angles between the eigenvectors, the correlation coefficients between the PC scores and the shrinkage factors conform to the consistent estimators derived by [4, 14, 20]. For an SP model where all the non-spikes are equal to 1, the LSD H is a degenerate distribution at 1, and

ψ(α)=α{1+γ/(α1)}, ψ(α)=1γ/(α1)2.

Now, ψ’(α) > 0 if and only if α > 1 + γ1/2. If α > 1 + γ1/2 and d is the corresponding sample eigenvalue, then the consistent estimator of α is given by ψ−1 (d), and

αψ(α)ψ(α)=1γ/(α1)21+γ/(α1), αψ(α)=α1α+γ1,

which show that all our results match with the results from [14].

It is of interest to investigate how closely methods developed under the SP model can approximate the consistent estimators for the distant spikes when the population eigenvalues actually follow a GSP model. Suppose the population eigenvalues λ1 ≥ ⋯ ≥ λp follow the GSP model with m distant spikes. The sample eigenvalues are d1 ≥ ⋯ ≥ dp. Let λk be a distant spike with multiplicity 1, and the corresponding sample eigenvalue is dk. Then according to Result 1, dkψ(λk) almost surely. From the definition of ψ,

ψ(λk)=λk{1+γλλkλdH(λ)}=λk+γλ1λ/λkdH(λ).

If H is almost degenerate, i.e., the non-spikes are nearly identical, then

ψ(λk)λk+γλ¯1λ¯/λk, (2)

where λ¯=λdH(λ) is the mean of the population LSD which can be closely approximated by the mean of the non­spikes. By contrast, if the spike λk is very large compared to all the non-spikes such that λ/λk ≈ 0 for any λ ∈ ΓH, then

ψ(λk)λk+γλ¯.

Now, suppose that instead of using the GSP assumption, we use the SP assumption to estimate the distant spikes. We assume that under the SP model, the population covariance matrix is scaled by a factor ζ and the population eigenvalues are β1 ≥⋯≥ βm > ζ = ⋯ = ζ. If βk is the population eigenvalue corresponding to dk, then dkψ(βk) almost surely, where

ψ(βk)=βk(1+γζβkζ)=βk+γζ1ζ/βk.

Here ζ is estimated as the mean of the non-spikes as they are all assumed to be equal to ζ. Notice that this expression is approximately equal to the expression in (2) with βk = λk and ζ=λ¯. Therefore, the asymptotic limit of dk under both the GSP and the SP model are approximately equal when the non-spikes are nearly identical. In contrast, when the spike βk is very large compared to all the non-spikes such that ζ/βk ≈ 0, then ψ(βk) ≈ βk + γζ. In this case also, the asymptotic limit of dk under both the GSP and the SP model are approximately equal with βk = λk and ζ=λ¯. Therefore, if a generalized spike is very far away from the support of the population LSD, then the estimate of the spike based on an SP model will closely approximate the estimate based on a GSP model. However, the SP model will provide potentially biased estimates if the non-spikes are not similar and the ratio between the largest non-spike and the spike of interest is substantially larger than zero.

4.6. Comparison with ultra high-dimensional regime-based results when p/n is large

Our methods are developed under the high-dimensional regime p/nγ < ∞, and do not theoretically warrant them to be applied in ultra-high dimensional (UHD) regime where p/n → ∞. However, often times in real-world applications, we only have data with large p and large n, but the relative rate of their asymptotic divergence is unknown. Therefore, we do not know whether the true asymptotic regime is high-dimensional (p/nγ < ∞) or ultra high-dimensional (p/n → ∞).

Suppose that the true asymptotic regime is high-dimensional with γ finite but large compared to n, and the eigenvalues follow the GSP model. In such situations, we can either correctly assume the high-dimensional regime and apply the results discussed in this paper, or we can falsely assume the ultra high-dimensional asymptotic regime and employ the theoretical results derived under this regime [15]. In this section, we will investigate whether it is prudent to assume the UHD regime in such situations. In other words, we will try to answer how large γ can be considered to be diverging to infinity for practical applications.

We first show that for large enough γ, the theoretical results based on the UHD regime become nearly identical to the results under the correctly assumed GSP (under high-dimensional regime) model. The UHD-based results presented in [15] require weaker conditions for the non-spiked eigenvalues than those for the spiked population model. Instead of assuming that the non-spiked eigenvalues are the same, Lee et al. [15] assumes certain conditions on the moments of the non-spiked eigenvalues. Since the population LSD has a finite support and all of its central moments are finite, the condition on their moments, i.e., Condition 2 in [15] is satisfied with an additional assumption that n3/ p2 = o(1). Without loss of generality, we assume that the mean of the non-spikes is unity. Then, under the UHD regime,

d/λpγ/λ+1 when λO(γ), d/γp1 when λ=o(γ), (3)

where λ and d are a spiked population eigenvalue and its corresponding sample eigenvalue, respectively. Here λO(γ) means λ/γ is bounded away from zero, and λ = o(γ) means λ/γ → 0. Lee et al. [15] also showed the convergence of sample eigenvectors and PC scores.

Alternatively, under the GSP model, dψ(λ) when λ is a distance spike. From Theorem 1, a distant spike λ must satisfy

1γx2(λx)2dH(x)>0,

where H is the population LSD. Since fλ(x) = x2(λx)−2 is a continuous function for λ > sup ΓH and xΓH, where ΓH is the support of H, there exists x* ∈ (inf ΓH, sup ΓH) such that ∫x2(λx)−2 dH(x) = x*2 (λx*)−2. Then

1γx*2/(λx*)2>0,

which implies λ>x*+x*γ. Thus, for any λ>x*+x*γ,d/λψ(λ)/λ converges to zero.

Now, under the true asymptotic regime (high-dimensional) λ and γ are both finite and non-zero, and thus λ = O(γ). However, under the falsely assumed UHD regime, one can further assume λO(γ) or λ = o(γ) depending on whether λ is large or small compared to γ. If one assumes λO(γ), then the difference between the convergence of d/λ from the two models is

ψ(λ)/λγ/λ1=γλ{x1x/λdH(x)1}. (4)

Since γ/λ = O(1) and ∫ x (1 – x/λ)−1dH(x) − 1 = O(1/λ) as the mean of the non-spikes is unity, (4) becomes almost identical to zero when λ is sufficiently large.

Now, suppose one assumes λ = ο(γ). Let λa + bγk for some finite a, b and k ∈ [1/2,1). Then, the difference between our result and the UHD result is

|ψ(λ)/γ1|=|λ/γ+λxλxdH(x)1|=O(γk1). (5)

Thus, in this case also (5) becomes almost identical to zero when γ is sufficiently large. We can also show the similar results for eigenvectors and PC scores.

Although both GSP and UHD eventually provide nearly identical results when γ is sufficiently large, the GSP model can provide substantially better estimates. The difference can be large when λ is small compare to γ, i.e., k < 1, since the difference in (5) is of the order O(γk−1). The difference will be at least as large as O(1/γ) in such cases. In simulation studies, we show this numerically.

One important thing to note here is that the results under the ultra high-dimensional regime do not provide any consistent estimators for the spiked eigenvalues, albeit there still can be methods to estimate shrinkage. When the spike λ = ο(γ), the limiting behavior of the corresponding sample eigenvalue does not depend on λ, and when λO(γ), d/λpγ/λ+1 does not imply dγ to be a consistent estimator of λ because λ itself is divergent.

In the simulation studies and real data applications, we further evaluated the performance of dγ as an estimator. Lee et al. [15] also provided an estimator of the asymptotic shrinkage which we have evaluated in the simulation studies. A recent paper by Cai et al. [7] provided limiting laws for spiked eigenvalues and corresponding eigenvectors when the spikes are divergent, under more general assumptions than the ultra high-dimensional assumptions of [15]. In terms of eigenvalue estimation, this model will still have the problem of the spikes not being estimable. However, it can be an interesting future research direction to find estimators of the shrinkage under this model.

5. Estimation of the population LSD

The λ-estimators rely on ψ, that is a function of the unknown population LSD H. To use the λ-estimators, it is thus required to estimate H. Using the Stieltjes transformation and the Marčenko-Pastur theorem, El Karoui [9] developed a general algorithm to estimate the population LSD from the sample ESD, Fp. We propose to use El Karoui’s method to estimate the population LSD H and then use it to estimate ψ.

5.1. El Karoui’s algorithm

Suppose vFp is the Stieltjes transformation of the set of eigenvalues in the sample covariance matrix in which

vFp(z)=1ni=1n1diz

for any z+,+={x: lm (x)>0}. According to the Marčenko-Pastur theorem [16], when assumptions (A)-(D) hold, vFp converges point-wise almost surely to a non-random limit vF, which uniquely satisfies the following equation:

vF(z)={zγλ1+λvF(z)dH(λ)}1.

El Karoui’s method first calculates vFp for a grid of values z1,…, zJ, and then finds

H^=argHminL[{1vFp(zj)+zjpnλ1+λvFp(zj)dH(λ)}j=1J],

where L is any pre-defined convex loss function. In order to approximate the integral inside the loss function, the algorithm discretizes H as

dH(λ)k=1Kwkδtk(λ),

where δtk(λ)=1 if λ = tk and 0 otherwise, w1 + ⋯ + wK = 1 with wk > 0 for all k ∈ {1,…, K}, and t1,…, tK is a grid of points on the support of H. This is basically approximating H by a discrete distribution with support t1,…, tK. Then the integral is approximated by

λ1+λvF(z)dH(λ)k=1Kwktk1+tkvFp(zj),

and the minimization problem transforms into

H^=argHminL[{1vFp(zj)+zjpnk=1Kwktk1+tkvFp(zj)}j=1J]. (6)

El Karoui [9] has shown the weak convergence of H^ to H, i.e., H^H.

Some examples of the convex loss function L are

  1. L(e1,…, eJ) = maxj·max{|Re(ej)|, |Im(ej)|};

  2. L1(e1,…,eJ) = |e1| + ⋯ + |ej|;

  3. L2(el,…, eJ) = |e1|2 + ⋯ + |eJ|2.

For the convex loss functions described above, the estimation of H in (6) reduces to a convex optimization problem [6]. El Karoui also provided a translation of this problem into a linear programing (LP) problem when the L loss function is used. Further details can be found in [9].

5.2. Implementing El Karoui’s algorithm when the number of spikes is known

Since the generalized spikes fall outside the support of the population LSD, El Karoui’s algorithm cannot be directly applied to estimate the spikes. Furthermore, Bai and Silverstein [2] showed that the probability of a sample eigenvalue falling outside the support of the sample LSD will go to zero as p increases, which implies that the sample eigenvalues corresponding to the population generalized spikes will be measure zero points in the sample LSD. Since the spikes behave like measure zero points (or outliers) when we are concerned about estimating the population LSD, we can exclude the sample eigenvalues corresponding to the population generalized spikes while calculating vFp and that will lead to a more robust estimation of H. Therefore, we will apply El Karoui’s algorithm in the following way:

  1. Suppose the population covariance matrix possesses m generalized spikes. We exclude the top m sample eigenvalues while calculating vFp, viz.
    vFp(z)=1nmi=1n1diz.
  2. Apply El Karoui’s algorithm to obtain H^. Additionally, if it is reasonable to assume that the true population LSD is a continuous or piecewise continuous distribution function, suitable kernel smoothing algorithm can be used on H^ to obtain a more continuous approximation of H.

  3. The quantiles of H^ can be considered as the estimators of the non-spikes.

  4. Suppose λ^m+1,,λ^p are the estimated non-spikes. Then the ψ function is estimated by
    ψ^(α)=α+γαpmi=m+1pλ^iαλ^i.

Due to the weak convergence H^H, ψ^ will also converge to ψ point-wise. Thus, all the estimates provided in Section 3 and 4 will still be consistent if we replace ψ with ψ^.

A computationally challenging part in El Karoui’s algorithm is solving the LP problem, for which there are many fast algorithms available that can solve it in polynomial (in K, the grid size of w1,…, wK) computation time. An important property of El Karoui’s algorithm is that the complexity of the LP step does not depend on p, which is especially useful for estimating the LSDs of large covariance matrices. In our implementation, we used the lpSolve R package [5] to solve the LP problem.

5.3. Estimating the number of spikes

Our application of El Karoui’s algorithm to the GSP model depends on the number of spikes m, which is usually unknown. If we have some knowledge of the underlying structure of the data, we can use it to estimate m roughly. Suppose we know that the data are coming from a mixture of K subpopulations, and within each subpopulation the observations are iid N(μk,Σ), where μk represents the mean for the kth subpopulation, and Σ is the common within-group population covariance matrix. Then, as the spikes represent the between-group differences, the number of spikes should be the same as the rank of the between-group covariance matrix which is K − 1. However in real data, it is often hard to accurately assess the number of such homogeneous subpopulations. In those cases, we can use the following algorithm to estimate m.

  1. Start with a reasonable finite upper bound mmax of the number of spikes. The upper bound can be selected based on prior information on the subpopulations, or by examining the sample eigenvalues. Set m = mmax.

  2. Use El Karoui’s algorithm to estimate the population LSD and the non-spikes. Suppose the estimated non­spikes are λ^m+1λ^p, and the ψ function is estimated by
    ψ^(α)=α+γαpmi=m+1pλ^iαλ^i.
  3. Find Sψ > λm+1 using the Newton-Raphson algorithm such that
    ψ^(Sψ)=1γpmi=m+1p(λ^iSψλ^i)2=0.
  4. Since any distant spike must be larger than Sψ, and ψ^, ψ^ are both continuous and strictly increasing functions on (Sψ, ∞), the equation ψ^(λ)dk=0 has a root in (Sψ, ∞) if and only if ψ^(Sψ)dk<0. Therefore, find the smallest index i* ∈ {1,…, m} such that di*ψ^(Sψ). If all d1,…, dm are larger than ψ^(Sψ), then stop and select m as the number of distant spikes. Otherwise, set m = i* − 1 and repeat step (II)-(IV).

Note that the close spikes occur so close to the support of the population LSD that they cannot be distinguished separately from the non-spikes when the number of spikes is unknown.

The selection of mmax is subjective. It can be selected based on the prior knowledge on the number of subpopulations or by investigating the sample eigenvalues. In real data applications, we are usually interested in only a few large eigenvalues. In such situations, mmax can also be selected to be slightly larger than the number of eigenvalues we are interested in. As seen from our simulation studies, this spike selection algorithm can overestimate the number of spikes if the upper bound mmax is too large or underestimate the number of spikes if there are close spikes present; see the Online Supplement. However, as long as mmax is small compared to n and p, the estimation of the true population distant spikes will still remain consistent.

6. Simulation studies and real data example

6.1. Simulation studies: Compare GSP and SP-based methods

In this section we will present simulation studies of four different scenarios to compare the performances of the proposed GSP-based methods and the existing SP-based method proposed by Lee et al. [14]. For each study, we simulated a training dataset with n = 500 individuals and p = 5000 features. The data were generated from three subpopulations with sample sizes 100, 150 and 250. For each subpopulation we first selected a mean vector μi by drawing its elements randomly with replacement from {−0.3,0,0.3}. Then samples in the ith subpopulation were drawn from Np(μi,V) where V is the AR(1) covariance matrix with variance σ2 and autocorrelation ρ. The (σ2,ρ) pairs used for the four studies were (4,0.8), (1,0.7), (7.5,0.8) and (4,0). The population eigenvalue plots for all the studies are shown in Figure 2.

Figure 2:

Figure 2:

Eigenvalue structures in simulation studies comparing GSP-based and SP-based methods.

We also generated test datasets for each study with the same settings as the training datasets. Then we applied our GSP-based methods and the existing SP-based method to estimate the population spikes, the angles between the sample and population eigenvectors, the correlations between the sample and population PC scores and the asymptotic shrinkage factors. For all of the studies, we used the upper bound mmax = 5 to estimate the number of distant spikes using the algorithm described in Section 5.3. We simulated each study 200 times to calculate the empirical biases and standard errors of the estimates. The results are presented in Table 1.

Table 1:

Simulation results for GSP-based and SP-based methods for estimating the population eigenvalues, cosine of the angles between sample and population eigenvectors, correlations between sample and population PC scores, and the asymptotic shrinkage factors. Each cell has empirical bias (%) with coefficients of variations (%) in parentheses.

Settings Method Eigenvalue Angle Correlation Shrinkage
No. 1 2 1 2 1 2 1 2
1 n = 500
p = 5000
σ2 = 4
ρ = 0.8
SP 5.27
(2.37)
18.27
(3.11)
6.52
(0.32)
34.07
(0.60)
3.83
(0.03)
23.33
(0.08)
5.32
(0.60)
17.88
(1.06)
λ-GSP 0.43
(2.67)
0.95
(5.27)
0.53
(0.77)
3.28
(6.26)
0.33
(0.31)
2.79
(4.69)
0.47
(0.92)
0.58
(3.31)
d-GSP 0.47
(2.67)
0.69
(5.45)
0.47
(0.77)
2.48
(6.70)
0.24
(0.31)
2.11
(5.07)
0.51
(0.92)
0.31
(3.51)
2 n = 500
p = 5000
σ2 = 1
ρ = 0.7
SP 0.10
(0.90)
0.46
(1.27)
0.16
(0.04)
0.44
(0.08)
0.08
(0.001)
0.24
(0.003)
0.18
(0.08)
0.39
(0.16)
λ-GSP −0.04
(0.90)
0.04
(1.28)
0.01
(0.04)
0.004
(0.10)
0.01
(0.03)
0.01
(0.01)
0.03
(0.08)
−0.03
(0.18)
d-GSP −0.004
(0.90)
0.10
(1.28)
0.03
(0.04)
0.03
(0.10)
0.004
(0.03)
0.01
(0.01)
0.07
(0.08)
0.03
(0.18)
3 n = 500
p = 5000
σ2 = 7.5
ρ = 0.8
SP 25.68
(2.54)
64.06
(0.52)
46.50
(0.07)
26.41
(0.90)
λ-GSP 2.92
(5.7)
12.62
(11.90)
10.95
(10.13)
3.47
(4.20)
d-GSP 2.45
(5.74)
12.25
(10.52)
10.87
(8.58)
3.00
(4.24)
4 n = 500
p = 5000
σ2 = 4
ρ = 0
SP 0.05
(1.58)
−0.26
(2.35)
0.06
(0.23)
−0.06
(0.53)
0.03
(0.02)
0.05
(0.08)
0.07
(0.43)
−0.22
(0.90)
λ-GSP 0.03
(1.58)
−0.35
(2.35)
0.02
(0.24)
−0.18
(0.54)
0.01
(0.02)
−0.02
(0.09)
0.04
(0.43)
−0.31
(0.91)
d-GSP 0.16
(1.58)
−0.12
(2.35)
0.10
(0.23)
−0.03
(0.53)
0.01
(0.02)
0.02
(0.09)
0.18
(0.42)
−0.08
(0.90)

It is clear from Table 1 that for Studies 1, 2 and 3, our methods reduced the bias in all the estimates while having similar standard errors as the existing method. The positive empirical biases in all the SP estimates suggest that the SP method tends to overestimate all the quantities. In Study 4, since the underlying population satisfied the SP assumption, all methods provided very similar and almost unbiased estimates (< 1%). The results also verify that the λ-estimates and d-estimates are asymptotically equivalent. The performances of the λ-estimates and the d-estimates are nearly identical in all the simulation studies.

In Study 1, the ratio of the largest non-spike with the two spikes are 0.29 and 0.48, which are substantially larger than zero. Thus according to the discussion in Section 4, the SP model does not closely approximate the GSP model. The results support this assertion as the SP model-based estimates are highly biased whereas the estimates based on our methods have very little empirical bias. In Study 2, the largest non-spike is very small compared to the smallest spike (ratio 0.08). Thus the estimates based on the SP model closely approximate the estimates based on the GSP model, and we find very little empirical bias (< 1%) in all of the SP model-based estimates. In Study 3, even though there were two spikes present, only the largest population eigenvalue was a distant spike. So we presented only the estimates corresponding to the largest population eigenvalue. Since the ratio of the largest non-spike and the largest spike is substantially larger than zero (0.53) in this study, we observe very high empirical bias in the SP model-based estimates. However, our methods provided negligible empirical biases even in the presence of a close spike. We also presented the estimated number of distant spikes in each of the simulation studies in the Online Supplement. Note that in some cases, our algorithm over-estimates the number of distant spikes. However, as the over-estimation is only finite, the estimates of the distant spikes still remain consistent.

6.2. Simulation studies: Compare GSP and UHD-based methods

In Section 4.6, we compared the asymptotic results under the UHD regime and the results based on the high­dimensional GSP model when p is greatly larger than n, but p/n is large but finite. We theoretically established that the results from the two regimes become almost identical when p/n = γ is sufficiently large. However, given large but finite γ in the data, the difference can be substantial when the spike is smaller compared to γ.

In this section, we will assess that result by numerically comparing the GSP and UHD-based estimates for different values of γ. We considered five different scenarios where the largest population eigenvalue λ{γ,0.6γ,60+0.1γ,6γ,4γ2/3}. For the first three scenarios, under the UHD regime, λ/γ can be assumed to be bounded away from zero, and for the last two, λ/γ → 0 as γ →∞. To compare the performances as γ increases, we selected six different values for γ, viz. γ ∈ {100,200,500,1000,2500,5000}. For each combination of γ and λ, we simulated 200 datasets, each with n = 200 samples from a population with only one spike λ, and the non-spikes generated from the AR(1) covariance structure with (σ2, ρ) = (1,0.9).

First, we compare the convergence results of the largest sample eigenvalue d from Theorem 1 and (3). For this purpose, we assume the population eigenvalues and the rate of increment of λ are known, and we compare the relative errors ϵGSP = {dψ(λ)}/d and ϵUHD = (dλγ)/d or (dγ)/d depending on whether λ/γ is assumed to be bounded away from zero or not. Figure 3 shows that for all combinations of (γ, λ), the GSP-based convergence result (Theorem 1) has very negligible relative errors. In contrast, the UHD-based convergence result (3) has substantially large relative errors even for γ as large as 5000 in Scenarios 3, 4 and 5. For Scenarios 1 and 2, since λ increases at a faster rate with γ than in other scenarios, the relative errors based on the two results converge much faster. However, for relatively smaller values of γ (100, 200, 500), the differences are substantial even though γ is large compared to n = 200. This suggests that we need γ to be large in an absolute sense, and not only in a relative sense compared to n in order to assume γ → ∞ and apply UHD-based results.

Figure 3:

Figure 3:

Relative errors (%) in the convergence results of the largest sample eigenvalue derived under GSP and UHD regimes. The population eigenvalues and the rate of increment of the largest population eigenvalue are assumed to be known.

Next, we compare the estimates of the spike λ using GSP-based and UHD-based methods assuming the population eigenvalues and the rate of increment of the population spike λ to be unknown. Among the GSP-based methods, we only used the d-GSP method for this purpose due to the computational burden associated with applying the λ-GSP method on such a large number of simulated datasets. One thing to note here is that the UHD results do not provide any consistent estimators for λ, as it is assumed to be divergent when λO(γ), and the asymptotic properties of the sample eigenvalues do not depend on λ when λ/γ → 0. Thus, in order to compare these methods, we estimate λ by λ = dγ when considering the UHD regime.

From Figure 4 we can see that our proposed d-GSP method provides almost negligible biases for all combinations of (γ, λ), whereas the UHD-based estimates have substantial biases even for γ as large as 5000 in Scenarios 3, 4 and 5. For Scenarios 1 and 2, both methods provide almost unbiased estimates when γ ≥ 1000 and γ ≥ 2000, respectively. Further, we compare the estimated shrinkage factors based on these two methods in the Online Supplement. They also show very similar patterns as the estimated spikes.

Figure 4:

Figure 4:

Empirical biases (%) in estimating the largest population eigenvalue for GSP-based and UHD-based methods. The population eigenvalues and the rate of increment of the largest population eigenvalue are assumed to be unknown.

6.3. Application on Hapmap III data

For this demonstration, we used genetic data from the Hapmap Phase III project (http://hapmap.ncbi.nlm.nih.gov/). Our sample consisted of unrelated individuals sampled from two different populations: a) Utah residents with Northern and Western European ancestry (CEU) and b) Toscans in Italy (TSI). We only included genomic markers that are on chromosome 1–22, have less than 5% missing values, and those with minor allele frequency more than 0.05. We also excluded two samples (both from CEU) with outlier PC scores (more than six standard deviations away from the mean PC score corresponding to at least one distant spike). We then mean-centered and variance-standardized the data for each marker. The final sample consisted of 198 individuals (110 from CEU and 88 from TSI). Total number of markers selected across chromosome 1–22 was 1,389,511.

To evaluate the performance of the proposed methods with different p, we performed PCA on each chromosome separately. The number of markers varied from 19,331 (chromosome 21) to 116,582 (chromosome 2). The distribution of the number of markers across different chromosomes are presented in the Online Supplement. We first estimated the number of distant spikes using the algorithm described in Section 5.3. We found no distant spike in chromosome 22 and only one distant spike in chromosome 2. Then we applied our GSP-based methods, the existing SP-based method [14] and the UHD-based method [15] to estimate the asymptotic shrinkage factors corresponding to the distant spikes. Figure 5a, 5b compares the estimated asymptotic shrinkage factors for the first two PCs across different chromosomes. The plots show that for all the chromosomes, λ-GSP and d-GSP methods provided almost equal estimates while the SP and UHD estimates are larger than both the GSP estimates. This suggests that the SP method would over-estimate the shrinkage factors when the population eigenvalues deviate from the assumption that the non-spiked eigenvalues are the same. Moreover, the UHD method over-estimated the shrinkage factors even for p/n nearly as large as 600 (chromosome 2).

Figure 5:

Figure 5:

Estimated shrinkage factors for (a) PC1 and (b) PC2 across chromosomes 1–21 based on three different methods. (c) Comparison of the mean squared errors (MSE) of the unadjusted and adjusted PC scores based on the d-GSP and SP methods with the adjusted PC scores based on the λ-GSP method. The ratios of the MSEs are presented for chromosome 1–21 using the threshold ϵ = 1. The y-axis is presented in a logarithmic scale.

To investigate whether the proposed shrinkage-bias adjustment can improve the prediction accuracy, we performed a leave-one-out cross-validation. In each iteration we removed one individual (test sample) and performed PCA on the remaining individuals (training samples) to predict the PC score of the test sample. For each predicted PC score, we adjusted the shrinkage-bias using the GSP-based, SP-based and UHD-based shrinkage factor estimates.

One important issue with this cross-validation is that the exclusion of one individual can substantially change the PC-coordinates, in which the PC score plots from the training sample-based and complete sample-based PCA can be substantially different. To circumvent this problem, in each iteration we first rescaled the PC scores based on their corresponding sample eigenvalues to make the PCs comparable. In addition, we obtained the mean squared difference of the training sample PCl-2 scores with and without the exclusion of the test sample (for chromosome 2, only PCl is used), and excluded the test sample from the prediction error estimation if the mean squared difference was above a threshold ϵ. We used four different values 0.5,1,5 and 10 for the threshold parameter ϵ, and for each value of ϵ we calculated the mean squared errors (MSE) of the unadjusted and adjusted predicted PC scores of the test samples. The sample sizes of the test samples that were finally included in the prediction error estimation for different values of ϵ are shown in the Online Supplement.

Figure 5 shows the estimated MSEs for ϵ = 1. It is clear that both the λ-GSP and d-GSP methods have much smaller MSEs than the SP method. The UHD-based method had almost identical MSEs as the SP-based method for all the chromosomes, hence we omitted the UHD-based results in this plot. As expected, the unadjusted predicted PC scores have substantially larger MSE than all the proposed adjustments. The plots are very similar for the other values of ϵ, and they can be found in the Online Supplement.

Figure 6 illustrates the shrinkage-bias adjustment for the PC1 and PC2 scores of an individual based on the markers on chromosome 7. The plot clearly shows that the bias-adjusted PC score based on the SP model is still biased towards zero, whereas the bias-adjusted PC score based on the GSP model is very close to the original sample PC score. We only showed the d-GSP adjusted score in the plot as the d-GSP and λ-GSP adjusted scores were almost equal.

Figure 6:

Figure 6:

PCl vs PC2 plot of the Hapmap III CEU and TSI samples based on chromosome 7. The predicted PC scores for the illustrative individual, and its bias-adjusted PC scores are also presented. Since the d-GSP and the λ-GSP adjusted scores are nearly the same, the λ-GSP adjusted scores are not presented.

7. Conclusions and discussion

In this paper, we investigated the asymptotic properties of PCA under the Generalized Spiked Population model and derived estimators of the population eigenvalues, the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We also proposed methods to adjust the shrinkage bias in the predicted PC scores. Further, theoretically and using simulation studies, we compared our results with the results developed under the ultra-high dimensional regime [15], and showed that our methods provide more accurate results under the high-dimensional regime even when p is greatly larger than n. Since the proposed methods do not require the equality of the non-spiked eigenvalues, they can be widely used in high-dimensional biomedical data analysis. We implemented all our algorithms in the R package hdpca.

We note that Mestre [18, 19] proposed an asymptotic setting similar to the generalized spiked population model but with a different assumption on the number of spikes in which the number of spikes increases with the dimension. Under this assumption, he provided asymptotic properties of sample eigenvalues and eigenvectors. However, in many biomedical data, the number of spikes is usually finite as the spikes represent the difference between finitely many underlying subpopulations. Therefore we believe that the generalized spiked population model is more appropriate in such cases.

In some special cases, even though the features exhibit strong local correlation, one can use the spiked population model based methods after some suitable data manipulation. In genome-wide association studies, SNP pruning [1] can be used to remove locally correlated SNPs to satisfy the spiked population model. For example, Lee et al. [14] reported good performance of the spiked population model-based methods with the SNP-pruned Hapmap III dataset. This approach, however, can lead to a considerable loss of information; the SNP-pruning in Hapmap III data removed nearly 90% of the SNPs. Since the proposed approach does not require this additional step, it can use most of the information present in the data.

Supplementary Material

1

Acknowledgments

We thank the Editor-in-Chief, Christian Genest, and referees for their helpful comments and suggestions, which lead to substantial improvements of the manuscript. The work was supported by NIH Grants R00HL113164 and R01HG008773.

Appendix

Proof of Theorem 1. The first part of the proof follows directly from Result 1 along with the fact that on the domain of the distant spikes, the ψ function is strictly increasing, and hence is left invertible. Since ψ”(α) > 0 for any α > sup ΓH, ψ’(α) is a strictly increasing function for α > sup ΓH. Let Sψ > sup ΓH be a solution for ψ’(α) = 0. Then for any α > sup ΓH, ψ’(α) > 0 if and only if α > Sψ. Therefore the interval (Sψ, ∞) is the domain of the distance spikes, and ψ is a strictly increasing function on this interval. The second part follows from Lemma B. □

Proof of Theorem 2. The proof closely follows the proof of Theorem 2 in [18]. However, contrary to [18], we do not assume that the population LSD contains the generalized spikes. Thus, some of the derivation steps and results are substantially different from [18]. We start the derivation by first noting that the quadratic forms η^k can be expressed as contour integrals of a special class of Stieltjes transforms of the sample covariance matrix. Let us define, for all z+,

m^p(z)=s1(SpzIp)1s2=j=1ps1ejejs2/(djz),

where s1 and s2 are non-random vectors with uniformly bounded norms. Girko [10] and Mestre [17] showed that under the assumption that the population LSD contains the generalized spikes, one has, for all z+,

|m^p(z)mp(z)|a.s.0, (A.1)

Where

mp(z)=s1{w(z)ΣpzIp}1s2=j=1ps1EjEjs2w(z)λjz.

The function w(z) is defined as w(z) = 1 − γγzbF(z), where bF(z) = ∫ (τz)−1dF(τ) is the Stieltjes transform of the sample LSD. It is easy to check, by the same arguments provided in [17], that the result still holds when the generalized spikes are considered lying outside the support of the population LSD.

The functions m^p, mp and bF can be extended to ={z: lm (z)<0} by defining m^p(z)=m^p*(z*),mp(z)=mp*(z*) and z, where z* is the complex conjugate of z.

With this definition, |m^p(z)mp(z)|a.s.0 even when z. Now η^k can be expressed as an integral of m^p, viz.

η^k=12πi^y(k)m^p(z)dz,

where i=1,y>0 and ^y(k) is the negatively (clockwise) oriented boundary of the rectangle

^y(k)={z:a^1 Re (z)a^2,| lm (z)|y}.

The values of a^1 and a^2 can be arbitrarily chosen provided that ^y(k) contains only the sample eigenvalue dk and no other sample eigenvalue. Then the following lemma gives the almost sure limit of η^k.

Lemma A.

|12πi^y(k)m^p(z)dz12πiy(k)mp(z)dz|a.s.0,

where y > 0 and y(k) is the negatively (clockwise) oriented boundary of the rectangle

y(k)={z:a1 Re (z)a2, lm (z)|y}.

The constants a1 and a2 can be arbitrarily chosen so that ψ(λk) ∈ [a1, a2] and [a1, a2] ⊂ ψ(Sψ, ∞), where Sψ > sup ΓH, ψ′(Sψ) = 0. Here, ψ(Sψ, ∞) denotes the image of the interval (Sψ, ∞) under ψ.

Lemma A implies

|η^kj=1p{12πiy(k)dzw(z)λjz}s1EjEjs2|a.s.0. (A.2)

Now we need to evaluate the integral in (A.2) in order to get the almost sure limit of the random variable η^k. First, we extend the ψ function to y(k) as follows. For all zy(k),

ψ(z)=z{1+γλzλdH(λ)}.

According to [16], for all z+, bF(z) = b is the unique solution to the equation

b=1λ(1γγzb)zdH(λ) (A.3)

in the set {b:γb(1γ)/z+}. It is easy to see that bF also satisfies (A.3) when z. Now we formally define the fF function introduced in (1). For all z\, set

fF(z)=zw(z)=z1γγzbF(z). (A.4)

Then bF can be expressed in terms of fF as

bF(z)=(1γ)fF(z)zγzfF(z).

By replacing b with {(1 – γ)fz}γzf in (A.3), we get

f{1+γλfλdH(λ)}=z. (A.5)

It is easy to see that bF is a solution to (A.3) if and only if fF is a solution to (A.5). Therefore, for all z+ (similarly for z, fF(z) = f is the unique solution to (A.5) on + (respectively, ). This implies ψ{fF(z)}=for all zy(k)\[a1,a2].

Now we focus on the case when z\{0}. According to [22], we can extend bF to \{0} by defining bF(z)=limy0+bF(z+iy) for any z\{0}. The definition of fF can also be extended in a similar fashion. In Lemma B we have shown that fF is the inverse function of ψ on (Sψ, ∞), and there exists Mf > sup ΓF for which f(Sψ, ∞) = (Mf, ∞). Thus, [a1, a2] ⊂ ψ(Sψ, ∞) implies ψ{fF(z)} = z for all zy(k). Furthermore, the function ψ is continuous and differentiable on y(k), and the derivative is given by

ψ(z)=1γ(λzλ)2dH(λ).

Then the integral in (A.2) can be expressed in terms of ψ and fF as follows:

12πiy(k)dzw(z)λjz=12πiy(k)dzzfF(z)λjz=12πiy(k)1λjfF(z)×fF(z)ψ{fF(z)}dz. (A.6)

The integrand in the final expression is holomorphic on y(k) when jk and possesses a simple pole ψ(λk) when j = k. Therefore, when jk the integral in (A.6) is zero. When j = k, applying the residue theorem on the final integral, we get

12πiy(k)dzw(z)λkz=limzψ(λk)ψ(λk)zλkfF(z)×fF(z)ψ{fF(z)}=limzψ(k)ψ(λk)ψ{fF(z)}λkfF(z)×fF(z)ψ{fF(z)}=λkψ(λk)ψ(λk).

This implies

12πiy(k)dzw(z)λjz={λkψ(λk)/ψ(λk) if j=k,0 if jk,

and the proof of Theorem 2 is complete. □

Proof of Lemma A. First, we show that a^1, a^2, a1, a2 can be chosen satisfying a^1a1 and a^2a2. This is possible due to the fact that dka.s.ψ(λk) and ψ(λk) ⊂ ψ(βψ, ∞) = (Mf, ∞), where Mf > sup ΓF. Therefore, we can choose a neighborhood [a1, a2] around ψ(λk) so that [a1, a2] ⊂ (Mf, ∞). Moreover, as Mf is bounded away from the support of the sample LSD F and dka.s.ψ(λk), we can select a neighborhood [a^1, a^2] around dk which does not contain any other eigenvalue for which a^1a1, a^2a2. Then

|12πi^y(k)m^p(z)dz12πiy(k)mp(z)dz|12π{supz^y(k)y+(k)|m^p(z)|}(|a^1a1|+|a^2a2|)+12πy(k)|m^p(z)dzmp(z)|×|dz|. (A.7)

From the Cauchy-Schwarz inequality, we can obtain the following upper bound for m^p.:

|m^p(z)|s1×s2/d(z,ΓFp).

Here, d(z,ΓFp)=infyΓFp|zy|. Since FpF point-wise and [a1, a2] is bounded away from ΓF, d(z,ΓFp) is bounded away from zero with probability 1 for large enough p and n. Therefore |m^p(z)| is finite for zy(k) with probability 1 for large enough p and n. Moreover, since [a^1,a^2][a1,a2], the interval [a^1,a^2] will eventually be bounded away from ΓF. Thus, eventually the upper bound for |m^p(z)| will also be finite for z^y(k). Therefore, the first term on the right-hand side of (A.7) will go to zero as a^1a1, a^2a2.

Now, as m^p(z) and mp(z) are holomorphic functions on the compact set y(k), we have

supzy(k)|m^p(z)mp(z)|<.

Also from (A.1), |m^p(z)mp(z)|a.s.0 point-wise for all z\. Therefore, by Lebesgue’s Dominated Convergence Theorem, the second term on the right-hand side of (A.7) also converges to zero almost surely. This completes the proof of Lemma A. □

We can show the asymptotic equivalence of the limits derived in Theorem 2 and Result 2 as a direct application of the following lemma.

Lemma B. Suppose assumptions (A)-(D) hold. If λk is a distant spike with multiplicity 1, and dk is the corresponding sample eigenvalue, then

fF(dk)pλk, dkgF(dk)fF(dk)pψ(λk).

Proof. We have already established in the proof of Theorem 2 that for all z+ (similarly for z), fF (z) = f is the unique solution to (A.5) on + (respectively, ). When z is restricted to \, using (A.4) and the fact that bF (z) = (τz)−1dF(τ), we can write

fF(z)=z{1+γτzτdF(τ)}.

Now suppose z=x\{0}. Then both Eqs. (A.3) and (A.5) will have multiple roots, both real and complex valued depending on x and H. If we look at (A.5) closely, we can see for real-valued x, it can be represented as ψ{f (x)} = x, where the ψ function is as defined in (1). As we have seen in the proof of Theorem 1, ψ is strictly increasing in the interval (Sψ, ∞), where Sψ > sup ΓH and ψ’(Sψ) = 0. Therefore, any real-valued solution f of ψ{f (x)} = x in (Sψ, ∞) has to be the inverse of ψ, which is unique due to the strict monotonicity of ψ on (Sψ, ∞). Now suppose ΓF is the support of the sample LSD F. We will show that there exists Mf > sup ΓF such that for any x > Mf, the function fF is real-valued and it is a solution to (A.5) in the interval (Sψ, ∞). Thus it is also the unique such solution and the inverse of the ψ function in (Sψ, ∞).

Let x,x> sup ΓF and z=x+iy+. Now, as z+, fF(z) is the unique solution to (A.5) in +. Therefore, if we express fF(z) as u(z) + iv(z), then v(z) > 0. Also, the imaginary part of (A.5) can be written as

v(z)[1γλ2{u(z)λ}2+v(z)2dH(λ)]=y.

Both v(z) and y being positive implies that

1γλ2{u(z)λ}2+v(z)2dH(λ)>0. (A.8)

Due to the continuity of fF on the set {z+:z=x+iy,x>supΓF},

fF(x)=limy0+x+iy1+γτ(x+iyτ)1dF(τ)=x1+γτ(xτ)1dF(τ),

which is real-valued. Thus u(z) → fF(x) and v(z) → 0 as y ↓ 0. Therefore as y ↓ 0, the inequality (A.8) becomes

1γλ2{fF(x)λ}2dH(λ)>0,

which implies ψ’{fF(x)} > 0.

We can see that fF(x) attains zero at sup ΓF and it is strictly and unboundedly increasing for x > sup ΓF. This ensures the existence of a threshold MF > sup ΓF such that the function fF maps the interval (MF, ∞) to (Sψ, ∞). Therefore, fF and ψ are both strictly increasing, continuous and bijective mappings between the intervals (MF, ∞) and (Sψ, ∞). Since fF(z) is the unique solution to (A.5) in + when z+, fF is also a solution to (A.5) in (Sψ, ∞) when x > MF due to the continuity of the left-hand side of (A.5) on the set {f+:f=u+iv,u>Sψ}, which further implies that fF is the inverse function of ψ on (Sψ, ∞).

The first part of Lemma B is proved as a corollary to Result 1 as ψ−1 = fF on the domain of distant spikes, i.e., (Sψ, ∞). For the second part, we first need to derive the expression of fF, and then derive the expression of ψ’ in terms of fF and F, viz.

fF(x)=f(x){1+γvF(x)}/x; vF(x)=τ(xτ)2dF(τ).

For a distant spike λk, using the expression of fF, we get

λkψ(λk)ψ(λk)=λkψ(λk)fF{ψ(λk)}=11+γfF{ψ(λk)}τ{ψ(λk)τ}2dF(τ)=gF{ψ(λk)}.

As ψ(λk) > Mf, gF is continuous at ψ(λ). Since dkpψ(λk),

gF(dk)pgF{ψ(λk)}=λkψ(λk)/ψ(λk), dkgF(dk)/fF(dk)pψ(λk).

This concludes the proof of Lemma B. □

Proof of Theorem 3. We have

Pk,pk2=1n2λkdkXEk,Xek2=1λkdk(EkXXek/n)2=1λkdk{Ek(i=1pdieiei)ek}2=dkλkek,Ek2.

Using the limits derived in Theorem 2 and Result 1, we deduce that

|dkλkek,Ek2ψ(λk)|p0.

Using Lemma B, we conclude that

|dkλkek,Ek2dkgF(dk)fF(dk)|p0.

Proof of Theorem 4. We show that the denominator E(pkj2) converges to ψ(λk) and the numerator E(qk2) converges to λk2/ψ(λk). The proof will be complete using the fact that dkpψ(λk). For the denominator, we have

E(pkj2)=1nE(i=1npki2)=1nE{i=1n(xiek)2}=E(ekXXek/n)=E{ek(i=1pdieiei)ek}=E(dk)ψ(λk).

For the numerator, we have

E(qk2)=E{(xnewek)2}=E{E(xnewek)2|ek}=E{ var (xnewek)|ek}=E(ekΣpek).

Now, using the notations in the proof of Theorem 2 and Lemma B, we have bF(z) = ∫ (τz)−1dF(τ) as the Stieltjes transform of the sample LSD and the function fF defined as fF(z) = z{1 − γγzbF(z)}−1. Therefore,

bF(z)=(1γ)fF(z)zγzfF(z).

The functions bF and fF can be extended to the real axis by defining the extensions as shown in the proof of Lemma B. Thus, for the sample eigenvalue dk corresponding to the distant spike λk we have

bF(dk)=(1γ)fF(dk)dkγdkfF(dk).

According to Theorem 4 in [13], the limit of ekΣpek is given by dk{1 − γγdkbF(dk)}−2. Replacing the expression of bF (dk) in this limit, we get

|ekΣpekfF2(dk)/dk|p0.

Using Result 1 and Lemma B, we have fF2(dk)/dkpλk2/ψ(λk). Therefore, the limit of the numerator is given by

E(qk2)=E(ekΣpek)λk2/ψ(λk).

This completes the proof of Theorem 4. □

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Online Supplement

Additional tables and figures relevant to this paper are provided in the Online Supplement.

References

  • [1].Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT, Data quality control in genetic case-control association studies, Nature Protocols 5 (2010) 1564–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Bai ZD, Silverstein JW, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices, Ann. Probab 26 (1998) 316–345. [Google Scholar]
  • [3].Bai ZD, Yao J, On sample eigenvalues in a generalized spiked population model, J. Multivariate Anal 106 (2012) 167–177. [Google Scholar]
  • [4].Baik J, Silverstein JW, Eigenvalues of large sample covariance matrices of spiked population models, J. Multivariate Anal 97 (2006) 1382–1408. [Google Scholar]
  • [5].Berkelaar M, et al. , lpSolve: Interface to ‘Lp_solve’ v. 5.5 to Solve Linear/Integer Programs, 2015. R package version 5.6.13. [Google Scholar]
  • [6].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]
  • [7].Cai TT, Han X, Pan G, Limiting laws for divergent spiked eigenvalues and largest non-spiked eigenvalue of sample covariance matrices, ArXiv e-prints (2017). [Google Scholar]
  • [8].Ding X, Convergence of sample eigenvectors of spiked population model, Commun. Statist. Theory Methods 44 (2015) 3825–3840. [Google Scholar]
  • [9].El Karoui N, Spectrum estimation for large dimensional covariance matrices using random matrix theory, Ann. Statist 36 (2008) 2757–2790. [Google Scholar]
  • [10].Girko VL, Strong law for the eigenvalues and eigenvectors of empirical covariance matrices, Random Oper. Stoch. Equ 4 (1996) 176–204. [Google Scholar]
  • [11].Johnstone IM, On the distribution of the largest eigenvalue in principal components analysis, Ann. Statist 29 (2001) 295–327. [Google Scholar]
  • [12].Johnstone IM, Lu AY, On consistency and sparsity for principal components analysis in high dimensions, J. Amer. Statist. Assoc 104 (2009) 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Ledoit O, Peche S, Eigenvectors of some large sample covariance matrix ensembles, Probab. Theory Related Fields 151 (2010) 233–264. [Google Scholar]
  • [14].Lee S, Zou F, Wright FA, Convergence and prediction of principal component scores in high-dimensional settings, Ann. Statist 38 (2010) 3605–3629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Lee S, Zou F, Wright FA, Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data, Biometrika 101 (2014) 484–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Marcenko VA, Pastur LA, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457. [Google Scholar]
  • [17].Mestre X, On the asymptotic behavior of quadratic forms of the resolvent of certain covariance-type matrices, Technical Report, CTTC/RC/2006-001, Centre Tecnologic de Telecomunicacions de Catalunya, Barcelona, Spain, 2006. [Google Scholar]
  • [18].Mestre X, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE Trans. Inform. Theory 54 (2008) 5113–5129. [Google Scholar]
  • [19].Mestre X, On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices, IEEE Trans. Signal Process. 56 (2008) 5353–5368. [Google Scholar]
  • [20].Paul D, Asymptotics of sample eigenstruture for a large dimensional spiked covariance model, Statist. Sinica 17 (2007) 1617–1642. [Google Scholar]
  • [21].Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D, Principal components analysis corrects for stratification in genome-wide association studies, Nature Genetics 38 (2006) 904–909. [DOI] [PubMed] [Google Scholar]
  • [22].Silverstein J, Choi S, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis 54 (1995)295–309. [Google Scholar]
  • [23].Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW, Significance analysis of time course microarray experiments, Proc. Nat. Acad. Sci. USA 102 (2005) 12837–12842. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES