Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model

Rounak Dey; Seunggeun Lee

doi:10.1016/j.jmva.2019.02.007

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: J Multivar Anal. 2019 Feb 19;173:145–164. doi: 10.1016/j.jmva.2019.02.007

Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model

Rounak Dey ^a, Seunggeun Lee ^a,^*

PMCID: PMC7441582 NIHMSID: NIHMS1523560 PMID: 32831421

Abstract

With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy.

Keywords: Consistent estimation, High-dimensional data, PC scores, Random matrix

1. Introduction

Principal component analysis (PCA) is a very popular tool for analyzing high-dimensional biomedical data, where the number p of features is often substantially larger than the number n of observations. PCA is widely used to adjust for population stratification in genome-wide association studies [21] and to identify overall expression patterns in transcriptome analysis [23]. However, the asymptotic properties of PCA in high-dimensional data are profoundly different from the properties in low-dimensional (p finite, n → ∞) settings. In high-dimensional settings, the sample eigenvalues and eigenvectors are not consistent estimators of the population eigenvalues and eigenvectors [12, 20], and the predicted principal component (PC) scores based on the sample eigenvectors can be systematically biased toward zero [14].

There has been extensive effort to investigate the asymptotic behavior of PCA in high-dimensional settings. To provide a statistical framework for PCA in these settings, Johnston introduced a spiked population model, which assumes that all the eigenvalues are equal except for finitely many large ones called the spikes. A spiked population covariance matrix is basically a finite-rank perturbation of a scalar multiple of the identity matrix. A typical example of a spiked population with two spikes is shown in Figure 1a. This two-spike eigenvalue structure arises if the population consists of three sub-populations, and the features are largely independent with equal variances. Under this model, convergence of sample eigenvalues, eigenvectors and PC scores have been extensively studied [4, 11, 14, 20].

Figure 1: — Eigenvalue structures for SP and GSP models.

In many biomedical data, however, the assumption of the equality of non-spiked eigenvalues can be violated due to the presence of local correlation among features. In genome-wide association studies, for example, the genetic variants are locally correlated due to linkage disequilibrium. In gene-expression data, since genes in the same pathway are often expressed together, their expression measurements are often correlated. These local correlations can cause substantial differences in non-spiked eigenvalues. To illustrate this phenomenon, we obtained eigenvalues with an autoregressive within-group correlation structure rather than the independent structure of the previous example. Figure 1b shows that the equality assumption is clearly violated. Thus, if methods developed under the equality assumption are applied to these types of data, we will obtain biased results.

The generalized spiked population model [3] has been proposed to address this problem. The condition that the non-spikes have to be equal is removed in this generalization. In this model, the set of population eigenvalues consists of finitely many large eigenvalues called the generalized spikes, which are well separated from infinitely many small eigenvalues. Although the generalized spiked population model has a great potential to provide more accurate inference in high-dimensional biomedical data, only limited literature is available on the asymptotic properties of PCA under this model and their application to real data. Bai and Yao [3] and Ding [8] provided results regarding convergence of eigenvalues and eigenvectors. However, their work remained largely theoretical. Moreover, to the best of our knowledge, no method has been developed for estimating the correlations between the sample and population PC scores, and adjusting biases in the predicted PC scores under the generalized spiked population model.

In this paper, we systematically investigate the asymptotic behavior of PCA under the generalized spiked population model, and develop methods to estimate the population eigenvalues and adjust for the bias in the predicted PC scores. We first propose two different approaches to consistently estimate the population eigenvalues, the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We compare these two methods and show the asymptotic equivalence of the estimators across them. Finally, we propose a method to reduce the bias in the predicted PC scores based on the estimated population eigenvalues.

The paper is organized as follows. We begin in Section 2 by providing the definition of the generalized spiked population model and present existing theoretical results. We develop our methods to consistently estimate the population spikes in Section 3. In Section 4, we construct consistent estimators of the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We also propose the bias-reduction technique for the predicted PC scores. Section 5 presents the algorithm [9] to estimate the population limiting spectral distribution and the non-spiked eigenvalues. In Section 6, we present results from simulation studies and an example from the Hapmap project to demonstrate the improved performance of our method over the existing one. Finally, we conclude the paper with a discussion.

2. Generalized spiked population model

In order to formally define generalized spiked population model, we require the concept of spectral distribution. In the random matrix literature, it is natural to associate a probability measure to the set of eigenvalues as the dimension p goes to infinity. More explicitly, if a Hermitian matrix Σ_p has eigenvalues λ₁,…, λ_p, we can define the empirical spectral distribution (ESD) of Σ_p to be H_p based on the probability measure

d H_{p} (x) = \frac{1}{p} \sum_{i = 1}^{p} δ_{λ_{i}} (x),

where δ_λi (x) is unity when x = λ_i, and otherwise zero. Now, for a sequence (Σ_p) of covariance matrices, if the corresponding sequence (H_p) of ESDs converge weakly to a non-random probability distribution H as p → ∞ to, then we define H as the limiting spectral distribution (LSD) of the sequence (Σ_p).

The generalized spiked population model [3] is defined as follows. Suppose that H_p is the ESD corresponding to the population covariance matrix Σ_p and that it converges weakly to a non-random probability distribution H. Let Γ_H be the support of H and d(x, A) = inf_y∈A |x – y| be the distance metric from a point x to a set A. Then the set of eigenvalues of Σ_p comprises of two subsets of eigenvalues α₁ ≥ ⋯ ≥ α_m and β_p,1 ≥ ⋯ ≥ β_p,p−m, as follows:

there exists δ > 0 such that d(α_i, Γ_H) > δ for all i ∈ {1,…, m}; α₁,…, α_m are called the generalized spikes.
max_{1≤i≤p−m}d(β_p,i, Γ_H) = ϵ_p → 0;β_p,1,…,β_p,p−m are called the non-spikes.

It is obvious from the definition that the generalized spikes are measure zero points of the population LSD. For Johnstone’s spiked population model [11], the population LSD is H = δ_{1} indicating Γ_H = {1}. From the definition above, all eigenvalues larger than 1 are spikes. Hence, Johnstone’s spiked population model is a special case of the generalized spiked population model.

Suppose that the population covariance matrix Σ_p has eigenvalues λ₁ ≥ ⋯ ≥ λ_p, and the sample covariance matrix S_p = X^⊺X/n has eigenvalues d₁ ≥ ⋯ > d_p, where X is an n × p data matrix. Let Σ_p = EΛ_pE^⊺ and S _p = UD_pU^⊺ be the spectral decompositions of Σ_p and S _p, respectively. Further, we will assume the following throughout the paper.

p → ∞, n → ∞, p/n → γ< ∞.
$X = Y Λ_{p}^{1 / 2} E^{⊤}$ , where Y is an n × p random matrix with iid elements such that E(Y_ij) = 0, E(|Y_ij|²) = 1, E(|Y_ij|⁴) < ∞.
The population eigenvalues follow the generalized spiked population (GSP) model with m generalized spikes. Let H_p be the population ESD, H be the population LSD, and Γ_H be the support of H. Moreover, the sequence (∥Σ_p∥) of spectral norms is bounded.
The generalized spikes are larger than sup Γ_H.

Even though we will develop our estimation methods based on the asymptotic regime where p/n → γ < ∞, we will discuss the applicability of our methods when p is greatly larger than n in Section 4.6.

In this paper, we will derive and discuss asymptotic properties of different functions of eigenvalues and angles between the sample and population eigenvectors, both of which are rotation-invariant, i.e., if we rotate X by some p×p orthogonal matrix P, the eigenvalues (both sample and population) and angles between the sample and population eigenvectors will remain the same. The new population covariance matrix ${\tilde{Σ}}_{p} = P^{⊤} Σ_{p} P$ will have the same eigenvalues as Σ_p, and the new sample covariance matrix ${\tilde{S}}_{p} = P^{⊤} S_{p} P$ will also have the same eigenvalues as S_p. The angles between the eigenvectors of ${\tilde{S}}_{p}$ and ${\tilde{Σ}}_{p}$ will be the same as the angles between the eigenvectors of S _p and Σ_p, both given by the elements of the matrix E^⊺ U. Therefore, without loss of generality, by using P = E, we can assume the population covariance matrix to be diagonal.

From the Marčenko-Pastur theorem [16], the sample ESD F_p converges weakly to a non-random probability distribution F with support Γ_F. For α ∉ Γ_H, α ≠ 0 and x > 0, we define the following two functions:

ψ (α) = α + γ α \int \frac{λ d H (λ)}{α - λ}, f_{F} (x) = x / {1 + γ \int \frac{τ d F (τ)}{x - τ}} .

(1)

The following result by Bai and Yao [3] provides the almost sure limits of the sample eigenvalues corresponding to the population generalized spikes.

Result 1. Suppose assumptions (A)-(D) hold. Let λ_k be a generalized spike of multiplicity 1 and the corresponding sample eigenvalue is d_k. Moreover, let ψ′ denote the first derivative of the function ψ. Then,

If ψ′(λ_k) > 0, then the sample eigenvalue d_k converges almost surely to ψ(λ_k), i.e., $| d_{k} - ψ (λ_{k}) | \overset{a . s .}{\to} 0$ .
If ψ′(λ_k) ≤ 0, then let (u_k, v_k) ⊂ (sup Γ_H, ∞) be the maximal interval on which ψ′ > 0. The sample eigenvalue d_k converges almost surely to ψ (w) is a boundary of [u_k, v_k] that is nearest to λ_k.

Given that ψ’(α) is a strictly increasing function for α > sup Γ_H, if a generalized spike λ_k is large enough such that ψ’(α) > 0, according to Result 1 the corresponding sample eigenvalue will converge almost surely to ψ(λ_k). However if the generalized spike lies close enough, i.e., ψ’(λ_k) ≤ 0, to the set of non-spikes, then the convergence of the corresponding sample eigenvalue is given by the second part of the result. We will denote a generalized spike λ_k as a “distant spike” if ψ’(λ_k) > 0; otherwise we will call it a “close spike”.

3. Consistent estimation of the generalized spikes

The following theorem provides two different consistent estimators of the distant spikes.

Theorem 1. Let λ_k be a distant spike of multiplicity 1 and the corresponding sample eigenvalue is d_k. If the assumptions (A)-(D) hold, then

| ψ^{- 1} (d_{k}) - λ_{k} | \overset{p}{\to} 0,

where ψ⁻¹ is the left inverse of ψ. Also, $| f_{F} (d_{k}) - λ_{k} | \overset{p}{\to} 0$ .

This theorem shows that for any distant spike λ_k we have two consistent estimators ψ⁻¹(d_k) and f_F(d_k). Notice that the function f_F depends only on the sample LSD which can be approximated by the sample ESD. Thus, f_F(d_k) can be approximated directly using the sample eigenvalues. More explicitly, f_F(d_k) can be closely approximated as

f_{F} (d_{k}) \approx d_{k} / {1 + \frac{γ}{p - m} \sum_{i = m + 1}^{p} \frac{d_{i}}{d_{k} - d_{i}}} .

In contrast, the ψ function depends on the population LSD which is unknown. We can estimate the ψ function using the algorithm described in Section 5 and then find the inverse function ψ⁻¹ using a Newton-Raphson type algorithm.

4. Consistent estimators of the asymptotic shrinkage in predicting the PC scores

In this section, we investigate the convergence of sample eigenvectors, PC scores, and shrinkage factors in predicting the PC scores. Let e_i and E_i be the ith sample and population eigenvectors, respectively. In addition to assumptions (A)-(D), we further assume that the distant spikes are of multiplicity 1. This assumption is to restrict the dimension of the corresponding eigenspaces to 1, as otherwise the angle between sample and population eigenvectors, or shrinkage in predicted PC scores cannot be well defined.

4.1. Angle between sample and population eigenvectors

We first present the following theorem on the convergence of the quadratic forms of the sample eigenvectors.

Theorem 2. Let λ_k be a distant spike of multiplicity 1, and the assumptions (A)-(D) hold. Consider the quadratic form ${\hat{η}}_{k} = s_{1}^{⊤} e_{k} e_{k}^{⊤} s_{2}$ , where s₁ and s₂ are non-random vectors with uniformly bounded norm for all p. Then

| {\hat{η}}_{k} - η_{k} | \overset{a . s .}{\to} 0,

where $η_{k} = λ_{k} ψ^{'} (λ_{k}) s_{1}^{⊤} E_{k} E_{k}^{⊤} s_{2} / ψ (λ_{k})$ .

Mestre [19] showed similar asymptotic properties of the quadratic forms under the assumption that the number of spikes increases with the dimension. Theorem 2 shows the convergence of the angle between sample and population eigenvectors. Suppose s₁ = s₂ = E_k. Then

{\hat{η}}_{k} = E_{k}^{⊤} e_{k} e_{k}^{⊤} E_{k} = {〈 e_{k}, E_{k} 〉}^{2}, η_{k} = λ_{k} ψ^{'} (λ_{k}) / ψ (λ_{k}) .

Combining them, we can show

| {〈 e_{k}, E_{k} 〉}^{2} - λ_{k} ψ^{'} (λ_{k}) / ψ (λ_{k}) | \overset{a . s .}{\to} 0.

Therefore, {λ_kψ’(λ_k)/ψ(λ_k)}^1/2 is a consistent estimator of the cosine of the angle, i.e., the absolute value of the inner product, between the kth sample and population eigenvectors. In order to obtain this estimator, we first need to estimate the ψ function using the algorithm described in Section 5.

The following result by [8] provides another consistent estimator for the angle between the kth sample and population eigenvectors. The proof of the asymptotic equivalence of these two estimators is given in the Appendix.

Result 2. Let λ_k be a distant spike of multiplicity 1, and d_k be the corresponding sample eigenvalue. Assume that (A)-(D) hold. Define,

g_{F} (x) = {1 + γ f_{F} (x) \int \frac{τ}{{(x - τ)}^{2}} d F (τ)}^{- 1} .

Then $| {〈 e_{k}, E_{k} 〉}^{2} - g_{F} (d_{k}) | \overset{p}{\to} 0$ .

Hence g_F (d_k)^1/2 also works as a consistent estimator of |〈e_k, E_k〉|. Since the function g_F depends only on sample LSD, it can be approximated directly using sample eigenvalues. More explicitly, if there are m spikes in the population, the function g_F can be closely approximated as

g_{F} (d_{k}) \approx {1 + \frac{γ f_{F} (d_{k})}{p - m} \sum_{i = m + 1}^{p} \frac{d_{i}}{{(d_{k} - d_{i})}^{2}}}^{- 1} .

The above equation can be used to estimate the angle between the sample and population eigenvectors.

4.2. Correlation between sample and population PC scores

The sample and population PC scores are the projections of the data on the sample and population eigenvectors, respectively. The correlation between them can be perceived as a measure of accuracy of the PCA. The squared correlation can also be interpreted as the proportion of variance in the population PC scores that can be explained by corresponding sample PC scores. The following theorem provides the consistent estimators of the correlation between the sample and population PC scores corresponding to a distant spike.

Theorem 3. Suppose λ_k is a distant spike of multiplicity 1, d_k is the corresponding sample eigenvalue, and the assumptions (A)-(D) hold. Let the normalized kth population PC score be P_k = XE_k/(nλ_k)^1/2 and the normalized kth sample PC score be p_k = Xe_k/(nd_k)^1/2. Then

| {〈 P_{k}, p_{k} 〉}^{2} - ψ^{'} (λ_{k}) | \overset{p}{\to} 0

and

| {〈 P_{k}, p_{k} 〉}^{2} - d_{k} g_{F} (d_{k}) / f_{F} (d_{k}) | \overset{p}{\to} 0,

where the function g_F is as defined in Result 2.

Since P_k and p_k are normalized random vectors, the absolute value of the inner product 〈P_k, p_k〉 is identical to the absolute value of their correlation coefficient. Since correlation is scale invariant, this is also the correlation between kth sample and population PC scores. Therefore we can consider both ψ’(λ_k)^1/2 and {d_kg_F(d_k)/f_F(d_k)}^1/2 to be consistent estimators of the correlation between the kth sample and population PC scores.

4.3. Asymptotic shrinkage factor

Suppose λ_k is a distant spike. Let the kth sample PC score for the jth observation x_j be $p_{k j} = x_{j}^{⊤} e_{k}$ , and the kth predicted PC score for a new observation x_new be $q_{k} = x_{n e w}^{⊤} e_{k}$ . Then the quantity $ρ_{k} = {lim}_{p \to \infty} {E (q_{k}^{2}) / E (p_{k j}^{2})}^{1 / 2}$ describes the asymptotic shrinkage in the kth predicted PC score for a new observation. As both p_kj and q_k are centered, i.e., E(p_kj) = E(q_k) = 0, ρ_k represents the limiting ratio of the standard deviations of the predicted PC scores and the sample PC scores. Therefore, if we can estimate p_k, then the shrinkage bias in the kth predicted PC scores can be easily adjusted by rescaling the predicted scores by the factor $ρ_{k}^{- 1}$ . The following theorem provides the consistent estimator of the asymptotic shrinkage factor p_k.

Theorem 4. Suppose λ_k is a distant spike of multiplicity 1, d_k is the corresponding sample eigenvalue, and the assumptions (A)-(D) hold. Let p_kj and q_k be as defined above. Then

| \sqrt{E (q_{k}^{2}) / E (p_{k j}^{2})} - λ_{k} / d_{k} | \overset{p}{\to} 0.

This is a surprising result in which the asymptotic shrinkage factor is expressed as a simple ratio of the population and sample eigenvalues. Recall that we already constructed the consistent estimators for population eigenvalues in the previous sections. Using these results, the asymptotic shrinkage factor ρ_k can be consistently estimated by ${\hat{λ}}_{k} / d_{k}$ , where ${\hat{λ}}_{k}$ is any consistent estimator of λ_k.

4.4. Comparison of the two different estimators

For each of the quantities discussed above, we proposed two asymptotically equivalent estimators. In terms of practical applications, they have their own advantages and disadvantages. One of them can be approximated directly based only on the sample eigenvalues, while the other one requires to estimate the LSD of the population eigenvalues to obtain the ψ function. For ease of discourse, we will call the former “d-estimator” and the later “λ-estimator”.

If the number of spikes is known, estimating the d-estimator is computationally more efficient than estimating the λ-estimator as it does not involve estimating the population LSD. However, by estimating the population LSD the λ-estimation procedure can verify whether an estimated eigenvalue is actually a distant spike by checking if ψ’ > 0. Thus it can be used to estimate the number of distant spikes when it is unknown; see Section 5. In contrast, the d-estimation procedure provides no information on the population LSD and thus cannot distinguish among distant spikes, close spikes and non-spikes.

To summarize, when the number of spikes is known or we only want to estimate few of the largest eigenvalues which are known to be distant spikes, then the d-estimation procedure has the advantage of a faster computation, while the λ-estimation procedure is more useful when the number of spikes is unknown or the distribution of the non-spikes is of interest.

4.5. Comparison of the Generalized Spiked Population (GSP) model and the Spiked Population (SP) model

As mentioned before, the SP model [11] is a special case of the GSP model. It is easy to verify that when the population eigenvalues follow the SP model, our consistent estimators for the spiked eigenvalues, the angles between the eigenvectors, the correlation coefficients between the PC scores and the shrinkage factors conform to the consistent estimators derived by [4, 14, 20]. For an SP model where all the non-spikes are equal to 1, the LSD H is a degenerate distribution at 1, and

ψ (α) = α {1 + γ / (α - 1)}, ψ^{'} (α) = 1 - γ / {(α - 1)}^{2} .

Now, ψ’(α) > 0 if and only if α > 1 + γ^1/2. If α > 1 + γ^1/2 and d is the corresponding sample eigenvalue, then the consistent estimator of α is given by ψ⁻¹ (d), and

\frac{α ψ^{'} (α)}{ψ (α)} = \frac{1 - γ / {(α - 1)}^{2}}{1 + γ / (α - 1)}, \frac{α}{ψ (α)} = \frac{α - 1}{α + γ - 1},

which show that all our results match with the results from [14].

It is of interest to investigate how closely methods developed under the SP model can approximate the consistent estimators for the distant spikes when the population eigenvalues actually follow a GSP model. Suppose the population eigenvalues λ₁ ≥ ⋯ ≥ λ_p follow the GSP model with m distant spikes. The sample eigenvalues are d₁ ≥ ⋯ ≥ d_p. Let λ_k be a distant spike with multiplicity 1, and the corresponding sample eigenvalue is d_k. Then according to Result 1, d_k → ψ(λ_k) almost surely. From the definition of ψ,

ψ (λ_{k}) = λ_{k} {1 + γ \int \frac{λ}{λ_{k} - λ} d H (λ)} = λ_{k} + γ \int \frac{λ}{1 - λ / λ_{k}} d H (λ) .

If H is almost degenerate, i.e., the non-spikes are nearly identical, then

ψ (λ_{k}) \approx λ_{k} + \frac{γ \bar{λ}}{1 - \bar{λ} / λ_{k}},

(2)

where $\bar{λ} = \int λ d H (λ)$ is the mean of the population LSD which can be closely approximated by the mean of the nonspikes. By contrast, if the spike λ_k is very large compared to all the non-spikes such that λ/λ_k ≈ 0 for any λ ∈ Γ_H, then

ψ (λ_{k}) \approx λ_{k} + γ \bar{λ} .

Now, suppose that instead of using the GSP assumption, we use the SP assumption to estimate the distant spikes. We assume that under the SP model, the population covariance matrix is scaled by a factor ζ and the population eigenvalues are β₁ ≥⋯≥ β_m > ζ = ⋯ = ζ. If β_k is the population eigenvalue corresponding to d_k, then d_k → ψ(β_k) almost surely, where

ψ (β_{k}) = β_{k} (1 + γ \frac{ζ}{β_{k} - ζ}) = β_{k} + \frac{γ ζ}{1 - ζ / β_{k}} .

Here ζ is estimated as the mean of the non-spikes as they are all assumed to be equal to ζ. Notice that this expression is approximately equal to the expression in (2) with β_k = λ_k and $ζ = \bar{λ}$ . Therefore, the asymptotic limit of d_k under both the GSP and the SP model are approximately equal when the non-spikes are nearly identical. In contrast, when the spike β_k is very large compared to all the non-spikes such that ζ/β_k ≈ 0, then ψ(β_k) ≈ β_k + γζ. In this case also, the asymptotic limit of d_k under both the GSP and the SP model are approximately equal with β_k = λ_k and $ζ = \bar{λ}$ . Therefore, if a generalized spike is very far away from the support of the population LSD, then the estimate of the spike based on an SP model will closely approximate the estimate based on a GSP model. However, the SP model will provide potentially biased estimates if the non-spikes are not similar and the ratio between the largest non-spike and the spike of interest is substantially larger than zero.

4.6. Comparison with ultra high-dimensional regime-based results when p/n is large

Our methods are developed under the high-dimensional regime p/n → γ < ∞, and do not theoretically warrant them to be applied in ultra-high dimensional (UHD) regime where p/n → ∞. However, often times in real-world applications, we only have data with large p and large n, but the relative rate of their asymptotic divergence is unknown. Therefore, we do not know whether the true asymptotic regime is high-dimensional (p/n → γ < ∞) or ultra high-dimensional (p/n → ∞).

Suppose that the true asymptotic regime is high-dimensional with γ finite but large compared to n, and the eigenvalues follow the GSP model. In such situations, we can either correctly assume the high-dimensional regime and apply the results discussed in this paper, or we can falsely assume the ultra high-dimensional asymptotic regime and employ the theoretical results derived under this regime [15]. In this section, we will investigate whether it is prudent to assume the UHD regime in such situations. In other words, we will try to answer how large γ can be considered to be diverging to infinity for practical applications.

We first show that for large enough γ, the theoretical results based on the UHD regime become nearly identical to the results under the correctly assumed GSP (under high-dimensional regime) model. The UHD-based results presented in [15] require weaker conditions for the non-spiked eigenvalues than those for the spiked population model. Instead of assuming that the non-spiked eigenvalues are the same, Lee et al. [15] assumes certain conditions on the moments of the non-spiked eigenvalues. Since the population LSD has a finite support and all of its central moments are finite, the condition on their moments, i.e., Condition 2 in [15] is satisfied with an additional assumption that n³/ p² = o(1). Without loss of generality, we assume that the mean of the non-spikes is unity. Then, under the UHD regime,

d / λ \overset{p}{\to} γ / λ + 1 when λ \geq O (γ), d / γ \overset{p}{\to} 1 when λ = o (γ),

(3)

where λ and d are a spiked population eigenvalue and its corresponding sample eigenvalue, respectively. Here λ ≥ O(γ) means λ/γ is bounded away from zero, and λ = o(γ) means λ/γ → 0. Lee et al. [15] also showed the convergence of sample eigenvectors and PC scores.

Alternatively, under the GSP model, d → ψ(λ) when λ is a distance spike. From Theorem 1, a distant spike λ must satisfy

1 - γ \int \frac{x^{2}}{{(λ - x)}^{2}} d H (x) > 0,

where H is the population LSD. Since f_λ(x) = x²(λ − x)⁻² is a continuous function for λ > sup Γ_H and x ∈ Γ_H, where Γ_H is the support of H, there exists x* ∈ (inf Γ_H, sup Γ_H) such that ∫x²(λ − x)⁻² dH(x) = x^*2 (λ – x*)⁻². Then

1 - γ x^{* 2} / {(λ - x^{*})}^{2} > 0,

which implies $λ > x^{*} + x^{*} \sqrt{γ}$ . Thus, for any $λ > x^{*} + x^{*} \sqrt{γ}, d / λ - ψ (λ) / λ$ converges to zero.

Now, under the true asymptotic regime (high-dimensional) λ and γ are both finite and non-zero, and thus λ = O(γ). However, under the falsely assumed UHD regime, one can further assume λ ≥ O(γ) or λ = o(γ) depending on whether λ is large or small compared to γ. If one assumes λ ≥ O(γ), then the difference between the convergence of d/λ from the two models is

ψ (λ) / λ - γ / λ - 1 = \frac{γ}{λ} {\int \frac{x}{1 - x / λ} d H (x) - 1} .

(4)

Since γ/λ = O(1) and ∫ x (1 – x/λ)⁻¹dH(x) − 1 = O(1/λ) as the mean of the non-spikes is unity, (4) becomes almost identical to zero when λ is sufficiently large.

Now, suppose one assumes λ = ο(γ). Let λ ≃ a + bγ^k for some finite a, b and k ∈ [1/2,1). Then, the difference between our result and the UHD result is

| ψ (λ) / γ - 1 | = | λ / γ + λ \int \frac{x}{λ - x} d H (x) - 1 | = O (γ^{k - 1}) .

(5)

Thus, in this case also (5) becomes almost identical to zero when γ is sufficiently large. We can also show the similar results for eigenvectors and PC scores.

Although both GSP and UHD eventually provide nearly identical results when γ is sufficiently large, the GSP model can provide substantially better estimates. The difference can be large when λ is small compare to γ, i.e., k < 1, since the difference in (5) is of the order O(γ^k−1). The difference will be at least as large as $O (1 / \sqrt{γ})$ in such cases. In simulation studies, we show this numerically.

One important thing to note here is that the results under the ultra high-dimensional regime do not provide any consistent estimators for the spiked eigenvalues, albeit there still can be methods to estimate shrinkage. When the spike λ = ο(γ), the limiting behavior of the corresponding sample eigenvalue does not depend on λ, and when λ ≥ O(γ), $d / λ \overset{p}{\to} γ / λ + 1$ does not imply d − γ to be a consistent estimator of λ because λ itself is divergent.

In the simulation studies and real data applications, we further evaluated the performance of d − γ as an estimator. Lee et al. [15] also provided an estimator of the asymptotic shrinkage which we have evaluated in the simulation studies. A recent paper by Cai et al. [7] provided limiting laws for spiked eigenvalues and corresponding eigenvectors when the spikes are divergent, under more general assumptions than the ultra high-dimensional assumptions of [15]. In terms of eigenvalue estimation, this model will still have the problem of the spikes not being estimable. However, it can be an interesting future research direction to find estimators of the shrinkage under this model.

5. Estimation of the population LSD

The λ-estimators rely on ψ, that is a function of the unknown population LSD H. To use the λ-estimators, it is thus required to estimate H. Using the Stieltjes transformation and the Marčenko-Pastur theorem, El Karoui [9] developed a general algorithm to estimate the population LSD from the sample ESD, F_p. We propose to use El Karoui’s method to estimate the population LSD H and then use it to estimate ψ.

5.1. El Karoui’s algorithm

Suppose $v_{F_{p}}$ is the Stieltjes transformation of the set of eigenvalues in the sample covariance matrix in which

v_{F_{p}} (z) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{d_{i} - z}

for any $z \in ℂ^{+}, ℂ^{+} = {x \in ℂ : lm (x) > 0}$ . According to the Marčenko-Pastur theorem [16], when assumptions (A)-(D) hold, $v_{F_{p}}$ converges point-wise almost surely to a non-random limit v_F, which uniquely satisfies the following equation:

v_{F} (z) = - {z - γ \int \frac{λ}{1 + λ v_{F} (z)} d H (λ)}^{- 1} .

El Karoui’s method first calculates $v_{F_{p}}$ for a grid of values z₁,…, z_J, and then finds

\hat{H} = {arg}_{H} min L [{\frac{1}{v_{F_{p}} (z_{j})} + z_{j} - \frac{p}{n} \int \frac{λ}{1 + λ v_{F_{p}} (z_{j})} d H (λ)}_{j = 1}^{J}],

where L is any pre-defined convex loss function. In order to approximate the integral inside the loss function, the algorithm discretizes H as

d H (λ) ≃ \sum_{k = 1}^{K} w_{k} δ_{t_{k}} (λ),

where $δ_{t_{k}} (λ) = 1$ if λ = t_k and 0 otherwise, w₁ + ⋯ + w_K = 1 with w_k > 0 for all k ∈ {1,…, K}, and t₁,…, t_K is a grid of points on the support of H. This is basically approximating H by a discrete distribution with support t₁,…, t_K. Then the integral is approximated by

\int \frac{λ}{1 + λ v_{F} (z)} d H (λ) ≃ \sum_{k = 1}^{K} w_{k} \frac{t_{k}}{1 + t_{k} v_{F_{p}} (z_{j})},

and the minimization problem transforms into

\hat{H} = {arg}_{H} min L [{\frac{1}{v_{F_{p}} (z_{j})} + z_{j} - \frac{p}{n} \sum_{k = 1}^{K} w_{k} \frac{t_{k}}{1 + t_{k} v_{F_{p}} (z_{j})}}_{j = 1}^{J}] .

(6)

El Karoui [9] has shown the weak convergence of $\hat{H}$ to H, i.e., $\hat{H} \to H$ .

Some examples of the convex loss function L are

L_∞(e₁,…, e_J) = max_j·max{|Re(e_j)|, |Im(e_j)|};
L₁(e₁,…,e_J) = |e₁| + ⋯ + |e_j|;
L₂(e_l,…, e_J) = |e₁|² + ⋯ + |e_J|².

For the convex loss functions described above, the estimation of H in (6) reduces to a convex optimization problem [6]. El Karoui also provided a translation of this problem into a linear programing (LP) problem when the L_∞ loss function is used. Further details can be found in [9].

5.2. Implementing El Karoui’s algorithm when the number of spikes is known

Since the generalized spikes fall outside the support of the population LSD, El Karoui’s algorithm cannot be directly applied to estimate the spikes. Furthermore, Bai and Silverstein [2] showed that the probability of a sample eigenvalue falling outside the support of the sample LSD will go to zero as p increases, which implies that the sample eigenvalues corresponding to the population generalized spikes will be measure zero points in the sample LSD. Since the spikes behave like measure zero points (or outliers) when we are concerned about estimating the population LSD, we can exclude the sample eigenvalues corresponding to the population generalized spikes while calculating $v_{F_{p}}$ and that will lead to a more robust estimation of H. Therefore, we will apply El Karoui’s algorithm in the following way:

Suppose the population covariance matrix possesses m generalized spikes. We exclude the top m sample eigenvalues while calculating $v_{F_{p}}$ , viz.
$v_{F_{p}} (z) = \frac{1}{n - m} \sum_{i = 1}^{n} \frac{1}{d_{i} - z} .$
Apply El Karoui’s algorithm to obtain $\hat{H}$ . Additionally, if it is reasonable to assume that the true population LSD is a continuous or piecewise continuous distribution function, suitable kernel smoothing algorithm can be used on $\hat{H}$ to obtain a more continuous approximation of H.
The quantiles of $\hat{H}$ can be considered as the estimators of the non-spikes.
Suppose ${\hat{λ}}_{m + 1}, \dots, {\hat{λ}}_{p}$ are the estimated non-spikes. Then the ψ function is estimated by
$\hat{ψ} (α) = α + \frac{γ α}{p - m} \sum_{i = m + 1}^{p} \frac{{\hat{λ}}_{i}}{α - {\hat{λ}}_{i}} .$

Due to the weak convergence $\hat{H} \to H$ , $\hat{ψ}$ will also converge to ψ point-wise. Thus, all the estimates provided in Section 3 and 4 will still be consistent if we replace ψ with $\hat{ψ}$ .

A computationally challenging part in El Karoui’s algorithm is solving the LP problem, for which there are many fast algorithms available that can solve it in polynomial (in K, the grid size of w₁,…, w_K) computation time. An important property of El Karoui’s algorithm is that the complexity of the LP step does not depend on p, which is especially useful for estimating the LSDs of large covariance matrices. In our implementation, we used the lpSolve R package [5] to solve the LP problem.

5.3. Estimating the number of spikes

Our application of El Karoui’s algorithm to the GSP model depends on the number of spikes m, which is usually unknown. If we have some knowledge of the underlying structure of the data, we can use it to estimate m roughly. Suppose we know that the data are coming from a mixture of K subpopulations, and within each subpopulation the observations are iid $N (μ_{k}, Σ)$ , where μ_k represents the mean for the kth subpopulation, and Σ is the common within-group population covariance matrix. Then, as the spikes represent the between-group differences, the number of spikes should be the same as the rank of the between-group covariance matrix which is K − 1. However in real data, it is often hard to accurately assess the number of such homogeneous subpopulations. In those cases, we can use the following algorithm to estimate m.

Start with a reasonable finite upper bound m_max of the number of spikes. The upper bound can be selected based on prior information on the subpopulations, or by examining the sample eigenvalues. Set m = m_max.
Use El Karoui’s algorithm to estimate the population LSD and the non-spikes. Suppose the estimated nonspikes are ${\hat{λ}}_{m + 1} \geq \dots \geq {\hat{λ}}_{p}$ , and the ψ function is estimated by
$\hat{ψ} (α) = α + \frac{γ α}{p - m} \sum_{i = m + 1}^{p} \frac{{\hat{λ}}_{i}}{α - {\hat{λ}}_{i}} .$
Find S_ψ > λ_m+1 using the Newton-Raphson algorithm such that
${\hat{ψ}}^{'} (S_{ψ}) = 1 - \frac{γ}{p - m} \sum_{i = m + 1}^{p} {(\frac{{\hat{λ}}_{i}}{S_{ψ} - {\hat{λ}}_{i}})}^{2} = 0.$
Since any distant spike must be larger than S_ψ, and $\hat{ψ}$ , ${\hat{ψ}}^{'}$ are both continuous and strictly increasing functions on (S_ψ, ∞), the equation $\hat{ψ} (λ) - d_{k} = 0$ has a root in (S_ψ, ∞) if and only if $\hat{ψ} (S_{ψ}) - d_{k} < 0$ . Therefore, find the smallest index i* ∈ {1,…, m} such that $d_{i^{*}} \leq \hat{ψ} (S_{ψ})$ . If all d₁,…, d_m are larger than $\hat{ψ} (S_{ψ})$ , then stop and select m as the number of distant spikes. Otherwise, set m = i^* − 1 and repeat step (II)-(IV).

Note that the close spikes occur so close to the support of the population LSD that they cannot be distinguished separately from the non-spikes when the number of spikes is unknown.

The selection of m_max is subjective. It can be selected based on the prior knowledge on the number of subpopulations or by investigating the sample eigenvalues. In real data applications, we are usually interested in only a few large eigenvalues. In such situations, m_max can also be selected to be slightly larger than the number of eigenvalues we are interested in. As seen from our simulation studies, this spike selection algorithm can overestimate the number of spikes if the upper bound m_max is too large or underestimate the number of spikes if there are close spikes present; see the Online Supplement. However, as long as m_max is small compared to n and p, the estimation of the true population distant spikes will still remain consistent.

6. Simulation studies and real data example

6.1. Simulation studies: Compare GSP and SP-based methods

In this section we will present simulation studies of four different scenarios to compare the performances of the proposed GSP-based methods and the existing SP-based method proposed by Lee et al. [14]. For each study, we simulated a training dataset with n = 500 individuals and p = 5000 features. The data were generated from three subpopulations with sample sizes 100, 150 and 250. For each subpopulation we first selected a mean vector μ_i by drawing its elements randomly with replacement from {−0.3,0,0.3}. Then samples in the ith subpopulation were drawn from $N_{p} (μ_{i}, V)$ where V is the AR(1) covariance matrix with variance σ² and autocorrelation ρ. The (σ²,ρ) pairs used for the four studies were (4,0.8), (1,0.7), (7.5,0.8) and (4,0). The population eigenvalue plots for all the studies are shown in Figure 2.

Figure 2: — Eigenvalue structures in simulation studies comparing GSP-based and SP-based methods.

We also generated test datasets for each study with the same settings as the training datasets. Then we applied our GSP-based methods and the existing SP-based method to estimate the population spikes, the angles between the sample and population eigenvectors, the correlations between the sample and population PC scores and the asymptotic shrinkage factors. For all of the studies, we used the upper bound m_max = 5 to estimate the number of distant spikes using the algorithm described in Section 5.3. We simulated each study 200 times to calculate the empirical biases and standard errors of the estimates. The results are presented in Table 1.

Table 1:

Simulation results for GSP-based and SP-based methods for estimating the population eigenvalues, cosine of the angles between sample and population eigenvectors, correlations between sample and population PC scores, and the asymptotic shrinkage factors. Each cell has empirical bias (%) with coefficients of variations (%) in parentheses.

Settings		Method	Eigenvalue		Angle		Correlation		Shrinkage
No.			1	2	1	2	1	2	1	2
1	n = 500 p = 5000 σ² = 4 ρ = 0.8	SP	5.27 (2.37)	18.27 (3.11)	6.52 (0.32)	34.07 (0.60)	3.83 (0.03)	23.33 (0.08)	5.32 (0.60)	17.88 (1.06)
		λ-GSP	0.43 (2.67)	0.95 (5.27)	0.53 (0.77)	3.28 (6.26)	0.33 (0.31)	2.79 (4.69)	0.47 (0.92)	0.58 (3.31)
		d-GSP	0.47 (2.67)	0.69 (5.45)	0.47 (0.77)	2.48 (6.70)	0.24 (0.31)	2.11 (5.07)	0.51 (0.92)	0.31 (3.51)
2	n = 500 p = 5000 σ² = 1 ρ = 0.7	SP	0.10 (0.90)	0.46 (1.27)	0.16 (0.04)	0.44 (0.08)	0.08 (0.001)	0.24 (0.003)	0.18 (0.08)	0.39 (0.16)
		λ-GSP	−0.04 (0.90)	0.04 (1.28)	0.01 (0.04)	0.004 (0.10)	0.01 (0.03)	0.01 (0.01)	0.03 (0.08)	−0.03 (0.18)
		d-GSP	−0.004 (0.90)	0.10 (1.28)	0.03 (0.04)	0.03 (0.10)	0.004 (0.03)	0.01 (0.01)	0.07 (0.08)	0.03 (0.18)
3	n = 500 p = 5000 σ² = 7.5 ρ = 0.8	SP	25.68 (2.54)	—	64.06 (0.52)	—	46.50 (0.07)	—	26.41 (0.90)	—
		λ-GSP	2.92 (5.7)	—	12.62 (11.90)	—	10.95 (10.13)	—	3.47 (4.20)	—
		d-GSP	2.45 (5.74)	—	12.25 (10.52)	—	10.87 (8.58)	—	3.00 (4.24)	—
4	n = 500 p = 5000 σ² = 4 ρ = 0	SP	0.05 (1.58)	−0.26 (2.35)	0.06 (0.23)	−0.06 (0.53)	0.03 (0.02)	0.05 (0.08)	0.07 (0.43)	−0.22 (0.90)
		λ-GSP	0.03 (1.58)	−0.35 (2.35)	0.02 (0.24)	−0.18 (0.54)	0.01 (0.02)	−0.02 (0.09)	0.04 (0.43)	−0.31 (0.91)
		d-GSP	0.16 (1.58)	−0.12 (2.35)	0.10 (0.23)	−0.03 (0.53)	0.01 (0.02)	0.02 (0.09)	0.18 (0.42)	−0.08 (0.90)

Open in a new tab

It is clear from Table 1 that for Studies 1, 2 and 3, our methods reduced the bias in all the estimates while having similar standard errors as the existing method. The positive empirical biases in all the SP estimates suggest that the SP method tends to overestimate all the quantities. In Study 4, since the underlying population satisfied the SP assumption, all methods provided very similar and almost unbiased estimates (< 1%). The results also verify that the λ-estimates and d-estimates are asymptotically equivalent. The performances of the λ-estimates and the d-estimates are nearly identical in all the simulation studies.

In Study 1, the ratio of the largest non-spike with the two spikes are 0.29 and 0.48, which are substantially larger than zero. Thus according to the discussion in Section 4, the SP model does not closely approximate the GSP model. The results support this assertion as the SP model-based estimates are highly biased whereas the estimates based on our methods have very little empirical bias. In Study 2, the largest non-spike is very small compared to the smallest spike (ratio 0.08). Thus the estimates based on the SP model closely approximate the estimates based on the GSP model, and we find very little empirical bias (< 1%) in all of the SP model-based estimates. In Study 3, even though there were two spikes present, only the largest population eigenvalue was a distant spike. So we presented only the estimates corresponding to the largest population eigenvalue. Since the ratio of the largest non-spike and the largest spike is substantially larger than zero (0.53) in this study, we observe very high empirical bias in the SP model-based estimates. However, our methods provided negligible empirical biases even in the presence of a close spike. We also presented the estimated number of distant spikes in each of the simulation studies in the Online Supplement. Note that in some cases, our algorithm over-estimates the number of distant spikes. However, as the over-estimation is only finite, the estimates of the distant spikes still remain consistent.

6.2. Simulation studies: Compare GSP and UHD-based methods

In Section 4.6, we compared the asymptotic results under the UHD regime and the results based on the highdimensional GSP model when p is greatly larger than n, but p/n is large but finite. We theoretically established that the results from the two regimes become almost identical when p/n = γ is sufficiently large. However, given large but finite γ in the data, the difference can be substantial when the spike is smaller compared to γ.

In this section, we will assess that result by numerically comparing the GSP and UHD-based estimates for different values of γ. We considered five different scenarios where the largest population eigenvalue $λ \in {γ, 0.6 γ, 60 + 0.1 γ, 6 \sqrt{γ}, 4 γ^{2 / 3}}$ . For the first three scenarios, under the UHD regime, λ/γ can be assumed to be bounded away from zero, and for the last two, λ/γ → 0 as γ →∞. To compare the performances as γ increases, we selected six different values for γ, viz. γ ∈ {100,200,500,1000,2500,5000}. For each combination of γ and λ, we simulated 200 datasets, each with n = 200 samples from a population with only one spike λ, and the non-spikes generated from the AR(1) covariance structure with (σ², ρ) = (1,0.9).

First, we compare the convergence results of the largest sample eigenvalue d from Theorem 1 and (3). For this purpose, we assume the population eigenvalues and the rate of increment of λ are known, and we compare the relative errors ϵ_GSP = {d − ψ(λ)}/d and ϵ_UHD = (d − λ − γ)/d or (d − γ)/d depending on whether λ/γ is assumed to be bounded away from zero or not. Figure 3 shows that for all combinations of (γ, λ), the GSP-based convergence result (Theorem 1) has very negligible relative errors. In contrast, the UHD-based convergence result (3) has substantially large relative errors even for γ as large as 5000 in Scenarios 3, 4 and 5. For Scenarios 1 and 2, since λ increases at a faster rate with γ than in other scenarios, the relative errors based on the two results converge much faster. However, for relatively smaller values of γ (100, 200, 500), the differences are substantial even though γ is large compared to n = 200. This suggests that we need γ to be large in an absolute sense, and not only in a relative sense compared to n in order to assume γ → ∞ and apply UHD-based results.

Figure 3: — Relative errors (%) in the convergence results of the largest sample eigenvalue derived under GSP and UHD regimes. The population eigenvalues and the rate of increment of the largest population eigenvalue are assumed to be known.

Next, we compare the estimates of the spike λ using GSP-based and UHD-based methods assuming the population eigenvalues and the rate of increment of the population spike λ to be unknown. Among the GSP-based methods, we only used the d-GSP method for this purpose due to the computational burden associated with applying the λ-GSP method on such a large number of simulated datasets. One thing to note here is that the UHD results do not provide any consistent estimators for λ, as it is assumed to be divergent when λ ≥ O(γ), and the asymptotic properties of the sample eigenvalues do not depend on λ when λ/γ → 0. Thus, in order to compare these methods, we estimate λ by λ = d − γ when considering the UHD regime.

From Figure 4 we can see that our proposed d-GSP method provides almost negligible biases for all combinations of (γ, λ), whereas the UHD-based estimates have substantial biases even for γ as large as 5000 in Scenarios 3, 4 and 5. For Scenarios 1 and 2, both methods provide almost unbiased estimates when γ ≥ 1000 and γ ≥ 2000, respectively. Further, we compare the estimated shrinkage factors based on these two methods in the Online Supplement. They also show very similar patterns as the estimated spikes.

6.3. Application on Hapmap III data

For this demonstration, we used genetic data from the Hapmap Phase III project (http://hapmap.ncbi.nlm.nih.gov/). Our sample consisted of unrelated individuals sampled from two different populations: a) Utah residents with Northern and Western European ancestry (CEU) and b) Toscans in Italy (TSI). We only included genomic markers that are on chromosome 1–22, have less than 5% missing values, and those with minor allele frequency more than 0.05. We also excluded two samples (both from CEU) with outlier PC scores (more than six standard deviations away from the mean PC score corresponding to at least one distant spike). We then mean-centered and variance-standardized the data for each marker. The final sample consisted of 198 individuals (110 from CEU and 88 from TSI). Total number of markers selected across chromosome 1–22 was 1,389,511.

To evaluate the performance of the proposed methods with different p, we performed PCA on each chromosome separately. The number of markers varied from 19,331 (chromosome 21) to 116,582 (chromosome 2). The distribution of the number of markers across different chromosomes are presented in the Online Supplement. We first estimated the number of distant spikes using the algorithm described in Section 5.3. We found no distant spike in chromosome 22 and only one distant spike in chromosome 2. Then we applied our GSP-based methods, the existing SP-based method [14] and the UHD-based method [15] to estimate the asymptotic shrinkage factors corresponding to the distant spikes. Figure 5a, 5b compares the estimated asymptotic shrinkage factors for the first two PCs across different chromosomes. The plots show that for all the chromosomes, λ-GSP and d-GSP methods provided almost equal estimates while the SP and UHD estimates are larger than both the GSP estimates. This suggests that the SP method would over-estimate the shrinkage factors when the population eigenvalues deviate from the assumption that the non-spiked eigenvalues are the same. Moreover, the UHD method over-estimated the shrinkage factors even for p/n nearly as large as 600 (chromosome 2).

Figure 5: — Estimated shrinkage factors for (a) PC1 and (b) PC2 across chromosomes 1–21 based on three different methods. (c) Comparison of the mean squared errors (MSE) of the unadjusted and adjusted PC scores based on the d-GSP and SP methods with the adjusted PC scores based on the λ-GSP method. The ratios of the MSEs are presented for chromosome 1–21 using the threshold ϵ = 1. The y-axis is presented in a logarithmic scale.

To investigate whether the proposed shrinkage-bias adjustment can improve the prediction accuracy, we performed a leave-one-out cross-validation. In each iteration we removed one individual (test sample) and performed PCA on the remaining individuals (training samples) to predict the PC score of the test sample. For each predicted PC score, we adjusted the shrinkage-bias using the GSP-based, SP-based and UHD-based shrinkage factor estimates.

One important issue with this cross-validation is that the exclusion of one individual can substantially change the PC-coordinates, in which the PC score plots from the training sample-based and complete sample-based PCA can be substantially different. To circumvent this problem, in each iteration we first rescaled the PC scores based on their corresponding sample eigenvalues to make the PCs comparable. In addition, we obtained the mean squared difference of the training sample PCl-2 scores with and without the exclusion of the test sample (for chromosome 2, only PCl is used), and excluded the test sample from the prediction error estimation if the mean squared difference was above a threshold ϵ. We used four different values 0.5,1,5 and 10 for the threshold parameter ϵ, and for each value of ϵ we calculated the mean squared errors (MSE) of the unadjusted and adjusted predicted PC scores of the test samples. The sample sizes of the test samples that were finally included in the prediction error estimation for different values of ϵ are shown in the Online Supplement.

Figure 5 shows the estimated MSEs for ϵ = 1. It is clear that both the λ-GSP and d-GSP methods have much smaller MSEs than the SP method. The UHD-based method had almost identical MSEs as the SP-based method for all the chromosomes, hence we omitted the UHD-based results in this plot. As expected, the unadjusted predicted PC scores have substantially larger MSE than all the proposed adjustments. The plots are very similar for the other values of ϵ, and they can be found in the Online Supplement.

Figure 6 illustrates the shrinkage-bias adjustment for the PC1 and PC2 scores of an individual based on the markers on chromosome 7. The plot clearly shows that the bias-adjusted PC score based on the SP model is still biased towards zero, whereas the bias-adjusted PC score based on the GSP model is very close to the original sample PC score. We only showed the d-GSP adjusted score in the plot as the d-GSP and λ-GSP adjusted scores were almost equal.

7. Conclusions and discussion

In this paper, we investigated the asymptotic properties of PCA under the Generalized Spiked Population model and derived estimators of the population eigenvalues, the angles between the sample and population eigenvectors, and the correlation coefficients between the sample and population PC scores. We also proposed methods to adjust the shrinkage bias in the predicted PC scores. Further, theoretically and using simulation studies, we compared our results with the results developed under the ultra-high dimensional regime [15], and showed that our methods provide more accurate results under the high-dimensional regime even when p is greatly larger than n. Since the proposed methods do not require the equality of the non-spiked eigenvalues, they can be widely used in high-dimensional biomedical data analysis. We implemented all our algorithms in the R package hdpca.

We note that Mestre [18, 19] proposed an asymptotic setting similar to the generalized spiked population model but with a different assumption on the number of spikes in which the number of spikes increases with the dimension. Under this assumption, he provided asymptotic properties of sample eigenvalues and eigenvectors. However, in many biomedical data, the number of spikes is usually finite as the spikes represent the difference between finitely many underlying subpopulations. Therefore we believe that the generalized spiked population model is more appropriate in such cases.

In some special cases, even though the features exhibit strong local correlation, one can use the spiked population model based methods after some suitable data manipulation. In genome-wide association studies, SNP pruning [1] can be used to remove locally correlated SNPs to satisfy the spiked population model. For example, Lee et al. [14] reported good performance of the spiked population model-based methods with the SNP-pruned Hapmap III dataset. This approach, however, can lead to a considerable loss of information; the SNP-pruning in Hapmap III data removed nearly 90% of the SNPs. Since the proposed approach does not require this additional step, it can use most of the information present in the data.

Supplementary Material

NIHMS1523560-supplement-1.pdf^{(380KB, pdf)}

Acknowledgments

We thank the Editor-in-Chief, Christian Genest, and referees for their helpful comments and suggestions, which lead to substantial improvements of the manuscript. The work was supported by NIH Grants R00HL113164 and R01HG008773.

Appendix

Proof of Theorem 1. The first part of the proof follows directly from Result 1 along with the fact that on the domain of the distant spikes, the ψ function is strictly increasing, and hence is left invertible. Since ψ”(α) > 0 for any α > sup Γ_H, ψ’(α) is a strictly increasing function for α > sup Γ_H. Let S_ψ > sup Γ_H be a solution for ψ’(α) = 0. Then for any α > sup Γ_H, ψ’(α) > 0 if and only if α > S_ψ. Therefore the interval (S_ψ, ∞) is the domain of the distance spikes, and ψ is a strictly increasing function on this interval. The second part follows from Lemma B. □

Proof of Theorem 2. The proof closely follows the proof of Theorem 2 in [18]. However, contrary to [18], we do not assume that the population LSD contains the generalized spikes. Thus, some of the derivation steps and results are substantially different from [18]. We start the derivation by first noting that the quadratic forms ${\hat{η}}_{k}$ can be expressed as contour integrals of a special class of Stieltjes transforms of the sample covariance matrix. Let us define, for all $z \in ℂ^{+}$ ,

{\hat{m}}_{p} (z) = s_{1}^{⊤} {(S_{p} - z I_{p})}^{- 1} s_{2} = \sum_{j = 1}^{p} s_{1}^{⊤} e_{j} e_{j}^{⊤} s_{2} / (d_{j} - z),

where s₁ and s₂ are non-random vectors with uniformly bounded norms. Girko [10] and Mestre [17] showed that under the assumption that the population LSD contains the generalized spikes, one has, for all $z \in ℂ^{+}$ ,

| {\hat{m}}_{p} (z) - m_{p} (z) | \overset{a . s .}{\to} 0,

(A.1)

Where

m_{p} (z) = s_{1}^{⊤} {w (z) Σ_{p} - z I_{p}}^{- 1} s_{2} = \sum_{j = 1}^{p} \frac{s_{1}^{⊤} E_{j} E_{j}^{⊤} s_{2}}{w (z) λ_{j} - z} .

The function w(z) is defined as w(z) = 1 − γ – γzb_F(z), where b_F(z) = ∫ (τ − z)⁻¹dF(τ) is the Stieltjes transform of the sample LSD. It is easy to check, by the same arguments provided in [17], that the result still holds when the generalized spikes are considered lying outside the support of the population LSD.

The functions ${\hat{m}}_{p}$ , m_p and b_F can be extended to $ℂ^{-} = {z \in ℂ : lm (z) < 0}$ by defining ${\hat{m}}_{p} (z) = {\hat{m}}_{p}^{*} (z^{*}), m_{p} (z) = m_{p}^{*} (z^{*})$ and $z \in ℂ^{-}$ , where z* is the complex conjugate of z.

With this definition, $| {\hat{m}}_{p} (z) - m_{p} (z) | \overset{a . s .}{\to} 0$ even when $z \in ℂ^{-}$ . Now ${\hat{η}}_{k}$ can be expressed as an integral of ${\hat{m}}_{p}$ , viz.

{\hat{η}}_{k} = \frac{1}{2 π i} \oint_{\partial {\hat{ℝ}}_{y}^{-} (k)} {\hat{m}}_{p} (z) d z,

where $i = \sqrt{- 1}, y > 0$ and $\partial {\hat{ℝ}}_{y}^{-} (k)$ is the negatively (clockwise) oriented boundary of the rectangle

{\hat{ℝ}}_{y} (k) = {z \in ℂ : {\hat{a}}_{1} \leq R e (z) \leq {\hat{a}}_{2}, | lm (z) | \leq y} .

The values of ${\hat{a}}_{1}$ and ${\hat{a}}_{2}$ can be arbitrarily chosen provided that ${\hat{ℝ}}_{y} (k)$ contains only the sample eigenvalue d_k and no other sample eigenvalue. Then the following lemma gives the almost sure limit of ${\hat{η}}_{k}$ .

Lemma A.

| \frac{1}{2 π i} \oint_{\partial {\hat{ℝ}}_{y}^{-} (k)} {\hat{m}}_{p} (z) d z - \frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} m_{p} (z) d z | \overset{a . s .}{\to} 0,

where y > 0 and $\partial ℝ_{y}^{-} (k)$ is the negatively (clockwise) oriented boundary of the rectangle

ℝ_{y} (k) = {z \in ℂ : a_{1} \leq R e (z) \leq a_{2}, l m (z) | \leq y} .

The constants a₁ and a₂ can be arbitrarily chosen so that ψ(λ_k) ∈ [a₁, a₂] and [a₁, a₂] ⊂ ψ(S_ψ, ∞), where S_ψ > sup Γ_H, ψ′(S_ψ) = 0. Here, ψ(S_ψ, ∞) denotes the image of the interval (S_ψ, ∞) under ψ.

Lemma A implies

| {\hat{η}}_{k} - \sum_{j = 1}^{p} {\frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{d z}{w (z) λ_{j} - z}} s_{1}^{⊤} E_{j} E_{j}^{⊤} s_{2} | \overset{a . s .}{\to} 0.

(A.2)

Now we need to evaluate the integral in (A.2) in order to get the almost sure limit of the random variable ${\hat{η}}_{k}$ . First, we extend the ψ function to $ℝ_{y} (k)$ as follows. For all $z \in ℝ_{y} (k)$ ,

ψ (z) = z {1 + γ \int \frac{λ}{z - λ} d H (λ)} .

According to [16], for all $z \in ℂ^{+}$ , b_F(z) = b is the unique solution to the equation

b = \int \frac{1}{λ (1 - γ - γ z b) - z} d H (λ)

(A.3)

in the set ${b \in ℂ : γ b - (1 - γ) / z \in ℂ^{+}}$ . It is easy to see that b_F also satisfies (A.3) when $z \in ℂ^{-}$ . Now we formally define the f_F function introduced in (1). For all $z \in ℂ \ ℝ$ , set

f_{F} (z) = \frac{z}{w (z)} = \frac{z}{1 - γ - γ z b_{F} (z)} .

(A.4)

Then b_F can be expressed in terms of f_F as

b_{F} (z) = \frac{(1 - γ) f_{F} (z) - z}{γ z f_{F} (z)} .

By replacing b with {(1 – γ)f − z}γzf in (A.3), we get

f {1 + γ \int \frac{λ}{f - λ} d H (λ)} = z .

(A.5)

It is easy to see that b_F is a solution to (A.3) if and only if f_F is a solution to (A.5). Therefore, for all $z \in ℂ^{+}$ (similarly for $z \in ℂ^{-}$ , f_F(z) = f is the unique solution to (A.5) on $ℂ^{+}$ (respectively, $ℂ^{-}$ ). This implies ψ{f_F(z)}=for all $z \in ℝ_{y} (k) \ [a_{1}, a_{2}]$ .

Now we focus on the case when $z \in ℝ \ {0}$ . According to [22], we can extend b_F to $ℝ \ {0}$ by defining $b_{F} (z) = {lim}_{y \to 0^{+}} b_{F} (z + i y)$ for any $z \in ℝ \ {0}$ . The definition of f_F can also be extended in a similar fashion. In Lemma B we have shown that f_F is the inverse function of ψ on (S_ψ, ∞), and there exists M_f > sup Γ_F for which f(S_ψ, ∞) = (M_f, ∞). Thus, [a₁, a₂] ⊂ ψ(S_ψ, ∞) implies ψ{f_F(z)} = z for all $z \in ℝ_{y} (k)$ . Furthermore, the function ψ is continuous and differentiable on $ℝ_{y} (k)$ , and the derivative is given by

ψ^{'} (z) = 1 - γ \int {(\frac{λ}{z - λ})}^{2} d H (λ) .

Then the integral in (A.2) can be expressed in terms of ψ and f_F as follows:

\frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{d z}{w (z) λ_{j} - z} = \frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{d z}{\frac{z}{f_{F} (z)} λ_{j} - z} = \frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{1}{λ_{j} - f_{F} (z)} \times \frac{f_{F} (z)}{ψ {f_{F} (z)}} d z .

(A.6)

The integrand in the final expression is holomorphic on $ℝ_{y}^{-} (k)$ when j ≠ k and possesses a simple pole ψ(λ_k) when j = k. Therefore, when j ≠ k the integral in (A.6) is zero. When j = k, applying the residue theorem on the final integral, we get

\frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{d z}{w (z) λ_{k} - z} = lim_{z \to ψ (λ_{k})} \frac{ψ (λ_{k}) - z}{λ_{k} - f_{F} (z)} \times \frac{f_{F} (z)}{ψ {f_{F} (z)}} = lim_{z \to ψ (k)} \frac{ψ (λ_{k}) - ψ {f_{F} (z)}}{λ_{k} - f_{F} (z)} \times \frac{f_{F} (z)}{ψ {f_{F} (z)}} = \frac{λ_{k} ψ^{'} (λ_{k})}{ψ (λ_{k})} .

This implies

\frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} \frac{d z}{w (z) λ_{j} - z} = {\begin{array}{l} λ_{k} ψ^{'} (λ_{k}) / ψ (λ_{k}) & if j = k, \\ 0 & if j \neq k, \end{array}

and the proof of Theorem 2 is complete. □

Proof of Lemma A. First, we show that ${\hat{a}}_{1}$ , ${\hat{a}}_{2}$ , a₁, a₂ can be chosen satisfying ${\hat{a}}_{1} \to a_{1}$ and ${\hat{a}}_{2} \to a_{2}$ . This is possible due to the fact that $d_{k} \overset{a . s .}{\to} ψ (λ_{k})$ and ψ(λ_k) ⊂ ψ(β_ψ, ∞) = (M_f, ∞), where M_f > sup Γ_F. Therefore, we can choose a neighborhood [a₁, a₂] around ψ(λ_k) so that [a₁, a₂] ⊂ (M_f, ∞). Moreover, as M_f is bounded away from the support of the sample LSD F and $d_{k} \overset{a . s .}{\to} ψ (λ_{k})$ , we can select a neighborhood [ ${\hat{a}}_{1}$ , ${\hat{a}}_{2}$ ] around d_k which does not contain any other eigenvalue for which ${\hat{a}}_{1} \to a_{1}$ , ${\hat{a}}_{2} \to a_{2}$ . Then

| \frac{1}{2 π i} \oint_{\partial {\hat{ℝ}}_{y}^{-} (k)} {\hat{m}}_{p} (z) d z - \frac{1}{2 π i} \oint_{\partial ℝ_{y}^{-} (k)} m_{p} (z) d z | \leq \frac{1}{2 π} {sup_{z \in \partial {\hat{ℝ}}_{y}^{-} (k) \cup \partial ℝ_{y}^{+} (k)} | {\hat{m}}_{p} (z) |} (| {\hat{a}}_{1} - a_{1} | + | {\hat{a}}_{2} - a_{2} |) + \frac{1}{2 π} \oint_{\partial ℝ_{y}^{-} (k)} | {\hat{m}}_{p} (z) d z - m_{p} (z) | \times | d z | .

(A.7)

From the Cauchy-Schwarz inequality, we can obtain the following upper bound for ${\hat{m}}_{p}$ .:

| {\hat{m}}_{p} (z) | \leq ‖ s_{1} ‖ \times ‖ s_{2} ‖ / d (z, Γ_{F_{p}}) .

Here, $d (z, Γ_{F_{p}}) = {inf}_{y \in Γ_{F_{p}}} | z - y |$ . Since F_p → F point-wise and [a₁, a₂] is bounded away from Γ_F, $d (z, Γ_{F_{p}})$ is bounded away from zero with probability 1 for large enough p and n. Therefore $| {\hat{m}}_{p} (z) |$ is finite for $z \in ℝ_{y} (k)$ with probability 1 for large enough p and n. Moreover, since $[{\hat{a}}_{1}, {\hat{a}}_{2}] \to [a_{1}, a_{2}]$ , the interval $[{\hat{a}}_{1}, {\hat{a}}_{2}]$ will eventually be bounded away from Γ_F. Thus, eventually the upper bound for $| {\hat{m}}_{p} (z) |$ will also be finite for $z \in {\hat{ℝ}}_{y} (k)$ . Therefore, the first term on the right-hand side of (A.7) will go to zero as ${\hat{a}}_{1} \to a_{1}$ , ${\hat{a}}_{2} \to a_{2}$ .

Now, as ${\hat{m}}_{p} (z)$ and m_p(z) are holomorphic functions on the compact set $\partial ℝ_{y}^{-} (k)$ , we have

sup_{z \in \partial ℝ_{y}^{-} (k)} | {\hat{m}}_{p} (z) - m_{p} (z) | < \infty .

Also from (A.1), $| {\hat{m}}_{p} (z) - m_{p} (z) | \overset{a . s .}{\to} 0$ point-wise for all $z \in ℂ \ ℝ$ . Therefore, by Lebesgue’s Dominated Convergence Theorem, the second term on the right-hand side of (A.7) also converges to zero almost surely. This completes the proof of Lemma A. □

We can show the asymptotic equivalence of the limits derived in Theorem 2 and Result 2 as a direct application of the following lemma.

Lemma B. Suppose assumptions (A)-(D) hold. If λ_k is a distant spike with multiplicity 1, and d_k is the corresponding sample eigenvalue, then

f_{F} (d_{k}) \overset{p}{\to} λ_{k}, \frac{d_{k} g_{F} (d_{k})}{f_{F} (d_{k})} \overset{p}{\to} ψ^{'} (λ_{k}) .

Proof. We have already established in the proof of Theorem 2 that for all $z \in ℂ^{+}$ (similarly for $z \in ℂ^{-}$ ), f_F (z) = f is the unique solution to (A.5) on $ℂ^{+}$ (respectively, $ℂ^{-}$ ). When z is restricted to $ℂ \ ℝ$ , using (A.4) and the fact that b_F (z) = ∫ (τ − z)⁻¹dF(τ), we can write

f_{F} (z) = z {1 + γ \int \frac{τ}{z - τ} d F (τ)} .

Now suppose $z = x \in ℝ \ {0}$ . Then both Eqs. (A.3) and (A.5) will have multiple roots, both real and complex valued depending on x and H. If we look at (A.5) closely, we can see for real-valued x, it can be represented as ψ{f (x)} = x, where the ψ function is as defined in (1). As we have seen in the proof of Theorem 1, ψ is strictly increasing in the interval (S_ψ, ∞), where S_ψ > sup Γ_H and ψ’(S_ψ) = 0. Therefore, any real-valued solution f of ψ{f (x)} = x in (S_ψ, ∞) has to be the inverse of ψ, which is unique due to the strict monotonicity of ψ on (S_ψ, ∞). Now suppose Γ_F is the support of the sample LSD F. We will show that there exists M_f > sup Γ_F such that for any x > M_f, the function f_F is real-valued and it is a solution to (A.5) in the interval (S_ψ, ∞). Thus it is also the unique such solution and the inverse of the ψ function in (S_ψ, ∞).

Let $x \in ℝ, x >$ sup Γ_F and $z = x + i y \in ℂ^{+}$ . Now, as $z \in ℂ^{+}$ , f_F(z) is the unique solution to (A.5) in $ℂ^{+}$ . Therefore, if we express f_F(z) as u(z) + iv(z), then v(z) > 0. Also, the imaginary part of (A.5) can be written as

v (z) [1 - γ \int \frac{λ^{2}}{{u (z) - λ}^{2} + v {(z)}^{2}} d H (λ)] = y .

Both v(z) and y being positive implies that

1 - γ \int \frac{λ^{2}}{{u (z) - λ}^{2} + v {(z)}^{2}} d H (λ) > 0.

(A.8)

Due to the continuity of f_F on the set ${z \in ℂ^{+} : z = x + i y, x > sup Γ_{F}}$ ,

f_{F} (x) = lim_{y \to 0^{+}} \frac{x + i y}{1 + γ \int τ {(x + i y - τ)}^{- 1} d F (τ)} = \frac{x}{1 + γ \int τ {(x - τ)}^{- 1} d F (τ)},

which is real-valued. Thus u(z) → f_F(x) and v(z) → 0 as y ↓ 0. Therefore as y ↓ 0, the inequality (A.8) becomes

1 - γ \int \frac{λ^{2}}{{f_{F} (x) - λ}^{2}} d H (λ) > 0,

which implies ψ’{f_F(x)} > 0.

We can see that f_F(x) attains zero at sup Γ_F and it is strictly and unboundedly increasing for x > sup Γ_F. This ensures the existence of a threshold M_F > sup Γ_F such that the function f_F maps the interval (M_F, ∞) to (S_ψ, ∞). Therefore, f_F and ψ are both strictly increasing, continuous and bijective mappings between the intervals (M_F, ∞) and (S_ψ, ∞). Since f_F(z) is the unique solution to (A.5) in $ℂ^{+}$ when $z \in ℂ^{+}$ , f_F is also a solution to (A.5) in (S_ψ, ∞) when x > M_F due to the continuity of the left-hand side of (A.5) on the set ${f \in ℂ^{+} : f = u + i v, u > S_{ψ}}$ , which further implies that f_F is the inverse function of ψ on (S_ψ, ∞).

The first part of Lemma B is proved as a corollary to Result 1 as ψ⁻¹ = f_F on the domain of distant spikes, i.e., (S_ψ, ∞). For the second part, we first need to derive the expression of ${f^{'}}_{F}$ , and then derive the expression of ψ’ in terms of f_F and F, viz.

f_{F}^{'} (x) = f (x) {1 + γ v_{F} (x)} / x; v_{F} (x) = \int \frac{τ}{{(x - τ)}^{2}} d F (τ) .

For a distant spike λ_k, using the expression of ${f^{'}}_{F}$ , we get

\frac{λ_{k} ψ^{'} (λ_{k})}{ψ (λ_{k})} = \frac{λ_{k}}{ψ (λ_{k}) f_{F}^{'} {ψ (λ_{k})}} = \frac{1}{1 + γ f_{F} {ψ (λ_{k})} \int τ {ψ (λ_{k}) - τ}^{- 2} d F (τ)} = g_{F} {ψ (λ_{k})} .

As ψ(λ_k) > M_f, g_F is continuous at ψ(λ). Since $d_{k} \overset{p}{\to} ψ (λ_{k})$ ,

g_{F} (d_{k}) \overset{p}{\to} g_{F} {ψ (λ_{k})} = λ_{k} ψ^{'} (λ_{k}) / ψ (λ_{k}), d_{k} g_{F} (d_{k}) / f_{F} (d_{k}) \overset{p}{\to} ψ^{'} (λ_{k}) .

This concludes the proof of Lemma B. □

Proof of Theorem 3. We have

{〈 P_{k}, p_{k} 〉}^{2} = \frac{1}{n^{2} λ_{k} d_{k}} {〈 X E_{k}, X e_{k} 〉}^{2} = \frac{1}{λ_{k} d_{k}} {(E_{k}^{⊤} X^{⊤} X e_{k} / n)}^{2} = \frac{1}{λ_{k} d_{k}} {E_{k}^{⊤} (\sum_{i = 1}^{p} d_{i} e_{i} e_{i}^{⊤}) e_{k}}^{2} = \frac{d_{k}}{λ_{k}} {〈 e_{k}, E_{k} 〉}^{2} .

Using the limits derived in Theorem 2 and Result 1, we deduce that

| \frac{d_{k}}{λ_{k}} {〈 e_{k}, E_{k} 〉}^{2} - ψ^{'} (λ_{k}) | \overset{p}{\to} 0.

Using Lemma B, we conclude that

| \frac{d_{k}}{λ_{k}} {〈 e_{k}, E_{k} 〉}^{2} - \frac{d_{k} g_{F} (d_{k})}{f_{F} (d_{k})} | \overset{p}{\to} 0.

Proof of Theorem 4. We show that the denominator $E (p_{k j}^{2})$ converges to ψ(λ_k) and the numerator $E (q_{k}^{2})$ converges to $λ_{k}^{2} / ψ (λ_{k})$ . The proof will be complete using the fact that $d_{k} \overset{p}{\to} ψ (λ_{k})$ . For the denominator, we have

E (p_{k j}^{2}) = \frac{1}{n} E (\sum_{i = 1}^{n} p_{k i}^{2}) = \frac{1}{n} E {\sum_{i = 1}^{n} {(x_{i}^{⊤} e_{k})}^{2}} = E (e_{k}^{⊤} X^{⊤} X e_{k} / n) = E {e_{k}^{⊤} (\sum_{i = 1}^{p} d_{i} e_{i} e_{i}^{⊤}) e_{k}} = E (d_{k}) \to ψ (λ_{k}) .

For the numerator, we have

E (q_{k}^{2}) = E {{(x_{n e w}^{⊤} e_{k})}^{2}} = E {E {(x_{n e w}^{⊤} e_{k})}^{2} | e_{k}} = E {v a r (x_{n e w}^{⊤} e_{k}) | e_{k}} = E (e_{k}^{⊤} Σ_{p} e_{k}) .

Now, using the notations in the proof of Theorem 2 and Lemma B, we have b_F(z) = ∫ (τ − z)⁻¹dF(τ) as the Stieltjes transform of the sample LSD and the function f_F defined as f_F(z) = z{1 − γ − γzb_F(z)}⁻¹. Therefore,

b_{F} (z) = \frac{(1 - γ) f_{F} (z) - z}{γ z f_{F} (z)} .

The functions b_F and f_F can be extended to the real axis by defining the extensions as shown in the proof of Lemma B. Thus, for the sample eigenvalue d_k corresponding to the distant spike λ_k we have

b_{F} (d_{k}) = \frac{(1 - γ) f_{F} (d_{k}) - d_{k}}{γ d_{k} f_{F} (d_{k})} .

According to Theorem 4 in [13], the limit of $e_{k}^{⊤} Σ_{p} e_{k}$ is given by d_k{1 − γ − γd_kb_F(d_k)}⁻². Replacing the expression of b_F (d_k) in this limit, we get

| e_{k}^{⊤} Σ_{p} e_{k} - f_{F}^{2} (d_{k}) / d_{k} | \overset{p}{\to} 0.

Using Result 1 and Lemma B, we have $f_{F}^{2} (d_{k}) / d_{k} \overset{p}{\to} λ_{k}^{2} / ψ (λ_{k})$ . Therefore, the limit of the numerator is given by

E (q_{k}^{2}) = E (e_{k}^{⊤} Σ_{p} e_{k}) \to λ_{k}^{2} / ψ (λ_{k}) .

This completes the proof of Theorem 4. □

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Online Supplement

Additional tables and figures relevant to this paper are provided in the Online Supplement.

References

[1].Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT, Data quality control in genetic case-control association studies, Nature Protocols 5 (2010) 1564–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Bai ZD, Silverstein JW, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices, Ann. Probab 26 (1998) 316–345. [Google Scholar]
[3].Bai ZD, Yao J, On sample eigenvalues in a generalized spiked population model, J. Multivariate Anal 106 (2012) 167–177. [Google Scholar]
[4].Baik J, Silverstein JW, Eigenvalues of large sample covariance matrices of spiked population models, J. Multivariate Anal 97 (2006) 1382–1408. [Google Scholar]
[5].Berkelaar M, et al. , lpSolve: Interface to ‘Lp_solve’ v. 5.5 to Solve Linear/Integer Programs, 2015. R package version 5.6.13. [Google Scholar]
[6].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]
[7].Cai TT, Han X, Pan G, Limiting laws for divergent spiked eigenvalues and largest non-spiked eigenvalue of sample covariance matrices, ArXiv e-prints (2017). [Google Scholar]
[8].Ding X, Convergence of sample eigenvectors of spiked population model, Commun. Statist. Theory Methods 44 (2015) 3825–3840. [Google Scholar]
[9].El Karoui N, Spectrum estimation for large dimensional covariance matrices using random matrix theory, Ann. Statist 36 (2008) 2757–2790. [Google Scholar]
[10].Girko VL, Strong law for the eigenvalues and eigenvectors of empirical covariance matrices, Random Oper. Stoch. Equ 4 (1996) 176–204. [Google Scholar]
[11].Johnstone IM, On the distribution of the largest eigenvalue in principal components analysis, Ann. Statist 29 (2001) 295–327. [Google Scholar]
[12].Johnstone IM, Lu AY, On consistency and sparsity for principal components analysis in high dimensions, J. Amer. Statist. Assoc 104 (2009) 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Ledoit O, Peche S, Eigenvectors of some large sample covariance matrix ensembles, Probab. Theory Related Fields 151 (2010) 233–264. [Google Scholar]
[14].Lee S, Zou F, Wright FA, Convergence and prediction of principal component scores in high-dimensional settings, Ann. Statist 38 (2010) 3605–3629. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Lee S, Zou F, Wright FA, Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data, Biometrika 101 (2014) 484–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Marcenko VA, Pastur LA, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457. [Google Scholar]
[17].Mestre X, On the asymptotic behavior of quadratic forms of the resolvent of certain covariance-type matrices, Technical Report, CTTC/RC/2006-001, Centre Tecnologic de Telecomunicacions de Catalunya, Barcelona, Spain, 2006. [Google Scholar]
[18].Mestre X, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE Trans. Inform. Theory 54 (2008) 5113–5129. [Google Scholar]
[19].Mestre X, On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices, IEEE Trans. Signal Process. 56 (2008) 5353–5368. [Google Scholar]
[20].Paul D, Asymptotics of sample eigenstruture for a large dimensional spiked covariance model, Statist. Sinica 17 (2007) 1617–1642. [Google Scholar]
[21].Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D, Principal components analysis corrects for stratification in genome-wide association studies, Nature Genetics 38 (2006) 904–909. [DOI] [PubMed] [Google Scholar]
[22].Silverstein J, Choi S, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis 54 (1995)295–309. [Google Scholar]
[23].Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW, Significance analysis of time course microarray experiments, Proc. Nat. Acad. Sci. USA 102 (2005) 12837–12842. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1523560-supplement-1.pdf^{(380KB, pdf)}

[R1] [1].Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT, Data quality control in genetic case-control association studies, Nature Protocols 5 (2010) 1564–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Bai ZD, Silverstein JW, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices, Ann. Probab 26 (1998) 316–345. [Google Scholar]

[R3] [3].Bai ZD, Yao J, On sample eigenvalues in a generalized spiked population model, J. Multivariate Anal 106 (2012) 167–177. [Google Scholar]

[R4] [4].Baik J, Silverstein JW, Eigenvalues of large sample covariance matrices of spiked population models, J. Multivariate Anal 97 (2006) 1382–1408. [Google Scholar]

[R5] [5].Berkelaar M, et al. , lpSolve: Interface to ‘Lp_solve’ v. 5.5 to Solve Linear/Integer Programs, 2015. R package version 5.6.13. [Google Scholar]

[R6] [6].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]

[R7] [7].Cai TT, Han X, Pan G, Limiting laws for divergent spiked eigenvalues and largest non-spiked eigenvalue of sample covariance matrices, ArXiv e-prints (2017). [Google Scholar]

[R8] [8].Ding X, Convergence of sample eigenvectors of spiked population model, Commun. Statist. Theory Methods 44 (2015) 3825–3840. [Google Scholar]

[R9] [9].El Karoui N, Spectrum estimation for large dimensional covariance matrices using random matrix theory, Ann. Statist 36 (2008) 2757–2790. [Google Scholar]

[R10] [10].Girko VL, Strong law for the eigenvalues and eigenvectors of empirical covariance matrices, Random Oper. Stoch. Equ 4 (1996) 176–204. [Google Scholar]

[R11] [11].Johnstone IM, On the distribution of the largest eigenvalue in principal components analysis, Ann. Statist 29 (2001) 295–327. [Google Scholar]

[R12] [12].Johnstone IM, Lu AY, On consistency and sparsity for principal components analysis in high dimensions, J. Amer. Statist. Assoc 104 (2009) 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Ledoit O, Peche S, Eigenvectors of some large sample covariance matrix ensembles, Probab. Theory Related Fields 151 (2010) 233–264. [Google Scholar]

[R14] [14].Lee S, Zou F, Wright FA, Convergence and prediction of principal component scores in high-dimensional settings, Ann. Statist 38 (2010) 3605–3629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Lee S, Zou F, Wright FA, Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data, Biometrika 101 (2014) 484–490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Marcenko VA, Pastur LA, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457. [Google Scholar]

[R17] [17].Mestre X, On the asymptotic behavior of quadratic forms of the resolvent of certain covariance-type matrices, Technical Report, CTTC/RC/2006-001, Centre Tecnologic de Telecomunicacions de Catalunya, Barcelona, Spain, 2006. [Google Scholar]

[R18] [18].Mestre X, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE Trans. Inform. Theory 54 (2008) 5113–5129. [Google Scholar]

[R19] [19].Mestre X, On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices, IEEE Trans. Signal Process. 56 (2008) 5353–5368. [Google Scholar]

[R20] [20].Paul D, Asymptotics of sample eigenstruture for a large dimensional spiked covariance model, Statist. Sinica 17 (2007) 1617–1642. [Google Scholar]

[R21] [21].Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D, Principal components analysis corrects for stratification in genome-wide association studies, Nature Genetics 38 (2006) 904–909. [DOI] [PubMed] [Google Scholar]

[R22] [22].Silverstein J, Choi S, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis 54 (1995)295–309. [Google Scholar]

[R23] [23].Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW, Significance analysis of time course microarray experiments, Proc. Nat. Acad. Sci. USA 102 (2005) 12837–12842. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model

Rounak Dey

Seunggeun Lee

Abstract

1. Introduction

Figure 1:

2. Generalized spiked population model

3. Consistent estimation of the generalized spikes

4. Consistent estimators of the asymptotic shrinkage in predicting the PC scores

4.1. Angle between sample and population eigenvectors

4.2. Correlation between sample and population PC scores

4.3. Asymptotic shrinkage factor

4.4. Comparison of the two different estimators

4.5. Comparison of the Generalized Spiked Population (GSP) model and the Spiked Population (SP) model

4.6. Comparison with ultra high-dimensional regime-based results when p/n is large

5. Estimation of the population LSD

5.1. El Karoui’s algorithm

5.2. Implementing El Karoui’s algorithm when the number of spikes is known

5.3. Estimating the number of spikes

6. Simulation studies and real data example

6.1. Simulation studies: Compare GSP and SP-based methods

Figure 2:

Table 1:

6.2. Simulation studies: Compare GSP and UHD-based methods

Figure 3:

Figure 4:

6.3. Application on Hapmap III data

Figure 5:

Figure 6:

7. Conclusions and discussion

Supplementary Material

Acknowledgments

Appendix

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases