Sparse principal component analysis by choice of norm

Xin Qi; Ruiyan Luo; Hongyu Zhao

doi:10.1016/j.jmva.2012.07.004

. Author manuscript; available in PMC: 2014 Feb 1.

Published in final edited form as: J Multivar Anal. 2012 Jul 16;114:127–160. doi: 10.1016/j.jmva.2012.07.004

Sparse principal component analysis by choice of norm

Xin Qi ^1,^✉, Ruiyan Luo ², Hongyu Zhao ³

PMCID: PMC3601508 NIHMSID: NIHMS395542 PMID: 23524453

Abstract

Recent years have seen the developments of several methods for sparse principal component analysis due to its importance in the analysis of high dimensional data. Despite the demonstration of their usefulness in practical applications, they are limited in terms of lack of orthogonality in the loadings (coefficients) of different principal components, the existence of correlation in the principal components, the expensive computation needed, and the lack of theoretical results such as consistency in high-dimensional situations. In this paper, we propose a new sparse principal component analysis method by introducing a new norm to replace the usual norm in traditional eigenvalue problems, and propose an efficient iterative algorithm to solve the optimization problems. With this method, we can efficiently obtain uncorrelated principal components or orthogonal loadings, and achieve the goal of explaining a high percentage of variations with sparse linear combinations. Due to the strict convexity of the new norm, we can prove the convergence of the iterative method and provide the detailed characterization of the limits. We also prove that the obtained principal component is consistent for a single component model in high dimensional situations. As illustration, we apply this method to real gene expression data with competitive results.

Keywords: sparse principal component analysis, high-dimensional data, uncorrelated or orthogonal principal components, iterative algorithm, consistency in high-dimensional

1. Introduction

Principal components analysis (PCA), as a popular feature extraction and dimension reduction tool, seeks the linear combinations of the original variables (principal components, PCs) such that the derived PCs capture maximal variance and guarantee minimal information loss. However, the classic PCA has major practical and theoretical drawbacks when it is applied to high-dimensional data. The classic PCA produces inconsistent estimates in high-dimensional situations (see Paul [11], Nadler [10] and Johnstone and Lu [7]). The loadings of PCs are typically non-zero. This often makes it difficult to interpret the PCs and identify important variables.

To address the drawbacks of classic PCA, various modified PCA methods have been proposed to form PCs where each PC is the linear combination of a small subset of the variables that can still explain high percentage of variance. SCoTLASS proposed in Jolliffe, Trendafilov and Uddin [8], a quite natural extension to classic PCA, maximizes the explained variances with l₁ constraints on the loadings, as well as orthogonal conditions on subsequent loadings. The nonconvexity of its objective function and feasible region leads to difficulties in computation, especially for higher order principal components. Trendafilov and Jolliffe [14] proposed a globally convergent algorithm to solve the optimization problem in the SCoTLASS based on the projected gradient approach. D’Aspremont, Bach and Ghaoui [4] suggested a semidefinite programming problem as a relaxation to the l₀-penalty for sparse covariance matrix. Amini and Wainwright [1] studied the asymptotic properties of the leading eigenvector of the covariance estimator obtained by D’Aspremont et al. [4]. The SPCA developed by Zou, Hastie and Tibshirani [16] formulates PCA as a regression-type optimization problem (SPCA), and then to obtain sparse loadings by imposing the lasso or elastic net penalty on the regression coefficients. Compared to SCotLASS, this method has more complicated objective function, but it is computationally efficient. The sparse PCs obtained from SPCA are neither orthogonal nor uncorrelated. Applications of SPCA to simulated and real examples indicated that different sparse PCs may be highly correlated, which makes it difficult in differentiating the variances explained by different PCs. To ease this difficulty, Zou and his colleagues proposed a new approach to computing the total variance explained by sparse PCs. However, without orthogonality or uncorrelation constraints, there are infinitely many sparse linear combinations which can achieve the same level of explained variances as the higher order PCs obtained by SPCA. Furthermore, the obtained PCs by SPCA appear to be sensitive to the choice of the number of PCs and all the sparsity parameters. Shen and Huang [13] proposed a sequential method (sPCA-rSVD) to find the sparse PCs via regularized singular value decomposition which extracts the PCs through solving low rank matrix approximation problems. Compared to SPCA, this method does not have the sensitivity problem as SPCA to the number of PCs and its higher order PCs are not affected by the sparsity parameters of the lower order PCs. But the nonorthogonality and correlation in PCs remain a problem. Based on their penalized matrix decomposition method, Witten, Tibshirani and Hastie [15] proposed an iterative method for obtaining sparse PCs, which yields an efficient algorithm to obtain the first PC of SCoTLASS. However, it is hard to study the convergence of this iterative method and to extend it to get orthogonal or uncorrelated higher order PCs. Johnstone and Lu [7] proposed a thresholding method and proved its consistency for a single component model in high dimensional situations. As pointed out by Cadima and Jolliffe [3], the thresholding approach can be potentially misleading in various respects.

In this paper, we propose a new criterion-based sequential sparse PCA method by replacing the l₂-norm in traditional eigenvalue problems with a new norm, which is a convex combination of l₁ and l₂ norms. Hence, the optimization problems in our methods are natural extensions of those in classic PCA and have relatively simple forms. We propose an efficient iterative algorithms to solve these optimization problems. Due to the strict convexity of the new norm, we can prove the convergence of this iterative algorithm and provide the detailed characterization of the limits. With this method, we can efficiently obtain un-correlated PCs or orthogonal loadings, and achieve the goal of explaining high percentage of variations with sparse loadings. Our method is almost as fast as SPCA and sPCA-rSVD for high-dimensional data such as gene expression data and single-nucleotide polymorphism (SNP) data. This new framework can be easily extended to other statistical techniques and methodologies involve solving eigenvalue problems (such as partial least square regression), generalized eigenvalue problems (such as linear discriminant analysis) and singular value decomposition problems (such as canonical correlation analysis), where the different components are required to be orthogonal or uncorrelated. They are our future research directions. We also prove that the obtained PC is consistent in the single component model proposed by Johnstone and Lu [7] when the ratio of the number of variables to the sample size goes to a nonnegative constant.

The paper is organized as follows. In Section 2, we describe our method, propose algorithms and provide the theoretical characterizations. We show the consistency result in Section 3, and conduct simulation studies and apply our method to the well-studied pitprop data set and a gene expression dataset in Section 4. The proofs of all theorems are given in Section 5. The proofs of some technical lemmas are given in the Appendix.

2. Sparse PCA by choice of norm

Let X be a n × p data matrix where n and p are the number of the observations and the number of the variables, respectively. Without loss of generality, assume the column means (sample means) of X are all 0. We will use Σ to denote the p × p sample covariance matrix $\frac{1}{n} X^{T} X$ throughout this paper except in Section 3 where the sample covariance matrix is denoted by Σ̂⁽ⁿ⁾ because we want to study the asymptotic property of our method as n → ∞. X^T is the transpose of the matrix X. For any v = (v₁, · · ·, v_p)^T ∈ ℝ^p and u = (u₁, · · ·, u_p)^T ∈ ℝ^p, let ${| | v | |}_{2} = {(\sum_{i = 1}^{p} v_{i}^{2})}^{\frac{1}{2}}$ be the l₂-norm of v, 〈u, v〉 = u^Tv be the inner product between u and v, and ${| | v | |}_{1} = \sum_{i = 1}^{p} ∣ v_{i} ∣$ be the l₁-norm of v. If u and v are orthogonal under l₂-norm, that is, 〈u, v〉 = u^Tv = 0, we write u ⊥ v.

For any λ ∈ [0, 1], we define a “mixed norm” ||·||_λ in ℝ^p space by

{| | v | |}_{λ} = {[(1 - λ) {| | v | |}_{2}^{2} + λ {| | v | |}_{1}^{2}]}^{\frac{1}{2}}, \forall v \in R^{p} .

If λ = 0, this norm reduces to the l₂ norm, and if λ = 1, it is the l₁ norm. Our definitions for the first and the higher order sparse PCs are given as follows.

Definition 1

The coefficient vector v₁ ∈ ℝ^p of the first sparse principal component is the solution to the following optimization problem,

max_{v \in R^{p}, {| | v | |}_{2} = 1} \frac{v^{T} \sum v}{(1 - λ_{1}) {| | v | |}_{2}^{2} + λ_{1} {| | v | |}_{1}^{2}} = max_{v \in R^{p}, {| | v | |}_{2} = 1} \frac{v^{T} \sum v}{{| | v | |}_{λ_{1}}^{2}},

(2.1)

where 0 ≤ λ₁ ≤ 1 is the tuning parameter for the first sparse principal component.

λ₁ controls the sparsity of v₁. A larger λ₁ leads to sparser coefficients, but reduces the proportion of variations explained by the first PC. In practice, we need to choose an appropriate value of the tuning parameter to make a balance between sparsity and high percentage of explained variance. Because the objective function in (2.1) is homogeneous, that is, the objective function has the same values at v and av, where a is any nonzero real number, the optimization problem (2.1) is equivalent to the following problem,

max_{u \in R^{p}} u^{T} \sum u, subject to {| | u | |}_{λ_{1}} \leq 1.

(2.2)

The solutions to (2.1) and (2.2) differ only by a multiplicative constant. The solution to (2.2) always has the mixed norm equal to one, because Σ, the covariance matrix, is nonnegative definite, and hence u^T Σu is always nonnegative for any u. We will propose an efficient algorithm to solve (2.2). However, it is more convenient to consider the form (2.1) in studying the asymptotic consistency of our sparse PCs.

We now compare our method with SCoTLASS which solves the following problem,

max_{u \in R^{p}} u^{T} \sum u, subject to {| | u | |}_{2} = 1, {| | u | |}_{1} \leq t .

(2.3)

When $t \geq \sqrt{p}$ , the l₁ constraint in (2.3) is not active. In this case, (2.3) is just the usual eigenvalue problem and equivalent to (2.2) with λ₁ = 0. When t < 1, the feasible region of (2.3) is empty. When $1 \leq t < \sqrt{p}$ , the difference of the feasible regions between (2.2) and (2.3) is shown in Figure 1 for the two dimensional case. The left plot in Figure 1 shows the pattern of the feasible region for (2.3) with $1 < t = 1.3 < \sqrt{2}$ . The feasible region of (2.3) is the collection of the solid green lines and is nonconvex. The right plot in Figure 1 shows the feasible regions in (2.2) for different λ₁. The regions are strictly convex for 0 ≤ λ < 1, which plays a key role in the development of our method.

Left: the feasible region ||u||₂ = 1, ||u||₁ ≤ t of SCoTLASS (where t = 1.3) is the collection of the solid green lines. Right: the feasible regions ||u||_λ ≤ 1 of our method for λ = 0, 0.1, 0.5 and 1 are the regions bounded by the lines with different colors.

The classic PCs possess two properties simultaneously: coefficient vectors of different PCs are orthogonal and different PCs are uncorrelated. However, for sparse PCA, at most one of the two properties can be possessed by principal components. Therefore, we consider two alternative definitions for higher order PCs. In the following definitions, v_k, k ≥ 1, denotes the coefficient vector of the k-th sparse PC.

Definition 2

Suppose that we have obtained v_j, 1 ≤ j ≤ k − 1, then v_k solves the following optimization problem,

max_{\begin{matrix} {| | v | |}_{2} = 1, v ⊥ v_{j}, \\ j = 1, \dots, k - 1 \end{matrix}} \frac{v^{T} \sum v}{{| | v | |}_{λ_{k}}^{2}},

(2.4)

where λ_k is the tuning parameter for v_k.

Definition 3

Suppose that we have obtained v_j, 1 ≤ j ≤ k − 1, then v_k solves the following optimization problem,

max_{\begin{matrix} {| | v | |}_{2} = 1, v ⊥ \sum v_{j}, \\ j = 1, \dots, k - 1 \end{matrix}} \frac{v^{T} \sum v}{{| | v | |}_{λ_{k}}^{2}},

(2.5)

Under Definition 2, v_k, k ≥ 1, are orthogonal to each other and under Definition 3, different PCs are uncorrelated to each other. Our algorithm can be applied to both definitions. Different tuning parameters can be used for different PCs in our method. We first propose an algorithm to solve the problem (2.2) for the first PC.

Before we introduce our algorithm, in order to facilitate comparison among various sparse PCA methods, we summarize the criteria of the methods we can find as follows. It is definitely not a complete list.

SCoTLASS (Jolliffe et al. [8] and Trendafilov and Jolliffe [14]): the coefficient vector for the k-th sparse principal component is the solution to
$max_{u \in R^{p}} u^{T} \sum u, subject to {| | u | |}_{2} = 1, {| | u | |}_{1} \leq t, u ⊥ u_{j}, 1 \leq j \leq k - 1,$

where u_j is the coefficient vector for the j-th sparse principal component and t is tuning parameter.
Semidefinite-programming relaxation (D’Aspremont et al. [4]): find the symmetric and nonnegative-definite matrix X which
$maximize T r (\sum X) subject to T r (X) = 1, rank (X) = 1 \sum_{i, j = 1}^{p} ∣ X_{i j} ∣ \leq k,$

where Tr(·) denotes the trace of a matrix and k is the tuning parameter. Then the estimate of the coefficient vector for the first principal component is the leading eigenvector of X.
SPCA (Zou et al. [16]): solve the following regression-type problem,
$min_{A, B} {\sum_{i = 1}^{n} {| | x_{i} - {AB}^{T} x_{i} | |}_{2}^{2} + λ \sum_{j = 1}^{k} {| | β_{j} | |}_{2}^{2} + \sum_{j = 1}^{k} λ_{1, j} {| | β_{j} | |}_{1}}, subject to A^{T} A = I$

where x₁, · · ·, x_n are the sample, A is a p × k matrix, B = (β₁, β₂, · · ·, β_k) is a p × k matrix and I is the identity matrix. Let (Â, B̂) be the solution to the above optimization problem. Then the columns of B̂, are the estimates of the coefficient vectors for the first k principal components.
sPCA-rSVD (Shen and Huang [13]): solve the following problem,
$min_{u, v, {| | u | |}_{2} = 1} {\sum_{i = 1}^{n} \sum_{j = 1}^{p} {(X_{i j} - u_{i} v_{j})}^{2} + P_{λ} (v)},$

where X is the data matrix, u = (u₁, · · ·, u_n)^T and v = (v₁, · · ·, v_p)^T, and P_λ(v) is the penalty term. Let (û, v̂) be the solution to the above optimization problem. The estimator of the coefficient vector for the first sparse principal component is v̂.
In addition to the above criterion-based methods, there are thresholding methods such as the method proposed in Johnstone and Lu [7].

2.1. First sparse principal component

We propose the following iterative algorithm to solve (2.2).

Algorithm 2.1 (For solving (2.2))

Choose an initial vector u⁽⁰⁾ ∈ ℝ^p such that Σu⁽⁰⁾ ≠ 0.
Iteratively compute u⁽¹⁾, u⁽²⁾, · · ·, until convergence as follows: for any i ≥ 1, if we have obtained u⁽ⁱ⁻¹⁾, define w⁽ⁱ⁻¹⁾ = Σu⁽ⁱ⁻¹⁾, then u⁽ⁱ⁾ is the solution to
$max_{u \in R^{p}} {(w^{(i - 1)})}^{T} u, subject t o {| | u | |}_{λ_{1}} \leq 1.$ (2.6)

We first observe that if λ = 0, Algorithm 2.1 is the power iteration method for finding the largest eigenvalue of a symmetric matrix (see Section 5.3 in Quarteroni, Sacco and Saleri [12]). The power method is efficient to find the first eigenvector and has been applied to many real-life problems, such as calculating the page rank of documents in the search engine of Google. The convergence of the power method has been well studied. Algorithm 2.1 can be regarded as a generalized power method. The convergence of Algorithm 2.1 will be established at the end of this subsection. A second observation is that the key step of Algorithm 2.1 is the iteration step (2.6). Hence, in order to make Algorithm 2.1 fast, we will propose an efficient algorithm to solve the iteration step (2.6). The optimization problem (2.6) can be written in a general form: given a nonzero vector a ≠ 0, solve

max_{u \in R^{p}} a^{T} u, subject to {| | u | |}_{λ} \leq 1,

(2.7)

where 0 ≤ λ < 1. The following theorem establishes the uniqueness of the solution to (2.7).

Theorem 2.1

For any 0 ≤ λ < 1, the space (ℝ^p, ||·||_λ) is a strictly convex Banach space, that is, its unit ball {u: ||u||_λ ≤ 1} is a strictly convex set. Moreover, the solution to (2.7) is unique and is a continuous function of a.

Remark

When λ = 1, the norm is just the l₁-norm, hence the feasible region is convex but not strongly convex. Therefore, in this case, the solution to (2.7) may not be unique.

Theorem 2.1 is important for studying the convergence and limits of Algorithm 2.1. A similar algorithm to Algorithm 2.1 has been proposed in Witten et al. [15] to solve the optimization problem (2.3) in SCoTLASS for the first PC. However, the feasible region in (2.3) is convex but not strictly convex, hence the solution to each iteration step may not be unique and thus it is not a function of a. Therefore, it is not easy to study the convergence and limits of the iterative algorithm and they did not propose the similar iterative algorithm for higher order PCs of SCoTLASS.

Note that (2.7) is a convex optimization problem. We first derive an explicit solution to a special case of (2.7) based on the Karush-Kuhn-Tucker (KKT) conditions in Theorem 2.2. Then the solutions to general cases of (2.7) can be easily derived. We introduce some notations. Let A^[^λ^] be the p × p matrix with diagonal elements equal to 1 and off-diagonal elements equal to λ. That is,

A^{[λ]} = (\begin{array}{l} 1 & λ & \dots & λ \\ λ & 1 & \dots & λ \\ \dots & \dots & \dots & \dots \\ λ & λ & \dots & 1 \end{array}) .

Suppose that I is a subset of {1, 2, · · ·, p} and the size of I is k. Let I^c be the complement of I. For any p × p matrix B, we use B_I to denote the k × k submatrix formed by the rows and columns which original indices are in I. For any u ∈ ℝ^p, we use u_I to denote the k-vector formed by the coordinates which original indices are in I.

Lemma 1

Let 0 ≤ λ < 1 and a = (a₁, · · ·, a_p)^T ≠ 0 with a₁ ≥ a₂ · · · ≥ a_p ≥ 0. Define the partial sums $S_{i} = \sum_{j = 1}^{i} a_{j}$ , i = 1, · · ·, p. Then there exists a unique m ∈ {1, 2 · · ·, p} satisfying

a_{m + 1} \leq \frac{λ S_{m}}{(1 - λ) + m λ} < a_{m},

(2.8)

where we set a_p₊₁ = 0. Moreover, we have

\frac{λ S_{i}}{(1 - λ) + i λ} < a_{i}, \forall i \leq m and \frac{λ S_{i}}{(1 - λ) + i λ} \geq a_{i}, \forall i > m .

(2.9)

Remark

For a special case where a₁ = a₂ = · · · = a_p > 0, it can be derived from the lemma that m = p.

Theorem 2.2

Let 0 ≤ λ < 1. Suppose that in (2.7), a = (a₁, · · ·, a_p)^T ≠ 0 with a₁ ≥ a₂ · · · ≥ a_p ≥ 0. Define the partial sums $S_{i} = \sum_{j = 1}^{i} a_{j}$ , i = 1, · · ·, p. Then the unique solution $u^{*} = {(u_{1}^{*}, \dots, u_{p}^{*})}^{T}$ to (2.7) satisfies

u_{i}^{*} = {\begin{array}{r} \frac{1}{2 ν} [a_{i} - \frac{λ S_{m}}{(1 - λ) + m λ}] & i f & i \leq m \\ 0 & i f & i > m \end{array},

(2.10)

where m, the number of nonzero coordinates of u^*, is the unique integer satisfying (2.8) and ν is a positive scale constant chosen to make ||u^*||_λ = 1. Furthermore, let I = {1, 2, · · ·, m} be the index set of nonzero coordinates of u^*, we have $A_{I}^{[λ]} u_{I}^{*} = a_{I} / 2 \tilde{ν}$ , where ν̃ is a positive constant.

The solution to (2.7) for general cases can be easily obtained from Lemma 1 and Theorem 2.2.

Corollary 2.1

Let 0 ≤ λ < 1 and k₁, k₂, · · ·, k_p be a permutation of 1, 2, · · ·, p such that |a_k₁| ≥ |a_k₂| ≥ · · · ≥ |a_{k_p}| is the sorted sequence of (a₁, · · ·, a_p) by their absolute values. Define the partial sums $S_{i} = \sum_{j = 1}^{i} ∣ a_{k_{j}} ∣$ , i = 1, · · ·, p, of the sorted sequence. Then there exists a unique m ∈ {1, 2 · · ·, p} satisfying

∣ a_{k_{m + 1}} ∣ \leq \frac{λ S_{m}}{(1 - λ) + m λ} < ∣ a_{k_{m}} ∣,

(2.11)

where we set |a_{k_p+1}| = 0, and the unique solution $u^{*} = {(u_{1}^{*}, \dots, u_{p}^{*})}^{T}$ of (2.7) satisfies

u_{i}^{*} = {\begin{array}{r} [a_{i} - sgn (a_{i}) \frac{λ S_{m}}{(1 - λ) + m λ}] / 2 ν & i f & i \in I \\ 0 & i f & i \in I^{c} \end{array},

(2.12)

where sgn is the sign function, I = {k₁, · · ·, k_m} and ν is a positive constant chosen to make ||u^*||_λ = 1. Furthermore, let Inline graphic be the p × p diagonal matrix with the i-th diagonal element equal to the sign of a_i for all 1 ≤ i ≤ p, we have

E_{I} A_{I}^{[λ]} E_{I} u_{I}^{*} = a_{I} / 2 \tilde{ν},

(2.13)

where ν̃ is a positive constant.

Now we propose the following algorithm to solve the optimization problem (2.7) in general cases based on Corollary 2.1.

Algorithm 2.2. (For solving (2.7))

Sort the coordinates a_i’s of a by their absolute values: |a_k₁|≥ |a_k₂|≥ · · · ≥ |a_{k_p}|. Let b_i = |a_{k_i}|, 1 ≤ i ≤ p.
Compute $S_{i} = \sum_{j = 1}^{i} b_{j}$ and find the unique m ∈ {1, 2 · · ·, p} satisfying
$b_{m + 1} \leq \frac{λ S_{m}}{(1 - λ) + m λ} < b_{m},$

where b_p₊₁ = 0.
Compute
$y_{i} = {\begin{array}{r} b_{i} - \frac{λ S_{m}}{(1 - λ) + m λ} & i f & i \leq m \\ 0 & i f & i > m \end{array},$
Let û = (û₁, · · ·, û_p) with û_{k_i} = sgn(a_{k_i})y_i for 1 ≤ i ≤ p. Then the solution is u^* = û/||û||_λ.

The following corollary is useful in our proof of consistency results.

Corollary 2.2

Suppose that a is the same as in Corollary 2.1 and $u^{*} = {(u_{1}^{*}, \dots, u_{p}^{*})}^{T}$ is the solution to (2.7) and satisfies (2.12). Let δ > 0 satisfy δ/2ν < 1, where ν is the constant in (2.12). Then the solution to the following problem,

max_{u \in R^{p}} {(a - δ u^{*})}^{T} u, subject t o {| | u | |}_{\tilde{λ}} \leq 1,

(2.14)

is u^* multiplied by a positive constant, where $\tilde{λ} = \frac{λ}{λ + (1 - δ / 2 ν) (1 - λ)}$ .

In the following, we study the convergence and limits of Algorithm 2.1. Let f_λ(a) denote the unique solution to (2.7). By Theorem 2.1, for a fixed λ, f_λ is a continuous function of a. Since λ is fixed in the following theorems, for simplicity, we omit the subscript λ of f_λ hereafter. As a function from ℝ^p\{0} to the unit sphere {u: ||u||_λ = 1} in (ℝ^p, ||·||_λ), f has the following property,

f (c a) = sgn (c) f (a), \forall 0 \neq c \in R .

Noting that u⁽^k⁺¹⁾ = f (Σu⁽^k⁾) in Algorithm 2.1, we define a subset Inline graphic of {u: ||u||_λ = 1},

L = {u : {| | u | |}_{λ} = 1, u = f (\sum u)} .

(2.15)

Then Inline graphic is the collection of all fixed points of the function f ∘ Σ. If the initial vector u⁽⁰⁾ in Algorithm 2.1 is a point in , we have u⁽⁰⁾ = u⁽¹⁾ = u⁽²⁾ = · · ·. In this case, Algorithm 2.1 converges to u⁽⁰⁾. Hence, any point in is a possible limit of Algorithm 2.1. We will prove that the iterative procedure in Algorithm 2.1 with any initial vector that satisfies Σu⁽⁰⁾ ≠ 0 will converge to one point in Inline graphic . The limit may depend on the initial vector. The following theorem provides a characterization of the points in .

Theorem 2.3

Assume that 0 ≤ λ < 1. Let u^* be a point in {u: ||u||_λ = 1}. Suppose that the number of nonzero coordinates of u^* is m and Σu^* = (c₁, c₂, · · ·, c_p)^T. Let |c_k₁| ≥ |c_k₂| ≥ · · · ≥ |c_{k_p}| be the sorted sequence of c_i’s by their absolute values. Define a_i = |c_{k_i}|, 1 ≤ i ≤ p. Then u^* ∈ Inline graphic if and only if

(a₁, a₂, · · ·, a_p) satisfy (2.8) in Lemma 1.
The index set I of nonzero coordinates of u^* is equal to {k₁, k₂, · · ·, k_m} and $u_{I}^{*}$ is an eigenvector with nonzero eigenvalue of the following generalized eigenvalue problem
$\sum_{I} x = μ E_{I} A_{I}^{[λ]} E_{I} x, x \in R^{m},$ (2.16)

where is the diagonal matrix with the i-th diagonal element equal to the sign of $u_{i}^{*}$ for all 1 ≤ i ≤ p.

Now we study the convergence of the iteration in Algorithm 2.1 and its limits.

Theorem 2.4

Assume that 0 ≤ λ < 1. Suppose that for any subset J of {1, 2, · · ·, p}, and any p × p diagonal matrix Inline graphic with diagonal elements in {1, −1, 0} and the diagonal elements of nonzero, all the positive eigenvalues of the following generalized eigenvalue problem have multiplicities 1,

\sum_{J} x = μ E_{J} A_{J}^{[λ]} E_{J} x .

(2.17)

Assume that for any u^* ∈ Inline graphic ,

max_{i \notin I} ∣ c_{i} ∣ \neq \frac{λ \sum_{j \in I} ∣ c_{j} ∣}{(1 - λ) + m λ},

(2.18)

where (c₁, c₂, · · ·, c_p)^T = Σu^*, I is the index set of nonzero coordinates of u^* and m is the size of I. Then the iteration in Algorithm 2.1 converges to a point in Inline graphic (the limit depends on the initial vector). Furthermore, only those u^* ∈ with $u_{I}^{*}$ being the leading eigenvector of (2.16) are stable limits.

Remark

We say that a limit point is stable if there exists a neighborhood of this point such that once the iteration falls in the neighborhood, it will converge to the limit point.

From the proof of Theorem 2.4, for any unstable limit in Inline graphic , only an iteration sequence with all but finite points belonging to an affine subspace through that point with dimension strictly less than p can converge to it. Therefore, in practice, due to the rounding errors, the iteration sequence can only converge to stable limits. The solution to (2.2) is one of stable limits in Inline graphic . Although the stable limits may not be unique, due to the stringent conditions for stable limits in Theorem 2.3, there are only a few stable limits. Note that if u^* is a stable limit, so is −u^*. We do not differentiate the two limits. Our simulation studies and applications to real data suggest that when λ is small or the differences between the adjacent eigenvalues of covariance matrix are large, there is only one stable limit. For the cases with large λ and the close adjacent eigenvalues of covariance matrix, there can be two or three stable limits. For high dimensional data, a small λ is typically enough to achieve sparsity of the coefficients and high percentage of explained variance. In the case that there is more than one limit, if the initial vector is randomly chosen, our simulations suggest that with larger probability, the iteration converges to the solution to (2.2) than any other limit. These observations are also true for higher order sparse PCs.

2.2. Higher order sparse principal components

Now we propose a similar iterative algorithm for solving the optimization problem (2.4) or (2.5) for the higher order PCs. We will consider the following more general problem of which (2.4) and (2.5) are special cases:

max_{{| | v | |}_{2} = 1, v ⊥ M} \frac{v^{T} \sum v}{{| | v | |}_{λ}^{2}},

(2.19)

where M is a linear subspace of ℝ^p. M is the subspace spanned by {v₁, · · ·, v_k₋₁} for Definition 2, and M is spanned by {Σv₁, · · ·, Σv_k₋₁} for Definition 3. We propose the following algorithm to solve (2.19).

Algorithm 2.3. (For solving (2.19))

Choose an initial vector u⁽⁰⁾ ∈ ℝ^p such that Σu⁽⁰⁾ ∉ M.
Iteratively compute u⁽¹⁾, u⁽²⁾, · · ·, until convergence as follows: for any i ≥ 1, let w⁽ⁱ⁻¹⁾ = Σu⁽ⁱ⁻¹⁾, then u⁽ⁱ⁾ solves the following problem
$max_{u \in R^{p}} {(w^{(i - 1)})}^{T} u, subject t o {| | u | |}_{λ_{k}} \leq 1, u ⊥ M .$

The optimization problem in the second step of Algorithm 2.3 is a special case of the following problem: given a ∉ M,

max_{u \in R^{p}} a^{T} u, subject to {| | u | |}_{λ} \leq 1, u ⊥ M .

(2.20)

For a general M, we cannot give an explicit solution to (2.20) as in Theorem 2.2. We will develop a method to solve (2.20) by relating the optimization problem (2.20) to the optimization problem (2.7) which has an explicit solution and can be solved quickly. Recall that f (a) is the solution to the optimization problem (2.7) and is a continuous function of a.

Theorem 2.5

Assume that a ∉ M. Let a_⊥ be the orthogonal projection of a onto the orthogonal complement of M. Let q be the dimension of M and {ψ₁, · · ·, ψ_q} be an orthonormal basis of M. Then there exists $t^{*} = (t_{1}^{*}, \dots, t_{q}^{*}) \in R^{q}$ such that $f (a_{⊥} + t_{1}^{*} ψ_{1} + \dots + t_{q}^{*} ψ_{q}) ⊥ M$ . Moreover, $f (a_{⊥} + t_{1}^{*} ψ_{1} + \dots + t_{q}^{*} ψ_{q})$ is the unique solution to (2.20).

By Theorem 2.5, solving (2.20) is equivalent to finding t^* ∈ ℝ^q such that $f (a_{⊥} + t_{1}^{*} ψ_{1} + \dots + t_{q}^{*} ψ_{q}) ⊥ M$ . For any t = (t₁, · · ·, t_q)^T ∈ ℝ^q, we define H(t) to be the squared norm of the orthogonal projection of f (a_⊥ + t₁ψ₁ + · · ·+ t_q ψ_q) onto M. Let Ψ = (ψ₁, …, ψ_q), where ψ_j = (ψ₁_j, · · ·, ψ_pj)^T, 1 ≤ j ≤ q, are basis vectors, and x(t) = f (a_⊥ + t₁ψ₁ + · · · + t_q ψ_q) = f (a_⊥ + Ψt). Then

H (t) = \sum_{i = 1}^{q} {({ψ_{i}}^{T} f (a_{⊥} + t_{1} ψ_{1} + \dots + t_{q} ψ_{q}))}^{2} = x {(t)}^{T} Ψ Ψ^{T} x (t),

(2.21)

and H(t) is a continuous function of t. It follows from Theorem 2.5 that H(t^*) = 0 and t^* is also the minimum point of H(t). We will show that H(t) is a piecewise smooth function. Let

c (t) = {(c_{1} (t), \dots, c_{p} (t))}^{T} = a_{⊥} + Ψ t, \forall t = {(t_{1}, \dots, t_{q})}^{T} \in R^{q} .

(2.22)

c_i(t)’s are linear functions of t. Let |c_k₁(t)| ≥ |c_k₂(t)| ≥ · · · ≥ |c_{k_q}(t)| be the sorted sequence of {c₁(t), · · ·, c_p(t)} by their absolute values. Note that k_i, 1 ≤ i ≤ p, depends on t. Define the following subsets of ℝ^q,

J_{m} = {t \in R^{q} : ∣ c_{k_{m + 1}} (t) ∣ = \frac{λ \sum_{1 \leq i \leq m} ∣ c_{k_{i}} (t) ∣}{(1 - λ) + m λ}}, 1 \leq m \leq p - 1,

(2.23)

each of which is a union of faces of a finite number of polyhedrons.

Theorem 2.6

H(t) is smooth at any point $t \notin \cup_{m = 1}^{p - 1} J_{m}$ . Suppose that

x (t) = {(x_{1} (t), \dots, x_{p} (t))}^{T}

is the solution to the following optimization problem

max_{u \in R^{p}} {(a_{⊥} + Ψ t)}^{T} u, subject t o {| | u | |}_{λ} \leq 1.

(2.24)

Then the first and the second partial derivatives of H(t) are

\begin{array}{l} \nabla H (t) = \frac{G (t) - H (t) E (t)}{ν}, \\ \nabla^{2} H (t) = \frac{4 H (t) E (t) E {(t)}^{T} - 2 [E (t) G {(t)}^{T} + G (t) E {(t)}^{T}] + Π (t) - H (t) F (t)}{2 ν^{2}}, \end{array}

where

\begin{array}{l} E (t) = (1 - λ) K {(t)}^{T} x (t) + λ {| | x (t) | |}_{1} K {(t)}^{T} N (t) 1, G (t) = K {(t)}^{T} Ψ Ψ^{T} x (t), \\ F (t) = (1 - λ) K {(t)}^{T} K (t) + λ K {(t)}^{T} N {(t)}^{T} 1 1^{T} N (t) K (t), \\ Π (t) = K {(t)}^{T} Ψ Ψ^{T} K (t), K (t) = N (t) (I - \frac{λ}{(1 - λ) + m λ} 1 1^{T}) N (t) Ψ, \end{array}

I is the p-dimensional identity matrix, 1 is the p vector with all coordinates equal to 1, and N(t) is the p-dimensional diagonal matrix with the i-th diagonal element equal to sgn(x_i(t)), 1 ≤ i ≤ p. 2ν is a scale constant which is the same as that in (2.12) with a and u^* replaced with a_⊥ + Ψt and x(t), respectively.

By Theorem 2.6, the first and the second partial derivatives of H(t) only involve c(t) and x(t) which can be obtained quickly from Algorithm 2.2. Hence, based on Theorems 2.5 and 2.6, we propose the following algorithm to solve (2.20) by using the Newton method. It is worth to point out that algorithms based on other numerical methods can also be proposed to get the zero point and minimum point t^* of H(t), which is equivalent to solving (2.20). Some of them do not use the second derivative of H(t).

Algorithm 2.4. (For solving (2.20))

Compute the projection matrix P = ΨΨ^T onto M.
Compute the orthogonal projection a_⊥ = (I − P)a of a onto the rthogonal complement space M^⊥ of M, where I is the identity matrix.
Set the initial value of t equal to zero.
Compute the solution x(t) to (2.24) using Algorithm 2.2.
Compute the first and the second partial derivatives, ∇H(t) and ∇²H(t), of H(t) using Theorem 2.6.
Compute Δt = −(∇²H(t))⁻¹ ∇H(t).
Update the value of t by setting t = t + Δt. Then go back to Step 4 and repeat the procedure until convergence.

In some cases, the second derivatives ∇²H(t) can be nearly singular. We can use the following formula

Δ t = - {(\nabla^{2} H (t) + ε I)}^{- 1} \nabla H (t),

to calculate Δt in Step 6 of Algorithm 2.4, where ε is the sum of two times of the absolute value of the smallest eigenvalue of ∇²H(t) and a small number (for example, 0.001).

Now we will give the convergence results for Algorithm 2.3 which are very similar to those for the first PC. For any subspace M, define f ^M(a) to be the solution to (2.20). Then f ^M is a continuous function from ℝ^p\M to M^⊥∩{u: ||u||_λ = 1}, where M^⊥ is the orthogonal complement of M. Let

L^{M} = {u : u = f^{M} (\sum u)} .

(2.25)

The following theorem involves some technical conditions which will be given in its proof.

Theorem 2.7

Assume that 0 ≤ λ < 1. Suppose that all u^* ∈ Inline graphic satisfy (5.47). Then the iteration in Algorithm 2.3 converges to one point in (the limit depends on the initial vector). Furthermore, only u^* ∈ with $u_{I}^{*}$ being a leading eigenvector of problem (5.48) are stable limits, where I is the index set of nonzero coordinates of u^*.

2.3. Choice of the tuning parameters

The selection of tuning parameters is crucial for the performance of any penalized procedure. The main purpose of sparse PCA is twofold. On the one hand, we want to find principal components which can explain as much as possible variations in data sets. On the other hand, we want to make the coefficients of the principal components as sparse as possible. However, as the tuning parameters increases, sparsity increases, but variation explained decreases. We have to achieve a suitable tradeoff between these two goals. Hence we design the following criteria to choose appropriate tuning parameters. The criteria are similar to the AIC and BIC criteria for linear models. Since we have one tuning parameter for each component, we choose the tuning parameters sequentially from low-order components to high-order components.

For the first principal component, we choose λ₁ which maximize

C_{1} (λ) = (1 - a) q_{1} (λ) + a \frac{k_{1} (λ)}{p},

(2.26)

where p is the dimension of the coefficients, q₁(λ) is the ratio of the explained variances by the first PCs when the tuning parameter is λ to the explained variances when it is zero (the classic PCA), k₁(λ) is the number of zeros in the coefficients of the first PC when the tuning parameter is λ. a ∈ (0, 1) is the weight which is chosen based on the purpose of the researchers, that is, the relative importance of sparsity and the explained variations. We choose a larger a if we need sparser principal components.

Suppose that we have chosen the tuning parameters for the first m − 1 components and obtained the first m − 1 principal components v₁, · · ·, v_m₋₁. Then the tuning parameter λ_m of the m-th principal component is chosen by maximizing

C_{m} (λ) = (1 - a) q_{m} (λ, v_{1}, \dots, v_{m - 1}) + a \frac{k_{m} (λ)}{p},

(2.27)

where a is the same as in (2.26) (that is, a is the same for all components), q_m(λ, v₁, · · ·, v_m₋₁) is the ratio of the explained variances by the m-th PCs when the tuning parameter is λ to the explained variances when it is zero (the m-th PC should be orthogonal to or uncorrelated with the first m − 1 components), k_m(λ) is the number of zeros in the coefficients of the m-th PCs when the tuning parameter is λ.

For the total number K of the principal components, we can choose K such that the total variations explained by the first K components is greater than a given large number, or the K + 1 component explain variance less than a given small number.

3. Consistency in high dimensions

In this section, we consider the single component model discussed in John-stone and Lu [7]. We will prove that the PC obtained by our method is consistent under this model when the ratio of the number of variables to the sample size goes to a nonnegative constant.

For each n ≥ 1, we assume that n i.i.d. random samples, $x_{i}^{(n)} \in R^{p_{n}}$ , 1 ≤ i ≤ n, are drawn from the following model,

x_{i}^{(n)} = w_{i}^{(n)} ρ^{(n)} + σ z_{i}^{(n)}, i = 1, \dots, n,

(3.1)

where $ρ^{(n)} = {(ρ_{1}^{(n)}, \dots, ρ_{p_{n}}^{(n)})}^{T} \in R^{p_{n}}$ , the single component, is a nonrandom vector. { $w_{i}^{(n)}$ , 1 ≤ i ≤ n, 1 ≤ n} is a set of i.i.d. standard normal variables and $z_{i}^{(n)} = {(z_{i 1}^{(n)}, \dots, z_{{i p}_{n}}^{(n)})}^{T} \in R^{p_{n}}$ are the noise vectors with { $z_{i j}^{(n)}$ : 1 ≤ j ≤ p_n, 1 ≤ i ≤ n} a set of i.i.d. standard normal variables. In this paper, for simplicity, we assume that the variance of noise, σ², does not depend on n. However, the main result can be easily extended to the cases where σ² depends on n and is bounded. In Johnstone and Lu [7], ||ρ⁽ⁿ⁾||₂ is assumed to converge to some positive constant. Without loss of generality, we will assume that ||ρ ⁽ⁿ⁾||₂ = 1 for simplicity. Our proof can be revised to give the consistency result for the case where the sequence, {||ρ⁽ⁿ⁾||₂: 1 ≤ n}, is bounded.

Our estimate ρ̂⁽ⁿ⁾ of the single component ρ⁽ⁿ⁾ is the solution to the following optimization problem,

max_{v \in R^{p}, {| | v | |}_{2} = 1} \frac{v^{T} {\sum^{^}}^{(n)} v}{(1 - λ^{(n)}) {| | v | |}_{2}^{2} + λ^{(n)} {| | v | |}_{1}^{2}},

(3.2)

where ${\sum^{^}}^{(n)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(n)} {(x_{i}^{(n)})}^{T}$ and 0 ≤ λ⁽ⁿ⁾ < 1 is the tuning parameter. We will show that ||ρ̂⁽ⁿ⁾ − ρ⁽ⁿ⁾||₂ → 0 as n → ∞ under some conditions. An essential condition for consistency of the method in Johnstone and Lu [7] is the uniform “weak l_q decay” condition which is equivalent to the concentration of energy in a few coordinates. Suppose that

∣ ρ^{(n)} ∣_{(1)} \geq ∣ ρ^{(n)} ∣_{(2)} \geq \dots \geq ∣ ρ^{(n)} ∣_{(p_{n})}

are the ordered absolute values of the coordinates of ρ⁽ⁿ⁾. Then the uniform “weak l_q decay” condition is that for some 0 < q < 2 and C > 0,

∣ ρ^{(n)} ∣_{(ν)} \leq C ν^{- 1 / q}, ν = 1, \dots, p_{n}, n = 1, 2, \dots,

(3.3)

where C and q do not depend on n. In addition to this, we need another technical condition. For each n, we define the partial sum $S_{i}^{(n)} = \sum_{ν = 1}^{i} ∣ ρ^{(n)} ∣_{(ν)}$ , 1 ≤ i ≤ p_n. For notational reason, for all i > p_n, we define $S_{i}^{(n)} = S_{p_{n}}^{(n)}$ . Then the technical condition is

\underset{m \to \infty}{lim sup} sup_{n \geq 1} \frac{S_{2 m}^{(n)} - S_{m}^{(n)}}{S_{m}^{(n)}} < 1.

(3.4)

We conjecture that our method is consistent even without this condition. This condition implies that the sequences |ρ⁽ⁿ⁾|₍₁₎ ≥ |ρ⁽ⁿ⁾|₍₂₎ ≥ · · · have a uniform asymptotic decreasing rate. Note that if the condition (3.3) with 0 < q < 1 is satisfied, then (3.4) is automatically true with the limit equal to 0. However, if 1 ≤ q < 2, the condition (3.4) can not be derived from (3.3). Here is an example. Suppose that 1 ≤ q < 2, for each n, let p_n = 2n, $∣ ρ^{(n)} ∣_{(1)} = \sqrt{1 - (2 n - 1) {(2 n)}^{- 2 / q}}$ and |ρ⁽ⁿ⁾|₍₂₎ = · · · = |ρ⁽ⁿ⁾|_{(p_n)} = (2n)^−1/^q, then the condition (3.3) is satisfied with C = 1. However, a simple calculation shows that the limit in (3.4) is 1 and hence the condition (3.4) is not true in this case. If |ρ⁽ⁿ⁾|₍_ν₎ decrease with order ν^−1/^q uniformly in the sense that

lim_{m \to \infty} sup_{n : p_{n} \geq m} | \frac{∣ ρ^{(n)} ∣_{(m)}}{m^{- 1 / q}} - c_{n} | = 0,

where c_n, n ≥ 1, are positive numbers with inf_n c_n > 0, the condition (3.4) holds. Now we give the main consistency result in high dimension for our method.

Theorem 3.1

Suppose that ||ρ ⁽ⁿ⁾||₂ = 1 for all n ≥ 1 and satisfies the conditions (3.3) and (3.4). Assume that p_n is a nonrandom sequence such that p_n/n → c as n → ∞, where c is a real number. Suppose that c satisfies the following condition,

4 σ \sqrt{c} + 4 σ^{2} \sqrt{c} + 6 σ^{2} c < 1,

(3.5)

and there exists 0 < α < 1/3 such that

\underset{n \to \infty}{lim inf} λ^{(n)} n^{α} > 0 and lim_{n \to \infty} λ^{(n)} = 0.

(3.6)

Then we have lim_n_→∞||ρ̂⁽ⁿ⁾ − ρ ⁽ⁿ⁾||₂ = 0. Moreover, the number of nonzero coordinates of ρ̂⁽ⁿ⁾ is O_p(1/λ⁽ⁿ⁾).

4. Examples

4.1. Pitprops data

The pitprops data, with 180 observations and 13 measured variables, was first introduced by Jeffers [6]. It is a classic example showing the difficulty of interpreting PCs. To illustrate the performance of their sparse PCA methods, several authors have studied the pitprops data, such as Jolliffe et al. [8] (SCot-LASS), Zou et al. [16] (SPCA), and Shen and Huang [13] (sPCA-rSVD).

We first compare our method with SPCA and sPCA-rSVD. We apply our method to this data set and take (0.1, 0.1, 0.13, 0.12, 0.2, 0.2) as the values of λ for the first six PCs. The obtained sparse PC loadings are shown in Table 1. We use Definition 3 for higher order PCs. Then the six sparse PCs are uncorrelated and the explained variances by these PCs can be clearly distiguished. The percentages of the explained variances are listed in Table 1. They are close to the percentages achieved by the classical PCA: 32.4, 18.3, 14.4, 8.5, 7.0, 6.3, respectively. We can obtain sparser PCs by increasing the value of λ, but the percentages of explained variances will decrease.

Table 1.

Loadings of the first six PCs by our method on the pitprops data

Variables	PC1	PC2	PC3	PC4	PC5	PC6
topdiam	−0.471	0	0.197	0	0	0
length	−0.484	0	0.222	0	−0.045	0
moist	0	−0.684	0	0.060	0.261	0
testsg	0	−0.659	−0.072	0.063	0.189	−0.121
ovensg	0	0	−0.745	0	0	−0.455
ringtop	−0.134	0	−0.400	0	−0.137	0.345
ringbut	−0.383	0	−0.110	0	−0.139	0.299
bowmax	−0.254	0.137	0	−0.092	0	−0.679
bowdist	−0.383	0	0	0	−0.080	0
whorls	−0.410	0.163	0	0.035	0	0
clear	0	0	0	−0.978	−0.040	−0.091
knots	0	−0.229	0	0	−0.921	−0.318
diaknot	0	0	0.424	0.163	0	0

Variance(%)	0.301	0.156	0.132	0.078	0.065	0.046
Cum. var.(%)	0.301	0.457	0.589	0.666	0.731	0.778

Open in a new tab

The sparse PCs produced by SPCA and sPCA-rSVD are neither uncorrelated nor orthogonal. We calculate the correlation matrices of the first six sparse PCs obtained in Zou et al. [16] and Shen and Huang [13], respectively. The first matrix in the following is the correlation matrix for SPCA and the second one is for sPCA-rSVD:

\begin{array}{l} (\begin{array}{r} 1 & - 0.17 & - 0.33 & - 0.00 & - 0.20 & 0.08 \\ - 0.17 & 1 & 0.13 & - 0.14 & - 0.22 & 0.08 \\ - 0.33 & 0.13 & 1 & 0.10 & 0.14 & - 0.40 \\ - 0.00 & - 0.14 & 0.10 & 1 & 0.03 & - 0.01 \\ - 0.20 & - 0.22 & 0.14 & 0.03 & 1 & - 0.18 \\ 0.08 & 0.08 & - 0.40 & - 0.01 & - 0.18 & 1 \end{array}), \\ (\begin{array}{r} 1 & 0.20 & - 0.46 & - 0.33 & - 0.20 & - 0.04 \\ 0.20 & 1 & - 0.11 & 0.27 & 0.13 & 0.05 \\ - 0.46 & - 0.11 & 1 & 0.26 & 0.16 & - 0.10 \\ - 0.33 & 0.27 & 0.26 & 1 & 0.20 & 0.07 \\ - 0.20 & 0.13 & 0.16 & 0.20 & 1 & - 0.05 \\ - 0.04 & 0.05 & - 0.10 & 0.07 & - 0.05 & 1 \end{array}) . \end{array}

We can see that some PCs in both methods are strongly correlated and there is no clear pattern in these correlation matrices. For example, in the second matrix, the third PC is strongly correlated to the first, but weakly correlated with the second and the sixth PCs.

We also compare our method with SCotLASS. In Table 2 (provided by one of the anonymous referees), the loadings of principal components produced by SCotLASS for the pitprops data are listed. The loading are orthogonal to each other and all have l₂ norms equal to one. However, the principal components are correlated. In order to make a fair comparison between our method and SCotLASS, we also produce principal components with orthonormal loadings. The loadings by our method with the tuning parameters equal to 0.1, 0.12, 0.12, 0.3, 0.3, 0.3 are listed in Table 3. One can see that two sets of principal components explain almost the same variances, however, the loadings by our method are sparser than those by SCotLASS.

Table 2.

Loadings of the first six PCs by SCotLASS on the pitprops data

Variables	PC1	PC2	PC3	PC4	PC5	PC6
topdiam	0.471	0	−0.081	0.001	0	0
length	0.484	0	−0.101	0.030	0	0
moist	0	−0.705	−0.013	0	0	0
testsg	0	−0.709	0	0	−0.001	0
ovensg	0	0	0.590	0	0	−0.762
ringtop	0.135	−0.026	0.358	0.009	0	0
ringbut	0.383	0	0.101	0	0	0
bowmax	0.254	0	0	−0.062	0	0
bowdist	0.383	0	0	0	0.051	0
whorls	0.410	0.009	0	0	−0.049	0
clear	0	0	0	0	0.997	0
knots	0	0	0	0.998	0	0
diaknot	0	0	−0.704	0	0	−0.638

Variance(%)	0.301	0.146	0.129	0.08	0.08	0.060
Cum. var.(%)	0.301	0.447	0.576	0.656	0.736	0.796

Open in a new tab

Table 3.

Orthonormal loadings of the first six PCs by our method on the pitprops data

Variables	PC1	PC2	PC3	PC4	PC5	PC6
topdiam	0.471	0	−0.16	0	0	0
length	0.484	0	−0.175	0	0	0
moist	0	−0.705	−0.013	0	0	0
testsg	0	−0.708	0	0	0	0
ovensg	0	0	0.542	0	0	−0.744
ringtop	0.134	−0.02	0.470	0	0	0
ringbut	0.383	0	0.253	0	0	0
bowmax	0.254	0	0	0	0	0
bowdist	0.383	0	0	0	0	0
whorls	0.410	0.007	0	0	0	0
clear	0	0	0	0	1	0
knots	0	0	0	1	0	0
diaknot	0	0	−0.604	0	0	−0.669

Variance(%)	0.301	0.146	0.146	0.077	0.077	0.061
Cum. var.(%)	0.301	0.447	0.593	0.671	0.748	0.809

Open in a new tab

Finally, we use the criteria (2.26) and (2.27) to choose the tuning parameters for the first six principal components, respectively. Suppose that sparsity and the variance explained are equally important. We choose a = 0.5 in (2.26) and (2.27). Then the tuning parameters chosen by these criteria are 0.10, 0.35, 0.05, 0.15, 0.30, 0.40. The corresponding loadings of the uncorrelated principal components and the proportions of variances explained are listed in Table 4. There are 44 zeros in the loadings and 78.2% variance is explained by the first six principal components.

Table 4.

loadings of the first six PCs corresponding to the tuning parameters chosen based on criteria (2.26) and (2.27)

Variables	PC1	PC2	PC3	PC4	PC5	PC6
topdiam	0.471	0	0.246	0	0	0
length	0.484	0	0.268	0	0	−0.266
moist	0	−0.717	0	0.089	0	0.330
testsg	0	−0.618	−0.106	0.088	0	0
ovensg	0	0	−0.576	0	−0.555	−0.396
ringtop	0.134	0	−0.450	0	0	0
ringbut	0.383	0	−0.262	0	0	0
bowmax	0.254	0	0	−0.064	0	0
bowdist	0.383	0	0.093	0	0	0
whorls	0.410	0.323	0	0	0.017	0
clear	0	0	0	−0.977	−0.126	−0.093
knots	0	0	0	0	0.476	−0.782
diaknot	0	0	0.494	0.157	−0.670	−0.208

Variance(%)	0.301	0.140	0.145	0.076	0.061	0.058
Cum. var.(%)	0.301	0.441	0.587	0.662	0.723	0.782

Open in a new tab

4.2. NCI cell line data

We apply the three methods to the NCI60 dataset which has n = 60 samples and p = 7129 genes (http://www-genome.wi.mit.edu/mpr/NCI60/). We list the percentages of explained variances and the number of nonzero loadings for different tuning parameters in our method in Table 5. We display, in Figure 2, the percentage of explained variances and the sparsity of the first PCs obtained from the three methods for different tuning parameters. The three curves are almost the same. Actually, the variables selected by the three methods are almost identical. For example, when the numbers of nonzero loadings are all 62 for the three methods, the variables selected by sPCA-rSVD and our method are exactly the same and differ from those selected from SPCA only by 4 variables.

Table 5.

Percentages of variations and the number of nonzero coefficients for different tuning parameters in NCI cell line data

λ	0	0.001	0.005	0.01
1st PC	15.9%(7129)	15.1%(411)	13.3%(155)	12%(101)
2nd PC	11.6% (7129)	10.2%(565)	8.4%(155)	7.8% (70)
3rd PC	7.9% (7129)	7.7% (579)	6.7%(216)	5.7%(149)
4th PC	5.6%(7129)	5.2% (437)	4.6%(142)	4.2% (86)
5th PC	4.9%(7129)	4.5%(522)	3.9% (164)	3.4%(110)
6th PC	4.2% (7129)	4.0%(403)	3.6%(123)	3.4%(79)

Open in a new tab

Plot of percentages of explained variances versus the sparsity of the first PCs of the three methods for the NCI60 data.

For higher order PCs, the three methods have quite different results. For SPCA, the first PC is strongly affected by the tuning parameter for the second PC, which is shown in Figure 3. It is hard to tune the parameters simultaneously for SPCA. The sPCA-rSVD and our method are sequential methods, that is, the tuning of parameters for higher order PCs does not affect the lower order PCs. To compare these two methods, we choose the tuning parameters for the first six PCs such that the numbers of nonzero loadings are the same for the two methods and equal to 101, 70, 149, 86, 110, 79, respectively. Our method provides uncorrelated PCs. But there exist strong correlations between some PCs obtained from sPCA-rSVD as shown in the following correlation matrix. Furthermore, in this case, the two methods selected different sets of variables. The numbers of common variables selected by the two methods are 101, 69, 137, 73, 33, 41 for the first six PCs, respectively.

Influence of the parameter for the 2nd PC on the first PC in SPCA.

(\begin{array}{r} 1 & - 0.037 & 0.059 & - 0.37 & - 0.34 & - 0.016 \\ - 0.037 & 1 & 0.21 & 0.016 & - 0.094 & - 0.22 \\ 0.059 & 0.21 & 1 & - 0.051 & - 0.15 & - 0.084 \\ - 0.37 & 0.016 & - 0.051 & 1 & 0.10 & 0.027 \\ - 0.34 & - 0.094 & - 0.15 & 0.10 & 1 & 0.013 \\ - 0.016 & - 0.22 & - 0.084 & 0.027 & 0.013 & 1 \end{array})

5. Proofs

Proof of Theorem 2.1

We first show that ||·||_λ is a norm. It is easy to see that for any η ∈ ℝ and u ∈ ℝ^p, we have that ||ηu||_λ = |η|||u||_λ, ||u||_λ ≥ 0 and ||u||_λ = 0 if and only if u = 0. Hence, we only need to show the triangle inequality. For any u, v ∈ ℝ^p,

\begin{array}{l} {| | u + v | |}_{λ}^{2} = (1 - λ) {| | u + v | |}_{2}^{2} + λ {| | u + v | |}_{1}^{2} \\ = (1 - λ) (\sum_{i = 1}^{p} {(u_{i} + v_{i})}^{2}) + λ {(\sum_{i = 1}^{p} ∣ u_{i} + v_{i} ∣)}^{2} \\ \leq (1 - λ) (\sum_{i = 1}^{p} u_{i}^{2} + \sum_{i = 1}^{p} v_{i}^{2} + 2 \sum_{i = 1}^{p} u_{i} v_{i}) + λ {(\sum_{i = 1}^{p} ∣ u_{i} ∣ + \sum_{i = 1}^{p} ∣ v_{i} ∣)}^{2} \\ = (1 - λ) {| | u | |}_{2}^{2} + (1 - λ) {| | v | |}_{2}^{2} + 2 (1 - λ) \sum_{i = 1}^{p} u_{i} v_{i} + λ {| | u | |}_{1}^{2} + λ {| | v | |}_{1}^{2} + 2 λ {| | u | |}_{1} {| | v | |}_{1} \\ = {| | u | |}_{λ}^{2} + {| | v | |}_{λ}^{2} + 2 (1 - λ) \sum_{i = 1}^{p} u_{i} v_{i} + 2 λ {| | u | |}_{1} {| | v | |}_{1} \\ \leq {| | u | |}_{λ}^{2} + {| | v | |}_{λ}^{2} + 2 \sqrt{((1 - λ) \sum_{i = 1}^{p} u_{i}^{2} + λ {| | u | |}_{1}^{2}) ((1 - λ) \sum_{i = 1}^{p} v_{i}^{2} + λ {| | v | |}_{1}^{2})} \\ = {| | u | |}_{λ}^{2} + {| | v | |}_{λ}^{2} + 2 {| | u | |}_{λ} {| | v | |}_{λ} = {({| | u | |}_{λ} + {| | v | |}_{λ})}^{2}, \end{array}

(5.1)

where the equality in the inequality in the third line of (5.1) holds if and only if for each 1 ≤ i ≤ p, the coordinates u_i and v_i have the same signs if both of them are nonzero. The inequality in the second line from the last is due to Cauchy-Schwartz inequality and the equality holds if and only if v is a nonnegative scalar multiple of u or vice versa. Therefore, we have the triangle inequality,

{| | u + v | |}_{λ} \leq {| | u | |}_{λ} + {| | v | |}_{λ},

(5.2)

with equality if and only if v is a nonnegative scalar multiple of u or vice versa. The completeness of (ℝ^p, ||·||_λ) follows from the fact that ||u||₂ ≤ ||u||_λ ≤ ||u||₁ for any u ∈ ℝ^p. Hence, (ℝ^p, ||·||_λ) is a Banach space.

Now we show the unit ball of (ℝ^p, ||·||_λ) is a strictly convex set, that is, for any 0 < η < 1 and u ≠ v with ||u||_λ = ||v||_λ = 1, ||ηu + (1 − η)v||_λ < 1. Note that ||ηu + (1 − η)v||_λ ≤ η||u||_λ + (1 − η) ||v||_λ = 1 and the equality holds if and only if v is a nonnegative scalar multiple of u by the arguments in the last paragraph. Because ||u||_λ = ||v||_λ = 1 and u ≠ v, ||ηu + (1 − η)v||_λ = 1 if and only if u = v. Thus, we have ||ηu + (1 − η)v||_λ < 1.

(2.7) has at least one solution because its objective function is continuous and its feasible region is a compact set. Now suppose that we have two different solutions u₁ ≠ u₂ to (2.7). Then a^Tu₁ = a^Tu₂ = max_{||u||_λ≤ 1} a^T u. It is obvious that ||u₁||_λ = ||u₂||_λ = 1. Let $\bar{u} = \frac{u_{1} + u_{2}}{2}$ . We have ||ū||_λ < 1 by the strict convexness. Hence,

a^{T} (\frac{\bar{u}}{{| | \bar{u} | |}_{λ}}) = (\frac{a^{T} u_{1} + a^{T} u_{2}}{2 {| | \bar{u} | |}_{λ}}) = \frac{1}{{| | \bar{u} | |}_{λ}} max_{{| | u | |}_{λ} \leq 1} a^{T} u > max_{{| | u | |}_{λ} \leq 1} a^{T} u .

We have obtained a contradiction. Therefore, the sulotion to (2.7) is unique.

Let f (a) denote the unique solution to (2.7). Then f is a function from ℝ^p\{0} to {u: ||u||_λ = 1}. We show that f is a continuous function. Assume contrary, that is, there exist a point a ∈ ℝ^p\{0} such that f is discontinuous at a. By the compactness of {u: ||u||_λ = 1}, we can find a sequence {a_n, n ≥ 1} in ℝ^p\{0} such that a_n → a and f (a_n) → û ≠ f (a). By the definition of f, we have $a_{n}^{T} f (a_{n}) \geq a_{n}^{T} f (a)$ . Let n → ∞, then we have a^T û ≥ a^T f (a) = max_{||u||_λ≤ 1} a^T u. Therefore, both û and f (a) are solutions to (2.7), which contradicts to the uniqueness of the solution to (2.7). Thus f is a continuous function.

Proof of Lemma 1

Define

m = inf {1 \leq i \leq p : a_{i + 1} \leq \frac{λ S_{i}}{(1 - λ) + i λ}} .

(5.3)

Since we have set a_p₊₁ = 0, p belongs to the subset in the right hand side of (5.3). Hence, this subset is nonempty. By the definition (5.3), we have (define S₀ = 0 if m = 1)

a_{m} > \frac{λ S_{m - 1}}{(1 - λ) + (m - 1) λ}, and hence, λ S_{m - 1} < ((1 - λ) + (m - 1) λ) a_{m} .

Therefore,

\frac{λ S_{m}}{(1 - λ) + m λ} = \frac{λ S_{m - 1} + λ a_{m}}{(1 - λ) + m λ} < \frac{((1 - λ) + (m - 1) λ) a_{m} + λ a_{m}}{(1 - λ) + m λ} = a_{m},

(5.4)

and (2.8) follows from (5.3) and (5.4). By similar arguments as in (5.4), we can show that

\frac{λ S_{m + 1}}{(1 - λ) + (m + 1) λ} \geq a_{m + 1}, \frac{λ S_{m + 2}}{(1 - λ) + (m + 2) λ} \geq a_{m + 2}, \dots .

(5.5)

By the definition (5.3), for any i < m, we have

\frac{λ S_{i}}{(1 - λ) + i λ} < a_{i + 1} \leq a_{i} .

(5.6)

(5.4), (5.5) and (5.6) lead to (2.9). The uniqueness of m follows from (2.9) and the definition (5.3).

Proof of Theorem 2.2

Due to the conditions on a in this theorem, we have $u_{i}^{*} \geq 0$ , for all 1 ≤ i ≤ p. Hence, u^* is also the unique solution to the following convex optimization problem,

\begin{array}{l} minimize & - a^{T} u, \\ subject to & (1 - λ) \sum_{i = 1}^{p} u_{i}^{2} + λ {(\sum_{i = 1}^{p} u_{i})}^{2} \leq 1, u_{i} \geq 0, 1 \leq i \leq p . \end{array}

(5.7)

The condition $(1 - λ) \sum_{i = 1}^{p} u_{i}^{2} + λ {(\sum_{i = 1}^{p} u_{i})}^{2} \leq 1$ can be written as u^T A^[^λ^]u − 1 ≤ 0. Because (5.7) is a convex optimization problem, u^* is the solution to (5.7) if and only if it satisfies the following Karush-Kuhn-Tucker (KKT) conditions (see Section 5.5.3 of Boyd and Vandenberghe [2]),

{\begin{array}{r} \frac{\partial}{\partial u_{i}} [- a^{T} u + \tilde{ν} (u^{T} A^{[λ]} u - 1) - \sum_{i = 1}^{p} μ_{i} u_{i}] ∣_{u = u^{*}} = 0, 1 \leq i \leq p, \\ \tilde{ν} (u^{* T} A^{[λ]} u^{*} - 1) = 0, \\ \tilde{ν} \geq 0, μ_{i} u_{i}^{*} = 0, μ_{i} \geq 0, 1 \leq i \leq p, \end{array}

(5.8)

where μ = (μ₁, · · ·, μ_p)^T and ν̃ are unknown multipliers. From the equalities in the first line of (5.8), we obtain a set of linear equations

A^{[λ]} u^{*} = (a + μ) / 2 \tilde{ν} .

(5.9)

The solution to this set of linear equations is

u_{i}^{*} = [(α + β) (a_{i} + μ_{i}) - β (S_{p} + Δ)] / 2 \tilde{ν}, 1 \leq i \leq p,

(5.10)

where $Δ = \sum_{i = 1}^{p} μ_{i}$ and

α = \frac{1 + (p - 2) λ}{1 + (p - 2) λ - (p - 1) λ^{2}}, β = \frac{λ}{1 + (p - 2) λ - (p - 1) λ^{2}} .

(5.11)

Because $μ_{i} u_{i}^{*} = 0$ for all 1 ≤ i ≤ p, if $u_{i}^{*} > 0$ , then μ_i = 0 and if μ_i > 0, then $u_{i}^{*} = 0$ . Therefore, by (5.10), we have either $u_{i}^{*} = 0$ or $u_{i}^{*} = [(α + β) a_{i} - β (S_{p} + Δ)] / 2 \tilde{ν} > 0$ . Thus, we have $u_{1}^{*} \geq u_{2}^{*} \geq \dots \geq u_{p}^{*} \geq 0$ . Define

m = max {i \in {1, \dots, p} : u_{i}^{*} > 0} .

(5.12)

We will show that m satisfies (2.8). Because $u_{i}^{*} > 0$ and μ_i = 0 for all i ≤ m, and $u_{i}^{*} = 0$ for all i > m, by (5.10),

0 = \sum_{i = m + 1}^{p} u_{i}^{*} = \sum_{i = m + 1}^{p} [(α + β) (a_{i} + μ_{i}) - β (S_{p} + Δ)] / 2 \tilde{ν} = [(α + β) (S_{p} - S_{m} + Δ) - (p - m) β (S_{p} + Δ)] / 2 \tilde{ν},

from which we obtain

Δ = \frac{(p - m) β S_{p} - (α + β) (S_{p} - S_{m})}{α - (p - m - 1) β} .

(5.13)

It follows from (5.10), $u_{m}^{*} > 0$ and μ_m = 0 that [(α + β)a_m − β(S_p + Δ)]/2ν̃ > 0, and hence by (5.13), $\frac{λ S_{m}}{(1 - λ) + m λ} < a_{m}$ . Similarly, from (5.10), $u_{m + 1}^{*} = 0$ andμ_m₊₁ ≥ 0, we can obtain $a_{m + 1} \leq \frac{λ S_{m}}{(1 - λ) + m λ}$ . Hence, m satisfies (2.8). The expressions of the solutions in (2.10) can be obtained from (5.10) with ν = ν̃/(α + β). Since I = {1, 2, · · ·, m}, $u_{i}^{*} > 0$ , μ_i = 0 for all i ∈ I and $u_{i}^{*} = 0$ for all i ∈ I^c. By (5.9), we have $A_{I}^{[λ]} u_{I}^{*} = a_{I} / 2 \tilde{ν}$ .

Proof of Corollary 2.2

Without loss of generality, we assume that a₁ ≥ a₂ ≥ · · · ≥ a_p ≥ 0. Let ã = (ã₁, · · ·, ã_p) = a − δu^* and S̃_i = ã₁ + · · · + ã_i. Then u^* satisfies (2.10) and

{\tilde{a}}_{i} = {\begin{array}{r} (1 - \frac{δ}{2 ν}) a_{i} + \frac{δ}{2 ν} \frac{λ S_{m}}{(1 - λ) + m λ} & if & i \leq m \\ a_{i} & if & i > m . \end{array}

Let ${\tilde{u}}^{*} = ({\tilde{u}}_{1}^{*}, \dots, {\tilde{u}}_{p}^{*})$ be the solution to (2.14). A simple calculation shows that

\begin{array}{l} {\tilde{S}}_{m} = [1 - \frac{δ}{2 ν} \frac{1 - λ}{(1 - λ) + m λ}] S_{m}, \\ {\tilde{a}}_{p} \leq \dots \leq {\tilde{a}}_{m + 2} \leq {\tilde{a}}_{m + 1} \leq \frac{λ S_{m}}{(1 - λ) + m λ} = \frac{\tilde{λ} {\tilde{S}}_{m}}{(1 - \tilde{λ}) + m \tilde{λ}} < {\tilde{a}}_{m} \leq \dots \leq {\tilde{a}}_{1} . \end{array}

Hence, by Theorem 2.2, ${\tilde{u}}_{i}^{*} = 0 = u_{i}^{*}$ for i > m and

{\tilde{u}}_{i}^{*} = \frac{1}{2 \tilde{ν}} [{\tilde{a}}_{i} - \frac{\tilde{λ} {\tilde{S}}_{m}}{(1 - \tilde{λ}) + m \tilde{λ}}] = \frac{2 ν}{2 \tilde{ν}} (1 - \frac{δ}{2 ν}) u_{i}^{*},

for i ≤ m, where ν̃ is a scale constant.

Proof of Theorem 2.3

Because u^* ∈ Inline graphic if and only if u^* is the solution to the optimization problem (2.7) with a replaced by Σu^*, this theorem follows from Corollary 2.1 and (2.16) can be derived from (2.13) by noticing that $u_{i}^{*}$ and c_i have the same sign for all i ∈ I.

Proof of Theorem 2.4

Recall that u⁽⁰⁾, u⁽¹⁾, u⁽²⁾, · · · is the iteration sequence in Algorithm 2.1.

Lemma 2

For any u with ||u||_λ = 1. Let u′ = f(Σu). Then we have u^TΣu ≤ u′^T Σu′, where equality holds if and only if u = u′. Hence, u⁽^k⁾^T Σu⁽^k⁾, k ≥ 1, is an increasing sequence.

Proof of Lemma 2

Because Σ is a nonnegative definite symmetric matrix, we can find a matrix B such that Σ = B^T B. By the definition of f and Cauchy-Schwarz inequality, we have

\begin{array}{l} u^{T} \sum u = {(\sum u)}^{T} u \leq {(\sum u)}^{T} u^{'} = u^{T} \sum u^{'} = u^{T} B^{T} {Bu}^{'} \leq {| | Bu | |}_{2} {| | {Bu}^{'} | |}_{2} \\ = \sqrt{u^{T} B^{T} Bu} \sqrt{{u^{'}}^{T} B^{T} {Bu}^{'}} = \sqrt{u^{T} \sum u} \sqrt{{u^{'}}^{T} \sum u^{'}} . \end{array}

(5.14)

Thus u^T Σu ≤ u′^T Σu′, where equality holds if and only if u = u′ because equality in the first inequality of (5.14) holds if and only if u = u′.

We first assume that $u_{I}^{*}$ is the leading eigenvector of (2.16) where I is the index set of nonzero coordinates of u^*. We will find a neighborhood Inline graphic of u^* such that if u⁽^k⁾ ∈ ∩ {u: ||u||_λ = 1} for some k ≥ 1, then the iteration will converge to u^*. Without loss of generality, we assume that

I = {1, 2, \dots, m} and all the coordinates of u^{*} are nonnegative .

(5.15)

We decompose the matrix Σ as follows

\sum = [\begin{matrix} \sum_{I} & Λ \\ Λ^{T} & \sum_{I^{c}} \end{matrix}],

(5.16)

where I^c = {1, 2, · · ·, p}\I, Σ_I and Σ_I^c are m × m and (p − m) × (p − m) matrices, respectively, and Λ is a m × (p − m) matrix. It follows from (2.16) and (5.15), $u_{I}^{*}$ is the leading eigenvector of the following generalized eigenvalue problem,

\sum_{I} x = μ A_{I}^{[λ]} x, x \in R^{m} .

(5.17)

Recall that A^[^λ^] is the p × p matrix with diagonal elements equal to 1 and off-diagonal elements equal to λ.

Lemma 3

Suppose that g, g′ ∈ ℝ^m are two eigenvectors of (5.17) corresponding to different eigenvalues. Then we have $g^{T} A_{I}^{[λ]} g^{'} = 0$ .

Proof of Lemma 3

Suppose that the eigenvalues corresponding to g and g′ are μ and μ′, respectively. Because

g^{T} \sum_{I} g^{'} = g^{T} (\sum_{I} g^{'}) = μ^{'} g^{T} A_{I}^{[λ]} g^{'}

(5.18)

and

g^{T} \sum_{I} g^{'} = {(\sum_{I} g)}^{T} g^{'} = μ {(A_{I}^{[λ]} g)}^{T} g^{'} = μ g^{T} A_{I}^{[λ]} g^{'},

(5.19)

by subtracting (5.18) from (5.19), we obtain

0 = μ g^{T} A_{I}^{[λ]} g^{'} - μ^{'} g^{T} A_{I}^{[λ]} g^{'} = (μ - μ^{'}) g^{T} A_{I}^{[λ]} g^{'} .

Since μ ≠ μ′, we have $g^{T} A_{I}^{[λ]} g^{'} = 0$ .

Suppose that μ₁ ≥ μ₂ ≥ · · · ≥ μ_m ≥ 0 are eigenvalues of (5.17) and the corresponding eigenvectors g₁, g₂, · · ·, g_m ∈ ℝ^m satisfying

g_{i}^{T} A_{I}^{[λ]} g_{i} = 1, 1 \leq i \leq m and g_{i}^{T} A_{I}^{[λ]} g_{j} = 0, 1 \leq i \neq j \leq m .

(5.20)

Note that if μ_i ≠ μ_j, then by Lemma 3, the second equality in (5.20) is true. For μ_i = μ_j, we can apply the Gram-Schmidt orthogonalization procedure to choose the eigenvectors satisfying (5.20). Since $u_{I}^{*}$ is the leading eigenvector of (5.17), by the condition (2) in Theorem 2.3, we have μ₁ ≠ 0. By the assumption of this theorem, the multiplicity of μ₁ is one, hence $u_{I}^{*}$ is equal to g₁ multiplied by a constant. By (5.15), without loss of generality, we can assume that all coordinates of g₁ are positive. For 1 ≤ i ≤ m, define h_i to be the p-vector with the first m coordinates equal to g_i and the last p −m coordinates equal to zeros, that is,

h_{i} = {(g_{i}^{T}, 0)}^{T}, 1 \leq i \leq m .

(5.21)

Define h_m₊_j, 1 ≤ j ≤ p − m, to be the p-vectors with the m + j-th coordinate equal to 1 and other coordinates equal to 0. Then {h₁, · · ·, h_p} form a basis of ℝ^p and it follows from the following lemma that

u_{I}^{*} = g_{1}, u^{*} = h_{1} .

(5.22)

Lemma 4

There exists δ₁ > 0 such that for any u = ηh₁ + ε₂h₂ + · · · + ε_mh_m with |1 − η| < δ₁ and $\sqrt{ε_{2}^{2} + \dots + ε_{m}^{2}} < δ_{1}$ , we have

{| | u | |}_{λ}^{2} = η^{2} + ε_{2}^{2} + \dots + ε_{m}^{2} .

(5.23)

Hence, in this case, ||u||_λ = 1 if and only if $η^{2} = 1 - \sum_{i = 2}^{p - m} ε_{i}^{2}$ .

Proof of Lemma 4

By the definition (5.21), the last p −m coordinates of u are zeros. The first m coordinates of h₁ are equal to coordinates of g₁ which are positive. Therefore, if η is close enough to 1 and ε₂, · · ·, ε_p₋_m are small enough, the first m coordinates of u are positive. Thus, by (5.20),

\begin{array}{l} {| | u | |}_{λ}^{2} = (1 - λ) \sum_{i = 1}^{m} u_{i}^{2} + λ {(\sum_{i = 1}^{m} u_{i})}^{2} = u_{I} A_{I}^{[λ]} u_{I}^{T} \\ = η^{2} g_{1}^{T} A_{I}^{[λ]} g_{1} + ε_{2}^{2} g_{2}^{T} A_{I}^{[λ]} g_{2} + \dots + ε_{m}^{2} g_{m}^{T} A_{I}^{[λ]} g_{m} = η^{2} + ε_{2}^{2} + \dots + ε_{m}^{2} . \end{array}

(5.24)

Recall that Σu^* = c = (c₁, · · ·, c_p)^T. We sort the absolute values of c_i’s into an decreasing sequence |c_k₁| ≥ |c_k₂| ≥ · · · ≥ |c_{k_p}| and define a_i = |c_{k_i}|, 1 ≤ i ≤ p. Since u^* → Inline graphic , that is, u^* = f (Σu^*) = f (c), by Corollary 2.1 and (5.15), I = {k₁, k₂, · · ·, k_m} = {1, 2, · · ·, m}, that is, {k₁, k₂, · · ·, k_m} is a permutation of {1, 2, · · ·, m}, and $u_{i}^{*}$ , 1 ≤ i ≤ m, has the same sign as c_i, that is

c_{i} > 0, 1 \leq i \leq m .

(5.25)

Let $S_{i} = \sum_{j = 1}^{i} a_{i}$ , 1 ≤ i ≤ p. By the condition (2.18),

a_{m + 1} = max_{i \notin I} ∣ c_{i} ∣ \neq \frac{λ \sum_{i \in I} ∣ c_{i} ∣}{(1 - λ) + m λ} = \frac{λ S_{m}}{(1 - λ) + m λ} .

(5.26)

It follows from (5.26) and the condition (2.11) in Corollary 2.1 that

a_{m + 1} < \frac{λ S_{m}}{(1 - λ) + m λ} < a_{m} .

(5.27)

Lemma 5

There exists a positive number δ₂ ≤ δ₁ (δ₁ is defined in Lemma 4) such that if there is a vector in the iteration sequence of Algorithm 2.1 belonging to the following subset Inline graphic of {u: ||u||_λ = 1},

A = {u : u = η h_{1} + ε_{2} h_{2} + \dots + ε_{m} h_{m}, η^{2} = 1 - \sum_{i = 2}^{m} ε_{i}^{2}, \sum_{i = 2}^{m} ε_{i}^{2} < δ_{2}^{2}},

(5.28)

then the iteration sequence converges to u^* = h₁.

Proof of Lemma 5

Without loss of generality, we assume that u⁽⁰⁾ ∈ Inline graphic . Suppose that

u^{(0)} = η h_{1} + ε_{2} h_{2} + \dots + ε_{m} h_{m}, η^{2} = 1 - \sum_{i = 2}^{m} ε_{i}^{2} .

Define ${(c_{1}^{'}, \dots, c_{p}^{'})}^{T} = \sum u^{(0)} = η \sum h_{1} + ε_{2} \sum h_{2} + \dots + ε_{m} \sum h_{m}$ . Then as $\sum_{i = 2}^{m} ε_{i}^{2} \to 0$ , by (5.22),

{(c_{1}^{'}, \dots, c_{p}^{'})}^{T} \to \sum h_{1} = \sum u^{*} = {(c_{1}, \dots, c_{p})}^{T} .

(5.29)

Hence, by (5.25), when $\sum_{i = 2}^{m} ε_{i}^{2}$ is small enough, we have

c_{1}^{'} > 0, \dots, c_{m}^{'} > 0.

(5.30)

Let $∣ c_{k_{1}^{'}}^{'} ∣ \geq ∣ c_{k_{2}^{'}}^{'} ∣ \geq \dots \geq ∣ c_{k_{p}^{'}}^{'} ∣$ be the sorted sequence. Define $a_{i}^{'} = ∣ c_{k_{i}^{'}}^{'} ∣$ , 1 ≤ i ≤ p and $S_{i}^{'} = \sum_{j = 1}^{i} a_{i}^{'}$ , 1 ≤ i ≤ p. Then, we have $a_{j}^{'} \to a_{j}$ and $S_{j}^{'} \to S_{j}$ , 1 ≤ j ≤ p, as $\sum_{i = 2}^{m} ε_{i}^{2} \to 0$ . Because by (5.27), max_m_+1≤ _i _≤ _p|c_i| = max_i_∉_I |c_i| = a_m₊₁ < a_m = min_i_∈_I |c_i| = min_1≤_i_≤_m |c_i|, when $\sum_{i = 2}^{m} ε_{i}^{2}$ is samll enough, we have ${max}_{m + 1 \leq i \leq p} ∣ c_{i}^{'} ∣ < {min}_{1 \leq i \leq m} ∣ c_{i}^{'} ∣$ . Hence, { $a_{1}^{'}, \dots, a_{m}^{'}$ } is a permutation of { $∣ c_{1}^{'} ∣, \dots, ∣ c_{m}^{'} ∣$ } and { $a_{m + 1}^{'}, \dots, a_{p}^{'}$ } is a permutation of { $∣ c_{m + 1}^{'} ∣, \dots, ∣ c_{p}^{'} ∣$ }. In another word,

{k_{1}^{'}, \dots, k_{m}^{'}} is a permutation of {1, \dots, m} .

(5.31)

and $a_{m + 1}^{'} = {min}_{m + 1 \leq i \leq p} ∣ c_{i}^{'} ∣ < {max}_{1 \leq i \leq m} ∣ c_{i}^{'} ∣ = a_{m}^{'}$ . Therefore, we can choose δ₂ ≤ δ₁ small enough such that (5.30), (5.31) and the following inequality (due to (5.27)) hold,

a_{m + 1}^{'} < \frac{λ S_{m}^{'}}{(1 - λ) + m λ} < a_{m}^{'} .

(5.32)

Now by (5.31), (5.32), and Corollary 2.1, the coordinates of u⁽¹⁾ with indices in $I^{c} = {m + 1, \dots, p} = {k_{m + 1}^{'}, \dots, k_{p}^{'}}$ are zeros. It follows from (2.13) that

A_{I}^{[λ]} u_{I}^{(1)} = \frac{1}{ν} \sum_{I} u_{I}^{(0)} = \frac{1}{ν} (η \sum_{I} g_{1} + ε_{2} \sum_{I} g_{2} + \dots + ε_{m} \sum_{I} g_{m}) = \frac{1}{ν} (η u_{1} A_{I}^{[λ]} g_{1} + ε_{2} μ_{2} A_{I}^{[λ]} g_{2} + \dots + ε_{m} μ_{m} A_{I}^{[λ]} g_{m}),

(5.33)

where ν a positive scale constant chosen to make ||u⁽¹⁾||_λ = 1. Since $A_{I}^{[λ]}$ is invertible, we have $u_{I}^{(1)} = \frac{1}{ν} (η μ_{1} g_{1} + ε_{2} μ_{2} g_{2} + \dots + ε_{m} μ_{m} g_{m})$ . Hence, $u^{(1)} = \frac{1}{ν} (η μ_{1} h_{1} + ε_{2} μ_{2} h_{2} + \dots + ε_{m} μ_{m} h_{m})$ . By (5.30) and (5.33), ${| | u^{(1)} | |}_{λ}^{2} = {(u_{I}^{(1)})}^{T} A_{I}^{(λ)} u_{I}^{(1)} = \frac{1}{ν^{2}} (η^{2} μ_{1}^{2} + ε_{2}^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}) = 1$ . Hence, $ν^{2} = η^{2} μ_{1}^{2} + ε_{2}^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}$ . we have

u^{(1)} = η^{(1)} h_{1} + ε_{2}^{(1)} h_{2} + \dots + ε_{m}^{(1)} h_{m},

where $η^{(1)} = \frac{η μ_{1}}{ν}$ and $ε_{i}^{(1)} = \frac{ε_{i} μ_{i}}{ν}$ , 2 ≤ i ≤ m. By the assumptions of this theorem, the nonnegative eigenvalues have multiplicities 1, so μ₁ > μ₂ ≥ · · · ≥ μ_p. Then we have

\sum_{i = 2}^{m} {(ε_{i}^{(1)})}^{2} = \frac{ε_{2}^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}}{η^{2} μ_{1}^{2} + ε_{2}^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}} \leq \frac{ε_{2}^{2} μ_{1}^{2} + \dots + ε_{m}^{2} μ_{1}^{2}}{η^{2} μ_{1}^{2} + ε_{2}^{2} μ_{1}^{2} + \dots + ε_{m}^{2} μ_{1}^{2}} = \sum_{i = 2}^{m} ε_{i}^{2} < δ_{2} .

Hence, u⁽¹⁾ ∈ Inline graphic . Similarly, all the iteration sequence

u^{(k)} = η^{(k)} h_{1} + ε_{2}^{(k)} h_{2} + \dots + ε_{m}^{(k)} h_{m} \in A, k \geq 1,

where ${(η^{(k)})}^{2} = 1 - \sum_{i = 2}^{m} {(ε_{i}^{(k)})}^{2}$ and

{(ε_{i}^{(k)})}^{2} = \frac{ε_{i}^{2} μ_{i}^{2 k}}{η^{2} μ_{1}^{2 k} + ε_{2}^{2} μ_{2}^{2 k} + \dots + ε_{m}^{2} μ_{m}^{2 k}} = \frac{ε_{i}^{2} {(\frac{μ_{i}}{μ_{1}})}^{2 k}}{η^{2} + ε_{2}^{2} {(\frac{μ_{2}}{μ_{1}})}^{2 k} + \dots + ε_{m}^{2} {(\frac{μ_{m}}{μ_{1}})}^{2 k}} \to 0 as k \to \infty .

Therefore u⁽^k⁾ ∈ h₁ as k → ∞.

Now define a neighborhood of u^* = h₁ as follows

N = {u : u = (1 + ω_{1}) h_{1} + ω_{2} h_{2} + \dots + ω_{m} h_{m} + \dots + ω_{p} h_{p}, ω_{1}^{2} + \dots + ω_{p}^{2} < δ_{3}^{2}},

(5.34)

By a similar argument as that in the proof of Lemma 5, we can choose δ₃ small enough such that if u⁽^k⁾ ∈ Inline graphic ∩ {u: ||u||_λ = 1} for some k ≥ 1, then u⁽^k⁺¹⁾ ∈ and the iteration will converge to u^* by Lemma 5.

Now we assume that $u_{I}^{*}$ is not the leading eigenvector of the problem (2.16). Without loss of generality, we assume that $u_{I}^{*}$ is the eigenvector corresponding to the second largest eigenvalue. For other cases, we have the similar results and proofs. We will adapt the same notations as above and still assume that the first m coordinates of u are positive and the last p −m coordinates are zeros. Hence, in this case, we have u^* = h₂ and $u_{I}^{*} = g_{2}$ . By the assumptions of this theorem, both μ₁ and μ₂ have multiplicities 1. Therefore, we have μ₁ > μ₂ > μ₃ ≥ · · ·. We define the following two subsets of {u: ||u||_λ = 1} around u^* = h₂,

\begin{matrix} D = {u : u = ε_{1} h_{1} + η h_{2} + ε_{3} h_{3} + \dots + ε_{m} h_{m}, η^{2} = 1 - ε_{1}^{2} - \sum_{i = 3}^{m} ε_{i}^{2}, \\ ε_{1}^{2} < \frac{2 μ_{2}}{μ_{1} - μ_{2}} δ_{4}^{2} and ε_{3}^{2} + \dots + ε_{m}^{2} < δ_{4}^{2}}, and \\ \tilde{D} = {u : u = ε_{1} h_{1} + η h_{2} + ε_{3} h_{3} + \dots + ε_{m} h_{m}, η^{2} = 1 - ε_{1}^{2} - \sum_{i = 3}^{m} ε_{i}^{2}, \\ ε_{1}^{2} < \frac{4 μ_{1}^{2}}{μ_{2} (μ_{1} - μ_{2})} δ_{4}^{2} and ε_{3}^{2} + \dots + ε_{m}^{2} < δ_{4}^{2}} . \end{matrix}

The only difference between Inline graphic and is the ranges of ε₁. Note that ⊂

Lemma 6

For $δ_{4} \leq \frac{1}{2}$ small enough and any u ∈ Inline graphic , we have

f (\sum u) = \frac{ε_{1} μ_{1}}{ν} h_{1} + \frac{η μ_{2}}{ν} h_{1} + \dots + \frac{ε_{m} μ_{m}}{ν} h_{m},

(5.35)

where $ν^{2} = ε_{1}^{2} μ_{1}^{2} + η^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}$ . Furthermore, f (Σu) ∈ Inline graphic .

Recall that f (Σu) is the solution to optimization problem (2.7) with a replaced with Σu.

Proof of Lemma 6

(5.35) can be derived by the same arguments as in the case that $u_{I}^{*}$ is the leading eigenvector. Now we show that f (Σu) ∈ Inline graphic . Since

\begin{array}{r} {(\frac{ε_{3} μ_{3}}{ν})}^{2} + \dots + {(\frac{ε_{m} μ_{m}}{ν})}^{2} = \frac{ε_{3}^{2} μ_{3}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}}{ε_{1}^{2} μ_{1}^{2} + η^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}} \\ \leq \frac{(ε_{3}^{2} + \dots + ε_{m}^{2}) μ_{2}^{2}}{ε_{1}^{2} μ_{1}^{2} + η^{2} μ_{2}^{2} + (ε_{3}^{2} + \dots + ε_{m}^{2}) μ_{2}^{2}} \leq \frac{(ε_{3}^{2} + \dots + ε_{m}^{2}) μ_{2}^{2}}{(ε_{1}^{2} + η^{2} + ε_{3}^{2} + \dots + ε_{m}^{2}) μ_{2}^{2}} \\ = ε_{3}^{2} + \dots + ε_{m}^{2} < δ_{4}, \end{array}

and

\begin{array}{r} {(\frac{ε_{1} μ_{1}}{ν})}^{2} = \frac{ε_{1}^{2} μ_{1}^{2}}{ε_{1}^{2} μ_{1}^{2} + η^{2} μ_{2}^{2} + \dots + ε_{m}^{2} μ_{m}^{2}} \leq \frac{ε_{1}^{2} μ_{1}^{2}}{ε_{1}^{2} μ_{1}^{2} + η^{2} μ_{2}^{2}} \\ = \frac{ε_{1}^{2} μ_{1}^{2}}{ε_{1}^{2} μ_{1}^{2} + (1 - ε_{1}^{2} - \sum_{i = 3}^{m} ε_{i}^{2}) μ_{2}^{2}} \leq \frac{ε_{1}^{2} μ_{1}^{2}}{ε_{1}^{2} (μ_{1}^{2} - μ_{2}^{2}) + (1 - δ_{4}) μ_{2}^{2}} \leq \frac{ε_{1}^{2} μ_{1}^{2}}{(1 - δ_{4}) μ_{2}^{2}} \\ \leq \frac{2 ε_{1}^{2} μ_{1}^{2}}{μ_{2}^{2}} \leq \frac{4 μ_{1}^{2}}{μ_{2} (μ_{1} - μ_{2})} δ_{4}^{2}, \end{array}

where the last inequality is due to the definition of Inline graphic . Thus f (Σu) ∈ .

Now without loss of generality, we assume that u⁽⁰⁾ ∈ Inline graphic . Let u⁽⁰⁾ = ε₁h₁ + ηh₂ + ε₃h₃ + · · · + ε_mh_m. If ε₁ = 0, by (5.35) in Lemma 6, the coefficient of h₁ in u⁽¹⁾ = f (Σu⁽⁰⁾) is also zero. This fact is true for all vectors in the iteration sequence. We can use the same arguments as in the case that $u_{I}^{*}$ is the leading eigenvector to show that the iteration sequence converges to $h_{2} = u_{I}^{*}$ . However, if ε₁ ≠ 0, we have a different result.

Lemma 7

For any u ∈ Inline graphic \ , we have u^T Σu > (u^*)^T Σ(u^*).

Proof of Lemma 7

Let u = ε₁h₁ + ηh₂ + ε₃h₃ + · · · + ε_mh_m. Since u ∈ Inline graphic \ , we have

ε_{3}^{2} + \dots + ε_{m}^{2} < δ_{4}^{2}, and \frac{2 μ_{2}}{μ_{1} - μ_{2}} δ_{4}^{2} \leq ε_{1}^{2} < \frac{4 μ_{1}^{2}}{μ_{2} (μ_{1} - μ_{2})} δ_{4}^{2} .

(5.36)

Because the last p − m coordinates of u are zeros, we have

\begin{array}{l} u^{T} \sum u = {(ε_{1} h_{1} + η h_{2} + \dots + ε_{m} h_{m})}^{T} (ε_{1} \sum h_{1} + η \sum h_{2} + \dots + ε_{m} \sum h_{m}) \\ = {(ε_{1} g_{1} + η g_{2} + \dots + ε_{m} g_{m})}^{T} (ε_{1} \sum_{I} g_{1} + η \sum_{I} g_{2} + \dots + ε_{m} \sum_{I} g_{m}) \\ = {(ε_{1} g_{1} + η g_{2} + \dots + ε_{p - m} g_{m})}^{T} (ε_{1} μ_{1} A_{I}^{[λ]} g_{1} + \dots + ε_{m} μ_{m} A_{I}^{[λ]} g_{m}) \\ = ε_{1}^{2} μ_{1} + η^{2} μ_{2} + ε_{3}^{2} μ_{3} + \dots + ε_{m}^{2} μ_{m} \geq ε_{1}^{2} μ_{1} + η^{2} μ_{2} \\ = ε_{1}^{2} μ_{1} + (1 - ε_{1}^{2} - \sum_{i = 3}^{m} ε_{i}^{2}) μ_{2} = ε_{1}^{2} (μ_{1} - μ_{2}) + (1 - \sum_{i = 3}^{m} ε_{i}^{2}) μ_{2} \\ > ε_{1}^{2} (μ_{1} - μ_{2}) + (1 - δ_{4}^{2}) μ_{2} \geq \frac{2 μ_{2}}{μ_{1} - μ_{2}} δ_{4}^{2} (μ_{1} - μ_{2}) + (1 - δ_{4}^{2}) μ_{2} = (1 + δ_{4}^{2}) μ_{2} \\ > μ_{2} = {(h_{2})}^{T} \sum (h_{2}) = {(u^{*})}^{T} \sum (u^{*}) . \end{array}

Now if ε₁ ≠ 0, by (5.35), we can obtain that the coefficient of h₁ in u⁽^k⁾ is $ε_{1} μ_{1}^{k} / \sqrt{ε_{1}^{2} μ_{1}^{2 k} + η^{2} μ_{2}^{2 k} + \dots + ε_{m}^{2} μ_{m}^{2 k}}$ which is strictly increasing. Hence, when k is large enough, u⁽^k⁾ will leave the subset Inline graphic . Suppose that k₀ is the index with u^(k₀) ∈ and u^(k₀+1) ∉ . Then by Lemmas 6 and 7, we have that u^(k₀+1) ∈ \ and (u^(k₀+1))^T Σu^(k₀+1> (u^*)^T Σ(u^*). Now by Lemma 2, {(u⁽^k⁾)^T Σu⁽^k⁾: k ≥ 1} is an increasing sequence, hence {u⁽^k⁾: k ≥ 1} can not converge to u^*.

Now define a neighborhood of u^* = h₂ as follows

M = {u : u = ω_{1} h_{1} + (1 + ω_{2}) h_{2} + \dots + ω_{m} h_{m} + \dots + ω_{p} h_{p}, ω_{1}^{2} + \dots + ω_{p}^{2} < δ_{5}^{2}} .

(5.37)

By the same arguments as in the case that $u_{I}^{*}$ is the leading eigenvector, we can show that when δ₅ small enough we have if u⁽^k⁾ ∈ Inline graphic , then u⁽^k⁺¹⁾ ∈ . In this case, let u⁽^k⁾ = ω₁h₁ + (1 + ω₂)h₂ + · · · + ω_mh_m + · · · + ω_ph_p. We will compute the coefficient of h₁ in u⁽^k⁺¹⁾. By the decomposition (5.16) of Σ and the following decompositions,

u^{(k)} = [\begin{matrix} u_{I}^{(k)} \\ u_{I^{c}}^{(k)} \end{matrix}] and u^{(k + 1)} = [\begin{matrix} u_{I}^{(k + 1)} \\ 0 \end{matrix}],

where $u_{I^{c}}^{(k)} = {(ω_{m + 1}, \dots, ω_{p})}^{T}$ .

\sum u^{(k)} = [\begin{matrix} \sum_{I} u_{I}^{(k)} + Λ u_{I^{c}}^{(k)} \\ Λ^{T} u_{I}^{(k)} + \sum_{I^{c}} u_{I^{c}}^{(k)} \end{matrix}] .

It follows from (2.13) that

ν A_{I}^{[λ]} u_{I}^{(k + 1)} = Λ^{T} u_{I^{c}}^{(k)} + \sum_{I} u_{I}^{(k)},

where ν is a positive scale constant. Then the coefficient of h₁ in u⁽^k⁺¹⁾ is equal to

g_{1}^{T} A_{I}^{[λ]} u_{I}^{(k + 1)} = g_{1}^{T} Λ^{T} u_{I^{c}}^{(k)} + g_{1}^{T} \sum_{I} u_{I}^{(k)} = g_{1}^{T} Λ^{T} {(ω_{m + 1}, \dots, ω_{p})}^{T} + ω_{1} μ_{1} .

Now define the following affine subspace passing through u^* = h₂ in ℝ^p with dimension strictly less than p,

S = {u : u = ω_{1} h_{1} + (1 + ω_{2}) h_{2} + \dots + ω_{m} h_{m} + \dots + ω_{p} h_{p}, g_{1}^{T} Λ^{T} {(ω_{m + 1}, \dots, ω_{p})}^{T} + ω_{1} μ_{1} = 0} .

(5.38)

Then if u⁽^k⁾ ∈ Inline graphic ∩ ∩ {{u: ||u||_λ = 1} for some k ≥ 1, then all u⁽^k⁺¹⁾, u⁽^k⁺²⁾, · · ·, will belong to ∩ {{u: ||u||_λ = 1} and their coefficients of h₁ are all zeros. Hence, the iteration sequence converges to u^*. If u⁽^k⁾ ∈ ∩ ∩ {{u: ||u||_λ = 1} for some k ≥ 1, all u⁽^k⁺¹⁾, u⁽^k⁺²⁾, · · ·, will belong to Inline graphic ∩ {{u: ||u||_λ = 1}, but their coefficients of h₁ are all nonzero. The iteration sequence can not converge to u^*. Therefore, in this case, u^* is not a stable limit.

At last, we prove the iteration sequence converges. Because {u: ||u||_λ = 1} is a compact set, we can find a convergent subsequence {u_{k_j}: k₁ < k₂ < · · ·} of the iteration sequence. Suppose that u_{k_j} → u^* as j → ∞. We first show that u^* ∈ Inline graphic which is defined in (2.15). Because u_{k_j+1} = f (Σ_{k_j}) and f is a continuous function, u_{k_j+1} → f (Σu^*). Let ū′ = f (Σu^*). Then we have $u_{k_{j}}^{T} \sum u_{k_{j}} \to u^{* T} \sum u^{*}$ and $u_{k_{j} + 1}^{T} \sum u_{k_{j} + 1} \to {({\bar{u}}^{'})}^{T} \sum {\bar{u}}^{'}$ . By Lemma 2, { $u_{k}^{T} \sum u_{k} : k \geq 1$ } is an increasing sequence. Hence, u^*^T Σu^* = (ū′)^T Σū′ and by Lemma 2, u^* = ū′, that is, u^* ∈ Inline graphic . Now if u^*_I is a leading vector of (2.16). Let be the neighborhood of u^* defined in (5.34). Then when k_j is large enough, u_{k_j} ∈ . Hence, the iteration sequence converges to u^*. If u^*_I is not a leading vector. Let and be the subsets defined in (5.37) and (5.38). Since u_{k_j} → u^*, we must have that when k_j is large enough, u_{k_j} ∈ Inline graphic ∩ . Hence, the iteration sequence converges to u^*.

Proof of Theorem 2.5

Let Ψ = (ψ₁, …, ψ_q) be a p × q matrix, where the basis ψ_j = (ψ₁_j, · · ·, ψ_pj)^T ∈ R^p, 1 ≤ j ≤ q. Then t₁ψ₁ + · · · + t_q ψ_q = Ψt, ∀t = (t₁, · · ·, t_q)^T ∈ ℝ^q. We first show that if for some t^* ∈ ℝ^p, f (a_⊥ + Ψt^*) ⊥ M, then f (a_⊥ + Ψt^*) is the solution to (2.20). Actually, for any u with u ⊥ M and ||u||_λ = 1, we have

a^{T} u = a_{⊥}^{T} u = {(a_{⊥} + Ψ t^{*})}^{T} u \leq {(a_{⊥} + Ψ t^{*})}^{T} f (a_{⊥} + Ψ t^{*}) = a^{T} f (a_{⊥} + Ψ t^{*}),

where the inequality follows from the definition of f. The uniqueness of the solution to (2.20) is due to the strict convexness of {u: ||u||_λ = 1}. For any t = (t₁, · · ·, t_q)^T ∈ ℝ^q, let the orthogonal projection of f (a_⊥ + Ψt) onto M have the basis expansion Ψs = s₁ψ₁ + · · · + s_q ψ_q, where s = (s₁, · · ·, s_q)^T ∈ ℝ^q is a function of t. Hence, we define s = Γ(t). Then Γ is a continous function from R^q to R^q because f is a continuous function. We only need to show that there exists t^* such that Γ(t^*) = 0. We first prove a technical lemma.

Lemma 8

If ${| | a_{⊥} | |}_{2} < \frac{{| | t | |}_{2}^{2}}{{| | Ψ t | |}_{λ}}$ , we have t^TΓ(t) > 0. Moreover, if ||t||₂ → ∞, we have $\frac{{| | t | |}_{2}^{2}}{{| | Ψ t | |}_{λ}} \to \infty$ .

Proof of Lemma 8

We will proceed by contradiction. Assume that t^T Γ(t) ≤ 0. Let f (a_⊥ + Ψt) = ΨΓ(t) + u_⊥, where u_⊥ ⊥ M. Hence, ${| | u_{⊥} | |}_{2}^{2} \leq {| | f (a_{⊥} + Ψ t) | |}_{2}^{2} \leq {| | f (a_{⊥} + Ψ t) | |}_{λ}^{2} = 1$ . Note that {ψ₁, · · ·, ψ_q} is an orthonormal basis. Then we have

\begin{array}{r} {(a_{⊥} + Ψ t)}^{T} f (a_{⊥} + Ψ t) = a_{⊥}^{T} u_{⊥} + t^{T} Γ (t) \leq a_{⊥}^{T} u_{⊥} \leq {| | a_{⊥} | |}_{2} {| | u_{⊥} | |}_{2} \leq {| | a_{⊥} | |}_{2} \\ < \frac{{| | t | |}_{2}^{2}}{{| | Ψ t | |}_{λ}} = {(Ψ t)}^{T} \frac{Ψ t}{{| | Ψ t | |}_{λ}} = {(a_{⊥} + Ψ t)}^{T} \frac{Ψ t}{{| | Ψ t | |}_{λ}} . \end{array}

It is a contradiction because, by the definition of f, (a_⊥ + Ψt)^T f (a_⊥ + Ψt) is the largest among all (a_⊥ + Ψt)^T u with ||u||_λ = 1. The last statement in this lemma follows from

\frac{{| | t | |}_{2}^{2}}{{| | Ψ t | |}_{λ}} \geq \frac{{| | t | |}_{2}^{2}}{\sum_{i = 1}^{q} ∣ t_{i} {| | ∣ ψ_{i} | |}_{λ}} \geq \frac{{| | t | |}_{2}^{2}}{\sqrt{\sum_{i = 1}^{q} t_{i}^{2}} \sqrt{\sum_{i = 1}^{q} {| | ψ_{i} | |}_{λ}^{2}}} = \frac{{| | t | |}_{2}}{\sqrt{\sum_{i = 1}^{q} {| | ψ_{i} | |}_{λ}^{2}}} .

Now if q = 1, t is the real number. By Lemma 8, we can choose t > 0 so large that tΓ(t) > 0 and (−t)Γ(−t) > 0. Hence, we have Γ(−t) < 0 < Γ(t). Because Γ is a continuous function, the intermediate value theorem implies that for some −t ≤ t^* ≤ t, Γ(t^*) = 0.

If q > 1, we need some basic results from topology. We first give a definition (see Section 51 in Munkres [9]). Suppose that ρ₀ and ρ₁ are continuous maps of from a space X into X, we say that ρ₀ is homotopic to ρ₁ if there is a continuous map F: X ×[0, 1] → X such that F (x, 0) = ρ₀(x) and F (x, 1) = ρ₁(x), ∀x ∈ X. If ρ₁ maps all points in X into a fixed point in X, we say that ρ₀ is null homotopic. It is well known that the identity map from the unit sphere S^q⁻¹ of ℝ^q (with l₂-norm) onto S^q⁻¹ is not null homotopic (see Exercise 4(a), after Section 55 in Munkres [9]). We will proceed by contradiction. Assume that Γ(t) ≠ 0 for all t ∈ ℝ^q. By Lemma 8, we can choose a large M > 0 such that if ||t̃||₂ = M, we have t̃^TΓ(t̃) > 0. Hence, for any t ∈ S^q⁻¹, t^TΓ(M t) > 0. Now we define a continuous map F: S^q⁻¹ × [0, 1] → S^q⁻¹ as follows,

F (t, θ) = {\begin{matrix} \frac{(1 - 2 θ) t + 2 θ Γ (M t)}{{| | (1 - 2 θ) t + 2 θ Γ (M t) | |}_{2}} & if & 0 \leq θ \leq \frac{1}{2}, \\ \frac{Γ (2 (1 - θ) M t)}{{| | Γ (2 (1 - θ) M t) | |}_{2}} & if & \frac{1}{2} \leq θ \leq 1, \end{matrix}

where (1 −2θ)t + 2θΓ(Mt) ≠ 0 for all $0 \leq θ \leq \frac{1}{2}$ . It is easy to see that F(t, 0) is the identity map from S^q⁻¹ to S^q⁻¹ and F (t, 1) maps every t to Γ(0)/||Γ(0)||₂, which contradict to the fact that the identity map from S^q⁻¹ onto S^q⁻¹ is not null homotopic.

Proof of Theorem 2.6

Recall that Ψ = (ψ₁, …, ψ_q), where the basis ψ_j = (ψ₁_j, · · ·, ψ_pj)^T ∈ R^p, 1 ≤ j ≤ q. Suppose that $\bar{t} \notin \cup_{m = 1}^{q - 1} J_{m}$ . We will calculate the first and second order partial derivatives of H at point t̄. Let $S_{i} = \sum_{j = 1}^{i} ∣ c_{k_{j}} (\bar{t}) ∣$ , 1 ≤ i ≤ p, where c_i(t̄)’s are define in (2.22). Let m be the number of nonzero coordinates of f (a_⊥ + Ψt̄). Then by (2.11) in Corollary 2.1

∣ c_{k_{m + 1}} (\bar{t}) ∣ \leq \frac{λ S_{m}}{(1 - λ) + m λ} < ∣ c_{k_{m}} (\bar{t}) ∣,

and the set I of indices of nonzero coordinates of f (a_⊥ + Ψt̄) is equal to {k₁, · · ·, k_m}. By the definition of Inline graphic , we have the following strict inequalities,

∣ c_{k_{m + 1}} (\bar{t}) ∣ < \frac{λ S_{m}}{(1 - λ) + m λ} < ∣ c_{k_{m}} (\bar{t}) ∣,

(5.39)

Let Δt = (Δt₁, Δ₂, · · ·, Δt_q) and $∣ c_{k_{1}^{'}} (\bar{t} + Δ t) ∣ \geq \dots \geq ∣ c_{k_{p}^{'}} (\bar{t} + Δ t) ∣$ be the sorted sequence. Define $S_{i}^{'} = \sum_{j = 1}^{i} ∣ c_{k_{j}^{'}} (t + Δ t) ∣$ , 1 ≤ i ≤ p. By similar arguments as in the proof of Lemma 5, it follows from (5.39) that when ||Δt||₂ is small enough, we have

∣ c_{k_{m + 1}^{'}} (\bar{t} + Δ t) ∣ < \frac{λ S_{m}^{'}}{(1 - λ) + m λ} < ∣ c_{k_{m}^{'}} (\bar{t} + Δ t),

(5.40)

and ${k_{1}^{'}, \dots, k_{m}^{'}} = {k_{1}, \dots, k_{m}} = I$ . Recall that we have defined x(t) = (x₁(t), · · ·, x_q (t))^T = f (a_⊥ + Ψt̄). It follows from (5.40) and Corollary 2.1 that

\begin{array}{l} x_{i} (\bar{t} + Δ t) = {\begin{array}{l} \frac{1}{2 ν^{'}} [c_{i} (\bar{t} + Δ t) - sgn (c_{i} (\bar{t} + Δ t)) \frac{λ S_{m}^{'}}{(1 - λ + m λ)}], & i \in I \\ 0, & i \in I^{c} \end{array}, \\ x_{i} (\bar{t}) = {\begin{array}{l} \frac{1}{2 ν} [c_{i} (\bar{t}) - sgn (c_{i} (\bar{t})) \frac{λ S_{m}}{(1 - λ) + m λ}], & i \in I \\ 0, & i \in I^{c} \end{array}, \end{array}

(5.41)

where ν and ν′ are the scale constants. Since c_i(t̄ + Δt) = c_i(t̄) + (ΨΔt)_i, when ||Δt||₂ is small enough, we have sgn(c_i(t̄ + Δt)) = sgn(c_i(t̄)) ≠ 0 for i ∈ I. In this case,

\begin{array}{l} ∣ c_{i} (\bar{t} + Δ t) ∣ = sgn (c_{i} (\bar{t})) c_{i} (\bar{t} + Δ t) = sgn (c_{i} (\bar{t})) (c_{i} (\bar{t}) + {(Ψ Δ t)}_{i}) \\ = ∣ c_{i} (\bar{t}) ∣ + sgn (c_{i} (\bar{t})) {(Ψ Δ t)}_{i} . \end{array}

Hence, for i ∈ I, we have

\begin{array}{l} x_{i} (\bar{t} + Δ t) = \frac{1}{2 ν^{'}} [c_{i} (\bar{t}) + {(Ψ Δ t)}_{i} - sgn (c_{i} (\bar{t})) \frac{λ \sum_{j \in I} (∣ c_{j} (\bar{t}) ∣ + sgn (c_{j} (\bar{t})) {(Ψ Δ t)}_{j})}{(1 - λ) + m λ}] \\ = \frac{1}{2 ν^{'}} [2 ν x_{i} (\bar{t}) + {(Ψ Δ t)}_{i} - sgn (c_{i} (\bar{t})) \frac{λ \sum_{j \in I} sgn (c_{j} (\bar{t})) {(Ψ Δ t)}_{j}}{(1 - λ) + m λ}] \\ = \frac{1}{2 ν^{'}} [2 ν x_{i} (\bar{t}) + {(sgn (x_{i} (\bar{t})))}^{2} {(Ψ Δ t)}_{i} - sgn (x_{i} (\bar{t})) \frac{λ \sum_{j \in I} sgn (x_{j} (\bar{t})) {(Ψ Δ t)}_{j}}{(1 - λ) + m λ}] \\ = \frac{1}{2 ν^{'}} [2 ν x_{i} (\bar{t}) + {(sgn (x_{i} (\bar{t})))}^{2} {(Ψ Δ t)}_{i} - sgn (x_{i} (\bar{t})) \frac{λ \sum_{j = 1}^{p} sgn (x_{j} (\bar{t})) {(Ψ Δ t)}_{j}}{(1 - λ) + m λ}], \end{array}

(5.42)

where the last two equalities follows from the facts that x_i(t̄) and c_i(t̄) have the same signs for i ∈ I and the signs of x_i(t̄) are zeros for all i ∈ I^c. Note that the last line in (5.42) is equal to x_i(t̄ + Δt) even if i ∈ I^c because both of them are equal to zero. Thus, we have

x (\bar{t} + Δ t) = \frac{1}{2 ν^{'}} [2 ν x (\bar{t}) + K (\bar{t}) Δ t]

(5.43)

where the p × q matrix

K (\bar{t}) = N (\bar{t}) (I - \frac{λ}{(1 - λ) + m λ} 1 1^{T}) N (\bar{t}) Ψ,

(5.44)

I is the p-dimensional identity matrix, 1 is the p vector with all coordinates equal to 1, and N(t̄) is the p-dimensional diagonal matrix with the i-th diagonal element equal to sgn(x_i(t̄)), 1 ≤ i ≤ p. Now by the definition of H, we have H(t̄) = x(t̄)^T ΨΨ^T x(t̄) and

\begin{array}{l} H (\bar{t} + Δ t) = x {(\bar{t} + Δ t)}^{T} Ψ Ψ^{T} x (\bar{t} + Δ t) \\ = \frac{1}{{(2 ν^{'})}^{2}} ({(2 ν)}^{2} H (\bar{t}) + 4 ν G {(\bar{t})}^{T} Δ t + Δ t^{T} Π (\bar{t}) Δ t), \end{array}

(5.45)

where the q-vector G(t̄) = K(t̄)^T ΨΨ^T x(t̄) and the q × q matrix Π(t̄) = K(t̄)^T ΨΨ^T K(t̄). Now we compute (2ν′)². From (5.43), we have

\begin{array}{l} {(2 ν^{'})}^{2} = {| | 2 ν x (\bar{t}) + K (\bar{t}) Δ t | |}_{λ}^{2} = (1 - λ) {| | 2 ν x (\bar{t}) + K (\bar{t}) Δ t | |}_{2}^{2} + λ {(\sum_{i = 1}^{p} ∣ 2 ν x_{i} (\bar{t}) + {(K (\bar{t}) Δ t)}_{i} ∣)}^{2} \\ = (1 - λ) [{(2 ν)}^{2} {| | x (t) | |}_{2}^{2} + 4 ν x {(\bar{t})}^{T} K (\bar{t}) Δ t + Δ t^{T} K {(\bar{t})}^{T} K (\bar{t}) Δ t] + λ {(\sum_{i = 1}^{p} 2 ν ∣ x_{i} (\bar{t}) ∣ + \sum_{i = 1}^{p} sgn (x_{i} (\bar{t})) {(K (\bar{t}) Δ t)}_{i})}^{2} \\ = (1 - λ) [{(2 ν)}^{2} {| | x (\bar{t}) | |}_{2}^{2} + 4 ν x {(\bar{t})}^{T} K (\bar{t}) Δ t + Δ t^{T} K {(\bar{t})}^{T} K (\bar{t}) Δ t] + λ [{(2 ν)}^{2} {| | x (\bar{t}) | |}_{1}^{2} + 4 ν {| | x (\bar{t}) | |}_{1} 1^{T} N (\bar{t}) K (\bar{t}) Δ t + Δ t^{T} K {(\bar{t})}^{T} N {(\bar{t})}^{T} 1 1^{T} N (\bar{t}) K (\bar{t}) Δ t] \\ = {(2 ν)}^{2} {| | x (\bar{t}) | |}_{λ}^{2} + 4 ν E {(\bar{t})}^{T} Δ t + {(Δ t)}^{T} F (\bar{t}) (Δ t), \end{array}

where the q-vector E and the q × q matrix F are

\begin{array}{l} E (\bar{t}) = (1 - λ) K {(\bar{t})}^{T} x (\bar{t}) + λ {| | x (\bar{t}) | |}_{1} K {(\bar{t})}^{T} N (\bar{t}) 1, \\ F (\bar{t}) = (1 - λ) K {(\bar{t})}^{T} K (\bar{t}) + λ K {(\bar{t})}^{T} N {(\bar{t})}^{T} 1 1^{T} N (\bar{t}) K (\bar{t}) . \end{array}

Since ||x(t̄)||_λ = 1,

\begin{array}{l} {(\frac{2 ν}{2 ν^{'}})}^{2} = {(1 + \frac{2}{2 ν} {(Δ t)}^{T} E (\bar{t}) + \frac{1}{{(2 ν)}^{2}} {(Δ t)}^{T} F (\bar{t}) (Δ t))}^{- 1} \\ = 1 - \frac{2}{2 ν} {(Δ t)}^{T} E (\bar{t}) - \frac{1}{{(2 ν)}^{2}} {(Δ t)}^{T} F (\bar{t}) (Δ t) + \frac{4}{{(2 ν)}^{2}} {(Δ t)}^{T} E (\bar{t}) E {(\bar{t})}^{T} (Δ t) + o ({| | Δ t | |}_{2}^{2}) . \end{array}

Now by (5.45),

\begin{array}{l} H (\bar{t} + Δ t) = \frac{{(2 ν)}^{2}}{{(2 ν^{'})}^{2}} [H (\bar{t}) + \frac{1}{ν} {(Δ t)}^{T} G (\bar{t}) + \frac{1}{{(2 ν)}^{2}} {(Δ t)}^{T} Π (\bar{t}) Δ (t)] \\ = H (\bar{t}) + {(Δ t)}^{T} (\frac{G (\bar{t}) - H (\bar{t}) E (\bar{t})}{ν}) + {(Δ t)}^{T} \frac{4 H (\bar{t}) E (\bar{t}) E {(\bar{t})}^{T} - 2 [E (\bar{t}) G {(\bar{t})}^{T} + G (\bar{t}) E {(\bar{t})}^{T}] + Π (\bar{t}) - H (\bar{t}) F (\bar{t})}{{(2 ν)}^{2}} Δ t + o ({| | Δ t | |}_{2}^{2}) . \end{array}

Hence we have

\begin{array}{l} \nabla H (\bar{t}) = \frac{G (\bar{t}) - H (\bar{t}) E (\bar{t})}{ν}, \\ \nabla^{2} H (\bar{t}) = \frac{4 H (\bar{t}) E (\bar{t}) E {(\bar{t})}^{T} - 2 [E (\bar{t}) G {(\bar{t})}^{T} + G (\bar{t}) E {(\bar{t})}^{T}] + Π (\bar{t}) - H (\bar{t}) F (\bar{t})}{2 ν^{2}} . \end{array}

Proof of Theorem 2.7

Recall that Ψ = (ψ₁, …, ψ_q) is the matrix of basis vectors of M, where q is the dimension of M. For any 1 ≤ j ≤ p, let e_j be the vector with q-th coordinate equal to 1 and other coordinates equal zeros. {e₁, · · ·, e_p} is a basis of ℝ^p. For any subset I of {1, · · ·, p}, we choose a subset J = J(I) of I^c such that {e_i: i ∈ J} ∪ {ψ_j: 1 ≤ j ≤ q} is a basis for the subspace spanned by {e_i: i ∈ I^c} ∪ {ψ_j: 1 ≤ j ≤ q}. For any u^* ∈ Inline graphic , let I be the index set of nonzero coordinates of u^*. Since u^* is the solution to

max_{u \in R^{p}} {(\sum u^{*})}^{T} u, subject to {| | u | |}_{λ} \leq 1, u ⊥ M,

it is also the solution to

\begin{matrix} min_{u \in R^{p}} - {(\sum u^{*})}^{T} u, subject to {| | u | |}_{λ} \leq 1, sgn (u_{i}^{*}) u_{i} \geq 0, i \in I; u_{i} = 0, i \in I^{c}; \\ and ψ_{j}^{T} u^{*} = 0, 1 \leq j \leq q, \end{matrix}

which is equivalent to the problem

\begin{matrix} min_{u \in R^{p}} - {(\sum u^{*})}^{T} u, subject to {| | u | |}_{λ} \leq 1, sgn (u_{i}^{*}) u_{i} \geq 0, i \in I; u_{i} = 0, i \in J; \\ and ψ_{j}^{T} u^{*} = 0, 1 \leq j \leq q, \end{matrix}

By the Karush-Kuhn-Tucker condition, we have

2 \tilde{ν} \tilde{E} A^{[λ]} \tilde{E} u^{*} = \sum u^{*} + \sum_{i \in J} μ_{i} e_{i} + \sum_{j = 1}^{q} β_{j} ψ_{j},

(5.46)

where ν̃ > 0, μ_i and β_j are multipliers and Inline graphic is the p-dimensional diagonal matrix with the i-th diagonal element equal to $sgn (u_{i}^{*})$ if i ∈ I and 1 if i ∈ I ^c. We will assume that for all u^* ∈ ,

μ_{i} \neq 0 for all i \in J, and β_{j} \neq 0 for all 1 \leq j \leq q .

(5.47)

From (5.46), we have

2 \tilde{ν} {\tilde{E}}_{I} A_{I}^{[λ]} {\tilde{E}}_{I} u_{I}^{*} = \sum_{I} u_{I}^{*} + \sum_{j = 1}^{q} β_{j} {(ψ_{j})}_{I},

and thus

P_{I}^{M} {\tilde{E}}_{I} A_{I}^{[λ]} {\tilde{E}}_{I} P_{I}^{M} u_{I}^{*} = \frac{1}{2 \tilde{ν}} P_{I}^{M} \sum_{I} P_{I}^{M} u_{I}^{*}

(5.48)

where $P_{I}^{M}$ is the orthogonal projection matrix onto the orthogonal complement of the space spanned by (ψ_j)_I, 1 ≥ j ≥ q. Hence, $u_{I}^{*}$ is an eigenvector of the generalized eigenvalue problem (5.48) corresponding nonzero eigenvaules. Now we can use the same arguments as in the proof of Theorem 2.4 to prove the convergence of Algorithm 2.3. Moreover, we can show that only u^* ∈ Inline graphic with $u_{I}^{*}$ being a leading eigenvector of problem (5.48) are stable limits.

Proof of Theorem 3.1

Since $0 < α < \frac{1}{3}$ , we can find a positive number β such that

α < β < 2 β < 1 - α .

(5.49)

Without loss of generality, we assume that for each n,

ρ_{1}^{(n)} \geq ρ_{2}^{(n)} \geq \dots \geq ρ_{p_{n}}^{(n)} \geq 0,

(5.50)

Then the uniform “weak l_q decay” condition (3.3) becomes

ρ_{ν}^{(n)} \leq C ν^{- 1 / q}, ν = 1, \dots, p_{n}, n = 1, 2, \dots .

(5.51)

and the partial sum $S_{i}^{(n)} = \sum_{j = 1}^{i} ρ_{j}^{(n)}$ , 1 ≤ i ≤ p_n. Let $w^{(n)} = {(w_{1}^{(n)}, \dots, w_{p}^{(n)})}^{T},$ , and $Z^{(n)} = [z_{1}^{(n)}, \dots, z_{n}^{(n)}]$ be the p_n × n noise matrix. Then the p_n × p_n sample covariance matrix is

{\sum^{^}}^{(n)} = \frac{{| | w^{(n)} | |}_{2}^{2}}{n} ρ^{(n)} {(ρ^{(n)})}^{T} + σ^{2} I + ρ^{(n)} {(y^{(n)})}^{T} + y^{(n)} {(ρ^{(n)})}^{T} + σ^{2} (\frac{Π^{(n)}}{n} - I),

(5.52)

where I is the p_n-dimensional identity matrix, $y^{(n)} = \frac{σ}{n} Z^{(n)} w^{(n)}$ is a p_n-dimensional random vector and Π⁽ⁿ⁾ = Z⁽ⁿ⁾(Z⁽ⁿ⁾)^T is a p_n × p_n random matrix and has a Wishart distribution with n degrees of freedom. By the condition (3.4), we can choose a positive integer m₀ (independent of n) and a real number 0 < τ < 1 such that for any m ≥ m₀,

sup_{n \geq 1} \frac{S_{2 m}^{(n)} - S_{m}^{(n)}}{S_{m}^{(n)}} \leq τ .

(5.53)

Define

{\hat{μ}}^{(n)} = max_{v \in R^{p}, {| | v | |}_{2} = 1} \frac{v^{T} {\sum^{^}}^{(n)} v}{(1 - λ^{(n)}) {| | v | |}_{2}^{2} + λ^{(n)} {| | v | |}_{1}^{2}} = \frac{{({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)}}{{| | {\hat{ρ}}^{(n)} | |}_{λ^{(n)}}^{2}} .

(5.54)

We first provide several technical lemmas whose proofs are given in Appendix. Recall that for any two vectors u and v, we use 〈u, v〉 = u^T v to denote the inner product between u and v. By the condition (3.5), we have ${(2 σ \sqrt{c})}^{2} < 1 - (4 σ \sqrt{c} + 4 σ^{2} \sqrt{c} + 2 σ^{2} c)$ and $1 - (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c) > 0$ . Thus, we can find κ₁ > 0 and κ₂ > 0 small enough such that

{(κ_{1} + 2 σ \sqrt{c})}^{2} < 1 - [4 σ \sqrt{c} + 4 σ^{2} \sqrt{c} + 2 σ^{2} c],

(5.55)

2 κ_{2} σ^{2} < 1 - (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c) .

(5.56)

Lemma 9

Suppose that κ₁ and κ₂ satisfy (5.55) and (5.56), respectively, we have

\sum_{n = 1}^{\infty} P (| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉 | \leq κ_{1}) < \infty,

(5.57)

\sum_{n = 1}^{\infty} P ({\hat{μ}}^{(n)} \leq (1 + κ_{2}) σ^{2}) < \infty, \sum_{n = 1}^{\infty} P ({| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) < \infty .

(5.58)

Lemma 10

If μ̂ ⁽ⁿ⁾ > (1 + κ₂)σ² and λ⁽ⁿ⁾ ≤ κ₂/(1 + κ₂), then ρ̂⁽ⁿ⁾ is the solution to the following optimization problem multiplied by a constant,

max_{u \in R^{p}} {(({\sum^{^}}^{(n)} - σ^{2} I) {\hat{ρ}}^{(n)})}^{T} u, subject t o {| | u | |}_{{\tilde{λ}}^{(n)}} \leq 1,

where I is the identity matrix and ${\hat{λ}}^{(n)} = λ^{(n)} / (1 - \frac{σ^{2}}{{\hat{μ}}^{(n)}})$ .

Let

\begin{array}{l} ξ^{(n)} = {(ξ_{1}^{(n)}, \dots, ξ_{p_{n}}^{(n)})}^{T} \\ = ({\sum^{^}}^{(n)} - σ^{2} I) {\hat{ρ}}^{(n)} - (〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉) ρ^{(n)} . \end{array}

Then we have the following lemma.

Lemma 11

If (5.58) holds for some positive κ₂, we have

\sum_{n = 1}^{\infty} P (max_{1 \leq i \leq p_{n}} ∣ ξ_{i}^{(n)} ∣ > n^{- β}) < \infty .

(5.59)

We choose and fix κ₁ > 0 and κ₂ > 0 such that (5.55) and (5.56) hold. Define ζ⁽ⁿ⁾ = ξ⁽ⁿ⁾/(〈ρ̂⁽ⁿ⁾, ρ⁽ⁿ⁾〉 + 〈ρ̂⁽ⁿ⁾, y⁽ⁿ⁾〉) and

{\hat{a}}^{(n)} = ρ^{(n)} + ξ^{(n)} / (〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉) = ρ^{(n)} + ζ^{(n)},

then

({\sum^{^}}^{(n)} - σ^{2} I) {\hat{ρ}}^{(n)} = (〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉) {\hat{a}}^{(n)} .

It follows from (5.57) and (5.59) that

\sum_{n = 1}^{\infty} P (max_{1 \leq i \leq p_{n}} | ζ_{i}^{(n)} | > \frac{n^{- β}}{κ_{1}}) < \infty,

(5.60)

where $ζ^{(n)} = {(ζ_{1}^{(n)}, \dots, ζ_{p_{n}}^{(n)})}^{T}$ . Define a sequence of measurable subsets of the probability space

Ω^{(n)} = {| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉 | > κ_{1}} \cap {{\hat{μ}}^{(n)} > (1 + κ_{2}) σ^{2}} \cap {max_{1 \leq i \leq p_{n}} | ζ_{i}^{(n)} | \leq \frac{n^{- β}}{κ_{1}}}

(5.61)

It follows from Lemma 9 and (5.60) that

\sum_{n = 1}^{\infty} P ({(Ω^{(n)})}^{c}) < \infty,

(5.62)

where (Ω⁽ⁿ⁾)^c is the complement of Ω⁽ⁿ⁾. Hence by Borel-Cantelli Lemma, it holds almost surely that for all but finitely many n, the event Ω⁽ⁿ⁾ happens. Therefore, in the following, we will assume that the inequalities in the definition (5.61) of Ω⁽ⁿ⁾ hold for all n large enough. By Lemma 10, ρ̂⁽ⁿ⁾ is the solution to the following optimization problem multiplied by a constant,

max_{u \in R^{p}} {({\hat{a}}^{(n)})}^{T} u, subject to {| | u | |}_{{\tilde{λ}}^{(n)}} \leq 1.

Let |â⁽ⁿ⁾|₍₁₎ ≥ |â⁽ⁿ⁾|₍₂₎ ≥ · · · ≥ |â⁽ⁿ⁾|_{(p_n)} be the sorted sequence of the co-ordinates ( ${\hat{a}}_{1}^{(n)}, \dots, {\hat{a}}_{p_{n}}^{(n)}$ ) of â⁽ⁿ⁾ by their absolute values. Define the partial sums ${\hat{S}}_{i}^{(n)} = \sum_{j = 1}^{i} ∣ {\hat{a}}^{(n)} ∣_{(j)}$ , 1 ≤ i ≤ p_n. Since ${max}_{1 \leq i \leq p_{n}} ∣ ζ_{i}^{(n)} ∣ \leq \frac{n^{- β}}{κ_{1}}, ∣ {\hat{a}}_{i}^{(n)} ∣ = ∣ ρ_{i}^{(n)} + ζ_{i}^{(n)} ∣ \leq ρ_{i}^{(n)} + \frac{n^{- β}}{κ_{1}}$ , from which and (5.50) we can obtain that

ρ_{i}^{(n)} - \frac{n^{- β}}{κ_{1}} \leq ∣ {\hat{a}}^{(n)} ∣_{(i)} \leq ρ_{i}^{(n)} + \frac{n^{- β}}{κ_{1}}, \forall 1 \leq i \leq p_{n} .

(5.63)

Let m̂⁽ⁿ⁾ be the number of nonzero coordinates of ρ̂⁽ⁿ⁾. We will give an upper bound for m̂⁽ⁿ⁾. Define

m^{(n)} = 4 (⌊ \frac{1}{{\tilde{λ}}^{(n)}} max {1, \frac{4 τ}{1 - τ}} ⌋ + 1) .

(5.64)

where τ is the constant defined in (5.53), λ̃ ⁽ⁿ⁾ is defined in Lemma 10. For any real number x, ⌊x⌋ denotes the largest integer number smaller than x.

Lemma 12

With probability 1, m̂⁽ⁿ⁾ ≤ m⁽ⁿ⁾ when n is large enough.

By Corollary 2.1, we have

{\hat{ρ}}^{(n)} = {\hat{b}}^{(n)} / {| | {\hat{b}}^{(n)} | |}_{2},

(5.65)

where ${\hat{b}}^{(n)} = {({\hat{b}}_{1}^{(n)}, \dots, {\hat{b}}_{p_{n}}^{(n)})}^{T}$ and

{\hat{b}}_{i}^{(n)} = sgn ({\hat{a}}_{i}^{(n)}) {(∣ {\hat{a}}_{i}^{(n)} ∣ - \frac{{\tilde{λ}}^{(n)} {\tilde{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}})}_{+} .

(5.66)

(x)₊ denotes the function which is equal to zero if x < 0 and equal to x if x ≥ 0. We will show ${| | ρ^{(n)} - {\hat{b}}^{(n)} | |}_{2}^{2} \to 0$ as n → ∞. Let

\begin{array}{l} I_{0} = {1 \leq i \leq p_{n} : {\hat{b}}_{i}^{(n)} = 0}, \\ I_{1} = {1 \leq i \leq p_{n} : {\hat{b}}_{i}^{(n)} \neq 0, ρ_{i}^{(n)} \leq \frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}}, \\ I_{2} = {1 \leq i \leq p_{n} : {\hat{b}}_{i}^{(n)} \neq 0, ρ_{i}^{(n)} > \frac{n^{_β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}} . \end{array}

(5.67)

Then I₀ ∪ I₁ ∪ I₂ = {1, 2, · · ·, p_n} and the number of elements in I₁ ∪ I₂ is m̂⁽ⁿ⁾. It can be seen that for any i ∈ I₀, we have

ρ_{i}^{(n)} \leq \frac{n^{- β}}{κ_{1}} + ∣ {\hat{a}}_{i}^{(n)} ∣ \leq \frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}},

(5.68)

for any i ∈ I₁,

\begin{array}{r} ∣ ρ_{i}^{(n)} - {\hat{b}}_{i}^{(n)} ∣ \leq ∣ ρ_{i}^{(n)} ∣ + ∣ {\hat{b}}_{i}^{(n)} ∣ \leq ρ_{i}^{(n)} + ∣ {\hat{a}}_{i}^{(n)} ∣ + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}} \\ \leq 2 ρ_{i}^{(n)} + \frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}} \\ \leq \frac{3 n^{- β}}{κ_{1}} + \frac{3 {\tilde{λ}}^{(n)} S_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}, \end{array}

(5.69)

and for any i ∈ I₂,

∣ ρ_{i}^{(n)} - {\hat{b}}_{i}^{(n)} ∣ \leq ∣ ρ_{i}^{(n)} - {\hat{a}}_{i}^{(n)} ∣ + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}} \leq \frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}},

(5.70)

First we calculate

\begin{array}{l} {(\frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}})}^{2} {\hat{m}}^{(n)} \leq {(\frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)} + {\hat{λ}}^{(n)} {\hat{m}}^{(n)} \frac{n^{- β}}{κ_{1}}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}})}^{2} {\hat{m}}^{(n)} \\ \leq {(\frac{{\tilde{λ}}^{(n)} S_{{\hat{m}}^{(n)}}^{(n)}}{1 - {\tilde{λ}}^{(n)}} + \frac{n^{- β}}{κ_{1}})}^{2} {\hat{m}}^{(n)} \leq 2 \frac{{({\tilde{λ}}^{(n)} S_{{\hat{m}}^{(n)}}^{(n)})}^{2} {\hat{m}}^{(n)}}{{(1 - {\tilde{λ}}^{(n)})}^{2}} + 2 \frac{n^{- 2 β} {\hat{m}}^{(n)}}{κ_{1}^{2}} \end{array}

By the condition (5.51), we have $S_{m^{(n)}}^{(n)} \leq \sum_{ν = 1}^{m^{(n)}} C ν^{- 1 / q}$

S_{m^{(n)}}^{(n)} = {\begin{array}{l} O ({(m^{(n)})}^{1 - 1 / q}) & if & 1 < q < 2, \\ O (log (m^{(n)})) & if & q = 1, \\ O (1) & if & 0 < q < 1. \end{array}

Hence, by Lemma 12, if 1 < q < 2,

{({\tilde{λ}}^{(n)} S_{{\hat{m}}^{(n)}}^{(n)})}^{2} {\hat{m}}^{(n)} \leq {({\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)})}^{2} m^{(n)} = {({\tilde{λ}}^{(n)})}^{2} O ({(m^{(n)})}^{3 - 2 / q}) = {({\tilde{λ}}^{(n)})}^{2} O ({({\tilde{λ}}^{(n)})}^{- 3 + 2 / q}) = O ({({\tilde{λ}}^{(n)})}^{- 1 + 2 / q}) \to 0,

Similarly, ${({\tilde{λ}}^{(n)} S_{{\hat{m}}^{(n)}}^{(n)})}^{2} {\hat{m}}^{(n)} \to 0$ for other two cases. Since n⁻²^βm̂⁽ⁿ⁾ ≤ n⁻²^βm ⁽ⁿ⁾ = O(n⁻²^β/λ̃⁽ⁿ⁾) = O(n⁻²^β⁺^α) → 0 by (5.49),

{(\frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}})}^{2} {\hat{m}}^{(n)} \to 0,

(5.71)

as n → ∞. By (5.68)–(5.70),

{| | ρ^{(n)} - {\hat{b}}^{(n)} | |}_{2}^{2} = \sum_{i \in I_{0}} {(ρ_{i}^{(n)})}^{2} + \sum_{i \in I_{1}} {(ρ_{i}^{(n)} - {\hat{b}}_{i}^{(n)})}^{2} + \sum_{i \in I_{2}} {(ρ_{i}^{(n)} - {\hat{b}}_{i}^{(n)})}^{2} \leq 8 \sum_{i = 1}^{p_{n}} {[ρ_{i}^{(n)} \land (\frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}})]}^{2}

(5.72)

+ 9 \sum_{i \in I_{1}} {[\frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}]}^{2} .

(5.73)

It follows from (5.71) that

\sum_{i \in I_{1}} {[\frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}]}^{2} \leq {[\frac{n^{- β}}{κ_{1}} + \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{{\hat{m}}^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + {\hat{m}}^{(n)} {\tilde{λ}}^{(n)}}]}^{2} {\hat{m}}^{(n)} \to 0.

Because each term in the sum (5.72) converges to zero as n → ∞ and the sequence dominated by {Ci^−1/^q: i ≥ 1} which is square summable, by the Dominated Convergence Theorem, the sum (5.72) converges to zero. Hence, we have proved that ${| | ρ^{(n)} - {\hat{b}}^{(n)} | |}_{2}^{2} \to 0$ . Since ||ρ⁽ⁿ⁾||₂ = 1, ${| | {\hat{b}}^{(n)} | |}_{2}^{2} \to 1$ .

{| | ρ^{(n)} - {\hat{ρ}}^{(n)} | |}_{2} = {| | ρ^{(n)} - \frac{{\hat{b}}^{(n)}}{{| | {\hat{b}}^{(n)} | |}_{2}} | |}_{2} \leq {| | {\hat{b}}^{(n)} - \frac{{\hat{b}}^{(n)}}{{| | {\hat{b}}^{(n)} | |}_{2}} | |}_{2} + {| | ρ^{(n)} - {\hat{b}}^{(n)} | |}_{2} \to 0.

the right hand side goes to zero.

Supplementary Material

NIHMS395542-supplement-1.zip^{(3.4KB, zip)}

Acknowledgments

Supported in part by NSF grant DMS 0714817 and NIH grants P30 DA18343 and R01 GM59507.

The authors want to thank the Editors and the reviewers whose comments have greatly improved the scope and presentation of the paper.

6. Appendix

We will use the following large deviation inequalities for chi-square distribution (see (A.2) and (A.3) in Johnstone and Lu [7]).

P (| \frac{χ_{(n)}^{2}}{n} - 1 | > ε) \leq 2 exp {- 3 n ε^{2} / 16}, 0 \leq ε < \frac{1}{2} .

(6.1)

Proof of Lemma 9

By (5.55), we can pick $0 < ε < \frac{1}{2}$ small enough such that

(1 - ε) (1 - ε^{2}) - σ^{2} ε - \frac{ε (2 - ε^{2})}{1 - ε^{2}} [3 + 2 (1 + ε) σ + σ^{2}] - \frac{2 - ε^{2}}{1 - ε} [2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c] > {(κ_{1} + (1 + ε) 2 σ (\sqrt{c} + ε) + ε)}^{2}

(6.2)

and

\begin{array}{l} 1 - ε^{2} - 3 ε - 2 (1 + ε) ε σ - ε σ^{2} - (1 + ε) (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c) > (1 + ε) [(κ_{2} + ε) σ^{2} + (1 - ε) ε^{2}], \end{array}

(6.3)

1 + σ^{2} (1 + ε) + 3 ε + 2 (1 + ε) σ ε + (1 + ε) (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c) < 3 + σ^{2},

(6.4)

because the two sides of (6.2) and (6.3) converge to the two sides of (5.55) and (5.56), the left side of (6.4) goes to $1 + [2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c]$ which is smaller than 3 by (5.56), as ε → 0. We choose an integer N > 0 (independent of n) such that $\sum_{ν = N + 1}^{\infty} C^{2} ν^{- 2 / q} < ε^{2}$ . We define $ρ_{ε}^{(n)} = (ρ_{1}^{(n)}, \dots, ρ_{N}^{(n)}, 0, \dots, 0)$ , then by the condition (5.51),

1 \geq {| | ρ_{ε}^{(n)} | |}_{2}^{2} = 〈 ρ_{ε}^{(n)}, ρ^{(n)} 〉 \geq 1 - ε^{2} for all n .

(6.5)

Since ρ̂⁽ⁿ⁾ is the solution to (3.2), we have

\frac{{(ρ_{ε}^{(n)})}^{T} {\sum^{^}}^{(n)} ρ_{ε}^{(n)}}{{| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2}} \leq \frac{{({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)}}{{| | {\hat{ρ}}^{(n)} | |}_{λ^{(n)}}^{2}} \leq \frac{{({\hat{ρ}}^{n})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)}}{{| | {\hat{ρ}}^{(n)} | |}_{2}^{2}} = {({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)} .

(6.6)

By a simple calculation, it follows from (5.52) that

\begin{array}{l} {({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)} = \frac{{| | w^{(n)} | |}_{2}^{2}}{n} {〈 {\hat{ρ}}^{(n)}, {\hat{ρ}}^{(n)} 〉}^{2} + σ^{2} {| | {\hat{ρ}}^{(n)} | |}_{2}^{2} + 2 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + σ^{2} {({\hat{ρ}}^{(n)})}^{T} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)} \\ = {〈 {\hat{ρ}}^{(n)}, {\hat{ρ}}^{(n)} 〉}^{2} + σ^{2} + (\frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1) {〈 {\hat{ρ}}^{(n)}, {\hat{ρ}}^{(n)} 〉}^{2} + 2 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + σ^{2} {({\hat{ρ}}^{(n)})}^{T} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)} \\ \leq {〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} + σ^{2} + | \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + 2 {| | y^{(n)} | |}_{2} + σ^{2} {\hat{γ}}^{(n)}, \end{array}

(6.7)

where γ̂⁽ⁿ⁾ is the largest eigenvalue of $\frac{1}{n} Π^{(n)} - I$ . Similarly, by (6.5),

\begin{array}{l} {(ρ_{ε}^{(n)})}^{T} {\sum^{^}}^{(n)} ρ_{ε}^{(n)} \geq {〈 ρ_{ε}^{(n)}, ρ^{(n)} 〉}^{2} + σ^{2} {| | ρ_{ε}^{(n)} | |}_{2}^{2} - | \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | - 2 {| | y^{(n)} | |}_{2} - σ^{2} {\hat{γ}}^{(n)} \\ \geq {(1 - ε^{2})}^{2} + (1 - ε^{2}) σ^{2} - | \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | - 2 {| | y^{(n)} | |}_{2} - σ^{2} {\hat{γ}}^{(n)} . \end{array}

(6.8)

Now it follows from (6.6) − (6.8) that

{〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} \geq \frac{(1 - ε^{2}) (1 + σ^{2} - ε^{2})}{{| | ρ_{ε}^{(2)} | |}_{λ^{(n)}}^{2}} - σ^{2} - (1 + \frac{1}{{| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2}}) [| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + 2 {| | y^{(n)} | |}_{2} + σ^{2} {\hat{γ}}^{(n)}] .

(6.9)

Since ${| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2} \geq {| | ρ_{ε}^{(n)} | |}_{2}^{2} \geq 1 - ε^{2}$ and ${| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2} - {| | ρ_{ε}^{(n)} | |}_{2}^{2} \to 0$ as n → 0, we have $1 + ε \geq {| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2} \geq 1 - ε^{2}$ when n is large enough. Therefore, it follows from (6.9) that when n is large enough,

{〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} \geq \frac{(1 - ε^{2}) (1 + σ^{2} - ε^{2})}{1 + ε} - σ^{2} - (1 + \frac{1}{1 - ε^{2}}) [| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + 2 {| | y^{(n)} | |}_{2} + σ^{2} {\hat{γ}}^{(n)}] = (1 - ε) (1 - ε^{2}) - σ^{2} ε - \frac{2 - ε^{2}}{1 - ε^{2}} [| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + 2 {| | y^{(n)} | |}_{2} + σ^{2} {\hat{γ}}^{(n)}] .

(6.10)

Since $\frac{p_{n}}{n} \to c$ , we have $\sqrt{\frac{p_{n}}{n}} \leq \sqrt{c} + ε$ when n is large enough. If $| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | \leq ε, {| | y^{(n)} | |}_{2} \leq (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε$ and ${\hat{γ}}^{(n)} \leq (1 + ε) {(1 + \sqrt{c})}^{2} - 1$ , then by (6.10) and (6.2), when n is large enough, we have

\begin{array}{l} {〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} \geq (1 - ε) (1 - ε^{2}) - σ^{2} ε - \frac{2 - ε^{2}}{1 - ε^{2}} [ε + 2 (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + 2 ε + (1 + ε) σ^{2} {(1 + \sqrt{c})}^{2} - σ^{2}] \\ \geq (1 - ε) (1 - ε^{2}) - σ^{2} ε - \frac{2 - ε^{2}}{1 - ε^{2}} [3 ε + 2 (1 + ε) σ ε + 2 (1 + ε) σ \sqrt{c} + (1 + ε) σ^{2} {(1 + \sqrt{c})}^{2} - σ^{2}] \\ \geq (1 - ε) (1 - ε^{2}) - σ^{2} ε - \frac{ε (2 - ε^{2})}{1 - ε^{2}} [3 + 2 (1 + ε) σ + σ^{2}] - \frac{2 - ε^{2}}{1 - ε} [2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c] \\ > {(κ_{1} + (1 + ε) 2 σ (\sqrt{c} + ε) + ε)}^{2}, \end{array}

(6.11)

moreover, by the definition (5.54) of μ̂ ⁽ⁿ⁾, (6.8) and (6.3),

\begin{array}{l} {\hat{μ}}^{(n)} \geq \frac{{(ρ_{ε}^{(n)})}^{T} {\sum^{^}}^{(n)} ρ_{ε}^{(n)}}{{| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2}} \\ \geq \frac{1}{1 + ε} [(1 - ε^{2}) (1 + σ^{2} - ε^{2}) - 3 ε - 2 (1 + ε) σ (\sqrt{c} + ε) - (1 + ε) σ^{2} {(1 + \sqrt{c})}^{2} + σ^{2}] \\ = (1 - ε) (σ^{2} - ε^{2}) + \frac{1}{1 + ε} [1 - ε^{2} - 3 ε - 2 (1 + ε) ε σ - ε σ^{2} - (1 + ε) (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c)] \\ > (1 - ε) (σ^{2} - ε^{2}) + (κ_{2} + ε) σ^{2} + (1 - ε) ε^{2} > (1 + κ_{2}) σ^{2}, \end{array}

(6.12)

and by (6.6), (6.7), (6.12) and (6.4),

\begin{array}{l} λ^{(n)} {| | {\hat{ρ}}^{(n)} | |}_{1}^{2} \leq {| | {\hat{ρ}}^{(n)} | |}_{λ^{(n)}}^{2} \leq \frac{{({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)}}{{(ρ_{ε}^{(n)})}^{T} {\sum^{^}}^{(n)} ρ_{ε}^{(n)}} {| | ρ_{ε}^{(n)} | |}_{λ^{(n)}}^{2} \\ \leq \frac{1}{(1 + κ_{2}) σ^{2}} {({\hat{ρ}}^{(n)})}^{T} {\sum^{^}}^{(n)} {\hat{ρ}}^{(n)} \\ \leq \frac{1}{σ^{2}} ({〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} + σ^{2} + | \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + 2 {| | y^{(n)} | |}_{2} + σ^{2} {\hat{γ}}^{(n)}) \\ \leq \frac{1}{σ^{2}} [1 + σ^{2} (1 + ε) + 3 ε + 2 (1 + ε) σ ε + (1 + ε) (2 σ \sqrt{c} + 2 σ^{2} \sqrt{c} + σ^{2} c)] \\ \leq \frac{3 + σ^{2}}{σ^{2}} . \end{array}

(6.13)

Hence, when n is large enough, by (6.11) − (6.13), we have

P ({〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} \leq {(κ_{1} + (1 + ε) 2 σ (\sqrt{c} + ε) + ε)}^{2}) \leq P (| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | > ε) + P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε) + P ({\hat{γ}}^{(n)} > (1 + ε) {(1 + \sqrt{c})}^{2} - 1),

(6.14)

P ({\hat{μ}}^{(n)} \leq σ^{2} (1 + κ_{2})) \leq P (| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | > ε) + P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε) + P ({\hat{γ}}^{(n)} > (1 + ε) {(1 + \sqrt{c})}^{2} - 1),

(6.15)

and

P ({| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) \leq P (| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | > ε) + P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε) + P ({\hat{γ}}^{(n)} > (1 + ε) {(1 + \sqrt{c})}^{2} - 1) .

(6.16)

${| | w^{(n)} | |}_{2}^{2}$ has $χ_{(n)}^{2}$ -distribution and ${| | y^{(n)} | |}_{2}^{2}$ has the distribution $σ^{2} n^{- 2} χ_{(n)}^{2} χ_{(p_{n})}^{2}$ where $χ_{(n)}^{2}$ and $χ_{(p_{n})}^{2}$ are independent (see (24) in Johnstone and Lu [7]). It follows from (6.1) that

\sum_{n = 1}^{\infty} P (| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | > ε) < \infty .

(6.17)

Now we consider the term for ||y⁽ⁿ⁾ ||₂. If c > 0, then p_n → ∞ and thus

\begin{array}{l} \sum_{n = 1}^{\infty} P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε) \leq \sum_{n = 1}^{\infty} P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}}) \\ = \sum_{n = 1}^{\infty} P (\frac{χ_{(n)}^{2}}{n} \frac{χ_{(p_{n})}^{2}}{p_{n}} > {(1 + ε)}^{2}) \\ \leq \sum_{n = 1}^{\infty} [P (\frac{χ_{(n)}^{2}}{n} > (1 + ε)) + P (\frac{χ_{(p_{n})}^{2}}{p_{n}} > (1 + ε))] < \infty \end{array}

(6.18)

If c = 0, let $q_{n} = ⌊ \frac{ε^{2} n}{{(1 + ε)}^{2} σ^{2}} ⌋$ , the largest integer less than or equal to $\frac{ε^{2} n}{{(1 + ε)}^{2} σ^{2}}$ .

\begin{array}{l} \sum_{n = 1}^{\infty} P ({| | y^{(n)} | |}_{2} > (1 + ε) σ \sqrt{\frac{p_{n}}{n}} + ε) \\ \leq \sum_{n = 1}^{\infty} P ({| | y^{(n)} | |}_{2} > ε) = \sum_{n = 1}^{\infty} P (\frac{χ_{(n)}^{2}}{n} \frac{χ_{(p_{n})}^{2}}{p} > \frac{ε^{2}}{σ^{2}}) \\ \leq \sum_{n = 1}^{\infty} [P (\frac{χ_{(n)}^{2}}{n} > (1 + ε)) + P (χ_{(p_{n})}^{2} > \frac{ε^{2} n}{(1 + ε) σ^{2}})] \\ \leq \sum_{n = 1}^{\infty} [P (\frac{χ_{(n)}^{2}}{n} > (1 + ε)) + P (\frac{χ_{(p_{n})}^{2} + χ_{(q_{n} - p_{n})}^{2}}{q_{n}} > (1 + ε))] < \infty, \end{array}

(6.19)

where $χ_{(q_{n} - p_{n})}^{2}$ is independent of $χ_{(p_{n})}^{2}$ . By (2) in Geman [5], because γ̂⁽ⁿ⁾ + 1 is the largest eigenvalue of Π⁽ⁿ⁾/n, we have

\sum_{n = 1}^{\infty} P ({\hat{γ}}^{(n)} > (1 + ε) {(1 + \sqrt{c})}^{2} - 1) < \infty .

(6.20)

Hence, by (6.14)–(6.20),

\begin{array}{l} \sum_{n = 1}^{\infty} P ({〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉}^{2} \leq {(κ_{1} + (1 + ε) 2 σ (\sqrt{c} + ε) + ε)}^{2}) < \infty, \\ \sum_{n = 1}^{\infty} P ({\hat{μ}}^{(n)} \leq (1 + κ_{2}) σ^{2}) < \infty, \sum_{n = 1}^{\infty} P ({| | {\hat{ρ}}^{(n)} | |}_{2} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{n)}}}) < \infty . \\ P (| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉 | \leq κ_{1}) \leq P (| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 | \leq κ_{1} + {| | y^{(n)} | |}_{2}) \\ \leq P (| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 | \leq κ_{1} + 1 + ε) 2 σ \sqrt{\frac{p_{n}}{n}} + ε) + P ({| | y^{(n)} | |}_{2} > (1 + ε) 2 σ \sqrt{\frac{p_{n}}{n}} + ε) \\ \leq P (| 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 | \leq κ_{1} + 1 + ε) 2 σ (\sqrt{c} + ε) + ε) + P ({| | y^{(n)} | |}_{2} > (1 + ε) 2 σ \sqrt{\frac{p_{n}}{n}} + ε) < \infty . \end{array}

Proof of Lemma 10

First note that ρ̂⁽ⁿ⁾/||ρ̂⁽ⁿ⁾||_λ⁽ⁿ⁾ is the solution to the following optimization problem,

max_{u \in R^{P}} {({\sum^{^}}^{(n)} {\hat{ρ}}^{(n)})}^{T} u, subject to {| | u | |}_{λ^{(n)}} \leq 1.

We will verify the conditions in Corollary 2.2. Then this lemma follows from Corollary 2.2. Let â ⁽ⁿ⁾ = Σ̂ ⁽ⁿ⁾ρ̂⁽ⁿ⁾. Without loss of generality, in this proof, we assume that ${\hat{ρ}}_{1}^{(n)} \geq {\hat{ρ}}_{2}^{(n)} \geq \dots \geq {\hat{ρ}}_{m}^{(n)} > 0 = {\hat{ρ}}_{m + 1}^{(n)} = \dots = {\hat{ρ}}_{p_{n}}^{(n)}$ . Let I = {1, 2, · · ·, m}. Then by Theorem 2.3, we have ${\hat{a}}_{I}^{(n)} = {\sum^{^}}_{I}^{(n)} {\hat{ρ}}_{I}^{(n)} = {\hat{μ}}^{(n)} A_{I}^{[λ^{(n)}]} {\hat{ρ}}_{I}^{(n)}$ , from which we can obtain that

{\hat{ρ}}_{i}^{(n)} = {\begin{array}{l} [{\hat{a}}_{i}^{(n)} - \frac{λ^{(n)} S_{m}}{(1 - λ) + m λ^{(n)}}] / {\hat{μ}}^{(n)} (1 - λ^{(n)}) & if & i \in I \\ 0 & if & i \in I^{c} \end{array},

where ${\hat{a}}^{(n)} = {({\hat{a}}_{1}^{(n)}, \dots, {\hat{a}}_{p_{n}}^{(n)})}^{T}$ and $S_{m} = {\hat{a}}_{1}^{(n)} + \dots + {\hat{a}}_{m}^{(n)}$ . Under the conditions of this lemma, σ²/μ̂ ⁽ⁿ⁾(1 − λ ⁽ⁿ⁾) < 1, that is, the conditions of Corollary 2.2 are satisfied.

Proof of Lemma 11

By (5.52),

\begin{array}{l} ξ^{(n)} = ({\sum^{^}}^{(n)} - σ^{2} I) {\hat{ρ}}^{(n)} - (〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 + 〈 {\hat{ρ}}^{(n)}, y^{(n)} 〉) ρ^{(n)} \\ = (\frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1) 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 ρ^{(n)} + 〈 {\hat{ρ}}^{(n)}, ρ^{(n)} 〉 y^{(n)} + σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)} . \end{array}

Hence,

max_{1 \leq i \leq p_{n}} ∣ ξ_{i}^{(n)} ∣ \leq | \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | + max_{1 \leq i \leq p_{n}} ∣ y_{i} ∣ + max_{1 \leq i \leq p_{n}} | {[σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)}]}_{i} |,

where y_i and ${[σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)}]}_{i}$ are the i-th coordinates of y⁽ⁿ⁾ and $σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)}$ , respectively. First, we calculate

\begin{array}{l} P (max_{1 \leq i \leq p_{n}} | {[σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)}]}_{i} | > \frac{n^{- β}}{3}) \\ = P (max_{1 \leq i \leq p_{n}} | \frac{σ^{2}}{n} \sum_{j \neq i : 1}^{p_{n}} (\sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}) {\hat{ρ}}_{j}^{(n)} + σ^{2} (\sum_{k = 1}^{n} \frac{{(z_{k i}^{(n)})}^{2}}{n} - 1) {\hat{ρ}}_{j}^{(n)} | > \frac{n^{- β}}{3}) \\ \leq P (max_{1 \leq i \leq p_{n}} | \frac{σ^{2}}{n} \sum_{j \neq i : 1}^{p_{n}} (\sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}) {\hat{ρ}}_{j}^{(n)} | > \frac{n^{- β}}{6}) \\ + P (max_{1 \leq i \leq p_{n}} | σ^{2} (\sum_{k = 1}^{n} \frac{{(z_{k i}^{(n)})}^{2}}{n} - 1) {\hat{ρ}}_{j}^{(n)} | > \frac{n^{- β}}{6}) \\ \leq P (max_{1 \leq i \leq p_{n}} max_{1 \leq j \neq i \leq p_{n}} | \frac{σ^{2}}{n} (\sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}) | {| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{n^{- β}}{6}) \\ + \sum_{i = 1}^{p_{n}} P (| σ^{2} (\sum_{k = 1}^{n} \frac{{(z_{k i}^{(n)})}^{2}}{n} - 1) | > \frac{n^{- β}}{6}) \\ \leq P (max_{1 \leq i \leq p_{n}} max_{1 \leq j \neq i \leq p_{n}} | \frac{σ^{2}}{n} (\sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}) | {| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{n^{- β}}{6}, {| | {\hat{ρ}}^{(n)} | |}_{1} \\ \leq \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) + P ({| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) \\ + \sum_{i = 1}^{p_{n}} P (| σ^{2} (\sum_{k = 1}^{n} \frac{{(z_{k i}^{(n)})}^{2}}{n} - 1) | > \frac{n^{- β}}{6}) \\ \leq \sum_{i = 1}^{p_{n}} \sum_{j \neq i} P (| \frac{\sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}}{n} | > \frac{n^{- β} \sqrt{λ^{(n)}}}{6 σ \sqrt{3 + σ^{2}}}) \\ + P ({| | {\hat{ρ}}^{n} | |}_{1} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) + p_{n} P (| \frac{χ_{(n)}^{2}}{n} - 1 | > \frac{n^{- β}}{6 σ^{2}}) \end{array}

(6.21)

Since

\frac{2 \sum_{k = 1}^{n} z_{k i}^{(n)} z_{k j}^{(n)}}{n} = [\frac{\sum_{k = 1}^{n} {((z_{k i}^{(n)} + z_{k j}^{(n)}) / \sqrt{2})}^{2}}{n} - 1] - [\frac{\sum_{k = 1}^{n} {((z_{k i}^{(n)} - z_{k j}^{(n)}) / \sqrt{2})}^{2}}{n} - 1],

where $\sum_{k = 1}^{n} {((z_{k i}^{(n)} + z_{k j}^{(n)}) / \sqrt{2})}^{2}$ and $\sum_{k = 1}^{n} {((z_{k i}^{(n)} - z_{k j}^{(n)}) / \sqrt{2})}^{2}$ are two independent $χ_{(n)}^{2}$ variables, the probability in (6.21) is less than or equal to

2 p_{n}^{2} P (| \frac{χ_{(n)}^{2}}{n} - 1 | \frac{n^{- β} \sqrt{λ^{(n)}}}{6 σ \sqrt{3 + σ^{2}}}) + P ({| | {\hat{ρ}}^{(n)} | |}_{1} > \frac{\sqrt{3 + σ^{2}}}{σ \sqrt{λ^{(n)}}}) + p_{n} P (| \frac{χ_{(n)}^{2}}{n} - 1 | > \frac{n^{- β}}{6 σ^{2}}) .

Since $n {(n^{- β} \sqrt{λ^{(n)}})}^{2} = O (n^{1 - 2 β - α}) \to \infty$ by (5.49), it follows from (6.1) and (5.58) that

\sum_{n = 1}^{\infty} P (max_{1 \leq i \leq p_{n}} | {[σ^{2} (\frac{Π^{(n)}}{n} - I) {\hat{ρ}}^{(n)}]}_{i} | > \frac{n^{- β}}{3}) < \infty .

(6.22)

Second, because

\begin{array}{l} y_{i} = \frac{σ}{n} \sum_{k = 1}^{n} z_{k i}^{(n)} w_{k}^{(n)} \\ = \frac{σ}{2} [\frac{\sum_{k = 1}^{n} {((z_{k i}^{(n)} + w_{k}^{(n)}) / \sqrt{2})}^{2}}{n} - 1] - \frac{σ}{2} [\frac{\sum_{k = 1}^{n} {((z_{k i}^{(n)} - w_{k}^{(n)}) / \sqrt{2})}^{2}}{n} - 1], \end{array}

\sum_{n = 1}^{\infty} P (max_{1 \leq i \leq p_{n}} ∣ y_{i} ∣ > \frac{n^{- β}}{3}) < \infty and \sum_{n = 1}^{\infty} P (| \frac{{| | w^{(n)} | |}_{2}^{2}}{n} - 1 | > \frac{n^{- β}}{3}) < \infty .

(6.23)

Then the lemma follows.

Proof of Lemma 12

Recall the partial sums

S_{i}^{(n)} = \sum_{j = 1}^{i} ρ_{j}^{(n)}, {\hat{S}}_{i}^{(n)} = \sum_{j = 1}^{i} ∣ {\hat{a}}^{(n)} ∣_{(j)}, 1 \leq i \leq p_{n} .

By Theorem 2.2 and Lemma 1, we only need to show that

∣ {\hat{a}}^{(n)} ∣_{(m^{(n)})} \leq \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} .

(6.24)

Let $k = \frac{m^{(n)}}{4}$ . Since m⁽ⁿ⁾ → ∞ as n → ∞, when n is large, we have k > m₀, where m₀ is defined in (5.53). Hence, the inequality (5.53) is true for k. From the definition (5.64) of m⁽ⁿ⁾, we can obtain that

4 k {\tilde{λ}}^{(n)} \geq 1 and k {\tilde{λ}}^{(n)} \frac{1 - τ}{τ} \geq 1.

(6.25)

Then

\begin{array}{l} \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - ρ^{m^{(n)}} = \frac{{\tilde{λ}}^{(n)} S_{4 k}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} - ρ_{4 k}^{(n)} \\ = \frac{{\tilde{λ}}^{(n)} (S_{4 k}^{(n)} - 4 k ρ_{4 k}^{(n)}) - (1 - {\tilde{λ}}^{(n)}) ρ_{4 k}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} \geq \frac{{\tilde{λ}}^{(n)} (S_{2 k}^{(n)} - 2 k ρ_{4 k}^{(n)}) - (1 - {\tilde{λ}}^{(n)}) ρ_{4 k}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} \\ \geq \frac{{\tilde{λ}}^{(n)} (\frac{1}{τ} (S_{4 k}^{(n)} - S_{2 k}^{(n)}) - 2 k ρ_{4 k}^{(n)}) - (1 - {\tilde{λ}}^{(n)}) ρ_{4 k}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} \\ \geq \frac{{\tilde{λ}}^{(n)} (\frac{1}{τ} 2 k ρ_{4 k}^{(n)} - 2 k ρ_{4 k}^{(n)}) - (1 - {\tilde{λ}}^{(n)}) ρ_{4 k}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} = \frac{2 k {\tilde{λ}}^{(n)} \frac{1 - τ}{τ} - (1 - {\tilde{λ}}^{(n)})}{(1 - {\tilde{λ}}^{(n)}) + 4 k {\tilde{λ}}^{(n)}} ρ_{4 k}^{(n)} \\ \geq \frac{2 k {\tilde{λ}}^{(n)} \frac{1 - τ}{τ} - 1}{1 + 4 k {\tilde{λ}}^{(n)}} ρ_{4 k}^{(n)} \geq \frac{2 k {\tilde{λ}}^{(n)} \frac{1 - τ}{τ} - k {\tilde{λ}}^{(n)} \frac{1 - τ}{τ}}{4 k {\tilde{λ}}^{(n)} + 4 k {\tilde{λ}}^{(n)}} ρ_{4 k}^{(n)} = \frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)}, \end{array}

(6.26)

where the inequality in the third line is due to (5.53) and the last inequality is due to (6.25). It follows from (5.63) that

S_{m^{(n)}}^{(n)} - m^{(n)} \frac{n^{- β}}{κ_{1}} \leq {\hat{S}}_{m^{(n)}}^{(n)} \leq S_{m^{(n)}}^{(n)} + m^{(n)} \frac{n^{- β}}{κ_{1}} .

(6.27)

We will consider the following two cases separately. If $\frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} \geq 2 \frac{n^{- β}}{κ_{1}}$ , by (6.26) and (6.27),

\begin{array}{l} ∣ {\hat{a}}^{(n)} ∣_{(m^{(n)})} \leq ρ_{m^{(n)}}^{(n)} + \frac{n^{- β}}{κ_{1}} \leq \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - \frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} + \frac{n^{- β}}{κ_{1}} \\ \leq \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{m^{(n)}}^{(n)} + {\tilde{λ}}^{(n)} m^{(n)} \frac{n^{- β}}{κ_{1}}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - \frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} + \frac{n^{- β}}{κ_{1}} \\ \leq \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - \frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} + 2 \frac{n^{- β}}{κ_{1}} \leq \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} . \end{array}

If $\frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} < 2 \frac{n^{- β}}{κ_{1}}$ , let us compute

S_{m^{(n)}}^{(n)} = \sum_{j = 1}^{m^{(n)}} ρ_{j}^{(n)} \geq \sqrt{\sum_{j = 1}^{m^{(n)}} {(ρ_{j}^{(n)})}^{2}} = \sqrt{1 - \sum_{ν = m^{(n)} + 1}^{\infty} {(ρ_{ν}^{(n)})}^{2}} \geq \sqrt{1 - \sum_{ν = m^{(n)} + 1}^{\infty} C^{2} ν^{2 / q}} \to 1,

(6.28)

where the last inequality follows from the condition (5.51). Hence,

\begin{array}{l} \frac{{\tilde{λ}}^{(n)} {\hat{S}}_{m^{(n)}}^{(n)}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - ∣ {\hat{a}}^{(n)} ∣_{(m^{(n)})} \geq \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)} - {\tilde{λ}}^{(n)} m^{(n)} \frac{n^{- β}}{κ_{1}}}{(1 - {\tilde{λ}}^{(n)}) + m^{(n)} {\tilde{λ}}^{(n)}} - ∣ {\hat{a}}^{(n)} ∣_{(m^{(n)})} \\ \geq \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)}}{1 + m^{(n)} {\tilde{λ}}^{(n)}} - \frac{n^{- β}}{κ_{1}} - ∣ {\hat{a}}^{(n)} ∣_{(m^{(n)})} \geq \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)}}{1 + m^{(n)} {\tilde{λ}}^{(n)}} - ρ_{m^{(n)}}^{(n)} - 2 \frac{n^{- β}}{κ_{1}} \\ \geq \frac{{\tilde{λ}}^{(n)} S_{m^{(n)}}^{(n)}}{2 m^{(n)} {\tilde{λ}}^{(n)}} - \frac{8 τ}{1 - τ} 2 \frac{n^{- β}}{κ_{1}} - 2 \frac{n^{- β}}{κ_{1}} = \frac{S_{m^{(n)}}^{(n)}}{m^{(n)}} - 2 \frac{1 + 7 τ}{1 - τ} \frac{n^{- β}}{κ_{1}}, \end{array}

where the inequality in the last line is due to (6.25) and $\frac{1 - τ}{8 τ} ρ_{m^{(n)}}^{(n)} < 2 \frac{n^{- β}}{κ_{1}}$ . By (6.28) and the definition (5.64) of m⁽ⁿ⁾, $\frac{{\hat{S}}_{m^{(n)}}^{(n)}}{m^{(n)}} ~ \frac{1}{m^{(n)}} O (1) ~ {\tilde{λ}}^{(n)} ~ λ^{(n)} ~ O (n^{- α})$ which is lager than $2 \frac{1 + 7 τ}{1 - τ} \frac{n^{- β}}{κ_{1}}$ when n is large by (5.49). Therefore, we have proved (6.24).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Xin Qi, Email: xqi3@gsu.edu, Department of Mathematics and Statistics, Georgia State University, 30 Pryor Stree, Atlanta, GA 30303-3083.

Ruiyan Luo, Email: rluo@gsu.edu, Department of Mathematics and Statistics, Georgia State University, 30 Pryor Stree, Atlanta, GA 30303-3083.

Hongyu Zhao, Email: hongyu.zhao@yale.edu, Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520-8034.

References

1.Amini A, Wainwright M. High-dimensional analysis of semidefinite relaxations for sparse principal components. Annals of Statistics. 2009;37:2877–2921. [Google Scholar]
2.Boyd S, Vandenberghe L. Convex Optimization. chap 5. Cambridge University Press; 2004. [Google Scholar]
3.Cadima J, Jolliffe I. Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics. 1995;22:203–214. [Google Scholar]
4.D’Aspremont A, Bach F, Ghaoui LE. A direct formulation of sparse PCA using semidefinite programming. SIAM Riew. 2007;49:434–448. [Google Scholar]
5.Geman S. A limit theorem for the norm of random matrices. The Annals of Probability. 1980;8:252–261. [Google Scholar]
6.Jeffers J. Two case studies in the application of principal component. Applied Statistics. 1967;16:225–236. [Google Scholar]
7.Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104(486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the LASSO. Journal of Computional and Graphical Statistics. 2003;12:531–547. [Google Scholar]
9.Munkres J. Topology. 2. Prentice Hall; 2000. [Google Scholar]
10.Nadler B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics. 2008;36:2791–2817. [Google Scholar]
11.Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]
12.Quarteroni A, Sacco R, Saleri F. Numerical Mathematics (Graduate Texts in Mathematics) 2. Springer; 2006. [Google Scholar]
13.Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008;99 [Google Scholar]
14.Trendafilov NT, Jolliffe IT. Projected gradient approach to the numerical solution of the scotlass. Computational Statistics and Data Analysis. 2006;50:242–253. [Google Scholar]
15.Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of omputational and Graphical Statistics. 2006;15:265–286. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS395542-supplement-1.zip^{(3.4KB, zip)}

[R1] 1.Amini A, Wainwright M. High-dimensional analysis of semidefinite relaxations for sparse principal components. Annals of Statistics. 2009;37:2877–2921. [Google Scholar]

[R2] 2.Boyd S, Vandenberghe L. Convex Optimization. chap 5. Cambridge University Press; 2004. [Google Scholar]

[R3] 3.Cadima J, Jolliffe I. Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics. 1995;22:203–214. [Google Scholar]

[R4] 4.D’Aspremont A, Bach F, Ghaoui LE. A direct formulation of sparse PCA using semidefinite programming. SIAM Riew. 2007;49:434–448. [Google Scholar]

[R5] 5.Geman S. A limit theorem for the norm of random matrices. The Annals of Probability. 1980;8:252–261. [Google Scholar]

[R6] 6.Jeffers J. Two case studies in the application of principal component. Applied Statistics. 1967;16:225–236. [Google Scholar]

[R7] 7.Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104(486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the LASSO. Journal of Computional and Graphical Statistics. 2003;12:531–547. [Google Scholar]

[R9] 9.Munkres J. Topology. 2. Prentice Hall; 2000. [Google Scholar]

[R10] 10.Nadler B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics. 2008;36:2791–2817. [Google Scholar]

[R11] 11.Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]

[R12] 12.Quarteroni A, Sacco R, Saleri F. Numerical Mathematics (Graduate Texts in Mathematics) 2. Springer; 2006. [Google Scholar]

[R13] 13.Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008;99 [Google Scholar]

[R14] 14.Trendafilov NT, Jolliffe IT. Projected gradient approach to the numerical solution of the scotlass. Computational Statistics and Data Analysis. 2006;50:242–253. [Google Scholar]

[R15] 15.Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of omputational and Graphical Statistics. 2006;15:265–286. [Google Scholar]

PERMALINK

Sparse principal component analysis by choice of norm

Xin Qi

Ruiyan Luo

Hongyu Zhao

Abstract

1. Introduction

2. Sparse PCA by choice of norm

Definition 1

Figure 1.

Definition 2

Definition 3

2.1. First sparse principal component

Algorithm 2.1 (For solving (2.2))

Theorem 2.1

Remark

Lemma 1

Remark

Theorem 2.2

Corollary 2.1

Algorithm 2.2. (For solving (2.7))

Corollary 2.2

Theorem 2.3

Theorem 2.4

Remark

2.2. Higher order sparse principal components

Algorithm 2.3. (For solving (2.19))

Theorem 2.5

Theorem 2.6

Algorithm 2.4. (For solving (2.20))

Theorem 2.7

2.3. Choice of the tuning parameters

3. Consistency in high dimensions

Theorem 3.1

4. Examples

4.1. Pitprops data

Table 1.

Table 2.

Table 3.

Table 4.

4.2. NCI cell line data

Table 5.

Figure 2.

Figure 3.

5. Proofs

Proof of Theorem 2.1

Proof of Lemma 1

Proof of Theorem 2.2

Proof of Corollary 2.2

Proof of Theorem 2.3

Proof of Theorem 2.4

Lemma 2

Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Lemma 4

Proof of Lemma 4

Lemma 5

Proof of Lemma 5

Lemma 6

Proof of Lemma 6

Lemma 7

Proof of Lemma 7

Proof of Theorem 2.5

Lemma 8

Proof of Lemma 8

Proof of Theorem 2.6

Proof of Theorem 2.7

Proof of Theorem 3.1

Lemma 9

Lemma 10

Lemma 11

Lemma 12

Supplementary Material

Acknowledgments

6. Appendix

Proof of Lemma 9

Proof of Lemma 10

Proof of Lemma 11

Proof of Lemma 12