Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data

Fang Han; Han Liu

doi:10.1080/01621459.2013.844699

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2014 Mar 19;109(505):275–287. doi: 10.1080/01621459.2013.844699

Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data

Fang Han ^*, Han Liu ^†

PMCID: PMC4051512 NIHMSID: NIHMS548750 PMID: 24932056

Abstract

We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets.

Keywords: High dimensional statistics, Principal component analysis, Elliptical distribution, Robust statistics

1 Introduction

Principal component analysis (PCA) is a powerful tool for dimension reduction and feature selection. Let $x_{1}, \dots, x_{n} \in ℝ^{d}$ be n observations of a d-dimensional random vector X with covariance matrix Σ. PCA aims at estimating the leading eigenvectors u₁, …, u_m of Σ.

When the dimension d is small compared with the sample size n, u₁, …, u_m can be consistently estimated by the leading eigenvectors ${\hat{u}}_{1}, \dots, {\hat{u}}_{m}$ of the sample covariance matrix (Anderson, 1958). However, when d increases at the same order or even faster than n, this approach can lead to poor estimates. In particular, Johnstone and Lu (2009) showed that the angle between ${\hat{u}}_{1}$ and u₁ may not converge to 0 if d/n → c for some constant c > 0. To handle this challenge, one popular assumption is to impose sparsity constraint on the leading eigenvectors. For example, when estimating the leading eigen-vector u₁ := (u₁₁, …, u_1d)^T, we may assume that s := card({j : u_1j ≠ = 0}) < n. Under this assumption, different variants of sparse PCA have been developed, more details can be found in d’Aspremont et al. (2004), Zou et al. (2006), Shen and Huang (2008), Witten et al. (2009), Journée et al. (2010), and Zhang and El Ghaoui (2011). The theoretical properties of sparse PCA in feature selection and parameter estimation have been investigated by Amini and Wainwright (2009), Ma (2013), Paul and Johnstone (2012), Vu and Lei (2012), and Berthet and Rigollet (2012).

There are several drawbacks of the classical PCA and sparse PCA: (i) It is not scale-invariant, i.e., changing the measurement scale of variables makes the estimates different (Chatfield and Collins, 1980); (ii) It is not robust to possible data contamination or outliers (Puri and Sen, 1971); (iii) The theory of sparse PCA relies heavily on the Gaussian or sub-Gaussian assumption, which may not be realistic for many real-world applications.

In the low dimensional settings, remedies for the drawbacks (ii) and (iii) include generalizing the Gaussian distribution to elliptical distribution (Fang et al., 1990), and considering some robust estimators (Huber and Ronchetti, 2009). One research line is to develop various PCA estimators for the elliptical data (Möttönen and Oja, 1995; Choi and Marden, 1998; Marden, 1999; Visuri et al., 2000; Croux et al., 2002; Jackson and Chen, 2004). The theoretical properties of these elliptical distribution based PCA estimators have been established under the classical asymptotic framework (i.e., the dimension d is fixed) by Hallin et al. (2010), Oja (2010), and Croux and Dehon (2010). Along another research line, multiple robust PCA estimators have been proposed to address the outlier and heavy tailed issues via replacing the sample covariance matrix by a robust scatter matrix. Such robust scatter matrix estimators include M-estimator (Maronna, 1976), S-estimator (Davies, 1987), median absolute deviation (MAD) proposed by Hampel (1974), and S_n and Q_n estimators (Rousseeuw and Croux, 1993). These robust scatter matrix estimators have been exploited to conduct robust (sparse) principal component analysis (Gnanadesikan and Kettenring, 1972; Maronna and Zamar, 2002; Hubert et al., 2002; Croux and Ruiz-Gazen, 2005; Croux et al., 2013). The theoretical performances of PCA based on these robust estimators in low dimensions were further analyzed in Croux and Haesbroeck (2000).

In this article we propose a new method for conducting sparse principal component analysis on non-Gaussian data. Our method can be viewed as a scale-invariant version of sparse PCA but is applicable to a wide range of distributions belonging to the meta-elliptical family (Fang et al., 2002). The meta-elliptical family extends the elliptical family. In particular, a continuous random vector $X ≔ {(X_{1}, \dots, X_{d})}^{T} \in ℝ^{d}$ follows a meta-elliptical distribution if there exists a set of univariate strictly increasing functions $f ≔ {f_{j}}_{j = 1}^{d}$ such that f(X) := (f₁(X₁), …, f_d(X_d))^T follows an elliptical distribution with location parameter 0 and scale parameter Σ⁰, whose diagonal values are all 1. We call Σ⁰ the latent generalized correlation matrix. By treating ${f_{j}}_{j = 1}^{d}$ as nuisance parameters, our method estimates the leading eigenvector θ₁ of Σ⁰ by exploiting a rank-based estimating procedure and can be viewed as a scale-invariant PCA conducted on f(X). Theoretically we show that when s is fixed, it achieves a parametric rate of convergence in estimating the leading eigenvector. Computationally, it is as efficient as sparse PCA. Empirically, we show that the proposed method outperforms the classical sparse PCA and two robust alternatives on both synthetic and real-world datasets.

The rest of this paper is organized as follows. In the next section, we review the elliptical distribution family and introduce the meta-elliptical distribution. In Section 3, we present the statistical model, introduce the rank-based estimators, and provide computational algorithm for parameter estimation. In Section 4, we provide theoretical analysis. In Section 5, we provide empirical studies on both synthetic and real-world datasets. More discussion and comparison with related methods are put in the last section.

2 Elliptical and Meta-elliptical Distributions

In this section, we briefly review the elliptical distribution and introduce the meta-elliptical distribution family. We start by first introducing the notation: Let $M = [M_{j k}] \in ℝ^{d \times d}$ and $v = {(v_{1}, \dots, v_{d})}^{T} \in ℝ^{d}$ be a d-dimensional matrix and a d-dimensional vector. We denote v_I to be the subvector of v whose entries are indexed by a set I. We also denote M_I,J to be the submatrix of M whose rows are indexed by I and columns are indexed by J. Let M_I* and M_*J be the submatrix of M with rows in I, and the submatrix of M with columns in J. Let supp(v) := {j : v_j ≠ = 0}. For 0 < q < ∞, we define the ℓ₀, ℓ_q and ℓ_∞ vector norms as ||v||₀ := card(supp(v)), ${∣ ∣ v ∣ ∣}_{q} ≔ {(Σ_{i = 1}^{d} {∣ v_{i} ∣}^{q})}^{1 ∕ q}$ and ${∣ ∣ v ∣ ∣}_{\infty} ≔ \max_{1 \leq i \leq d} ∣ v_{i} ∣$ . We define the matrix ℓ_max norm as the elementwise maximum value: ||M||_max := max{|M_ij|}. Let Λ_j(M) be the j-th largest eigenvalue of M. In particular, we denote Λ_min(M) := Λ_d(M) and Λ_max(M) := Λ₁(M) to be the smallest and largest eigenvalues of M. Let ||M||₂ be the spectral norm of M. We define $(M) ≔ {(M_{* 1}^{T}, \dots, M_{* d}^{T})}^{T}$ and $𝕊^{d - 1} ≔ {v \in ℝ^{d} : {∣ ∣ v ∣ ∣}_{2} = 1}$ be the d-dimensional unit sphere. For any two vectors $a, b \in ℝ^{d}$ and any two squared matrices $A, B \in ℝ^{d \times d}$ , we denote the inner product of a and b, A and B by 〈a, b〉 := a^Tb and 〈A, B〉:= Tr(A^TB). For any matrix $M \in ℝ^{d \times d}$ , we denote diag(M) to be the diagonal matrix with the same diagonal entries as M. For any univariate function f, we denote f(M) = [f(M_jk)] to be a d × d matrix with f applied on each entry of M. Let I_d be the identity matrix in $ℝ^{d \times d}$ . For two random vectors X and Y, we denote $X \overset{d}{=} Y$ if they are identically distributed.

2.1 Elliptical Distribution

We briefly overview the elliptical distribution. In the sequel, we say a random vector X = (X₁, …, X_d)^T is continuous if the marginal distribution are all continuous. X possesses density if it is absolutely continuous with respect to the Lebesgue measure.

Definition 2.1 (Elliptical distribution). A random vector Z = (Z₁, …, Z_d)^T follows an elliptical distribution if and only if Z has a stochastic representation: $Z \overset{d}{=} μ + ξ A U$ . Here $μ \in ℝ^{d}$ , q := rank(A), $A \in ℝ^{d \times q}$ , ξ ≥ 0 is a random variable independent of U, $U \in 𝕊^{q - 1}$ is uniformly distributed on the unit sphere in $ℝ^{q}$ . Letting Σ := AA^T, we denote Z ~ EC_d(μ, Σ, ξ). We call Σ the scatter matrix.

In Definition 2.1, there can be multiple A’s corresponding to the same Σ, i.e., there exist $A_{1} \neq A_{2} \in ℝ^{d \times q}$ such that $A_{1} A_{1}^{T} = A_{2} A_{2}^{T} = Σ$ . To make the representation unique, we always parameterize an elliptical distribution by the scatter matrix Σ instead of A.

The model family in Definition 2.1 is not identifiable. For example, Σ is unique only up to a constant scaling, i.e., for some constant c > 0, if we define ξ* = ξ/c and A* = cA, then $ξ A U \overset{d}{=} ξ^{*} A^{*} U$ . To make the model identifiable, we require the condition that max_1≤i≤dΣ_ii=1. We define Σ⁰ := diag(Σ)^−1/2 · Σ · diag(Σ)^−1/2 to be the generalized correlation matrix.

2.2 Meta-elliptical Distribution

Real world data are usually nonGaussian and asymmetric. To illustrate the nonGaussianity and asymmetry issues, we consider the stock log return data in S&P 500 index, collected from Yahoo! Finance (finance.yahoo.com) from January 1, 2003 to January 1, 2008, including 452 stocks and 1,257 data points. Table 1 illustrates the nonGaussianity issue of the stock log-return data. Here we conduct the three marginal normality tests as in Table 1 at the significant level of 0.05. It is clear that at most 24 out of 452 stocks would pass any of three normality test. Even with Bonferroni correction there are still over half stocks that fail to pass any normality tests. Figure 1 plots the histograms of three typical stocks, “eBay Inc.”, “Macy’s Inc.”, and “Wells Fargo”, in the sectors of information technology, consumer discretionary, and financials separately. We see that the log-return values are skewed to the left.

Table 1.

Normality test of the stock log-return data. This table illustrates the number of 452 stocks rejecting the null hypothesis of normality at the significance level 0.05.

Significance level	Kolmogorov-Smirnov	Shapiro-Wilk	Lilliefors
0.05	428	449	449
0.05/452	269	448	426

Open in a new tab

Illustration of the asymmetry issue of the log-return stock data.

Though the elliptical distribution family has been widely used to model heavy-tail data (Oja, 2010), it assumes that the distribution contours to exhibit ellipsoidal structure. To relax this assumption, Fang et al. (2002) introduced the concept of meta-elliptical distribution under a copula framework. In this section we introduce the concept of meta-elliptical using a different approach, which extends the family defined in Fang et al. (2002).

First, we define two sets of symmetric matrices:

\begin{matrix} R_{d}^{+} & = {Σ \in ℝ^{d \times d} : Σ^{T} = Σ, diag (Σ) = I_{d}, Σ ≻ 0}, \\ R_{d} & = {Σ \in ℝ^{d \times d} : Σ^{T} = Σ, diag (Σ) = I_{d}, Σ \underline{≻} 0} . \end{matrix}

The meta-elliptical distribution family is defined as follows:

Definition 2.2 (Meta-elliptical distribution). A continuous random vector X = (X₁, …, X_d)^T follows a meta-elliptical distribution, denoted by X ~ ME_d(Σ⁰, ξ; f₁, …, f_d), if there exist univariate strictly increasing functions f₁, …, f_d such that

{(f_{1} (X_{1}), \dots, f_{d} (X_{d}))}^{T} \sim E C_{d} (0, Σ^{0}, ξ), where Σ^{0} \in R_{d} .

(2.1)

Here, Σ⁰ is called the latent generalized correlation matrix. When

{(f_{1} (X_{1}), \dots, f_{d} (X_{d}))}^{T} \sim N_{d} (0, Σ^{0}),

we say that X follows a nonparanormal distribution, denoted by X ~ NPN_d(Σ⁰; f₁, …, f_d).

The meta-elliptical is a strict extension to the nonparanormal defined in Liu et al. (2012). They both assume that after unspecified marginal transformations the data follow certain distributions. However, the nonparanormal exploits a Gaussian base distribution while the meta-elliptical exploits an elliptical base distribution.

On the other hand, we would like to point out that Definition 2.2 extends the family originally defined in Fang et al. (2002) in three aspects: (i) The generating variable ξ does not have to be absolutely continuous; (ii) The parameter Σ⁰ is strictly enlarged from $R_{d}^{+}$ to $R_{d}$ ; (iii) X does not necessarily possess density. Moreover, even if these two definitions are the same confined in the distribution set with density existing, we define the meta-elliptical in fundamentally different ways by characterizing the transformation functions instead of characterizing the density functions. By exploiting this new definition, we find that several results provided in the later sections can be easier to understand.

The meta-elliptical family is rich and contains many useful distributions, including multivariate Gaussian, rank-deficient Gaussian, multivariate t, logistic, Kotz, symmetric Pearson type-II and type-VII, the nonparanormal, and various other asymmetric distributions such as multivariate asymmetric t distribution (Fang et al., 2002). To illustrate the modeling flexibility of the meta-elliptical family, Figure 2 visualizes the density functions of two meta-elliptical distributions.

Densities of two 2-dimensional meta-elliptical distributions. (A) The component functions have the form f₁(x) = sign(x)|x|² and f₂(x) = x³, and after transformation follows a Gaussian distribution. (B) The component functions have the form f₁(x) = f₂(x) = log(x), and after transformation follows a Cauchy distribution. In both cases the latent generalized correlation matrix has all off-diagonal values to be 0.5.

3 Methodology

We propose a new scale-invariant sparse PCA method based on the meta-elliptical distribution family. More specifically, under a meta-elliptical model X ~ ME_d(Σ⁰, ξ; f₁, …, f_d), the proposed method aims at estimating the leading eigenvector of Σ⁰. Since the diagonal entries of Σ⁰ are all 1, the proposed method is scale-invariant. From Definition 2.2, the proposed method is equivalent to conducting scale-invariant sparse PCA on the transformed data (f₁(X₁), …, f_d(X_d))^T which follow an elliptical distribution.

3.1 Statistical Model

The statistical model of our proposed method is defined as follows:

Definition 3.1. We consider the following model, denoted by $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ , which is defined to be the set of distributions:

M_{d} (Σ^{0}, ξ, f; θ_{1}, s) ≔ {X : X \sim M E_{d} (Σ^{0}, ξ; f_{1}, \dots, f_{d}) s u c h t h a t θ_{1}, t h e l e a d i n g e i g e n v e c t o r o f Σ^{0}, s a t i s f i e s {∣ ∣ θ_{1} ∣ ∣}_{0} = s .} .

(3.1)

This model allows asymmetric and heavy tail distributions with nontrivial tail dependency. It can be used as a powerful tool for modeling real-world data.

3.2 Method

We now provide the proposed method that exploits the model (3.1). One of the key components of the proposed rank based method is the Kendall’s tau correlation matrix estimator, which will be explained in the next section.

3.2.1 Kendall’s tau based Correlation Matrix Estimator

The Kendall’s tau statistic was introduced by Kendall (1948) for estimating pairwise correlation and has been used for principal component analysis in low dimensions (Croux et al., 2002; Gibbons and Chakraborti, 2003). More specifically, let X := (X₁, …, X_d)^T be a d-dimensional random vector and let $\tilde{X} ≔ {({\tilde{X}}_{1}, \dots, {\tilde{X}}_{d})}^{T}$ be an independent copy of X. The Kendall’s tau correlation coefficient between X_j and X_k is defined as

τ (X_{j}, X_{k}) ≔ ℙ ((X_{j} - {\tilde{X}}_{j}) (X_{k} - {\tilde{X}}_{k}) > 0) - ℙ ((X_{j} - {\tilde{X}}_{j}) (X_{k} - {\tilde{X}}_{k}) < 0) .

The next proposition shows that for meta-elliptical distribution family, we have a one-to-one map between $Σ_{j k}^{0}$ and τ(X_j, X_k).

Theorem 3.2. Given X ~ ME_d (Σ⁰, ξ; f₁, …, f_d) meta-elliptically distributed, we have

Σ_{j k}^{0} = \sin (\frac{π}{2} τ (X_{j}, X_{k})) .

(3.2)

Proof. It is obvious that the Kendall’s tau statistic is invariant under strictly increasing transformations to the marginal variables. Moreover, Lindskog et al. (2003) show that the Kendall’s tau statistic is invariant to different generating variables ξ’s. Combining these two results and Equation (6.6) of Kruskal (1958), we obtain the desired result.

Let $x_{1}, \dots, x_{n} \in ℝ^{d}$ with x := (x_i1, …, x_id)^T be n data points of X. The sample version Kendall’s tau statistic is defined as:

{\hat{τ}}_{j k} ≔ \frac{2}{n (n - 1)} \sum_{1 \leq i < i^{'} \leq n} sign (x_{i j} - x_{i^{'} j}) sign (x_{i k} - x_{i^{'} k}) .

It is easy to see that ${\hat{τ}}_{j k}$ is an unbiased estimator of τ(X_j, X_k). Using ${\hat{τ}}_{j k}$ , we define the Kendall’s tau correlation matrix as follows:

Definition 3.3 (Kendall’s tau correlation matrix). We define the Kendall’s tau correlation matrix $\hat{R} = [{\hat{R}}_{j k}]$ to be a d by d matrix with element entry to be

{\hat{R}}_{j k} = \sin (\frac{π}{2} {\hat{τ}}_{j k}) .

(3.3)

3.2.2 Rank-based Estimators

Given the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ , Theorem 3.2 provides a natural way to estimate θ₁. In particular, we solve the following optimization problem:

{\hat{θ}}_{1, k}^{*} ≔ \underset{v \in ℝ^{d}}{\arg \max} v^{T} \hat{R} v, subject to v \in 𝕊^{d - 1} \cap 𝔹_{0} (k),

(3.4)

where $𝔹_{0} (k) ≔ {v \in ℝ^{d} : {∣ ∣ v ∣ ∣}_{0} \leq k}$ , k is a sufficiently large tuning parameter, and $\hat{R}$ is the Kendall’s tau correlation matrix. Equation (3.4) is a combinatorial optimization problem and hard to compute. The corresponding global optimum is denoted by ${\hat{θ}}_{1, k}^{*}$ .

Because the estimator ${\hat{θ}}_{1, k}^{*}$ is very hard to compute, we consider an alternative way to estimate θ₁ using the truncated power algorithm proposed by Yuan and Zhang (2013). This algorithm yields an estimator ${\tilde{θ}}_{1, k}$ .. Here $k ≔ {∣ ∣ {\tilde{θ}}_{1, k} ∣ ∣}_{0}$ is a hypothesized value for s (the number of nonzero elements of θ₁) and can be treated as a tuning parameter.

More specifically, we apply the classical power method, but within each iteration t we project the intermediate vector x_t to the intersection of the d-dimension sphere $𝕊^{d - 1}$ and the ℓ₀ ball with radius k > 0. In detail, we sort the absolute values of the elements of x_t from the highest to the lowest, find the highest k absolute values, truncate all the others to zero, and then normalize the truncated vector such that it lies in $𝕊^{d - 1} \cap 𝔹_{0} (k)$ . To provide the detailed algorithm, we first introduce some additional notation. For any vector $v \in ℝ^{d}$ and an index set J ˄ {1, …, d}, we define the truncation function TRC(·,·) to be

TRC (v, J) ≔ {(v_{1} \cdot I (1 \in J), \dots, v_{d} \cdot I (d \in J))}^{T},

(3.5)

where I(·) is the indicator function. The truncated power algorithm is presented in Algorithm 1.

The formulation of the truncated power algorithm is nonconvex and the performance of the estimator relies on the selection of the initial vector v⁽⁰⁾. In practice, we use the estimate obtained from the SPCA algorithm (Zou et al., 2006) as the initial vector. We set the termination criteria to be ||v^(t)−v^(t−1)||₂≤ 10⁻⁴.

In Section 4, we show that, by appropriately setting the initial vector v⁽⁰⁾, the algorithm converges and the corresponding estimator ${\tilde{θ}}_{1, k}$ is a consistent estimator of θ₁. In practice, we find that this algorithm always converges on all the synthetic and real-world data.

Algorithm 1.

Truncated Power Method

graphic file with name nihms-548750-t0001.jpg

Open in a new tab

3.3 Estimating the Top m Leading Eigenvectors

We exploit the iterative deflation method to estimate the top m leading eigenvectors θ₁, …, θ_m of Σ⁰. This method is proposed by Mackey (2009) and its empirical performance is further evaluated in Yuan and Zhang (2013). In detail, for any positive semidefinite matrix $Γ \in ℝ^{d \times d}$ , its deflation with respect to the vector $v \in ℝ^{d}$ is defined as:

D (Γ, v) ≔ (I_{d} - {vv}^{T}) Γ (I_{d} - {vv}^{T}) .

In this way, D(Γ, v) is positive semidefinite, left and right orthogonal to v, and symmetric. To estimate θ₁, …, θ_m, we exploit the following approach: (i) The estimate ${\hat{θ}}_{1}$ (can be either ${\hat{θ}}_{1, k}^{*}$ or ${\tilde{θ}}_{1, k}$ ) of θ₁ is calculated using Equation (3.4) or the truncated power method; (ii) Given ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{j}$ , we estimate ${\hat{θ}}_{j + 1}$ by plugging $Γ^{(j + 1)} ≔ D (Γ^{(j)}, {\hat{θ}}_{j})$ into Equation (3.4) or the truncated power method (Γ⁽¹⁾ := Σ⁰).

4 Theoretical Properties

In this section we provide the theoretical properties of the estimators ${\hat{θ}}_{1, k}^{*}$ and ${\tilde{θ}}_{1, k}$ . In the analysis, we adopt the double asymptotic framework in which the dimension d increases with the sample size n. This framework more realistically reflects the challenges of many high dimensional applications (Bühlmann and van de Geer, 2011).

4.1 Latent Generalized Correlation Matrix Estimation

In this section we focus on estimating the latent generalized correlation matrix Σ⁰. In the next theorem we prove the rate of convergence $O_{P} (\sqrt{\log d ∕ n})$ for $∣ {\hat{R}}_{j k} - Σ_{j k}^{0} ∣$ uniformly over all indices j, k. This is an important result, which indicates that the Gaussian parametric rate in estimating the correlation matrix obtained by Bickel and Levina (2008) can be extended to the meta-elliptical distribution family using the Kendall’s tau statistic.

Theorem 4.1. Let x₁, …, x_n be n observations of X ~ ME_d(Σ⁰,ξ; f₁, …, f_d) and let $\hat{R}$ be defined as in Equation (3.3). We have, with probability at least 1 − d^−5/2,

{∣ ∣ \hat{R} - Σ^{0} ∣ ∣}_{\max} \leq 3 π \sqrt{\frac{\log d}{n}} .

(4.1)

Proof. The result follows from Theorem 4.2 in Liu et al. (2012) but with a slightly different probability bound. A detailed proof is provided in Appendix A.2 for self-completeness.

4.2 Leading Eigenvector Estimation

We analyze the estimation errors of the global optimum ${\hat{θ}}_{1, k}^{*}$ and the estimator ${\tilde{θ}}_{1, k}$ obtained from the truncated algorithm. We say that the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ holds if the data are drawn from one probability distribution in $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ . The next theorem provides an upper bound on the angle between ${\hat{θ}}_{1, k}^{*}$ and θ₁.

Theorem 4.2. Let ${\hat{θ}}_{1, k}^{*}$ be the global optimum to Equation (3.4), the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ hold, and k ≥ s. For any two vectors $v_{1} \in 𝕊^{d - 1}$ and $v_{2} \in 𝕊^{d - 1}$ , let $∣ \sin ∠ (v_{1}, v_{2}) ∣ ≔ \sqrt{1 - {(v_{1}^{T} v_{2})}^{2}}$ . Then we have, with probability at least 1 − d^−5/2,

∣ \sin ∠ ({\hat{θ}}_{1, k}^{*}, θ_{1}) ∣ \leq \frac{6 π}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}},

(4.2)

when λ_j := Λ_j(Σ⁰) for j = 1,2.

Proof. The key idea of the proof is to exploit the results in Theorem 4.1 in bounding the estimation error. Detailed proofs are presented in Appendix A.3.

Remark 4.3. When s, λ₁, λ₂ do not scale with (n, d) and k ≥ s is a fixed constant, the rate of convergence in parameter estimation is $O_{P} (\sqrt{\log d ∕ n})$ , which is the minimax optimal parametric rate shown in Vu and Lei (2012) under certain model class.

In the next corollary, we provide a feature selection result for the proposed method. Given that the selected tuning parameter k is large enough, we show that the support set of θ₁ can be consistently recovered in a fast rate by imposing a constraint on the minimum absolute value of the signal part of θ₁.

Corollary 4.4 (Feature selection). Let ${\hat{θ}}_{1, k}^{*}$ be the global optimum to Equation (3.4), the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ hold, and k ≥ s. Let Τ:= supp(θ₁), and ${\hat{ϴ}}_{k}^{*} ≔ s u p p ({\hat{θ}}_{1, k}^{*})$ . If we further have $m i n_{j \in ϴ} ∣ θ_{1 j} ∣ \geq \frac{6 \sqrt{2 π}}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}}$ , then $ℙ (ϴ \subset {\hat{ϴ}}_{k}^{*}) \geq 1 - d^{- 5 ∕ 2}$ .

Proof. The key of the proof is to construct a contradiction given Theorem 4.2 and the condition on the minimum value of ${∣ θ_{1 j} ∣}_{j = 1}^{d}$ . Detailed proof is shown in Appendix A.4

In the next theorem, we provide a result on the convergence rate of the estimator ${\tilde{θ}}_{1, k}$ obtained by exploiting the truncated power algorithm. This theorem, coming from Yuan and Zhang (2013), indicates that under sufficient conditions ${\tilde{θ}}_{1, k}$ converges to θ₁ in a $s \sqrt{\log d ∕ n}$ rate.

Theorem 4.5. If the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ holds, the conditions in Theorem 1 in Yuan and Zhang (2013) hold, and k ≥ s, we have, with probability at least 1 − d^−5/2,

∣ \sin ∠ ({\tilde{θ}}_{1, k}, θ_{1}) ∣ \leq C \cdot (s + 2 k) \sqrt{\frac{\log d}{n}},

for some generic constant C not scaling with (n; d; s).

The result in Theorem 4.5 is a direct consequence of Theorem 1 in Yuan and Zhang (2013) and therefore the proof is omitted. Here we note that, similar as Corollary 4.4, it can be shown that under certain conditions, $supp (θ_{1}) \subset supp ({\tilde{θ}}_{1, k})$ with high probability.

4.3 Principal Component Estimation

In this section, we consider estimating the latent principal components of the meta-elliptically distributed data. To estimate the latent principal components instead of the eigenvectors of the latent generalized correlation matrix, one needs to obtain good estimates of the unknown transformation functions f₁, …, f_d.

Let X ~ ME_d(Σ⁰, ξ; f₁, …, f_d) follow a meta-elliptical distribution and x₁, …, x_n be n observations of X with x_i := (x_i₁, …, x_id)^T. Let Z := (f₁(X₁), …, f_d(X_d))^T be the transformed random vector. By definition, Z ~ EC_d(0, Σ⁰, ξ) is elliptically distributed. Let Q_g be the marginal distribution function of Z (From Proposition A.2, we know all the elements of Z share the same marginal distribution functions). If Q_g is known, we can estimate f₁, …, f_d as follows. For j = 1, …, d, let ${\hat{F}}_{j} (t; δ_{n})$ be defined as

{\hat{F}}_{j} (t; δ_{n}) ≔ {\begin{matrix} δ_{n}, & if t < δ_{n} \\ \frac{1}{n} Σ_{i = 1}^{n} I (x_{i j} \leq t), & if δ_{n} \leq t \leq 1 - δ_{n} \\ 1 - δ_{n}, & if x > 1 - δ_{n} \end{matrix} .

We define

{\hat{f}}_{j} (t; δ_{n}) ≔ Q_{g}^{- 1} ({\hat{F}}_{j} (t; δ_{n}))

(4.3)

to be an estimator of f_j. When Q_g(·) = Φ(·), where Φ(·) is the distribution function of the standard Gaussian, we have the following theorem, showing that ${\hat{f}}_{j} (\cdot; δ_{n})$ converges to f_j(·) uniformly over an expanding interval with high probability.

Theorem 4.6 (Han et al. (2013)). Suppose that X ~ NPN_d(Σ⁰; f₁, …, f_d) and for j = 1, …, d, let $g_{j} ≔ f_{j}^{- 1}$ be the inverse function of f_j. For any 0 < γ 1, we define

I_{n} ≔ [g_{j} (- \sqrt{2 (1 - γ) \log n}), g_{j} (\sqrt{2 (1 - γ) \log n})],

then $\sup_{t \in I_{n}} ∣ {\hat{f}}_{j} (t; {(2 n)}^{- 1}) - f_{j} (t) ∣ = O_{P} (\sqrt{\frac{\log \log n}{n^{γ}}})$ . Here ${\hat{f}}_{j} (t; δ_{n}) ≔ Φ^{- 1} ({\hat{F}}_{j} (t; δ_{n}))$ .

Using Theorem 4.6, we have the following theorem, which shows that, under appropriate conditions, we can recover the first principal component of any data point x.

Theorem 4.7. For any observation x := (x₁, …, x_d)^T of X ~ NPN_d(Σ; f₁, …, f_d), under the conditions of Theorem 4.2, letting

\hat{f} (x) ≔ {({\hat{f}}_{1} (x_{1}; {(2 n)}^{- 1}), \dots, {\hat{f}}_{d} (x_{d}; {(2 n)}^{- 1}))}^{T} and f (x) ≔ {(f_{1} (x_{1}), \dots, f_{d} (x_{d}))}^{T},

and b be any positive constant such that (s + k)n^−b/2 = o(1),we have

∣ \hat{f} {(x)}^{T} {\hat{θ}}_{1, k}^{*} - f {(x)}^{T} θ_{1}^{*} ∣ = O_{P} (\sqrt{(s + k) \cdot \frac{\log \log n}{n^{1 - b ∕ 2}}} + \frac{k}{λ 1 - λ_{2}} \sqrt{\frac{(s + k) \log d \log n}{n}}),

where $θ_{1}^{*} ≔ s i g n (θ_{1}^{T} {\hat{θ}}_{1, k}^{*}) \cdot θ_{1}$ .

Proof. Theorem 4.7 is proved by combining the results in Theorems 4.2 and 4.6. More details are presented in Appendix A.5.

5 Experiments

In this section we evaluate the empirical performance of the proposed method on both synthetic and real-world datasets and compare its performance with the classical sparse PCA and two additional robust sparse PCA procedures. We use the truncated power method proposed by Yuan and Zhang (2013) for parameter estimation. The following four methods are considered:

Pearson: the classical high dimensional scale-invariant PCA using the Pearson’s sample correlation matrix as the input;
S_n: The sparse PCA using the robust S_n correlation matrix estimator (Rousseeuw and Croux, 1993; Maronna and Zamar, 2002) as the input;
Q_n: The sparse PCA using the robust Q_n correlation matrix estimator (Rousseeuw and Croux, 1993; Maronna and Zamar, 2002) as the input;
Kendall: The proposed method using the Kendall’s tau correlation matrix as the input.

Here the robust Q_n and S_n correlation matrix estimates are calculated by the R package robustbase (Rousseeuw et al., 2009). We also tried the sparse robust PCA procedure proposed in Croux, Filzmoser, and Fritz. (2013), implemented in the R package pcaPP. However, we found that the grid algorithm, which is used in their paper to estimate sparse eigenvectors, has convergence problem when the dimension is high, which makes the obtained estimator perform very bad. Therefore, we did not include this procedure in the draft for comparison.

5.1 Numerical Simulations

In the simulation study we sample n data points from a given meta-elliptical distribution. Here we set d = 100. We first construct Σ⁰ using a similar idea as in Yuan and Zhang (2013): First a covariance matrix is synthesized through the eigenvalue decomposition, where the first two eigenvalues are given and the corresponding eigenvectors are pre-specified to be sparse. More specifically, let

Σ ≔ \sum_{j = 1}^{2} (ω_{j} - 1) μ_{j} μ_{j}^{T} + I_{d}, where ω_{1} = 6, ω_{2} = 3 .

We set u₁ and u₂ as follows:

u_{1 j} = {\begin{matrix} \frac{1}{\sqrt{10}} & 1 \leq j \leq 10 \\ 0 & otherwise \end{matrix} and u_{2 j} = {\begin{matrix} \frac{1}{\sqrt{10}} & 11 \leq j \leq 20 \\ 0 & otherwise \end{matrix} .

The latent generalized correlation matrix Σ⁰ is Σ⁰ = diagΣ()^−1/2 · Σ · diag(Σ)^−1/2. We then consider six different schemes to generate the data matrix $X ≔ {(x_{1}, \dots, x_{n})}^{T} \in ℝ^{n \times d}$ :

Scheme 1: Let x₁, …, x_n be n observations of X ~ N_d(0, Σ⁰).

Scheme 2: Let x₁, …, x_n be n observations of X ~ N_d(0, Σ⁰), but with 5% entries in each x_i randomly picked up and replaced by −5 or 5.

Scheme 3: Let x₁, …, x_n be n observations of X ~ NPN_d(Σ⁰; f₁, …, f₁) with f₁(x) = x³.

Scheme 4: Let x₁, …, x_n be n observations of X ~ ME_d(Σ⁰, ξ₁; f₀, …, f₀) with f₀(x) = x and $ξ_{1} \overset{d}{=} \sqrt{κ} ξ_{1}^{*} ∕ ξ_{2}^{*}$ . Here $ξ_{1}^{*} \overset{d}{=} χ_{d}$ and $ξ_{2}^{*} \overset{d}{=} χ_{κ}$ with $κ \in ℤ^{+}$ . In this setting, X follows a multivariate t distribution with degree of freedom κ (Fang et al., 1990). Here we set κ = 3.

Scheme 5: Let x₁, …, x_n be n observations of X ~ ME_d(Σ⁰, ξ₂; f₀, …, f₀) with ξ₂ ~ F (d, 1), i.e., ξ₂ follows an F-distribution with degree of freedom d and 1.

Scheme 6: Let x₁, …, x_n be n observations of X ~ ME_d(Σ⁰, ξ₃; f₀, …, f₀) with ξ₃ follows an exponential distribution with the rate parameter 1.

Here Schemes 1 to 3 represent three different versions of the Gaussian data: (i) The perfect Gaussian data; (ii) The Gaussian data contaminated by outliers; (iii) The Gaussian data contaminated by marginal transformations. Schemes 4-6 represent three different elliptical distributions, which are all heavy-tailed and belong to the meta-elliptical family.

For n = 50, 100, 200, we repeatedly generate the data matrix X according to Schemes 1 to 6 for 1,000 times. To show the feature selection results for estimating the support set of the leading eigenvector θ₁, Figure 3 plots the false positive rates against the true positive rates for the four different estimators under different schemes.

ROC curves under Scheme 1 to Scheme 6. Here n = 100 and d = 100.

To illustrate the parameter estimation performance, we conduct a quantitative comparison of the estimation accuracy of the four competing method. For all methods, we fix the tuning parameter (i.e., the cardinality of the estimate’s support set) to be 10. Table 2 shows the averaged distances between the estimated leading eigenvector and θ₁, with standard deviations presented in the parentheses. Here the distance between two vectors $v_{1}, v_{2} \in 𝕊^{d - 1}$ is defined as $∣ \sin ∠ (v_{1}, v_{2}) ∣$ .

Table 2.

Quantitative comparison on the datasets under the six generating schemes. The averaged distances with standard deviations in parentheses are presented. Here n is changing from 50 to 200 and d is fixed to be 100.

Scheme	n	Pearson	S_n	Q_n	Kendall
Scheme 1	50	0.422(0.555)	0.607(0.473)	0.555(0.259)	0.473(0.266)
	100	0.121(0.158)	0.188(0.140)	0.158(0.110)	0.140(0.201)
	200	0.068(0.071)	0.072(0.072)	0.071(0.018)	0.072(0.024)

Scheme 2	50	0.911(0.878)	0.882(0.631)	0.878(0.105)	0.631(0.131)
	100	0.806(0.715)	0.737(0.264)	0.715(0.169)	0.264(0.213)
	200	0.484(0.354)	0.381(0.093)	0.354(0.222)	0.093(0.246)

Scheme 3	50	0.822(0.907)	0.921(0.473)	0.907(0.154)	0.473(0.101)
	100	0.562(0.700)	0.737(0.140)	0.700(0.214)	0.140(0.202)
	200	0.228(0.356)	0.410(0.072)	0.356(0.156)	0.072(0.255)

Scheme 4	50	0.947(0.679)	0.704(0.678)	0.679(0.095)	0.668(0.227)
	100	0.910(0.247)	0.269(0.248)	0.247(0.157)	0.238(0.239)
	200	0.873(0.079)	0.084(0.084)	0.079(0.232)	0.074(0.063)

Scheme 5	50	0.977(0.911)	0.910(0.854)	0.911(0.028)	0.854(0.102)
	100	0.976(0.718)	0.722(0.532)	0.718(0.028)	0.532(0.214)
	200	0.978(0.297)	0.305(0.147)	0.297(0.029)	0.147(0.244)

Scheme 6	50	0.959(0.848)	0.862(0.771)	0.848(0.060)	0.771(0.143)
	100	0.931(0.548)	0.569(0.373)	0.548(0.108)	0.373(0.250)
	200	0.840(0.156)	0.165(0.103)	0.156(0.223)	0.103(0.170)

Open in a new tab

Both Figure 3 and Table 2 show that when the data are non-Gaussian but follow a meta-elliptical distribution, Kendall constantly outperforms Pearson in terms of feature selection and parameter estimation. Moreover, when the data are indeed Gaussian distributed, there is no obvious difference between Kendall and Pearson, indicating that our proposed rank-based method is a good alternative to the classical scale-invariant sparse PCA under the meta-elliptical model.

We then compare Kendall with S_n and Q_n. In Scheme 1, for the Gaussian data, Kendall slightly outperforms S_n and Q_n. For the data with outliers, S_n and Q_n performs better than the classical sparse PCA estimates, but are not as robust as Kendall. For different elliptical distributions explored in Schemes 4 to 6, Kendall has the best overall performance compared to S_n and Q_n. The results for the non-elliptically distributed data, as explored in Scheme 3, shows a significant difference between our proposed method and the other two robust sparse PCA approaches. In this case we are interested in, instead of the correlation matrix of the meta-elliptically distributed data, the latent generalized correlation matrix, which S_n and Q_n fail to recover.

5.2 Equity Data Analysis

In this section we investigate the performance of the four competing methods on the equity data explored in Section 2.2. The data come from Yahoo! Finance (finance.yahoo.com). We collect the daily closing prices for J = 452 stocks that are consistently in the S&P 500 index from January 1, 2003 to January 1, 2008. This gives us altogether T = 1, 257 data points, each data point corresponds to the vector of closing prices on a trading day. Let St = [St_t,j] denote the closing price of stock j on day t. We are interested in the log-return data X = [X_tj] with X_tj := log(St_t,j/St_t−1,j).

We evaluate the ability of using only a small number of stocks to represent the trend of the whole stock market. To this end, we run the four competing methods on the log-return data X and obtain the top four leading eigenvectors. Here the iterative deflation method discussed in Section 3.3 is exploited with the same tuning parameter k in each deflation step. Let A_k be the support set of the estimated leading eigenvectors by one of the four methods. We define $T_{t}^{W}$ and $T_{t}^{A_{k}}$ as

T_{t}^{W} ≔ I (\sum_{j} S t_{t, j} - \sum_{j} S t_{t - 1, j} > 0), T_{t}^{A_{k}} ≔ I (\sum_{j \in A_{k}} S t_{t, j} - \sum_{j \in A_{k}} S t_{t - 1, j} > 0),

where I(·) is the indicator function. In this way, we can calculate the proportion of successful matches of the market trend using the stocks in A_k as:

ρ_{A_{k}} ≔ \frac{1}{T - 1} \sum_{t = 2}^{T} I (T_{t}^{W} = T_{t}^{A_{k}}) .

We visualize the result by plotting (card(A_k), ρ_{A_k}) in Figure 4, which shows that Kendall summarizes the trend of the whole stock market better than the other three methods.

Successful matches of the market trend proportions only using the stocks in the support sets of the estimated loading vectors. The horizontal-axis represents the cardinalities of the estimates’ support sets; the vertical-axis represents the percentage of successful matches.

Moreover, we examine the stocks selected by the four competing methods. The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sectors, including Consumer Discretionary (70 stocks), Consumer Staples (35 stocks), Energy (37 stocks), Financials (74 stocks), Health Care (46 stocks), Industrials (59 stocks), Information Technology (64 stocks), Telecommunications Services (6 stocks), Materials (29 stocks), and Utilities (32 stocks). Table 3 provides a more detailed description of these ten categories with their numbers and abbreviations provided.

Table 3.

The ten categories of the stocks with their numbers and abbreviations provided.

Name	Number	Abbreviation
Consumer Discretionary	70	CD
Consumer Staples	35	CS
Energy	37	E
Financials	74	F
Health Care	46	HC
Industrial	59	I
Information Technology	64	IT
Telecommunications Services	6	TS
Materials	29	M
Utilities	32	U

Open in a new tab

We estimate the top four leading eigenvectors using the four competing methods with the same k = 30 in each deflation step. The obtained non-zero features’ categories are presented in Table 4. We see that, in general, Kendall has the best ability in grouping the stocks of the same category together. Therefore, Kendall provides a more interpretable result.

Table 4.

The categories of the nonzero terms in the top four leading eigenvectors calculated by the four competing methods. The abbreviations are listed in Table 3. (Note: 30F means 30 stocks are from the Financials category.)

Method	PC1	PC2	PC3	PC4
Pearson	29F,1I	6CD,5F,8I,1IT,10M	8F,2E,3M,17U	8CD,1F,1I,20IT
S_n	29F,1I	2CD,2F,12I,14M	3I,27IT	3F,27U
Q_n	29F,1I	2CD,2F,12I,1IT,13M	2I,28IT	3F,27U
Kendall	30F	15I, 15M	10CD, 10F,10I	3I, 27IT

Open in a new tab

6 Discussion

We propose a new scale-invariant sparse principal component analysis method for high dimensional meta-elliptical data. Our estimator is semiparametric but achieves a fast rate of convergence in parameter estimation, and is robust to both modeling assumption and data contamination. Therefore, the new estimator can be a good alternative to the classical sparse PCA method.

Although the rank-based Kendall’s tau statistic has been exploited for principal component analysis in low dimensions (see, for example, Croux et al. (2002)), our work is fundamentally different from the existing literature. The main difference can be elaborated in the following three aspects: (i) We generalize the Kendall’s tau statistic to high dimensions, while the current literature only focuses on the low dimension settings; (ii) Our theoretical analysis are fundamentally different from the previous low dimensional analysis, which exploits classical semiparametric theory under which the dimension d is usually fixed; (iii) Most existing methods and theories are built upon the Gaussian or elliptical model, while we consider the meta-elliptical model.

There is another trend in exploiting robust (sparse) PCA (see, for example, Maronna and Zamar (2002) and Croux et al. (2013)). The empirical comparisons conducted in this paper indicate that, confined in the meta-elliptical family, the proposed rank-based method can be more efficient in parameter estimation and feature selection than these additional robust procedures. Moreover, our proposed method achieves the nearly parametric rate of convergence in parameter estimation, while to the best of our knowledge the performance of these robust sparse PCA procedures in high dimensions is mostly unknown.

Vu and Lei (2012) and Ma (2013) considered sparse principal component analysis and studied the rates of convergence under various modeling and sparsity assumptions. Our method is different from theirs in two aspects: (i) Their analysis relies heavily on the Gaussian or sub-Gaussian assumption, which no longer holds under the meta-elliptical model; (ii) They exploit the Pearson’s sample covariance or correlation matrix as the algorithm input, while we advocate the usage of the Kendall’s tau correlation matrix in the meta-elliptical model.

Liu et al. (2012) and Xue and Zou (2012) proposed a procedure called the nonparanormal SKEPTIC, which exploits the nonparanormal family for graph estimation. The non-paranormal SKEPTIC also adopts rank-based methods in high dimensions. Our method is different from theirs in three aspects: (i) We advocate the use of meta-elliptical family, of which the nonparanormal is a subset; (ii) We advocate the use of the Kendall’s tau, which is adaptive over the whole meta-elliptical family but instead of the Spearman’s rho statistic; (iii) Their focus is on graph estimation, in contrast, this paper focuses on principal component analysis. In a preliminary version of this work (Han and Liu, 2012), they mainly focused on estimating the first leading eigenvector of the latent generalized correlation matrix by directly solving Equation (3.4), which is practically intractable. In contrast, we exploit a computationally feasible procedure (truncated power method) for scale-invariant sparse PCA, and provide theoretical guarantee of convergence for this algorithm. Moreover, our method estimates the latent principal components, which are crucial in practical applications, and we provide the theoretical analysis of convergence for the corresponding estimators.

For the principal component estimation algorithm in Section 4.3, when Q_g is unknown, we could estimate f₁, …, f_d using the following method:

Test whether the original data is elliptically distributed by using some existing techniques (Li et al., 1997; Huffer and Park, 2007; Sakhanenko, 2008). If yes, we set ${\hat{f}}_{j} (t) = (t - {\hat{μ}}_{j}) ∕ {\hat{σ}}_{j}$ . Here ${\hat{μ}}_{j}$ and ${\hat{σ}}_{j}$ are the marginal sample mean and standard deviation for the j-th entry.
If not, we construct a set of marginal distribution functions:
$Π ≔ {Q_{g} : Q_{g} is a well defined marginal distribution function} .$
For any Q_g ∈ Π, we calculate $\hat{f} = {{\hat{f}}_{1}, \dots, {\hat{f}}_{d}}$ using Equation (4.3).
We transform the data using $\hat{f}$ .
We test whether the transformed data is elliptically distributed by using the techniques exploited in step 1.

We iterate steps 3-5 until we cannot reject the null hypothesis in step 5 for some Q_g. This is a heuristic method whose theoretical justification is left for future investigation. Other future directions include analyzing the robustness property of the method to more noisy and dependent data.

Acknowledgement

We thank the associate editor and two anonymous reviewers for their very helpful and constructive comments and suggestions. Fang Han’s research is supported by Google fellowship in statistics. Han Liu’s research is supported by NSF Grants III-1116730 and NSF III-1332109, an NIH sub-award and a FDA sub-award from Johns Hopkins University.

A Appendix

A.1 Properties of the Elliptical Distribution

The next proposition provides two alternative ways to characterize an elliptical distribution and their proofs can be found in Fang et al. (1990).

Proposition A.1 (Fang et al. (1990)). A random vector Z = (Z₁, …, Z_d)^T satisfies that Z ~ EC_d(μ,Σ,ξ) if and only if Z has the characteristic function exp(it′μ)φ(t′Σt), where $i : \sqrt{- 1}$ and φ is a properly-defined characteristic function. We denote Z ~ EC_d(μ,Σ,φ) in this setting. If ξ is absolutely continuous and Σ is non-singular, then the density of Z exists and is of the form:

p_{Z} (z) = {∣ Σ ∣}^{- 1 ∕ 2} g ({(z - μ)}^{T} Σ^{- 1} (z - μ)),

where g : [0, ∞) → [0,∞). We denote Z ~ EC_d(μ,Σ, g). Here ξ, φ and g uniquely determine one of the other.

The next proposition provides three important properties of the elliptical distribution.

Proposition A.2. If a random vector Z is elliptically distributed, we have:

For any $q \in ℕ$ , $v \in ℝ^{q}$ , and any matrix $B \in ℝ^{d \times q}$ , v + B^TZ is elliptically distributed. In particular, if Z ~ EC_d(μ,Σ,ξ), then v + B^TZ ~ EC_q(v + B^Tμ, B^TΣB,ξ).
Let Z ~ EC_d(μ,Σ,ξ) and Σ⁰ be the generalized correlation matrix of Z. If rank(Σ) = q and $𝔼 ξ^{2} < \infty$ , then $𝔼 (Z) = μ$ , $Cov (Z) = \frac{𝔼 (ξ^{2})}{q} Σ$ , and Cor(Z) = Σ⁰.
If Z ~ EC_d(0,Σ,φ) with diag(Σ) = I_d, then the marginal distributions of Z are the same.

Proof. The proof of the first two assertions can be found in Fang et al. (1990). To prove the third assertion, we use Proposition A.1 to obtain the characteristic function of Z_j for any $1 \leq j \leq d : 𝔼 \exp (i t Z_{j}) = 𝔼 \exp (i t e_{j}^{T} Z) = ψ (t^{2} e_{j}^{T} Σ e_{j}) = ψ (t^{2})$ , where $t \in ℝ$ and e_j is the j-th canonical basis in $ℝ^{d}$ , i.e., Inline graphic for 1 ≤ j ≤ d. The result follows from the one-to-one map between the characteristic functions and the random variables.

A.2 Proof of Theorem 4.1

Proof. Realizing that ${\hat{τ}}_{j k}$ is a 2nd order U-statistic and sign (x_ij − x_i′k) (x_ik − x_i′k) is bounded in [−1; 1], using Equation (5.7) in Hoeffding (1963), we have

ℙ (∣ {\hat{τ}}_{j k} - τ (X_{j}, X_{k}) ∣ > t) = ℙ (∣ {\hat{τ}}_{j k} - 𝔼 ({\hat{τ}}_{j k}) ∣ > t) \leq 2 \exp (- \frac{n t^{2}}{8}) .

Therefore, we have

\begin{matrix} ℙ (∣ {\hat{R}}_{j k} - Σ_{j k}^{0} ∣ > t) & = ℙ (∣ \sin (\frac{π}{2} {\hat{τ}}_{j k}) - \sin (\frac{π}{2} τ (X_{j}, X_{k})) ∣ > t) \\ \leq ℙ (∣ {\hat{τ}}_{j k} - τ (X_{j}, X_{k}) ∣ > \frac{2}{π} t) \\ \leq 2 \exp (- \frac{n t^{2}}{2 π^{2}}) . \end{matrix}

Taking the union bound, we have

ℙ ({∣ ∣ \hat{R} - Σ^{0} ∣ ∣}_{\max} > t) \leq d^{2} \exp (- \frac{n t^{2}}{2 π^{2}}) .

This completes the proof.

A.3 Proof of Theorem 4.2

Proof. Under the model $M_{d} (Σ^{0}, ξ, f; θ_{1}, s)$ , we define λ_j := Λ_j(Σ⁰) and θ_j to be the corresponding eigenvector for j = 1, …, d. We then define $Ψ_{0} : Σ_{j = 2}^{d} λ_{j} θ_{j} θ_{j}^{T}$ and let

ε ≔ ∣ \sin ∠ (θ_{1}, {\hat{θ}}_{1, k}^{*}) ∣ and Σ^{0} = λ_{1} θ_{1} θ_{1}^{T} + Ψ_{0} .

For all $θ \in 𝕊^{d - 1}$ , we have

\begin{matrix} 〈 Σ^{0}, θ_{1} θ_{1}^{T} - {θ θ}^{T} 〉 & = 〈 Σ^{0}, θ_{1} θ_{1}^{T} 〉 - 〈 λ_{1} θ_{1} θ_{1}^{T} + Ψ_{0}, {θ θ}^{T} 〉 \\ = λ_{1} - λ_{1} {〈 θ_{1}, θ 〉}^{2} - 〈 Ψ_{0}, {θ θ}^{T} 〉, \end{matrix}

(A.1)

and

\begin{matrix} 〈 Ψ_{0}, {θ θ}^{T} 〉 & = θ^{T} Ψ_{0} θ = θ^{T} (I_{d} - θ_{1} θ_{1}^{T}) Σ^{0} (I_{d} - θ_{1} θ_{1}^{T}) θ \\ \leq λ_{2} {∣ ∣ (I_{d} - θ_{1} θ_{1}^{T}) θ ∣ ∣}_{2}^{2} = λ_{2} - λ_{2} {〈 θ_{1}, θ 〉}^{2} . \end{matrix}

Moreover, by definition,

\sin^{2} ∠ (θ_{1}, θ) = 1 - {(θ_{1}^{T} θ)}^{2} = 1 - {〈 θ_{1}, θ 〉}^{2} .

(A.2)

Combining Equation (A.1) with Equation (A.2), we have

〈 Σ^{0}, θ_{1} θ_{1}^{T} - {θ θ}^{T} 〉 \geq (λ_{1} - λ_{2}) (1 - {〈 θ_{1}, θ 〉}^{2}) = (λ_{1} - λ_{2}) \sin^{2} ∠ (θ_{1}, θ) .

Therefore, letting ${\hat{θ}}_{1, k}^{*}$ be the global optimum to Equation (3.4), we have

\begin{matrix} ε^{2} & \leq \frac{1}{λ_{1} - λ_{2}} 〈 Σ^{0}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉 \\ \leq \frac{1}{λ_{1} - λ_{2}} (〈 Σ^{0} - \hat{R}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉 + 〈 \hat{R}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉) \\ \leq \frac{1}{λ_{1} - λ_{2}} 〈 Σ^{0} - \hat{R}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉 . \end{matrix}

(A.3)

The last inequality holds because θ₁ is feasible in the optimization constraint in (3.4), implying that

〈 \hat{R}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉 = Tr (\hat{R} θ_{1} θ_{1}^{T}) - Tr (\hat{R} {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T}) = θ_{1}^{T} \hat{R} θ_{1} - {\hat{θ}}_{1, k}^{* T} \hat{R} {\hat{θ}}_{1, k}^{*} \leq 0 .

Therefore, using Equation (A.3),

\begin{matrix} ε^{2} & \leq \frac{1}{λ_{1} - λ_{2}} 〈 Σ^{0}, - \hat{R}, θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} 〉 \\ = \frac{1}{λ_{1} - λ_{2}} 〈 vec (Σ^{0} - \hat{R}), vec (θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T}) 〉 \\ \leq \frac{1}{λ_{1} - λ_{2}} {∣ ∣ vec (\hat{R} - Σ^{0}) ∣ ∣}_{\infty} \cdot {∣ ∣ vec (θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T}) ∣ ∣}_{1}, \end{matrix}

(A.4)

where the last inequality is by the Hölder inequality. Letting $ϑ = vec (θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T}$ , we have

\begin{matrix} {∣ ∣ ϑ ∣ ∣}_{2}^{2} & = {∣ ∣ θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T} ∣ ∣}_{F}^{2} = Tr ((θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T}) (θ_{1} θ_{1}^{T} - {\hat{θ}}_{1, k}^{*} {\hat{θ}}_{1, k}^{* T})) \\ = 2 (1 - {(θ_{1}^{T} {\hat{θ}}_{1, k}^{*})}^{2}) = 2 ε^{2}, \end{matrix}

implying that

{∣ ∣ ϑ ∣ ∣}_{1} \leq \sqrt{{∣ ∣ ϑ ∣ ∣}_{0}} {∣ ∣ ϑ ∣ ∣}_{2} \leq \sqrt{2 k^{2}} \cdot \sqrt{2 ε^{2}} = 2 k ε .

(A.5)

Therefore, combining Equation (A.4) with Equation (A.5), we get

ε^{2} \leq \frac{{∣ ∣ vec (\hat{R} - Σ^{0}) ∣ ∣}_{\infty}}{λ_{1} - λ_{2}} .2 k ε,

which is equivalent to saying that

ε \leq \frac{2 k {∣ ∣ vec (\hat{R} - Σ^{0}) ∣ ∣}_{\infty}}{λ_{1} - λ_{2}} .

Using Theorem 4.1, we have, with probability at least 1 − d^−5/2,

∊ \leq \frac{2 k {∣ ∣ vec (\hat{R} - Σ^{0}) ∣ ∣}_{\infty}}{λ_{1} - λ_{2}} \leq \frac{6 π}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}} .

This completes the proof.

A.4 Proof of Corollary 4.4

Proof. Without loss of generality, we may assume that $θ_{1}^{T} {\hat{θ}}_{1, k}^{*} \geq 0$ because otherwise we can simply conduct appropriate sign changes in the proof. We first note that card $({\hat{ϴ}}_{k}^{*}) = k \geq card (ϴ)$ . if $ϴ ⊄ {\hat{ϴ}}_{k}^{*}$ , then $ϴ ∕ {\hat{ϴ}}_{k}^{*} \neq 0$ . This implies that,

{∣ ∣ {\hat{θ}}_{1, k}^{*} - θ_{1} ∣ ∣}_{2} \geq {∣ ∣ {({\hat{θ}}_{1, k}^{*} - θ_{1})}_{ϴ ∕ {\hat{ϴ}}_{k}^{*}} ∣ ∣}_{2} \geq \min_{j \in ϴ} ∣ θ_{1_{j}} ∣ \geq \frac{6 \sqrt{2 π}}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}} .

We then have

\sin^{2} ∠ ({\hat{θ}}_{1, k}^{*}, θ_{1}) = 1 - {(θ_{1}^{T} {\hat{θ}}_{1, k}^{*})}^{2} \geq 1 - θ_{1}^{T} {\hat{θ}}_{1, k}^{*} = \frac{{∣ ∣ θ_{1} - {\hat{θ}}_{1, k}^{*} ∣ ∣}_{2}^{2}}{2},

implying that

∣ \sin ∠ ({\hat{θ}}_{1, k}^{*}, θ_{1}) ∣ \geq \frac{{∣ ∣ {\hat{θ}}_{1, k}^{*} - θ_{1} ∣ ∣}_{2}}{\sqrt{2}} \geq \frac{\min_{j \in ϴ} ∣ θ_{1_{j}} ∣}{\sqrt{2}} = \frac{6 π}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}} .

(A.6)

Therefore, applying Theorem 4.2, we get

ℙ (ϴ ⊄ {\hat{ϴ}}_{k}^{*}) \leq ℙ (∣ \sin ∠ ({\hat{θ}}_{1, k}^{*}, θ_{1}) ∣ \geq \frac{6 π}{λ_{1} - λ_{2}} \cdot k \sqrt{\frac{\log d}{n}} \leq d^{- 5 ∕ 2}) .

(A.7)

This completes the proof.

A.5 Proof of Theorem 4.7

Proof. Without loss of generality, we assume that $θ_{1}^{T} {\hat{θ}}_{1, k}^{*} \geq 0$ . We define $S = {j_{1}, \dots, j_{v}} = supp (θ_{1}) \cup supp ({\hat{θ}}_{1, k}^{*})$ , where $v = card (S) \leq s + k$ . we further define

T_{n} = [g_{j_{1}} (- \sqrt{b \log n}), g_{j_{1}} (\sqrt{b \log n}] \times \dots \times [g_{j_{v}} (- \sqrt{b \log n)}, g_{j_{v}} (\sqrt{b \log n)}],

for some 0 < b < 1. Moreover, we define the event M_n as

M_{n} ≔ {v \in ℝ^{d} : v_{S} \in T_{n}} .

Thus, conditioning on M_n, using Theorem 4.6, we have

∣ \hat{f} {(x)}^{T} {\hat{θ}}_{1, k}^{*} - f {(x)}^{T} θ_{1}^{*} ∣ \leq ∣ {(\hat{f} (x) - f (x))}^{T} {\hat{θ}}_{1, k}^{*} ∣ + ∣ f {(x)}^{T} ({\hat{θ}}_{1, . k}^{*} - θ_{1}^{*}) ∣ \leq {∣ ∣ {(\hat{f} (x) - f (x))}_{S} ∣ ∣}_{2} + {∣ ∣ {(f (x))}_{S} ∣ ∣}_{2} {∣ ∣ {\hat{θ}}_{1, k}^{*} - θ_{1}^{*} ∣ ∣}_{2} O_{P} (\sqrt{(s + k) \cdot \frac{\log \log n}{n^{1 - b ∕ 2}}}) + O_{P} (\frac{k}{λ_{1} - λ_{2}} \sqrt{\frac{\log d}{n}}) \cdot {∣ ∣ {(f (x))}_{S} ∣ ∣}_{2} .

(A.8)

Since f_j(x_j) ~ N_d(0, 1), using Mill’s inequality, we have

∣ f_{j} (x_{j}) ∣ = O_{P} (\sqrt{\log n} \Rightarrow {∣ ∣ {(f (x))}_{S} ∣ ∣}_{2} = O_{P} (\sqrt{(s + k) \log n}) .

(A.9)

Finally, using Mill’s inequality again, we have

ℙ = (f_{j} (x_{j}) \geq \sqrt{n \log n}) = O (n^{- b ∕ 2}) \Rightarrow ℙ (x \in M_{n}^{c}) = O ((s + k) n^{- b ∕ 2}) = o (1) .

(A.10)

Combining Equations (A.8), (A.9), and (A.10), we have the desired result.

References

Amini A, Wainwright M. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics. 2009;37(5B):2877–2921. [Google Scholar]
Anderson TW. An Introduction to Multivariate Statistical Analysis. Vol. 2. Wiley; 1958. [Google Scholar]
Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. forthcoming in the Annals of Statistics. 2012 [Google Scholar]
Bickel P, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. [Google Scholar]
Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; 2011. [Google Scholar]
Chatfield C, Collins A. Introduction to Multivariate Analysis. Vol. 166. Chapman & Hall; 1980. [Google Scholar]
Choi K, Marden J. A multivariate version of Kendall’s τ. Journal of Non-parametric Statistics. 1998;9(3):261–293. [Google Scholar]
Croux C, Dehon C. Influence functions of the Spearman and Kendall correlation measures. Statistical Methods & Applications. 2010;19(4):497–515. [Google Scholar]
Croux C, Filzmoser P, Fritz H. Robust sparse principal component analysis. Technometrics. 2013;55(2):202–214. [Google Scholar]
Croux C, Haesbroeck G. Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and e ciencies. Biometrika. 2000;87(3):603–618. [Google Scholar]
Croux C, Ollila E, Oja H. Sign and rank covariance matrices: statistical properties and application to principal components analysis. Statistics in Industry and Technology. 2002:257–269. [Google Scholar]
Croux C, Ruiz-Gazen A. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis. 2005;95(1):206–226. [Google Scholar]
d’Aspremont A, El Ghaoui L, Jordan M, Lanckriet G. A direct formulation for sparse pca using semidefinite programming. SIAM Review. 2004;49(3):434–448. [Google Scholar]
Davies P. Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. The Annals of Statistics. 1987;15(3):1269–1292. [Google Scholar]
Fang H, Fang K, Kotz S. The meta-elliptical distributions with given marginals. Journal of Multivariate Analysis. 2002;82(1):1–16. [Google Scholar]
Fang K, Kotz S, Ng K. Symmetric Multivariate and Related Distributions. Chapman&Hall; 1990. [Google Scholar]
Gibbons JD, Chakraborti S. Nonparametric Statistical Inference. Vol. 168. CRC press; 2003. [Google Scholar]
Gnanadesikan R, Kettenring JR. Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics. 1972;28(1):81–124. [Google Scholar]
Hallin M, Paindaveine D, Verdebout T. Optimal rank-based testing for principal components. The Annals of Statistics. 2010;38(6):3245–3299. [Google Scholar]
Hampel FR. The influence curve and its role in robust estimation. Journal of the American Statistical Association. 1974;69(346):383–393. [Google Scholar]
Han F, Liu H. Transelliptical component analysis. Advances in Neural Information Processing Systems. 2012;25:368–376. [Google Scholar]
Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2013;14:629–671. [Google Scholar]
Hoe ding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 1963;58(301):13–30. [Google Scholar]
Huber PJ, Ronchetti E. Robust Statistics. 2nd edition Wiley; 2009. [Google Scholar]
Hubert M, Rousseeuw PJ, Verboven S. A fast method for robust principal components with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems. 2002;60(1):101–111. [Google Scholar]
Hu er FW, Park C. A test for elliptical symmetry. Journal of Multivariate Analysis. 2007;98(2):256–281. [Google Scholar]
Jackson D, Chen Y. Robust principal component analysis and outlier detection with ecological data. Environmetrics. 2004;15(2):129–139. [Google Scholar]
Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104(486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Journée M, Nesterov Y, Richtárik P, Sepulchre R. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research. 2010;11:517–553. [Google Scholar]
Kendall MG. Rank Correlation Methods. Griffin; 1948. [Google Scholar]
Kruskal W. Ordinal measures of association. Journal of the American Statistical Association. 1958;53(284):814–861. [Google Scholar]
Li R, Fang K, Zhu L. Some Q-Q probability plots to test spherical and elliptical symmetry. Journal of Computational and Graphical Statistics. 1997;6(4):435–450. [Google Scholar]
Lindskog F, McNeil A, Schmock U. Kendall’s tau for elliptical distributions. Springer; 2003. [Google Scholar]
Liu H, Han F, Yuan M, La erty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012;40(4):2293–2326. [Google Scholar]
Ma Z. Sparse principal component analysis and iterative thresholding. forthcoming in the Annals of Statistics. 2013 [Google Scholar]
Mackey L. Deflation methods for sparse PCA. Advances in Neural Information Processing Systems. 2009;21:1017–1024. [Google Scholar]
Marden J. Some robust estimates of principal components. Statistics & Probability Letters. 1999;43(4):349–359. [Google Scholar]
Maronna RA. Robust M-estimators of multivariate location and scatter. The Annals of Statistics. 1976;4(1):51–67. [Google Scholar]
Maronna RA, Zamar RH. Robust estimates of location and dispersion for high-dimensional datasets. Technometrics. 2002;44(4):307–317. [Google Scholar]
Möttönen J, Oja H. Multivariate spatial sign and rank methods. Journal of Nonparametric Statistics. 1995;5(2):201–213. [Google Scholar]
Oja H. Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Vol. 199. Springer; 2010. [Google Scholar]
Paul D, Johnstone I. Augmented sparse principal component analysis for high dimensional data. Arxiv preprint arXiv:1202.1242. 2012 [Google Scholar]
Puri ML, Sen PK. Nonparametric Methods in Multivariate Analysis. Wiley; 1971. [Google Scholar]
Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Maechler M. Robustbase: basic robust statistics. R package. 2009 URL http://CRAN.R-project.org/package=robustbase. [Google Scholar]
Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. Journal of the American Statistical Association. 1993;88(424):1273–1283. [Google Scholar]
Sakhanenko L. Testing for ellipsoidal symmetry: A comparison study. Computational Statistics & Data Analysis. 2008;53(2):565–581. [Google Scholar]
Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008;99(6):1015–1034. [Google Scholar]
Visuri S, Koivunen V, Oja H. Sign and rank covariance matrices. Journal of Statistical Planning and Inference. 2000;91(2):557–575. [Google Scholar]
Vu V, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. International Conference on Artificial Intelligence and Statistics (AISTATS) 2012;15:1278–1286. [Google Scholar]
Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xue L, Zou H. Regularized rank-based estimation of high-dimensional non-paranormal graphical models. The Annals of Statistics. 2012;40(5):2541–2571. [Google Scholar]
Yuan X, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]
Zhang Y, El Ghaoui L. Large-scale sparse principal component analysis with application to text data. Advances in Neural Information Processing Systems. 2011;24 [Google Scholar]
Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of Computational and Graphical Statistics. 2006;15(2):265–286. [Google Scholar]

[R1] Amini A, Wainwright M. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics. 2009;37(5B):2877–2921. [Google Scholar]

[R2] Anderson TW. An Introduction to Multivariate Statistical Analysis. Vol. 2. Wiley; 1958. [Google Scholar]

[R3] Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. forthcoming in the Annals of Statistics. 2012 [Google Scholar]

[R4] Bickel P, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. [Google Scholar]

[R5] Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; 2011. [Google Scholar]

[R6] Chatfield C, Collins A. Introduction to Multivariate Analysis. Vol. 166. Chapman & Hall; 1980. [Google Scholar]

[R7] Choi K, Marden J. A multivariate version of Kendall’s τ. Journal of Non-parametric Statistics. 1998;9(3):261–293. [Google Scholar]

[R8] Croux C, Dehon C. Influence functions of the Spearman and Kendall correlation measures. Statistical Methods & Applications. 2010;19(4):497–515. [Google Scholar]

[R9] Croux C, Filzmoser P, Fritz H. Robust sparse principal component analysis. Technometrics. 2013;55(2):202–214. [Google Scholar]

[R10] Croux C, Haesbroeck G. Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and e ciencies. Biometrika. 2000;87(3):603–618. [Google Scholar]

[R11] Croux C, Ollila E, Oja H. Sign and rank covariance matrices: statistical properties and application to principal components analysis. Statistics in Industry and Technology. 2002:257–269. [Google Scholar]

[R12] Croux C, Ruiz-Gazen A. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis. 2005;95(1):206–226. [Google Scholar]

[R13] d’Aspremont A, El Ghaoui L, Jordan M, Lanckriet G. A direct formulation for sparse pca using semidefinite programming. SIAM Review. 2004;49(3):434–448. [Google Scholar]

[R14] Davies P. Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. The Annals of Statistics. 1987;15(3):1269–1292. [Google Scholar]

[R15] Fang H, Fang K, Kotz S. The meta-elliptical distributions with given marginals. Journal of Multivariate Analysis. 2002;82(1):1–16. [Google Scholar]

[R16] Fang K, Kotz S, Ng K. Symmetric Multivariate and Related Distributions. Chapman&Hall; 1990. [Google Scholar]

[R17] Gibbons JD, Chakraborti S. Nonparametric Statistical Inference. Vol. 168. CRC press; 2003. [Google Scholar]

[R18] Gnanadesikan R, Kettenring JR. Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics. 1972;28(1):81–124. [Google Scholar]

[R19] Hallin M, Paindaveine D, Verdebout T. Optimal rank-based testing for principal components. The Annals of Statistics. 2010;38(6):3245–3299. [Google Scholar]

[R20] Hampel FR. The influence curve and its role in robust estimation. Journal of the American Statistical Association. 1974;69(346):383–393. [Google Scholar]

[R21] Han F, Liu H. Transelliptical component analysis. Advances in Neural Information Processing Systems. 2012;25:368–376. [Google Scholar]

[R22] Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2013;14:629–671. [Google Scholar]

[R23] Hoe ding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 1963;58(301):13–30. [Google Scholar]

[R24] Huber PJ, Ronchetti E. Robust Statistics. 2nd edition Wiley; 2009. [Google Scholar]

[R25] Hubert M, Rousseeuw PJ, Verboven S. A fast method for robust principal components with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems. 2002;60(1):101–111. [Google Scholar]

[R26] Hu er FW, Park C. A test for elliptical symmetry. Journal of Multivariate Analysis. 2007;98(2):256–281. [Google Scholar]

[R27] Jackson D, Chen Y. Robust principal component analysis and outlier detection with ecological data. Environmetrics. 2004;15(2):129–139. [Google Scholar]

[R28] Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104(486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Journée M, Nesterov Y, Richtárik P, Sepulchre R. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research. 2010;11:517–553. [Google Scholar]

[R30] Kendall MG. Rank Correlation Methods. Griffin; 1948. [Google Scholar]

[R31] Kruskal W. Ordinal measures of association. Journal of the American Statistical Association. 1958;53(284):814–861. [Google Scholar]

[R32] Li R, Fang K, Zhu L. Some Q-Q probability plots to test spherical and elliptical symmetry. Journal of Computational and Graphical Statistics. 1997;6(4):435–450. [Google Scholar]

[R33] Lindskog F, McNeil A, Schmock U. Kendall’s tau for elliptical distributions. Springer; 2003. [Google Scholar]

[R34] Liu H, Han F, Yuan M, La erty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012;40(4):2293–2326. [Google Scholar]

[R35] Ma Z. Sparse principal component analysis and iterative thresholding. forthcoming in the Annals of Statistics. 2013 [Google Scholar]

[R36] Mackey L. Deflation methods for sparse PCA. Advances in Neural Information Processing Systems. 2009;21:1017–1024. [Google Scholar]

[R37] Marden J. Some robust estimates of principal components. Statistics & Probability Letters. 1999;43(4):349–359. [Google Scholar]

[R38] Maronna RA. Robust M-estimators of multivariate location and scatter. The Annals of Statistics. 1976;4(1):51–67. [Google Scholar]

[R39] Maronna RA, Zamar RH. Robust estimates of location and dispersion for high-dimensional datasets. Technometrics. 2002;44(4):307–317. [Google Scholar]

[R40] Möttönen J, Oja H. Multivariate spatial sign and rank methods. Journal of Nonparametric Statistics. 1995;5(2):201–213. [Google Scholar]

[R41] Oja H. Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Vol. 199. Springer; 2010. [Google Scholar]

[R42] Paul D, Johnstone I. Augmented sparse principal component analysis for high dimensional data. Arxiv preprint arXiv:1202.1242. 2012 [Google Scholar]

[R43] Puri ML, Sen PK. Nonparametric Methods in Multivariate Analysis. Wiley; 1971. [Google Scholar]

[R44] Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Maechler M. Robustbase: basic robust statistics. R package. 2009 URL http://CRAN.R-project.org/package=robustbase. [Google Scholar]

[R45] Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. Journal of the American Statistical Association. 1993;88(424):1273–1283. [Google Scholar]

[R46] Sakhanenko L. Testing for ellipsoidal symmetry: A comparison study. Computational Statistics & Data Analysis. 2008;53(2):565–581. [Google Scholar]

[R47] Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008;99(6):1015–1034. [Google Scholar]

[R48] Visuri S, Koivunen V, Oja H. Sign and rank covariance matrices. Journal of Statistical Planning and Inference. 2000;91(2):557–575. [Google Scholar]

[R49] Vu V, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. International Conference on Artificial Intelligence and Statistics (AISTATS) 2012;15:1278–1286. [Google Scholar]

[R50] Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Xue L, Zou H. Regularized rank-based estimation of high-dimensional non-paranormal graphical models. The Annals of Statistics. 2012;40(5):2541–2571. [Google Scholar]

[R52] Yuan X, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]

[R53] Zhang Y, El Ghaoui L. Large-scale sparse principal component analysis with application to text data. Advances in Neural Information Processing Systems. 2011;24 [Google Scholar]

[R54] Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of Computational and Graphical Statistics. 2006;15(2):265–286. [Google Scholar]

PERMALINK

Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data

Fang Han

Han Liu

Abstract

1 Introduction

2 Elliptical and Meta-elliptical Distributions

2.1 Elliptical Distribution

2.2 Meta-elliptical Distribution

Table 1.

Figure 1.

Figure 2.

3 Methodology

3.1 Statistical Model

3.2 Method

3.2.1 Kendall’s tau based Correlation Matrix Estimator

3.2.2 Rank-based Estimators

Algorithm 1.

3.3 Estimating the Top m Leading Eigenvectors

4 Theoretical Properties

4.1 Latent Generalized Correlation Matrix Estimation

4.2 Leading Eigenvector Estimation

4.3 Principal Component Estimation

5 Experiments

5.1 Numerical Simulations

Figure 3.

Table 2.

5.2 Equity Data Analysis

Figure 4.

Table 3.

Table 4.

6 Discussion

Acknowledgement

A Appendix

A.1 Properties of the Elliptical Distribution

A.2 Proof of Theorem 4.1

A.3 Proof of Theorem 4.2

A.4 Proof of Corollary 4.4

A.5 Proof of Theorem 4.7

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases