A High-Dimensional Nonparametric Multivariate Test for Mean Vector

Lan Wang; Bo Peng; Runze Li

doi:10.1080/01621459.2014.988215

. Author manuscript; available in PMC: 2017 Jan 15.

Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1658–1669. doi: 10.1080/01621459.2014.988215

A High-Dimensional Nonparametric Multivariate Test for Mean Vector

Lan Wang ¹, Bo Peng ², Runze Li ³

PMCID: PMC4734767 NIHMSID: NIHMS651394 PMID: 26848205

Abstract

This work is concerned with testing the population mean vector of nonnormal high-dimensional multivariate data. Several tests for high-dimensional mean vector, based on modifying the classical Hotelling T² test, have been proposed in the literature. Despite their usefulness, they tend to have unsatisfactory power performance for heavy-tailed multivariate data, which frequently arise in genomics and quantitative finance. This paper proposes a novel high-dimensional nonparametric test for the population mean vector for a general class of multivariate distributions. With the aid of new tools in modern probability theory, we proved that the limiting null distribution of the proposed test is normal under mild conditions when p is substantially larger than n. We further study the local power of the proposed test and compare its relative efficiency with a modified Hotelling T² test for high-dimensional data. An interesting finding is that the newly proposed test can have even more substantial power gain with large p than the traditional nonparametric multivariate test does with finite fixed p. We study the finite sample performance of the proposed test via Monte Carlo simulations. We further illustrate its application by an empirical analysis of a genomics data set.

Keywords: Asymptotic relative efficiency, High dimensional multivariate data, Hotelling T² test, Nonparametric multivariate test

1 Introduction

Let X₁, …, X_n be independent and identically distributed (iid) p-dimensional random vectors from the model X_i = μ + ε_i where ε_i is the random error to be specified later. In this paper, we consider a novel nonparametric procedure for testing the hypothesis

H_{0} : μ = 0 versus H_{1} : μ \neq 0,

(1)

when p is potentially much larger than n. Here and throughout this paper, p stands for the number of variables (or features) of the data, and n for the sample size.

The above testing problem is motivated by recent advances in genomics. There is growing evidence that most biological processes involve the regulation of multiple genes; and that analysis focusing on individual genes often suffer from low power to detect important genetic variation and poor reproduceability (Vo et al., 2007). As a result, increasing attention has been focused on the analysis of gene sets/pathways, which are groups of genes sharing common biological functions, chromosomal locations or regulations. In some important applications, the problem of evaluating whether a group of genes are differentially expressed can be formulated as the hypothesis in (1), where X_i represents a vector of summary statistics computed on each of the p genes, such as the log-intensity ratios of the red over green channels; or the log ratios of the gene expression levels between control and treatment chips (or before and after drug treatment). For example, the data set we analyzed in Section 3.2 contains microarray measurements from diabetic patients before and after insulin treatment (Wu et al., 2007, 2011).

Testing the hypothesis in (1) becomes very challenging for high-dimensional data. The traditional Hotelling’s T² test is not well defined as the inverse of sample covariance matrix may not exist when p is larger than n. It has been observed in Bai and Saranadasa (1996) that the power of the Hotelling’s T² test can be adversely affected even when p < n, if the sample covariance matrix is nearly singular, see also Pan and Zhou (2011). Recently, there has been great interest in extending Hotelling’s test to the p > n setting, see Bai and Saranadasa (1996, p/n → c ε (0,1)), Srivastava and Du (2008, n = O(p^δ) for some 1/2 < δ ≤ 1), Srivastava (2009, n = O(p^δ) for some 0 < δ ≤ 1), Lee et al. (2012, p/n → c > 0), Srivastava et al. (2013, n = O(p^δ), δ > 1/2), Chen and Qin (2010, Tr(Σ⁴) = o(Tr² (Σ²))). Thulin (2014) proposed a more computing-intensive extension by combining Hotelling’s tests from a large number of lower-dimensional random subspaces. A shared drawback of the aforementioned tests is that they tend to have unsatisfactory power performance when the multivariate distribution is heavy-tailed and is very sensitive to outlying observations.

In many microarray experiments, most genes are expressed at very low levels, few genes are expressed at high levels. The distribution of intensities tends to be nonnormal even after log transformation, regardless of the normalization methods (e.g., Purdom and Holmes, 2005). For the data example in Section 3.2, it is observed that the marginal distributions of the microarray expressions are nonnormal and have heavy tails based on values of their marginal kurtosises. Furthermore, in microarray experiments, outliers frequently arise due to the array chip artifacts such as uneven spray of reagents within arrays and other reasons. This motivates us to develop a nonparametric test for high-dimensional population mean vector or the location parameter without the multivariate normality assumption.

We propose a new test for hypothesis (1) based on spatial signs of the observations, and further study its asymptotic theory. Comparing with the extensions of Hotelling’s T² test (Chen and Qin, 2010), the theory for the nonparametric test with p > n is considerably more challenging. To derive the asymptotic theory, we employ new probability tools on the concentration properties of certain quadratic forms, which may be of independent interest and have potential applications in developing the theory for other related high-dimensional nonparametric procedures. The proposed nonparametric test has several appealing properties. First it is directly applicable for the setting with p > n, and it is computationally simple. Second, the new test is shown to lose little efficiency when the underlying data are multivariate normal and to have potentially significant efficiency gain for heavy-tailed multivariate distributions. This is verified by deriving its asymptotic relative efficiency. From our Monte Carlo simulation, significant efficiency gain can be achieved at small or moderate sample size.

Nonparametric statistical procedures have been explored little in the high dimensional setting. An open question is whether their power advantage continues to hold (and if hold, to what extent) in high dimension. This work takes a substantial step towards understanding the merits of nonparametric procedures when p > n by providing both theoretical justification and numerical evidence. Our theoretical analysis reveals a striking phenomenon: the efficiency gain of the new nonparametric test in the high-dimensional setting can be more substantial comparing with the well known traditional nonparametric tests efficiency gain in the “classical” framework where p is fixed and n goes to infinity. For example, consider the p-dimensional multivariate t-distribution with 3 degrees of freedom, which is heavy-tailed. For this distribution, it is well known that the asymptotic relative efficiency of the spatial sign test versus Hotelling’s T² test is 1.9 for p = 1, 2.02 for p = 3, and 2.09 for p = 10. This implies an increasing trend as the dimension p increases. The theory established in this paper suggests that when p > n, the asymptotic relative efficiency of the proposed new nonparametric test versus Chen and Qin’s extension of Hotelling’s T² test is about 2.54. This result provides strong support for the usefulness of nonparametric tests in high-dimensional problems.

It is worth noting that we do not impose structural constrains, such as sparsity, on the alternative hypothesis. Hence, it allows for a dense alternative, where many components of the vector contribute to the signal. In fact, this is one of the main motivations for gene set analysis. For many complex diseases, such as depression and diabetes, evidence from medical literature suggests that many of the genes from a biological pathway contribute small signals which are hard to detect individually. Cook et al. (2012) discussed other applications of similar nature, where sparsity may not be the reality. In the simulations, we demonstrated that the test based on marginal p values with Bonferroni or FDR correction may have low power to detect the global signal. On the other hand, in some other applications involving high-dimensional testing, there may be reasons to believe the alternative is sparse, for which case the existing tests can be further tuned to increase the power performance, see the recent work by Hall and Jin (2010), Zhong, Chen and Xu (2013) and Cai, Liu and Xia (2014), among others. It is noted that these tests use sample means as basic building blocks and hence are expected to suffer from power loss for heavy-tailed multivariate data. The new test we propose has the potential to be extended to the sparse alternative setting with the promise of improved power performance.

We introduce the high-dimensional nonparametric test in Section 2.1. We derive its limiting null distribution under a set of weak conditions in Section 2.2, and investigate its power performance under local alternatives and study the asymptotic relative efficiency in Section 2.3, some important extensions are discussed in Section 2.4. We conduct Monte Carlo simulations and analyze the gene sets from a genomics study in Section 3. Section 4 concludes the paper and discusses relevant issues. Technical proofs are given in the Appendix. The Supplemental Material include additional technical and numerical results.

2 A high-dimensional nonparametric test

We first focus on the case that the random vector X_i follows an elliptical distribution. Extensions to beyond the elliptical distribution family are discussed in Section 2.4.

The class of elliptical distributions encompasses many useful non-Gaussian multivariate distributions such as multivariate t distribution, multivariate logistic distribution, Kotz-type multivariate distribution, Pearson II type multivariate distribution and many others. The family of elliptical distributions is well studied in the statistical literature (e.g., Fang, Kotz and Ng, 1990). Recently, this family becomes important for modeling finance data (McNeil, Frey and Embrechts, 2005) due to its potential to accommodate tail dependence (the phenomenon of simultaneous extremes), which is important in quantitative finance but is not allowed by the multivariate normal distribution (Schmidt, 2002).

An elliptically distributed random vector X_i has the following convenient stochastic representation:

X_{i} = μ + ε_{i}, and ε_{i} = Γ R_{i} U_{i},

(2)

where Γ is a p × p matrix, U_i is a random vector uniformly distributed on the unit sphere in ℝ^p, and R_i is a nonnegative random variable independent of U_i. The distribution of X_i depends on Γ only through ΓΓ^T (Fang, Kotz and Ng, 1989). Thus, we denote Ω = ΓΓ^T for easy future reference. An important special case of (2) is the multivariate normal distribution with mean μ and covariance matrix Σ, for which $R_{i}^{2}$ has a chi-square distribution with p degrees of freedom and Ω = Σ. In general, X_i’s covariance matrix Σ is related to Ω by $\sum = p^{- 1} E (R_{i}^{2}) Ω$ .

2.1 The test statistic

Our test statistic T_n is based on the spatial sign function of the observed data. The spatial sign function of X_i is defined as $Z_{i} = \frac{X_{i}}{‖ X_{i} ‖}$ if X_i ≠ 0; and Z_i = 0 if X_i = 0, where ‖X_i‖ denotes the L₂ norm of X_i. The spatial sign vector is simply the unit vector in the direction of X_i. In the univariate case, it reduces to the familiar sign function.

We propose the following new nonparametric test statistics:

T_{n} = \sum_{i = 1}^{n} \sum_{j = 1}^{i - 1} Z_{i}^{T} Z_{j},

(3)

which indeed is a U-statistic. Under H₀, E (Z_i) = 0 which implies E(T_n) = 0. The above test statistic has an intuitive connection with the work of Bai and Saranadasa (1996) and Chen and Qin (2010), particularly the latter one. To see this, we note that the test statistic of Bai and Saranadasa (1996) for testing (1) is based on ${‖ \bar{X} ‖}^{2}$ , while the one of Chen and Qin (2010) is based on $\sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} X_{i}^{T} X_{j}$ . By removing the diagonal elements in the statistic of Bai and Saranadasa (1996), Chen and Qin (2010) was able to considerably relax the restrictive condition on p and n. In this spirit, we also dismiss the diagonal elements in defining T_n. Our test statistic hence can be deemed as a nonparametric extension of Chen and Qin (2010).

From another perspective, the new test generalizes the multivariate spatial sign test (e.g., Brown, 1983; Chaudhuri, 1992; Möttönen and Oja, 1995) to the high-dimensional setting. In the classical setting of p < n, Möttönen, Oja and Tienari (1997) derived the asymptotic relative efficiency (ARE) of the spatial sign test versus Hotelling’s T² test and established its theoretical advantage for heavy-tailed distributions. For example, when the underlying distribution is a 10-dimensional t distribution with ν degrees of freedom, the ARE of the spatial sign test versus the Hotelling’s T² test is 2.42 when ν = 3, and is 0.95 when ν = ∞ (multivariate normality). However, similarly as Hotelling’s T² test, the multivariate spatial sign test is not defined when p > n. It is an open question whether we can modify it in a way such that its efficiency advantage can be preserved in the high-dimensional setting. This paper provides an affirmative answer.

Remark 1

It is interesting to compare with Bai and Saranadasa (1996) and Chen and Qin (2010), both of which adopt a factor model structure and a type of pseudo-independence assumption. It is noted that their model assumption excludes some commonly-used multivariate distributions such as the multivariate t distribution. However, we can show that Chen and Qin’s test remain valid for the multivariate t distribution (see the Supplementary Material); but could suffer from substantial power loss. In Section 2.4, we also extend the new test to some important models in the Chen and Qin’s class.

2.2 The limiting null distribution

Despite the simple form of T_n, deriving its asymptotic distribution when p > n is by no means straightforward. As for any other high-dimensional inference, the most challenging issue lies in characterizing the underlying conditions for the asymptotic theory. In Bai and Saranadasa (1996) and Chen and Qin (2010), the key condition is stated through the behavior of the population covariance matrix Σ = Cov(X_i). In Bai and Saranadasa (1996), it is assumed that $λ_{\max} (\sum) = o {\sqrt{{Tr}^{2} (\sum^{2})}}$ , where λ_max(·) denotes the largest eigenvalue of a matrix and Tr(·) denotes the trace. In Chen and Qin (2010), it is assumed that Tr(Σ⁴) = o{Tr² (Σ²)}, which is satisfied under quite relaxed conditions on the eigenvalues of Σ. For the nonparametric test T_n, it is desirable to characterize the underlying conditions in a similar fashion. However, this is challenging as the building blocks of T_n are the transformations Z_i’s, which are not directly related to Σ.

In deriving the asymptotic properties of T_n, moment conditions directly related to Z_i’s naturally arise. Lemma 2.1 below plays an important role in this paper. It establishes some of the key properties of the moments of Z_i’s under a set of relaxed conditions on Σ. More specifically, we impose the following two conditions:

Tr (\sum^{4}) = o {{Tr}^{2} (\sum^{2})} .

(C1)

\frac{{Tr}^{4} (\sum)}{{Tr}^{2} (\sum^{2})} \exp {- \frac{{Tr}^{2} (\sum)}{128 p λ_{\max}^{2} (\sum)}} = o (1) .

(C2)

Lemma 2.1

Suppose that conditions (C1) and (C2) hold. Let $B = E (\frac{ε_{i} ε_{i}^{T}}{{‖ ε_{i} ‖}^{2}})$ . Then under H₀,

E {{(Z_{1}^{T} Z_{2})}^{4}} = O (1) (E^{2} {{(Z_{1}^{T} Z_{2})}^{2}}),

(4)

E {{(Z_{1}^{T} B Z_{1})}^{2}} = O (1) (E^{2} (Z_{1}^{T} B Z_{1})),

(5)

E {{(Z_{1}^{T} B Z_{2})}^{2}} = o (1) (E^{2} (Z_{1}^{T} B Z_{1})) .

(6)

The above result is established by using a recent probability tool developed by El Karoui (2009) on the concentration inequality for the quadratic form of a random vector that has a uniform distribution on the unit sphere of ℝ^P.

Some intuition on T_n’s asymptotic behavior under H₀ can be gained by observing its first two moments. First, it is evident that E(T_n) = 0. To calculate its variance, we write $T_{n} = \sum_{i = 2}^{n} Y_{i}$ , where $Y_{i} = \sum_{j = 1}^{i - 1} Z_{i}^{T} Z_{j}$ . It follows from direct calculation that

\begin{matrix} E (Y_{i}^{2}) = \sum_{j = 1}^{i - 1} \sum_{k = 1}^{i - 1} E (Z_{i}^{T} Z_{j} Z_{i}^{T} Z_{k}) = \sum_{j = 1}^{i - 1} E ({(Z_{i}^{T} Z_{j})}^{2}) \\ = (i - 1) Tr (E (Z_{1} Z_{1}^{T}) E (Z_{2} Z_{2}^{T})) = (i - 1) Tr (B^{2}), \end{matrix}

where B is defined in Lemma 2.1. Hence, $Var (T_{n}) = \frac{n (n - 1)}{2} Tr (B^{2})$ . Although T_n has a U-statistics structure, the classical central limit theorem for U-statistics does not apply because the dimension p may depend on the sample size n. By applying Lemma 2.1 and exploring the martingale structure of T_n, we can establish the asymptotic normality of $\frac{T_{n}}{\sqrt{Var (T_{n})}}$ . The limiting null distribution of T_n is given in the following theorem.

Theorem 2.2

Assume conditions (C1) and (C2) hold. Then under H₀, as n, p → ∞, $\frac{T_{n}}{\sqrt{\frac{n (n - 1)}{2} Tr (B^{2})}} \to N (0, 1)$ in distribution.

Remark 2

Condition (C1) holds trivially if all eigenvalues of Σ are bounded away from 0 and ∞. It is noted that the bounded eigenvalues assumption is commonly adopted in the literature of estimating high-dimensional covariance matrices (e.g., Bickel and Levina, 2008). It has also been shown that (C1) holds under some general conditions if some of the eigenvalues are unbounded (Chen and Qin, 2010).

Remark 3

Condition (C2) is new but quite relaxed. In particular, it is generally weaker than those conditions in the literature which explicitly imposed a relationship between n and p such as p = o(n²). Condition (C2) holds if all eigenvalues of Σ are bounded away from 0 and ∞. It also permits the eigenvalues to be unbounded as the exponential term is expected to converge to zero quickly if $\frac{Tr (\sum)}{\sqrt{p} λ_{\max} (\sum)}$ diverges to ∞. To see this, let λ₁ < λ₂ ≤ ⋯ ≤ λ_p be ordered eigenvalues of Σ. Assume that as p → ∞, k₁ eigenvalues converge to 0; k₂ eigenvalues diverge to ∞; and p − k₁ − k₂ eigenvalues remain bounded with lower bound c₁ > 0 and upper bound c₂ < ∞. Then

\begin{matrix} \frac{Tr (\sum)}{\sqrt{p} λ_{\max} (\sum)} \geq \frac{k_{1} λ_{1} + c_{1} (p - k_{1} - k_{2}) + k_{2} λ_{p - k_{2} + 1}}{\sqrt{p} λ_{p}}, \\ \frac{{Tr}^{2} (\sum)}{Tr (\sum^{2})} \leq \frac{k_{2}^{2} λ_{p}^{2} + {(p - k_{2})}^{2} c_{2}^{2} + 2 k_{2} (p - k_{2}) c_{2} λ_{p}}{k_{1} λ_{1}^{2} + (p - k_{1}) c_{1}^{2}} . \end{matrix}

Assume $λ_{1} = p^{- b_{1}}$ and $λ_{p} = p^{b_{2}}$ for b₁ > 0, b₂ > 0. If both k₁ and k₂ are bounded, then it is easy to see that condition (C2) is satisfied if $b_{2} < \frac{1}{2}$ . It is noted that (C2) can still hold under some extra conditions on the rate of λ₁ and λ_p even if both k₁ and K₂ diverge to infinity at appropriate rate.

Remark 4

To apply T_n in practice, we need an estimator of Tr(B²). Following Chen and Qin (2010), we may estimate Tr(B²) using the cross-validation approach as follows:

\hat{T r (B^{2})} = {n (n - 1)}^{- 1} Tr {\sum_{1 \leq j \neq k \leq n} (Z_{j} - {\bar{Z}}_{(j, k)}) Z_{j}^{T} (Z_{k} - {\bar{Z}}_{(j, k)}) Z_{k}^{T}},

(7)

where ${\bar{Z}}_{(j, k)}$ is the sample mean after excluding Z_j and Z_k. It is noteworthy that the estimator in Chen and Qin can be computationally intensive for large p as each term inside the U-statistic involves multiplying high-dimensional matrices. In contrast, the computational burden of the estimator in (7) can be substantially reduced by observing that ‖Z_j‖² = 1. Let ${\bar{Z}}^{*} = {(n - 2)}^{- 1} \sum_{m = 1}^{n} Z_{m}$ . In the Appendix, it is derived that

\begin{array}{l} \hat{Tr (B^{2})} = - \frac{n}{{(n - 2)}^{2}} + \frac{(n - 1)}{n {(n - 2)}^{2}} Tr {{(\sum_{j = 1}^{n} Z_{j} Z_{j}^{T})}^{2}} + \frac{1 - 2 n}{n (n - 1)} {\bar{Z}}^{* T} (\sum_{j = 1}^{n} Z_{j} Z_{j}^{T}) {\bar{Z}}^{*} \\ + \frac{2}{n} {‖ {\bar{Z}}^{*} ‖}^{2} + \frac{{(n - 2)}^{2}}{n (n - 1)} {‖ {\bar{Z}}^{*} ‖}^{4} . \end{array}

(8)

In Figure 1(a), we plot the empirical distribution of $\frac{T_{n}}{\sqrt{\frac{n (n - 1)}{2} \hat{Tr (B^{2})}}}$ and compare it with the N(0, 1) density curve for n = 50, p = 1000, where the data are generated from the N_p(0, Σ) distribution with the (i, j) th entry of Σ equal to 0.8^|^i−j^|. The two curves are very close to each other, which suggests that the standard normal distribution provides a satisfactory approximation of the null distribution.

Comparing the empirical distribution of the new test with the theoretical distribution (n = 50, p = 1000)

2.3 Local power analysis

We now turn our attention to the power analysis of T_n under contiguous sequences of alternative hypotheses. This analysis enables us to further investigate the asymptotic relative efficiency of T_n with respect to Chen and Qin’s test (referred to as CQ test in the sequel). Some interesting findings are revealed, which suggests promising efficiency gain of the new test for heavy-tailed multivariate distributions in the high-dimensional setting.

For the local power analysis, we impose the following additional conditions.

\exp (- \frac{{Tr}^{2} (\sum)}{256 p λ_{\max}^{2} (\sum)}) = o (min (\frac{λ_{\max} (\sum)}{Tr (\sum)}, \frac{λ_{min} (\sum)}{λ_{\max} (\sum)})) .

(C3)

λ_{\max} (\sum) = o (Tr (\sum)) .

(C4)

{‖ μ ‖}^{2} E ({‖ ε ‖}^{- 2}) = o (min (n^{- 1} \frac{Tr (\sum^{2})}{λ_{\max} (\sum) Tr (\sum)}, n^{- 1 / 2} \frac{{Tr}^{1 / 2} (\sum^{2})}{Tr (\sum)})) .

(C5)

For some 0 < δ < 1, {‖ μ ‖}^{2 δ} E ({‖ ε ‖}^{- 2 - 2 δ}) = o (E^{2} ({‖ ε ‖}^{- 1})) .

(C6)

Remark 5

Conditions (C3) and (C4) are concerned with the properties of the population covariance matrix Σ. These two conditions are relatively weak. In particular, they are satisfied when the eigenvalues of Σ are bounded away from 0 and ∞. Conditions (C5) and (C6) can be viewed as high-dimensional local-alternative statements for p > n. To gain some insight into the local alternative, we consider the case the eigenvalues of Σ are bounded away from 0 and ∞, then the right-hand side of (C5) is o(n⁻¹/²p⁻¹/²). For p-dimensional spherical t-distribution with ν degrees of freedom, $\frac{p}{{‖ ε ‖}^{2}} \sim F (v, p)$ . It is easy to show that $E ({‖ ε ‖}^{- 2}) = \frac{1}{p - 2}$ . A slightly more involved calculation based on the properties of F-distribution reveals that E(‖ε‖)⁻¹ = O(p^−1/2 and E(‖ε‖)⁻²⁻²^δ = O(p⁻¹⁻^δ). Then the conditions in (C4) and (C5) amount to ‖μ‖² = o(n^−1/2 p^1/2) and ‖μ‖²^δ = o(p^δ) for some 0 < δ < 1. If δ = 1/2, then the condition further reduces to ‖μ‖ = o(n^−1/4p^1/4). If we consider the local alternatives such that all components of μ are equal to κ, then we have κ = o(n^−1/4p^−1/4), which when p > n is of smaller order of n^−1/2, the usual local alternative rate for Hotelling’s test with fixed dimension. The faster rate of local alternative can be viewed as a blessing of high dimensionality, where more information can be gained to distinguish subtle deviation from the null hypothesis.

Theorem 2.3

Assume conditions (C1)–(C6) hold. Letting $A = E {\frac{1}{‖ ε_{i} ‖} (I_{p} - \frac{ε_{i} ε_{i}^{T}}{{‖ ε_{i} ‖}^{2}})}$ . Then as n, p → ∞, $\frac{T_{n} - \frac{n (n - 1)}{2} μ^{T} A^{2} μ (1 + o (1))}{\sqrt{\frac{n (n - 1)}{2} T r (B^{2})}} \to N (0, 1)$ in distribution.

Theorem 2.3 implies that under the local alternatives, the proposed level α test has the local power $β_{n} = Φ (- z_{α} + \sqrt{\frac{n (n - 1)}{2}} \frac{μ^{T} A^{2} μ (1 + o (1))}{\sqrt{Tr (B^{2})}})$ , where Φ(⋅) and z_α denote the cumulative distribution function and the upper α quantile of the N (0, 1) distribution, respectively. Let $η = \frac{μ^{T} A^{2} μ}{\sqrt{Tr (B^{2})}}$ . Figure 1(b) plots the empirical power of the proposed test as a function of η (for n = 50, p = 1000) and compare it with the theoretical power given by the above formula. The data are generated from the multivariate t distribution with mean vector μ, covariance matrix Σ and 3 degrees of freedom, where the (i, j) th entry of Σ is 0.8^|ⁱ⁻^j^| and μ has all elements equal to κ with κ being chosen according to the value of η. The plot suggests that the theoretical formula of the local power provides a reasonable approximation to the empirical power.

On the other hand, the test of CQ test has the local power $β_{n}^{CQ} = Φ (- z_{α} + \frac{n {‖ μ ‖}^{2}}{\sqrt{2 Tr (\sum^{2})}})$ . The asymptotic relative efficiency (ARE) of T_n versus the CQ test is

{ARE}_{T n, CQ} = \frac{μ^{T} A^{2} μ}{{‖ μ ‖}^{2}} \sqrt{\frac{Tr (\sum^{2})}{Tr (B^{2})}} (1 + o (1)) .

(9)

To appreciate the implication of the above result, we consider the asymptotic relative efficiency when the data arise from a spherical p-dimensional t distribution with ν degrees of freedom (ν > 2). In this case, $A = E [{‖ ε ‖}^{- 1}] \frac{p - 1}{p} I_{p}$ where I_p denotes the p × p identity matrix, Tr(B²) = p⁻¹ and Tr(Σ²) = p⁻¹E²[‖ε‖²]. Hence, ARE_Tn, CQ = p⁻²(p − 1)2E²[‖ε‖⁻¹] E[‖ε‖²]. For the t distribution, we have $E [{‖ ε ‖}^{2}] = \frac{p v}{v - 2}$ and $E [{‖ ε ‖}^{- 1}] = \sqrt{\frac{2}{p v}} \frac{Γ ((v + 1) / 2)}{Γ (v / 2)}$ , where $Γ (t) = \int_{0}^{\infty} u^{t - 1} e^{- u} d u$ denotes the gamma function. For large p, the asymptotic relative efficiency thus is approximately ${ARE}_{T n, CQ} \approx \frac{2}{v - 2} {(\frac{Γ ((v + 1) / 2)}{Γ (v / 2)})}^{2}$ . For ν = 3, this value is about 2.54; for ν = 4, it is about 1.76; for ν = 5, it is about 1.51; for ν = 6, it is about 1.38; for ν = ∞ (corresponding to multivariate normal distribution), by noting that $Γ ((v + 1) / 2) \approx Γ (v / 2) \sqrt{\frac{v}{2}}$ as ν → ∞, we have that the ARE has limit one. Theoretically, the efficiency loss of the new test under multivariate normality is little, but the efficiency gain can be substantial for heavy-tailed distribution. Recall that for v = 3, the ARE of the classical spatial sign test versus Hotelling’s T² is 2.02 for p = 3 and 2.09 for p = 10 in the fixed dimensional case. This suggests that nonparametric test may have more substantial power gain in the high-dimensional case.

2.4 Extensions to beyond the elliptical distribution family

In this paper, we focus on the family of elliptical distributions because its popularity and flexibility for modeling non-normal multivariate data. Our results have the potential to extend to some useful multivariate distributions beyond the family of elliptical distributions.

One such class of distributions are those generated from the symmetric independent component models (e.g., Ilmonen and Paindaveine, 2011). That is,

X_{i} = μ + Γ Z_{i},

(10)

where Γ is a full rank p × p positive definite matrix; Z_i = (Z_i1, …, Z_ip)′ has independent components Z_ij and Z_ij is symmetric about zero. The independent components model assumes that the observed random vector can be written as linear combinations of independent random variables. This model has received broad attentions in signal processing and machine learning (Hyvärinen, Karhunen and Oja, 2001). For example, independent component analysis with exponential power marginal density (p(x) ∝ exp (−|x|^q) for some q > 0) is popular for analyzing image and sound signals. It is noted that this class of models encompass many of the practically useful distributions from Bai and Saranadasa (1996) and Chen and Qin (2010).

We assume that Z_ij are standardized such that Var (Z_ij) = 1. Thus Var (X_i) = ΓΓ^T. We also assume that Z_ij has a sub-exponential distribution with exponent α, that is, there exist constants a > 0, b > 0 such that for all t > 0, P (|Z_ij| ≥ t^α) ≤ α exp (−bt). If α = 1/2, then Z_ij is sub-gaussian. The class of sub-exponential distributions include many practically used heavy-tailed distributions. The fact that our proposed test is still valid for this class is summarized in the following theorem, whose proof is given in the Supplementary Material.

Theorem 2.4

Assume (C1) and (C2) hold for model (10), $E (Z_{i j}^{4})$ for some positive constant c for all i, j, $E (Z_{i j}^{4}) \leq c$ and Tr(Σ²) = o(Tr²(Σ)). Then $\frac{T_{n}}{\sqrt{\frac{n (n - 1)}{2} T r (B^{2})}} \to N (0, 1)$ in distribution under H₀, as n, p → ∞, where B has the same expression as in Lemma 2.1.

Another interesting extension of the elliptical distributions involves generating random variables from (2) but allowing R_i to be negative and depend on U_i. This yields the so-called family of generalized elliptical distributions. The asymptotic results of this paper also hold for this class by observing that $\frac{X_{i}}{‖ X_{i} ‖} = \frac{U_{i}}{‖ U_{i} ‖}$ under H₀. This class of models recently caught the attentions of researchers in finance, see Branco and Dey (2001), Frahm (2004), among others. A representative example of this class is the collection of multi-tail elliptical distributions, where R_i is a positive random variable whose tail parameter depends on ΓU_i (e.g., Kring et al, 2009; Rachev et al, 2011). The multi-tail elliptical distributions are particularly useful for modeling asset returns in finance.

Not surprisingly, generalizations to other multivariate distributions are possible although a case-by-case consideration may be needed. Particularly, the requirement that the U_i in (2) is uniformly distributed on the L₂ sphere can be relaxed. For example, one possible extension is to allow U_i to be from the class of distributions discussed in Gupta and Song (1997) and Szablowski (1998). Concentration inequalities similar to that given in Lemma A.2, which plays an important role in the proof, can be obtained for random vectors that satisfy certain concentration of measure properties (El Karoui, 2009).

3 Numerical studies

3.1 Monte Carlo simulations

We compare the performance of the new test with four alternatives: the test of Chen and Qin (CQ test, 2010), the test of Srivastava, Katayama and Kano (SKK test, 2013), the test based on multiple comparison with Bonferroni correction (BF test), and the test based on multiple comparison with FDR control (FDR test, Benjamini and Hochberg, 1995). The SKK test is constructed using the inverse of the diagonalized version of the sample covariance matrix and is computationally attractive as it involves a simple estimator for the asymptotic covariance. The BF test controls the family error rate at 0.05 and the FDR test controls the false discovery rate at 0.05. Both the BF test and FDR test are computed using the p-values from the t tests for the marginal hypotheses and reject H₀ if at least one marginal test is significant. The performance of the five tests are evaluated on 1000 simulation runs. We consider n = 20, 50 and p = 200, 1000 and 2000. To save space, we report the results for p = 1000 and 2000 here. The results for p = 50, 100 and 200 are reported in the Supplemental Material.

Example 1

In this example, random data were generated from N_p(μ, Σ). We consider three different choices for μ and three different choices for Σ = (σ_ij).

The three choices for μ are: (1) the null hypothesis μ₀ = (0, …, 0)^T; (2) the alternative μ₁ = (0.25, 0.25, …, 0.25)^T; and (3) the alternative μ₂ = (μ₂₁, …, μ₂_p)^T with $μ_{21} = \dots = μ_{2 \frac{p}{3}} = 0$ , $μ_{2 (\frac{p}{3} + 1)} = \dots = μ_{2 (\frac{2 p}{3})} = 0.25$ and $μ_{2 (\frac{2 p}{3} + 1)} = \dots = μ_{2 p} = - 0.25$ The three choices for Σ are: (1) σ_ii = 1 and σ_ij = 0.2 (i ≠ j); (2) σ_ij = 0.8^|i−j|; and (3) Σ = DRD, where D = diag(d₁, …, d_p) with d_i = 2 + (p − i + 1)/p, R = (r_ij) with r_ii = 1 and r_ij = (−1)ⁱ⁺^j (0.2)^|ⁱ⁻^j^| for i ≠ j. In the tables, we denote these three choices for Σ by Σ₁, Σ₂ and Σ₃, respectively. It is noted that Σ₃ was considered in Srivastava, Katayama and Kano (2013).

Table 1 summarizes the simulations results for different choices of Σ, μ, n and p. We observe that the five tests have nominal levels reasonably close to 0.05, especially when n = 50. For the alternative μ₁, the performance of the new test is very close to that of the CQ test and the SKK test, which are significantly better than the BF test and the FDR test. The latter two tests have especially low power when n = 20. For the alternative μ₂, we first note that the BF test and the FRD test perform fine for Σ₁ when n = 50 but has significantly lower power in all other settings. We also observe that the new test, the CQ test and the SKK test perform similarly for Σ₃; the new test has somewhat better performance for Σ₁; and the SKK test has somewhat better performance for Σ₂ for p = 1000.

Table 1.

Example 1: multivariate normal distribution

Σ	μ	n	p	New	CQ	SKK	BF	FDR
Σ₁	μ₀	20	1000	0.066	0.069	0.061	0.046	0.046
		20	2000	0.073	0.070	0.046	0.052	0.053
		50	1000	0.059	0.060	0.043	0.035	0.038
		50	2000	0.058	0.061	0.029	0.043	0.047

	μ₁	20	1000	0.723	0.723	0.692	0.405	0.471
		20	2000	0.720	0.729	0.638	0.385	0.447
		50	1000	0.975	0.976	0.962	0.842	0.890
		50	2000	0.970	0.976	0.945	0.850	0.901

	μ₂	20	1000	0.951	0.826	0.650	0.382	0.443
		20	2000	0.954	0.821	0.567	0.404	0.464
		50	1000	1.000	1.000	1.000	0.964	0.997
		50	2000	1.000	1.000	1.000	0.973	0.998

Σ₂	μ₀	20	1000	0.052	0.051	0.047	0.038	0.041
		20	2000	0.058	0.059	0.023	0.047	0.047
		50	1000	0.048	0.050	0.051	0.043	0.048
		50	2000	0.060	0.060	0.052	0.054	0.061
		50	1000	0.795	0.797	0.815	0.122	0.138

	μ₁	20	2000	0.969	0.968	0.930	0.134	0.145
		50	1000	0.999	0.999	0.999	0.357	0.430
		50	2000	1.000	1.000	1.000	0.416	0.479

	μ₂	20	1000	0.540	0.549	0.579	0.092	0.102
		20	2000	0.790	0.788	0.695	0.102	0.112
		50	1000	0.992	0.991	0.994	0.265	0.289
		50	2000	1.000	1.000	1.000	0.343	0.381

Σ₃	μ₀	20	1000	0.055	0.055	0.067	0.052	0.052
		20	2000	0.059	0.061	0.048	0.042	0.044
		50	1000	0.052	0.052	0.061	0.042	0.044
		50	2000	0.045	0.045	0.052	0.048	0.048

	μ₁	20	1000	0.490	0.438	0.514	0.127	0.132
		20	2000	0.646	0.594	0.556	0.113	0.119
		50	1000	1.000	0.998	1.000	0.296	0.331
		50	2000	1.000	1.000	1.000	0.331	0.369

	μ₂	20	1000	0.242	0.225	0.335	0.100	0.104
		20	2000	0.342	0.310	0.318	0.084	0.092
		50	1000	0.932	0.862	0.982	0.239	0.260
		50	2000	0.991	0.987	1.000	0.269	0.296

Open in a new tab

Example 2

We simulate X_i from a p-variate t distribution with mean vector μ, covariance matrix Σ and 3 degrees of freedom. The choices of μ and Σ are set to be the same as those in Example 1. The distribution is heavy-tailed in this example.

We summarize the simulation results in Table 2. Both the new test and the CQ test have empirical levels close to 0.05 under the null hypothesis μ₀ while the other three tests tend to be conservative. In this example, the BF test and FDR test perform unsatisfactorily under the alternatives μ₁ and μ₂. It is observed that the new test has the best power performance in all settings; which is often substantially higher than (sometimes more than twofold) the second best performed test (the CQ test in this example). For example, the new test (and the CQ test) has power 0.83 (and 0.36) for the setting with μ = μ₂, Σ = Σ₁, n = 20 and p = 2000; 0.98 (and 0.47) for the setting with μ = μ₁, Σ = Σ₂, n = 50 and p = 1000; 0.88 (and 0.51) for the setting with μ = μ₁, Σ = Σ₃, n = 20 and p = 2000.

Table 2.

Example 2: multivariate t-distribution

Σ	μ	n	p	New	CQ	SKK	BF	FDR
Σ₁	μ₀	20	1000	0.083	0.088	0.012	0.011	0.013
		20	2000	0.064	0.072	0.007	0.010	0.010
		50	1000	0.053	0.063	0.015	0.011	0.012
		50	2000	0.069	0.076	0.008	0.010	0.010

	μ₁	20	1000	0.633	0.472	0.222	0.153	0.183
		20	2000	0.631	0.468	0.171	0.117	0.138
		50	1000	0.941	0.721	0.493	0.424	0.491
		50	2000	0.921	0.736	0.448	0.438	0.492

	μ₂	20	1000	0.815	0.371	0.076	0.129	0.150
		20	2000	0.830	0.363	0.040	0.107	0.122
		50	1000	1.000	0.803	0.333	0.427	0.485
		50	2000	1.000	0.825	0.288	0.493	0.571

Σ₂	μ₀	20	1000	0.052	0.053	0.000	0.011	0.012
		20	2000	0.058	0.060	0.000	0.013	0.013
		50	1000	0.052	0.053	0.000	0.015	0.015
		50	2000	0.059	0.060	0.000	0.020	0.021

	μ₁	20	1000	0.682	0.349	0.013	0.029	0.033
		20	2000	0.883	0.512	0.002	0.038	0.040
		50	1000	0.996	0.780	0.120	0.094	0.102
		50	2000	1.000	0.933	0.091	0.118	0.128

	μ₂	20	1000	0.441	0.228	0.001	0.027	0.029
		20	2000	0.654	0.339	0.000	0.027	0.028
		50	1000	0.942	0.570	0.037	0.067	0.071
		50	2000	0.998	0.785	0.018	0.077	0.086

Σ₃	μ₀	20	1000	0.054	0.058	0.001	0.023	0.023
		20	2000	0.061	0.057	0.000	0.011	0.011
		50	1000	0.056	0.057	0.002	0.015	0.015
		50	2000	0.048	0.042	0.001	0.015	0.015

	μ₁	20	1000	0.355	0.174	0.001	0.033	0.034
		20	2000	0.488	0.224	0.003	0.022	0.022
		50	1000	0.979	0.465	0.021	0.080	0.084
		50	2000	0.998	0.624	0.015	0.095	0.103

	μ₂	20	1000	0.198	0.113	0.001	0.034	0.034
		20	2000	0.251	0.141	0.000	0.018	0.019
		50	1000	0.766	0.249	0.010	0.059	0.064
		50	2000	0.922	0.349	0.005	0.070	0.073

Open in a new tab

Example 3

We simulate X_i from a scale mixture of two multivariate normal distributions 0.9* N_p(μ, Σ) + 0.1 * N_p(μ, 9Σ), where we consider the same choices of μ and Σ as in Example 1. The distribution in this example also has heavy tails.

We summarize the simulation results in Table 3. Similarly as in Example 2, the new test significantly outperforms the four contending approaches. For example, the new test (and the CQ test) has power 0.88 (and 0.49) for the setting with μ = μ₂, Σ = Σ₁, n = 20 and p = 2000; 0.80 (and 0.42) for the setting with μ = μ₂, Σ = Σ₂, n = 50 and p = 1000; 0.91 (and 0.70) for the setting with μ = μ₁, Σ = Σ₃, n = 20 and p = 2000.

Table 3.

Example 3: mixture of multivariate normal distributions

Σ	μ	n	p	New	CQ	SKK	BF	FDR
Σ₁	μ₀	20	1000	0.063	0.070	0.007	0.014	0.015
		20	2000	0.045	0.049	0.007	0.015	0.018
		50	1000	0.063	0.066	0.014	0.020	0.021
		50	2000	0.042	0.040	0.007	0.015	0.016

	μ₁	20	1000	0.649	0.548	0.277	0.209	0.259
		20	2000	0.627	0.542	0.229	0.193	0.231
		50	1000	0.941	0.859	0.730	0.600	0.682
		50	2000	0.964	0.867	0.701	0.619	0.687

	μ₂	20	1000	0.870	0.449	0.109	0.175	0.201
		20	2000	0.882	0.492	0.089	0.224	0.243
		50	1000	1.000	0.966	0.700	0.663	0.759
		50	2000	1.000	0.968	0.577	0.698	0.800

Σ₂	μ₀	20	1000	0.046	0.063	0.006	0.019	0.019
		20	2000	0.050	0.047	0.004	0.017	0.018
		50	1000	0.039	0.049	0.000	0.021	0.023
		50	2000	0.041	0.039	0.001	0.020	0.021

	μ₁	20	1000	0.678	0.485	0.093	0.059	0.067
		20	2000	0.914	0.700	0.126	0.069	0.070
		50	1000	0.998	0.943	0.230	0.157	0.177
		50	2000	1.000	0.995	0.167	0.190	0.211

	μ₂	20	1000	0.437	0.285	0.059	0.050	0.053
		20	2000	0.687	0.478	0.088	0.047	0.050
		50	1000	0.953	0.759	0.045	0.111	0.118
		50	2000	0.998	0.940	0.040	0.150	0.169

Σ₃	μ₀	20	1000	0.054	0.053	0.008	0.020	0.020
		20	2000	0.039	0.045	0.002	0.022	0.022
		50	1000	0.050	0.050	0.000	0.026	0.026
		50	2000	0.035	0.034	0.000	0.021	0.022

	μ₁	20	1000	0.342	0.207	0.045	0.058	0.061
		20	2000	0.493	0.307	0.072	0.046	0.049
		50	1000	0.995	0.730	0.050	0.124	0.129
		50	2000	1.000	0.879	0.035	0.111	0.119

	μ₂	20	1000	0.178	0.130	0.030	0.046	0.050
		20	2000	0.249	0.173	0.043	0.047	0.048
		50	1000	0.797	0.421	0.012	0.100	0.110
		50	2000	0.947	0.565	0.011	0.092	0.093

Open in a new tab

3.2 An application

Type 2 diabetes is one of the most common chronic diseases. Insulin resistance in skeletal muscle, which is the major site of glucose disposal, is a prominent feature of Type 2 diabetes. To study insulins ability to regulate gene expression, an experiment performed microarray analysis using the Affymetrix Hu95A chip of human skeletal muscle biopsies from 15 diabetic patients both before and after insulin treatment (Wu et al., 2007). The gene expression alterations are promising to provide insights on new therapeutic targets for the treatment of this common disease. Hence, we are interested in testing the hypothesis in (1), where μ represents the average change of the gene expression level due to the treatment.

The underlying genetics of Type 2 diabetes were recognized to be very complex. It is believed that Type 2 diabetes is resulted from interactions between many genetic factors and the environment. The data were normalized by quantile normalization. When multiple probes are associated with the same gene, their expression values are consolidated by taking the average. In our analysis, we considered 2519 curated gene sets. The gene sets we used are from the C2 collection of the GSEA online pathway databases (http://www.broadinstitute.org/gsea/msigdb/collection details.jsp#C2). The largest gene set contains 1607 genes, which makes the hypothesis testing problem a high-dimensional one.

We applied both the new test and the CQ test at 5% significance level with the Bonferroni correction to control the family-wise error rate at 0.05 level. For the CQ method, 520 gene sets (20.64% of all candidates) are identified as significant; and for the new method, 954 gene sets (37.87% of all candidates) are selected as significant. We observe that the significant gene sets selected by the new test include those identified by the CQ test with only one exception (HASLINGER_B_CLL_WITH_CHROMOSOME_12_TRISOMY).

Table 4 displays the top 10 significant gene sets identified by the two tests and their corresponding test statistics values. The “NA” values in the table correspond to gene sets in the top 10 list of one test but not the other. We observe these two lists share 7 common gene sets. Among these seven gene sets, ZWANG_CLASS_2_TRANSIENTLY_INDUCED_BY_EGF, NAGASHIMA_EGF_SIGNALING_UP, AMIT_EGF_RESPONSE_60_HELA, AMIT_SERUM_RESPONSE_40_MCF10A and AMIT_SERUM_RESPONSE_60_MCF10A are known to be biologically related to insulin effect on human cells. We also observe that for 9 out of the top 10 gene sets, the new test has a smaller p-value than the CQ test does. The gene set SEMENZA_HIF1_TARGETS is only on the top ten list of the new test and was also found to be biologically related to insulin effect on human cells. Most of those significant gene sets are induced by Epidermal growth factor (EGF) or insulin-like growth factor (IGF).

Table 4.

The top 10 significant gene sets selected by the new test and the CQ test

Gene set	New test	CQ test
ZWANG_CLASS_2_TRANSIENTLY_INDUCED_BY_EGF	24.34	20.11
NAGASHIMA_EGF_SIGNALING_UP	22.44	17.01
SHIPP_DLBCL_CURED_VS_FATAL_DN	22.34	18.24
WILLERT_WNT_SIGNALING	19.66	NA
UZONYI_RESPONSE_TO_LEUKOTRIENE_AND_THROMBIN	19.63	18.46
PID_HIF2PATHWAY	19.46	15.65
PHONG_TNF_TARGETS_UP	19.21	18.90
AMIT_EGF_RESPONSE_60_HELA	18.64	16.38
MCCLUNG_CREB1_TARGETS_DN	18.43	NA
SEMENZA_HIF1_TARGETS	18.34	NA
AMIT_SERUM_RESPONSE_40_MCF10A	NA	15.98
AMIT_SERUM_RESPONSE_60_MCF10A	NA	15.43
PLASARI_TGFB1_TARGETS_1HR_UP	NA	15.00

Open in a new tab

It is interesting to point out that exploratory analysis of the gene expression data suggests the multivariate normality assumption is questionable. In fact, we investigated each of the top 10 gene sets identified by the new test and found that the multivariate normal distribution is plausible for none of them. For example, Figure 2 displays the histogram of the marginal kurtosises of the difference of each gene expression levels (before/after the treatment) of all genes in MCCLUNG_CREB1_TARGETS_DN gene set, which was selected among the top 10 gene sets by the new method but not by the CQ method. Figure 2 clearly shows that some gene expression levels have heavy tails as their kurtosises are much larger than 3, the kurtosis of a normal distribution.

The histogram of marginal kurtosises for all genes in MCCLUNG_CREB1_TARGETS_DN gene set.

4 Conclusion and discussions

The paper proposes a new spatial sign based nonparametric test for testing a hypothesis about the location parameter of a high-dimensional random vector. The goal is to improve the power performance when the underlying distribution of the data deviates from multivariate normality. We investigate the asymptotic properties of the new test and compare it with alternative tests based on extending Hotelling’s T² test. A remarkable finding is the power improvement in the large p setting can be more substantial than that in the classical fixed p setting. The proposed test can be used as a basic building block to develop nonparametric tests in other important settings such as testing for sparse alternative or testing a hypothesis on coefficients in high-dimensional factorial designs (Zhong and Chen, 2011). A spatial sign based test was proposed for sphericity when p = O(n²) in Zou et al. (2014), and spatial sign tests were proposed for testing uniformity on the unit sphere and other related null hypotheses when p/n → c for some positive constant c in Paindaveinez and Verdebout (2013). The techniques related to sign tests have the potential to be used to develop the high-dimensional theory for other classical nonparametric multivariate testing procedures, such as those based on spatial sign ranks (e.g., Möttönen and Oja, 1995) and ranks (e.g., Hallin and Davy Paindaveine, 2006).

For reasons discussed in the introduction section, detecting the significance of a gene set is often of independent interest. In particular, finding significant gene sets/pathways can improve our understanding of the biological processes associated with a specific disease. The proposed method can also be incorporated into a multi-step procedure, in combination with various gene-level testing procedures and multiple tests correction methods, to further identify a short list of top genes for the biologists. This kind of multi-step procedure is expected to have better power to identify important individual genes as the gene set acts as a dimension reduction from potentially thousands of genes.

Supplementary Material

NIHMS651394-supplement-Supplementary_Material.pdf^{(301KB, pdf)}

Acknowledgments

Wang and Peng’s research is supported by a NSF grant DMS1308960. Li’s research is supported by NIDA, NIH grants P50 DA10075 and P50 DA036107. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH. We thank the Editor, the AE and three referees for their constructive comments which help us significantly improve the paper. We also thank Professor Tiefeng Jiang for helpful discussions.

Appendix: Technical proofs

Appendix 1: Some useful lemmas

We present below several useful technical lemmas, the proof for which can be found in the online supplementary material.

Lemma A.1

Let U = (U₁, …, U_p)^T be a random vector uniformly distributed on the unit sphere in ℝ^p. Then

E(U) = 0, Var(U) = p⁻¹ I_p, $E (U_{j}^{4}) = \frac{3}{p (p + 2)}$ , and $E (U_{j}^{2} U_{k}^{2}) = \frac{1}{p (p + 2)}$ for j ≠ k.
Let M be a deterministic real-valued matrix. Assume that ‖M‖₂ ≤ k, where ‖M‖₂ denotes the spectral norm of M. Then, $\forall t > 0, P (| U^{T} M U - p^{- 1} T r (M) | > t) \leq 2 \exp (- \frac{(p - 1) {(t - c_{p})}^{2}}{8 k^{2}})$ , where $c_{p} = \sqrt{\frac{8 π k^{2}}{p - 1}}$ .

Lemma A.2

(A concentration inequality) Assume W = ΓU, where U is uniformly distributed on the unit sphere in ℝ^p. Let Ω = ΓΓ^T and consider the event $A = {\frac{T r (Ω)}{2 p} \leq {‖ W ‖}^{2} \leq \frac{3 T r (Ω)}{2 p}}$ . Then P(A) ≥ 1 − c₁ $\exp (- \frac{T r^{2} (\sum)}{128 p λ_{\max}^{2} (\sum)})$ , for all p > 1, where c₁ = 2 exp (π/2) is a finite constant.

Lemma A.3

For any p-dimensional vectors X and μ, we have (1) $‖ \frac{X - μ}{‖ X - μ ‖} - \frac{X}{‖ X ‖} ‖ \leq 2 \frac{‖ μ ‖}{‖ X ‖}$ ; and (2) $‖ \frac{X - μ}{‖ X - μ ‖} - \frac{X}{‖ X ‖} - \frac{1}{‖ X ‖} (I_{p} - \frac{X X^{T}}{‖ X ‖}) μ ‖ \leq c_{2} \frac{{‖ μ ‖}^{1 + δ}}{{‖ X ‖}^{1 + δ}}$ , for all 0 < δ < 1, where c₂ is a constant that does not depend on X or μ.

Lemma A.4

Let B be the matrix defined in Lemma 2.1. Assume condition (C3) holds, then $λ_{\max} (B) \leq \frac{2 λ_{\max} (\sum)}{T r (\sum)} (1 + o (1))$ .

Lemma A.5

Let A be the matrix defined in Theorem 2.3 and $D = E {\frac{1}{{‖ ε_{1} ‖}^{2}} (I_{p} - \frac{ε_{1} ε_{1}^{T}}{{‖ ε_{1} ‖}^{2}})}$ then λ_max(A) ≤ E(‖ε₁‖⁻¹) and λ_max(D) ≤ E(‖ε₁‖⁻²). Furthermore, if conditions (C3) and (C4) hold, then $λ_{\max} (A) \geq \frac{1}{\sqrt{3}} E ({‖ ε_{1} ‖}^{- 1}) (1 - o (1))$ .

Appendix 2: Proof of main theorems

We use c or C to denote generic positive constants, which may vary from line to line.

Proof of Theorem 2.2

Let $S_{n}^{2} = Var (T_{n}) = \frac{n (n - 1)}{2} Tr (B^{2}) = \frac{n (n - 1)}{2} E {{(Z_{1}^{T} Z_{2})}^{2}}$ . Let $V_{n}^{2} = \sum_{i = 2}^{n} E (Y_{i}^{2} | Z_{1}, \dots, Z_{i - 1})$ and $Y_{i} = \sum_{j = 1}^{i - 1} Z_{i}^{T} Z_{j}$ . To apply the martingale central limit theorem (Hall and Heyde, 1980), it is sufficient to check two conditions:

S_{n}^{- 4} \sum_{i = 2}^{n} E (Y_{i}^{4}) \to 0 as n, p \to \infty,

(A.1)

S_{n}^{- 2} V_{n}^{2} \to 1 in probability as n, p \to \infty .

(A.2)

To check (A.1), note that under H₀,

\begin{array}{l} E (Y_{i}^{4}) = E {{(\sum_{j = 1}^{i - 1} Z_{i}^{T} Z_{j})}^{4}} = \sum_{j = 1}^{i - 1} E {{(Z_{i}^{T} Z_{j})}^{4}} + 3 \underset{j \neq k}{\sum_{1 \leq j, k \leq i - 1} E} {{(Z_{i}^{T} Z_{j})}^{2} {(Z_{i}^{T} Z_{k})}^{2}} \\ = (i - 1) E {{(Z_{1}^{T} Z_{2})}^{4}} + 3 (i - 1) (i - 2) E {{(Z_{1}^{T} Z_{2})}^{2} {(Z_{i}^{T} Z_{3})}^{2}} . \end{array}

Hence, $\sum_{i = 1}^{n} E (Y_{i}^{4}) \leq c [n^{2} E {{(Z_{1}^{T} Z_{2})}^{4}} + n^{3} E {{(Z_{1}^{T} Z_{2})}^{2} {(Z_{1}^{T} Z_{3})}^{2}}] \leq c n^{3} E {{(Z_{1}^{T} Z_{2})}^{4}}$ by Hölder’s inequality. By Lemma 2.1, we have $E {{(Z_{1}^{T} Z_{2})}^{4}} = o (n E^{2} {{(Z_{1}^{T} Z_{2})}^{2}})$ . Therefore, (A.1) holds.

To prove (A.2), it is sufficient to verify that $\frac{E {(V_{n}^{2} - S_{n}^{2})}^{2}}{S_{n}^{4}} \to 0$ as n, p → ∞. We write $V_{n}^{2} = \sum_{i = 2}^{n} V_{n i}$ , where $V_{n i} = E (Y_{i}^{2} | Z_{1}, \dots, Z_{i - 1})$ . We have

\begin{array}{l} V_{n i} = \sum_{j = 1}^{i - 1} \sum_{k = 1}^{i - 1} E (Z_{i}^{T} Z_{j} Z_{i}^{T} Z_{k} | Z_{1}, \dots, Z_{i - 1}) = \sum_{j = 1}^{i - 1} \sum_{k = 1}^{i - 1} Tr (Z_{j} Z_{k}^{T} B) = \sum_{j = 1}^{i - 1} \sum_{k = 1}^{i - 1} Z_{j}^{T} B Z_{k} \\ = 2 \sum_{1 \leq j < k \leq i - 1} Z_{j}^{T} B Z_{k} + \sum_{j = 1}^{i = 1} Z_{j}^{T} B Z_{j} . \end{array}

If j₁ ≤ k₁ and j₂ ≤ k₂, then

\begin{array}{l} E (Z_{j 1}^{T} B Z_{k 1} Z_{j 2}^{T} B Z_{k 2}) = E {{(Z_{1}^{T} B Z_{1})}^{2}} I {j_{1} = k_{1} = j_{2} = k_{2}} + E^{2} (Z_{1}^{T} B Z_{1}) I {j_{1} = k_{1} \neq j_{2} = k_{2}} \\ + E {{(Z_{1}^{T} B Z_{2})}^{2}} I {j_{1} = j_{2}, k_{1} = k_{2}, j_{1} < k_{1}} . \end{array}

Therefore, for i₁ < i₂,

\begin{array}{l} E (V_{n i_{1}} V_{n i_{2}}) \\ = 4 \sum_{1 \leq j < k \leq i_{1} - 1} E {{(Z_{1}^{T} B Z_{2})}^{2}} + \sum_{j = 1}^{i_{1} - 1} \sum_{k = 1}^{i_{2} - 1} E^{2} (Z_{1}^{T} B Z_{1}) + \sum_{j = 1}^{i_{1} - 1} [E {{(Z_{1}^{T} B Z_{1})}^{2}} - E^{2} (Z_{1}^{T} B Z_{1})] \\ = 2 (i_{1} - 1) (i_{1} - 2) E {{(Z_{1}^{T} B Z_{2})}^{2}} + (i_{1} - 1) (i_{2} - 1) E^{2} (Z_{1}^{T} B Z_{1}) + (i_{1} - 1) Var (Z_{1}^{T} B Z_{1}) . \end{array}

Consequently,

\begin{array}{l} E (V_{n}^{4}) = E {{(\sum_{i = 2}^{n} V_{n i})}^{2}} = 2 \sum_{2 \leq i < j \leq n} E (V_{n i} V_{n j}) + \sum_{j = 2}^{n} E (V_{n i}^{2}) \\ = 2 \sum_{i = 2}^{n} (i - 1) (i - 2) (2 n - 2 i + 1) E {{(Z_{1}^{T} B Z_{2})}^{2}} + \sum_{i = 2}^{n} (i - 1) (2 n - 2 i + 1) Var (Z_{1}^{T} B Z_{1}) \\ + {n (n - 1) E (Z_{1}^{T} B Z_{1}) / 2}^{2} . \end{array}

Note that $E (Z_{1}^{T} B Z_{1}) = Tr (B^{2})$ and $S_{n}^{2} = \frac{n (n - 1)}{2} Tr (B^{2})$ Hence,

\begin{array}{l} E {{(V_{n}^{2} - S_{n}^{2})}^{2}} = E (V_{n}^{4}) - S_{n}^{4} \\ = 2 \sum_{i = 2}^{n} (i - 1) (i - 2) (2 n - 2 i + 1) E {{(Z_{1}^{T} B Z_{2})}^{2}} + \sum_{i = 2}^{n} (i - 1) (2 n - 2 i + 1) Var (Z_{1}^{T} B Z_{1}) \\ \leq c [n^{4} E {{(Z_{1}^{T} B Z_{2})}^{2}} + n^{3} E {{(Z_{1}^{T} B Z_{1})}^{2}}] . \end{array}

Hence, a sufficient condition for $S_{n}^{- 4} E {(V_{n}^{2} - S_{n}^{2})}^{2} \to 0$ . is $\frac{n^{4} E {{(Z_{1}^{T} B Z_{2})}^{2}} + n^{3} E {{(Z_{1}^{T} B Z_{1})}^{2}}}{n^{4} E^{2} {Z_{1}^{T} B Z_{1}}} \to 0$ . This condition holds by Lemma 2.1. This finishes the proof of Theorem 2.2. □

Proof of Theorem 2.3

Under the local alternatives,

\begin{array}{l} T_{n} = \sum_{i = 1}^{n} \sum_{\begin{array}{l} j = 1 \\ j < i \end{array}}^{n} {\frac{ε_{i}}{‖ ε_{i} ‖} + (\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖})}^{T} {\frac{ε_{j}}{‖ ε_{j} ‖} + (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖} - \frac{ε_{j}}{‖ ε_{j} ‖})} \\ = T_{n 1} + T_{n 2} + T_{n 3}, \end{array}

where $T_{n 1} = \underset{j < i}{\sum_{i = 1}^{n} \sum_{j = 1}^{n}} \frac{ε_{i}^{T} ε_{j}}{‖ ε_{i} ‖ ‖ ε_{j} ‖}$ , $T_{n 2} = \underset{j \neq i}{\sum_{i = 1}^{n} \sum_{j = 1}^{n}} {(\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖})}^{T} \frac{ε_{j}}{‖ ε_{i} ‖}$ , and $T_{n 3} = \underset{j < i}{\sum_{i = 1}^{n} \sum_{j = 1}^{n}} {(\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖})}^{T} (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖} - \frac{ε_{j}}{‖ ε_{j} ‖})$ . By Theorem 2.2, $T_{n 1} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} \to N (0, 1)$ .

To analyze T_n₂, we write T_n₂ = T_n₂₁ + T_n₂₂, where $T_{n 21} = \sum_{i < j} {(\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖})}^{T} \frac{ε_{j}}{‖ ε_{j} ‖}$ and $T_{n 22} = \sum_{j < i} {(\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖})}^{T} \frac{ε_{j}}{‖ ε_{j} ‖}$ . Note that E(T_n₂₁) = 0, and

\begin{array}{l} E (T_{n 21}^{2}) = \sum_{i_{1} < j_{1}} \sum_{i_{2} < j_{2}} E {{(\frac{ε_{i_{1}} + μ}{‖ ε_{i_{1}} + μ ‖} - \frac{ε_{i_{1}}}{‖ ε_{i_{1}} ‖})}^{T} \frac{ε_{j_{1}}}{‖ ε_{j_{1}} ‖} \frac{ε_{j_{2}}^{T}}{‖ ε_{j_{2}} ‖} (\frac{ε_{i_{2}} + μ}{‖ ε_{i_{2}} + μ ‖} - \frac{ε_{i_{2}}}{‖ ε_{i_{2}} ‖})} \\ = \sum_{i_{1} < j} \sum_{i_{2} < j} E {{(\frac{ε_{i_{1}} + μ}{‖ ε_{i_{1}} + μ ‖} - \frac{ε_{i_{1}}}{‖ ε_{i_{1}} ‖})}^{T} B (\frac{ε_{i_{2}} + μ}{‖ ε_{i_{2}} + μ ‖} - \frac{ε_{i_{2}}}{‖ ε_{i_{2}} ‖})} \\ \leq λ_{\max} (B) \sum_{i_{1} < j} \sum_{i_{2} < j} E {\frac{4 {‖ μ ‖}^{2}}{‖ ε_{i_{1}} ‖ ‖ ε_{i_{2}} ‖}} \\ \leq 8 \frac{λ_{\max} (\sum)}{Tr (\sum)} (1 + o (1)) {‖ μ ‖}^{2} {\sum_{i < j} E ({‖ ε_{i_{1}} ‖}^{- 2}) + \sum_{i_{1} < j} \sum_{i_{2} < j} E^{2} ({‖ ε_{i_{1}} ‖}^{- 1})} \\ \leq O (n^{3} {‖ μ ‖}^{2}) \frac{λ_{\max} (\sum)}{Tr (\sum)} E ({‖ ε ‖}^{- 2}), \end{array}

where the first inequality uses Lemma A.3, and the second inequality uses Lemma A.4. In the derivation in Lemma 2.1, we derived that $Tr (B^{2}) \geq \frac{4}{9} \frac{Tr (\sum^{2})}{{Tr}^{2} (\sum)} (1 - o (1))$ . Hence, it follows by condition (C5) that

\frac{E (T_{n 21}^{2})}{\frac{n (n - 1)}{2} Tr (B^{2})} \leq O (n {‖ μ ‖}^{2}) \frac{λ_{\max} (\sum) Tr (\sum)}{Tr (\sum^{2})} E ({‖ ε ‖}^{- 2} = o (1)) .

This implies $T_{n 21} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} = o_{p} (1)$ . Similarly, $T_{n 22} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} = o p (1)$ .

Finally, we analyze T_n₃. Denote

\begin{array}{l} T_{n 31} = \frac{n (n - 1)}{2} E {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})}^{T} E (\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖}), \\ T_{n 32} = \sum_{j \neq i} E {(\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖})}^{T} {\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖} - \frac{ε_{j}}{‖ ε_{j} ‖} - E (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖})}, \\ T_{n 33} = \sum_{j < i} {\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖} - E (\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖})}^{T} {\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖} - \frac{ε_{j}}{‖ ε_{j} ‖} - E (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖})} . \end{array}

Then it follows that

\begin{array}{l} T_{n 3} = \sum_{i = 1}^{n} \sum_{\begin{array}{l} j = 1 \\ j < i \end{array}}^{n} {[E (\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖}) + {\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖} - \frac{ε_{i}}{‖ ε_{i} ‖} - E (\frac{ε_{i} + μ}{‖ ε_{i} + μ ‖})}]}^{T} \\ \times [E (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖}) + {\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖} - \frac{ε_{j}}{‖ ε_{j} ‖} - E (\frac{ε_{j} + μ}{‖ ε_{j} + μ ‖})}] \\ = T_{n 31} + T_{n 32} + T_{n 33} . \end{array}

To analyze T_n₃₁, by Lemma A.3 (2), we can write $E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖}) = - A μ + E (Q_{1})$ , where $Q_{1} = \frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} + \frac{1}{‖ ε_{1} ‖} (I_{p} - \frac{ε_{1} ε_{1}^{T}}{{‖ ε_{1} ‖}^{2}}) μ$ satisfies $E ({‖ Q_{1} ‖}^{2}) \leq c_{3} {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ})$ for all 0 < δ < 1, where c₃ is a constant that does not depend on ε₁ or μ. Hence,

E {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})}^{T} E (\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖}) = μ^{T} A^{2} μ - μ^{T} A E (Q_{1}) - μ^{T} A E (Q_{2}) + E {(Q_{1})}^{T} E (Q_{2}) .

Note that by Lemma A.5 and condition (C6), the last three terms on the right-hand side of the above expression are bounded by

\begin{array}{l} 2 c_{3}^{1 / 2} {‖ μ ‖}^{1 + δ} ‖ μ^{T} A ‖ \cdot E^{1 / 2} ({‖ ε_{1} ‖}^{- 2 - 2 δ}) + c_{3} {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ}) \\ \leq c {‖ μ ‖}^{2} o (E^{2} ({‖ ε_{1} ‖}^{- 1})) = o (μ^{T} A^{2} μ) . \end{array}

Therefore, $T_{n 31} = \frac{n (n - 1)}{2} μ^{T} A^{2} μ (1 + o (1))$ .

To evaluate Tn₃₂, we observe that E(T_n₃₂) = 0 and that

\begin{array}{l} E (T_{n 32}^{2}) = O (n^{3}) E {(\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖})}^{T} E {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})) \\ {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖}))}^{T}} E (\frac{ε_{3} + μ}{‖ ε_{3} + μ ‖}) . \end{array}

Note that by Lemma A.3 (2),

\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖}) = - {\frac{1}{‖ ε_{1} ‖} (I_{p} - \frac{ε_{1} ε_{1}^{T}}{{‖ ε_{1} ‖}^{2}}) - A} μ + (Q_{1} - E (Q_{1})) .

Applying the above decomposition, we obtain

\begin{array}{l} λ_{\max} [E {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})) {(\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ \in_{1} + μ ‖}))}^{T}}] \\ \leq 2 {‖ μ ‖}^{2} λ_{\max} (D) + C {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ}), \\ \leq 2 {‖ μ ‖}^{2} E ({‖ ε_{1} ‖}^{- 2}) + C {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ}), \end{array}

by Lemma A.5, where $D = E {\frac{1}{{‖ ε_{1} ‖}^{2}} (I_{p} - \frac{ε_{1} ε_{1}^{T}}{{‖ ε_{1} ‖}^{2}})}$ . Therefore, by Lemma A.5, conditions (C5), (C6), and observing that $Tr {(B)}^{2} \geq \frac{4 Tr (\sum^{2})}{9 {Tr}^{2} (\sum)} (1 - o (1))$ , we have

\begin{array}{l} E (T_{n 32}^{2}) \leq O (n^{3}) {2 {‖ μ ‖}^{2} E ({‖ ε_{1} ‖}^{- 2}) + C {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ})} {‖ μ ‖}^{2} E^{2} ({‖ ε_{1} ‖}^{- 1}) \\ \leq O (n^{3}) {‖ μ ‖}^{4} E^{2} ({‖ ε_{1} ‖}^{- 2}) = o (n^{2} Tr (B^{2})) . \end{array}

Therefore, $T_{n 32} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} = o_{p} (1)$ .

To evaluate T_n₃₃, we observe that E(T_n₃₃) = 0 and that

\begin{array}{l} E (T_{n 33}^{2}) \\ = O (n^{2}) E [{\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})}^{T} {\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖} - \frac{ε_{2}}{‖ ε_{2} ‖} - E (\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖})} \\ {\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖} - \frac{ε_{2}}{‖ ε_{2} ‖} - E (\frac{ε_{2} + μ}{‖ ε_{2} + μ ‖})}^{T} {\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖})}] \\ \leq O (n^{2}) {2 {‖ μ ‖}^{2} λ_{\max} (D) + C {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ})} \\ \times E [{‖ \frac{ε_{1} + μ}{‖ ε_{1} + μ ‖} - \frac{ε_{1}}{‖ ε_{1} ‖} - E (\frac{ε_{1} + μ}{‖ ε_{1} + μ ‖}) ‖}^{2}] \\ \leq O (n^{2}) {2 {‖ μ ‖}^{2} λ_{\max} (D) + (C) {‖ μ ‖}^{2 + 2 δ} E ({‖ ε_{1} ‖}^{- 2 - 2 δ})}^{2} \\ \leq O (n^{2}) {‖ μ ‖}^{4} E^{2} ({‖ ε_{1} ‖}^{- 2}) = o (n^{2} Tr (B^{2})) \end{array}

by Lemma A.5, conditions (C5) and (C6). Therefore, $T_{n 33} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} = o_{p} (1)$ . Summarizing the above, $T_{n 3} / \sqrt{\frac{n (n - 1)}{2} Tr (B^{2})} = \frac{\frac{n (n - 1)}{2} μ^{T} A^{2} μ (1 + o (1))}{\sqrt{\frac{n (n - 1)}{2}} Tr (B^{2})} + o_{p} (1)$ . This finishes the proof. □

Contributor Information

Lan Wang, Email: wangx346@umn.edu, Associate Professor, School of Statistics, University of Minnesota, Minneapolis, MN 55455.

Bo Peng, Graduate student, School of Statistics, University of Minnesota, Minneapolis, MN 55455.

Runze Li, Email: rzli@psu.edu, Distinguished Professor, Department of Statistics and the Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111.

References

1.Bai Z, Sarandasa H. Effect of High Dimension: By an Example of a Two Sample Problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]
2.Benjamini Yoav, Hochberg Yosef. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B. 1995;57:289?00. [Google Scholar]
3.Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
4.Branco MD, Dey DK. A General Class of Multivariate Skew-elliptical Distributions. Journal of Multivariate Analysis. 2001;79:99–113. [Google Scholar]
5.Brown BM. Statistical Uses of the Spatial Median. Journal of the Royal Statistical Society, Series B. 1983;45:25–30. [Google Scholar]
6.Cai T, Liu W, Xia Y. Two-sample Test of High Dimensional Means under Dependence. Journal of the Royal Statistical Society, Series B. 2014;76:349–372. [Google Scholar]
7.Chaudhuri P. Multivariate Location Estimation Using Extension of R-estimates through U-statistics Type Approach. Annals of Statistics. 1992;20:897–916. [Google Scholar]
8.Chen SX, Qin YL. A Two-sample Test for High-dimensional Data with Application to Gene-Set Testing. Annals of Statistics. 2010;38:808–835. [Google Scholar]
9.Cook RD, Forzani L, Rothman AJ. Estimating Sufficient Reductions of the Predictors in Abundant High-dimensional Regressions. Annals of Statistics. 2012;40(353):84. [Google Scholar]
10.El Karoui N. Concentration of Measure and Spectra of Random Matrices: with Applications to Correlation Matrices, Elliptical Distributions and Beyond. The Annals of Applied Probability. 2009;19:2362–2405. [Google Scholar]
11.Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; London: 1990. [Google Scholar]
12.Frahm G. Ph.D. thesis. University of Cologne; Germany: 2004. Generalized Elliptical Distributions: Theory and Applications. [Google Scholar]
13.Gupta AK, Song D. Lp-norm Spherical Distributions. Journal of Statistical Planning and Inference. 1997;100:241–260. [Google Scholar]
14.Hallin M, Paindaveine D. Semiparametrically Efficient Rank-based Inference for Shape I. Optimal Rank-based Tests for Sphericity. The Annals of Statistics. 2006;34:2707–2756. [Google Scholar]
15.Hall P, Heyde C. Martingale Limit Theory and Applications. Academic Press; New York: 1980. [Google Scholar]
16.Hall P, Jin J. Innovated Higher Criticism for Detecting Sparse Signals in Correlated Noise. The Annals of Statistics. 2010;38:1686–1732. [Google Scholar]
17.Hyvärinen A, Karhunen J, Oja E. Independent Component Analysis. John Wiley & Sons; New York: 2001. [Google Scholar]
18.Ilmonen P, Paindaveine D. Semiparametrically Efficient Inference based on Signed Ranks in Symmetric Independent Component Models. Annals of Statistics. 2011;39:2448–2476. [Google Scholar]
19.Kring S, Rachev ST, Hchsttter M, Fabozzi FJ, Bianchi ML. Multitail Generalized Elliptical Distributions for Asset Returns. The Econometrics Journal. 2009;12(272):91. [Google Scholar]
20.Lee SH, Limb J, Li E, Vannuccid M, Petkova E. Order test for high-dimensional two-sample means. Journal of Statistical Planning and Inference using random subspaces. 2012;142:2719–2725. [Google Scholar]
21.Ledoux M. The Concentration of Measure Phenomenon. American Mathematical Society; Providence, Rhode Island: 2001. [Google Scholar]
22.Mcneil AJ, Frey R, Embrechts P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press; Princeton, NJ: 2005. [Google Scholar]
23.Möttönen J, Oja H. Multivariate Spatial Sign and Rank Methods. Journal of Nonparametric Statistics. 1995;5:201–213. [Google Scholar]
24.Möttönen J, Oja H, Tienari J. On the Efficiency of Multivariate Spatial Sign and Rank Tests. Annals of Statistics. 1997;25:542–552. [Google Scholar]
25.Oja H. Multivariate Nonparametric Methods with R. Springer; 2010. [Google Scholar]
26.Paindaveine D, Verdebout T. Universal Asymptotics for High-dimensional Sign Tests. Université libre de Bruxellesz; 2013. (technical report). [Google Scholar]
27.Pan GM, Zhou W. Central Limit Theorem for Hotelling’s T2 Statistic under Large Dimension. Annals of Applied Probability. 2011;21:1860–1910. [Google Scholar]
28.Purdom E, Holmes SP. Error Distribution for Gene Expression Data. Statistical Applications in Genetics and Molecular Biology. 2005;4 doi: 10.2202/1544-6115.1070. Article 16. [DOI] [PubMed] [Google Scholar]
29.Rachev ST, Kim YS, Bianchi ML, Fabozzi FJ. Financial Models with Lévy Processes and Volatility Clustering. John Wiley & Sons; Hoboken, NJ, USA: 2011. Multi-Tail t-Distribution. [Google Scholar]
30.Schmidt R. Tail Dependence for Elliptically Contoured Distributions. Mathematical Methods of Operations Research. 2002;55:301–327. [Google Scholar]
31.Srivastava M. A Test for the Mean Vector with Fewer Observations than the Dimension under Non-normality. Journal of Multivariate Analysis. 2009;100:386–402. [Google Scholar]
32.Srivastava MS, Du M. A Test for the Mean Vector with Fewer Observations than the Dimension. Journal of Multivariate Analysis. 2008;99:386–402. [Google Scholar]
33.Srivastava MS, Katayama S, Kano Y. A two sample test in high dimensional data. Journal of Multivariate Analysis. 2013;114:349–358. [Google Scholar]
34.Szabowski PJ. Uniform Distributions on Spheres in Finite-dimensional Lα and Their Generalization. Journal of Multivariate Analysis. 1998;64:103–117. [Google Scholar]
35.Thulin M. 11A high-dimensional two-sample test for the mean using random subspaces. Computational Statistics & Data Analysis. 2014;74:26–38. [Google Scholar]
36.Vo T, Phan J, Huynh K, Wang M. Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. 2007. Reproducibility of Differential Gene Detection across Multiple Microarray Studies; p. 4231?234. [DOI] [PubMed] [Google Scholar]
37.Wu X, Wang J, Cui X, Maianu L, et al. The Effect of Insulin on Expression of Genes and Biochemical Pathways in Human Skeletal Muscle. Endocrine. 2007;31:5–17. doi: 10.1007/s12020-007-0007-x. [DOI] [PubMed] [Google Scholar]
38.Zhong PS, Chen SX. Tests for High Dimensional Regression Coefficients with Factorial Designs. Journal of the American Statistical Association. 2011;106:260–274. [Google Scholar]
39.Zhong PS, Chen SX, Xu MY. Tests Alter- native to Higher Criticism for High Dimensional Means under Sparsity and Column-wise Dependence. The Annals of Statistics. 2013;41:2703–3110. [Google Scholar]
40.Zou CL, Peng LH, Wang ZJ. Multivariate Sign-based High-dimensional Tests for Sphericity. Biometrika. 2014;101:229–236. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS651394-supplement-Supplementary_Material.pdf^{(301KB, pdf)}

[R1] 1.Bai Z, Sarandasa H. Effect of High Dimension: By an Example of a Two Sample Problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]

[R2] 2.Benjamini Yoav, Hochberg Yosef. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B. 1995;57:289?00. [Google Scholar]

[R3] 3.Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]

[R4] 4.Branco MD, Dey DK. A General Class of Multivariate Skew-elliptical Distributions. Journal of Multivariate Analysis. 2001;79:99–113. [Google Scholar]

[R5] 5.Brown BM. Statistical Uses of the Spatial Median. Journal of the Royal Statistical Society, Series B. 1983;45:25–30. [Google Scholar]

[R6] 6.Cai T, Liu W, Xia Y. Two-sample Test of High Dimensional Means under Dependence. Journal of the Royal Statistical Society, Series B. 2014;76:349–372. [Google Scholar]

[R7] 7.Chaudhuri P. Multivariate Location Estimation Using Extension of R-estimates through U-statistics Type Approach. Annals of Statistics. 1992;20:897–916. [Google Scholar]

[R8] 8.Chen SX, Qin YL. A Two-sample Test for High-dimensional Data with Application to Gene-Set Testing. Annals of Statistics. 2010;38:808–835. [Google Scholar]

[R9] 9.Cook RD, Forzani L, Rothman AJ. Estimating Sufficient Reductions of the Predictors in Abundant High-dimensional Regressions. Annals of Statistics. 2012;40(353):84. [Google Scholar]

[R10] 10.El Karoui N. Concentration of Measure and Spectra of Random Matrices: with Applications to Correlation Matrices, Elliptical Distributions and Beyond. The Annals of Applied Probability. 2009;19:2362–2405. [Google Scholar]

[R11] 11.Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; London: 1990. [Google Scholar]

[R12] 12.Frahm G. Ph.D. thesis. University of Cologne; Germany: 2004. Generalized Elliptical Distributions: Theory and Applications. [Google Scholar]

[R13] 13.Gupta AK, Song D. Lp-norm Spherical Distributions. Journal of Statistical Planning and Inference. 1997;100:241–260. [Google Scholar]

[R14] 14.Hallin M, Paindaveine D. Semiparametrically Efficient Rank-based Inference for Shape I. Optimal Rank-based Tests for Sphericity. The Annals of Statistics. 2006;34:2707–2756. [Google Scholar]

[R15] 15.Hall P, Heyde C. Martingale Limit Theory and Applications. Academic Press; New York: 1980. [Google Scholar]

[R16] 16.Hall P, Jin J. Innovated Higher Criticism for Detecting Sparse Signals in Correlated Noise. The Annals of Statistics. 2010;38:1686–1732. [Google Scholar]

[R17] 17.Hyvärinen A, Karhunen J, Oja E. Independent Component Analysis. John Wiley & Sons; New York: 2001. [Google Scholar]

[R18] 18.Ilmonen P, Paindaveine D. Semiparametrically Efficient Inference based on Signed Ranks in Symmetric Independent Component Models. Annals of Statistics. 2011;39:2448–2476. [Google Scholar]

[R19] 19.Kring S, Rachev ST, Hchsttter M, Fabozzi FJ, Bianchi ML. Multitail Generalized Elliptical Distributions for Asset Returns. The Econometrics Journal. 2009;12(272):91. [Google Scholar]

[R20] 20.Lee SH, Limb J, Li E, Vannuccid M, Petkova E. Order test for high-dimensional two-sample means. Journal of Statistical Planning and Inference using random subspaces. 2012;142:2719–2725. [Google Scholar]

[R21] 21.Ledoux M. The Concentration of Measure Phenomenon. American Mathematical Society; Providence, Rhode Island: 2001. [Google Scholar]

[R22] 22.Mcneil AJ, Frey R, Embrechts P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press; Princeton, NJ: 2005. [Google Scholar]

[R23] 23.Möttönen J, Oja H. Multivariate Spatial Sign and Rank Methods. Journal of Nonparametric Statistics. 1995;5:201–213. [Google Scholar]

[R24] 24.Möttönen J, Oja H, Tienari J. On the Efficiency of Multivariate Spatial Sign and Rank Tests. Annals of Statistics. 1997;25:542–552. [Google Scholar]

[R25] 25.Oja H. Multivariate Nonparametric Methods with R. Springer; 2010. [Google Scholar]

[R26] 26.Paindaveine D, Verdebout T. Universal Asymptotics for High-dimensional Sign Tests. Université libre de Bruxellesz; 2013. (technical report). [Google Scholar]

[R27] 27.Pan GM, Zhou W. Central Limit Theorem for Hotelling’s T2 Statistic under Large Dimension. Annals of Applied Probability. 2011;21:1860–1910. [Google Scholar]

[R28] 28.Purdom E, Holmes SP. Error Distribution for Gene Expression Data. Statistical Applications in Genetics and Molecular Biology. 2005;4 doi: 10.2202/1544-6115.1070. Article 16. [DOI] [PubMed] [Google Scholar]

[R29] 29.Rachev ST, Kim YS, Bianchi ML, Fabozzi FJ. Financial Models with Lévy Processes and Volatility Clustering. John Wiley & Sons; Hoboken, NJ, USA: 2011. Multi-Tail t-Distribution. [Google Scholar]

[R30] 30.Schmidt R. Tail Dependence for Elliptically Contoured Distributions. Mathematical Methods of Operations Research. 2002;55:301–327. [Google Scholar]

[R31] 31.Srivastava M. A Test for the Mean Vector with Fewer Observations than the Dimension under Non-normality. Journal of Multivariate Analysis. 2009;100:386–402. [Google Scholar]

[R32] 32.Srivastava MS, Du M. A Test for the Mean Vector with Fewer Observations than the Dimension. Journal of Multivariate Analysis. 2008;99:386–402. [Google Scholar]

[R33] 33.Srivastava MS, Katayama S, Kano Y. A two sample test in high dimensional data. Journal of Multivariate Analysis. 2013;114:349–358. [Google Scholar]

[R34] 34.Szabowski PJ. Uniform Distributions on Spheres in Finite-dimensional Lα and Their Generalization. Journal of Multivariate Analysis. 1998;64:103–117. [Google Scholar]

[R35] 35.Thulin M. 11A high-dimensional two-sample test for the mean using random subspaces. Computational Statistics & Data Analysis. 2014;74:26–38. [Google Scholar]

[R36] 36.Vo T, Phan J, Huynh K, Wang M. Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. 2007. Reproducibility of Differential Gene Detection across Multiple Microarray Studies; p. 4231?234. [DOI] [PubMed] [Google Scholar]

[R37] 37.Wu X, Wang J, Cui X, Maianu L, et al. The Effect of Insulin on Expression of Genes and Biochemical Pathways in Human Skeletal Muscle. Endocrine. 2007;31:5–17. doi: 10.1007/s12020-007-0007-x. [DOI] [PubMed] [Google Scholar]

[R38] 38.Zhong PS, Chen SX. Tests for High Dimensional Regression Coefficients with Factorial Designs. Journal of the American Statistical Association. 2011;106:260–274. [Google Scholar]

[R39] 39.Zhong PS, Chen SX, Xu MY. Tests Alter- native to Higher Criticism for High Dimensional Means under Sparsity and Column-wise Dependence. The Annals of Statistics. 2013;41:2703–3110. [Google Scholar]

[R40] 40.Zou CL, Peng LH, Wang ZJ. Multivariate Sign-based High-dimensional Tests for Sphericity. Biometrika. 2014;101:229–236. [Google Scholar]

PERMALINK

A High-Dimensional Nonparametric Multivariate Test for Mean Vector

Lan Wang

Bo Peng

Runze Li

Abstract

1 Introduction

2 A high-dimensional nonparametric test

2.1 The test statistic

Remark 1

2.2 The limiting null distribution

Lemma 2.1

Theorem 2.2

Remark 2

Remark 3

Remark 4

Figure 1.

2.3 Local power analysis

Remark 5

Theorem 2.3

2.4 Extensions to beyond the elliptical distribution family

Theorem 2.4

3 Numerical studies

3.1 Monte Carlo simulations

Example 1

Table 1.

Example 2

Table 2.

Example 3

Table 3.

3.2 An application

Table 4.

Figure 2.

4 Conclusion and discussions

Supplementary Material

Acknowledgments

Appendix: Technical proofs

Appendix 1: Some useful lemmas

Lemma A.1

Lemma A.2

Lemma A.3

Lemma A.4

Lemma A.5

Appendix 2: Proof of main theorems

Proof of Theorem 2.2

Proof of Theorem 2.3

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases