Projection Test for Mean Vector in High Dimensions

Wanjun Liu; Xiufan Yu; Wei Zhong; Runze Li

doi:10.1080/01621459.2022.2142592

. Author manuscript; available in PMC: 2025 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2022 Dec 12;119(545):744–756. doi: 10.1080/01621459.2022.2142592

Projection Test for Mean Vector in High Dimensions

Wanjun Liu ¹, Xiufan Yu ², Wei Zhong ³, Runze Li ⁴

PMCID: PMC11064993 NIHMSID: NIHMS1987210 PMID: 38706705

Abstract

This paper studies the projection test for high-dimensional mean vectors via optimal projection. The idea of projection test is to project high-dimensional data onto a space of low dimension such that traditional methods can be applied. We first propose a new estimation for the optimal projection direction by solving a constrained and regularized quadratic programming. Then two tests are constructed using the estimated optimal projection direction. The first one is based on a data-splitting procedure, which achieves an exact $t$ -test under normality assumption. To mitigate the power loss due to data-splitting, we further propose an online framework, which iteratively updates the estimation of projection direction when new observations arrive. We show that this online-style projection test asymptotically converges to the standard normal distribution. Various simulation studies as well as a real data example show that the proposed online-style projection test retains the type I error rate well and is more powerful than other existing tests.

Keywords: Data splitting, one-sample mean problem, online-style estimation, power enhancement, regularization methods

1. Introduction

Testing whether a population mean equals a known vector or the equality of means from two populations is a fundamental problem in high-dimensional statistics with wide applications. Such tests are commonly encountered in genome-wide association studies. For instance, Chen and Qin (2010) performed a test to identify sets of genes which are responsible for certain types of tumors in a genetics research. Xu et al. (2016) applied various tests to a bipolar disorder dataset collected by Wellcome Trust Case Control Consortium (2007) in which one would like to test whether there is any association between a disease and a large number of genetic variants. In these applications, the dimension of collected data $p$ is often much larger than the sample size $n$ . Suppose that $\{x_{1}, \dots, x_{n}\}$ is a random sample from a $p$ -dimensional population with mean $μ$ and positive definite covariance matrix $Σ$ . Of interest is to test the following hypothesis

H_{0} : μ = μ_{0} versus H_{1} : μ \neq μ_{0},

(1)

for some known vector $μ_{0}$ . This is typically referred to as the one-sample hypothesis testing problem in multivariate analysis and has been extensively studied when $p$ is fixed. When $p \geq n$ , the well-known Hotelling’s $T^{2}$ test (Hotelling 1931) is not directly applicable as the sample covariance matrix is not invertible. Even in the case $n > p$ , the Hotelling’s $T^{2}$ test may suffer from low power against the alternative if $p / n \to c \in (0,1)$ for some $c$ that is close to 1 (Bai and Saranadasa 1996).

More recently, various tests have been proposed for high-dimensional mean vectors, which can be roughly classified into three types (Huang et al. 2022). The first type is known as quadratic-form tests, which can be viewed as modified Hotelling’s $T^{2}$ tests. Quadratic-form tests substitute the singular sample covariance matrix by an invertible diagonal matrix, leading to a quadratic form of the sample mean. For example, Bai and Saranadasa (1996) substituted the sample covariance matrix by identity matrix I and showed the resulting statistic has asymptotic normal null distribution if $p / n \to c > 0$ . Quadratic-form tests are known to be more powerful against dense alternatives. However, these quadratic-form tests tend to ignore the dependence among the $p$ dimensions and would lose testing power when correlations among the variables are high. The second type is known as extreme-type tests. An extreme-type test usually makes a decision based on the maximum value of a sequence of candidate test statistics. For example, to test a high-dimensional mean vector, one can easily construct a marginal test for each dimension and then use the maximum value of the $p$ marginal test statistics as the final test statistic (Cai et al. 2014, Chang et al. 2017). Another example is the sum-of-powers test which is in the form of sum of $p$ marginal statistics with power index leading to the most extreme p-value (Xu et al. 2016). Though extreme-type tests are typically more powerful against sparse alternatives (Cai et al. 2014), they generally require stringent conditions in order to derive the limiting null distribution and are likely to suffer from size distortions due to slow convergence rate. The third type is projection test. The idea of projection test is to project the high-dimensional observations onto a space of low dimension and then traditional methods such as Hotelling’s $T^{2}$ can be directly applied. A primary task of projection test is to find a good projection matrix (or vector) such that the resulting test achieves high power. A random projection test was proposed in Lopes et al. (2011) where entries in the projection matrix are randomly drawn from the standard normal distribution. Instead of random projection, data-driven approaches are also studied to construct the projection matrix. Lauter (1996) proposed a test that projects data onto a 1-dimensional space by some weight vector which depends on data only through $X^{⊤} X$ . Recently, Huang (2015) studied the optimal projection such that the resulting projection test has the highest power among all possible projection matrices. It has been shown that we should project data onto a 1-dimensional space to achieve the highest power and the optimal projection matrix is of the form $Σ^{- 1} μ$ . Li and Li (2021) further extended the idea of optimal projection matrix for hypothesis testing in high-dimensional linear models.

Optimal direction based projection tests have not been systematically studied yet and there are a few open questions remaining to be addressed. The first question is how to obtain a good estimation of the optimal projection with statistical guarantees. A naive ridge-type estimator is proposed in Huang (2015), but little is known about its theoretical properties. The power analysis of their test relies on the assumption that the ridge-type estimator is consistent, which is not necessarily true in the high-dimensional setting. The second question is how to construct a powerful test based on the estimated optimal projection. Notice that the optimal projection $Σ^{- 1} μ$ is typically unknown and needs to be estimated from observed data. Huang (2015) proposed a data-splitting strategy which utilizes half of the observations to estimate the optimal projection and uses the remaining half to test the hypothesis. The major advantage of the data-splitting strategy is the estimated projection direction is independent of the remaining data and an exact test can be easily constructed. However, there is substantial power loss due to data-splitting as only half of the data is used to perform the test. This paper aims to fill these gaps by constructing an online-style projection test via optimal projection. Firstly, we propose a new estimation of the optimal projection direction by solving a constrained and regularized quadratic programming problem. Under the assumption that the optimal projection is sparse, we can show that any stationary point of the quadratic programming problem is a consistent estimator. In other words, we do not need to find the global solution to the possibly nonconvex optimization problem as any stationary point enjoys desirable statistical properties. Secondly, we propose an online framework to boost testing power, which originates from online learning algorithms for streaming data. The main idea is to recursively update the estimation when new observations arrive. In particular, we first obtain an initial estimator of the optimal projection via a subset of data. Then we recursively project the incoming observations onto a 1-dimensional space and then update the estimation by including the newly arrived observations. We repeat this procedure until the arrival of the last observation and then perform a test based on the projected sample.

The rest of this paper is organized as follows. In Section 2, we propose a new estimation of the optimal projection via constrained and regularized quadratic programming and derive non-asymptotic $L_{1}$ and $L_{2}$ error bounds. Such error bounds hold for all stationary points. In Section 3, we propose the online-style projection test and study its asymptotic limiting distributions. In Section 4, numerical studies are conducted to examine the finite sample performances of different tests with application to a real data example. We conclude this paper with some discussion in Section 5. All technical proofs and additional simulation results are given in the supplemental material.

2. Optimal Direction Based Projection Tests

2.1. Preliminary

Without loss of generality, we assume $μ_{0} = 0$ and the one-sample problem (1) becomes

H_{0} : μ = 0 versus H_{1} : μ \neq 0 .

(2)

Given a random sample $x_{1}, \dots, x_{n}$ with mean $μ$ and covariance matrix $Σ$ , let $\bar{x}$ and $\hat{Σ}$ be the sample mean and the sample covariance matrix, respectively,

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}, \hat{Σ} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{⊤} .

(3)

When $n > p$ , the Hotelling’s $T^{2}$ for problem (2) is $T^{2} = n {\bar{x}}^{⊤} {\hat{Σ}}^{- 1} \bar{x}$ . If $x_{1}, \dots, x_{n}$ are independent and identically distributed (i.i.d.) normal random variables, then $(n - p) / {(n - 1) p} T^{2} ~ F_{p, n - p}$ under $H_{0}$ , where $F_{p, n - p}$ denotes $F$ distribution with degrees of freedom $p$ and $n - p$ . The Hotelling’s $T^{2}$ requires the sample covariance matrix $\hat{Σ}$ to be invertible and cannot be directly applied when $n \leq p$ . Despite the singularity of $\hat{Σ}$ , it has been observed that the power of the Hotelling’s $T^{2}$ test can be adversely affected even when $p < n$ if $\hat{Σ}$ is nearly singular (Bai and Saranadasa 1996).

Projection tests are proposed for high-dimensional data. The basic idea of projection tests is to reduce dimension by projecting the high-dimensional vector $x_{i}$ onto a space of lower dimension and then traditional methods such as Hotelling’s $T^{2}$ can be carried out. Let $P$ be a $p \times q$ projection matrix with $q ≪ n$ . We project the $p$ -dimensional vector $x_{i}$ to a $q$ -dimensional space by left-multiplying $P^{⊤}$ , i.e., $y_{i} = P^{⊤} x_{i}, i = 1, \dots, n$ . Conditional on the projection matrix $P$ , the projected sample $y_{1}, \dots, y_{n} \in R^{q}$ are independent and identically distributed with mean $P^{⊤} μ$ and covariance matrix $P^{⊤} Σ P$ . Then the original problem (2) can be reformulated as

H_{0}^{'} : P^{⊤} μ = 0 versus H_{1}^{'} : P^{⊤} μ \neq 0 .

(4)

Note that the hypothesis in (4) is not necessarily equivalent to that in (2). If $H_{0}^{'}$ is rejected, then $H_{0}$ is rejected as well but not vice versa. Given the projection matrix $P$ , we construct Hotelling’s $T^{2}$ test, denoted by $T_{P}^{2}$ , based on the projected sample $\{P^{⊤} x_{1}, \dots, P^{⊤} x_{n}\}$ ,

T_{P}^{2} = n {\bar{x}}^{⊤} P {(P^{⊤} \hat{Σ} P)}^{- 1} P^{⊤} \bar{x} .

A key question is how to choose the dimension $q$ and the projection matrix $P$ . Huang (2015) proved that under normality assumption, the optimal projection direction is proportional to $Σ^{- 1} μ$ in the sense that the power of $T_{P}^{2}$ is maximized. Without normality assumption, the direction $Σ^{- 1} μ$ is asymptotically optimal provided that the sample mean follows asymptotically normal. Throughout this paper, it is assumed that $x_{1}, \dots, x_{n}$ are i.i.d. sub-Gaussian, and hence $Σ^{- 1} μ$ is asymptotically the optimal direction.

Define $θ = Σ^{- 1} μ$ . With the choice of $P = θ$ , the hypothesis becomes

H_{0}^{''} : θ^{⊤} μ = 0 versus H_{1}^{''} : θ^{⊤} μ \neq 0,

(5)

which is equivalent to the original hypothesis (2) as $θ^{⊤} μ = 0$ implies $μ = 0$ . This optimal projection direction depends on unknown parameters and needs to be estimated from data. In order to control the type I error, Huang (2015) proposed an exact $t$ -test via data-splitting. The entire dataset is randomly partitioned into two disjoint sets: $𝒟_{1} = \{x_{1}, \dots, x_{n_{1}}\}$ and $𝒟_{2} = \{x_{n_{1} + 1}, \dots, x_{n}\}$ . The first dataset $𝒟_{1}$ is used to estimate $θ$ and the second dataset $𝒟_{2}$ is used to construct the test statistic. To estimate $θ$ , a ridge-type estimator $\hat{θ} = {({\hat{Σ}}_{1} + λ D_{1})}^{- 1} {\bar{x}}_{1}$ is used in Huang (2015), where ${\bar{x}}_{1}$ and ${\hat{Σ}}_{1}$ are the sample mean and the sample covariance estimated from $𝒟_{1}$ an $D_{1} = d i a g ({\hat{Σ}}_{1})$ . Define $y_{i} = {\hat{θ}}^{⊤} x_{i}, i = n_{1} + 1, \dots$ , $n$ , then the test statistic is constructed as $T_{\hat{θ}}^{2} = \sqrt{n_{2}} \overline{y} / s_{y}$ , where $\overline{y}$ and $s_{y}$ are the sample mean and standard deviation based on $\{y_{n_{1} + 1}, \dots, y_{n}\}$ . The advantage of data-splitting is the constructed test $T_{\hat{θ}}^{2}$ is an exact test under normality assumption. The power analysis of $T_{\hat{θ}}^{2}$ relies on the consistency of ridge-type estimator $\hat{θ}$ , which is no longer true in the high-dimensional regime. In addition, this data-splitting approach suffers from power loss since only a subset of data is used to perform the test. In this paper, we propose a different approach to estimate the optimal projection direction such that the resulting estimator is consistent. Furthermore, we propose an online framework to perform the test such that more observations are utilized to boost the power.

2.2. Estimation of optimal projection direction

We first introduce some notations. For a vector $v = {(v_{j})}_{j = 1}^{p} \in R^{p}$ , let $∥ v ∥_{k}$ be its $L_{k}$ norm, $k = 1$ ,2. Its $L_{0}$ norm $∥ v ∥_{0}$ is the number of nonzero entries in $v$ and $L_{\infty}$ norm is $∥ v ∥_{\infty} = m a x |v_{j}|$ . For a matrix $M = {(m_{i j})}_{i, j = 1}^{p} \in R^{p \times p}$ , its entrywise $L_{\infty}$ norm is $∥ M ∥_{m a x} = m a x |m_{i j}|$ . For a set $𝒟$ , $| 𝒟 |$ denotes its cardinality. We use $⌊ a ⌋$ to denote the largest integer that is smaller or equal to $a$ , and use $X = {(x_{1}, \dots, x_{p})}^{⊤}$ to denote the data matrix.

In practice, the optimal projection direction $Σ^{- 1} μ$ is typically unknown and needs to learned from samples. A simple plug-in estimator for $Σ^{- 1} μ$ is replacing $Σ$ and $μ$ by their sample counterparts $\hat{Σ}$ and $\bar{x}$ . The first challenge is when $p > n$ , the sample covariance matrix is no longer invertible. This challenge can be overcomed by imposing additional sparsity structure on $Σ$ or $Σ^{- 1}$ and apply thresholding or penalized likelihood (Fan et al. 2016). The second challenge is even if we achieve a consistent estimator ${\hat{Σ}}^{- 1}$ , the consistency of ${\hat{Σ}}^{- 1}$ does not imply the consistency of ${\hat{Σ}}^{- 1} \bar{x}$ (Chen et al. 2016). These challenges motivate us to seek a direct estimation of the vector $Σ^{- 1} μ$ . Let $θ = Σ^{- 1} μ$ denote the optimal projection direction. Notice that any vector that is proportional to $θ$ is also optimal as the projection test statistic $T_{θ}^{2}$ is scale invariant. Based on this observation, we consider the following constrained quadratic optimization problem

\underset{β}{m i n} \frac{1}{2} β^{⊤} Σ β subject to μ^{⊤} β = 1 .

(6)

The solution to (6) is $β^{⋆} = Σ^{- 1} μ / μ^{⊤} Σ^{- 1} μ$ , which can be viewed as a scaled version of the optimal projection direction. Note that the formulation (6) rules out the null hypothesis $μ = 0$ as the solution is not well-defined for $μ = 0$ . We want to point out that the optimal projection direction, which maximizes the power of the test $T_{P}^{2}$ , is derived when alternative hypothesis is true. Under null hypothesis, any choice of projection would work perfectly fine in the sense that type I error rate can be well controlled, see more discussions in Section 3. Hence excluding $μ = 0$ does not bring in any inconvenience.

In high-dimensional regimes, it is well known that consistent estimators cannot be achieved unless additional structural assumptions are imposed. A common assumption is the optimal projection direction $β^{⋆} = Σ^{- 1} μ / μ^{⊤} Σ^{- 1} μ$ is sparse, i.e., a large portion of elements in $β^{⋆}$ are 0. To encourage a sparse estimator of $β^{⋆}$ , we consider the following constrained and regularized quadratic programming,

\underset{β}{m i n} \frac{1}{2} β^{⊤} \hat{Σ} β + P_{λ} (β) subject to {\bar{x}}^{⊤} β = 1,

(7)

where $P_{λ} (β) = \sum_{j = 1}^{p} P_{λ} (β_{j})$ is some penalty function satisfying the conditions in Supplement S.2.1. Such conditions on $P_{λ} (t)$ are relatively mild and are satisfied for a wide variety of popular penalties (Fan et al. 2020).

Before we proceed to solve the optimization problem (7), we provide a different prospective on estimation of the optimal projection direction without knowing its explicit form $Σ^{- 1} μ$ . Suppose we want to project data onto a 1-dimensional space. Let $β \in R^{p}$ be the projection vector and $y_{i} = β^{⊤} x_{i}, i = 1, \dots, n$ be projected sample. Then $\sqrt{n} \overline{y} / s_{y}$ is the $t$ -statistic based on the projected sample, where $\overline{y}$ and $s_{y}^{2}$ are the sample mean and sample variance of $y_{i}$ ’s,

\frac{\sqrt{n} \bar{y}}{s_{y}} = \frac{\sqrt{n} β^{⊤} \bar{x}}{\sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} β^{⊤} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{⊤} β}} = \frac{\sqrt{n} β^{⊤} \bar{x}}{\sqrt{β^{⊤} \hat{Σ} β}} .

The goal is to find $β$ that maximizes the absolute value of the $t$ -statistic $|\sqrt{n} \overline{y} / s_{y}|$ (i.e., maximize power). Since $\sqrt{n} \overline{y} / s_{y}$ is an odd function w.r.t. $β$ , maximizing $|\sqrt{n} \overline{y} / s_{y}|$ is the same as maximizing $\sqrt{n} \overline{y} / s_{y}$ , ${m a x}_{β} \{β^{⊤} \bar{x} / \sqrt{β^{⊤} \hat{Σ} β}\}$ , which is equivalent to

\underset{β}{m i n} \frac{1}{2} β^{⊤} \hat{Σ} β subject to {\bar{x}}^{⊤} β = 1,

(8)

Therefore, formulation (7) can be viewed as the regularized version of (8), which intends to maximize the test statistic and the penalty term $P_{λ}$ is added to avoid overfitting from a machine learning prospective. Another perspective that provides insights on the optimal projection direction is Fisher’s discriminant analysis. Consider a two-sample problem, and of interest is to test whether the two population means are the same. That is, $H_{0} : μ_{1} = μ_{2}$ . We can treat the hypothesis testing as a classification problem under Fisher’s discriminant analysis framework, which is to find the best linear combination of features that maximizes the ratio of variance between the two populations to the variance within the populations (Fisher 1936). That is to find $β$ to maximize the ratio

\frac{{V a r}_{between}}{{V a r}_{within}} = \frac{{(β^{⊤} (μ_{2} - μ_{1}))}^{2}}{β^{⊤} Σ β} .

This ratio measures the signal-to-noise ratio for the class labelling. It can be shown that the maximum ratio occurs when $β \propto Σ^{- 1} (μ_{1} - μ_{2})$ , which is exactly the optimal projection direction for the two-sample problem (Huang 2015).

2.3. Non-asymptotic error bounds

In this section, we establish the non-asymptotic error bounds for all stationary points. The nonconvexity of $P_{λ}$ brings extra difficulty in solving the constrained and regularized quadratic form (7). Direct computation of the global solution is quite challenging in high-dimensional settings. For instance, Liu et al. (2016) developed a mixed integer linear programming to find a provably global solution, which is quite computationally expensive. Instead of searching for the global solution, many algorithms are developed to efficiently find local solutions with satisfactory theoretical guarantees (Zou and Li 2008, Wang et al. 2013, Loh and Wainwright 2015). We will show that any stationary points of (7) enjoy satisfactory statistical properties and introduce an ADMM with local linear approximation algorithm to obtain a stationary point. $\hat{β}$ is a stationary point of optimization problem (7) if $\hat{β}$ satisfies the following first order condition,

⟨\hat{Σ} \hat{β} + \nabla P_{λ} (\hat{β}), β - \hat{β}⟩ \geq 0 for all feasible {\bar{x}}^{⊤} β = 1,

(9)

where $\nabla P_{λ} (\cdot)$ denotes the subgradient of $P_{λ}$ and $⟨ \cdot, \cdot ⟩$ denotes the inner product of two vectors. The first order condition (9) is a necessary condition for $\hat{β}$ to be a local minimum. In other words, the set of $\hat{β}$ satisfying condition (9) includes all local minimizers as well as the global minimizer.

We further assume the sample covariance matrix $\hat{Σ}$ satisfies the following restricted strong convexity (RSC) condition,

Δ^{⊤} \hat{Σ} Δ \geq ν ∥ Δ ∥_{2}^{2} - τ \sqrt{\frac{log p}{n}} ∥ Δ ∥_{1} for Δ \in R^{p}, ∥ Δ ∥_{1} \geq 1,

(10)

where $ν > 0$ is a strictly positive constant and $τ \geq 0$ is a nonnegative constant. Though we only impose the RSC condition on the set $∥ Δ ∥_{1} \geq 1$ , it is easy to verify that the RSC condition actually holds for all $Δ \in R^{p}$ due to the quadratic form, see Lemma S.3 in Supplement S.1. In a classical setting where $n > p$ , the sample covariance matrix $\hat{Σ}$ is positive definite and hence one can set $τ = 0$ and $ν = λ_{m i n} (\hat{Σ})$ , the minimal eigenvalue of $\hat{Σ}$ . In the high-dimensional regime where $p > n$ , the sample covariance matrix $\hat{Σ}$ is typically semi-positive definite and hence $Δ^{⊤} \hat{Σ} Δ \geq 0$ for any $Δ \in R^{p}$ . Then the RSC condition (10) holds trivially for $\{Δ : ∥ Δ ∥_{1} / ∥ Δ ∥_{2}^{2} > a\}$ , where $a = \frac{α}{τ} \sqrt{\frac{n}{l o g p}}$ . As a result, we only require the RSC condition to hold in the set $\{Δ : ∥ Δ ∥_{1} / ∥ Δ ∥_{2}^{2} \leq a, ∥ Δ ∥_{1} \geq 1\}$ . Such RSC-type condition is widely adopted to develop the non-asymptotic error bounds in high-dimensional $M$ -estimation and is satisfied with high probability if $x_{1}, \dots, x_{n}$ are i.i.d. sub-Gaussian random variables. See, for example, Loh and Wainwright (2015).

Recall that $β^{⋆} = Σ^{- 1} μ / μ^{⊤} Σ^{- 1} μ$ is the scaled optimal projection direction. It is challenging to directly derive the error bounds between $\hat{β}$ and $β^{⋆}$ as $β^{⋆}$ is not necessarily feasible, i.e., $β^{⋆}$ may not satisfy the linear equality constraint ${\bar{x}}^{⊤} β = 1$ . To this end, we introduce the scaled feasible optimal projection direction $\tilde{β} = Σ^{- 1} μ / {\bar{x}}^{⊤} Σ^{- 1} μ$ , which is the intersection of set of optimal projection direction ${a θ : a \in R}$ and feasible set $\{β : {\bar{x}}^{⊤} β = 1\}$ . Note that $\tilde{β}$ is of the same direction as the optimal projection $Σ^{- 1} μ$ but also satisfies the linear constraint ${\bar{x}}^{⊤} β = 1$ . Figure 1 illustrates the relationship between the scaled optimal projection direction $β^{⋆}$ , the scaled feasible optimal projection direction $\tilde{β}$ and a stationary point $\hat{β}$ . We first derive error bounds between $\hat{β}$ and $\tilde{β}$ and then error bounds between $\hat{β}$ and $β^{⋆}$ follow the triangle inequality

{∥ \hat{β} - β^{⋆} ∥}_{k} \leq {∥ \hat{β} - \tilde{β} ∥}_{k} + {∥ \tilde{β} - β^{⋆} ∥}_{k}, k = 1,2 .

(11)

Figure 1: — Illustration of the scaled optimal projection direction $β^{⋆}$ , the scaled feasible optimal projection direction $\tilde{β}$ and a stationary point $\hat{β}$ . $\tilde{β}$ is the intersection of the optimal projection direction and feasible set. Both $\tilde{β}$ and $\hat{β}$ lie in the feasible set while $\tilde{β}$ and $β^{⋆}$ are optimal projection directions.

We impose the following conditions,

(C1) $x_{1}, \dots, x_{n}$ are independent and identically distributed sub-Gaussian vectors.
(C2) The sample covariance matrix $\hat{Σ}$ satisfies the RSC condition (10) with $3 μ \leq 4 ν$ .
(C3) There exists $C_{1}$ , $C_{2} > 0$ such that $μ^{⊤} Σ^{- 1} μ \geq C_{1}$ , $∥ θ ∥_{1} \leq C_{2}$ .

The following theorem states the $L_{1}$ and $L_{2}$ error bounds for $\hat{β} - \tilde{β}$ and $\hat{β} - β^{⋆}$ .

Theorem 1. Suppose conditions (C1)–(C3) hold. Let $\hat{β}$ be any stationary point of the program (7) with $λ = M \sqrt{l o g p / n}$ for some large constant $M$ . If the sample size $n$ satisfies

n \geq {(\frac{M C_{2}}{(1 - δ) C_{1}})}^{2} log p,

(12)

for some $0 < δ < 1$ . Then with probability at least $1 - c p^{- 1}$ for some constant $c$ , we have

$∥ \hat{β} - \tilde{β} ∥_{1} = O (s \sqrt{l o g p / n})$ and $∥ \hat{β} - \tilde{β} ∥_{2} = O (\sqrt{s l o g p / n})$ .
${∥\hat{β} - β^{⋆}∥}_{1} = O (s \sqrt{l o g p / n})$ and ${∥\hat{β} - β^{⋆}∥}_{2} = O (\sqrt{s l o g p / n})$ .

where $s = {∥β^{⋆}∥}_{0}$ is the number of nonzero elements in $β^{⋆}$ .

Remark. The error bounds in Theorem 1 hold for all stationary points satisfying condition (9). In other words, any local estimator is guaranteed with desirable statistical accuracy. Condition (12) describes the relationship between sample size $n$ and dimension $p$ and is satisfied if $\sqrt{l o g p / n} = o (1)$ . If $\sqrt{s l o g p / n} \to 0$ as $n, p \to \infty$ , then we know $\hat{β}$ is a consistent estimator of $β^{⋆}$ . In other words, in order to obtain a consistent estimator, we require the optimal projection direction to be sparse. Chernozhukov et al. (2017) established a central limit theorem for high-dimensional data with Gaussian and bootstrap approximations, which can be applied to a one-sample hypothesis testing problem and requires $l o g p = o (n^{1 / 7})$ . In contrast, we impose a weaker condition on the dimension $l o g p = o (n / s^{2})$ on the proposed projection test. Conditions (C1)–(C2) are commonly adopted in high-dimensional statistics. Condition (C3) is posited to ensure the estimation error of covariance matrix is not enlarged by $θ$ and passed along to the estimation of $β^{⋆}$ . Note that $∥ \hat{Σ} θ - μ ∥_{\infty} = ∥ \hat{Σ} θ - Σ θ ∥_{\infty} \leq ∥ \hat{Σ} - Σ ∥_{m a x} \cdot ∥ θ ∥_{1}$ . A diverging $∥ θ ∥_{1}$ would amplify the estimation error of $\hat{Σ}$ .

Geometry of $\hat{β}$ .

Notice that the projection test is scale invariant with respect to $\hat{β}$ , i.e., $\hat{β}$ and $a \hat{β}$ have exact the same testing performance for any $a > 0$ . To eliminate scale effect, we also measure the closeness between $\hat{β}$ and $β^{⋆}$ using cosine similarity. Formally, the cosine similarity between two vectors $u$ and $v$ is defined as

cos ⟨u, v⟩ = \frac{u^{⊤} v}{∥ u ∥_{2} ∥ v ∥_{2}} .

If $u$ and $v$ are of the same direction ( $u = a v$ for some $a > 0$ ), then cosine similarity equals 1. In general, if the cosine similarity is closer to 1, then the directions of the two vectors are closer to each other. We that the cosine similarity between $β^{⋆}$ and $\hat{β}$ converges to 1.

Corollary 1. Suppose conditions (C1)–(C3) hold. Let $\hat{β}$ be a stationary point of the program (7) with $λ = M \sqrt{l o g p / n}$ for some large constant $M$ . Then we have

c o s ⟨ \hat{β}, \tilde{β} ⟩ = c o s ⟨\hat{β}, β^{⋆}⟩ \to 1 .

2.4. ADMM with local linear approximation

The error bounds in Theorem 1 indicate that any stationary point has satisfactory statistical accuracy. Instead of finding the global solution to the possibly nonconvex optimization problem (7), we only need to develop an algorithm which can find a stationary point that satisfies the first order condition (9). We propose to solve the constrained and regularized quadratic form using the local linear approximation (LLA) technique. The LLA algorithm was firstly proposed in Zou and Li (2008) to deal with nonconvex penalty in generalized linear models. The idea of LLA is to approximate the nonconvex penalty by its first order expansion, which is a convex function. It turns out that we only need to solve a convex problem iteratively until certain convergence criterion is reached. A similar idea is proposed in Wang et al. (2013) which approximates the nonconvex penalty by its tight upper bound. Here we extend the LLA algorithm to solve the constrained and regularized quadratic programming (7).

Suppose $β_{j 0}$ is close to $β_{j}$ , ignoring constants that do not involve $β_{j}$ , the penalty function can be approximated by

P_{λ} (|β_{j}|) \approx P_{λ}^{'} (|β_{j 0}|) (|β_{j}| - |β_{j 0}|) .

Given the $k$ -th iteration ${\hat{β}}^{(k)} = {({\hat{β}}_{1}^{(k)}, \dots, {\hat{β}}_{p}^{(k)})}^{⊤}$ , the optimization problem (7) can be updated as follows

{\hat{β}}^{(k + 1)} = \underset{{\bar{x}}^{⊤} β = 1}{arg min} \frac{1}{2} β^{⊤} \hat{Σ} β + \sum_{j = 1}^{p} ω_{j}^{(k)} | β_{j} |

(13)

where $ω_{j}^{(k)} = P_{λ}^{'} (|β_{j}^{(k)}|)$ . We iteratively solve (13) until ${\hat{β}}^{(k)}$ converges. The LLA algorithm is an instance of Majorization-Minimization (MM) algorithms and is guaranteed to converge to a stationary point (Zou and Li 2008). With the local linear approximation, we know there is a unique solution to the constrained and regularized quadratic form and this unique solution is a stationary point. Hence the theoretical results developed in Section 2.3 hold for the estimator given by the LLA algorithm. The theoretical properties of LLA estimators are systematically studied in Fan et al. (2014) in high dimensions and the LLA algorithm can find the oracle estimator with high probability after one iteration. To solve the convex optimization problem after local linear approximation (13), we adopt the alternating direction method of multipliers (ADMM) algorithm which naturally handles the linear equality constraint ${\bar{x}}^{⊤} β = 1$ (Boyd et al. 2011, Fang et al. 2015). The penalisation parameter $λ$ is chosen via high-dimensional BIC (Wang et al. 2013). The details of ADMM algorithm and choice of $λ$ are summarized in Supplement S.4 and S.5, respectively.

3. Data Splitting and Power Enhancement

Ideally, one would like to use the full sample to estimate the optimal projection direction so the estimated projection is the most accurate. Then we project the full sample to a 1-dimensional space and perform a test based on the projected sample. However, the limiting distribution of the test statistic is challenging to derive as the estimated projection direction and the data to be projected are dependent. In this section, we introduce a new projection test via an online framework. Before we present the full methodology of the online-style projection test, we first proceed with a data-splitting procedure.

3.1. A data-splitting procedure

Data-splitting technique has been widely used in high-dimensional statistical inference, where one can use a subset of data to learn the underlying model and use the remaining data to conduct statistical inferences. Given a dataset $𝒟 = \{x_{1}, \dots, x_{n}\}$ , we partition the full dataset into two disjoint subsets $𝒟_{1} = \{x_{1}, \dots, x_{n_{1}}\}$ and $𝒟_{2} = \{x_{n_{1} + 1}, \dots, x_{n}\}$ . We use the first subset $𝒟_{1}$ to estimate the optimal projection direction $β^{⋆}$ and use the second dataset $𝒟_{2}$ to perform the projection test. More specifically, let ${\bar{x}}_{1}$ and ${\hat{Σ}}_{1}$ be the sample mean and the sample covariance matrix estimated from $𝒟_{1}$ . The projection direction can be estimated by solving the following constrained and penalized quadratic form introduced in Section 2.2,

\hat{β} = \underset{{\bar{x}}_{1}^{⊤} β = 1}{a r g m i n} \frac{1}{2} β^{⊤} {\hat{Σ}}_{1} β + P_{λ} (β) .

(14)

Then we project data in $𝒟_{2}$ onto a 1-dimensional space by taking inner product of $\hat{β}$ and $x_{i}$ , i.e., $y_{i} = x_{i}^{⊤} \hat{β}$ for $i = n_{1} + 1, \dots, n$ . Note that the estimated projection direction $\hat{β}$ is independent of $𝒟_{2}$ , which is the key benefit of the data-splitting procedure. Conditional on $\hat{β}$ , $y_{n_{1} + 1}, \dots, y_{n}$ are independent and identically distributed with mean ${\hat{β}}^{⊤} μ$ and variance ${\hat{β}}^{⊤} Σ \hat{β}$ . Now we reduce the dimensionality from $p$ to 1, and we can simply apply the one-sample $t$ -test to the projected data $\{y_{n_{1} + 1}, \dots, y_{n}\}$ . Let

\bar{y} = \frac{1}{n_{2}} \sum_{i = n_{1} + 1}^{n} y_{i}, s_{y}^{2} = \frac{1}{n_{2} - 1} \sum_{i = n_{1} + 1}^{n} {(y_{i} - \bar{y})}^{2},

where $n_{2} = n - n_{1}$ . Clearly, the test statistic is $T_{y} = \sqrt{n_{2}} \overline{y} / s_{y}$ asymptotically follows $N (0,1)$ under $H_{0}$ and we reject null hypothesis whenever $|T_{y}| > z_{α / 2}$ , where $z_{α / 2}$ is the upper $α / 2$ quantile of the standard normal distribution. If $x_{i}$ ’s are normally distributed, then $T_{y}$ follows an exact $t$ distribution with degree of freedom $n_{2} - 1$ . We refer to this test as data-splitting projection test (DSPT) and summarize it in Algorithm 1. Thanks to the data-splitting procedure, the estimated $\hat{β}$ is independent of the remaining subset $𝒟_{2}$ . As a result, we are able to achieve an exact $t$ -test and hence control the type I error rate. However, the performance of the data-splitting procedure may not be satisfactory as data in $𝒟_{1}$ is not utilized when conducting the $t$ -test, leading to loss of power. Next, we derive the asymptotic power function for the DSPT defined as $β_{d s}^{α} (μ) = P r (|T_{y}| > z_{α / 2})$ .

Theorem 2. Suppose conditions (C1)–(C3) hold. Let $\hat{β}$ be a stationary point of program (14) with $λ = M \sqrt{l o g p / n_{1}}$ for some large constant $M$ . If $\sqrt{s l o g p / n_{1}} = o (1)$ , then we have

β_{d s}^{α} (μ) - Φ (- z_{α / 2} + \sqrt{n_{2} ζ}) \to 0 as n_{1}, n_{2} \to \infty,

where $ζ = μ^{⊤} Σ^{- 1} μ$ and $Φ (\cdot)$ is the cdf of the standard normal distribution.

The quantity $ζ = μ^{⊤} Σ^{- 1} μ$ can be interpreted as the signal strength in alternative hypothesis. As long as $n_{2} ζ \to \infty$ , the DSPT has asymptotic power approaching 1. If the full sample of size $n$ is used to perform the test, one would expect the power function to be $Φ (- z_{α / 2} + \sqrt{n ζ})$ . Theorem 2 clearly indicates power loss of the DSPT due to the data-splitting procedure - only observations in $𝒟_{2}$ are used to test the hypothesis. Though the DSPT is not optimal in terms of testing power, it is still more powerful than the quadratic-form tests which ignores the dependence among variables, as long as $\sqrt{n_{2} / n} > 0.5$ (Huang 2015). In practice, it is recommended to set $n_{1} = ⌊ τ n ⌋$ with $τ \in (0.4, 0.6)$ based on numerical studies. Huang (2015) proposed a ridge-type estimator $(\hat{Σ} + λ I)^{- 1} \bar{X}$ to estimate the projection direction, which is not guaranteed to be consistent in high dimensions and its theoretical properties require $n$ and $p$ of the same order. Its power analysis is relied on the consistency of the ridge-type estimator. We propose to estimate the optimal projection via a constrained and regularized quadratic program. The resulting estimator is consistent provided that $\sqrt{s l o g p / n} = o (1)$ . We further derive the asymptotic power function $β_{d s}^{α} (μ)$ based on this condition.

3.2. An online-style data splitting

The data-splitting procedure introduced in the previous section achieves an exact $t$ -test but suffers from power loss. In this subsection, we propose an online framework for projection test to enhance the testing power while retaining the type I error rate. Imagine that observations arrive one by one in chronological order, and we repeat a projection-estimation procedure whenever a new observation arrives. More specifically, we obtain an estimator of the optimal projection based on current observations, and when a new observation arrives we first project the new observation and then update the estimation by including the new observation. Suppose we observe $𝒟_{t} = \{x_{1}, \dots, x_{t}\}$ at time $t$ and we first use the $t$ observations to estimate the optimal projection direction via the penalized and constrained quadratic form introduced in Section 2.2,

{\hat{β}}_{t} = \underset{{\bar{x}}_{t}^{⊤} β = 1}{a r g m i n} \frac{1}{2} β^{⊤} {\hat{Σ}}_{t} β + P_{λ} (β),

(15)

where ${\bar{x}}_{t}$ and ${\hat{Σ}}_{t}$ are sample mean and sample covariance estimated from $𝒟_{t}$ . When a new observation $x_{t + 1}$ arrives, we take the following Projection-Estimation steps:

Projection: project the new observation using the current estimator ${\hat{β}}_{t}$ , i.e., $y_{t + 1} = x_{t + 1}^{⊤} {\hat{β}}_{t}$ .
Estimation: update the estimation of projection direction by incorporating the information of the new observation $x_{t + 1}$ . That is,
${\hat{β}}_{t + 1} = \underset{{\bar{x}}_{t + 1}^{⊤} β = 1}{a r g m i n} \frac{1}{2} β^{⊤} {\hat{Σ}}_{t + 1} β + P_{λ} (β),$
where ${\bar{x}}_{t + 1}$ and ${\hat{Σ}}_{t + 1}$ are sample mean and sample covariance estimated from $𝒟_{t + 1} = \{x_{1}, \dots, x_{t + 1}\}$ . Given an integer $k_{n} ≪ n$ , we obtain an initial estimator ${\hat{β}}_{k_{n}}$ based on the first $k_{n}$ observations in $𝒟_{k_{n}}$ . Then we repeat the Projection-Estimation steps until the last observation arrives. We require $k_{n} = o (n)$ and $k_{n} \geq 2$ since we need at least two observations to estimate the covariance matrix. As a result, we obtain a sequence of projected sample $\{y_{k_{n} + 1}, \dots, y_{n}\}$ of sample size $n - k_{n} + 1$ and based on which we can carry out the one-sample $t$ -test. Let
$T_{y} = \frac{\sqrt{n - k_{n}} \overline{y}}{s_{y}},$ (16)
where $\overline{y}$ and $s_{y}^{2}$ are the sample mean and the sample variance of $\{y_{k_{n} + 1}, \dots, y_{n}\}$ . Later in this section we will show the test statistic $T_{y}$ converges to $N (0,1)$ under $H_{0}$ . As a result, we can use $z_{α / 2}$ as the critical value and reject $H_{0}$ whenever $|T_{y}| > z_{α / 2}$ . We refer to this test as Online-style Projection Test One by one (OPT-O) as we update ${\hat{β}}_{t}$ whenever a new observation arrives. The details of the OPT-O are also summarized in Algorithm 2.

Under the one-by-one framework of OPT-O, one needs to solve a constrained and penalized quadratic programming, which can be nonconvex, whenever a new observation arrives. As a result, the OPT-O test can be computationally expensive especially when $n$ is large. To reduce the computational cost, we also propose a mini-batch version the OPT-O test. That is, we update the estimated projection direction only when a batch of observations of size $b$ arrive. Suppose we obtain an estimator ${\hat{β}}_{t}$ at time $t$ based on $𝒟_{t} = \{x_{1}, \dots, x_{t}\}$ . When the next $b$ observations $x_{t + 1}, \dots, x_{t + b}$ arrive, we first project the $b$ observations to a 1-dimensional space by multiplying ${\hat{β}}_{t}$ , i.e., $y_{t + 1} = x_{t + 1}^{⊤} {\hat{β}}_{t}, \dots$ , $y_{t + b} = x_{t + b}^{⊤} {\hat{β}}_{t}$ . Then we update the estimation of projection direction by including the additional $b$ observations,

{\hat{β}}_{t + b} = \underset{{\bar{x}}_{t + b}^{⊤} β = 1}{a r g m i n} \frac{1}{2} β^{⊤} {\hat{Σ}}_{t + b} β + P_{λ} (β),

(17)

where ${\bar{x}}_{t + b}$ and ${\hat{Σ}}_{t + b}$ are sample mean and sample covariance matrix computed from the first $t + b$ observations $𝒟_{t + b} = \{x_{1}, \dots, x_{t + b}\}$ . We repeat this procedure until the last batch arrives. Note that the size of the last batch can be small than $b$ . Similar to the OPT-O, we reject $H_{0}$ if $|T_{y}| > z_{α / 2}$ where the test statistic $T_{y}$ is defined in (16). The details of the mini-batch version are summarized in Algorithm 3 and the corresponding test is referred to as Online-style Projection Test - mini Batch (OPT-B). Note that the data-splitting projection test can be regarded as a special case of the OPT-B where there is only one single batch. In practice, we recommend the OPT-O test (i.e., $b = 1$ ) to maximize the power. If computing power is a concern, one should use a small value of $b$ to ensure high power provided that the computational cost is affordable. Section S.8 studies how the batch size $b$ affects the power of projection tests. Another way to reduce computational time is through warm starting. For both OPT-O and OPT-B, one can leverage the solution ${\hat{β}}_{t}$ from the previous step as the initial solution in the next iteration to obtain ${\hat{β}}_{t + 1}$ or ${\hat{β}}_{t + b}$ . Such a warm start helps shorten time to find the optimal solution.

Algorithms 2 and 3 introduce a general online framework to construct projection test. Instead of using the constrained and penalized quadratic programming to estimate the projection direction, any appropriate estimator can be applied. For instance, one can use the ridge-type estimator proposed in Huang (2015).

3.3. Asymptotic distribution of online-style projection test

In this subsection, we establish the limiting distributions for the proposed online-style projection test. Under the online framework, we obtain a sequence of projected sample $y_{k_{n} + 1}, \dots$ , $y_{n}$ one by one or via mini-batch. Then we need to make an inference regarding $H_{0}$ based on the projected sample. One challenge is that the projected sample $y_{k_{n} + 1}, \dots$ , $y_{n}$ are no longer independent from each other. A key observation is that we can construct a sequence of martingale differences from the projected sample. Let $𝓏_{t}$ be the centralized version of $y_{t}$ , i.e., $𝓏_{t} = {(x_{t} - μ)}^{⊤} {\hat{β}}_{t - 1}$ for $t = k_{n} + 1, \dots, n$ and $𝓏_{t}$ has mean 0. The following lemma states that the sequence of $𝓏_{t}$ ’s is a martingale difference and the proof can be found in Section S.3.

Lemma 1. The sequence $𝓏_{k_{n} + 1}, \dots, 𝓏_{n}$ is a martingale difference with respect to the $σ$ -field generated by $\{x_{1}, x_{2}, \dots\}$ under both the one-by-one and the mini-batch framework.

Lemma 1 also implies that $\{y_{k_{n} + 1}, \dots, y_{n}\}$ is a martingale difference under $H_{0}$ . With this observation, we are ready to establish the central limit theorem for the test statistic. To this end, let $V_{t + 1}^{2}$ be the conditional variance of $z_{t + 1}$ given $x_{1}, \dots, x_{t}$ ,

V_{t + 1}^{2} = E (𝓏_{t + 1}^{2} ∣ x_{1}, \dots, x_{t}) = {\hat{β}}_{t}^{⊤} Σ {\hat{β}}_{t},

and let $\overline{𝓏}$ and $s_{𝓏}^{2}$ be the sample mean and sample variance of $𝓏_{k_{n} + 1}, \dots, 𝓏_{n}$ ,

\overline{𝓏} = \frac{1}{n - k_{n}} \sum_{t = k_{n} + 1}^{n} 𝓏_{t} and s_{𝓏}^{2} = \frac{1}{n - k_{n} - 1} \sum_{t = k_{n} + 1}^{n} 𝓏_{t}^{2} .

The following theorem establishes the asymptotic normality for $T_{𝓏}$ .

Theorem 3. (Normality) Assume $k_{n} = o (n)$ and there exists $a_{n, p} > 0$ that depends on $n, p$ such that ${(n - k_{n})}^{- 1} a_{n, p} \sum_{t = k_{n} + 1}^{n} V_{t}^{2} \overset{p}{\to} V^{2}$ , where $V^{2}$ is some almost surely finite random variable satisfying $P r (V^{2} > 0) = 1$ , then we have

T_{𝓏} = \sqrt{n - k_{n}} \overline{𝓏} / s_{𝓏} \overset{d}{\to} N (0,1) .

In particular, under $H_{0}$ , $T_{y} = \sqrt{n - k_{n}} \overline{y} / s_{y} \overset{d}{\to} N (0,1)$ .

We also empirically verify the asymptotic normality of proposed online-style tests under both $H_{0}$ and $H_{1}$ in Section S.6.

Remark. In order to establish the central limit theorem for the proposed online-style test statistic, we require the sample average of conditional variance $V_{k}^{2}$ to be nondegenerate. Note that for each $t$ , the estimated projection direction satisfies the linear constraint ${\bar{x}}_{t}^{⊤} {\hat{β}}_{t} = 1$ and hence ${∥{\bar{x}}_{t}∥}_{\infty} {∥{\hat{β}}_{t}∥}_{1} \geq 1$ . Since ${∥{\bar{x}}_{t}∥}_{\infty} \leq \sqrt{l o g p / t}$ with high probability, we know ${∥{\hat{β}}_{t}∥}_{1} \geq \sqrt{t / l o g p}$ which diverges in the high-dimensional settings, so is the conditional variance $V_{t}^{2}$ . With a proper choice of $a_{n, p}$ , we want the sample average of conditional variance (up to the factor $a_{n, p}$ ) to converge to some nonzero random variable. In general, one can set $a_{n, p}$ to be of the order $O ((n - k_{n}) {(\sum_{t = k_{n} + 1}^{n} V_{t}^{2})}^{- 1})$ . In practice, we do not need to derive the explicit form of $a_{n, p}$ as the test statistic is scale invariant and $a_{n, p}$ will be cancelled out. Under the alternative hypothesis, we can show that the sample average of conditional variance $V_{t}^{2}$ converges to $ζ^{- 1}$ where $ζ = μ^{⊤} Σ^{- 1} μ$ and hence one can set $a_{n, p} = O (ζ)$ . Thanks to the linear constraint ${\bar{x}}_{t}^{⊤} β = 1$ , the resulting estimator ${\hat{β}}_{t}$ is bounded away from $0$ and the conditional variance $V_{t}^{2}$ is not degenerate. Without the linear constraint, we cannot guarantee that ${\hat{β}}_{t}$ is bounded away from $0$ and hence the central limit theorem may not hold.

Here is an example to illustrate the convergence of $V_{t}^{2}$ with the following conditions imposed on the first moment and second moment of ${\hat{β}}_{t}$ . Assume there exists $β, \underline{Σ}$ , and sequences $r_{n}$ such that $E (a^{⊤} {\hat{β}}_{t}) - a^{⊤} \underline{β} \to 0$ and $r_{n}^{2} V a r (a^{⊤} {\hat{β}}_{t}) - a^{⊤} \underline{Σ} a \to 0$ for all $∥ a ∥_{2} = 1$ . Therefore, $E ({\hat{β}}_{t}) - \underline{β} \to 0$ and $r_{n}^{2} V a r ({\hat{β}}_{t}) - \underline{Σ} \to 0$ . Note that

\frac{1}{n - k_{n}} \sum_{t = k_{n + 1}}^{n} V_{t}^{2} = \frac{1}{n - k_{n}} \sum_{t = k_{n + 1}}^{n} {\hat{β}}_{t - 1}^{⊤} Σ {\hat{β}}_{t - 1} = trace (Σ \frac{1}{n - k_{n}} \sum_{t = k_{n + 1}}^{n} {\hat{β}}_{t - 1} {\hat{β}}_{t - 1}^{⊤}) .

Further assume $\frac{1}{n - k_{n}} \sum_{t = k_{n + 1}}^{n} {\hat{β}}_{t - 1} {\hat{β}}_{t - 1}^{⊤}$ converges to its expectation $E ({\hat{β}}_{t} {\hat{β}}_{t}^{⊤})$ , where

E ({\hat{β}}_{t} {\hat{β}}_{t}^{⊤}) - r_{n}^{- 2} \underline{Σ} + {\underline{β} \underline{β}}^{⊤} = V a r ({\hat{β}}_{t}) + E ({\hat{β}}_{t}) E {({\hat{β}}_{t})}^{⊤} - r_{n}^{- 2} \underline{Σ} + {\underline{β} \underline{β}}^{⊤} \to 0 .

As a result, the sample average of conditional variance can be approximated by

\frac{1}{n - k_{n}} \sum_{t = k_{n + 1}}^{n} V_{t}^{2} = r_{n}^{- 2} t r a c e (Σ \underline{Σ}) + {\underline{β}}^{⊤} Σ \underline{β .}

Under $H_{0}$ , we have $\underline{β} = 0$ and hence one can take $a_{n, p} = r_{n}^{2}$ . Under $H_{1}$ , we have $r_{n} = O (1)$ and hence one can take $a_{n, p} = 1$ .

Theorem 3 establishes the asymptotic normality of the test statistic $T_{y}$ under both null and alternative hypotheses assuming the sample average of conditional variances (up to some factor $a_{n, p})$ converges. This theorem justifies the usage of $𝓏_{α / 2}$ as the critical value and we reject the null hypothesis if and only if $|T_{y}| \geq 𝓏_{α / 2}$ . A bootstrap version of projection test is discussed in Supplement S.7 where the critical value is chosen based on bootstrap samples. Under conditions (C1)–(C3), we can show $V_{t}^{2} - μ^{⊤} Σ^{- 1} μ \to 0$ as $t \to \infty$ , based on which we can derive the asymptotic power function $β_{opto}^{α} (μ) = P r (|T_{y}| > 𝓏_{α / 2})$ for the OPT-O test. The following theorem establishes the asymptotic power function for the OPT-O test.

Theorem 4. Suppose conditions (C1)–(C3) hold. Let ${\hat{β}}_{t}$ be a stationary point of program (15) with $λ = M s \sqrt{l o g p / t}$ . Assume that $s l o g p / \sqrt{n} = o (1)$ and $k_{n} = o (n)$ , we have

(Normality under alternative) $T_{𝓏} = \sqrt{n - k_{n}} \overline{𝓏} / s_{𝓏} \overset{d}{\to} N (0,1)$ .
(Power function) $β_{o p t o}^{α} (μ) - Φ (- 𝓏_{α / 2} + \sqrt{(n - k_{n}) ζ}) \to 0$ as $n \to \infty$ .

Theorem 4 establishes the normality of the proposed OPT-O test under the alternative, and based on which we derive the asymptotic power function. The asymptotic power function $β_{opto}^{α} (μ)$ in Theorem 4 holds with any choice of $k_{n} = o (n)$ . In practice, one may set $k_{n}$ be a fixed constant or $k_{n} = n^{τ}$ for some $τ \in (0,1)$ , see Section 3.4 for a numerical study on how to select $k_{n}$ . For the mini-batch version test OPT-B, it is easy to verify that $\{𝓏_{1}, 𝓏_{2}, \dots\}$ is also a martingale difference and thus the asymptotic normality and power function in Theorem 4 also hold for the OPT-B.

Remark The power function $β_{opto}^{α} (μ)$ indicates that $μ^{⊤} Σ^{- 1} μ$ can be regarded as the signal in the alternative. The larger $μ^{⊤} Σ^{- 1} μ$ , the easier to reject the null. If $n μ^{⊤} Σ^{- 1} μ \to \infty$ , then we have asymptotic power 1 to reject the null. For the sum-of-square-type of tests, which ignore the dependence among the variables, have asymptotic power

β_{S S}^{α} (μ) = Φ (- 𝓏_{α} + n ∥ μ ∥_{2}^{2} / \sqrt{2 t r (Σ^{2}})),

under the assumption that $μ^{⊤} Σ^{- 1} μ = o (n t r (Σ^{2}))$ . Since $k_{n} = o (n)$ , we have $β_{opto}^{α} (μ) > β_{d s}^{α} (μ)$ . The OPT-O test improves the performance of the DSPT in the following two ways: (1) the OPT-O test keeps including new observations for the estimation of $β^{⋆}$ and hence the estimator of $β^{⋆}$ gets more and more accurate as more observations arrive. (2) Fewer observations are discarded for the OPT-O test compared with the DSPT when performing the test. For the OPT-O, only the first $k_{n} = o (n)$ observations are discarded while $⌊ τ n ⌋ = O (n)$ observations are discarded for the DSPT. Let us further examine the technical conditions for the asymptotic power. For the DSPT test, we require $s \sqrt{l o g p / n_{1}} = o (1)$ to control the error bounds of the estimated projection direction. In order to have high power, the sample size used for the test $n_{2} = n - n_{1}$ cannot be too small. In particular, if $\sqrt{n_{2} / n} > 0.5$ , then it is more powerful than the quadratic-form tests such as the CQ test. As a result, it is assumed that $n_{1}$ and $n_{2}$ are the same order of $n$ (i.e., $n_{1} = O (n), n_{2} = O (n))$ and $s \sqrt{l o g p / n} = o (1)$ (hence $s \sqrt{l o g p / n_{1}} = o (1)$ ). The same condition on dimension $s \sqrt{l o g p / n} = o (1)$ is needed for the OPT-O, but it allows the sample size discarded for the test is small, i.e., $k_{n} = o (n)$ . As a result, more observations can be used to perform the test and has higher asymptotic power than the DSPT. To summarize, under the same condition $s \sqrt{l o g p / n} = o (1)$ , the OPT-O is more powerful than the DSPT. We can easily translate the asymptotic power function with local alternative $μ = δ / \sqrt{n}$ for some fixed vector $δ$ . Given that $n_{1} = ⌊ τ n ⌋$ and $k_{n} = o (n)$ , the local asymptotic power functions for the DSPT and the OPT-O are

β_{o p t o}^{α} (δ) = Φ (- 𝓏_{α / 2} + \sqrt{(1 - o (1)) η}),

β_{d s}^{α} (δ) = Φ (- 𝓏_{α / 2} + \sqrt{τ η}),

where $η = δ^{⊤} Σ^{- 1} δ$ and $0 < τ < 1$ . Clearly, for local alternatives, we still have $β_{opto}^{α} (δ) > β_{d s}^{α} (δ)$ . To evaluate the efficiency of the proposed OPT-O and DSPT, we introduce the oracle projection test which utilizes the full sample with known optimal projection direction. Hence the asymptotic power for the oracle projection test is

β_{oracle}^{α} (μ) - Φ (- 𝓏_{α / 2} + \sqrt{n ζ}) \to 0 as n \to \infty .

Here we present an asymptotic power comparison between the proposed OPT-O, DSPT and the oracle projection test with $n = 40,160$ , $n_{1} = n / 2$ , $k_{n} = ⌊n^{0.6}⌋$ and different signal strength $ζ = μ^{⊤} Σ^{- 1} μ$ . Figure 2 indicates that there is certain power loss due to data-splitting compared with the oracle test. The online-style test improves the power of the DSPT. Overall we have $β_{oracle}^{α} (μ) \geq β_{opto}^{α} (μ) \geq β_{d s}^{α} (μ)$ .

Figure 2: — Asymptotic power functions of the OPT-O, the DSPT and the oracle projection test. The oracle power function is based on the full sample with known optimal direction.

3.4. Choice of $k_{n}$

The proposed online-style projection test involves choosing the parameter $k_{n}$ such that we use the first $k_{n}$ observations to obtain an initial estimator of the optimal projection direction. According to Theorem 4, any choice of $k_{n} = o (n)$ would result in testing power of order $Φ (- 𝓏_{α / 2} + \sqrt{(n - k_{n}) ζ})$ . Clearly, $k_{n}$ should not be too large since the first $k_{n}$ observations are not utilized when performing the online-style projection test. A large $k_{n}$ may lead to a significant loss in power. To maximize $Φ (- 𝓏_{α / 2} + \sqrt{(n - k_{n}) ζ})$ with respect to $k_{n}$ , one may set $k_{n} = 2$ (since we need at least two observations to estimate the covariance matrix). However, a small $k_{n}$ results in an inaccurate initial estimator and consequently affect the power. In what follows we conduct numerical studies to investigate how $k_{n}$ affect the testing power. Assume $k_{n} = n^{τ}$ (so $k_{n} = o (n))$ with $τ = (0.2,0.25, \dots, 0.95)$ . We set $(n, p, c) = (100,1600,0.25)$ and $(40,1600,0.5)$ for both autocorrelation and compound symmetry covariance structures with $ρ \in {0.25,0.50,0.75,0.95}$ . Figure 3 depicts the testing power against the choice of $τ$ , where the upper two panels are for the autocorrelation structure and the lower two panels are for the compound symmetry structure. Figure 3 shows that the power of the test is almost flat when $τ$ is small, then increases gradually and drops quickly as $τ$ further increases. In other words, the power of the online-style projection test is not very sensitive to the choice of $k_{n}$ and gives very similar power when $τ$ is relatively small $(τ \leq 0.8)$ . When $k_{n}$ is large, we perform the online-style projection test only based on a dataset of a relatively small sample size and lower power is expected. According to Figure 3, we suggest choosing $τ \in [0.4,0.8]$ in practice and we use $τ = 0.6$ in the rest of this paper.

Figure 3: — Power of the online-style projection test against the choice of $τ$ . We set $(n, p, c) = (100,1600,0.25)$ and $(40,1600,0.5)$ for both autocorrelation and compound symmetry covariance matrix structure with $ρ \in {0.25,0.50,0.75,0.95}$ . The upper two panels are the power for autocorrelation structure and the lower two panels are the power for compound symmetry structure.

4. Numerical Results

In this section, we conduct numerical experiments to examine the finite sample performance of different tests for one-sample mean vector problem in high dimensions, including projection tests as well as quadratic-form tests. In particular, we consider the following optimal projection based projection tests:

OPT-O: Online-style Projection Test One-by-one version according to Algorithm 2.
OPT-B: Online-style Projection Test mini-Batch version according to Algorithm 3.
OPT-R: Online-style Projection Test One-by-one version where the projection direction is estimated by the ridge-type estimator in Huang (2015).
DSPT: Data-splitting Projection Test according to Algorithm 1.
RDSPT: Data-splitting Projection Test where the projection direction is estimated by the ridge-type estimator in Huang (2015).

We use the SCAD penalty (Fan and Li 2001) to estimate the optimal projection direction. The batch size for the OPT-B test is set to be 10. We also include two other projection tests, which are the Lauter test (Lauter 1996) and the random projection test (RPT) proposed in Lopes et al. (2011). The quadratic-form tests include the BS test (Bai and Saranadasa 1996), the CQ test (Chen and Qin 2010) and the SD test (Srivastava and Du 2008).

4.1. Simulation studies

We generate random samples of size $n$ from multivariate normal distribution $N (c μ, Σ)$ , multivariate $t$ -distribution with degrees of freedom 6, and multivariate $χ^{2}$ -distribution with degree of freedom 1 where $μ = {(1_{s}^{⊤}, 0_{p - s}^{⊤})}^{⊤}$ and $s = 10$ . We set $c = 0,0.5$ and 1 to examine Type I error rate and compare power of different tests, where $c = 0$ corresponds to the null hypothesis and $c = 0.5$ or 1 corresponds to the alternative hypothesis. For $ρ \in (0,1)$ , we consider the following two covariance structures: (1) compound symmetry with $Σ_{1} = (1 - ρ) I + ρ 1 1^{⊤}$ and (2) autocorrelation with $Σ_{2} = {(ρ^{| i - j |})}_{i, j}$ , where $1$ is a vector with all its entries being 1 and $I$ is the identity matrix. We consider $ρ = 0.25,0.5,0.75$ and 0.95 to study the impact of correlation on testing power. We set sample size $n = 40,160$ and dimension $p = 400,1600$ . For online-style projection tests (i.e., OPT-O, OPT-B and OPT-R), we set $k_{n} = ⌊n^{τ}⌋$ with $τ = 0.6$ . For data splitting projection tests (i.e., DSPT and RDSPT), we set $n_{1} = ⌊ n τ ⌋$ with $τ = 0.5$ . To this end, we replace the sample covariance matrix $\hat{Σ}$ by ${\hat{Σ}}_{ϕ} = \hat{Σ} + ϕ I$ with a small positive number $ϕ = \sqrt{l o g p / n}$ and all the theoretical results still hold when $ϕ \leq \sqrt{l o g p / n}$ . Such a perturbation does not noticeably affect the computational accuracy of the estimator but leads to a faster convergence according to our numerical studies. We set the type I error rate $α = 0.05$ . All simulation results are based on 10,000 replications. Top two panels of Table 1 report type I error rate and power for all tests with $n = 40$ and $p = 1600$ under compound symmetry structure and autocorrelation structure respectively. To save space, all other simulation results can be found in Tables S.2 – S.6 of Supplement S.9.

Table 1:

Size and power comparison for $n = 40$ and $p = 1600$ (values are in percentage).

	c = 0				c = 0.5				c = 1
ρ	0.25	0.5	0.75	0.95	0.25	0.5	0.75	0.95	0.25	0.5	0.75	0.95
$N (c μ, \sum) with \sum = \sum_{1}$
OPT-O	5.70	5.83	5.07	5.11	90.35	99.70	100.0	100.0	100.0	100.0	100.0	100.0
OPT-B	6.11	5.74	5.68	5.41	77.18	95.28	99.80	99.97	100.0	100.0	100.0	100.0
OPT-R	5.20	5.03	5.27	5.05	15.82	28.90	82.01	100.0	97.97	99.79	99.99	100.0
DSPT	5.22	4.99	5.21	5.08	50.43	79.97	98.51	99.98	99.92	99.94	99.99	100.0
RDSPT	5.01	4.71	5.06	4.94	14.62	23.71	54.68	98.14	71.49	81.98	95.74	100.0
BS	6.08	6.20	6.22	6.22	7.28	6.72	6.58	6.50	11.18	8.46	7.70	7.34
CQ	6.02	6.14	6.14	6.12	7.30	6.62	6.54	6.46	11.18	8.46	7.58	7.32
SD	2.29	0.56	0.11	0.01	2.50	0.58	0.12	0.01	4.06	0.72	0.14	0.01
Lauter	5.16	5.18	5.14	5.16	5.04	5.06	5.08	5.06	5.12	5.08	5.02	5.02
RPT	5.12	5.24	5.04	4.90	6.88	8.00	11.78	51.76	14.56	20.94	42.22	98.38
$N (c μ, \sum) with \sum = \sum_{2}$
OPT-O	5.87	5.99	5.76	5.88	74.33	59.22	40.02	25.12	100.0	100.0	99.98	99.84
OPT-B	5.48	5.90	6.06	5.98	63.21	49.59	32.92	20.84	100.0	100.0	99.87	98.33
OPT-R	6.14	5.45	5.96	5.47	32.44	24.99	15.73	8.37	99.89	98.48	84.55	40.02
DSPT	5.25	5.19	5.09	5.12	38.03	30.96	22.88	16.49	100.0	99.94	98.81	91.04
RDSPT	4.61	4.95	5.30	4.92	17.85	14.57	9.55	6.10	94.90	84.59	58.09	22.43
BS	5.18	4.96	5.24	4.46	38.02	29.42	17.74	8.02	100.0	99.84	91.44	33.82
CQ	5.16	5.06	5.24	4.44	38.08	29.46	17.72	8.06	100.0	99.84	91.48	33.92
SD	11.73	8.22	4.08	1.65	64.17	45.15	20.19	3.45	100.0	99.82	91.48	20.90
Lauter	4.90	4.66	5.20	5.10	8.70	6.42	5.94	5.14	14.64	9.48	6.98	5.36
RPT	4.86	4.98	5.16	4.90	6.34	6.30	6.60	6.86	11.88	12.16	11.52	13.36
$t_{6} (c μ, \sum) with \sum = \sum_{1}$
OPT-O	5.81	5.39	5.18	4.89	75.58	95.94	99.90	100.0	100.0	100.0	100.0	100.0
OPT-B	6.21	5.32	5.11	4.79	61.58	87.35	98.92	100.0	100.0	99.98	99.99	100.0
OPT-R	4.93	4.79	4.75	5.74	12.71	22.99	67.67	99.95	89.86	96.65	99.74	100.0
DSPT	5.28	5.09	4.83	4.62	40.44	69.29	95.49	99.97	99.82	99.83	99.99	99.99
RDSPT	4.88	5.06	5.02	4.74	11.40	18.51	46.12	96.50	61.99	74.62	92.62	99.98
BS	4.92	5.75	5.93	5.98	5.45	6.03	6.09	6.12	7.48	7.13	6.82	6.65
CQ	6.16	6.19	6.18	6.13	6.75	6.44	6.31	6.28	9.42	7.57	7.12	6.89
SD	1.19	0.43	0.05	0.02	1.34	0.43	0.05	0.02	1.89	0.53	0.05	0.02
Lauter	4.95	4.97	4.93	4.96	4.98	4.95	4.97	4.96	4.97	4.93	4.92	4.91
RPT	4.43	4.61	4.26	4.15	5.71	7.08	9.87	44.45	12.53	18.11	35.10	96.15
$t_{6} (c μ, \sum) with \sum = \sum_{2}$
OPT-O	5.93	5.86	5.91	6.03	58.34	45.07	30.92	17.82	99.99	99.96	99.36	96.20
OPT-B	5.46	5.82	6.02	6.24	48.12	36.98	26.30	16.15	99.99	99.90	98.59	91.16
OPT-R	5.48	5.96	6.20	6.19	23.81	18.51	12.64	7.43	97.26	91.16	70.57	30.22
DSPT	4.89	4.61	4.85	4.77	28.90	24.90	17.90	12.35	99.89	99.19	94.29	78.77
RDSPT	5.24	4.58	5.08	5.37	13.82	11.03	8.69	5.40	83.00	69.65	44.60	17.05
BS	0.00	0.00	0.01	0.78	0.00	0.01	0.06	1.36	3.05	3.60	5.24	6.85
CQ	5.14	5.17	5.09	5.25	21.73	17.47	11.39	7.08	97.03	90.64	66.82	20.99
SD	0.00	0.00	0.00	0.06	0.00	0.00	0.01	0.18	1.31	1.82	2.21	1.66
Lauter	5.07	5.12	4.76	5.09	7.72	6.65	5.08	5.27	12.20	9.06	6.38	5.44
RPT	3.89	4.63	4.00	4.34	5.47	5.81	5.14	5.78	10.12	10.39	10.17	11.02

Open in a new tab

We first examine the type I error rate. Among all these tests, the DSPT, the RDSPT, the Lauter test and the RPT are exact tests (under normality assumption) and their sizes are exactly $α = 0.05$ . The online-style projection tests have asymptotic normal distribution under $H_{0}$ and retain the type I error very well. Similar to online-style projection tests, the BS test and the CQ test also control the size reasonably well. The SD test is very sensitive to the correlation level and its size is often deviated from the expected 0.05.

Next we compare the power of these tests. Table 1 suggests that the power of these tests strongly relies on the covariance structure, the correlation $ρ$ as well as the signal strength $c$ . In summary, the proposed OPT-O test is the most powerful test in all settings. The one-by-one online-style projection (OPT-O) is slightly more powerful than the corresponding mini-batch version (OPT-B) and improves the performance of corresponding data splitting projection test (DSPT) a lot. This is not surprising since the OPT-O test keeps updating the estimation of projection direction whenever a new observation arrives and generally has a more accurate estimation than its mini-batch version. The mini-batch projection test slightly sacrifices the accuracy but reduces the computational cost. The DSPT is less powerful as it throws away much more information compared with online-style projection tests. The test based on constrained and regularized quadratic programming (DSPT) is more powerful than the one based on the ridge-type estimator (RDSPT) as the true optimal projection direction is (approximately) sparse. When the covariance structure is compound symmetry, the power of optimal projection based tests improve as $ρ$ increases as larger $ρ$ leads to stronger signal $μ^{⊤} Σ_{1}^{- 1} μ$ in the alternative. As the value of $c$ jumps from 0.5 to 1, the power of all tests increases dramatically. As the dimension $p$ goes from 400 to 1600, there is a downward trend for the performance of these tests. However, even in the most challenging setting $(n, p, c) = (40,1600,0.5)$ , the proposed OPT-O still has very satisfactory performance with power greater than 90%. The quadratic-form tests tend to become less powerful as $ρ$ increases. This is because quadratic-form tests ignore the correlation among the variables and therefore their overall performances are not satisfactory when correlation is strong. When the covariance structure is autocorrelation, unlike the compound symmetry setting, the power of all the tests drops as $ρ$ increases since larger $ρ$ would lead to weaker signal $μ^{⊤} Σ_{2}^{- 1} μ$ in the alternative. We notice that some of the quadratic-form tests have better power than the data splitting projection tests when $ρ$ is small. This is because $Σ_{2}^{- 1}$ is a 3-sparse matrix, which is almost an identity matrix, and ignoring dependence among variables does not noticeably diminish the power. In fact, the power of quadratic-form tests decreases dramatically as $ρ$ increases and become less powerful than the DSPT when $ρ = 0.95$ , not to mention the OPT-O test. As shown in the bottom two panels of Table 1 and Tables S.4, S.5 and S.6, the overall pattern of size and power for multivariate $t_{6}$ -distribution and $χ_{1}^{2}$ -distribution is very similar to that under normality assumption, indicating the proposed projection test is not very sensitive to the sub-Gaussianity assumption and performs well for asymmetric data. For non-sub-Gaussian data, we observe that online-style projection test still controls the type I error rate well and retains high power. The proposed OPT-O test is still the most powerful one among those which successfully retain the type I error rate. For other tests, the BS test and the CQ test have slight size distortion under $χ_{1}^{2}$ while the SD test and the Lauter test completely fail to control the type I error under $χ_{1}^{2}$ .

4.2. Real data example

In this section, we apply the proposed online-style projection test as well as other tests to a real dataset of high resolution micro-computed tomography. This dataset consists of 58 mice’s skull bone density from three different genotypes (“T0A0”, “T0A1”, “T1A1”) in a genetic mutation study. For each mouse, bone density is measured at density levels from 130 – 249 for 16 different areas of its skull. See Percival et al. (2014) for a detailed description of protocols. In this empirical study, we would like to know if there is a difference in the bone density patterns of two different areas in mice’s skull. To emphasize the high-dimensionality nature of this dataset, we only use a subset of the dataset. We select the mice of the genotype “T0A1” and there are 29 observations available in the dataset, i.e., sample size $n = 29$ . The two areas of the skull “Mandible” and “Nasal” are selected. We use all density levels from 130 – 249 for our analysis, hence dimension $p = 120$ . We first take the difference of the bone density of the two selected areas at the corresponding density level for each subject since the two bones come from the same mouse. Then we normalize the bone density in the sense that $\frac{1}{29} \sum_{i = 1}^{29} X_{i j}^{2} = 1$ for all $1 \leq j \leq 120$ .

We first apply SOPT-O, SPT-DS and other existing tests to perform the one-sample test and compute the p-values. The p-values are reported in the first column in Table 2. All p-values are very close to 0 ( $≪ 0.05$ ), implying that the bone densities of the two areas are significantly different. In order to compare the power of different tests, we also compute the p-values of different tests as we weaken the signal strength in alternative. To be more specific, let $\bar{x}$ be the sample mean and $r_{i} = x_{i} - \bar{x}$ be the residual for the $i$ th subject. Then we can construct a new observation $z_{i} = δ \bar{x} + r_{i}$ for the $i$ th subject, where $δ \in (0,1)$ . By the construction, a smaller $δ$ leads to a weaker signal and makes the test more challenging. Table 2 also reports the p-values of different tests with $δ = 1.0, 0.8, 0.6, 0.4, 0.3, 0.2$ . As expected, p-values of all tests increase as $δ$ decreases. When $δ = 0.8$ or 0.6, all tests perform well and reject the null hypothesis at level 0.05. When $δ = 0.4$ , the Lauter’s test starts to fail to reject the null hypothesis. When $δ$ is further decreased to 0.3, all the three projection based tests SOPT-O, SPT-DS and RPT-DS are able to reject the null hypothesis while all other tests except the LJW2011 test fail to reject null hypothesis. When $δ = 0.2$ , only SOPT-O and SPT-DS reject the null hypothesis, which suggests that our proposed sparse projection tests can still perform very well even when the signal is weak. Among those tests that fail to reject the null at $δ = 0.2$ , RPT-DS is the most powerful one as it has the smallest p-value.

Table 2:

P-values of different tests for bone density dataset.

δ	1.0	0.8	0.6	0.4	0.3	0.2
OPT-O	0	0	4.6 × 10⁻¹³	4.0 × 10⁻¹⁰	1.6 × 10⁻⁴	0.0325
DSPT	7.9 × 10⁻¹⁰	6.7 × 10⁻⁹	3.3 × 10⁻⁷	9.9 × 10⁻⁶	3.4 × 10⁻⁴	0.0418
RDSPT	2.4 × 10⁻⁸	5.1 × 10⁻⁷	1.9 × 10⁻⁵	0.0014	0.0140	0.1462
BS	0	0	0	1.2 × 10⁻⁴	0.0763	0.7684
CQ	0	0	0	1.6 × 10⁻⁴	0.0810	0.7717
SD	0	0	2.0 × 10⁻⁹	1.6 × 10⁻²	0.2494	0.7995
Lauter	1.1 × 10⁻¹⁰	3.1 × 10⁻⁸	1.6 × 10⁻³	0.1265	0.2625	0.4574
RPT	3.8 × 10⁻⁹	4.2 × 10⁻⁸	4.5 × 10⁻⁶	8.5 × 10⁻⁴	9.6 × 10⁻³	0.2031

Open in a new tab

We plot the heatmap of absolute values of paired sample correlations of all bone density levels in Figure 4. The heatmap clearly shows that some bone density levels are highly correlated. This explains why the proposed projection tests are more powerful than the D1958 test, the BS1996 test and the SD2008w test as these tests do not take the dependence among variables into account.

5. Discussion

This paper studies the projection test for mean vectors in high dimensions. Existing projection tests either fail to utilize the optimal projection or lack the theoretical justification of the estimated projection direction. To maximize the power of projection tests, a critical task is to obtain a good estimation of the optimal projection with statistical guarantees. We propose a constrained and regularized quadratic programming to obtain a consistent estimator under sparsity assumption. We further propose an online-style framework for projection tests to enhance the testing power. This is a general testing framework in the sense that any proper estimator (e.g., Ridge-type estimator) of the optimal projection can be applied and can be easily extended to the two-sample problem. Under this framework, the first $k_{n}$ observations are used to obtain an initial estimator and are not utilized to perform the test. In future work, it would be interesting to investigate how we can reuse the $k_{n}$ observations that are discarded to further enhance the power. A challenge is whether we can still control the type I error with guarantees by reusing the first $k_{n}$ observations. Another interesting question is how we can construct a project test when the sparsity assumption does not hold. When the optimal projection directions is not sparse, the ridge-type estimator in Huang (2015) seems a good candidate and it is worth investigating its theoretical properties in future work.

Supplementary Material

suppl

NIHMS1987210-supplement-suppl.pdf^{(1.5MB, pdf)}

Acknowledgments

The authors would like to thank the AE and reviewers for their constructive comments, which lead to a significant improvement of this work.

Funding

Zhong’s research was supported by National Natural Science Foundation of China (NNSFC) grants 11922117, 12231011, 7198801, and a National Statistical Science Research Grant 2022LD0. Li’s research was supported by National Science Foundation (NSF) DMS-1820702, and NIH grants R01AI136664 and R01AI170249. The content is solely the responsibility of the authors and does not necessarily represent the official views of NNSFC, NSF or NIH.

Footnotes

Supplementary Materials

Supplemental materials contain some useful lemmas, technical proofs and additional numerical results.

Disclosure Statement

The authors report there are no competing interests to declare.

References

Bai Z and Saranadasa H (1996), ‘Effect of high dimension: By an example of a two sample problem’, Statistica Sinica 6, 311–329. [Google Scholar]
Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011), ‘Distributed optimization and statistical learning via the alternating direction method of multipliers’, Foundations and Trends^® in Machine Learning 3(1), 1–122. [Google Scholar]
Cai T, Liu W and Xia Y (2014), ‘Two-sample test of high dimensional means under dependence’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 349–372. [Google Scholar]
Chang J, Zheng C, Zhou W-X and Zhou W (2017), ‘Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity’, Biometrics 73(4), 1300–1310. [DOI] [PubMed] [Google Scholar]
Chen SX and Qin Y-L (2010), ‘A two-sample test for high-dimensional data with applications to gene-set testing’, The Annals of Statistics 38(2), 808–835. [Google Scholar]
Chen X, Xu M and Wu WB (2016), ‘Regularized estimation of linear functionals of precision matrices for high-dimensional time series’, IEEE Transactions on Signal Processing 64(24), 6459–6470. [Google Scholar]
Fan J and Li R (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American Statistical Association 96(456), 1348–1360. [Google Scholar]
Fan J, Li R, Zhang C and Zou H (2020), Statistical Foundations of Data Science, Chapman and Hall/CRC. Boca Raton, FL. [Google Scholar]
Fan J, Liao Y and Liu H (2016), ‘An overview of the estimation of large covariance and precision matrices’, The Econometrics Journal 19(1), C1–C32. [Google Scholar]
Fan J, Xue L and Zou H (2014), ‘Strong oracle optimality of folded concave penalized estimation’, The Annals of Statistics 42(3), 819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang EX, He B, Liu H and Yuan X (2015), ‘Generalized alternating direction method of multipliers: new theoretical insights and applications’, Mathematical Programming Computation 7(2), 149–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hotelling H (1931), ‘The generalization of student’s ratio’, The Annals of Mathematical Statistics 2(3), 360–378. [Google Scholar]
Huang Y (2015), ‘Projection test for high-dimensional mean vectors with optimal direction’, Department of Statistics, The Pennsylvania State University at University Park. [Google Scholar]
Huang Y, Li C, Li R and Yang S (2022), ‘An overview of tests on high-dimensional means’, Journal of Multivariate Analysis 188, 104813. [Google Scholar]
Lauter J (1996), ‘Exact t and F tests for analyzing studies with multiple endpoints’, Biometrics 52(3), 964–970. [Google Scholar]
Li C and Li R (2021), ‘Linear hypothesis testing in linear models with high dimensional responses’, Journal of the American Statistical Association pp. 1–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, Yao T and Li R (2016), ‘Global solutions to folded concave penalized nonconvex learning’, The Annals of Statistics 44(2), 629–659. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loh P-L and Wainwright MJ (2015), ‘Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima’, Journal of Machine Learning Research 16(1), 559–616. [Google Scholar]
Lopes M, Jacob L and Wainwright MJ (2011), A more powerful two-sample test in high dimensions using random projection, in ‘Advances in Neural Information Processing Systems’, Vol. 24, pp. 1206–1214. [Google Scholar]
Percival CJ, Huang Y, Jabs EW, Li R and Richtsmeier JT (2014), ‘Embryonic craniofacial bone volume and bone mineral density in fgfr2+/p253r and nonmutant mice’, Developmental Dynamics 243(4), 541–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
Srivastava MS and Du M (2008), ‘A test for the mean vector with fewer observations than the dimension’, Journal of Multivariate Analysis 99(3), 386–402. [Google Scholar]
Wang L, Kim Y and Li R (2013), ‘Calibrating non-convex penalized regression in ultra-high dimension’, The Annals of Statistics 41(5), 2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wellcome Trust Case Control Consortium (2007), ‘Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls’, Nature 447(7145), 661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu G, Lin L, Wei P and Pan W (2016), ‘An adaptive two-sample test for high-dimensional means’, Biometrika 103(3), 609–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H and Li R (2008), ‘One-step sparse estimates in nonconcave penalized likelihood models’, The Annals of Statistics 36(4), 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl

NIHMS1987210-supplement-suppl.pdf^{(1.5MB, pdf)}

[R1] Bai Z and Saranadasa H (1996), ‘Effect of high dimension: By an example of a two sample problem’, Statistica Sinica 6, 311–329. [Google Scholar]

[R2] Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011), ‘Distributed optimization and statistical learning via the alternating direction method of multipliers’, Foundations and Trends^® in Machine Learning 3(1), 1–122. [Google Scholar]

[R3] Cai T, Liu W and Xia Y (2014), ‘Two-sample test of high dimensional means under dependence’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 349–372. [Google Scholar]

[R4] Chang J, Zheng C, Zhou W-X and Zhou W (2017), ‘Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity’, Biometrics 73(4), 1300–1310. [DOI] [PubMed] [Google Scholar]

[R5] Chen SX and Qin Y-L (2010), ‘A two-sample test for high-dimensional data with applications to gene-set testing’, The Annals of Statistics 38(2), 808–835. [Google Scholar]

[R6] Chen X, Xu M and Wu WB (2016), ‘Regularized estimation of linear functionals of precision matrices for high-dimensional time series’, IEEE Transactions on Signal Processing 64(24), 6459–6470. [Google Scholar]

[R7] Fan J and Li R (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American Statistical Association 96(456), 1348–1360. [Google Scholar]

[R8] Fan J, Li R, Zhang C and Zou H (2020), Statistical Foundations of Data Science, Chapman and Hall/CRC. Boca Raton, FL. [Google Scholar]

[R9] Fan J, Liao Y and Liu H (2016), ‘An overview of the estimation of large covariance and precision matrices’, The Econometrics Journal 19(1), C1–C32. [Google Scholar]

[R10] Fan J, Xue L and Zou H (2014), ‘Strong oracle optimality of folded concave penalized estimation’, The Annals of Statistics 42(3), 819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Fang EX, He B, Liu H and Yuan X (2015), ‘Generalized alternating direction method of multipliers: new theoretical insights and applications’, Mathematical Programming Computation 7(2), 149–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hotelling H (1931), ‘The generalization of student’s ratio’, The Annals of Mathematical Statistics 2(3), 360–378. [Google Scholar]

[R13] Huang Y (2015), ‘Projection test for high-dimensional mean vectors with optimal direction’, Department of Statistics, The Pennsylvania State University at University Park. [Google Scholar]

[R14] Huang Y, Li C, Li R and Yang S (2022), ‘An overview of tests on high-dimensional means’, Journal of Multivariate Analysis 188, 104813. [Google Scholar]

[R15] Lauter J (1996), ‘Exact t and F tests for analyzing studies with multiple endpoints’, Biometrics 52(3), 964–970. [Google Scholar]

[R16] Li C and Li R (2021), ‘Linear hypothesis testing in linear models with high dimensional responses’, Journal of the American Statistical Association pp. 1–37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Liu H, Yao T and Li R (2016), ‘Global solutions to folded concave penalized nonconvex learning’, The Annals of Statistics 44(2), 629–659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Loh P-L and Wainwright MJ (2015), ‘Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima’, Journal of Machine Learning Research 16(1), 559–616. [Google Scholar]

[R19] Lopes M, Jacob L and Wainwright MJ (2011), A more powerful two-sample test in high dimensions using random projection, in ‘Advances in Neural Information Processing Systems’, Vol. 24, pp. 1206–1214. [Google Scholar]

[R20] Percival CJ, Huang Y, Jabs EW, Li R and Richtsmeier JT (2014), ‘Embryonic craniofacial bone volume and bone mineral density in fgfr2+/p253r and nonmutant mice’, Developmental Dynamics 243(4), 541–551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Srivastava MS and Du M (2008), ‘A test for the mean vector with fewer observations than the dimension’, Journal of Multivariate Analysis 99(3), 386–402. [Google Scholar]

[R22] Wang L, Kim Y and Li R (2013), ‘Calibrating non-convex penalized regression in ultra-high dimension’, The Annals of Statistics 41(5), 2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Wellcome Trust Case Control Consortium (2007), ‘Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls’, Nature 447(7145), 661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Xu G, Lin L, Wei P and Pan W (2016), ‘An adaptive two-sample test for high-dimensional means’, Biometrika 103(3), 609–624. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Zou H and Li R (2008), ‘One-step sparse estimates in nonconcave penalized likelihood models’, The Annals of Statistics 36(4), 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Projection Test for Mean Vector in High Dimensions

Wanjun Liu

Xiufan Yu

Wei Zhong

Runze Li

Abstract

1. Introduction

2. Optimal Direction Based Projection Tests

2.1. Preliminary

2.2. Estimation of optimal projection direction

2.3. Non-asymptotic error bounds

Figure 1:

Geometry of β^.

2.4. ADMM with local linear approximation

3. Data Splitting and Power Enhancement

3.1. A data-splitting procedure

3.2. An online-style data splitting

3.3. Asymptotic distribution of online-style projection test

Figure 2:

3.4. Choice of kn

Figure 3:

4. Numerical Results

4.1. Simulation studies

Table 1:

4.2. Real data example

Table 2:

Figure 4:

5. Discussion

Supplementary Material

Acknowledgments

Funding

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Geometry of $\hat{β}$ .

3.4. Choice of $k_{n}$