Power Enhancement in High Dimensional Cross-Sectional Tests

Jianqing Fan; Yuan Liao; Jiawei Yao

doi:10.3982/ECTA12749

. Author manuscript; available in PMC: 2016 Jan 15.

Published in final edited form as: Econometrica. 2015 Jul 1;83(4):1497–1541. doi: 10.3982/ECTA12749

Power Enhancement in High Dimensional Cross-Sectional Tests

Jianqing Fan ^*,^†, Yuan Liao ^‡, Jiawei Yao ^*

PMCID: PMC4714420 NIHMSID: NIHMS749711 PMID: 26778846

Abstract

We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models.

Keywords: sparse alternatives, thresholding, large covariance matrix estimation, Wald-test, screening, cross-sectional independence, factor pricing model

1 Introduction

High-dimensional cross-sectional models have received growing attentions in both theoretical and applied econometrics. These models typically involve a structural parameter, whose dimension can be either comparable or much larger than the sample size. This paper addresses testing a high-dimensional structural parameter:

H_{0} : θ = 0,

where N = dim(θ) is allowed to grow faster than the sample size T. We are particularly interested in boosting the power in sparse alternatives under which θ is approximately a sparse vector. This type of alternative is of particular interest, as the null hypothesis typically represents some economic theory and violations are expected to be only by some exceptional individuals.

A showcase example is the factor pricing model in financial economics. Let y_it be the excess return of the i-th asset at time t, and f_t = (f₁_t,…, f_Kt)′ be the K-dimensional observable factors. Then, the excess return has the following decomposition:

y_{i t} = θ_{i} + b_{i}^{'} f_{t} + u_{i t}, i = 1, \dots, N, t = 1, \dots, T,

where b_i = (b_i₁,…, b_iK)′ is a vector of factor loadings and u_it represents the idiosyncratic error. The key implication from the multi-factor pricing theory is that the intercept θ_i should be zero, known as the “mean-variance efficiency” pricing, for any asset i. An important question is then if such a pricing theory can be validated by empirical data, namely we wish to test the null hypothesis H₀ : θ = 0, where θ = (θ₁, …, θ_N)′ is the vector of intercepts for all N financial assets. As the factor pricing model is derived from theories of financial economics (Ross, 1976), one would expect that inefficient pricing by the market should only occur to a small fractions of exceptional assets. Indeed, our empirical study of the constituents in the S&P 500 index indicates that there are only a couple of significant nonzero-alpha stocks, corresponding to a small portion of mis-priced stocks instead of systematic mis-pricing of the whole market. Therefore, it is important to construct tests that have high power when θ is sparse.

Most of the conventional tests for H₀ : θ = 0 are based on a quadratic form:

W = {\hat{θ}}^{'} V \hat{θ} .

Here $\hat{θ}$ is an element-wise consistent estimator of θ, and V is a high-dimensional positive definite weight matrix, often taken to be the inverse of the asymptotic covariance matrix of $\hat{θ}$ (e.g., the Wald test). After a proper standardization, the standardized W is asymptotically pivotal under the null hypothesis. In high-dimensional testing problems, however, various difficulties arise when using a quadratic statistic. First, when N > T, estimating V is challenging, as the sample analogue of the covariance matrix is singular. More fundamentally, tests based on W have low powers under sparse alternatives. The reason is that the quadratic statistic accumulates high-dimensional estimation errors under H₀, which results in large critical values that can dominate the signals in the sparse alternatives. A formal proof of this statement will be given in Section 3.3.

To overcome the aforementioned difficulties, this paper introduces a novel technique for high-dimensional cross-sectional testing problems, called the “power enhancement”. Let J₁ be a test statistic that has a correct asymptotic size (e.g., Wald statistic), which may suffer from low powers under sparse alternatives. Let us augment the test by adding a power enhancement component J₀ ≥ 0, which satisfies the following three properties:

Power Enhancement Properties

Non-negativity: J₀ ≥ 0 almost surely.
No-size-distortion: Under H₀, P (J₀ = 0|H₀) → 1.
Power-enhancement: J₀ diverges in probability under some specific regions of alternatives H_a.

Our constructed power enhancement test takes the form

J = J_{0} + J_{1} .

The non-negativity property of J₀ ensures that J is at least as powerful as J₁. Property (b) guarantees that the asymptotic null distribution of J is determined by that of J₁, and the size distortion due to adding J₀ is negligible, and property (c) guarantees significant power improvement under the designated alternatives. The power enhancement principle is thus summarized as follows: Given a standard test statistic with a correct asymptotic size, its power is substantially enhanced with little size distortion; this is achieved by adding a component J₀ that is asymptotically zero under the null, but diverges and dominates J₁ under some specific regions of alternatives.

An example of such a J₀ is a screening statistic:

J_{0} = \sqrt{N} \sum_{j \in \hat{S}} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} = \sqrt{N} \sum_{j = 1}^{N} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} 1 {| \hat{θ_{j}} | > {\hat{v}}_{j}^{1 / 2} δ_{N, T}},

where $\hat{S} = {j \leq N : | \hat{θ_{j}} | > {\hat{v}}_{j}^{1 / 2} δ_{N, T}}$ , and $\hat{v_{j}}$ denotes a data-dependent normalizing factor, taken as the estimated asymptotic variance of $\hat{θ_{j}}$ . The critical value δ_N,T, depending on (N, T), is a high-criticism threshold, chosen to be slightly larger than the noise level ${max}_{j} \leq_{N} | {\hat{θ}}_{j} - θ_{j} | / {\hat{v}}_{j}^{1 / 2}$ so that under H₀, J₀ = 0 with probability approaching one. In addition, we take J₁ as a pivotal statistic, e.g., standardized Wald statistic or other quadratic forms such as the sum of the squared marginal t-statistics (Bai and Saranadasa, 1996; Chen and Qin, 2010; Pesaran and Yamagata, 2012). The screening set $\hat{S}$ also captures indices where the null hypothesis is violated.

One of the major differences of our test from most of the thresholding tests (Fan, 1996; Hansen, 2005) is that, it enhances the power substantially by adding a screening statistic, which does not introduce extra difficulty in deriving the asymptotic null distribution. Since J₀ = 0 under H₀, it relies on the pivotal statistic J₁ to determine its null distribution. In contrast, the existing tests such as thresholding, extreme value, and higher criticism tests (e.g., Hall and Jin (2010)) often require stringent conditions to derive their asymptotic null distributions, making them restrictive in econometric applications, due to slow rates of convergence. Moreover, the asymptotic null distributions are inaccurate at finite sample. As pointed out by Hansen (2003), these statistics are non-pivotal even asymptotically, and require bootstrap methods to simulate the null distributions.

As a specific application, in addition to testing the aforementioned factor pricing model, this paper also studies the tests for cross-sectional independence in mixed effect panel data models:

y_{i t} = α + x_{i t}^{'} β + μ_{i} + u_{i t}, i \leq n, t \leq T .

Let ρ_ij denote the correlation between u_it and u_jt, assumed to be time invariant. The “cross-sectional independence” test is concerned about the following null hypothesis:

H_{0} : ρ_{i j} = 0, for all i \neq j,

that is, under the null hypothesis, the n × n covariance matrix Σ_u of {u_it}_i≤n is diagonal. In empirical applications, weak cross-sectional correlations are often present, which results in a sparse covariance Σ_u with just a few nonzero off-diagonal elements. Namely, the vector θ = (ρ₁₂, ρ₁₃,…, ρ_n₋₁_,n) is sparse and should be incorporated to improve power of the test. The dimensionality N = n(n − 1)/2 can be much larger than the number of observations. Therefore, the power enhancement in sparse alternatives is very important to the testing problem. By choosing δ_N,T to dominate ${max}_{i, j \leq n} {| {\hat{ρ}}_{i j} | / {\hat{v}}_{i j}^{1 / 2} : ρ_{i j} = 0}$ as detailed in Section 5, under the sparse alternative, the set $\hat{S}$ “screens out” most of the estimation noises, and contains only a few indices of the nonzero off-diagonal entries. Therefore, $\hat{S}$ not only reveals the sparse structure of Σ_u, but also determines the nonzero off-diagonal entries with an over-whelming probability.

There has been a large literature on high-dimensional cross-sectional tests. For instance, the literature on testing the factor pricing model is found in Gibbons et al. (1989), MacKinlay and Richardson (1991), Beaulieu et al. (2007) and Pesaran and Yamagata (2012), all in quadratic forms. Gagliardini et al. (2011) studied estimation of the risk premia in a CAPM and its associated testing problem, which is closely related to our work. While we also study a large panel of stock returns as a specific example and double asymptotics (as N, T → ∞), the problems and approaches being considered are very different. This paper addresses a general problem of enhancing powers under high-dimensional sparse alternatives.

For the mixed effect panel data model, most of the testing statistics are based on the sum of squared residual correlations, which also accumulates many off-diagonal estimation errors in estimating the covariance matrix of (u₁_t, …, u_nt). See, for example, Breusch and Pagan (1980), Pesaran et al. (2008), and Baltagi et al. (2012). Our problem is also related to the test with a restricted parameter space, previously considered by Andrews (1998), who improves the power by directing towards the “relevant” alternatives; see also Hansen (2003) for a related idea. Chernozhukov et al. (2013) proposed a high-dimensional inequality test, and employed an extreme value statistic, whose critical value is determined through applying the moderate deviation theory on an upper bound of the rejection probability. In contrast, the asymptotic distribution of our proposed power enhancement statistic is determined through the pivotal statistic J₁, and the power is improved via the contributions of sparse alternatives that survive the screening process.

The remainder of the paper is organized as follows. Section 2 sets up the preliminaries and highlights the major differences from existing tests. Section 3 presents the main result of power enhancement test. As applications to specific cases, Section 4 and Section 5 respectively study the factor pricing model and test of cross-sectional independence. Section 6 presents simulation results are empirical evidence of sparse alternatives based on the real data. Section 7 provides further discussions. Proofs are given in the supplementary material.

Throughout the paper, for a symmetric matrix A, let λ_min(A) and λ_max(A) represent its minimum and maximum eigenvalues. Let ‖A‖₂ and ‖A‖₁ denote its operator norm and l₁-norm respectively, defined by ${‖ A ‖}_{2} = λ_{max}^{1 / 2} (A^{'} A)$ and max_i Σ_j |A_ij|. For a vector θ, define $‖ θ ‖ = {(\sum_{j} θ_{j}^{2})}^{1 / 2}$ and ‖θ‖_max = max_j |θ_j|. For two deterministic sequences a_T and b_T, we write a_T ≪ b_T (or equivalently b_T ≫ a_T) if a_T = o(b_T). Also, a_T ⩆ b_T if there are constants C₁, C₂ > 0 so that C₁b_T ≤ a_T ≤ C₂b_T for all large T. Finally, we denote |S|₀ as the number of elements in a set S.

2 Power Enhancement in high dimensions

This section introduces power enhancement techniques and provides heuristics to justify the techniques. The differences from related methods in the literature are also highlighted.

2.1 Power enhancement

Consider a testing problem:

H_{0} : θ = 0, H_{a} : θ \in Θ_{a},

where Θ_a ⊂ ℝ^N\{0} is an alternative set in ℝ^N. A typical example is Θ_a = {θ: θ ≠ 0}. Suppose we observe a stationary process $D = {D_{t}}_{t = 1}^{T}$ of size T. Let J₁(D) be a certain test statistic, which will also be written as J₁. Often J₁ is constructed such that under H₀, it has a non-degenerate limiting distribution F : As T, N → ∞,

J_{1} | H_{0} \to^{d} F .

(2.1)

For the significance level q ∈ (0, 1), let F_q be the critical value for J₁. Then the critical region is taken as {D: J₁ > F_q} and satisfies

lim_{T, N \to ∞} P (J_{1} > F_{q} | H_{0}) = q .

This ensures that J₁ has a correct asymptotic size. In addition, it is often the case that J₁ has high power against H₀ on a subset Θ(J₁) ⊂ Θ_a, namely,

lim_{T, N \to ∞} inf_{θ \in Θ (J_{1})} P (J_{1} > F_{q} | θ) \to 1.

Typically, Θ(J₁) consists of those θ′s, whose l₂-norm is relatively large, as J₁ is normally an omnibus test (e.g. Wald test).

In a data-rich environment, econometric models often involve high-dimensional parameters in which dim(θ) = N can grow fast with the sample size T. We are particularly interested in sparse alternatives Θ_s ⊂ Θ_a under which H₀ is violated only on a couple of exceptional components of θ. Specifically, Θ_s ∈ ℝ^N is a subset of Θ_a, and when θ ∈ Θ_s, the number of non-vanishing components is much less than N. As a result, its l₂-norm is relatively small. Therefore, under sparse alternative Θ_s, the omnibus test J₁ typically has a lower power, due to the accumulation of high-dimensional estimation errors. Detailed explanations are given in Section 3.3 below.

We introduce a power enhancement principle for high-dimensional sparse testing, by bringing in a data-dependent component J₀ that satisfies the Power Enhancement Properties defined in Section 1. The component J₀ does not serve as a test statistic on its own, but is added to a classical statistic J₁ that is often pivotal (e.g., Wald-statistic), so the proposed test statistic is defined by

J = J_{0} + J_{1} .

Our introduced “power enhancement principle” is explained as follows.

Under mild conditions, P (J₀ = 0|H₀) → 0 by construction. Hence when (2.1) is satisfied, we have
$\underset{T, N \to ∞}{lim \sup} P (J > F_{q} | H_{0}) = q .$
Therefore, adding J₀ to J₁ does not affect the size of the standard test statistic asymptotically. Both J and J₁ have the same limiting distribution under H₀.
The critical region of J is defined by
${D : J > F_{q}} .$
As J₀ ≥ 0, P (J > F_q|θ) ≥ P (J₁ > F_q|θ) for all θ ∈ Θ_a. Hence the power of J is at least as large as that of J₁.
When θ ∈ Θ_s is a sparse high-dimensional vector under the alternative, the “classical” test J₁ may have low power as ‖θ‖ is typically relatively small. On the other hand, for θ ∈ Θ_s, J₀ stochastically dominates J₁. As a result, P (J > F_q|θ) > P (J₁ > F_q|θ) strictly holds, so the power of J₁ over the set Θ_s is enhanced after adding J₀. Often J₀ diverges fast under sparse alternatives Θ_s, which ensures P (J > F_q|θ) → 1 for θ ∈ Θ_s. In contrast, the classical test only has P (J₁ > F_q|θ) < c < 1 for some c ∈ (0, 1) and θ ∈ Θ_s, and when ‖θ‖ is sufficiently small, P (J₁ > F_q|θ) is approximately q.

It is important to note that the power is enhanced without sacrificing the size asymptotically. In fact the power enhancement principle can be asymptotically fulfilled under a weaker condition J₀|H₀ →^p 0. However, we construct J₀ so that P (J₀ = 0|H₀) → 1 to ensure a good finite sample size.

2.2 Construction of power enhancement component

We construct a specific power enhancement component J₀ that satisfies (a)–(c) of the power enhancement properties, and identify the sparse alternatives in Θ_s. Such a component can be constructed via screening as follows. Suppose we have a consistent estimator $\hat{θ}$ such that ${max}_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | = o_{P} (1)$ . For some slowly growing sequence δ_N,T → ∞ (as T, N → ∞), define a screening set:

\hat{S} = {j : | {\hat{θ}}_{j} | > {\hat{v}}_{j}^{1 / 2} δ_{N, T}, j = 1, \dots, N},

(2.2)

where ${\hat{v}}_{j} > 0$ is a data-dependent normalizing constant, often taken as the estimated asymptotic variance of ${\hat{θ}}_{j}$ . The sequence δ_N,T, called “high criticism”, is chosen to dominate the maximum-noise-level, satisfying: (recall that Θ_a denotes the alternative set)

inf_{θ \in Θ_{a} \cup {0}} P (max_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} < δ_{N, T} | θ) \to 1

(2.3)

for θ under both null and alternate hypotheses. The screening statistic J₀ is then defined as

J_{0} = \sqrt{N} \sum_{j \in \hat{S}} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} = \sqrt{N} \sum_{j = 1}^{N} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} 1 {| {\hat{θ}}_{j} | > {\hat{v}}_{j}^{1 / 2} δ_{N, T}} .

By (2.2) and (2.3), under H₀ : θ = 0,

P (J_{0} = 0 | H_{0}) = P (\hat{S} = \emptyset | H_{0}) = P (max_{j \leq N} | {\hat{θ}}_{j} | / {\hat{v}}_{j}^{1 / 2} \leq δ_{N, T} | H_{0}) \to 1.

Therefore J₀ satisfies the non-negativeness and no-size-distortion properties.

Let {v_j}_j≤N be the population counterpart of ${{\hat{v}}_{j}}_{j \leq N}$ . To satisfy the power-enhancement property, define

S (θ) = {j : | θ_{j} | > 3 v_{j}^{1 / 2} δ_{N, T}, j = 1, \dots, N},

(2.4)

and in particular S(0) = ∅. We shall show in Theorem 3.1 below that $P (S (θ) \subset \hat{S} | θ) \to 1$ , for all θ ∈ Θ_a ∪ {0}. Thus all the significant signals are contained in $\hat{S}$ with a high probability. If S (θ) ≠ Ø, then by the definition of $\hat{S}$ and δ_N,T → ∞, we have

P (J_{0} > \sqrt{N} | S (θ) \neq \emptyset) \geq P (\sqrt{N} \sum_{j \in \hat{S}} δ_{N, T}^{2} > \sqrt{N} | S (θ) \neq \emptyset) \to 1.

Thus, the power of J₁ is enhanced on the subset

Θ_{s} \equiv {θ \in ℝ^{N} : S (θ) \neq \emptyset} = {θ \in ℝ^{N} : max_{j \leq N} \frac{| θ_{j} |}{v_{j}^{1 / 2}} > 3 δ_{N, T}} .

Furthermore, $\hat{S}$ not only reveals the sparse structure of θ under the alternative, but also determines the nonzero entries with an over-whelming probability.

The introduced J₀ can be combined with any other test statistic with an accurate asymptotic size. Suppose J₁ is a “classical” test statistic. Our power enhancement test is simply

J = J_{0} + J_{1} .

For instance, suppose we can consistently estimate the asymptotic inverse covariance matrix of $\hat{θ}$ , denoted by $\hat{var} {\hat{(θ)}}^{- 1}$ , then J₁ can be chosen as the standardized Wald-statistic:

J_{1} = \frac{{\hat{θ}}^{'} \hat{var} {\hat{(θ)}}^{- 1} \hat{θ} - N}{\sqrt{2 N}} .

As a result, the asymptotic distribution of J is $N$ (0, 1) under the null hypothesis.

In sparse alternatives where ‖θ‖ may not grow fast enough with N but θ ∈ Θ_s, the combined test J₀ + J₁ can be very powerful. In contrast, we will formally show in Theorem 3.4 below that the conventional Wald test J₁ can have very low power on its own. On the other hand, when the alternative is “dense” in the sense that ‖θ‖ grows fast with N, the conventional test J₁ itself is consistent. In this case, J is still as powerful as J₁. Therefore, if we denote Θ(J₁) ⊂ ℝ^N/{0} as the set of alternative θ’s against which the classical J₁ test has power converging to one, then the combined J = J₀ + J₁ test has power converging to one against θ on

Θ_{s} \cup Θ (J_{1}) .

We shall show in Section 3 that the power is enhanced uniformly over θ ∈ Θ_s ∪ Θ(J₁). In addition, the set $\hat{S}$ indicates which components may violate the null hypothesis.

2.3 Comparisons with thresholding and extreme-value tests

One of the fundamental differences between our power enhancement component J₀ and existing tests with good power under sparse alternatives is that, existing test statistics have a non-degenerate distribution under the null, and often require either bootstrap or strong conditions to derive the null distribution. Such convergences are typically slow and the serious size distortion appears in finite sample. In contrast, our screening statistic J₀ uses “high criticism” sequence δ_N,T to make P (J₀ = 0|H₀) → 1, hence does not serve as a test statistic on its own. The asymptotic null distribution is determined by that of J₁, which usually not hard to derive As we shall see in sections below, the required regularity condition is relatively mild, which makes the power enhancement test widely applicable to many econometric problems.

In the high-dimensional testing literature, there are mainly two types of statistics with good power under sparse alternatives: extreme value test and thresholding test respectively. The test based on extreme values studies the maximum deviation from the null hypothesis across the components of $\hat{θ} = (\hat{θ_{1}}, \dots, \hat{θ_{N}})$ , and forms the statistic based on ${max}_{j \leq N} | \frac{{\hat{θ}}_{j}}{w_{j}} |^{δ}$ for some δ > 0 and a weight w_j (e.g., Cai et al. (2013), Chernozhukov et al. (2013)). Such a test statistic typically converges slowly to its asymptotic counterpart. An alternative test is based on thresholding: for some δ > 0 and pre-determined threshold level t_N,

R = \sqrt{T} \sum_{j = 1}^{N} | \frac{{\hat{θ}}_{j}}{w_{j}} |^{δ} 1 {| {\hat{θ}}_{j} | > t_{N} w_{j}}

(2.5)

For example, when t_T is taken slightly less than ${max}_{j \leq N} | {\hat{θ}}_{j} | / w_{j}$ , R becomes the extreme statistic. When t_T is small (e.g. 0), R becomes a traditional test, which is not powerful in detecting sparse alternatives, though it can have good size properties. The accumulation of estimation errors is prevented due to the threshold $1 {| {\hat{θ}}_{j} | > t_{N} w_{j}}$ for sufficiently large t_N (see, e.g., Fan (1996) and Zhong et al. (2013)). In a low-dimensional setting, Hansen (2005) suggested using a threshold to enhance the power in a similar way.

Although (2.5) looks similar to J₀, the ideas behind are very different. Both the extreme value test and the thresholding test require regularity conditions that may be restrictive in econometric applications. For instance, it can be difficult to employ the central limit theorem directly on (2.5), as it requires the covariance between $\hat{θ_{j}}$ and ${\hat{θ}}_{j + k}$ decay fast enough as k → ∞ (Zhong et al., 2013). In cross-sectional testing problems, this essentially requires an explicit ordering among the cross-sectional units which is, however, often unavailable in panel data applications. In addition, as (2.5) involves effectively limited terms of summations due to thresholding, the asymptotic theory does not provide adequate approximations, resulting in size-distortion in applications. We demonstrate the size-distortion in the simulation study.

3 Asymptotic properties

3.1 Main results

This section presents the regularity conditions and formally establishes the claimed power enhancement properties. Below we use P (·|θ) to denote the probability measure defined from the sampling distribution with parameter θ. Let Θ ⊂ ℝ^N be the parameter space of θ that covers the union of {0} and the alternative set Θ_a. When we write inf_θ∈_Θ P (·|θ), the infimum is taken in the space that covers the union of both null and alternative space.

We begin with a high-level assumption. In specific applications, they can be verified with primitive conditions.

Assumption 3.1

As T, N → ∞, the sequence δ_N,T → ∞, and the estimators ${\hat{θ_{j}}, \hat{v_{j}}}_{j \leq N}$ are such that

${inf}_{θ \in Θ} P ({max}_{j \leq N} | \hat{θ_{j}} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} < δ_{N, T} | θ) \to 1$ ;
${inf}_{θ \in Θ} P (4 / 9 < \hat{v_{j}} / v_{j} < 9 / 4, \forall j = 1, \dots, N | θ) \to 1$ .

The normalizing constant v_j is often taken as the asymptotic variance of $\hat{θ_{j}}$ , with $\hat{v_{j}}$ being its consistent estimator. The constants 4/9 and 9/4 in condition (ii) are not optimally chosen, as this condition only requires ${\hat{v_{j}}}_{j \leq N}$ be not-too-bad estimators of their population counterparts.

In many high-dimensional problems with strictly stationary data that satisfy strong mixing conditions, following from the large-deviation theory, typically, ${max}_{j \leq N} | \hat{θ_{j}} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} = O_{P} (\sqrt{\log N})$ . Therefore, we shall fix

δ_{N, T} = \log (\log T) \sqrt{\log N},

(3.1)

which is a high criticism that slightly dominates the standardized noise level (it may be useful to recall that the maximum of N i.i.d. Gaussian noises with a bounded variance behaves as $\sqrt{\log N}$ asymptotically). We shall provide primitive conditions for this choice of δ_N,T in the subsequent sections, so that Assumption 3.1 holds.

Recall that $\hat{S}$ and S(θ) are defined by (2.2) and (2.4) respectively. In particular, $S (θ) = {j : | θ_{j} | > 3 v_{j}^{1 / 2} δ_{N, T,} j = 1, \dots, N}$ , so under H₀ : θ = 0, S(θ) = ∅. The following theorem characterizes the asymptotic behavior of $J_{0} = \sqrt{N} \sum_{j \in \hat{S}} \hat{θ_{j}^{2}} {\hat{v}}_{j}^{- 1}$ under both the null and alternative hypotheses.

Theorem 3.1

Let Assumption 3.1 hold. As T, N →∞, we have under H₀ : θ = 0, $P (\hat{S} = \emptyset | H_{0}) \to 1$ . Hence

P (J_{0} = 0 | H_{0}) \to 1 and inf_{{θ \in Θ : S (θ) \neq \emptyset}} P (J_{0} > \sqrt{N} | θ) \to 1.

In addition,

inf_{θ \in Θ} P (S (θ) \subset \hat{S} | θ) \to 1.

Besides the asymptotic behavior of J₀, Theorem 3.1 also establishes a “sure screening” property of $\hat{S}$ , which means it selects all the significant components whose indices are in S(θ). This result is achieved uniformly in θ under both the null and alternative hypotheses.

Remark 3.1

Under additional mild assumptions, it can be further shown that $P (\hat{S} = S (θ) | θ) \to 1$ uniformly in θ. Hence the selection is consistent. While the selection consistency is not a requirement of the power enhancement principle, we refer to our earlier manuscript Fan et al. (2014) for technical details.

We are now ready to formally show the power enhancement argument. The enhancement is achieved uniformly on the following set:

Θ_{s} = {θ \in Θ : max_{j \leq N} \frac{| θ_{j} |}{v_{j}^{1 / 2}} > 3 δ_{N, T}} .

(3.2)

In particular, if $\hat{θ_{j}}$ is $\sqrt{T}$ -consistent, and $v_{j}^{1 / 2}$ is the asymptotic standard deviation of $\hat{θ_{j}},$ then $σ_{j} = \sqrt{T v_{j}}$ is bounded away from both zero and infinity. Using (3.1), we have

Θ_{s} = {θ \in Θ : max_{j \leq N} | θ_{j} | / σ_{j} > 3 \log (\log T) \sqrt{\frac{\log N}{T}}} .

This is a relatively weak condition on the strength of the maximal signal in order to be detected by J₀.

A test is said to have a high power uniformly on a set Θ* ⊂ ℝ^N \ {0} if

inf_{θ \in Θ *} P (reject H_{0} by the test| θ) \to 1.

For a given distribution function F, let F_q denote its qth quantile.

Theorem 3.2

Let Assumptions 3.1 hold. Suppose there is a test J₁ such that

it has an asymptotic non-degenerate null distribution F, and the critical region takes the form {D : J₁ > F_q} for the significance level q ∈ (0, 1),
it has a high power uniformly on some set Θ(J₁) ⊂ Θ,
there is c > 0 so that ${inf}_{θ \in Θ_{s}} P (c \sqrt{N} + J_{1} > F_{q} | θ) \to 1$ , as T, N → ∞.

Then the power enhancement test J = J₀ + J₁ has the asymptotic null distribution F, and has a high power uniformly on the set Θ_s ∪ Θ(J₁): as T, N → ∞

inf_{θ \in Θ_{s} \cup Θ (J_{1})} P (J > F_{q} | θ) \to 1.

The three required conditions for J₁ are easy to understand: Conditions (i) and (ii) respectively require the size and power conditions for J₁. Condition (iii) requires J₁ be dominated by J₀ under Θ_s. This condition is not restrictive since J₁ is typically standardized (e.g., Donald et al. (2003)).

Theorem 3.2 also shows that J₁ and J have the critical regions {D : J₁ > F_q} and {D : J > F_q} respectively, but the high power region is enhanced from Θ(J₁) to Θ_s ∪ Θ(J₁). In high-dimensional testing problems, Θ_s ∪ Θ(J₁) can be much larger than Θ(J₁). As a result, the power of J₁ can be substantially enhanced after J₀ is added.

3.2 Power enhancement for quadratic tests

As an example of J₁, we consider the widely used quadratic test statistic, which is asymptotically pivotal:

J_{Q} = \frac{T {\hat{θ}}^{'} V \hat{θ} - N (1 + μ_{N, T})}{ξ_{N, T} \sqrt{N}},

where μ_N,T and ξ_N,T are deterministic sequences that satisfy μ_N,T → 0 and ξ_N,T → ξ ∈ (0, ∞). The weight matrix V is positive definite, whose eigenvalues are bounded away from both zero and infinity. Here TV is often taken to be the inverse of the asymptotic covariance matrix of $\hat{θ}$ . Other popular choices are $V = diag(σ_{1}^{- 2},⋯, σ_{N}^{- 2})$ with $σ_{j} = \sqrt{T v_{j}}$ (Bai and Saranadasa, 1996; Chen and Qin, 2010; Pesaran and Yamagata, 2012) and V = I_N, the N × N identity matrix. We set J₁ = J_Q, whose power enhancement version is J = J₀ + J_Q. For the moment, we shall assume V to be known, and just focus on the power enhancement properties. We will deal with unknown V for testing the factor pricing problem in the next section.

Assumption 3.2

There is a non-degenerate distribution F so that under H₀, J_Q →^d F.
The critical value F_q = O(1) and the critical region of J_Q is {D : J_Q > F_q},
V is positive definite, and there exist two positive constants C₁ and C₂ such that C₁ ≤ λ_min(V) ≤ λ_max(V) ≤ C₂.
C₃ ≤ Tv_j ≤ C₄, j = 1, …, N for positive constants C₃ and C₄.

Analyzing the power properties of J_Q and applying Theorem 3.2, we obtain the following theorem. Recall that δ_N,T and Θ_s are defined by (3.1) and (3.2).

Theorem 3.3

Under Assumptions 3.1–3.2, the power enhancement test J = J₀ +J_Q satisfies: as T, N → ∞,

under the null hypothesis H₀ : θ = 0, J →^d F,
there is C > 0 so that J has high power uniformly on the set
$Θ_{s} \cup {θ \in Θ : {‖ θ ‖}^{2} > C δ_{N, T}^{2} N / T} \equiv Θ_{s} \cup Θ (J_{Q});$
that is, ${inf}_{θ \in Θ_{s} \cup Θ (J_{Q})} P (J > F_{q} | θ) \to 1$ for any q ∈ (0,1).

3.3 Low power of quadratic statistics under sparse alternatives

When J_Q is used on its own, it can suffer from a low power under sparse alternatives if N grows much faster than the sample size, even though it has been commonly used in the econometric literature. Mainly, $T {\hat{θ}}^{'} V \hat{θ}$ aggregates high-dimensional estimation errors under H₀, which become large with a non-negligible probability and potentially override the sparse signals under the alternative. The following result gives this intuition a more precise description.

To simplify our discussion, we shall focus on the Wald-test with TV being the inverse of the asymptotic covariance matrix of $\hat{θ}$ , assumed to exist. Specifically, we assume the standardized $T {\hat{θ}}^{'} V \hat{θ}$ to be asymptotically normal under H₀:

\frac{T {\hat{θ}}^{'} V \hat{θ} - N}{\sqrt{2 N}} | H_{0} \to^{d} N (0, 1) .

(3.3)

This is one of the most commonly seen cases in various testing problems. The diagonal entries of $\frac{1}{T} V^{- 1}$ are given by {v_j}_j≤N.

Theorem 3.4

Suppose that (3.3) holds with ‖V‖₁ < C and ‖V⁻¹‖₁ < C for some C > 0. Under Assumptions 3.1–3.2, $T = o (\sqrt{N})$ and log N = o(T^1−γ) for some 0 < γ < 1, the quadratic test J_Q has low power at the sparse alternative Θ_c for any c > 0, given by

Θ_{c} = {θ \in Θ : \sum_{j = 1}^{N} 1 {θ_{j} \neq 0} = o (\sqrt{N} / T), {‖ θ ‖}_{max} < c} .

In other words, ∀c > 0, ∀θ ∈ Θ_c, for any significance level q,

lim_{T, N \to ∞} P (J_{Q} > z_{q} | θ) = q,

where z_q is the qth upper quantile of standard normal distribution.

In the above theorem, the alternative is a sparse vector. However, using the quadratic test itself, the asymptotic power of the test is as low as q. This is because the signals in the sparse alternative are dominated by the aggregated high-dimensional estimation errors: $T \sum_{j : θ_{j} = 0} {\hat{θ}}_{j}^{2}$ . In contrast, the non-vanishing components of θ (fixed constants) are actually detectable by using J₀. The power enhancement test J₀ + J_Q takes this into account, and has a substantially improved power.

4 Application: Testing Factor Pricing Models

4.1 The model

The multi-factor pricing model, motivated by the Arbitrage Pricing Theory (APT) by Ross (1976), is one of the most fundamental results in finance. It postulates how financial returns are related to market risks, and has many important practical applications. Let y_it be the excess return of the i-th asset at time t and f_t = (f₁_t, …, f_Kt)′ be the observable factors. Then, the excess return has the following decomposition:

y_{i t} = θ_{i} + b_{i}^{'} f_{t} + u_{i t}, i = 1, \dots, N, t = 1, \dots, T,

where b_i = (b_i₁, …, b_iK)′ is a vector of factor loadings and u_it represents the idiosyncratic error. To make the notation consistent, we stick to use θ to represent the commonly used “alpha” in the finance literature.

While the APT does not necessarily deliver an observable factor model, the specification of an observable factor structure is of considerable interest and is often the case in empirical analyses. The key implication from the multi-factor pricing theory for tradable factors is that the intercept θ_i should be zero for any asset i. Such an exact feature of factor pricing can also be motivated from Connor (1984), who presented a competitive equilibrium version of the APT. An important question is then testing the null hypothesis

H_{0} : θ = 0,

(4.1)

namely, whether the factor pricing model is consistent with empirical data, where θ = (θ₁, …, θ_N)′ is the vector of intercepts for all N financial assets. One typically picks five-year monthly data, because the factor pricing model is technically a one-period model whose factor loadings can be time-varying; see Gagliardini et al. (2011) on how to model the time-varying effects using firm characteristics and market variables. As the theory of the factor pricing model applies to all tradable assets, rather than a handful selected portfolios, the number of assets N should be much larger than T. This ameliorates the selection biases in the construction of testing portfolios. On the other hand, our empirical study on the S&P500 index provides empirical evidence of sparse alternatives: there are only a few significantly nonzero components of θ, corresponding to a small portion of mis-priced stocks, instead of systematic mispricing of the whole market.

Most existing tests to the problem (4.1) are based on the quadratic statistic $W = T {\hat{θ}}^{'} V \hat{θ}$ , where $\hat{θ}$ is the OLS estimator for θ, and V is some positive definite matrix. Prominent examples are given by Gibbons et al. (1989), MacKinlay and Richardson (1991) and Beaulieu et al. (2007). When N is possibly much larger than T, Pesaran and Yamagata (2012) showed that, under regularity conditions (Assumption 4.1 below),

J_{1} = \frac{a_{f, T} T {\hat{θ}}^{'} \sum_{u}^{- 1} \hat{θ} - N}{\sqrt{2 N}} \to^{d} N (0, 1) .

where a_f,T > 0, given in the next subsection, is a constant that depends only on factors’ empirical moments, and Σ_u is the N×N covariance matrix of u_t = (u₁_t, …, u_Nt)′, assumed to be time-invariant.

Recently, Gagliardini et al. (2011) proposed a novel approach to modeling and estimating time-varying risk premiums using two-pass least-squares method under asset pricing restrictions. Their problems and approaches differ substantially from ours, though both papers study similar problems in finance. As a part of their model validation, they develop test statistics against the asset pricing restrictions and weak risk factors. Their test statistics are based on a weighted sum of squared residuals of the cross-sectional regression, which, like all classical test statistics, have power only when there are many violations of the asset pricing restrictions. They do not consider the issue of enhancing the power under sparse alternatives, nor do they involve a Wald statistic that depends on a high-dimensional covariance matrix. In fact, their testing power can be enhanced by using our techniques.

4.2 Power enhancement component

We propose a new statistic that depends on (i) the power enhancement component J₀, and (ii) a feasible Wald component based on a consistent covariance estimator for $\sum_{u}^{- 1}$ , which controls the size under the null even when N/T→∞.

Define $\bar{f} = \frac{1}{T} \sum_{t = 1}^{T} f_{t}$ and $w = {(\frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'})}^{- 1} \bar{f}$ . Also define

a_{f, T} = 1 - {\bar{f}}^{'} w, and a_{f} = 1 - E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t} .

The OLS estimator of θ can be expressed as

\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{N})}^{'}, {\hat{θ}}_{j} = \frac{1}{T a_{f, T}} \sum_{t = 1}^{T} y_{j t} (1 - f_{t}^{'} w) .

(4.2)

When cov(f_t) is positive definite, under mild regularity conditions (Assumption 4.1 below), a_f,T consistently estimates a_f, and a_f > 0. In addition, without serial correlations, the conditional variance of ${\hat{θ}}_{j}$ (given {f_t}) converges in probability to

v_{j} = var (u_{j t}) / (T a_{f}),

which can be estimated by ${\hat{v}}_{j}$ based on the residuals of OLS estimator:

{\hat{v}}_{j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{j t}^{2} / (T a_{f, T}), where {\hat{u}}_{j t} = y_{j t} - {\hat{θ}}_{j} - {\hat{b}}_{j}^{'} f_{t} .

We show in Proposition 4.1 below that ${max}_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} = O_{P} (\sqrt{\log N})$ . Therefore, $δ_{N, T} = \log (\log T) \sqrt{\log N}$ slightly dominates the maximum estimation noise. The screening set and the power enhancement component are defined as

\hat{S} = {j : | \hat{θ_{j}} | > {\hat{v}}_{j}^{1 / 2} δ_{N, T,} j = 1, \dots, N},

and

J_{0} = \sqrt{N} \sum_{j \in \hat{S}} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} .

4.3 Feasible Wald test in high dimensions

Assuming no serial correlations among ${u_{t}}_{t = 1}^{T}$ and conditional homoskedasticity (Assumption 4.1 below), given the observed factors, the conditional covariance of $\hat{θ}$ is Σ_u/(Ta_f,T). If the covariance matrix Σ_u of u_t were known, the standardized Wald test statistic is

\frac{T a_{f, T} {\hat{θ}}^{'} \sum_{u}^{- 1} \hat{θ} - N}{\sqrt{2 N}} .

(4.3)

Under H₀ : θ = 0, it converges in distribution to $N (0, 1)$ .

Note that factor models are often only justified as being “approximate” (Chamberlain and Rothschild (1983)), where Σ_u is a non-diagonal covariance matrix of cross-sectionally correlated idiosyncratic errors (u₁_t, …, u_Nt). When N/T → ∞, it is difficult to consistently estimate $\sum_{u}^{- 1}$ , as there are O(N²) off-diagonal parameters. Without parametrizing the off-diagonal elements, we assume Σ_u = cov(u_t) a sparse matrix. This assumption is natural for large covariance estimations for factor models, and was previously considered by Fan et al. (2011). Since the common factors dictate preliminarily the co-movement across the whole panel, a particular asset’s idiosyncratic shock is usually correlated significantly only with a few of other assets. For example, some shocks only exert influences on a particular industry, but are not pervasive for the whole economy (Connor and Korajczyk, 1993). Recently, Gagliardini et al. (2011) also obtained a feasible test by using a similar thresholding technique as to be introduced below to estimate the asymptotic covariance matrix. They showed that the sparsity approach for estimating covariance matrices covers the block diagonal case, which is expected to be present in the factor modeling of stocks (as elaborated in Gagliardini et al. (2011), a typical example of the sparsity of Σ_u is due to the presence of remaining industry sector effects).

Following the approach of Bickel and Levina (2008), we can consistently estimate $\sum_{u}^{- 1}$ via thresholding: let $s_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}$ . Define the covariance estimator as

{({\sum^{^}}_{u})}_{i j} = {\begin{cases} s_{i i}, & if i = j, \\ h_{i j} (s_{i j}), & if i \neq j, \end{cases}

where h_ij(·) is a generalized thresholding function (Antoniadis and Fan, 2001; Rothman et al., 2009), with threshold value $τ_{i j} = C {(s_{i i} s_{j j} \frac{\log N}{T})}^{1 / 2}$ for some constant C > 0, designed to keep only the sample correlation whose magnitude exceeds $C {(\frac{\log N}{T})}^{1 / 2}$ . The hard-thresholding function, for example, is h_ij(x) = x1{|x| > τ_ij}, and many other thresholding functions such as soft-thresholding and SCAD (Fan and Li, 2001) are specific examples. In general, h_ij(·) should satisfy:

h_ij(z) = 0 if |z| < τ_ij;
|h_ij(z) − z| ≤ τ_ij;
there are constants a > 0 and b > 1 such that $| h_{i j} (z) - z | \leq a τ_{i j}^{2}$ if |z| > bτ_ij.

The thresholded covariance matrix estimator sets most of the off-diagonal estimation noises in ${\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}}_{i, j \leq N}$ to zero. As studied in Fan et al. (2013), the constant C in the threshold can be chosen in a data-driven manner so that ${\sum^{^}}_{u}$ is strictly positive definite in finite sample even when N > T.

With ${\sum^{^}}_{u}^{- 1}$ , we are ready to define the feasible standardized Wald statistic:

J_{wald} = \frac{T a_{f, T} {\hat{θ}}^{'} {\sum^{^}}_{u}^{- 1} \hat{θ} - N}{\sqrt{2 N}},

whose power can be enhanced under sparse alternatives by:

J = J_{0} + J_{wald} .

Remark 4.1

The thresholding approach described here can be modified to take advantages of the block structure of Σ_u. The covariance matrix can first be divided into blocks according to the industries, and then estimated block-by-block. The estimation procedure and theoretical analysis will be similar to the block-thresholding of Cai and Yuan (2012).

4.4 Does the thresholded covariance estimator affect the size?

A natural but technical question to address is that when Σ_u indeed admits a sparse structure, is the thresholded estimator ${\sum^{^}}_{u}^{- 1}$ accurate enough so that the feasible J_wald is still asymptotically normal? The answer is affirmative if N(log N)⁴ = o(T²), and still we can allow N/T → ∞. However, such a simple question is far more technically involved than anticipated, as we now explain.

When Σ_u is a sparse matrix, under regularity conditions (Assumption 4.2 below), Fan et al. (2011) showed that

{‖ \sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1} ‖}_{2} = O_{P} (\sqrt{\frac{\log N}{T}}) .

(4.4)

This convergence rate is minimax optimal for the sparse covariance estimation, by the lower bound derived by Cai et al. (2010). On the other hand, when replacing $\sum_{u}^{- 1}$ in (4.3) by ${\sum^{^}}_{u}^{- 1}$ , one needs to show that the effect of such a replacement is asymptotically negligible, namely, under H₀,

T {\hat{θ}}^{'} (\sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1}) \hat{θ} / \sqrt{N} = o_{P} (1) .

(4.5)

However, when θ = 0, it can be shown that ${‖ \hat{θ} ‖}^{2} = O_{P} (N / T)$ . Using this and (4.4), by the Cauchy-Schwartz inequality, we have

| T {\hat{θ}}^{'} (\sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1}) \hat{θ} | / \sqrt{N} = O_{P} (\sqrt{\frac{N \log N}{T}}) .

Thus, it requires N log N = o(T) to converge, which is basically a low-dimensional scenario.

The above simple derivation uses, however, a Cauchy-Schwartz bound, which is too crude for a large N. In fact, ${\hat{θ}}^{'} (\sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1}) \hat{θ}$ is a weighted estimation error of $\sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1}$ , where the weights $\hat{θ}$ “average down” the accumulated estimation errors in estimating elements of $\sum_{u}^{- 1}$ , and result in an improved rate of convergence. The formalization of this argument requires further regularity conditions and novel technical arguments. These are formally presented in the following subsection.

4.5 Regularity conditions

We are now ready to present the regularity conditions. These conditions are imposed for three technical purposes: (i) Achieving the uniform convergence for $\hat{θ} - θ$ as required in Assumption 3.1, (ii) defining the sparsity of Σ_u so that ${\sum^{^}}_{u}^{- 1}$ is consistent, and (iii) showing (4.5), so that the errors from estimating $\sum_{u}^{- 1}$ do not affect the size of the test.

Let $ℱ_{- ∞}^{0}$ and $ℱ_{T}^{∞}$ denote the σ-algebras generated by {f_t: −∞ ≤ t ≤ 0} and {f_t: T ≤ t ≤ ∞} respectively. In addition, define the α-mixing coefficient

α (T) = sup_{A \in ℱ_{- ∞}^{0}, B \in ℱ_{T}^{∞}} | P (A) P (B) - P (A B) | .

Assumption 4.1

{u_t}_t_≤_T is i.i.d. $N (0, \sum_{u})$ , where both ‖Σ_u‖₁ and ${‖ \sum_{u}^{- 1} ‖}_{1}$ are bounded;
{f_t}_t_≤_T is strictly stationary, independent of {u_t}_t_≤_T, and there are r₁, b₁ > 0 so that
$max_{i \leq K} P (| f_{i t} | > s) \leq exp (- {(s / b_{1})}^{r_{1}}) .$
There exists r₂ > 0 such that $r_{1}^{- 1} + r_{2}^{- 1} > 0.5$ and C > 0, for all T ∈ ℤ⁺,
$α (T) \leq exp (- C T^{r_{2}}) .$
cov(f_t) is positive definite, and max_i_≤_N ‖b_i‖ < c₁ for some c₁ > 0.

Some remarks are in order for the conditions in Assumption 4.1.

Remark 4.2

The above assumption, perhaps somewhat restrictive, substantially facilitates our technical analysis. Here u_t is required to be serially uncorrelated across t. Under this condition, the conditional covariance of $\hat{θ}$ , given the factors, has a simple expression Σ_u/(Ta_f,T). On the other hand, if serial correlations are present in u_t, there would be additional autocovariance terms in the covariance matrix, which need to be further estimated via regularizations. Moreover, given that Σ_u is a sparse matrix, the Gaussianity ensures that most of the idiosyncratic errors are cross-sectionally independent so that $cov (u_{i t}^{2}, u_{j t}^{l}) = 0$ , l = 1, 2, for most of the pairs in {(i, j): i ≠ j}.

Note that we do allow the factors {f_t}_t≤T to be weakly correlated across t, but satisfy the strong mixing condition Assumption 4.1 (iii).

Remark 4.3

The conditional homoskedasticity $E (u_{t} u_{t}^{'} | f_{t}) = E (u_{t} u_{t}^{'})$ is assumed, granted by condition (ii). We admit that handling conditional heteroskedasticity, while important in empirical applications, is very technically challenging in our context. Allowing the high-dimensional covariance matrix $E (u_{t} u_{t}^{'} | f_{t})$ to be time-varying is possible with suitable continuum of sparse conditions on the time domain. In that case, one can require the sparsity condition to hold uniformly across t and continuously apply thresholding. However, unlike in the traditional case, technically, estimating the family of large inverse covariances ${E {(u_{t} u_{t}^{'} | f_{t})}^{- 1} : t = 1, 2, \dots}$ uniformly over t is highly challenging. As we shall see in the proof of Proposition 4.2, even in the homoskedastic case, proving the effect of estimating $\sum_{u}^{- 1}$ to be first-order negligible when N/T → ∞ requires delicate technical analysis.

To characterize the sparsity of Σ_u in our context, define

m_{N} = max_{i \leq N} \sum_{j = 1}^{N} 1 {{(\sum_{u})}_{i j} \neq 0}, D_{N} = \sum_{i \neq j} 1 {{(\sum_{u})}_{i j} \neq 0} .

Here m_N represents the maximum number of nonzeros in each row, and D_N represents the total number of nonzero off-diagonal entries. Formally, we assume:

Assumption 4.2

Suppose N^1/2(log N)^γ = o(T) for some γ > 2, and

${min}_{{(\sum_{u})}_{i j} \neq 0} | {(\sum_{u})}_{i j} | ≫ \sqrt{(\log N) / T}$ ;
at least one of the following cases holds:
1. D_N = O(N^1/2), and $m_{N}^{2} = O (\frac{T}{N^{1 / 2} {(\log N)}^{γ}})$
2. D_N = O(N), and $m_{N}^{2} = O (1)$ .

As regulated in Assumption 4.2, we consider two kinds of sparse matrices, and develop our results for both cases. In the first case (Assumption 4.2 (ii)(a)), Σ_u is required to have no more than O(N¹^/²) off-diagonal nonzero entries, but allows a diverging m_N. One typical example of this case is that there are only a small portion (e.g., finitely many) of firms whose individual shocks (u_it) are correlated with many other firms’. In the second case (Assumption 4.2(ii)(b)), m_N should be bounded, but Σ_u can have O(N) off-diagonal nonzero entries. This allows block-diagonal matrices with finite size of blocks or banded matrices with finite number of bands. This case typically arises when firms’ individual shocks are correlated only within industries but not across industries.

Moreover, we require N¹^/²(log N)^γ = o(T), which is the price to pay for estimating a large error covariance matrix. But still we allow N/T → ∞. It is also required that the minimal signal for the nonzero components be larger than the noise level (Assumption 4.2 (i)), so that nonzero components are not thresholded off when estimating Σ_u.

4.6 Asymptotic properties

The following result verifies the uniform convergence required in Assumption 3.1 over the entire parameter space that contains both the null and alternative hypotheses. Recall that the OLS estimator and its asymptotic standard error are defined in (4.2).

Proposition 4.1

Suppose the distribution of (f_t, u_t) is independent of θ. Under Assumption 4.1, for $δ_{N, T} = \log (\log T) \sqrt{\log N}$ , as T, N → ∞,

\begin{array}{l} inf_{θ \in Θ} P (max_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} < δ_{N, T} | θ) \to 1. \\ inf_{θ \in Θ} P (4 / 9 < {\hat{v}}_{j} / v_{j} < 9 / 4, \forall_{j} = 1, \dots, N | θ) \to 1. \end{array}

Proposition 4.2

Under Assumptions 4.1, 4.2, and H₀,

J_{wald} = \frac{T a_{f, T} {\hat{θ}}^{'} {\sum^{^}}_{u}^{- 1} \hat{θ} - N}{\sqrt{2 N}} \to^{d} N (0, 1) .

As shown, the effect of replacing $\sum_{u}^{- 1}$ by its thresholded estimator is asymptotically negligible and the size of the standard Wald statistic can be well controlled.

We are now ready to apply Theorem 3.3 to obtain the asymptotic properties of J = J₀ + J_wald as follows. For $δ_{N, T} = \log (\log T) \sqrt{\log N}$ , let

\begin{array}{l} Θ_{s} = {θ \in Θ : max_{j \leq N} \frac{T^{1 / 2} | θ_{j} |}{{var}^{1 / 2} (u_{j t})} > 3 a_{f}^{- 1 / 2} δ_{N, T}}, \\ Θ (J_{wald}) = {θ \in Θ : {‖ θ ‖}^{2} > C δ_{N, T}^{2} N / T} . \end{array}

Theorem 4.1

Suppose the assumptions of Propositions 4.1 and 4.2 hold.

Under the null hypothesis H₀: θ = 0, as T, N → ∞,
$P (J_{0} = 0 | H_{0}) \to 0, J_{wald} \to^{d} N (0, 1),$
and hence
$J = J_{0} + J_{wald} \to^{d} N (0, 1) .$
There is C > 0 so that for any q ∈ (0, 1), as T, N → ∞,
$inf_{θ \in Θ} P (J_{0} > \sqrt{N} | θ) \to 1, inf_{θ \in Θ (J_{wald})} P (J_{wald} > z_{q} | θ) \to 1,$
and hence
$inf_{θ \in Θ_{s} \cup Θ (J_{wald})} P (J > z_{q} | θ) \to 1,$
where z_q denotes the 1 − q quantile of the standard normal distribution.

We see that the power is substantially enhanced after J₀ is added, as the region where the test has power is enlarged from Θ(J_wald) to Θ_s ∪ Θ(J_wald).

5 Application: Testing Cross-Sectional Independence

5.1 The model

Consider a mixed effect panel data model

y_{i t} = α + x_{i t}^{'} β + μ_{i} + u_{i t}, i \leq n, t \leq T,

where the idiosyncratic error u_it is assumed to be Gaussian. The regressor x_it could be correlated with the individual random effect μ_i, but is uncorrelated with u_it. Let ρ_ij denote the correlation between u_it and u_jt, assumed to be time invariant. The goal is to test the following hypothesis:

H_{0} : ρ_{i j} = 0, for all i \neq j,

that is, whether the cross-sectional dependence is present. It is commonly known that the cross-sectional dependence leads to efficiency loss for OLS, and sometimes it may even cause inconsistency (Andrews, 2005). Thus testing H₀ is an important problem in applied panel data models. If we let N = n(n − 1)/2, and θ = (ρ₁₂, …, ρ₁_n, ρ₂₃, …, ρ₂_n, …, ρ_n−₁_,n)′ be an N × 1 vector stacking all the mutual correlations, then the problem is equivalent to testing about a high-dimensional vector H₀: θ = 0. Note that often the cross-sectional dependences are weakly present. Hence the alternative hypothesis of interest is often a sparse vector θ, corresponding to a sparse covariance matrix Σ_u of u_it.

Most of the existing tests are based on the quadratic statistic $W = \sum_{i < j} T {\hat{ρ}}_{i j}^{2} = T {\hat{θ}}^{'} \hat{θ}$ , where ${\hat{ρ}}_{i j}$ is the sample correlation between u_it and u_jt, estimated by the within-OLS (Baltagi, 2008), and $\hat{θ} = ({\hat{ρ}}_{12}, \dots, {\hat{ρ}}_{n - 1, n})$ . Pesaran et al. (2008) and Baltagi et al. (2012) studied the rescaled W, and showed that after a proper standardization, the rescaled W is asymptotically normal when both n, T → ∞. However, the quadratic test suffers from a low power if Σ_u is a sparse matrix. In particular, as is shown in Theorem 3.4, when n/T → ∞, the quadratic test cannot detect the sparse alternatives with Σ_i<j1{ρ_ij ≠ 0} = o(n/T), which can be restrictive. Such a sparse structure is present, for instance, when Σ_u is a block-diagonal sparse matrix with finite block sizes.

5.2 Power enhancement test

Following the conventional notation of panel data models, let ${\hat{y}}_{i t} = y_{i t} - \frac{1}{T} \sum_{t = 1}^{T} y_{i t}$ , ${\hat{x}}_{i t} = x_{i t} - \frac{1}{T} \sum_{t = 1}^{T} x_{i t}$ , and ${\tilde{u}}_{i t} = u_{i t} - \frac{1}{T} \sum_{t = 1}^{T} u_{i t}$ . Then ${\tilde{y}}_{i t} = {\tilde{x}}^{'}_{i t} β + {\tilde{u}}_{i t}$ . The within-OLS estimator $\hat{β}$ is obtained by regressing ${\tilde{y}}_{i t}$ on ${\tilde{x}}_{i t}$ for all i and t, which leads to the estimated residual ${\hat{u}}_{i t} = {\tilde{y}}_{i t} - {\tilde{x}}^{'}_{i t} \hat{β}$ . Then ρ_ij is estimated by

{\hat{ρ}}_{i j} = \frac{{\hat{σ}}_{i j}}{{\hat{σ}}_{i i}^{1 / 2} {\hat{σ}}_{j j}^{1 / 2}}, {\hat{σ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t} .

For the within-OLS, the asymptotic variance of ${\hat{ρ}}_{i j}$ is given by $v_{i j} = {(1 - ρ_{i j}^{2})}^{2} / T$ , and is estimated by ${\hat{v}}_{i j} = {(1 - {\hat{ρ}}_{i j}^{2})}^{2} / T$ . Therefore the screening statistic for the power enhancement test is defined as

J_{0} = \sqrt{N} \sum_{(i, j) \in \hat{S}} {\hat{ρ}}_{i j}^{2} {\hat{v}}_{i j}^{- 1}, \hat{S} = {(i, j) : | {\hat{ρ}}_{i j} | / {\hat{v}}_{i j}^{1 / 2} > δ_{N, T}, i < j \leq n} .

(5.1)

where $δ_{N, T} = \log (\log T) \sqrt{\log N}$ as before. The set $\hat{S}$ screens out most of the estimation errors.

To control the size, we employ Baltagi et al. (2012)’s bias-corrected quadratic statistic:

J_{1} = \sqrt{\frac{1}{n (n - 1)}} \sum_{i < j} (T {\hat{ρ}}_{i j}^{2} - 1) - \frac{n}{2 (T - 1)} .

(5.2)

Under regularity conditions (Assumptions 5.1, 5.2 below), $J_{1} \to^{d} N (0, 1)$ under H₀. Then the power enhancement test can be constructed as J = J₀ + J₁. The power is substantially enhanced to cover the region

Θ_{s} = {θ : max_{i < j} \frac{\sqrt{T} | ρ_{i j} |}{1 - ρ_{i j}^{2}} > 3 \log (\log T) \sqrt{\log N}},

(5.3)

in addition to the region detectable by J₁ itself. As a byproduct, it also identifies pairs (i, j) for ρ_ij ≠ 0 through $\hat{S}$ . Empirically, this set helps us understand better the underlying pattern of cross-sectional correlations and subsequently the cause of the correlation.

5.3 Asymptotic properties

In order for the power to be uniformly enhanced, the parameter space of θ = (ρ₁₂, …, ρ₁_n, ρ₂₃, …, ρ₂_n, …, ρ_n−₁_,n)′ is required to be: θ is element-wise bounded away from ±1: there is ρ_max ∈ (0, 1),

Θ = {θ \in ℝ^{N} : {‖ θ ‖}_{max} \leq ρ_{max}} .

The following regularity conditions are imposed. They hold uniformly in θ ∈ Θ.

Assumption 5.1

There are C₁, C₂ > 0, so that

$\sum_{i \neq j \leq n} | E {\tilde{x}}^{'}_{i t} {\tilde{x}}_{i t} E (u_{i t} u_{j t}) | < C_{1} n$ ,
${max}_{j \leq n} E (u_{j t}^{4}) < C_{1}, {min}_{j \leq n} E (u_{j t}^{2}) < C_{2}$ ,

Condition (i) is needed for the within-OLS to be $\sqrt{n T}$ -consistent (see, e.g., Baltagi (2008)). It is usually satisfied by weak cross-sectional correlations (sparse alternatives) among the error terms, or weak dependence among the regressors. We require the second moment of u_jt be bounded away from zero uniformly in j ≤ n and θ ∈ Θ, so that the cross-sectional correlations can be estimated stably.

The following conditions are assumed in Baltagi et al. (2012), which are needed for the asymptotic normality of J₁ under H₀.

Assumption 5.2

${u_{t}}_{t \leq T}$ are i.i.d. N(0, Σ_u), $E (u_{t} | {f_{t}}_{t \leq T}, θ) = 0$ almost surely.
With probability approaching one, all the eigenvalues of $\frac{1}{T} \sum_{t = 1}^{T} {\tilde{x}}_{j t} {\tilde{x}}^{'}_{j t}$ are bounded away from both zero and infinity uniformly for j ≤ n.

Proposition 5.1

Under Assumptions 5.1 and 5.2, for $δ_{N, T} = \log (\log T) \sqrt{\log N}$ , and N = n(n − 1)/2, as T, N → ∞,

\begin{array}{l} inf_{θ \in Θ} P (max_{i j} | {\hat{ρ}}_{i j} - ρ_{i j} | / {\hat{v}}_{i j}^{1 / 2} < δ_{N, T} | θ) \to 1 \\ inf_{θ \in Θ} P (4 / 9 < {\hat{v}}_{i j} / v_{i j} < 9 / 4, \forall_{i} \neq j | θ) \to 1. \end{array}

Define

Θ (J_{1}) = {θ \in Θ : \sum_{i < j} ρ_{i j}^{2} \geq C n^{2} \log n / T} .

For J₁ defined in (5.2), let

J = J_{0} + J_{1} .

(5.4)

The main result is presented as follows.

Theorem 5.1

Suppose Assumptions 5.1, 5.2 hold. As T, N → ∞,

under the null hypothesis H₀: θ = 0,
$P (J_{0} = 0 | H_{0}) \to 0, J_{wald} \to^{d} N (0, 1),$
and hence
$J = J_{0} + J_{1} \to^{d} N (0, 1);$
there is C > 0 in the definition of Θ(J₁) so that for any q ∈ (0, 1),
$inf_{θ \in Θ_{s}} P (J_{0} > \sqrt{N} | θ) \to 1, inf_{θ \in Θ (J_{1})} P (J_{1} > z_{q} | θ) \to 1,$
and hence
$inf_{θ \in Θ_{s} \cup Θ (J_{1})} P (J_{1} > z_{q} | θ) \to 1.$

Note that the high power region is enhanced from Θ(J₁) to Θ_s ∪ Θ(J₁) uniformly over sparse alternatives. In particular, the required signal strength of Θ_s in (5.3) is mild: the maximum cross-sectional correlation is only required to exceed a magnitude of $\log (\log T) \sqrt{(\log N) / T}$ .

6 Numerical studies

In this section, Monte Carlo simulations are employed to examine the finite sample performance of the power enhancement tests. We also present empirical evidence of sparse alternatives in the factor pricing model using real data.

6.1 Testing factor pricing models

To mimic the real data application, we consider the Fama and French (1992) three-factor model:

y_{i t} = θ_{i} + b_{i}^{'} f_{t} + u_{i t} .

We simulate ${b_{i}}_{t = 1}^{N}$ , ${f_{t}}_{t = 1}^{T}$ and ${u_{t}}_{t = 1}^{T}$ independently from $N_{3} (μ_{B}, \sum_{B})$ , $N_{3} (μ_{f}, \sum_{f})$ , and $N_{N} (0, \sum_{u})$ respectively. The parameters are set to be the same as those in the simulations of Fan et al. (2013), which are calibrated using daily returns of S&P 500’s top 100 constituents, for the period from July 1^st, 2008 to June 29^th 2012. These parameters are listed in the following table.

Set Σ_u = diag{A₁, …, A_N/₄} to be a block-diagonal correlation matrix. Each diagonal block A_j is a 4 × 4 positive definite matrix, whose correlation matrix has equi-off-diagonal entry ρ_j, generated from Uniform[0, 0.5].

We evaluate the powers of our tests under two specific alternatives (we set N > T):

\begin{array}{l} spares alternative H_{a}^{1} : θ_{i} = {\begin{matrix} 0.3, & i \leq \frac{N}{T} \\ 0, & i > \frac{N}{T} \end{matrix} \\ weak theta H_{a}^{2} : θ_{i} = {\begin{matrix} \sqrt{\frac{\log N}{T}}, & i \leq N^{0.4} \\ 0, & i > N^{0.4} \end{matrix} . \end{array}

Under $H_{a}^{1}$ , there are only a few nonzero θ’s with a relatively large magnitude. Under $H_{a}^{2}$ , there are many non-vanishing θ’s, but their magnitudes are all relatively small. In our simulation setup, $\sqrt{\log N / T}$ varies from 0.05 to 0.10. We therefore expect that under $H_{a}^{1}$ , $P (\hat{S} = \emptyset)$ is close to zero, as most of the first N/T estimated θ’s should survive from the screening step. These survived $\hat{θ}$ ’s contribute importantly to the rejection of the null hypothesis. In contrast, $P (\hat{S} = \emptyset)$ should be much larger under $H_{a}^{2}$ because the non-vanishing θ’s are too weak to be detected.

Four testing methods are conducted and compared: the standardized Wald test J_wald, the thresholding test J_thr as in Fan (1996), and their power enhancement versions J₀ + J_wald and J₀ + J_thr. In particular, the thresholding test J_thr is defined as, $σ_{N}^{2} = \sqrt{2 / π} a^{- 1} t_{N}^{3} (1 + 3 t_{N}^{- 2})$ and $μ_{N} = \sqrt{2 / π} a^{- 1} t_{N} (1 + t_{N}^{- 2})$ ,

J_{thr} = σ_{N}^{- 1} (\sum_{j = 1}^{N} {\hat{θ}}_{j}^{2} {\hat{v}}_{j}^{- 1} 1 {| {\hat{θ}}_{j} | {\hat{v}}_{j}^{- 1 / 2} > t_{N}} - μ_{N}),

where $t_{N} = \sqrt{2 \log (N a)}$ , a = (log N)⁻². Here the threshold value t_N is chosen smaller than our δ_N,T, and it results in a non-degenerate null distribution of J_thr. When Σ_u is diagonal, its asymptotic null distribution is $N (0, 1)$ , but when Σ_u is non-diagonal, it can suffer from substantial size distortions (see Fan (1996) for detailed discussions). For each test, we calculate the relative frequency of rejection under H₀, $H_{a}^{1}$ and $H_{a}^{2}$ based on 2000 replications, with significance level q = 0.05. We also calculate the relative frequency of $\hat{S}$ being empty, which approximates $P (\hat{S} = \emptyset)$ . We use the soft-thresholding to estimate the error covariance matrix.

Table II presents the empirical size and power of each testing method. Numerical findings are summarized as follows.

Table II.

Size and Power (%) of tests for simulated Fama-French three-factor model

J_wald

J_thr

J₀ + J_wald

J₀ + J_thr

P (\hat{S} = \emptyset)

H₀

300

500

5.0%

7.2%

5.2%

7.3%

99.9%

800

5.4%

7.7%

5.6%

7.9%

99.8%

1000

4.7%

7.8%

4.9%

8.0%

99.7%

1200

4.7%

6.6%

4.8%

6.7%

99.8%

500

5.2%

6.4%

5.3%

6.5%

99.9%

800

5.0%

5.8%

5.1%

5.9%

99.9%

1000

5.4%

6.6%

5.5%

6.6%

100.0%

1200

4.5%

7.4%

4.6%

7.5%

99.8%

H_{a}^{1}

300

500

48.3%

72.2%

93.7%

94.5%

8.0%

800

58.4%

87.9%

97.8%

98.8%

3.2%

1000

53.6%

87.0%

96.4%

98.1%

6.3%

1200

66.4%

94.3%

97.9%

98.8%

3.4%

500

37.9%

54.4%

96.3%

96.7%

4.0%

800

68.3%

91.6%

99.9%

0.1%

1000

63.3%

89.8%

99.8%

0.2%

1200

55.1%

88.0%

99.7%

99.6%

0.6%

H_{a}^{2}

300

500

68.6%

83.2%

72.7%

84.9%

80.0%

800

69.0%

86.4%

72.1%

87.5%

80.9%

1000

74.5%

91.7%

77.7%

92.1%

78.9%

1200

74.9%

93.6%

78.5%

94.3%

79.6%

500

70.6%

81.9%

72.8%

82.8%

89.0%

800

71.4%

86.6%

73.4%

87.1%

88.3%

1000

72.2%

89.7%

73.5%

90.0%

89.6%

1200

75.9%

92.3%

77.5%

92.5%

87.8%

Open in a new tab

Note: This table reports the frequencies (in percentage) of rejection and $\hat{S} = \emptyset$ based on 2000 replications. These tests are conducted at 5% significance level.

The sizes of both J_wald and J₀ + J_wald are close to the significance level. In contrast, the thresholding tests (J_thr and J_thr + J₀) have significant size distortions. Furthermore, adding J₀ results in just 0.1–0.2% increase of the size.
Under H₀, $P (\hat{S} = \emptyset)$ is close to one, indicating that the power enhancement component J₀ screens off most of the estimation errors. Under $H_{a}^{1}$ , $P (\hat{S} = \emptyset)$ is less than 10% because the screening procedure manages to capture the big thetas. Under $H_{a}^{2}$ , as the non-vanishing thetas are very weak, $\hat{S}$ has a large chance of being empty.
Under $H_{a}^{1}$ , the power of the thresholding test is much higher than that of the Wald test, as the Wald test accumulates too many estimation errors. Besides, the power is significantly enhanced after J₀ is added.
Finally, under $H_{a}^{2}$ , the power enhancement is not substantial as the nonzero thetas are very weak, and the thresholding test has higher power than J₀+J_wald does. The power of the thresholding test can be further enhanced after J₀ is added with little increase of false rejections. Note that in this case $\hat{S}$ still has more than 10% chance of being nonempty. Whenever it is non-empty, adding J₀ potentially enhances the power of the test.

6.2 Testing cross-sectional independence

We use the following data generating process in our experiments,

\begin{array}{l} y_{i t} = α + β x_{i t} + μ_{i} + u_{i t}, i \leq n, t \leq T, \\ x_{i t} = ξ x_{i, t - 1} + μ_{i} + ε_{i t} . \end{array}

(6.1)

Note that we model {x_i}’s as AR(1) processes, so that x_it is possibly correlated with μ_i, but not with u_it, as was the case in Im et al. (1999). For each i, initialize x_it = 0.5 at t = 1. We specify the parameters as follows: μ_i is drawn from $N (0, 0.25)$ for i = 1, …, n. The parameters α and β are set −1 and 2 respectively. In regression (6.1), ξ = 0.7 and $ε_{i t} ~ N (0, 1)$ .

We generate ${u_{t}}_{t = 1}^{T}$ from $N_{n} (0, \sum_{u})$ . Under the null hypothesis, Σ_u is set to be a diagonal matrix $\sum_{u, 0} = diag {σ_{1}^{2}, \dots, σ_{n}^{2}}$ . Following Baltagi et al. (2012), consider the heteroscedastic errors

σ_{i}^{2} = σ^{2} {(1 + κ {\bar{x}}_{i})}^{2}

(6.2)

with κ = 0.5, where ${\bar{x}}_{i}$ is the average of x_it across t. Here σ² is scaled to fix the average of $σ_{i}^{2}$ ’s at one.

For alternative specifications, we use a spatial model for the errors u_it. Baltagi et al. (2012) considered a tri-diagonal error covariance matrix in this case. We extend it by allowing for higher order spatial autocorrelations, but require that not all the errors be spatially correlated with their immediate neighbors. Specifically, we start with Σ_u,₁ = diag{Σ₁, …, Σ_n/₄} as a block-diagonal matrix with 4 × 4 blocks located along the main diagonal. Each Σ_i is assumed to be I₄ initially. We then randomly choose ⌊n^0.3⌋ blocks among them and make them non-diagonal by setting Σ_i(m, n) = ρ^|m−n|(m, n ≤ 4), with ρ = 0.2. To allow for error cross-sectional heteroscedasticity, we set $\sum_{u} = \sum_{u, 0}^{1 / 2} \sum_{u, 1} \sum_{u, 0}^{1 / 2}$ , where $\sum_{u, 0} = diag {σ_{1}^{2}, \dots, σ_{n}^{2}}$ as specified in (6.2).

The Monte Carlo experiments are conducted for different pairs of (n, T) with significance level q = 0.05 based on 2000 replications. The empirical size, power and the frequency of $\hat{S} = \emptyset$ as in (5.1) are recorded.

Table III gives the size and power of the bias-corrected quadratic test J₁ in (5.2) and those of the power enhanced test J₀ + J₁. The sizes of both tests are close to 5%. In particular, the power enhancement test has little distortion of the original size.

Table III.

Size and power (%) of tests for cross-sectional independence

T	n = 200 $J_{1} / PE / P (\hat{S} = \emptyset)$	n = 400 $J_{1} / PE / P (\hat{S} = \emptyset)$	n = 600 $J_{1} / PE / P (\hat{S} = \emptyset)$	n = 800 $J_{1} / PE / P (\hat{S} = \emptyset)$
	H₀
100	4.7/5.5 /99.1	4.9/5.3 /99.6	5.5/5.7 /99.7	4.9/5.2 /99.7
200	5.3/5.3 /100.0	5.5/5.9 /99.6	4.7/5.1 /99.4	4.9/5.1 /99.8
300	5.2/5.2 /100.0	5.2/5.2 /100.0	4.6/4.6 /100.0	4.9/4.9 /100.0
500	4.7/4.7 /100.0	5.5/5.5 /100.0	5.0/5.0 /100.0	5.1/5.1 /100.0

	H_a
100	26.4/95.5 /5.0	19.8/98.0 /2.3	13.5/98.2 /2.0	12.2/99.2 /0.9
200	54.6/98.8 /1.6	40.3/99.6 /0.5	24.8/99.6 /0.4	21/99.7 /0.3
300	78.9/99.2 /1.1	65.3/100.0 /0.1	41.7/99.9 /0.2	37.2/100.0 /0.1
500	93.5/99.8 /0.2	89.0/100.0 /0.0	69.1/100.0 /0.0	61.8/100.0 /0.0

Open in a new tab

Note: This table reports the frequencies of rejection by J₁ in (5.2) and PE = J₀ + J₁ in (5.4) under the null and alternative hypotheses, based on 2000 replications. The frequency of $\hat{S}$ being empty is also recorded. These tests are conducted at 5% significance level.

The bottom panel shows the power of the two tests under the alternative specification. The power enhancement test demonstrates almost full power under all combinations of (n, T). In contrast, the quadratic test J₁ only gains power when T gets large. As n increases, the proportion of nonzero off-diagonal elements in Σ_u gradually decreases. It becomes harder for J₁ to effectively detect those deviations from the null hypothesis. This explains the low power exhibited by the quadratic test when facing a high sparsity level.

6.3 Empirical evidence of sparse alternatives

We present empirical evidence of sparse alternatives based on a real data example. Consider Carhart (1997)’s four-factor model on the S&P 500 index. We collect monthly excess returns on all the S&P 500 constituents from the CRSP database for the period January 1980 to December 2012, and construct the screening set $\hat{S}$ on a rolling window basis: for each month, we evaluate $\hat{S}$ using the preceding 60 months’ returns (T = 60). The panel at each month consists of stocks without missing observations in the past five years, which yields a balanced panel with the cross-sectional dimension larger than the time-series dimension (N > T). In this manner we not only capture the up-to-date information in the market, but also mitigate the impact of time-varying factor loadings and sampling biases. In particular, for testing months τ = 1984.12, …, 2012.12, we run the regressions

r_{i t}^{τ} - r_{f t}^{τ} = θ_{i}^{τ} + β_{i, MKT}^{τ} ({MKT}_{t}^{τ} - r_{f t}^{τ}) + β_{i, SMB}^{τ} {SMB}_{t}^{τ} + β_{i, HML}^{τ} {HML}_{t}^{τ} + β_{i, MOM}^{τ} {MOM}_{t}^{τ} + u_{i t}^{τ},

(6.3)

for i = 1, …, N_τ and t = τ − 59, …, τ, where r_it represents the return for stock i at month t, r_ft the risk free rate, and MKT, SMB, HML and MOM constitute market, size, value and momentum factors. The time series of factors are downloaded from Kenneth French’s website. To make the notation consistent, we use $θ_{i}^{τ}$ to represent the “alpha” of stock i.

Table IV summarizes descriptive statistics for different components and estimates in the model. On average, 618 stocks (which is more than 500 because we are recording stocks that have ever become the constituents of the index) enter the panel of the regression during each five-year estimation window. Of those, merely an average of 5.2 stocks are selected by the screening set $\hat{S}$ which directly implies the presence of sparse alternatives. The threshold $δ_{N, T} = \sqrt{(\log N)} \log (\log T)$ varies as the panel size N changes at the end of each month, and is about 3.5 on average. The selected stocks have much larger alphas (θ) than other stocks do. Therefore empirically we find that there are only a few significant nonzero “alpha” components, corresponding to a small portion of mis-priced stocks instead of systematic mis-pricing of the whole market.

Table IV.

Summary of descriptive statistics and testing results

Variables

Mean

Std dev.

Median

Min

Max

N_τ

617.70

26.31

621

574

665

{| \hat{θ} |}_{0}

5.20

3.50

{\bar{| \hat{θ} |}}_{i}^{τ} (%)

0.9767

0.1519

0.9308

0.7835

1.3816

{\bar{| \hat{θ} |}}_{i \in \hat{S}}^{τ} (%)

4.5569

1.4305

4.1549

1.7839

10.8393

Open in a new tab

The power enhancement procedure is particularly suited for the empirical setting where sparse alternatives are present. Note that finding only a few stocks with nonzero alphas is probably explained by the focus on a balanced panel of highly traded stocks with large capitalizations (cf. constituents of the S&P500). On the other hand, as in Gagliardini et al. (2011), the empirical finding would probably be different if we consider a much larger universe of stocks with possibly many more mis-pricing.

7 Discussions

We consider testing a high-dimensional vector H: θ = 0 against sparse alternatives where the null hypothesis is violated only by a few components. We introduce a “power enhancement component” J₀ based on a screening technique, which is zero under the null, but diverges quickly under sparse alternatives. We suggest constructing J₀ as described in the paper since the screening set $\hat{S}$ can reveal the sparse structure of θ, and a negative outcome of the test suggests a specific set of alternatives.

In the factor pricing model, the issue of missing a small number of factors is also important to consider. One one hand, when the unspecified factors are not “pervasive”, only assets that are influenced by the missing factors are affected, which may lead to a sparse alternative. On the other hand, the unspecified pervasive factors may substantially affect the sparse structure of either θ or Σ_u, or both. In this case, we can extend the current model to allow for unobservable factors, which can be statistically inferred using principal components (PC) method as in Stock and Watson (2002) and Fan et al. (2013). Since the PC method is robust to over-specifying factors, the screening set $\hat{S}$ should be stable if the “working number of factors” $\hat{K}$ is no smaller than the true number of factors K. As a result, one can estimate θ and construct $\hat{S}$ using either a consistent estimator of K or a slightly overestimated $\hat{K}$ . Once $\hat{S}$ is reasonably robust to the choice of $\hat{K}$ , it indicates that no pervasive factors are omitted. We can then proceed to use the proposed J₀ to conduct the test.

In addition, this paper considers unconditional population moments of asset returns and focuses on the factor pricing model. Unconditional moments of financial returns, under a broad class of data generation processes, are time invariant and can thus be estimated from time series data. Though theoretical models often imply a conditional linear model with respect to investors’ information sets, it is much more convenient to deduce testable implication that does not depend on this conditioning information (see Hansen and Richard (1987)). On the other hand, the use of conditioning information is also appealing and has been addressed by several authors (see,e.g., Gagliardini et al. (2011) and Ang and Kristensen (2012)). It will be an interesting direction to accommodate such conditioning information in deriving power enhancement tests.

Supplementary Material

NIHMS749711-supplement-2.pdf^{(252.7KB, pdf)}

Table I.

Means and covariances used to generate b_i and f_t

μ_B		Σ_B		μ_f		Σ_f
0.9833	0.0921	−0.0178	0.0436	0.0260	3.2351	0.1783	0.7783
−0.1233	−0.0178	0.0862	−0.0211	0.0211	0.1783	0.5069	0.0102
0.0839	0.0436	−0.0211	0.7624	−0.0043	0.7783	0.0102	0.6586

Open in a new tab

Acknowledgments

*We thank co-editor Lars Hansen and anonymous referees for many insightful comments and suggestions, which have greatly improved the paper. The authors are also grateful to the comments from Per Mykland and seminar and conference participants at UChicago, Princeton, Georgetown, George Washington, 2014 Econometric Society North America Summer Meeting, UCL workshop on High-dimensional Econometrics Models, The 2014 Annual meeting of Royal Economics Society, The 2014 Asian Meeting of the Econometric Society, 2014 International Conference on Financial Engineering and Risk Management, and 2014 Midwest Econometric Group meeting. The research was partially supported by National Science Foundation grants DMS-1206464 and DMS-1406266, and National Institute of Health grants R01GM100474-01 and R01-GM072611.

A Proofs for Section 3

Throughout the proofs, let C be a generic constant, which may differ at different places.

A.1 Proof of Theorem 3.1

Proof

Define events

A_{1} = {max_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | / {\hat{v}}_{j}^{1 / 2} < δ_{N, T}}, A_{2} = {\frac{4}{9} \leq {\hat{v}}_{j} / v_{j} \leq \frac{9}{4}, \forall_{j} = 1, \dots, N} .

For any j ∈ S (θ), by the definition of S(θ), $| θ_{j} | > 3 δ_{N, T} v_{j}^{1 / 2}$ . Under A₁ ∩ A₂,

\frac{| {\hat{θ}}_{j} |}{{\hat{v}}_{j}^{1 / 2}} \geq \frac{| θ_{j} | - | {\hat{θ}}_{j} - θ_{j} |}{{\hat{v}}_{j}^{1 / 2}} \geq \frac{2 | θ_{j} |}{3 v_{j}^{1 / 2}} - δ_{N, T} > δ_{N, T} .

This implies that $j \in \hat{S}$ , hence $S (θ) \subset \hat{S}$ . In fact, we have proved this statement on the event A₁ ∩ A₂ uniformly for θ ∈ Θ:

inf_{θ \in Θ} P (S (θ) \subset \hat{S} | θ) \to 1.

Moreover, it is readily seen that, under H₀: θ = 0, by Assumption 3.1,

P (J_{0} = 0 | H_{0}) = P (\hat{S} = \emptyset | H_{0}) = P (max_{j \leq N} {| {\hat{θ}}_{j} | / {\hat{v}}_{j}^{1 / 2}} < δ_{N, T} | H_{0}) \to 1.

In addition, by ${inf}_{θ \in Θ} P (S (θ) \subset \hat{S} | θ) \to 1$ ,

\begin{matrix} sup_{θ \in Θ} P (J_{0} \leq \sqrt{N} | S (θ) \neq \emptyset) \leq sup_{θ \in Θ} P (J_{0} \leq \sqrt{N}, \hat{S} \neq \emptyset | S (θ) \neq \emptyset) + sup_{θ \in Θ} P (\hat{S} = \emptyset | S (θ) \neq \emptyset) \\ \leq sup_{θ \in Θ} P (\sqrt{N} \sum_{j \in \hat{S}} δ_{N, T}^{2} \leq \sqrt{N}, \hat{S} \neq \emptyset | S (θ) \neq \emptyset) + o (1) \\ \leq sup_{θ \in Θ} P (\sqrt{N} δ_{N, T}^{2} \leq \sqrt{N} | S (θ) \neq \emptyset) + o (1) \to 0. \end{matrix}

Note that the last convergence holds uniformly in θ ∈ Θ because δ_N,T → ∞. Therefore, ${inf}_{θ \in Θ} P (J_{0} > \sqrt{N} | S (θ) \neq \emptyset) \to 1$ . This completes the proof.

A.2 Proof of Theorem 3.2

Proof

It follows immediately from P (J₀ = 0|H₀) → 1 that J →^d F, and hence the critical region {D: J > F_q} has size q. Moreover, by the power condition of J₁ and J₀ ≥ 0,

inf_{θ \in Θ (J_{1})} P (J > F_{q} | θ) \geq inf_{θ \in Θ (J_{1})} P (J_{1} > F_{q} | θ) \to 1.

This together with the fact

inf_{θ \in Θ_{s} \cup Θ (J_{1})} P (J > F_{q} | θ) \geq min {inf_{θ \in Θ_{s}} P (J > F_{q} | θ), inf_{θ \in Θ (J_{1})} P (J > F_{q} | θ)},

establish the theorem, if we show ${inf}_{θ \in Θ_{s}} P (J > F_{q} | θ) \to 1$ .

By the definition of Ŝ and J₀, we have ${J_{0} < \sqrt{N} δ_{N, T}^{2}} = {\hat{S} = \emptyset}$ . Since inf_θ_∈Θ P(S(θ) ⊂Ŝ|θ) → 1 and Θ_s= {θ ∈ Θ: S(θ) ≠Ø}, we have

\begin{array}{l} sup_{Θ_{s}} P (J_{0} < \sqrt{N} δ_{N, T}^{2} | θ) = sup_{Θ_{s}} P (\hat{S} = \emptyset | θ) \\ \leq sup_{{θ \in Θ : S (θ) \neq \emptyset}} P (\hat{S} = \emptyset, S (θ) \subset \hat{S} | θ) + o (1), \end{array}

which converges to zero, since the first term is zero. This implies ${inf}_{Θ_{s}} P (J_{0} \geq \sqrt{N} δ_{N, T}^{2} | θ) \to 1$ . Then by condition (ii), as δ_N,T → ∞,

inf_{θ \in Θ_{s}} P (J > F_{q} | θ) \geq inf_{θ \in Θ_{s}} P (\sqrt{N} δ_{N, T}^{2} + J_{1} > F_{q} | θ) \geq inf_{θ \in Θ_{s}} P (c \sqrt{N} + J_{1} > F_{q} | θ) \to 1.

This completes the proof.

A.3 Proof of Theorem 3.3

Proof

It suffices to verify conditions (i)–(iii) in Theorem 3.2 for J₁ = J_Q. Condition (i) follows from Assumption 3.2. Condition (iii) is fulfilled for c > 2/ξ, since

inf_{θ \in Θ_{s}} P (c \sqrt{N} + J_{Q} > F_{q} | θ) \geq inf_{θ \in Θ_{s}} P (c \sqrt{N} - \frac{N (1 + μ_{N, T})}{ξ_{N, T} \sqrt{N}} > F_{q} | θ) \to 1,

by using F_q = O(1), ξ_N,T → ξ, and μ_N,T → 0. We now verify condition (ii) for the Θ(J_Q) defined in the theorem. Let D = diag(v₁, …, v_N). Then ‖D‖₂ < C₃/T by Assumption 3.2(iv).

On the event $A = {{‖ {(\hat{θ} - θ)}^{'} D^{- 1 / 2} ‖}^{2} < δ_{N, T}^{2} N / 4}$ , we have

\begin{array}{l} | {(\hat{θ} - θ)}^{'} V θ | \leq ‖ {(\hat{θ} - θ)}^{'} D^{- 1 / 2} ‖ ‖ D^{1 / 2} V θ ‖ \\ \leq δ_{N, T} \sqrt{N} {‖ D ‖}_{2}^{1 / 2} {‖ V ‖}_{2}^{1 / 2} {(θ^{'} V θ)}^{1 / 2} / 2 \\ \leq δ_{N, T} \sqrt{N} {(C_{3} / T)}^{1 / 2} {‖ V ‖}_{2}^{1 / 2} {(θ^{'} V θ)}^{1 / 2} / 2. \end{array}

For ${‖ θ ‖}^{2} > C δ_{N, T}^{2} N / T$ with C = 4C₃‖V‖₂/λ_min(V), we can bound further that

| {(\hat{θ} - θ)}^{'} V θ | \leq θ^{'} V θ / 4.

Hence, ${\hat{θ}}^{'} V \hat{θ} \geq θ^{'} V θ - 2 ({\hat{θ} - θ)}^{'} V θ \geq θ^{'} V θ / 2$ . Therefore,

\begin{array}{l} sup_{θ \in Θ (J_{Q})} P (J_{Q} \leq F_{q} | θ) \leq sup_{Θ (J_{Q})} P (\frac{T θ^{'} V θ / 2 - 2 N}{ξ \sqrt{N}} \leq F_{q} | θ) + sup_{Θ (J_{Q})} P (A^{c} | θ) \\ \leq sup_{Θ (J_{Q})} P (T λ_{min} (V) {‖ θ ‖}^{2} < 2 F_{q} ξ \sqrt{N} + 4 N | θ) + o (1) \\ \leq sup_{Θ (J_{Q})} P (λ_{min} (V) C δ_{N, T}^{2} N < 5 N | θ) + o (1), \end{array}

which converges to zero since $δ_{N, T}^{2} \to ∞$ . This implies ${inf}_{Θ (J_{Q})} P (J_{Q} > F_{q} | θ) \to 1$ and finishes the proof.

A.4 Proof of Theorem 3.4

Proof

Through this proof, C is a generic constant, which can vary from one line to another. Without loss of generality, under the alternative, write

θ^{'} = (θ_{1}^{'}, θ_{2}^{'}) = (0^{'}, θ_{2}^{'}), \hat{θ^{'}} = ({\hat{θ}}_{1}^{'}, {\hat{θ}}_{2}^{'}),

where dim(θ₁) = N − r_N and dim(θ₂) = r_N. Corresponding to $({θ^{'}}_{1}, {θ^{'}}_{2})$ , we partition V⁻¹ and V into:

V^{- 1} = (\begin{matrix} M_{1} & β^{'} \\ β & M_{2} \end{matrix}) and V = (\begin{matrix} M_{1}^{- 1} + A & G^{'} \\ G & C \end{matrix}),

where M₁ and A are (N − r_N) × (N − r_N); β and G are r_N × (N − r_N); M₂ and C are r_N × r_N.

By the matrix inversion formula,

A = M_{1}^{- 1} β^{'} {(M_{2} - β M_{1}^{- 1} β^{'})}^{- 1} β M_{1}^{- 1} .

Let $Δ = T \hat{θ^{'}} V \hat{θ} - T {\hat{θ}}_{1}^{'} M_{1}^{- 1} {\hat{θ}}_{1} .$ Note that

Δ = T {\hat{θ}}_{1}^{'} A \hat{θ_{1}} + 2 T {\hat{θ}}_{2}^{'} G \hat{θ_{1}} + T {\hat{θ}}_{2}^{'} C \hat{θ_{2}} .

We first look at $T {\hat{θ}}_{1}^{'} A \hat{θ_{1}} .$ Let $λ_{N, T} = T λ_{max} ({(M_{2} - β M_{1}^{- 1} β^{'})}^{- 1})$ and $D_{1} = diag(\frac{1}{T} M_{1})$ . Note that the diagonal entries of $\frac{1}{T} V^{- 1}$ are given by $diag (\frac{1}{T} V^{- 1}) = {v_{j}}_{j \leq N}$ . Therefore D₁ is a diagonal matrix with entries ${v_{j}}_{j \leq N - r_{N}}$ , and ${max}_{j} v_{j} = O (T^{- 1})$ .

Since β is r_N × (N − r_N), using the expression of A, we have

\begin{array}{l} T \hat{{θ^{'}}_{1}} A \hat{θ_{1}} \leq λ_{N, T} {‖ β M_{1}^{- 1} \hat{θ_{1}} ‖}^{2} \\ \leq λ_{N, T^{r} N} {‖ M_{1}^{- 1} (\hat{θ_{1}} - θ_{1}) ‖}_{max}^{2} {(max_{i \leq r_{N}} \sum_{j \leq N - r} | β_{i j} |)}^{2} \\ \leq λ_{N, T^{r} N} {‖ M_{1}^{- 1} D_{1}^{1 / 2} ‖}_{1}^{2} {‖ D_{1}^{- 1 / 2} (\hat{θ_{1}} - θ_{1}) ‖}_{max}^{2} {‖ V^{- 1} ‖}_{1}^{2}, \end{array}

where we used θ₁ = 0 in the second inequality and the fact that ${max}_{i} \leq r_{N} \sum_{j \leq N - r} | β_{i j} | \leq {‖ V^{- 1} ‖}_{1}$ . Note that ${‖ V ‖}_{1} = O (1) = {‖ V^{- 1} ‖}_{1}$ . Hence,

{‖ M_{1}^{- 1} D_{1}^{1 / 2} ‖}_{1}^{2} = O (T^{- 1}), and λ_{N, T} = O (T) .

Thus, there is C > 0, with probability approaching one,

T {\hat{θ}}_{1}^{'} A \hat{θ_{1}} \leq C r_{N} {‖ D_{1}^{- 1 / 2} (\hat{θ_{1}} - θ_{1}) ‖}_{max}^{2} \leq C r_{N} δ_{N, T}^{2} .

Note that the uniform convergence in Assumption 3.1 and boundness of ${‖ θ ‖}_{max}$ imply that $P ({‖ \hat{θ} ‖}_{max} \leq C) \to 1$ for a sufficient large constant C. For G = (g_ij), note that ${max}_{i \leq r} \sum_{j = 1}^{N - r} | g_{i j} | \leq {‖ V ‖}_{1}$ Hence, by using θ₁ = 0 again, with probability tending to one,

\begin{array}{l} | T {\hat{θ}}_{2}^{'} G \hat{θ_{1}} | = T | {\hat{θ}}_{2}^{'} G D_{1}^{1 / 2} D_{1}^{- 1 / 2} (\hat{θ_{1}} - θ_{1}) | \\ \leq T {‖ \hat{θ_{2}} ‖}_{max} {‖ D_{1}^{- 1 / 2} (\hat{θ_{1}} - θ_{1}) ‖}_{max} \sum_{i = 1}^{r_{N}} \sum_{j = 1}^{N - r} | g_{i j} | \sqrt{v_{j}} \\ \leq C r_{N} δ_{N, T} \sqrt{T} . \end{array}

Moreover, $T {\hat{θ}}_{2}^{'} C \hat{θ_{2}} \leq T {‖ \hat{θ_{2}} ‖}^{2} {‖ C ‖}_{2} = O_{P} (r_{N} T)$ . Combining all the results above, it yields that for any θ ∈ Θ_b,

Δ = O_{P} (r_{N} δ_{N, T}^{2} + r_{N} T) .

We denote $var (\hat{θ})$ , $var (\hat{θ_{1}})$ , $var (\hat{θ_{2}})$ to be the asymptotic covariance matrix of $\hat{θ}, \hat{θ_{1}}$ and $\hat{θ_{2}}$ . Then $\frac{1}{T} V^{- 1} = var (\hat{θ})$ and $\frac{1}{T} M_{1} = var (\hat{θ_{1}})$ . It then follows from (3.3) that

Z \equiv \frac{T {\hat{θ}}_{1}^{'} M_{1}^{- 1} \hat{θ_{1}} - (N - r_{N})}{\sqrt{2 (N - r_{N})}} \to^{d} N (0, 1) .

For any 0 < ε < F_q, define the event $A = {| Δ - r_{N} | < \sqrt{2 N} \in}$ . Hence, suppressing the dependence of θ,

\begin{array}{l} P (J_{Q} > F_{q}) = P (\frac{T {\hat{θ}}_{1}^{'} M_{1}^{- 1} \hat{θ_{1}} + Δ - N}{\sqrt{2 N}} > F_{q}) \\ = P (Z \sqrt{\frac{N - r_{N}}{N}} + \frac{Δ - r_{N}}{\sqrt{2 N}} > F_{q}) \\ \leq P (Z \sqrt{\frac{N - r_{N}}{N}} + \in > F_{q}) + P (A^{c}), \end{array}

which is further bounded by 1−Φ(F_q−ε)+P(A^c)+o(1). Since 1−Φ(F_q) = q, for small enough, ε, 1−Φ(F_q−ε) = q+O(ε). By letting ɛ → 0 slower than $O (T r_{N} / \sqrt{N})$ , we have P (A^c) = o(1), and lim sup_{N→∞,T→∞} P (J_Q > F_q) ≤ q. On the other hand, P (J_Q > F_q) ≥ P (J₁ > F_q), which converges to q. This proves the result.

B Proofs for Section 4

Lemma B.1

When cov(f_t) is positive definite, $E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t} < 1$ .

Proof

If Ef_t= 0, then $E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t} < 1$ . If Ef_t ≠ 0, because cov(f_t) is positive definite, let $c = {(E f_{t} f_{t}^{'})}^{- 1} E f_{t}$ , then $c^{'} (E f_{t} f_{t}^{'} - E f_{t} E f_{t}^{'}) c > 0$ . Hence $c^{'} E f_{t} E f_{t}^{'} c < c^{'} E f_{t} f_{t}^{'} c$ implies $E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t} > {(E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t})}^{2}$ . This implies $E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t} < 1$ .

B.1 Proof of Proposition 4.1

Recall that $v_{j} = var (u_{j t}) / (T - T E f_{t}^{'} {(E f_{t} f_{t}^{'})}^{- 1} E f_{t})$ , and $\hat{v_{j}} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{j t}^{2} / (T a_{f, T})$ . Write $σ_{i j} = {(\sum_{u})}_{i j}, \hat{σ_{i j}} = \frac{1}{T} \sum_{t = 1}^{T} \hat{u_{i t}} \hat{u_{j t}}, σ_{j}^{2} = T v_{j}$ , and $\hat{σ_{j}^{2}} = T \hat{v_{j}}$ .

Simple calculations yield

\hat{θ_{i}} = θ_{i} + a_{f, T}^{- 1} \frac{1}{T} \sum_{t = 1}^{T} u_{i t} (1 - f_{t}^{'} w) .

We first prove the second statement. Note that there is σ_min > 0 (independent of θ) so that min_j σ_j > σ_min. By Lemma ??, there is C > 0, ${inf}_{Θ} P ({max}_{j \leq N} | \hat{σ_{j}} - σ_{j} |) < C \sqrt{\frac{\log N}{T} |} θ) \to 1$ . On the event ${{max}_{j \leq N} | {\hat{σ}}_{j} - σ_{j} | < C \sqrt{\frac{\log N}{T}}}$ ,

max_{j \leq N} | \frac{{\hat{v}}_{j}^{1 / 2}}{v_{j}^{1 / 2}} - 1 | \leq max_{j \leq N} \frac{| {\hat{σ}}_{j} - σ_{j} |}{σ_{j}} \leq \frac{C \sqrt{\log N}}{σ_{min} \sqrt{T}} .

This proves the second statement. We can now use this to prove the first statement.

Note that v_j is independent of θ, so there is C₁ (independent of θ) so that ${max}_{j \leq N} v_{j}^{- 1 / 2} < C_{1} \sqrt{T} .$ On the event ${{max}_{j \leq N} v_{j}^{1 / 2} / {\hat{v}}_{j}^{1 / 2} < 2} \cap {{max}_{j \leq N} | {\hat{θ}}_{j} - θ_{j} | < C \sqrt{\frac{\log N}{T}}}$ ,

max_{j \leq N} \frac{| \hat{θ_{j}} - θ_{j} |}{{\hat{v}}_{j}^{1 / 2}} \leq C \sqrt{\frac{\log N}{T}} 2 max_{j} v_{j}^{- 1 / 2} \leq 2 C C_{1} \sqrt{\log N} < δ_{N, T} .

The constants C, C₁ appeared are independent of θ, and Lemma ?? holds uniformly in θ. Hence the desired result also holds uniformly in θ.

B.2 Proof of Proposition 4.2

By Theorem 1 of Pesaran and Yamagata (2012) (Theorem 1),

(T a_{f, T} \hat{θ^{'}} \sum_{u}^{- 1} \hat{θ} - N) / \sqrt{2 N} \to^{d} N (0, 1) .

Therefore, we only need to show

\frac{T \hat{θ^{'}} (\sum_{u}^{- 1} - {\sum^{^}}_{u}^{- 1}) \hat{θ}}{\sqrt{2 N}} = o_{P} (1) .

The left hand side is equal to

\frac{T \hat{θ^{'}} \sum_{u}^{- 1} ({\sum^{^}}_{u} - \sum_{u}) \sum_{u}^{- 1} \hat{θ^{'}}}{\sqrt{N}} + \frac{T \hat{θ^{'}} ({\sum^{^}}_{u}^{- 1} - \sum_{u}^{- 1}) ({\sum^{^}}_{u} - \sum_{u}) \sum_{u}^{- 1} \hat{θ^{'}}}{\sqrt{N}} \equiv a + b .

It was shown by Fan et al. (2011) that ${‖ {\sum^{^}}_{u} - \sum_{u} ‖}_{2} = O_{P} (m_{N} \sqrt{\frac{\log N}{T}}) = {‖ {\sum^{^}}_{u}^{- 1} - \sum_{u}^{- 1} ‖}_{2}$ In addition, under H₀, ${‖ \hat{θ} ‖}^{2} = O_{P} (N \log N / T) .$ . Hence $b = O_{p} (\frac{m_{N}^{2} \sqrt{N} {(\log N)}^{2}}{T}) = o_{P} (1)$ .

The challenging part is to prove a = o_P (1) when N > T. As is described in the main text, simple inequalities like Cauchy-Schwarz accumulate estimation errors, and hence do not work. Define $e_{t} = \sum_{u}^{- 1} u_{t} = {(e_{1 t}, \dots, e_{N t})}^{'}$ , which is an N-dimensional vector with mean zero and covariance $\sum_{u}^{- 1}$ , whose entries are stochastically bounded. Let $\bar{w} = {(E f_{t} f_{t}^{'})}^{- 1} E f_{t}$ . A key step of proving this proposition is to establish the following two convergences:

\frac{1}{T} E {| \frac{1}{\sqrt{N T}} \sum_{i = 1}^{N} \sum_{t = 1}^{T} (u_{i t}^{2} - E u_{i t}^{2}) {(\frac{1}{\sqrt{T}} \sum_{s = 1}^{T} e_{i s} (1 - {f^{'}}_{s} \bar{w}))}^{2} |}^{2} = o (1),

(B.1)

\frac{1}{T} E {| \frac{1}{\sqrt{N T}} \sum_{i \neq j, (i, j) \in S_{U}} \sum_{t = 1}^{T} (u_{i t} u_{j t} - E u_{i t} u_{j t}) [\frac{1}{\sqrt{T}} \sum_{s = 1}^{T} e_{i s} (1 - {f^{'}}_{s} \bar{w})] [\frac{1}{\sqrt{T}} \sum_{k = 1}^{T} e_{j k} (1 - {f^{'}}_{s} \bar{w})] |}^{2} = o (1),

(B.2)

where

S_{U} = {(i, j) : {(\sum_{u})}_{i j} \neq 0} .

The sparsity condition assumes that most of the off-diagonal entries of Σ_u are outside of S_U. The above two convergences are weighted cross-sectional and serial double sums, where the weights satisfy $\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} e_{i t} (1 - f_{t}^{'} \bar{w}) = O_{P} (1)$ for each i. The proofs of (B.1) and (B.2) are given in the supplementary material in Appendix D.

We consider the hard-thresholding covariance estimator. The proof for the generalized sparsity case as in Rothman et al. (2009) is very similar. Let $s_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}$ and σ_ij=(Σ_u)_ij. Under hard-thresholding,

{\hat{σ}}_{i j} = {({\sum^{^}}_{u})}_{i j} = {\begin{cases} s_{i i}, & if i = j, \\ s_{i j}, & if i \neq j, | s_{i j} | > C {(s_{i i} s_{j j} \frac{\log N}{T})}^{1 / 2} \\ 0, & if i \neq j, | s_{i j} | \leq C {(s_{i i} s_{j j} \frac{\log N}{T})}^{1 / 2} \end{cases}

Write ${({\hat{θ}}^{'} \sum_{u}^{- 1})}_{i}$ to denote the ith element of ${\hat{θ}}^{'} \sum_{u}^{- 1}$ , and $S_{U}^{c} = {(i, j) : {(\sum_{u})}_{i j} = 0}$ . For $σ_{i j} \equiv {(\sum_{u})}_{i j}$ , and ${\hat{σ}}_{i j} = {({\sum^{^}}_{u})}_{i j}$ we have

\begin{array}{l} a = \frac{T}{\sqrt{N}} \sum_{i = 1}^{N} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{i}^{2} ({\hat{σ}}_{i i} - σ_{i i}) + \frac{T}{\sqrt{N}} \sum_{i \neq j, (i, j) \in S_{U}} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{i} ({\hat{σ}}_{i j} - σ_{i j}) \\ + \frac{T}{\sqrt{N}} \sum_{(i, j) \in S_{U}^{c}} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{i} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{j} ({\hat{σ}}_{i j} - σ_{i j}) \\ = a_{1} + a_{2} + a_{3} \end{array}

We first examine a₃. Note that

a_{3} = \frac{T}{\sqrt{N}} \sum_{(i, j) \in S_{U}^{c}} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{i} {({\hat{θ}}^{'} \sum_{u}^{- 1})}_{j} {\hat{σ}}_{i j} .

Obviously,

P (a_{3} > T^{- 1}) \leq P (max_{(i, j) \in S_{U}^{c}} | {\hat{σ}}_{i j} | \neq 0) \leq P (max_{(i, j) \in S_{U}^{c}} | s_{i j} | > C {(s_{i i} s_{j j} \frac{\log N}{T})}^{1 / 2}) .

Because s_ii is uniformly (across i) bounded away from zero with probability approaching one, and ${max}_{(i, j) \in S_{U}^{c}} | s_{i j} | = O_{P} (\sqrt{\frac{\log N}{T}})$ . Hence for any ε > 0, when C in the threshold is large enough, P (a₃ > T ⁻¹) < ε, this implies a₃ = o_P (1).

The proof is finished once we establish a_i = o_P (1) for i = 1, 2, which are given in Lemmas ?? and ?? respectively in the supplementary material.

Proof of Theorem 4.1

Part (i) follows from Proposition 4.2 and that P (J₀ = 0|H₀) → 1. Part (ii) follows immediately from Theorem 3.3.

C Proofs for Section 5

C.1 Proof of Proposition 5.1

Lemma C.1

Under Assumption 5.1, uniformly in θ ∈ Θ, $P (\sqrt{n T} ‖ \hat{β} - β ‖ < \sqrt{\log n}) \to 1$ .

Proof

Note that

\sqrt{n T} ‖ \hat{β} - β ‖ = ‖ {(\frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{x}}_{i t}^{'})}^{- 1} (\frac{1}{\sqrt{n T}} \sum_{i = 1}^{n} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{u}}_{i t}) ‖ .

Uniformly for θ ∈ Θ, due to serial independence, and $\frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} E {\tilde{x}}_{i t}^{'} {\tilde{x}}_{i t} E {\tilde{u}}_{i t} {\tilde{u}}_{i t} \leq C_{1}$ ,

\begin{array}{l} E {‖ \frac{1}{\sqrt{n T}} \sum_{i = 1}^{n} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{u}}_{i t} ‖}^{2} = \frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} \sum_{j = 1}^{n} \sum_{s = 1}^{T} E {\tilde{x}}_{i t}^{'} {\tilde{x}}_{j s} {\tilde{u}}_{i t} {\tilde{u}}_{j s} \\ = \frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} E {\tilde{x}}_{i t}^{'} {\tilde{x}}_{i t} E {\tilde{u}}_{i t} {\tilde{u}}_{i t} + \frac{1}{n T} \sum_{i \neq j} \sum_{t = 1}^{T} E {\tilde{x}}_{i t}^{'} {\tilde{x}}_{j t} E {\tilde{u}}_{i t} {\tilde{u}}_{j t} \\ \leq C_{1} + \frac{1}{n} \sum_{i \neq j} | E {\tilde{x}}_{i t}^{'} {\tilde{x}}_{j t} | | E {\tilde{u}}_{i t} {\tilde{u}}_{j t} | \leq C . \end{array}

Hence the result follows from the Chebyshev inequality and that $λ_{min} (\frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{x}}_{i t}^{'})$ is bounded away from zero with probability approaching one, uniformly in θ.

Lemma C.2

Suppose ${max}_{j \leq n} {‖ \frac{1}{T} \sum_{t} {\tilde{x}}_{j t} {\tilde{x}}_{i t}^{'} ‖}_{2} < C^{'}$ with probability approaching one and $E (u_{j t}^{4}) < C^{'}$ . There is C > 0, so that uniformly in θ ∈ Θ,

$P ({max}_{j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{j t} | < C \sqrt{\log n / T}) \to 1$
$P ({max}_{i, j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - E u_{i t} u_{j t} | < C \sqrt{\log n / T}) \to 1$
$P ({max}_{j \leq n} \frac{1}{T} \sum_{t = 1}^{T} {(u_{j t} - {\hat{u}}_{j t})}^{2} < C \log n / T) \to 1$
$P ({max}_{i, j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t} - E u_{i t} u_{j t} | < C \sqrt{\log n / T}) \to 1$

Proof

By the Bernstein inequality, for $C = {(8 {max}_{j \leq n} E (u_{j t}^{2}))}^{1 / 2}$ , we have
$\begin{array}{l} P (max_{j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{j t} | \geq C \sqrt{\frac{\log n}{T}}) \leq n max_{j \leq n} P (| \frac{1}{T} \sum_{t = 1}^{T} u_{j t} | \geq C \sqrt{\frac{\log n}{T}}) \\ \leq exp (\log n - \frac{C^{2} \log n}{4 {max}_{j \leq n} E (u_{j t}^{2})}) = \frac{1}{n} . \end{array}$
Hence (i) is proved as $P ({max}_{j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{j t} | < C \sqrt{\log n / T}) \geq 1 - \frac{1}{n}$ .
For $C = {(12 {max}_{j \leq n} E (u_{j t}^{4}))}^{1 / 2}$ , we have, uniformly in θ ∈ Θ,
$\begin{array}{l} P (max_{i, j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - E u_{i t} u_{j t} | \geq C \sqrt{\frac{\log n}{T}}) \\ \leq n^{2} max_{i, j \leq n} P (| \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - E u_{i t} u_{j t} | \geq C \sqrt{\frac{\log n}{T}}) \\ \leq exp (2 \log n - \frac{C^{2} \log n}{4 {max}_{j \leq n} E (u_{j t}^{4})}) = \frac{1}{n} . \end{array}$
Note that ${\hat{u}}_{j t} - u_{j t} = - \frac{1}{T} \sum_{t = 1}^{T} u_{j t} - {\tilde{x}}^{'}_{j t} (\hat{β} - β)$ , and ${max}_{j \leq n} {‖ \frac{1}{T} \sum_{t} {\tilde{x}}_{j t} {\tilde{x}}^{'}_{j t} ‖}_{2} < C$ with probability approaching one. The result then follows from part (i) and Lemma C.1.
Observe that
$\begin{array}{l} | \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t} - E u_{i t} u_{j t} | \leq | \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - E u_{i t} u_{j t} | + | \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - {\hat{u}}_{i t} {\hat{u}}_{j t} | \\ \leq | \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - E u_{i t} u_{j t} | + \frac{1}{T} {\sum_{t = 1}^{T} {({\hat{u}}_{j t} - u_{j t})}^{2} + {(\frac{2}{T} \sum_{t} u_{j t}^{2})}^{1 / 2} (\frac{2}{T} \sum_{t} {({\hat{u}}_{j t} - u_{j t})}^{2})}^{1 / 2} \end{array}$

The first two terms and ${(\frac{2}{T} \sum_{t} {({\hat{u}}_{j t} - u_{j t})}^{2})}^{1 / 2}$ in the third term are bounded by results in (ii) and (iii). Therefore, it suffices to show that there is a constant M > 0 so that

P (max_{j \leq n} \frac{1}{T} \sum_{t} u_{j t}^{2} < M) \to 1.

Note that ${max}_{j \leq n} \frac{1}{T} \sum_{t} u_{j t}^{2} \leq {max}_{j \leq n} | \frac{1}{T} \sum_{t} u_{j t}^{2} - E u_{j t}^{2} | + {max}_{j \leq n} E u_{j t}^{2}$ . In addition, by (ii), there is C > 0 so that

P (max_{j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} u_{j t}^{2} - E u_{j t}^{2} | < C \sqrt{\log n / T}) \to 1.

Hence we can pick up M so that $M - {max}_{j \leq n} E (u_{j t}^{2}) > C \sqrt{\log n / T}$ , and

\begin{array}{l} P (max_{j \leq n} \frac{1}{T} \sum_{t} u_{j t}^{2} \geq M) \leq P (max_{j \leq n} | \frac{1}{T} \sum_{t} u_{j t}^{2} - E u_{j t}^{2} | \geq M - max_{j \leq n} E u_{j t}^{2}) \\ \leq P (max_{j \leq n} | \frac{1}{T} \sum_{t} u_{j t}^{2} - E u_{j t}^{2} | \geq C \sqrt{\frac{\log n}{T}}) \to 0. \end{array}

This proves the desired result.

Lemma C.3

Under Assumption 5.1, there is C > 0, uniformly in θ ∈ Θ, $P ({max}_{i j} | {\hat{ρ}}_{i j} - ρ_{i j} | < C \sqrt{\log n / T}) \to 1.$

Proof

By the definition ${\hat{ρ}}_{i j} = {(\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t}^{2})}^{- 1 / 2} {(\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{j t}^{2})}^{- 1 / 2} \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t} .$ By the triangular inequality,

\begin{array}{l} | {\hat{ρ}}_{i j} - ρ_{i j} | \leq \underset{X_{1}}{\underset{︸}{\frac{| \frac{1}{T} \sum_{t} {\hat{u}}_{i t} {\hat{u}}_{j t} - u_{i t} u_{j t} |}{{(\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t}^{2})}^{1 / 2} {(\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{j t}^{2})}^{1 / 2}}}} \\ + \underset{X_{2}}{\underset{︸}{| \frac{1}{T} \sum_{t} u_{i t} u_{j t} | | {(\frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t}^{2} \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{j t}^{2})}^{- 1 / 2} - {(\frac{1}{T} \sum_{t = 1}^{T} u_{i t}^{2} \frac{1}{T} \sum_{t = 1}^{T} u_{i t}^{2})}^{- 1 / 2} |}} \end{array}

By part (iv) of Lemma C.2, $P ({max}_{i, j \leq n} | \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t} - E u_{i t} u_{j t} | < C \sqrt{\log n / T}) \to 1$ . Hence for sufficiently large M > 0 such that ${min}_{j} E (u_{j t}^{2}) - C / M > C \sqrt{\log n / T}$ ,

\begin{array}{l} P (max_{i j} | X_{1} | > M \sqrt{\frac{\log n}{T}}) \leq P (min_{j} \frac{1}{T} \sum_{t} {\hat{u}}_{j t}^{2} < C / M) + o (1) \\ \leq P (max_{j} | \frac{1}{T} \sum_{t} {\hat{u}}_{j t}^{2} - E u_{j t}^{2} | > min_{j} E u_{j t}^{2} - C / M) + o (1) = o (1) . \end{array}

By a similar argument, there is M′ > 0 so that $P ({max}_{i j} | X_{2} | > M^{'} \sqrt{\frac{\log n}{T}}) = o (1) .$ The result then follows as, uniformly in θ ∈ Θ,

\begin{array}{l} P (max_{i j} | {\hat{ρ}}_{i j} - ρ_{i j} | \geq 2 (M + M^{'}) \sqrt{\log n / T}) \\ \leq P (max_{i j} | X_{1} | \geq (M + M^{'}) \sqrt{\log n / T}) + P (max_{i j} | X_{2} | \geq (M + M^{'}) \sqrt{\log n / T}) = o (1) . \end{array}

Proof of Proposition 5.1

As $1 - ρ_{i j}^{2} > 1 - c$ uniformly for (i, j) and θ, the second convergence follows from Lemma C.3. Also, with probability approaching one,

\frac{| {\hat{ρ}}_{i j} - ρ_{i j} |}{{\hat{v}}_{i j}^{1 / 2}} \leq \frac{3 \sqrt{T}}{2 (1 - c)} C \sqrt{\frac{\log n}{T}} < δ_{N, T} .

C.2 Proof of Theorem 5.1

Lemma C.4

J₁ has power uniformly on $Θ (J_{1}) = {\sum_{i < j} ρ_{i j}^{2} \geq C n^{2} \log n / T}$ for some C. Proof. By Lemma C.3, there is C > 0, ${inf}_{θ \in Θ} P ({max}_{i j} | {\hat{ρ}}_{i j} - ρ_{i j} | < C \sqrt{\log n / T} | θ) \to 1$ . Let

A = {{\sum_{i < j} ({\hat{ρ}}_{i j} - ρ_{i j})}^{2} < C^{2} n^{2} (\log n / T)} .

Then ${inf}_{Θ} P (A | θ) \to 1$ . On the event A, we have, uniformly in θ = {ρ_i_j},

\sum_{i < j} ({\hat{ρ}}_{i j} - ρ_{i j}) ρ_{i j} \leq {({\sum_{i < j} ({\hat{ρ}}_{i j} - ρ_{i j})}^{2})}^{1 / 2} {(\sum_{i < j} ρ_{i j}^{2})}^{1 / 2} \leq \frac{C n \sqrt{\log n}}{\sqrt{T}} {(\sum_{i < j} ρ_{i j}^{2})}^{1 / 2} .

Therefore, when $\sum_{i < j} ρ_{i j}^{2} \geq 16 C^{2} n^{2} \log n / T$ ,

\sum_{i < j} {\hat{ρ}}_{i j}^{2} = {\sum_{i < j} ({\hat{ρ}}_{i j} - ρ_{i j})}^{2} + ρ_{i j}^{2} + 2 ({\hat{ρ}}_{i j} - ρ_{i j}) ρ_{i j} \geq \sum_{i < j} ρ_{i j}^{2} - \frac{2 C n \sqrt{\log n}}{\sqrt{T}} {(\sum_{i < j} ρ_{i j}^{2})}^{1 / 2} \geq \frac{1}{2} \sum_{i < j} ρ_{i j}^{2} .

This entails that when $\sum_{i < j} ρ_{i j}^{2} \geq 16 C n^{2} \log n / T$ , we have

\begin{array}{l} sup_{Θ (J_{1})} P (J_{1} < F_{q} | θ) \leq sup_{Θ (J_{1})} P (\sum_{i < j} {\hat{ρ}}_{i j}^{2} < \frac{n (n - 1)}{2 T} + (F_{q} + \frac{n}{2 (T - 1)}) \frac{\sqrt{n (n - 1)}}{T} | θ) \\ \leq sup_{Θ (J_{1})} P (\frac{1}{2} \sum_{i < j} ρ_{i j}^{2} < \frac{n (n - 1)}{2 T} + (F_{q} + \frac{n}{2 (T - 1)}) \frac{\sqrt{n (n - 1)}}{T} | θ) + sup_{Θ (J_{1})} P (A^{c} | θ) \to 0. \end{array}

Proof of Theorem 5.1

It suffices to verify conditions (i)–(iii) of Theorem 3.2. Condition (i) follows from Theorem 1 of Baltagi et al. (2012). As for condition (ii), note that $J_{1} \geq - \frac{\sqrt{n (n - 1)}}{2} - \frac{n}{2 (T - 1)}$ almost surely. Hence as n, T → ∞,

inf_{θ \in Θ_{s}} P (c \sqrt{N} + J_{1} > z_{q} | θ) \geq inf_{θ \in Θ_{s}} P (c \sqrt{N} - \frac{\sqrt{n (n - 1)}}{2} - \frac{n}{2 (T - 1)} > z_{q} | θ) = 1.

Finally, condition (iii) follows from Lemma C.4.

Footnotes

JEL code: C12, C33, C58

Contributor Information

Yuan Liao, Email: yuanliao@umd.edu.

Jiawei Yao, Email: jiaweiy@princeton.edu.

References

Andrews D. Hypothesis testing with a restricted parameter space. Journal of Econometrics. 1998;84:155–199. [Google Scholar]
Andrews D. Cross-sectional regression with common shocks. Econometrica. 2005;73:1551–1585. [Google Scholar]
Ang A, Kristensen D. Testing conditional factor models. Journal of Financial Economics. 2012;106:132–156. [Google Scholar]
Antoniadis A, Fan J. Regularized wavelet approximations. Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]
Bai ZD, Saranadasa H. Effect of high dimension: by an example of a two sample problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]
Baltagi B. Econometric Analysis of Panel Data. fourth. Wiley; 2008. edition ed. [Google Scholar]
Baltagi B, Feng Q, Kao C. A lagrange multiplier test for cross-sectional dependence in a fix effects panel data model. Journal of Econometrics. 2012;170:164–177. [Google Scholar]
Beaulieu M, Dufour J, Khalaf L. Multivariate tests of mean-variance efficiency with possibly non-gaussian errors: an exact simulation based approach. Journal of Business and Economic Statistics. 2007;25:398–410. [Google Scholar]
Bickel P, Levina E. Covariance regularization by thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
Breusch T, Pagan A. The lagrange multiplier test and its application to model specification in econometrics. Review of Economic Studies. 1980;47:239–254. [Google Scholar]
Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association. 2013;108:265–277. [Google Scholar]
Cai T, Zhang C, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38:2118–2144. [Google Scholar]
Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics. 2012;40:2014–2042. [Google Scholar]
Carhart MM. On persistence in mutual fund performance. The Journal of finance. 1997;52:57–82. [Google Scholar]
Chamberlain G, Rothschild M. Arbitrage, factor structure and mean-variance analyssi in large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]
Chen SX, Qin YL. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics. 2010;38:808–835. [Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. Tech rep. MIT; 2013. Testing many moment inequalities. [Google Scholar]
Connor G. A unified beta pricing theory. Journal of Economic Theory. 1984;34:13–31. [Google Scholar]
Connor G, Korajczyk R. A test for the number of factors in an approximate factor model. Journal of Finance. 1993;48:1263–1291. [Google Scholar]
Donald SG, Imbens GW, Newey WK. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics. 2003;117:55–93. [Google Scholar]
Fama E, French K. The cross-section of expected stock returns. Journal of Finance. 1992;47:427–465. [Google Scholar]
Fan J. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association. 1996;91:674–688. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements (with discussion) Journal of the Royal Statistical Society, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Yao J. Tech rep. Princeton University; 2014. Power enhancement in high dimensional cross-sectional tests. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gagliardini P, Ossola E, Scaillet O. Tech rep. Swiss Finance Institute; 2011. Time-varying risk premium in large cross-sectional equidity datasets. [Google Scholar]
Gibbons M, Ross S, Shanken J. A test of the efficiency of a given portfolio. Econometrica. 1989;57:1121–1152. [Google Scholar]
Hall P, Jin J. Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics. 2010;38:1686–1732. [Google Scholar]
Hansen LP, Richard SF. The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica. 1987;55:587–613. [Google Scholar]
Hansen P. Tech rep. CREATES; 2003. Asymptotic tests of composite hypotheses. [Google Scholar]
Hansen P. A test for superior predictive ability. Journal of Business and Economic Statistics. 2005;23:365–380. [Google Scholar]
Im K, Ahn S, Schmidt P, Wooldridge J. Efficient estimation of panel data models with strictly exogenous explanatory variables. Journal of Econometrics. 1999;93:177–201. [Google Scholar]
MacKinlay A, Richardson M. Using generalized method of moments to test mean-variance efficiency. Journal of Finance. 1991;46:511–527. [Google Scholar]
Pesaran H, Ullah A, Yamagata T. A bias-adjusted lm test of error cross section independence. Econometrics Journal. 2008;11:105–127. [Google Scholar]
Pesaran H, Yamagata T. Tech rep. University of South California; 2012. Testing capm with a large number of assets. [Google Scholar]
Ross S. The arbitrage theory of capital asset pricing. Journal of Economic Theory. 1976;13:341–360. [Google Scholar]
Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
Stock J, Watson M. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association. 2002;97:1167–1179. [Google Scholar]
Zhong P, Chen S, Xu M. Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Annals of Statistics. 2013;41:2820–2851. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS749711-supplement-2.pdf^{(252.7KB, pdf)}

[R1] Andrews D. Hypothesis testing with a restricted parameter space. Journal of Econometrics. 1998;84:155–199. [Google Scholar]

[R2] Andrews D. Cross-sectional regression with common shocks. Econometrica. 2005;73:1551–1585. [Google Scholar]

[R3] Ang A, Kristensen D. Testing conditional factor models. Journal of Financial Economics. 2012;106:132–156. [Google Scholar]

[R4] Antoniadis A, Fan J. Regularized wavelet approximations. Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]

[R5] Bai ZD, Saranadasa H. Effect of high dimension: by an example of a two sample problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]

[R6] Baltagi B. Econometric Analysis of Panel Data. fourth. Wiley; 2008. edition ed. [Google Scholar]

[R7] Baltagi B, Feng Q, Kao C. A lagrange multiplier test for cross-sectional dependence in a fix effects panel data model. Journal of Econometrics. 2012;170:164–177. [Google Scholar]

[R8] Beaulieu M, Dufour J, Khalaf L. Multivariate tests of mean-variance efficiency with possibly non-gaussian errors: an exact simulation based approach. Journal of Business and Economic Statistics. 2007;25:398–410. [Google Scholar]

[R9] Bickel P, Levina E. Covariance regularization by thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]

[R10] Breusch T, Pagan A. The lagrange multiplier test and its application to model specification in econometrics. Review of Economic Studies. 1980;47:239–254. [Google Scholar]

[R11] Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association. 2013;108:265–277. [Google Scholar]

[R12] Cai T, Zhang C, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38:2118–2144. [Google Scholar]

[R13] Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics. 2012;40:2014–2042. [Google Scholar]

[R14] Carhart MM. On persistence in mutual fund performance. The Journal of finance. 1997;52:57–82. [Google Scholar]

[R15] Chamberlain G, Rothschild M. Arbitrage, factor structure and mean-variance analyssi in large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]

[R16] Chen SX, Qin YL. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics. 2010;38:808–835. [Google Scholar]

[R17] Chernozhukov V, Chetverikov D, Kato K. Tech rep. MIT; 2013. Testing many moment inequalities. [Google Scholar]

[R18] Connor G. A unified beta pricing theory. Journal of Economic Theory. 1984;34:13–31. [Google Scholar]

[R19] Connor G, Korajczyk R. A test for the number of factors in an approximate factor model. Journal of Finance. 1993;48:1263–1291. [Google Scholar]

[R20] Donald SG, Imbens GW, Newey WK. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics. 2003;117:55–93. [Google Scholar]

[R21] Fama E, French K. The cross-section of expected stock returns. Journal of Finance. 1992;47:427–465. [Google Scholar]

[R22] Fan J. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association. 1996;91:674–688. [Google Scholar]

[R23] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R24] Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements (with discussion) Journal of the Royal Statistical Society, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Fan J, Liao Y, Yao J. Tech rep. Princeton University; 2014. Power enhancement in high dimensional cross-sectional tests. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Gagliardini P, Ossola E, Scaillet O. Tech rep. Swiss Finance Institute; 2011. Time-varying risk premium in large cross-sectional equidity datasets. [Google Scholar]

[R28] Gibbons M, Ross S, Shanken J. A test of the efficiency of a given portfolio. Econometrica. 1989;57:1121–1152. [Google Scholar]

[R29] Hall P, Jin J. Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics. 2010;38:1686–1732. [Google Scholar]

[R30] Hansen LP, Richard SF. The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica. 1987;55:587–613. [Google Scholar]

[R31] Hansen P. Tech rep. CREATES; 2003. Asymptotic tests of composite hypotheses. [Google Scholar]

[R32] Hansen P. A test for superior predictive ability. Journal of Business and Economic Statistics. 2005;23:365–380. [Google Scholar]

[R33] Im K, Ahn S, Schmidt P, Wooldridge J. Efficient estimation of panel data models with strictly exogenous explanatory variables. Journal of Econometrics. 1999;93:177–201. [Google Scholar]

[R34] MacKinlay A, Richardson M. Using generalized method of moments to test mean-variance efficiency. Journal of Finance. 1991;46:511–527. [Google Scholar]

[R35] Pesaran H, Ullah A, Yamagata T. A bias-adjusted lm test of error cross section independence. Econometrics Journal. 2008;11:105–127. [Google Scholar]

[R36] Pesaran H, Yamagata T. Tech rep. University of South California; 2012. Testing capm with a large number of assets. [Google Scholar]

[R37] Ross S. The arbitrage theory of capital asset pricing. Journal of Economic Theory. 1976;13:341–360. [Google Scholar]

[R38] Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]

[R39] Stock J, Watson M. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association. 2002;97:1167–1179. [Google Scholar]

[R40] Zhong P, Chen S, Xu M. Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Annals of Statistics. 2013;41:2820–2851. [Google Scholar]

PERMALINK

Power Enhancement in High Dimensional Cross-Sectional Tests

Jianqing Fan

Yuan Liao

Jiawei Yao

Abstract

1 Introduction

Power Enhancement Properties

2 Power Enhancement in high dimensions

2.1 Power enhancement

2.2 Construction of power enhancement component

2.3 Comparisons with thresholding and extreme-value tests

3 Asymptotic properties

3.1 Main results

Assumption 3.1

Theorem 3.1

Remark 3.1

Theorem 3.2

3.2 Power enhancement for quadratic tests

Assumption 3.2

Theorem 3.3

3.3 Low power of quadratic statistics under sparse alternatives

Theorem 3.4

4 Application: Testing Factor Pricing Models

4.1 The model

4.2 Power enhancement component

4.3 Feasible Wald test in high dimensions

Remark 4.1

4.4 Does the thresholded covariance estimator affect the size?

4.5 Regularity conditions

Assumption 4.1

Remark 4.2

Remark 4.3

Assumption 4.2

4.6 Asymptotic properties

Proposition 4.1

Proposition 4.2

Theorem 4.1

5 Application: Testing Cross-Sectional Independence

5.1 The model

5.2 Power enhancement test

5.3 Asymptotic properties

Assumption 5.1

Assumption 5.2

Proposition 5.1

Theorem 5.1

6 Numerical studies

6.1 Testing factor pricing models

Table II.

6.2 Testing cross-sectional independence

Table III.

6.3 Empirical evidence of sparse alternatives

Table IV.

7 Discussions

Supplementary Material

Table I.

Acknowledgments

A Proofs for Section 3

A.1 Proof of Theorem 3.1

Proof

A.2 Proof of Theorem 3.2

Proof

A.3 Proof of Theorem 3.3

Proof

A.4 Proof of Theorem 3.4

Proof

B Proofs for Section 4

Lemma B.1

Proof

B.1 Proof of Proposition 4.1

B.2 Proof of Proposition 4.2

Proof of Theorem 4.1

C Proofs for Section 5

C.1 Proof of Proposition 5.1

Lemma C.1

Proof

Lemma C.2

Proof

Lemma C.3

Proof