Abstract
We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models.
Keywords: sparse alternatives, thresholding, large covariance matrix estimation, Wald-test, screening, cross-sectional independence, factor pricing model
1 Introduction
High-dimensional cross-sectional models have received growing attentions in both theoretical and applied econometrics. These models typically involve a structural parameter, whose dimension can be either comparable or much larger than the sample size. This paper addresses testing a high-dimensional structural parameter:
where N = dim(θ) is allowed to grow faster than the sample size T. We are particularly interested in boosting the power in sparse alternatives under which θ is approximately a sparse vector. This type of alternative is of particular interest, as the null hypothesis typically represents some economic theory and violations are expected to be only by some exceptional individuals.
A showcase example is the factor pricing model in financial economics. Let yit be the excess return of the i-th asset at time t, and ft = (f1t,…, fKt)′ be the K-dimensional observable factors. Then, the excess return has the following decomposition:
where bi = (bi1,…, biK)′ is a vector of factor loadings and uit represents the idiosyncratic error. The key implication from the multi-factor pricing theory is that the intercept θi should be zero, known as the “mean-variance efficiency” pricing, for any asset i. An important question is then if such a pricing theory can be validated by empirical data, namely we wish to test the null hypothesis H0 : θ = 0, where θ = (θ1, …, θN)′ is the vector of intercepts for all N financial assets. As the factor pricing model is derived from theories of financial economics (Ross, 1976), one would expect that inefficient pricing by the market should only occur to a small fractions of exceptional assets. Indeed, our empirical study of the constituents in the S&P 500 index indicates that there are only a couple of significant nonzero-alpha stocks, corresponding to a small portion of mis-priced stocks instead of systematic mis-pricing of the whole market. Therefore, it is important to construct tests that have high power when θ is sparse.
Most of the conventional tests for H0 : θ = 0 are based on a quadratic form:
Here is an element-wise consistent estimator of θ, and V is a high-dimensional positive definite weight matrix, often taken to be the inverse of the asymptotic covariance matrix of (e.g., the Wald test). After a proper standardization, the standardized W is asymptotically pivotal under the null hypothesis. In high-dimensional testing problems, however, various difficulties arise when using a quadratic statistic. First, when N > T, estimating V is challenging, as the sample analogue of the covariance matrix is singular. More fundamentally, tests based on W have low powers under sparse alternatives. The reason is that the quadratic statistic accumulates high-dimensional estimation errors under H0, which results in large critical values that can dominate the signals in the sparse alternatives. A formal proof of this statement will be given in Section 3.3.
To overcome the aforementioned difficulties, this paper introduces a novel technique for high-dimensional cross-sectional testing problems, called the “power enhancement”. Let J1 be a test statistic that has a correct asymptotic size (e.g., Wald statistic), which may suffer from low powers under sparse alternatives. Let us augment the test by adding a power enhancement component J0 ≥ 0, which satisfies the following three properties:
Power Enhancement Properties
Non-negativity: J0 ≥ 0 almost surely.
No-size-distortion: Under H0, P (J0 = 0|H0) → 1.
Power-enhancement: J0 diverges in probability under some specific regions of alternatives Ha.
Our constructed power enhancement test takes the form
The non-negativity property of J0 ensures that J is at least as powerful as J1. Property (b) guarantees that the asymptotic null distribution of J is determined by that of J1, and the size distortion due to adding J0 is negligible, and property (c) guarantees significant power improvement under the designated alternatives. The power enhancement principle is thus summarized as follows: Given a standard test statistic with a correct asymptotic size, its power is substantially enhanced with little size distortion; this is achieved by adding a component J0 that is asymptotically zero under the null, but diverges and dominates J1 under some specific regions of alternatives.
An example of such a J0 is a screening statistic:
where , and denotes a data-dependent normalizing factor, taken as the estimated asymptotic variance of . The critical value δN,T, depending on (N, T), is a high-criticism threshold, chosen to be slightly larger than the noise level so that under H0, J0 = 0 with probability approaching one. In addition, we take J1 as a pivotal statistic, e.g., standardized Wald statistic or other quadratic forms such as the sum of the squared marginal t-statistics (Bai and Saranadasa, 1996; Chen and Qin, 2010; Pesaran and Yamagata, 2012). The screening set also captures indices where the null hypothesis is violated.
One of the major differences of our test from most of the thresholding tests (Fan, 1996; Hansen, 2005) is that, it enhances the power substantially by adding a screening statistic, which does not introduce extra difficulty in deriving the asymptotic null distribution. Since J0 = 0 under H0, it relies on the pivotal statistic J1 to determine its null distribution. In contrast, the existing tests such as thresholding, extreme value, and higher criticism tests (e.g., Hall and Jin (2010)) often require stringent conditions to derive their asymptotic null distributions, making them restrictive in econometric applications, due to slow rates of convergence. Moreover, the asymptotic null distributions are inaccurate at finite sample. As pointed out by Hansen (2003), these statistics are non-pivotal even asymptotically, and require bootstrap methods to simulate the null distributions.
As a specific application, in addition to testing the aforementioned factor pricing model, this paper also studies the tests for cross-sectional independence in mixed effect panel data models:
Let ρij denote the correlation between uit and ujt, assumed to be time invariant. The “cross-sectional independence” test is concerned about the following null hypothesis:
that is, under the null hypothesis, the n × n covariance matrix Σu of {uit}i≤n is diagonal. In empirical applications, weak cross-sectional correlations are often present, which results in a sparse covariance Σu with just a few nonzero off-diagonal elements. Namely, the vector θ = (ρ12, ρ13,…, ρn−1,n) is sparse and should be incorporated to improve power of the test. The dimensionality N = n(n − 1)/2 can be much larger than the number of observations. Therefore, the power enhancement in sparse alternatives is very important to the testing problem. By choosing δN,T to dominate as detailed in Section 5, under the sparse alternative, the set “screens out” most of the estimation noises, and contains only a few indices of the nonzero off-diagonal entries. Therefore, not only reveals the sparse structure of Σu, but also determines the nonzero off-diagonal entries with an over-whelming probability.
There has been a large literature on high-dimensional cross-sectional tests. For instance, the literature on testing the factor pricing model is found in Gibbons et al. (1989), MacKinlay and Richardson (1991), Beaulieu et al. (2007) and Pesaran and Yamagata (2012), all in quadratic forms. Gagliardini et al. (2011) studied estimation of the risk premia in a CAPM and its associated testing problem, which is closely related to our work. While we also study a large panel of stock returns as a specific example and double asymptotics (as N, T → ∞), the problems and approaches being considered are very different. This paper addresses a general problem of enhancing powers under high-dimensional sparse alternatives.
For the mixed effect panel data model, most of the testing statistics are based on the sum of squared residual correlations, which also accumulates many off-diagonal estimation errors in estimating the covariance matrix of (u1t, …, unt). See, for example, Breusch and Pagan (1980), Pesaran et al. (2008), and Baltagi et al. (2012). Our problem is also related to the test with a restricted parameter space, previously considered by Andrews (1998), who improves the power by directing towards the “relevant” alternatives; see also Hansen (2003) for a related idea. Chernozhukov et al. (2013) proposed a high-dimensional inequality test, and employed an extreme value statistic, whose critical value is determined through applying the moderate deviation theory on an upper bound of the rejection probability. In contrast, the asymptotic distribution of our proposed power enhancement statistic is determined through the pivotal statistic J1, and the power is improved via the contributions of sparse alternatives that survive the screening process.
The remainder of the paper is organized as follows. Section 2 sets up the preliminaries and highlights the major differences from existing tests. Section 3 presents the main result of power enhancement test. As applications to specific cases, Section 4 and Section 5 respectively study the factor pricing model and test of cross-sectional independence. Section 6 presents simulation results are empirical evidence of sparse alternatives based on the real data. Section 7 provides further discussions. Proofs are given in the supplementary material.
Throughout the paper, for a symmetric matrix A, let λmin(A) and λmax(A) represent its minimum and maximum eigenvalues. Let ‖A‖2 and ‖A‖1 denote its operator norm and l1-norm respectively, defined by and maxi Σj |Aij|. For a vector θ, define and ‖θ‖max = maxj |θj|. For two deterministic sequences aT and bT, we write aT ≪ bT (or equivalently bT ≫ aT) if aT = o(bT). Also, aT ⩆ bT if there are constants C1, C2 > 0 so that C1bT ≤ aT ≤ C2bT for all large T. Finally, we denote |S|0 as the number of elements in a set S.
2 Power Enhancement in high dimensions
This section introduces power enhancement techniques and provides heuristics to justify the techniques. The differences from related methods in the literature are also highlighted.
2.1 Power enhancement
Consider a testing problem:
where Θa ⊂ ℝN\{0} is an alternative set in ℝN. A typical example is Θa = {θ: θ ≠ 0}. Suppose we observe a stationary process of size T. Let J1(D) be a certain test statistic, which will also be written as J1. Often J1 is constructed such that under H0, it has a non-degenerate limiting distribution F : As T, N → ∞,
(2.1) |
For the significance level q ∈ (0, 1), let Fq be the critical value for J1. Then the critical region is taken as {D: J1 > Fq} and satisfies
This ensures that J1 has a correct asymptotic size. In addition, it is often the case that J1 has high power against H0 on a subset Θ(J1) ⊂ Θa, namely,
Typically, Θ(J1) consists of those θ′s, whose l2-norm is relatively large, as J1 is normally an omnibus test (e.g. Wald test).
In a data-rich environment, econometric models often involve high-dimensional parameters in which dim(θ) = N can grow fast with the sample size T. We are particularly interested in sparse alternatives Θs ⊂ Θa under which H0 is violated only on a couple of exceptional components of θ. Specifically, Θs ∈ ℝN is a subset of Θa, and when θ ∈ Θs, the number of non-vanishing components is much less than N. As a result, its l2-norm is relatively small. Therefore, under sparse alternative Θs, the omnibus test J1 typically has a lower power, due to the accumulation of high-dimensional estimation errors. Detailed explanations are given in Section 3.3 below.
We introduce a power enhancement principle for high-dimensional sparse testing, by bringing in a data-dependent component J0 that satisfies the Power Enhancement Properties defined in Section 1. The component J0 does not serve as a test statistic on its own, but is added to a classical statistic J1 that is often pivotal (e.g., Wald-statistic), so the proposed test statistic is defined by
Our introduced “power enhancement principle” is explained as follows.
- Under mild conditions, P (J0 = 0|H0) → 0 by construction. Hence when (2.1) is satisfied, we have
Therefore, adding J0 to J1 does not affect the size of the standard test statistic asymptotically. Both J and J1 have the same limiting distribution under H0. - The critical region of J is defined by
As J0 ≥ 0, P (J > Fq|θ) ≥ P (J1 > Fq|θ) for all θ ∈ Θa. Hence the power of J is at least as large as that of J1. When θ ∈ Θs is a sparse high-dimensional vector under the alternative, the “classical” test J1 may have low power as ‖θ‖ is typically relatively small. On the other hand, for θ ∈ Θs, J0 stochastically dominates J1. As a result, P (J > Fq|θ) > P (J1 > Fq|θ) strictly holds, so the power of J1 over the set Θs is enhanced after adding J0. Often J0 diverges fast under sparse alternatives Θs, which ensures P (J > Fq|θ) → 1 for θ ∈ Θs. In contrast, the classical test only has P (J1 > Fq|θ) < c < 1 for some c ∈ (0, 1) and θ ∈ Θs, and when ‖θ‖ is sufficiently small, P (J1 > Fq|θ) is approximately q.
It is important to note that the power is enhanced without sacrificing the size asymptotically. In fact the power enhancement principle can be asymptotically fulfilled under a weaker condition J0|H0 →p 0. However, we construct J0 so that P (J0 = 0|H0) → 1 to ensure a good finite sample size.
2.2 Construction of power enhancement component
We construct a specific power enhancement component J0 that satisfies (a)–(c) of the power enhancement properties, and identify the sparse alternatives in Θs. Such a component can be constructed via screening as follows. Suppose we have a consistent estimator such that . For some slowly growing sequence δN,T → ∞ (as T, N → ∞), define a screening set:
(2.2) |
where is a data-dependent normalizing constant, often taken as the estimated asymptotic variance of . The sequence δN,T, called “high criticism”, is chosen to dominate the maximum-noise-level, satisfying: (recall that Θa denotes the alternative set)
(2.3) |
for θ under both null and alternate hypotheses. The screening statistic J0 is then defined as
By (2.2) and (2.3), under H0 : θ = 0,
Therefore J0 satisfies the non-negativeness and no-size-distortion properties.
Let {vj}j≤N be the population counterpart of . To satisfy the power-enhancement property, define
(2.4) |
and in particular S(0) = ∅. We shall show in Theorem 3.1 below that , for all θ ∈ Θa ∪ {0}. Thus all the significant signals are contained in with a high probability. If S (θ) ≠ Ø, then by the definition of and δN,T → ∞, we have
Thus, the power of J1 is enhanced on the subset
Furthermore, not only reveals the sparse structure of θ under the alternative, but also determines the nonzero entries with an over-whelming probability.
The introduced J0 can be combined with any other test statistic with an accurate asymptotic size. Suppose J1 is a “classical” test statistic. Our power enhancement test is simply
For instance, suppose we can consistently estimate the asymptotic inverse covariance matrix of , denoted by , then J1 can be chosen as the standardized Wald-statistic:
As a result, the asymptotic distribution of J is (0, 1) under the null hypothesis.
In sparse alternatives where ‖θ‖ may not grow fast enough with N but θ ∈ Θs, the combined test J0 + J1 can be very powerful. In contrast, we will formally show in Theorem 3.4 below that the conventional Wald test J1 can have very low power on its own. On the other hand, when the alternative is “dense” in the sense that ‖θ‖ grows fast with N, the conventional test J1 itself is consistent. In this case, J is still as powerful as J1. Therefore, if we denote Θ(J1) ⊂ ℝN/{0} as the set of alternative θ’s against which the classical J1 test has power converging to one, then the combined J = J0 + J1 test has power converging to one against θ on
We shall show in Section 3 that the power is enhanced uniformly over θ ∈ Θs ∪ Θ(J1). In addition, the set indicates which components may violate the null hypothesis.
2.3 Comparisons with thresholding and extreme-value tests
One of the fundamental differences between our power enhancement component J0 and existing tests with good power under sparse alternatives is that, existing test statistics have a non-degenerate distribution under the null, and often require either bootstrap or strong conditions to derive the null distribution. Such convergences are typically slow and the serious size distortion appears in finite sample. In contrast, our screening statistic J0 uses “high criticism” sequence δN,T to make P (J0 = 0|H0) → 1, hence does not serve as a test statistic on its own. The asymptotic null distribution is determined by that of J1, which usually not hard to derive As we shall see in sections below, the required regularity condition is relatively mild, which makes the power enhancement test widely applicable to many econometric problems.
In the high-dimensional testing literature, there are mainly two types of statistics with good power under sparse alternatives: extreme value test and thresholding test respectively. The test based on extreme values studies the maximum deviation from the null hypothesis across the components of , and forms the statistic based on for some δ > 0 and a weight wj (e.g., Cai et al. (2013), Chernozhukov et al. (2013)). Such a test statistic typically converges slowly to its asymptotic counterpart. An alternative test is based on thresholding: for some δ > 0 and pre-determined threshold level tN,
(2.5) |
For example, when tT is taken slightly less than , R becomes the extreme statistic. When tT is small (e.g. 0), R becomes a traditional test, which is not powerful in detecting sparse alternatives, though it can have good size properties. The accumulation of estimation errors is prevented due to the threshold for sufficiently large tN (see, e.g., Fan (1996) and Zhong et al. (2013)). In a low-dimensional setting, Hansen (2005) suggested using a threshold to enhance the power in a similar way.
Although (2.5) looks similar to J0, the ideas behind are very different. Both the extreme value test and the thresholding test require regularity conditions that may be restrictive in econometric applications. For instance, it can be difficult to employ the central limit theorem directly on (2.5), as it requires the covariance between and decay fast enough as k → ∞ (Zhong et al., 2013). In cross-sectional testing problems, this essentially requires an explicit ordering among the cross-sectional units which is, however, often unavailable in panel data applications. In addition, as (2.5) involves effectively limited terms of summations due to thresholding, the asymptotic theory does not provide adequate approximations, resulting in size-distortion in applications. We demonstrate the size-distortion in the simulation study.
3 Asymptotic properties
3.1 Main results
This section presents the regularity conditions and formally establishes the claimed power enhancement properties. Below we use P (·|θ) to denote the probability measure defined from the sampling distribution with parameter θ. Let Θ ⊂ ℝN be the parameter space of θ that covers the union of {0} and the alternative set Θa. When we write infθ∈Θ P (·|θ), the infimum is taken in the space that covers the union of both null and alternative space.
We begin with a high-level assumption. In specific applications, they can be verified with primitive conditions.
Assumption 3.1
As T, N → ∞, the sequence δN,T → ∞, and the estimators are such that
;
.
The normalizing constant vj is often taken as the asymptotic variance of , with being its consistent estimator. The constants 4/9 and 9/4 in condition (ii) are not optimally chosen, as this condition only requires be not-too-bad estimators of their population counterparts.
In many high-dimensional problems with strictly stationary data that satisfy strong mixing conditions, following from the large-deviation theory, typically, . Therefore, we shall fix
(3.1) |
which is a high criticism that slightly dominates the standardized noise level (it may be useful to recall that the maximum of N i.i.d. Gaussian noises with a bounded variance behaves as asymptotically). We shall provide primitive conditions for this choice of δN,T in the subsequent sections, so that Assumption 3.1 holds.
Recall that and S(θ) are defined by (2.2) and (2.4) respectively. In particular, , so under H0 : θ = 0, S(θ) = ∅. The following theorem characterizes the asymptotic behavior of under both the null and alternative hypotheses.
Theorem 3.1
Let Assumption 3.1 hold. As T, N →∞, we have under H0 : θ = 0, . Hence
In addition,
Besides the asymptotic behavior of J0, Theorem 3.1 also establishes a “sure screening” property of , which means it selects all the significant components whose indices are in S(θ). This result is achieved uniformly in θ under both the null and alternative hypotheses.
Remark 3.1
Under additional mild assumptions, it can be further shown that uniformly in θ. Hence the selection is consistent. While the selection consistency is not a requirement of the power enhancement principle, we refer to our earlier manuscript Fan et al. (2014) for technical details.
We are now ready to formally show the power enhancement argument. The enhancement is achieved uniformly on the following set:
(3.2) |
In particular, if is -consistent, and is the asymptotic standard deviation of then is bounded away from both zero and infinity. Using (3.1), we have
This is a relatively weak condition on the strength of the maximal signal in order to be detected by J0.
A test is said to have a high power uniformly on a set Θ* ⊂ ℝN \ {0} if
For a given distribution function F, let Fq denote its qth quantile.
Theorem 3.2
Let Assumptions 3.1 hold. Suppose there is a test J1 such that
it has an asymptotic non-degenerate null distribution F, and the critical region takes the form {D : J1 > Fq} for the significance level q ∈ (0, 1),
it has a high power uniformly on some set Θ(J1) ⊂ Θ,
there is c > 0 so that , as T, N → ∞.
Then the power enhancement test J = J0 + J1 has the asymptotic null distribution F, and has a high power uniformly on the set Θs ∪ Θ(J1): as T, N → ∞
The three required conditions for J1 are easy to understand: Conditions (i) and (ii) respectively require the size and power conditions for J1. Condition (iii) requires J1 be dominated by J0 under Θs. This condition is not restrictive since J1 is typically standardized (e.g., Donald et al. (2003)).
Theorem 3.2 also shows that J1 and J have the critical regions {D : J1 > Fq} and {D : J > Fq} respectively, but the high power region is enhanced from Θ(J1) to Θs ∪ Θ(J1). In high-dimensional testing problems, Θs ∪ Θ(J1) can be much larger than Θ(J1). As a result, the power of J1 can be substantially enhanced after J0 is added.
3.2 Power enhancement for quadratic tests
As an example of J1, we consider the widely used quadratic test statistic, which is asymptotically pivotal:
where μN,T and ξN,T are deterministic sequences that satisfy μN,T → 0 and ξN,T → ξ ∈ (0, ∞). The weight matrix V is positive definite, whose eigenvalues are bounded away from both zero and infinity. Here TV is often taken to be the inverse of the asymptotic covariance matrix of . Other popular choices are with (Bai and Saranadasa, 1996; Chen and Qin, 2010; Pesaran and Yamagata, 2012) and V = IN, the N × N identity matrix. We set J1 = JQ, whose power enhancement version is J = J0 + JQ. For the moment, we shall assume V to be known, and just focus on the power enhancement properties. We will deal with unknown V for testing the factor pricing problem in the next section.
Assumption 3.2
There is a non-degenerate distribution F so that under H0, JQ →d F.
The critical value Fq = O(1) and the critical region of JQ is {D : JQ > Fq},
V is positive definite, and there exist two positive constants C1 and C2 such that C1 ≤ λmin(V) ≤ λmax(V) ≤ C2.
C3 ≤ Tvj ≤ C4, j = 1, …, N for positive constants C3 and C4.
Analyzing the power properties of JQ and applying Theorem 3.2, we obtain the following theorem. Recall that δN,T and Θs are defined by (3.1) and (3.2).
Theorem 3.3
Under Assumptions 3.1–3.2, the power enhancement test J = J0 +JQ satisfies: as T, N → ∞,
under the null hypothesis H0 : θ = 0, J →d F,
- there is C > 0 so that J has high power uniformly on the set
that is, for any q ∈ (0,1).
3.3 Low power of quadratic statistics under sparse alternatives
When JQ is used on its own, it can suffer from a low power under sparse alternatives if N grows much faster than the sample size, even though it has been commonly used in the econometric literature. Mainly, aggregates high-dimensional estimation errors under H0, which become large with a non-negligible probability and potentially override the sparse signals under the alternative. The following result gives this intuition a more precise description.
To simplify our discussion, we shall focus on the Wald-test with TV being the inverse of the asymptotic covariance matrix of , assumed to exist. Specifically, we assume the standardized to be asymptotically normal under H0:
(3.3) |
This is one of the most commonly seen cases in various testing problems. The diagonal entries of are given by {vj}j≤N.
Theorem 3.4
Suppose that (3.3) holds with ‖V‖1 < C and ‖V−1‖1 < C for some C > 0. Under Assumptions 3.1–3.2, and log N = o(T1−γ) for some 0 < γ < 1, the quadratic test JQ has low power at the sparse alternative Θc for any c > 0, given by
In other words, ∀c > 0, ∀θ ∈ Θc, for any significance level q,
where zq is the qth upper quantile of standard normal distribution.
In the above theorem, the alternative is a sparse vector. However, using the quadratic test itself, the asymptotic power of the test is as low as q. This is because the signals in the sparse alternative are dominated by the aggregated high-dimensional estimation errors: . In contrast, the non-vanishing components of θ (fixed constants) are actually detectable by using J0. The power enhancement test J0 + JQ takes this into account, and has a substantially improved power.
4 Application: Testing Factor Pricing Models
4.1 The model
The multi-factor pricing model, motivated by the Arbitrage Pricing Theory (APT) by Ross (1976), is one of the most fundamental results in finance. It postulates how financial returns are related to market risks, and has many important practical applications. Let yit be the excess return of the i-th asset at time t and ft = (f1t, …, fKt)′ be the observable factors. Then, the excess return has the following decomposition:
where bi = (bi1, …, biK)′ is a vector of factor loadings and uit represents the idiosyncratic error. To make the notation consistent, we stick to use θ to represent the commonly used “alpha” in the finance literature.
While the APT does not necessarily deliver an observable factor model, the specification of an observable factor structure is of considerable interest and is often the case in empirical analyses. The key implication from the multi-factor pricing theory for tradable factors is that the intercept θi should be zero for any asset i. Such an exact feature of factor pricing can also be motivated from Connor (1984), who presented a competitive equilibrium version of the APT. An important question is then testing the null hypothesis
(4.1) |
namely, whether the factor pricing model is consistent with empirical data, where θ = (θ1, …, θN)′ is the vector of intercepts for all N financial assets. One typically picks five-year monthly data, because the factor pricing model is technically a one-period model whose factor loadings can be time-varying; see Gagliardini et al. (2011) on how to model the time-varying effects using firm characteristics and market variables. As the theory of the factor pricing model applies to all tradable assets, rather than a handful selected portfolios, the number of assets N should be much larger than T. This ameliorates the selection biases in the construction of testing portfolios. On the other hand, our empirical study on the S&P500 index provides empirical evidence of sparse alternatives: there are only a few significantly nonzero components of θ, corresponding to a small portion of mis-priced stocks, instead of systematic mispricing of the whole market.
Most existing tests to the problem (4.1) are based on the quadratic statistic , where is the OLS estimator for θ, and V is some positive definite matrix. Prominent examples are given by Gibbons et al. (1989), MacKinlay and Richardson (1991) and Beaulieu et al. (2007). When N is possibly much larger than T, Pesaran and Yamagata (2012) showed that, under regularity conditions (Assumption 4.1 below),
where af,T > 0, given in the next subsection, is a constant that depends only on factors’ empirical moments, and Σu is the N×N covariance matrix of ut = (u1t, …, uNt)′, assumed to be time-invariant.
Recently, Gagliardini et al. (2011) proposed a novel approach to modeling and estimating time-varying risk premiums using two-pass least-squares method under asset pricing restrictions. Their problems and approaches differ substantially from ours, though both papers study similar problems in finance. As a part of their model validation, they develop test statistics against the asset pricing restrictions and weak risk factors. Their test statistics are based on a weighted sum of squared residuals of the cross-sectional regression, which, like all classical test statistics, have power only when there are many violations of the asset pricing restrictions. They do not consider the issue of enhancing the power under sparse alternatives, nor do they involve a Wald statistic that depends on a high-dimensional covariance matrix. In fact, their testing power can be enhanced by using our techniques.
4.2 Power enhancement component
We propose a new statistic that depends on (i) the power enhancement component J0, and (ii) a feasible Wald component based on a consistent covariance estimator for , which controls the size under the null even when N/T→∞.
Define and . Also define
The OLS estimator of θ can be expressed as
(4.2) |
When cov(ft) is positive definite, under mild regularity conditions (Assumption 4.1 below), af,T consistently estimates af, and af > 0. In addition, without serial correlations, the conditional variance of (given {ft}) converges in probability to
which can be estimated by based on the residuals of OLS estimator:
We show in Proposition 4.1 below that . Therefore, slightly dominates the maximum estimation noise. The screening set and the power enhancement component are defined as
and
4.3 Feasible Wald test in high dimensions
Assuming no serial correlations among and conditional homoskedasticity (Assumption 4.1 below), given the observed factors, the conditional covariance of is Σu/(Taf,T). If the covariance matrix Σu of ut were known, the standardized Wald test statistic is
(4.3) |
Under H0 : θ = 0, it converges in distribution to .
Note that factor models are often only justified as being “approximate” (Chamberlain and Rothschild (1983)), where Σu is a non-diagonal covariance matrix of cross-sectionally correlated idiosyncratic errors (u1t, …, uNt). When N/T → ∞, it is difficult to consistently estimate , as there are O(N2) off-diagonal parameters. Without parametrizing the off-diagonal elements, we assume Σu = cov(ut) a sparse matrix. This assumption is natural for large covariance estimations for factor models, and was previously considered by Fan et al. (2011). Since the common factors dictate preliminarily the co-movement across the whole panel, a particular asset’s idiosyncratic shock is usually correlated significantly only with a few of other assets. For example, some shocks only exert influences on a particular industry, but are not pervasive for the whole economy (Connor and Korajczyk, 1993). Recently, Gagliardini et al. (2011) also obtained a feasible test by using a similar thresholding technique as to be introduced below to estimate the asymptotic covariance matrix. They showed that the sparsity approach for estimating covariance matrices covers the block diagonal case, which is expected to be present in the factor modeling of stocks (as elaborated in Gagliardini et al. (2011), a typical example of the sparsity of Σu is due to the presence of remaining industry sector effects).
Following the approach of Bickel and Levina (2008), we can consistently estimate via thresholding: let . Define the covariance estimator as
where hij(·) is a generalized thresholding function (Antoniadis and Fan, 2001; Rothman et al., 2009), with threshold value for some constant C > 0, designed to keep only the sample correlation whose magnitude exceeds . The hard-thresholding function, for example, is hij(x) = x1{|x| > τij}, and many other thresholding functions such as soft-thresholding and SCAD (Fan and Li, 2001) are specific examples. In general, hij(·) should satisfy:
hij(z) = 0 if |z| < τij;
|hij(z) − z| ≤ τij;
there are constants a > 0 and b > 1 such that if |z| > bτij.
The thresholded covariance matrix estimator sets most of the off-diagonal estimation noises in to zero. As studied in Fan et al. (2013), the constant C in the threshold can be chosen in a data-driven manner so that is strictly positive definite in finite sample even when N > T.
With , we are ready to define the feasible standardized Wald statistic:
whose power can be enhanced under sparse alternatives by:
Remark 4.1
The thresholding approach described here can be modified to take advantages of the block structure of Σu. The covariance matrix can first be divided into blocks according to the industries, and then estimated block-by-block. The estimation procedure and theoretical analysis will be similar to the block-thresholding of Cai and Yuan (2012).
4.4 Does the thresholded covariance estimator affect the size?
A natural but technical question to address is that when Σu indeed admits a sparse structure, is the thresholded estimator accurate enough so that the feasible Jwald is still asymptotically normal? The answer is affirmative if N(log N)4 = o(T2), and still we can allow N/T → ∞. However, such a simple question is far more technically involved than anticipated, as we now explain.
When Σu is a sparse matrix, under regularity conditions (Assumption 4.2 below), Fan et al. (2011) showed that
(4.4) |
This convergence rate is minimax optimal for the sparse covariance estimation, by the lower bound derived by Cai et al. (2010). On the other hand, when replacing in (4.3) by , one needs to show that the effect of such a replacement is asymptotically negligible, namely, under H0,
(4.5) |
However, when θ = 0, it can be shown that . Using this and (4.4), by the Cauchy-Schwartz inequality, we have
Thus, it requires N log N = o(T) to converge, which is basically a low-dimensional scenario.
The above simple derivation uses, however, a Cauchy-Schwartz bound, which is too crude for a large N. In fact, is a weighted estimation error of , where the weights “average down” the accumulated estimation errors in estimating elements of , and result in an improved rate of convergence. The formalization of this argument requires further regularity conditions and novel technical arguments. These are formally presented in the following subsection.
4.5 Regularity conditions
We are now ready to present the regularity conditions. These conditions are imposed for three technical purposes: (i) Achieving the uniform convergence for as required in Assumption 3.1, (ii) defining the sparsity of Σu so that is consistent, and (iii) showing (4.5), so that the errors from estimating do not affect the size of the test.
Let and denote the σ-algebras generated by {ft: −∞ ≤ t ≤ 0} and {ft: T ≤ t ≤ ∞} respectively. In addition, define the α-mixing coefficient
Assumption 4.1
{ut}t≤T is i.i.d. , where both ‖Σu‖1 and are bounded;
- {ft}t≤T is strictly stationary, independent of {ut}t≤T, and there are r1, b1 > 0 so that
- There exists r2 > 0 such that and C > 0, for all T ∈ ℤ+,
cov(ft) is positive definite, and maxi≤N ‖bi‖ < c1 for some c1 > 0.
Some remarks are in order for the conditions in Assumption 4.1.
Remark 4.2
The above assumption, perhaps somewhat restrictive, substantially facilitates our technical analysis. Here ut is required to be serially uncorrelated across t. Under this condition, the conditional covariance of , given the factors, has a simple expression Σu/(Taf,T). On the other hand, if serial correlations are present in ut, there would be additional autocovariance terms in the covariance matrix, which need to be further estimated via regularizations. Moreover, given that Σu is a sparse matrix, the Gaussianity ensures that most of the idiosyncratic errors are cross-sectionally independent so that , l = 1, 2, for most of the pairs in {(i, j): i ≠ j}.
Note that we do allow the factors {ft}t≤T to be weakly correlated across t, but satisfy the strong mixing condition Assumption 4.1 (iii).
Remark 4.3
The conditional homoskedasticity is assumed, granted by condition (ii). We admit that handling conditional heteroskedasticity, while important in empirical applications, is very technically challenging in our context. Allowing the high-dimensional covariance matrix to be time-varying is possible with suitable continuum of sparse conditions on the time domain. In that case, one can require the sparsity condition to hold uniformly across t and continuously apply thresholding. However, unlike in the traditional case, technically, estimating the family of large inverse covariances uniformly over t is highly challenging. As we shall see in the proof of Proposition 4.2, even in the homoskedastic case, proving the effect of estimating to be first-order negligible when N/T → ∞ requires delicate technical analysis.
To characterize the sparsity of Σu in our context, define
Here mN represents the maximum number of nonzeros in each row, and DN represents the total number of nonzero off-diagonal entries. Formally, we assume:
Assumption 4.2
Suppose N1/2(log N)γ = o(T) for some γ > 2, and
;
- at least one of the following cases holds:
- DN = O(N1/2), and
- DN = O(N), and .
As regulated in Assumption 4.2, we consider two kinds of sparse matrices, and develop our results for both cases. In the first case (Assumption 4.2 (ii)(a)), Σu is required to have no more than O(N1/2) off-diagonal nonzero entries, but allows a diverging mN. One typical example of this case is that there are only a small portion (e.g., finitely many) of firms whose individual shocks (uit) are correlated with many other firms’. In the second case (Assumption 4.2(ii)(b)), mN should be bounded, but Σu can have O(N) off-diagonal nonzero entries. This allows block-diagonal matrices with finite size of blocks or banded matrices with finite number of bands. This case typically arises when firms’ individual shocks are correlated only within industries but not across industries.
Moreover, we require N1/2(log N)γ = o(T), which is the price to pay for estimating a large error covariance matrix. But still we allow N/T → ∞. It is also required that the minimal signal for the nonzero components be larger than the noise level (Assumption 4.2 (i)), so that nonzero components are not thresholded off when estimating Σu.
4.6 Asymptotic properties
The following result verifies the uniform convergence required in Assumption 3.1 over the entire parameter space that contains both the null and alternative hypotheses. Recall that the OLS estimator and its asymptotic standard error are defined in (4.2).
Proposition 4.1
Suppose the distribution of (ft, ut) is independent of θ. Under Assumption 4.1, for , as T, N → ∞,
Proposition 4.2
Under Assumptions 4.1, 4.2, and H0,
As shown, the effect of replacing by its thresholded estimator is asymptotically negligible and the size of the standard Wald statistic can be well controlled.
We are now ready to apply Theorem 3.3 to obtain the asymptotic properties of J = J0 + Jwald as follows. For , let
Theorem 4.1
Suppose the assumptions of Propositions 4.1 and 4.2 hold.
- Under the null hypothesis H0: θ = 0, as T, N → ∞,
and hence - There is C > 0 so that for any q ∈ (0, 1), as T, N → ∞,
and hence
where zq denotes the 1 − q quantile of the standard normal distribution.
We see that the power is substantially enhanced after J0 is added, as the region where the test has power is enlarged from Θ(Jwald) to Θs ∪ Θ(Jwald).
5 Application: Testing Cross-Sectional Independence
5.1 The model
Consider a mixed effect panel data model
where the idiosyncratic error uit is assumed to be Gaussian. The regressor xit could be correlated with the individual random effect μi, but is uncorrelated with uit. Let ρij denote the correlation between uit and ujt, assumed to be time invariant. The goal is to test the following hypothesis:
that is, whether the cross-sectional dependence is present. It is commonly known that the cross-sectional dependence leads to efficiency loss for OLS, and sometimes it may even cause inconsistency (Andrews, 2005). Thus testing H0 is an important problem in applied panel data models. If we let N = n(n − 1)/2, and θ = (ρ12, …, ρ1n, ρ23, …, ρ2n, …, ρn−1,n)′ be an N × 1 vector stacking all the mutual correlations, then the problem is equivalent to testing about a high-dimensional vector H0: θ = 0. Note that often the cross-sectional dependences are weakly present. Hence the alternative hypothesis of interest is often a sparse vector θ, corresponding to a sparse covariance matrix Σu of uit.
Most of the existing tests are based on the quadratic statistic , where is the sample correlation between uit and ujt, estimated by the within-OLS (Baltagi, 2008), and . Pesaran et al. (2008) and Baltagi et al. (2012) studied the rescaled W, and showed that after a proper standardization, the rescaled W is asymptotically normal when both n, T → ∞. However, the quadratic test suffers from a low power if Σu is a sparse matrix. In particular, as is shown in Theorem 3.4, when n/T → ∞, the quadratic test cannot detect the sparse alternatives with Σi<j1{ρij ≠ 0} = o(n/T), which can be restrictive. Such a sparse structure is present, for instance, when Σu is a block-diagonal sparse matrix with finite block sizes.
5.2 Power enhancement test
Following the conventional notation of panel data models, let , , and . Then . The within-OLS estimator is obtained by regressing on for all i and t, which leads to the estimated residual . Then ρij is estimated by
For the within-OLS, the asymptotic variance of is given by , and is estimated by . Therefore the screening statistic for the power enhancement test is defined as
(5.1) |
where as before. The set screens out most of the estimation errors.
To control the size, we employ Baltagi et al. (2012)’s bias-corrected quadratic statistic:
(5.2) |
Under regularity conditions (Assumptions 5.1, 5.2 below), under H0. Then the power enhancement test can be constructed as J = J0 + J1. The power is substantially enhanced to cover the region
(5.3) |
in addition to the region detectable by J1 itself. As a byproduct, it also identifies pairs (i, j) for ρij ≠ 0 through . Empirically, this set helps us understand better the underlying pattern of cross-sectional correlations and subsequently the cause of the correlation.
5.3 Asymptotic properties
In order for the power to be uniformly enhanced, the parameter space of θ = (ρ12, …, ρ1n, ρ23, …, ρ2n, …, ρn−1,n)′ is required to be: θ is element-wise bounded away from ±1: there is ρmax ∈ (0, 1),
The following regularity conditions are imposed. They hold uniformly in θ ∈ Θ.
Assumption 5.1
There are C1, C2 > 0, so that
,
,
Condition (i) is needed for the within-OLS to be -consistent (see, e.g., Baltagi (2008)). It is usually satisfied by weak cross-sectional correlations (sparse alternatives) among the error terms, or weak dependence among the regressors. We require the second moment of ujt be bounded away from zero uniformly in j ≤ n and θ ∈ Θ, so that the cross-sectional correlations can be estimated stably.
The following conditions are assumed in Baltagi et al. (2012), which are needed for the asymptotic normality of J1 under H0.
Assumption 5.2
are i.i.d. N(0, Σu), almost surely.
With probability approaching one, all the eigenvalues of are bounded away from both zero and infinity uniformly for j ≤ n.
Proposition 5.1
Under Assumptions 5.1 and 5.2, for , and N = n(n − 1)/2, as T, N → ∞,
Define
For J1 defined in (5.2), let
(5.4) |
The main result is presented as follows.
Theorem 5.1
Suppose Assumptions 5.1, 5.2 hold. As T, N → ∞,
- under the null hypothesis H0: θ = 0,
and hence - there is C > 0 in the definition of Θ(J1) so that for any q ∈ (0, 1),
and hence
Note that the high power region is enhanced from Θ(J1) to Θs ∪ Θ(J1) uniformly over sparse alternatives. In particular, the required signal strength of Θs in (5.3) is mild: the maximum cross-sectional correlation is only required to exceed a magnitude of .
6 Numerical studies
In this section, Monte Carlo simulations are employed to examine the finite sample performance of the power enhancement tests. We also present empirical evidence of sparse alternatives in the factor pricing model using real data.
6.1 Testing factor pricing models
To mimic the real data application, we consider the Fama and French (1992) three-factor model:
We simulate , and independently from , , and respectively. The parameters are set to be the same as those in the simulations of Fan et al. (2013), which are calibrated using daily returns of S&P 500’s top 100 constituents, for the period from July 1st, 2008 to June 29th 2012. These parameters are listed in the following table.
Set Σu = diag{A1, …, AN/4} to be a block-diagonal correlation matrix. Each diagonal block Aj is a 4 × 4 positive definite matrix, whose correlation matrix has equi-off-diagonal entry ρj, generated from Uniform[0, 0.5].
We evaluate the powers of our tests under two specific alternatives (we set N > T):
Under , there are only a few nonzero θ’s with a relatively large magnitude. Under , there are many non-vanishing θ’s, but their magnitudes are all relatively small. In our simulation setup, varies from 0.05 to 0.10. We therefore expect that under , is close to zero, as most of the first N/T estimated θ’s should survive from the screening step. These survived ’s contribute importantly to the rejection of the null hypothesis. In contrast, should be much larger under because the non-vanishing θ’s are too weak to be detected.
Four testing methods are conducted and compared: the standardized Wald test Jwald, the thresholding test Jthr as in Fan (1996), and their power enhancement versions J0 + Jwald and J0 + Jthr. In particular, the thresholding test Jthr is defined as, and ,
where , a = (log N)−2. Here the threshold value tN is chosen smaller than our δN,T, and it results in a non-degenerate null distribution of Jthr. When Σu is diagonal, its asymptotic null distribution is , but when Σu is non-diagonal, it can suffer from substantial size distortions (see Fan (1996) for detailed discussions). For each test, we calculate the relative frequency of rejection under H0, and based on 2000 replications, with significance level q = 0.05. We also calculate the relative frequency of being empty, which approximates . We use the soft-thresholding to estimate the error covariance matrix.
Table II presents the empirical size and power of each testing method. Numerical findings are summarized as follows.
Table II.
T | N | Jwald | Jthr | J0 + Jwald | J0 + Jthr |
|
|
---|---|---|---|---|---|---|---|
H0 | |||||||
300 | 500 | 5.0% | 7.2% | 5.2% | 7.3% | 99.9% | |
800 | 5.4% | 7.7% | 5.6% | 7.9% | 99.8% | ||
1000 | 4.7% | 7.8% | 4.9% | 8.0% | 99.7% | ||
1200 | 4.7% | 6.6% | 4.8% | 6.7% | 99.8% | ||
500 | 500 | 5.2% | 6.4% | 5.3% | 6.5% | 99.9% | |
800 | 5.0% | 5.8% | 5.1% | 5.9% | 99.9% | ||
1000 | 5.4% | 6.6% | 5.5% | 6.6% | 100.0% | ||
1200 | 4.5% | 7.4% | 4.6% | 7.5% | 99.8% | ||
| |||||||
|
|||||||
300 | 500 | 48.3% | 72.2% | 93.7% | 94.5% | 8.0% | |
800 | 58.4% | 87.9% | 97.8% | 98.8% | 3.2% | ||
1000 | 53.6% | 87.0% | 96.4% | 98.1% | 6.3% | ||
1200 | 66.4% | 94.3% | 97.9% | 98.8% | 3.4% | ||
500 | 500 | 37.9% | 54.4% | 96.3% | 96.7% | 4.0% | |
800 | 68.3% | 91.6% | 99.9% | 99.9% | 0.1% | ||
1000 | 63.3% | 89.8% | 99.8% | 99.8% | 0.2% | ||
1200 | 55.1% | 88.0% | 99.7% | 99.6% | 0.6% | ||
| |||||||
|
|||||||
300 | 500 | 68.6% | 83.2% | 72.7% | 84.9% | 80.0% | |
800 | 69.0% | 86.4% | 72.1% | 87.5% | 80.9% | ||
1000 | 74.5% | 91.7% | 77.7% | 92.1% | 78.9% | ||
1200 | 74.9% | 93.6% | 78.5% | 94.3% | 79.6% | ||
500 | 500 | 70.6% | 81.9% | 72.8% | 82.8% | 89.0% | |
800 | 71.4% | 86.6% | 73.4% | 87.1% | 88.3% | ||
1000 | 72.2% | 89.7% | 73.5% | 90.0% | 89.6% | ||
1200 | 75.9% | 92.3% | 77.5% | 92.5% | 87.8% |
Note: This table reports the frequencies (in percentage) of rejection and based on 2000 replications. These tests are conducted at 5% significance level.
The sizes of both Jwald and J0 + Jwald are close to the significance level. In contrast, the thresholding tests (Jthr and Jthr + J0) have significant size distortions. Furthermore, adding J0 results in just 0.1–0.2% increase of the size.
Under H0, is close to one, indicating that the power enhancement component J0 screens off most of the estimation errors. Under , is less than 10% because the screening procedure manages to capture the big thetas. Under , as the non-vanishing thetas are very weak, has a large chance of being empty.
Under , the power of the thresholding test is much higher than that of the Wald test, as the Wald test accumulates too many estimation errors. Besides, the power is significantly enhanced after J0 is added.
Finally, under , the power enhancement is not substantial as the nonzero thetas are very weak, and the thresholding test has higher power than J0+Jwald does. The power of the thresholding test can be further enhanced after J0 is added with little increase of false rejections. Note that in this case still has more than 10% chance of being nonempty. Whenever it is non-empty, adding J0 potentially enhances the power of the test.
6.2 Testing cross-sectional independence
We use the following data generating process in our experiments,
(6.1) |
Note that we model {xi}’s as AR(1) processes, so that xit is possibly correlated with μi, but not with uit, as was the case in Im et al. (1999). For each i, initialize xit = 0.5 at t = 1. We specify the parameters as follows: μi is drawn from for i = 1, …, n. The parameters α and β are set −1 and 2 respectively. In regression (6.1), ξ = 0.7 and .
We generate from . Under the null hypothesis, Σu is set to be a diagonal matrix . Following Baltagi et al. (2012), consider the heteroscedastic errors
(6.2) |
with κ = 0.5, where is the average of xit across t. Here σ2 is scaled to fix the average of ’s at one.
For alternative specifications, we use a spatial model for the errors uit. Baltagi et al. (2012) considered a tri-diagonal error covariance matrix in this case. We extend it by allowing for higher order spatial autocorrelations, but require that not all the errors be spatially correlated with their immediate neighbors. Specifically, we start with Σu,1 = diag{Σ1, …, Σn/4} as a block-diagonal matrix with 4 × 4 blocks located along the main diagonal. Each Σi is assumed to be I4 initially. We then randomly choose ⌊n0.3⌋ blocks among them and make them non-diagonal by setting Σi(m, n) = ρ|m−n|(m, n ≤ 4), with ρ = 0.2. To allow for error cross-sectional heteroscedasticity, we set , where as specified in (6.2).
The Monte Carlo experiments are conducted for different pairs of (n, T) with significance level q = 0.05 based on 2000 replications. The empirical size, power and the frequency of as in (5.1) are recorded.
Table III gives the size and power of the bias-corrected quadratic test J1 in (5.2) and those of the power enhanced test J0 + J1. The sizes of both tests are close to 5%. In particular, the power enhancement test has little distortion of the original size.
Table III.
T |
n = 200 |
n = 400 |
n = 600 |
n = 800 |
---|---|---|---|---|
H0 | ||||
100 | 4.7/5.5 /99.1 | 4.9/5.3 /99.6 | 5.5/5.7 /99.7 | 4.9/5.2 /99.7 |
200 | 5.3/5.3 /100.0 | 5.5/5.9 /99.6 | 4.7/5.1 /99.4 | 4.9/5.1 /99.8 |
300 | 5.2/5.2 /100.0 | 5.2/5.2 /100.0 | 4.6/4.6 /100.0 | 4.9/4.9 /100.0 |
500 | 4.7/4.7 /100.0 | 5.5/5.5 /100.0 | 5.0/5.0 /100.0 | 5.1/5.1 /100.0 |
| ||||
Ha | ||||
100 | 26.4/95.5 /5.0 | 19.8/98.0 /2.3 | 13.5/98.2 /2.0 | 12.2/99.2 /0.9 |
200 | 54.6/98.8 /1.6 | 40.3/99.6 /0.5 | 24.8/99.6 /0.4 | 21/99.7 /0.3 |
300 | 78.9/99.2 /1.1 | 65.3/100.0 /0.1 | 41.7/99.9 /0.2 | 37.2/100.0 /0.1 |
500 | 93.5/99.8 /0.2 | 89.0/100.0 /0.0 | 69.1/100.0 /0.0 | 61.8/100.0 /0.0 |
Note: This table reports the frequencies of rejection by J1 in (5.2) and PE = J0 + J1 in (5.4) under the null and alternative hypotheses, based on 2000 replications. The frequency of being empty is also recorded. These tests are conducted at 5% significance level.
The bottom panel shows the power of the two tests under the alternative specification. The power enhancement test demonstrates almost full power under all combinations of (n, T). In contrast, the quadratic test J1 only gains power when T gets large. As n increases, the proportion of nonzero off-diagonal elements in Σu gradually decreases. It becomes harder for J1 to effectively detect those deviations from the null hypothesis. This explains the low power exhibited by the quadratic test when facing a high sparsity level.
6.3 Empirical evidence of sparse alternatives
We present empirical evidence of sparse alternatives based on a real data example. Consider Carhart (1997)’s four-factor model on the S&P 500 index. We collect monthly excess returns on all the S&P 500 constituents from the CRSP database for the period January 1980 to December 2012, and construct the screening set on a rolling window basis: for each month, we evaluate using the preceding 60 months’ returns (T = 60). The panel at each month consists of stocks without missing observations in the past five years, which yields a balanced panel with the cross-sectional dimension larger than the time-series dimension (N > T). In this manner we not only capture the up-to-date information in the market, but also mitigate the impact of time-varying factor loadings and sampling biases. In particular, for testing months τ = 1984.12, …, 2012.12, we run the regressions
(6.3) |
for i = 1, …, Nτ and t = τ − 59, …, τ, where rit represents the return for stock i at month t, rft the risk free rate, and MKT, SMB, HML and MOM constitute market, size, value and momentum factors. The time series of factors are downloaded from Kenneth French’s website. To make the notation consistent, we use to represent the “alpha” of stock i.
Table IV summarizes descriptive statistics for different components and estimates in the model. On average, 618 stocks (which is more than 500 because we are recording stocks that have ever become the constituents of the index) enter the panel of the regression during each five-year estimation window. Of those, merely an average of 5.2 stocks are selected by the screening set which directly implies the presence of sparse alternatives. The threshold varies as the panel size N changes at the end of each month, and is about 3.5 on average. The selected stocks have much larger alphas (θ) than other stocks do. Therefore empirically we find that there are only a few significant nonzero “alpha” components, corresponding to a small portion of mis-priced stocks instead of systematic mis-pricing of the whole market.
Table IV.
Variables | Mean | Std dev. | Median | Min | Max | |
---|---|---|---|---|---|---|
Nτ | 617.70 | 26.31 | 621 | 574 | 665 | |
|
5.20 | 3.50 | 5 | 0 | 20 | |
|
0.9767 | 0.1519 | 0.9308 | 0.7835 | 1.3816 | |
|
4.5569 | 1.4305 | 4.1549 | 1.7839 | 10.8393 |
The power enhancement procedure is particularly suited for the empirical setting where sparse alternatives are present. Note that finding only a few stocks with nonzero alphas is probably explained by the focus on a balanced panel of highly traded stocks with large capitalizations (cf. constituents of the S&P500). On the other hand, as in Gagliardini et al. (2011), the empirical finding would probably be different if we consider a much larger universe of stocks with possibly many more mis-pricing.
7 Discussions
We consider testing a high-dimensional vector H: θ = 0 against sparse alternatives where the null hypothesis is violated only by a few components. We introduce a “power enhancement component” J0 based on a screening technique, which is zero under the null, but diverges quickly under sparse alternatives. We suggest constructing J0 as described in the paper since the screening set can reveal the sparse structure of θ, and a negative outcome of the test suggests a specific set of alternatives.
In the factor pricing model, the issue of missing a small number of factors is also important to consider. One one hand, when the unspecified factors are not “pervasive”, only assets that are influenced by the missing factors are affected, which may lead to a sparse alternative. On the other hand, the unspecified pervasive factors may substantially affect the sparse structure of either θ or Σu, or both. In this case, we can extend the current model to allow for unobservable factors, which can be statistically inferred using principal components (PC) method as in Stock and Watson (2002) and Fan et al. (2013). Since the PC method is robust to over-specifying factors, the screening set should be stable if the “working number of factors” is no smaller than the true number of factors K. As a result, one can estimate θ and construct using either a consistent estimator of K or a slightly overestimated . Once is reasonably robust to the choice of , it indicates that no pervasive factors are omitted. We can then proceed to use the proposed J0 to conduct the test.
In addition, this paper considers unconditional population moments of asset returns and focuses on the factor pricing model. Unconditional moments of financial returns, under a broad class of data generation processes, are time invariant and can thus be estimated from time series data. Though theoretical models often imply a conditional linear model with respect to investors’ information sets, it is much more convenient to deduce testable implication that does not depend on this conditioning information (see Hansen and Richard (1987)). On the other hand, the use of conditioning information is also appealing and has been addressed by several authors (see,e.g., Gagliardini et al. (2011) and Ang and Kristensen (2012)). It will be an interesting direction to accommodate such conditioning information in deriving power enhancement tests.
Supplementary Material
Table I.
μB | ΣB | μf | Σf | ||||
---|---|---|---|---|---|---|---|
0.9833 | 0.0921 | −0.0178 | 0.0436 | 0.0260 | 3.2351 | 0.1783 | 0.7783 |
−0.1233 | −0.0178 | 0.0862 | −0.0211 | 0.0211 | 0.1783 | 0.5069 | 0.0102 |
0.0839 | 0.0436 | −0.0211 | 0.7624 | −0.0043 | 0.7783 | 0.0102 | 0.6586 |
Acknowledgments
*We thank co-editor Lars Hansen and anonymous referees for many insightful comments and suggestions, which have greatly improved the paper. The authors are also grateful to the comments from Per Mykland and seminar and conference participants at UChicago, Princeton, Georgetown, George Washington, 2014 Econometric Society North America Summer Meeting, UCL workshop on High-dimensional Econometrics Models, The 2014 Annual meeting of Royal Economics Society, The 2014 Asian Meeting of the Econometric Society, 2014 International Conference on Financial Engineering and Risk Management, and 2014 Midwest Econometric Group meeting. The research was partially supported by National Science Foundation grants DMS-1206464 and DMS-1406266, and National Institute of Health grants R01GM100474-01 and R01-GM072611.
A Proofs for Section 3
Throughout the proofs, let C be a generic constant, which may differ at different places.
A.1 Proof of Theorem 3.1
Proof
Define events
For any j ∈ S (θ), by the definition of S(θ), . Under A1 ∩ A2,
This implies that , hence . In fact, we have proved this statement on the event A1 ∩ A2 uniformly for θ ∈ Θ:
Moreover, it is readily seen that, under H0: θ = 0, by Assumption 3.1,
In addition, by ,
Note that the last convergence holds uniformly in θ ∈ Θ because δN,T → ∞. Therefore, . This completes the proof.
A.2 Proof of Theorem 3.2
Proof
It follows immediately from P (J0 = 0|H0) → 1 that J →d F, and hence the critical region {D: J > Fq} has size q. Moreover, by the power condition of J1 and J0 ≥ 0,
This together with the fact
establish the theorem, if we show .
By the definition of Ŝ and J0, we have . Since infθ∈Θ P(S(θ) ⊂Ŝ|θ) → 1 and Θs= {θ ∈ Θ: S(θ) ≠Ø}, we have
which converges to zero, since the first term is zero. This implies . Then by condition (ii), as δN,T → ∞,
This completes the proof.
A.3 Proof of Theorem 3.3
Proof
It suffices to verify conditions (i)–(iii) in Theorem 3.2 for J1 = JQ. Condition (i) follows from Assumption 3.2. Condition (iii) is fulfilled for c > 2/ξ, since
by using Fq = O(1), ξN,T → ξ, and μN,T → 0. We now verify condition (ii) for the Θ(JQ) defined in the theorem. Let D = diag(v1, …, vN). Then ‖D‖2 < C3/T by Assumption 3.2(iv).
On the event , we have
For with C = 4C3‖V‖2/λmin(V), we can bound further that
Hence, . Therefore,
which converges to zero since . This implies and finishes the proof.
A.4 Proof of Theorem 3.4
Proof
Through this proof, C is a generic constant, which can vary from one line to another. Without loss of generality, under the alternative, write
where dim(θ1) = N − rN and dim(θ2) = rN. Corresponding to , we partition V−1 and V into:
where M1 and A are (N − rN) × (N − rN); β and G are rN × (N − rN); M2 and C are rN × rN.
By the matrix inversion formula,
Let Note that
We first look at Let and . Note that the diagonal entries of are given by . Therefore D1 is a diagonal matrix with entries , and .
Since β is rN × (N − rN), using the expression of A, we have
where we used θ1 = 0 in the second inequality and the fact that . Note that . Hence,
Thus, there is C > 0, with probability approaching one,
Note that the uniform convergence in Assumption 3.1 and boundness of imply that for a sufficient large constant C. For G = (gij), note that Hence, by using θ1 = 0 again, with probability tending to one,
Moreover, . Combining all the results above, it yields that for any θ ∈ Θb,
We denote , , to be the asymptotic covariance matrix of and . Then and . It then follows from (3.3) that
For any 0 < ε < Fq, define the event . Hence, suppressing the dependence of θ,
which is further bounded by 1−Φ(Fq−ε)+P(Ac)+o(1). Since 1−Φ(Fq) = q, for small enough, ε, 1−Φ(Fq−ε) = q+O(ε). By letting ɛ → 0 slower than , we have P (Ac) = o(1), and lim supN→∞,T→∞ P (JQ > Fq) ≤ q. On the other hand, P (JQ > Fq) ≥ P (J1 > Fq), which converges to q. This proves the result.
B Proofs for Section 4
Lemma B.1
When cov(ft) is positive definite, .
Proof
If Eft= 0, then . If Eft ≠ 0, because cov(ft) is positive definite, let , then . Hence implies . This implies .
B.1 Proof of Proposition 4.1
Recall that , and . Write , and .
Simple calculations yield
We first prove the second statement. Note that there is σmin > 0 (independent of θ) so that minj σj > σmin. By Lemma ??, there is C > 0, . On the event ,
This proves the second statement. We can now use this to prove the first statement.
Note that vj is independent of θ, so there is C1 (independent of θ) so that On the event ,
The constants C, C1 appeared are independent of θ, and Lemma ?? holds uniformly in θ. Hence the desired result also holds uniformly in θ.
B.2 Proof of Proposition 4.2
By Theorem 1 of Pesaran and Yamagata (2012) (Theorem 1),
Therefore, we only need to show
The left hand side is equal to
It was shown by Fan et al. (2011) that In addition, under H0, . Hence .
The challenging part is to prove a = oP (1) when N > T. As is described in the main text, simple inequalities like Cauchy-Schwarz accumulate estimation errors, and hence do not work. Define , which is an N-dimensional vector with mean zero and covariance , whose entries are stochastically bounded. Let . A key step of proving this proposition is to establish the following two convergences:
(B.1) |
(B.2) |
where
The sparsity condition assumes that most of the off-diagonal entries of Σu are outside of SU. The above two convergences are weighted cross-sectional and serial double sums, where the weights satisfy for each i. The proofs of (B.1) and (B.2) are given in the supplementary material in Appendix D.
We consider the hard-thresholding covariance estimator. The proof for the generalized sparsity case as in Rothman et al. (2009) is very similar. Let and σij=(Σu)ij. Under hard-thresholding,
Write to denote the ith element of , and . For , and we have
We first examine a3. Note that
Obviously,
Because sii is uniformly (across i) bounded away from zero with probability approaching one, and . Hence for any ε > 0, when C in the threshold is large enough, P (a3 > T −1) < ε, this implies a3 = oP (1).
The proof is finished once we establish ai = oP (1) for i = 1, 2, which are given in Lemmas ?? and ?? respectively in the supplementary material.
Proof of Theorem 4.1
Part (i) follows from Proposition 4.2 and that P (J0 = 0|H0) → 1. Part (ii) follows immediately from Theorem 3.3.
C Proofs for Section 5
C.1 Proof of Proposition 5.1
Lemma C.1
Under Assumption 5.1, uniformly in θ ∈ Θ, .
Proof
Note that
Uniformly for θ ∈ Θ, due to serial independence, and ,
Hence the result follows from the Chebyshev inequality and that is bounded away from zero with probability approaching one, uniformly in θ.
Lemma C.2
Suppose with probability approaching one and . There is C > 0, so that uniformly in θ ∈ Θ,
Proof
- By the Bernstein inequality, for , we have
Hence (i) is proved as . - For , we have, uniformly in θ ∈ Θ,
Note that , and with probability approaching one. The result then follows from part (i) and Lemma C.1.
- Observe that
The first two terms and in the third term are bounded by results in (ii) and (iii). Therefore, it suffices to show that there is a constant M > 0 so that
Note that . In addition, by (ii), there is C > 0 so that
Hence we can pick up M so that , and
This proves the desired result.
Lemma C.3
Under Assumption 5.1, there is C > 0, uniformly in θ ∈ Θ,
Proof
By the definition By the triangular inequality,
By part (iv) of Lemma C.2, . Hence for sufficiently large M > 0 such that ,
By a similar argument, there is M′ > 0 so that The result then follows as, uniformly in θ ∈ Θ,
Proof of Proposition 5.1
As uniformly for (i, j) and θ, the second convergence follows from Lemma C.3. Also, with probability approaching one,
C.2 Proof of Theorem 5.1
Lemma C.4
J1 has power uniformly on for some C. Proof. By Lemma C.3, there is C > 0, . Let
Then . On the event A, we have, uniformly in θ = {ρij},
Therefore, when ,
This entails that when , we have
Proof of Theorem 5.1
It suffices to verify conditions (i)–(iii) of Theorem 3.2. Condition (i) follows from Theorem 1 of Baltagi et al. (2012). As for condition (ii), note that almost surely. Hence as n, T → ∞,
Finally, condition (iii) follows from Lemma C.4.
Footnotes
JEL code: C12, C33, C58
Contributor Information
Yuan Liao, Email: yuanliao@umd.edu.
Jiawei Yao, Email: jiaweiy@princeton.edu.
References
- Andrews D. Hypothesis testing with a restricted parameter space. Journal of Econometrics. 1998;84:155–199. [Google Scholar]
- Andrews D. Cross-sectional regression with common shocks. Econometrica. 2005;73:1551–1585. [Google Scholar]
- Ang A, Kristensen D. Testing conditional factor models. Journal of Financial Economics. 2012;106:132–156. [Google Scholar]
- Antoniadis A, Fan J. Regularized wavelet approximations. Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]
- Bai ZD, Saranadasa H. Effect of high dimension: by an example of a two sample problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]
- Baltagi B. Econometric Analysis of Panel Data. fourth. Wiley; 2008. edition ed. [Google Scholar]
- Baltagi B, Feng Q, Kao C. A lagrange multiplier test for cross-sectional dependence in a fix effects panel data model. Journal of Econometrics. 2012;170:164–177. [Google Scholar]
- Beaulieu M, Dufour J, Khalaf L. Multivariate tests of mean-variance efficiency with possibly non-gaussian errors: an exact simulation based approach. Journal of Business and Economic Statistics. 2007;25:398–410. [Google Scholar]
- Bickel P, Levina E. Covariance regularization by thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
- Breusch T, Pagan A. The lagrange multiplier test and its application to model specification in econometrics. Review of Economic Studies. 1980;47:239–254. [Google Scholar]
- Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association. 2013;108:265–277. [Google Scholar]
- Cai T, Zhang C, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38:2118–2144. [Google Scholar]
- Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics. 2012;40:2014–2042. [Google Scholar]
- Carhart MM. On persistence in mutual fund performance. The Journal of finance. 1997;52:57–82. [Google Scholar]
- Chamberlain G, Rothschild M. Arbitrage, factor structure and mean-variance analyssi in large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]
- Chen SX, Qin YL. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics. 2010;38:808–835. [Google Scholar]
- Chernozhukov V, Chetverikov D, Kato K. Tech rep. MIT; 2013. Testing many moment inequalities. [Google Scholar]
- Connor G. A unified beta pricing theory. Journal of Economic Theory. 1984;34:13–31. [Google Scholar]
- Connor G, Korajczyk R. A test for the number of factors in an approximate factor model. Journal of Finance. 1993;48:1263–1291. [Google Scholar]
- Donald SG, Imbens GW, Newey WK. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics. 2003;117:55–93. [Google Scholar]
- Fama E, French K. The cross-section of expected stock returns. Journal of Finance. 1992;47:427–465. [Google Scholar]
- Fan J. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association. 1996;91:674–688. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements (with discussion) Journal of the Royal Statistical Society, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Liao Y, Yao J. Tech rep. Princeton University; 2014. Power enhancement in high dimensional cross-sectional tests. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gagliardini P, Ossola E, Scaillet O. Tech rep. Swiss Finance Institute; 2011. Time-varying risk premium in large cross-sectional equidity datasets. [Google Scholar]
- Gibbons M, Ross S, Shanken J. A test of the efficiency of a given portfolio. Econometrica. 1989;57:1121–1152. [Google Scholar]
- Hall P, Jin J. Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics. 2010;38:1686–1732. [Google Scholar]
- Hansen LP, Richard SF. The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica. 1987;55:587–613. [Google Scholar]
- Hansen P. Tech rep. CREATES; 2003. Asymptotic tests of composite hypotheses. [Google Scholar]
- Hansen P. A test for superior predictive ability. Journal of Business and Economic Statistics. 2005;23:365–380. [Google Scholar]
- Im K, Ahn S, Schmidt P, Wooldridge J. Efficient estimation of panel data models with strictly exogenous explanatory variables. Journal of Econometrics. 1999;93:177–201. [Google Scholar]
- MacKinlay A, Richardson M. Using generalized method of moments to test mean-variance efficiency. Journal of Finance. 1991;46:511–527. [Google Scholar]
- Pesaran H, Ullah A, Yamagata T. A bias-adjusted lm test of error cross section independence. Econometrics Journal. 2008;11:105–127. [Google Scholar]
- Pesaran H, Yamagata T. Tech rep. University of South California; 2012. Testing capm with a large number of assets. [Google Scholar]
- Ross S. The arbitrage theory of capital asset pricing. Journal of Economic Theory. 1976;13:341–360. [Google Scholar]
- Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
- Stock J, Watson M. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association. 2002;97:1167–1179. [Google Scholar]
- Zhong P, Chen S, Xu M. Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Annals of Statistics. 2013;41:2820–2851. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.