Abstract
Motivated by the advent of high dimensional highly correlated data, this work studies the limit behavior of the empirical cumulative distribution function (ecdf) of standard normal random variables under arbitrary correlation. First, we provide a necessary and sufficient condition for convergence of the ecdf to the standard normal distribution. Next, under general correlation, we show that the ecdf limit is a random, possible infinite, mixture of normal distribution functions that depends on a number of latent variables and can serve as an asymptotic approximation to the ecdf in high dimensions. We provide conditions under which the dimension of the ecdf limit, defined as the smallest number of effective latent variables, is finite. Estimates of the latent variables are provided and their consistency proved. We demonstrate these methods in a real high-dimensional data example from brain imaging where it is shown that, while the study exhibits apparently strongly significant results, they can be entirely explained by correlation, as captured by the asymptotic approximation developed here.
Keywords: empirical null, dependent random variables, high dimensional data, factor analysis, asymptotic approximation, strong correlation
1 Introduction
The empirical cumulative distribution function (ecdf) and its large sample properties have a long and rich history in probability and statistics. However, most of this vast literature assumes that the variables used to construct the ecdf are independent (e.g., Wasserman, 2006, Chapter 2, and references therein). Under independence, the ecdf is a consistent estimator of the true cumulative distribution function (cdf). The consistency property continues to hold under various forms of weak dependence (e.g. Dedecker and Merlevéde (2007); Wu (2006)).
Motivated by modern problems in high dimensional data, where a large number of correlated variables are measured, it is of interest to study the asymptotic behavior of the ecdf when the variables involved are arbitrarily correlated. In particular, the present paper is motivated by large-scale multiple testing problems, where each of p tests produces a z-score Zi, i = 1, …, p. It has been pointed out by Bradley Efron that in large-scale multiple testing problems, the observed distribution of the z-scores often does not match the theoretical null distribution N(0, 1) (Efron, 2004, 2007a, b, 2008). Efron (2007a) conjectured that, even when the theoretical model is correct, the observed distribution of the test statistics can look different from the theoretical null distribution simply because of correlation between them. This interesting observation suggests that the ecdf may not always be consistent and calls for a detailed study of the ecdf of a large number of dependent variables.
Assuming that the variables Z1, Z2, …, Zp are marginally standard normal simplifies the problem and allows obtaining results for their ecdf under arbitrary dependence because all the dependence is expressed through correlation. In this situation, Efron (2007a) proposed the so-called empirical null as an approximation to the observed distribution of the z-scores, parametrized as a normal distribution with mean and variance other than 0 and 1. To further understand the effect of correlation, Efron (2010) derived the covariance function of the ecdf and applied it to estimating the variance of functions of the ecdf relevant in large-scale multiple testing, such as the local and tail false discovery rate. Schwartzman (2010) proposed to approximate the ecdf by a Gram-Charlier expansion and used its coefficients to establish some constraints to the extent of the departure of the ecdf from the marginal N(0, 1). However, these approaches have not succeeded to fully characterize the behavior of the ecdf under correlation.
In this article we describe the asymptotic behavior of the ecdf of a large number of correlated standard normal variables. First, we show that in general, the ecdf need not converge to Φ, the standard normal cdf. A necessary and sufficient condition for convergence, which we call weak correlation, is that the average of the absolute pairwise correlations between the z-scores (or the average of the squares of the pairwise correlations) tends to zero with increasing dimension p. However, we show that in a wide range of strong correlation situations, the ecdf converges instead to a random distribution function. This random function can be written as a (possibly infinite) normal mixture parametrized by latent independent standard normal variables. It can be thought of as an analytic asymptotic approximation to the ecdf, where the latent variables can be consistently estimated, under some regularity conditions, from the observed data sequence. Further, it can be thought of as a dimension reduction of the p-dimensional ecdf in the sense that its inherent dimension is the number of the latent variables. We give a lower bound for this inherent dimension and show that, under certain regularity conditions, it can be achieved by a particular parametrization obtained via an eigendecomposition of the correlation matrix. This parametrization is based on the factor analysis model of Fan et al. (2012), who use it to calculate the false discovery proportion in large scale multiple testing under arbitrary correlation. Here we consider a more general framework where not only eigendecompositions are investigated.
As an illustration of the behavior of the ecdf as a random function as described in this paper, Figure 1 presents the histograms of two realizations of normal random variables under two correlation structures:
Figure 1.
Histograms of various instances of 1000 standard normal variables with a one-block correlation structure (a) and a two-independent-blocks correlation structure (b). The red line is the standard normal density and the blue dashed line is the asymptotic approximation we use.
One block: is a sequence of exchangeable random variables with correlation ρ = 0.9, i.e., for any i ≠ j, cor(Zi, Zj) = ρ
Two independent blocks: consists of two intercalated independent sequences of exchangeable random variables, i.e., for any i ≠ j such that |i − j| is even, cor(Zi, Zj) = ρ (and 0 otherwise).
It can be seen that for both structures the empirical distribution differs from the standard normal density (red line). In case (a) the histogram is shifted and narrower than the standard normal density, while in case (b) it looks like a mixture of two normal distributions. Notice that the histograms change between the two realizations, suggesting that the distribution may not be converging to a deterministic limit. We will show that in these cases the empirical distribution indeed does not converge to a deterministic limit, but can be asymptotically approximated by a normal mixture estimated from the data, represented here by the dashed blue line. Not surprisingly, in case (a) the asymptotic approximation is of dimension 1, containing one latent standard normal variable, and in case (b) is of dimension 2, containing two independent latent standard normal variables. The fitted density in case (a) can be interpreted as Efron’s empirical null model of a shifted and scaled normal, but clearly that model cannot capture the behavior of the distribution in case (b).
When considering the entire range of possible correlation structures, we show that there are essentially three regimes. The first is what we call weak correlation; while the random variables need not to be independent, the ecdf converges to Φ. The second regime is what we call finite dimensional correlation and is the main interest of this paper. Under this type of correlation, the ecdf can be approximated by a random function that depends on a finite number of latent independent standard normal random variables. Furthermore, under some regularity conditions, a representation using the smallest number of latent variables can be achieved and estimated consistently. The examples of Figure 1 belong to this regime. Finally, in the third regime, the limiting random function depends on infinite number of independent standard normal random variables, like the ecdf itself.
As an illustration of the usefulness of the results presented in this article, we present an analysis of brain imaging data obtained from a study of cortical thickness of adults who had a diagnosis of attention deficit/hyperactivity disorder (ADHD) as children (Proal et al., 2011). In this study, it had been noticed before that, when searching for brain locations whose cortical thickness is related to clinical diagnosis, the histogram of the z-scores did not follow the theoretical standard normal distribution (Reiss et al., 2012). Here we do a slightly different analysis (where the correlation structure is known) and show that, while the study exhibits apparently strongly significant results, they can be entirely explained by correlation, as captured by the asymptotic approximation mentioned above.
The rest of the paper is organized as follows. After a brief treatment of weak correlation in Section 2, the main results of the paper for general correlation are given in Section 3. Section 4 discusses how to consistently estimate the latent variables. In Section 4 we also briefly discuss the case where the correlation matrix is unknown. Several concrete examples, including those of Figure 1, are presented in detail in Section 5. A data example is analyzed in Section 6. Section 7 considers some possible extensions of this work. All the proofs are given in a supplementary material document.
2 Weak correlation
2.1 Definition
To define weak correlation some notation is needed. For a given p × p matrix Rp = (rij) define the following average norms: and , where the λi’s are the eigenvalues of Rp. The latter is simply a scaled version of the Frobenius norm, and in both cases we use the superscript (p) to denote that the norm itself, not just the matrix, changes with p. If is a sequence of correlation matrices then, as p → ∞,
This is because, by Jensen’s inequality, , while on the other hand, |rij| ≤ 1 and (rij)2 ≤ |rij|, and so . A similar argument holds if Rp is a covariance matrix and the diagonal entries of Rp, i.e., the variances, are bounded.
Definition 1
Let be a sequence of standard normal variables with joint normal distribution and denote the correlation matrix of (ξ1, …, ξp) by Rp. If , or equivalently, , then is called weakly correlated. Otherwise it is called strongly correlated.
2.2 Convergence of the ecdf
Let be a sequence of standard normal variables with joint normal distribution. The ecdf is
| (1) |
where I(·) denotes the indicator function. Correlation does not affect the expectation E[F̂p(z)] = Φ(z) of the ecdf, but it affects its covariance. The covariance function is given in Proposition 1 of Schwartzman (2010).
The following theorem establishes that a necessary and sufficient condition for consistency of the ecdf in
is that
is weakly correlated.
Theorem 1
Let Z1, Z2, …, Zp, … be N(0, 1) variables with ecdf (1) and let Rp denote the correlation matrix of (Z1, …, Zp).
-
(Sufficiency) If is weakly correlated, then F̂p(z) converges to Φ(z) in
uniformly:
where C is a universal constant.
- (Necessity) If is not weakly correlated, i.e., , then F̂p(z) does not converge to Φ(z) in
for any z ≠ 0, i.e.:
Notice that according to Theorem 1 the convergence is either uniform or none at all; if the correlation is strong, i.e., not weak, then for any z ≠ 0 there is no convergence to Φ(z). Under a general correlation structure the ecdf may converge to a random function. In Section 3 below we aim at identifying and estimating this function.
2.3 Examples
It is easy to check that every Gaussian autoregressive moving average (ARMA) process is weakly correlated. So is every m-dependent Gaussian sequence with banded correlation matrix so that rij = 0 for i − j > m and fixed finite m. This includes correlation in fixed finite blocks. In all these cases the ecdf converges to the standard normal distribution.
More generally, all Gaussian stationary ergodic processes are weakly correlated. This is because ergodicity requires that the autocorrelation function ρ(ℓ) = ri,i+ℓ satisfies |ρ(ℓ)| → 0 as ℓ → ∞, and therefore
It is not hard to check that even long-range correlation, defined by , also implies weak correlation except in the extreme case where is of order p (the largest possible).
3 General correlation
3.1 An asymptotic approximation
To describe the asymptotic behavior of the ecdf for a general correlation structure, the main idea is to decompose the correlation into a strong correlation component and a weak correlation component. Then, the asymptotic behavior of the ecdf as a random function will be captured by the strong correlation component, while the weak component will converge as in Theorem 1.
Specifically, suppose that for every p we have the decomposition
| (2) |
where Ap and Bp are symmetric positive semi-definite matrices. Write Ap as
| (3) |
where Lp is of dimension p × k(p); let be the i’th row of Lp. If the matrix Ap is the zero matrix then we define k(p) := 1, and Lp := (0, …, 0)T.
For a matrix B, we define the matrix Cor(B) by
Notice that if B is a positive semidefinite matrix, then |{Cor(B)}ij| ≤ 1. Obviously, if B is a correlation matrix, then Cor(B) = B. The following theorem states the main result.
Theorem 2
Under the previous setting and notation,
-
For every p there exists a (non-unique) random vector such that
(4) where(5) with , and C is a universal constant. If we define .
Therefore, if then .
The key idea of the proof is the following: due to decomposition (2) and equality (3), there exists a random vector Wk ~ N (0, Ik) such that
| (6) |
where (ξ1, …, ξp) ~ N(0, Bp). Therefore, the conditional random vector
is normal with conditional mean μ = Lpwk and conditional covariance matrix Bp. The
distance between F̂p(z) and
can be essentially bounded uniformly by
.
It is important to note that the random vector is not unique. Because has a spherically symmetric distribution, Equation (4) will hold if is replaced by where Q is any p × p orthonormal matrix.
Definition 2
The sequence of random functions
is said to converge uniformly in
to the (random) function G(z) if
.
Corollary 1
Suppose F̂p(z) satisfies the conditions of Theorem 2(ii). If there exists a (random) function F̄(z) such that F̄p(z) converges to F̄(z) uniformly in
, then F̂p(z) also converges to F̄p(z) uniformly in
.
Theorem 2 holds for all decompositions of the form (2). We call every corresponding F̄p(z) an asymptotic approximation of F̂p(z). However, some decompositions may need a smaller number of latent variables k(p) than others. In the next section we characterize the best decomposition in the sense of giving the asymptotic approximation with the smallest number of latent variables. If F̄p(z) converges to some F̄(z), then we call the latter the asymptotic representation of F̄p(z).
3.2 Dimension reduction
Theorem 2 approximates F̂p(z) by F̄p(z), which is the projection of F̂p(z) onto a space generated by . Thus, as explained below, F̂p(z), which has dimension p, is approximated by F̄p(z) with dimension rank (Ap). Hence, Theorem 2 can be regarded as a dimension reduction of the empirical distribution function as stated in the following proposition.
Proposition 1
Let
be the collection of all one-dimensional distribution functions, where convergence is defined in the sense of weak convergence of the corresponding random variables.
F̂p(z) has dimension p: define the mapping
(Z1, ···, Zp) = F̂p(z), from ℝp to
. Then
(ℝp) ⊆
is homeomorphic to ℝp.F̄p(z) has dimension rank(Ap): define the mapping , from ℝk(p) to
. Then
(ℝk(p)) ⊆
is homeomorphic to ℝRANK(Ap).
In order to achieve dimension reduction, we are interested in knowing whether there exist decompositions of the form (2) such that the approximation of Theorem 2 holds and the dimension of the approximation rank(Ap) is finite. Consider the collection
of decompositions of Rp that satisfy the conditions of Theorem 2 Part (ii): D ∈
if
such that Ap, Bp satisfy (2) and
. Clearly,
is a large collection and some notion of optimality is required in order to decide which D ∈
to choose. Given Proposition 1 (ii), we are interested in decompositions where rank(Ap) is the smallest. Theorem 3 below states a lower bound on the limiting rank of Ap and presents a decomposition that achieves it under certain conditions.
To state the result, we need to define the eigendecompositions of Rp as a special case of the decompositions in
. Suppose that the eigenvalues of Rp are
and the corresponding eigenvectors are
(notice that everything depends on p). The eigendecomposition of Rp is
. For k < p, define
| (7) |
Then, the covariance matrix Rp can be expressed as Rp = Ak,p + Bk,p and Ak,p = Lk,p(Lk,p)T where
| (8) |
A critical quantity is the number of “big” eigenvalues, i.e., with size of order p. Define
and
By definition we have that K ≤ K̄. Notice that K could be ∞ and that if K < ∞ then Ki = 1 for i ≤ K and Ki = 0 otherwise; the same for K̄. Since K and K̄ are sums of indicators, then they are either an integer or infinity. The following theorem states that K and K̄ give lower bounds for the limiting rank of Ap and presents a decomposition of the correlation matrix that achieves it under certain conditions.
Theorem 3
Under the previous setting and notation:
If , then lim infp rank(Ap) ≥ K and lim supp rank(Ap) ≥ K̄ (both K and K̄ could be ∞).
-
For K̄ < ∞ define the decomposition according to (7) with k = K̄. If the nonzero diagonal terms of BK̄,p are bounded from below, i.e.
(9) then DK̄ ∈
and obviously rank(AK̄,p) = K̄ for all p > K̄.
The regularity condition (9) guarantees that the norm of the residual covariance in BK̄,p goes to zero because the correlation in it goes to zero, not because the variance in it goes to zero.
To better appreciate the result given by Theorem 3, it is useful to define the following concepts.
Definition 3
If there exists such that lim supp rank(Ap) < ∞ we say that has finite dimensional correlation.
If further K := K = K̄ < ∞ then we say that has asymptotic dimension K.
With these definitions in mind, Theorem 3 implies that if has finite dimensional correlation with asymptotic dimension K and the regularity condition (9) holds, then the decomposition is “optimal” in the sense that limp rank(AK,p) = K, so that it achieves the lowest dimension among all asymptotic approximations F̄p(z) of F̂p(z). We shall see in Section 5 that the regularity condition (9) holds for most typical correlation structures.
It is now possible to see that the number K̂ determines three different convergence regimes. If K̂ = ∞ then has no finite dimensional correlation. If K̄ < ∞ and (9) holds then has finite dimensional correlation. When K̄ = 0 then (9) holds trivially since {B0,p}ii = {Rp}ii = 1 and therefore a necessary and sufficient conditions for weak correlation, as in Theorem 1, is K̄ = 0. We summarize the results in the following corollary and in Figure 2.
Figure 2.
Different cases of asymptotic approximation to the ecdf. The corresponding examples from Section 5 below appear in parentheses.
Corollary 2
If K̄ = ∞ then has no finite dimensional correlation.
If K̄ < ∞ and (9) holds then has finite dimensional correlation.
If K := K = K̄ < ∞ and (9) holds then has asymptotic dimension K and the decomposition achieves it in the sense that limp rank(AK,p) = K.
K̄ = 0 if and only if is weakly correlated.
The regularity condition (9) does not hold when the diagonal of BK̄,p contains elements that are arbitrarily small. In this case, captured by the label “unknown” in Figure 2, may be small but not necessarily . Since our bound (4) is based on , rather than on , we cannot establish the asymptotic dimension in this case. However, we could not find an example of Rp, where (9) is not satisfied. A hand-waving argument that (9) typically holds is:
where ≈ is typically true because is a normalized vector and therefore .
4 Estimating the asymptotic representation from the data
4.1 Estimating the latent variables
We now discuss how to estimate the underlying latent variables when the sequence has asymptotic dimension K := K = K̄. For ease of presentation we write everything in a matrix/vector form; thus, we write Zp := (Z1, …, Zp)T, ξp = (ξ1, …, ξp)T. In this notation, (6) becomes the linear regression equation and the least square estimate of is .
When the eigendecomposition (7) is used, then the columns of Lk(p) are orthogonal and is a diagonal matrix whose i-th diagonal element is . Thus, in this case,
| (10) |
Fan et al. (2012) and Fan and Han (2014) consider a related framework where some of the variables Zi may have a non-zero mean, and therefore use an estimate that minimizes the
distance under sparsity assumptions.
Since
has strong correlation, then
does not converge to 0, as p goes to infinity. However, the following proposition states that
is still consistent in an
sense.
Proposition 2
Suppose that has asymptotic dimension K. Then
4.2 Estimating the asymptotic representation
In real data problems
and, thus, F̄p(·) are unknown. Proposition 2 suggests that if we plug-in
to F̄p(·) instead of
, then the
distance between F̂p(·) and the plugged-in F̄p(·) converges to zero. Indeed, this can be proved under some additional regularity conditions.
Theorem 4
-
Define the plug-in ecdf estimate
(11) with under the convention . We have thatwhere is the set of indexes for which and |Jp| is the cardinality of Jp.
-
Therefore, if has asymptotic dimension K, (9) holds and also
(12) then .
4.3 Approximated eigendecomposition
In this section we study the case in which the largest K eigenvalues and eigenvectors of the correlation matrix are not known exactly and an approximation is used. We show that if the distance between the approximated eigenvalues and eigenvectors and the true ones goes to zero as p goes to infinity, then the result of the previous section still holds.
Let be an approximation to the K biggest eigenvalues, and let be norm 1 vectors that are an approximation to the corresponding eigenvectors. Define the p × K matrix , and let be the i-th row of the matrix. Define,
and let
and
, where {x}i is the i-th element of the vector x. Finally, define
. The following proposition bounds the
distance between F̃p(z) and
.
Proposition 3
Suppose that , i = 1, …, p, for some εB > 0.
-
The following inequality holds:
where C(εB) is a constant that depends on εB.
Therefore, if , i = 1, …, K, and the conditions of Theorem 4 (ii) hold, then .
Proposition 3 and its proof imply that if there exist consistent estimates for the first K eigenvalues and eigenvectors, then the result of Theorem 4 (ii) still holds; see also Fan and Han (2014). A systematic study of the case that the correlation matrix is unknown is left to future research.
5 Examples
We now consider some examples of correlation structures. In Examples 1–3, K = K̄ is finite, conditions (9,12) are met, and therefore the decomposition is optimal and is consistent, but in Example 3, F̄p(z) does not have a limit. In Example 4, K = 1 but K̄ > 1 and in Example 5, K = ∞.
5.1 Exchangeable correlation: asymptotic dimension 1
Suppose that is a sequence of exchangeable random variables with correlation ρ ≥ 0, i.e., for any i ≠ j, cor(Zi, Zj) = ρ and
| (13) |
The eigenvalues of Rp are
(with corresponding eigenvector equal to
) and
, hence, K = K̄ = 1. According to (7), we have that A1,p = [ρ + (1 − ρ)/p]
and B1,p = (1 − ρ) [Ip − (1/p)
], where Ip is the p × p identity matrix and
is a p × p matrix with 1 in every entry. It is easy to check that for every i, limp{B1,p}ii = 1 − ρ > 0, and (9) holds. Also, |Jp| = p and (12) holds. Thus, in this case
has finite dimensional correlation with asymptotic dimension K = 1 and the decomposition
is optimal.
To write the asymptotic representation F̄(z) in this case, it is easier to work with the asymptotically equivalent decomposition Ap = ρ
, Bp = (1 − ρ)Ip, which is in D and rank(Ap) = K = 1. For this decomposition
and
do not depend on i or p and
| (14) |
That is, F̂p(z), which has dimension p, is approximated by F̄p(z) with dimension 1. Moreover, F̄p(z) = F̄(z) above does not depend on p and is therefore the asymptotic representation: F̂p(z) converges to F̄(z) in the sense of Definition 2.
Given the sequence , the regression estimate (10) of W1 is , where . An illustration of the estimated asymptotic representation in (11) is shown in Figure 1(a).
Notice that the distribution specified by (14) is a shifted and scaled normal distribution, corresponding to Efron’s empirical null model (Efron, 2004, 2007a,b, 2008). Fitting an empirical null in this case would amount to estimating W1 and ρ. This example shows that Efron’s empirical null model is justified under exchangeable correlation. However, as it can be seen from the above theory and the following examples, it does not capture the effect of many other correlation structures.
When ρ ≤ 0 in (13), positive semi-definiteness of the correlation matrix requires ρ ≥ −1/(p − 1). Therefore, in the high-dimensional setting, too much negative correlation is not possible. Consequently, the histograms of the Z’s are typically narrower than N(0, 1) (see Figure 1).
5.2 Two exchangeable correlation blocks: asymptotic dimension 2
Suppose that Rp consists of two blocks of the form (13), the first one of size n1(p) × n1(p) and the second of size n2(p) × n2(p) with n1(p) + n2(p) = p and limp→∞ n1(p)/p = π. We assume that there is a constant correlation between the two blocks ρB (which can be zero or negative). Thus,
Let I1(p) and I2(p) denote the set of indexes that belong to Blocks 1 and 2. If , then a sequence with the above covariance can be generated by the model
where , and W1, W2, W3, are i.i.d N (0, 1). Comparing with (6), we can write
If ρB = 0, then the asymptotic approximation F̄(z) depends only on W2, W3 and has dimension K = K̄ = 2. In this case Rp has two uncorrelated blocks of the type presented in Section 5.1; the generalization of the formulas there is straightforward (see also Figure 1(b)).
For ρB ≠ 0, however, the above F̄(z) depends on W1, W2, W3 and, thus, has dimension three; it is suboptimal. To obtain the optimal asymptotic representation in more generality, we work with the eigendecomposition. The matrix Rp has two eigenvalues
where , j = 1, 2, and the other p − 2 eigenvalues are either 1 − ρ1 or 1 − ρ2. For large p, positive semi-definiteness requires that . On the boundary, when converges to a positive constant and K = K̄ = 1.
We now concentrate on the case . Then, are of order p and K = K̄ = 2. The corresponding eigenvectors are
where and are given by
We have that
and
Moreover, for i ∈ I1(p) and for i ∈ I2(p). Thus, (9) holds with K̄ = K = 2. Also, |Jp| = p and (12) holds. The corresponding asymptotic representation can be obtained by writing F̄p(z) in terms of the and above and taking the limit as p → ∞.
The regression estimate of is
5.3 Finite dimensional correlation with no asymptotic representation
In this example, correlation is finite with K = 2 and the
distance between F̂p and F̄p goes to zero, but F̄p itself has no limit. Let 0 < ρ < 1 and
be any sequence, W1, W2,
independent standard random variables. Define
then, Zi ~ N(0, 1) and for i ≠ j. We can write Rp = Ap + Bp, where Ap = Lp(Lp)T,
Then, , where and it is clear that for certain choices of , F̄p(z) has no limit for any z.
5.4 Independent exchangeable correlation blocks with no asymptotic dimension
In this example there are two independent blocks similar to the example in Section 5.2 (with ρB = 0) and with the same notation, but here change infinitely often. Consequently, there are infinitely many p’s with two “big” eigenvalues and infinitely many p’s with only one “big” eigenvalue. Therefore, in this example, K = 1 while K̄ = 2. This means that correlation is finite dimensional and (9) holds, but there is no asymptotic representation and no well defined asymptotic dimension.
Consider the subsequence m1 = 2, for k ≥ 2; i.e., mk = 2(2k−1). Suppose that Zi belongs to the first block if i ∈ {mk−1 + 1, …, mk} for even k, otherwise it belongs to the second block. That is, at p = mk for even k,
and for odd k
When p = mk for even k, , and therefore the eigenvalue associated with this block satisfies but for odd k, and ; and vice versa for the other block. The largest eigenvalue satisfies
and therefore and K1 = 1. For the second eigenvalue
since lim infp of both terms is 0, then and K2 = 0. Thus, K = 1.
We next show that K̄ = 2. When p = mk one block is larger than the other block and when p = mk+1 the other block is larger. Therefore, for each k there exists pk between mk and mk+1 such that the two blocks are equal, i.e., n1(pk) = n2(pk) = pk/2. Thus,
Therefore, . Since the other eigenvalues are constant (1 − ρ) we have that K̄ = 2.
Moreover, in this case
does not converge to 0 and therefore
; hence, the decomposition {(A1,p, B1,p)} is not in
. This is because
so that .
This example can be easily generalized such that K̄ = 3, 4,.. or K̄ = ∞. Given M < ∞, to obtain K̄ = M, for even k one can divide the observations Zmk, …, Zmk+1 into M − 1 independent blocks with correlation ρ within blocks. To obtain K̄ = ∞, for even k one can divide the observations Zmk, …, Zmk+1 into k blocks.
5.5 No finite dimensional approximation
In this example, K = K̄ = ∞, and therefore, according to Theorem 3, for every decomposition in
, limp rank(Ap) = ∞. Suppose that at p = 2n, Rp consists of n independent blocks of sizes of 2n−1, 2n−2, …, 2 and an additional block of size 2, each block of the form (13), all with the same correlation parameter ρ. At p = 2n, there are n − 1 “big” eigenvalues, ρ(p/2i − 1) + 1 for i = 1, …, n − 1, one eigenvalue 1 + ρ (from the last block of size 2) and the rest are equal to 1 − ρ. For any fixed i ∈
, for large enough
the eigenvalue
is equal to ρ(p/2i − 1) + 1 and therefore for large enough
,
Therefore, for any fixed i ∈
, Ki = 1 and
.
6 Data example
As a practical application, we use the methods developed above to analyze a high-dimensional data set obtained from brain imaging. The data belongs to a study of cortical thickness of adults who had a diagnosis of attention deficit/hyperactivity disorder (ADHD) as children (Proal et al., 2011). The data set consists of cortical thickness measurements for about 80000 cortical voxels, obtained from magnetic resonance imaging (MRI) scans, as well as demographic and behavioral measurements, for each of n = 139 individuals. In this study, it had been noticed by Reiss et al. (2012) that z-scores corresponding to the voxelwise relationship between cortical thickness and ADHD diagnosis did not follow the theoretical standard normal distribution. Instead, the distribution of z-scores exhibited a substantial shift away from zero, indicating a possible widespread cortical thinning over the brain for individuals with ADHD. It is unclear, however, whether those results could have been caused by correlation between voxels rather than by a real relationship with clinical diagnosis.
In order to apply the methods developed in this paper, here we perform a slightly different analysis where the correlation structure can be taken as known. Specifically, we follow the approach of Owen (2005) of performing a regression of the observed trait for the subjects on the high-dimensional predictors, one dimension at a time. Owen (2005) and also Fan et al. (2012) used this approach in the context of genomic data. In our case, the trait is a global assessment of behavior, while the predictors are the cortical thickness measurements. For ease of computation, in the following analysis we use a random sample of p = 1000 voxels, which is enough to show the effect we want to highlight.
6.1 Regression analysis
Let Yj denote the global assessment of the j-th subject and let be the cortical thickness in the p voxels of the j-th subject. For each voxel i ∈ {1, …, p}, consider the simple regression model
where the ε’s are i.i.d with mean 0 and variance σ2. Let and be the centered variables and define . The least squares estimate of β(i) is . According to the model, we can write and so Cov(β̂(i), β̂(ℓ)) = σ2siℓ/[siisℓℓ].
Consider the p hypotheses versus , i = 1, …, p. The z-score for the i-th test is . Thus, under the global null hypothesis, i.e. that all ’s are true, we have that Z1, …, Zp are each approximately N (0, 1) with correlation matrix given by the pairwise correlations . Notice that the pairwise correlations between the z-scores are precisely the pairwise correlations between the cortical thickness measurements at each voxel. Because the regression is conditional on the voxelwise measurements, we take the pairwise correlations as fixed. To compute the z-scores, we use the variance estimate and ignore its negligible contribution to their variability.
6.2 The distribution of the z-scores
Consider the decomposition Rp = Ak,p + Bk,p, where Ak,p, Bk,p are of the form (7). The asymptotic dimension, if it exists, is unknown. In the next subsection we discuss the choice of k, for now we work with k = 2. After k is set, is estimated via (10). Define
where ℓi is the i-th row of Lk,p defined in (8). The empirical distribution is approximated by given by (11), and the density is approximated by , where φ is the standard normal density. Figure 3(a) plots the histogram of the Z’s, f̄ and φ. The approximation f̄ beautifully captures the shape of the empirical distribution, even though it is based on only two latent variables.
Figure 3.
Histogram of Z1, Z2, …, Z1000 of the real data (a) and the simulation results (b). The red line illustrates φ, the standard normal density, and the blue dashed line is the approximation we use, f̄.
Recall that the approximation is done under the complete null hypothesis, i.e. that there is no effect. The validity of the complete null hypothesis is confirmed by a false discovery rate (FDR) analysis which shows no significant voxels after applying the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) at an FDR level of 0.2. Nevertheless, the z-scores exhibit a strong shift toward positive values, indicating a possible positive correlation between cortical thickness and behavior. Our approximation indicates that the reason that many Z’s are large may be because of the correlation between them and not because of a true effect. To illustrate this point we simulated Y1, …, Yn ~ N (0, 1) i.i.d and independent of the X’s and repeated the same procedure as before. One such simulated instance is given in Figure 3(b). Without correlation, we would expect the histogram of the Y’s to follow the theoretical null density (red). However, the correlation is causing the impression of a strong positive effect not unlike the one seen in panel (a).
6.3 The number of latent variables
We now discuss the choice of k, the number of latent variables. Recall that the asymptotic dimension K, if it exists, is the optimal choice as discussed in Section 3.2 and is equal the number of eigenvalues of order p. A scree plot of the 50 largest eigenvalues of Rp is shown in Figure 4(a). Catell’s graphical test indicates an elbow at k = 3, and after 10 ~ 20 eigenvalues, the value is almost constant.
Figure 4.
(a) The 50 largest eigenvalues of RP ordered according to their size. (b) Histogram of Z1, Z2, …, Z1000 and the approximation, f̄, for k = 2 (blue dashed), 10 (red dotted), 100 (solid green).
Theorem 2 states that under the decomposition Rp = Ak,p + Bk,p, the
distance between F̂p and F̄p is bounded by
. As k increases the bound decreases and
when k = p. However, as k increases the dimension of the representation also increases since the dimension is rank(Ak,p) (Proposition 1). Furthermore, for large k the empirical distribution and the approximation, F̄, are very close and the overfitting problem arises. To illustrate this point, we plot different choices of k in Figure 4(b). It can be seen that when k = 100 the approximation is much closer to the histogram, but such high dimension is unnecessary to capture the global behavior of the histogram. The differences between k = 2 and k = 10 are rather small suggesting that k = 2 suffices.
7 Summary and extensions
In this work we have studied the limit of the ecdf of marginally standard normal variables when strong correlation is present. As predicted by Efron (2007a), we have shown that the limit is indeed not standard normal. Specifically, we have shown that under a regime that we call finite dimensional correlation (and some regularity conditions), the limit is a finite mixture of scaled normals with random means, which reduces to Efron’s empirical null model when the correlation structure is exchangeable. Moreover, we have shown that if the correlation is not finite then the limit can still be approximated by mixture of normals but an infinite number of them may be required.
The main technique for achieving these results has been a decomposition of the correlation matrix R into two matrices A and B, where A captures the strong correlation and B captures the weakly correlated residual noise. The form of the limiting distribution of the ecdf is determined by A, while the residual noise represented by B goes to zero asymptotically. The key to achieving the asymptotic representation of the ecdf with the smallest dimension is to choose B so that it contains the largest amount of variance while remaining weakly correlated.
For future work, we consider the following extensions. First, we assumed that all random variables have variance 1. If the variances are not 1, but are bounded above and below, then the random variables could be standardized and all our results follow. If they are not bounded, then still for each finite p one can standardize and obtain inequality (4) of Theorem 2.
In Theorem 2 we proved that the
distance between F̂p and F̄p converges to zero if the residual correlation not captured by F̄p is weak. Could the same be said of the
distance between the moments of F̂p and F̄p? Let
and m̄n,p := ∫ zndF̄p(z) be the n-th moment of F̂p and F̄p. For the first moment and an appropriate decomposition Rp = Ap + Bp,
By the definition of and equation (6), we have that . Therefore,
and the distance converges to zero if and only if . This condition is weaker than that of Theorem 2 and thus convergence of the ecdf implies convergence of the first moment. Proving convergence of higher moments is more difficult and is left for future work.
About the choice of number of components k in practice, the problem is not unlike that of determining the number of components in factor analysis or the effective dimension in principal components analysis. The connection with these techniques is worth exploring.
Finally, about the normality assumption, it is used in the proof of Theorem 2 by applying Mehler’s expansion to the joint density. Similar expansions are available for other distributions such as chi square and gamma (e.g., Koudou (1998); Schwartzman (2011)). Thus it may be possible to extend our results to those distributions as well.
Supplementary Material
Acknowledgments
The authors are grateful to Philip Reiss from the Department of Child and Adolescent Psychiatry, New York University School of Medicine, for providing the brain imaging data. This work was partially supported by NIH grant R01-CA157528.
Contributor Information
David Azriel, Email: davidazr@ie.technion.ac.il, Lecturer at the Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Haifa 32000, Israel, and Postdoctoral Research Associate in the Department of Statistics of the Wharton School of the University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 19104.
Armin Schwartzman, Email: armin.schwartzman@ncsu.edu, Associate Professor at the Department of Statistics, North Carolina State University, Raleigh, NC 27695.
References
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57:289–300. [Google Scholar]
- Dedecker J, Merlevéde F. The empirical distribution function for dependent variables: asymptotic and nonasymptotic results in LP. ESAIM: Probability and Statistics. 2007;11:102–114. [Google Scholar]
- Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association. 2004;99:96–104. [Google Scholar]
- Efron B. Correlation and large-scale simultaneous hypothesis testing. Journal of the American Statistical Association. 2007a;102:93–103. [Google Scholar]
- Efron B. Size, power and false discovery rates. The Annals of Statistics. 2007b;35:1351–1377. [Google Scholar]
- Efron Bradley. Simultaneous inference: When should hypothesis testing problems be combined? The Annals of Applied Statistics. 2008;2:197–223. [Google Scholar]
- Efron Bradley. Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association. 2010;105:1042–1055. doi: 10.1198/jasa.2010.tm09129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Han X. Estimation of False Discovery Proportion with Unknown Dependence. 2014 doi: 10.1111/rssb.12204. submitted http://arxiv.org/abs/1305.7007. [DOI] [PMC free article] [PubMed]
- Fan J, Han X, Gu W. Estimating False Discovery Proportion Under Arbitrary Covariance Dependence. Journal of the American Statistical Association. 2012;107:1019–1035. doi: 10.1080/01621459.2012.720478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koudou AE. Lancaster bivariate probability distributions with Poisson, negative binomial and gamma margins. Test. 1998;7:95–110. [Google Scholar]
- Owen AB. Variance of the number of false discoveries. Journal of the Royal Statistical Society: Series B. 2005;67:411–426. [Google Scholar]
- Proal E, Reiss PT, Klein RG, Mannuzza S, Gotimer K, Ramos-Olazagasti MA, Lerch JP, He Y, Zijdenbos A, Kelly C, Milham MP, Castellanos FX. Brain gray matter deficits at 33-year follow-up in adults with attention-deficit/hyperactivity disorder established in childhood. Archives of general psychiatry. 2011;68:1122–1134. doi: 10.1001/archgenpsychiatry.2011.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reiss PT, Schwartzman A, Lu F, Huang L, Proal E. Paradoxical results of adaptive false discovery rate procedures in neuroimaging studies. Neuroimage. 2012;63:1833–1840. doi: 10.1016/j.neuroimage.2012.07.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartzman A. Comment on “Correlated z-values and the accuracy of large-scale statistical estimates” by Bradley Efron. Journal of the American Statistical Association. 2010;105(491):1059–1063. doi: 10.1198/jasa.2010.tm10237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Biometrika. 2011;98:199–214. doi: 10.1093/biomet/asq075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wasserman L. All of Nonparametric Statistics: A Concise Course in Nonparametric Statistical Inference. New York: Springer; 2006. [Google Scholar]
- Wu WB. Oscillations of Empirical Distribution Functions under Dependence. IMS lecture notes-monograph series, High dimensional Probabilities. 2006;51:53–61. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




