Comment on “Estimating False Discovery Proportion Under Arbitrary Covariance Dependence” by Fan et al

Armin Schwartzman

doi:10.1080/01621459.2012.712876

. Author manuscript; available in PMC: 2014 Jun 25.

Published in final edited form as: J Am Stat Assoc. 2012 Oct 8;107(499):1039–1041. doi: 10.1080/01621459.2012.712876

Comment on “Estimating False Discovery Proportion Under Arbitrary Covariance Dependence” by Fan et al.

Armin Schwartzman ^*

PMCID: PMC4070739 NIHMSID: NIHMS405000 PMID: 24976660

Abstract

Accurate estimation of the false discovery proportion (FDP) and false discovery rate (FDR) is a problem that has been eagerly waiting for a solution. Fan, Han and Gu have found one. They have achieved this by both clarifying the concept of the problem and providing a feasible algorithmic solution. In this comment, I discuss some of the central concepts involving FDP and its estimation via conditioning in contrast with the estimators by Efron (2007) and Friguet et al. (2009).

1 FDP vs. FDR

The concept of FDR estimation, originally introduced by Storey (2002), was first seen as a tool for FDR control (Storey et al., 2004; Genovese and Wasserman, 2004), but then gained interest on its own mainly thanks to Efron (2007). In Fan et al.’s notation, ignoring the estimation of the number of null hypotheses p₀ (which is normally assumed to be close to the known number of tests p), the original FDR estimator by Storey (2002) is essentially E[V (t)]/R(t) = p₀t/R(t). Because it has the random variable R(t) in the denominator, this estimator is biased and highly variable as an estimator of FDR(t) = E[V (t)/R(t)], particularly in the presence of correlation (Schwartzman and Lin, 2011).

As discussed by Fan et al., two recent estimators have attempted to incorporate the effect of correlation explicitly. Efron’s estimator is essentially E[V (t) |Â]/R(t), where Â is a random variable that estimates the amount of dispersion between the test statistics in a given experiment (Efron, 2007). Friguet et al.’s estimator is essentially E[V (t) | Ŵ]/R(t), where Ŵ is an estimate of the realized random factors in a factor analysis model in a given experiment (Friguet et al., 2009). Both estimators are presented as conditional FDR estimators that correct the original FDR estimator E[V (t)]/R(t) by a multiplicative factor. However, in their simulations, both authors realize that in high correlation situations, their estimators actually correlate with the realized FDP.

Fan et al.’s first important conceptual realization is that the conditional estimators above may be better thought of as estimators of FDP(t) = V (t)/R(t) rather than FDR(t) = E[V (t)/R(t)], particularly in high correlation situations. This understanding is crucial because it allows Fan et al. to adapt the idea of the conditional FDR estimator and produce an estimator that is consistent for FDP under some set of asymptotic conditions. In contrast, Fan et al. suggest that estimating FDR may be more difficult, as they propose an asymptotic approximation to it but leave some of the details for future work.

2 The effect of conditioning

The essence of Fan et al.’s estimator is in its proper use of conditioning. Without the formalities of the proofs given in the Appendix, the expressions in Theorem 1 and Proposition 2 can be explained briefly as follows. Recall Fan et al.’s principal factor approximation (PFA) model (Eq. (10))

Z_{i} = μ_{i} + η_{i} + K_{i}, i = 1, \dots, n

with $η_{i} = b_{i}^{T} W ~ N (0, σ_{i}^{2}) and K_{i} ~ N (0, 1 - σ_{i}^{2})$ , where the K_i’s are independent of the factors W. Now write

R (t) = \sum_{i = 1}^{p} [1 (Z_{i} \geq - z_{t / 2}) + 1 (Z_{i} \leq z_{t / 2})],

(1)

where z_t/2 = Φ⁻¹(t/2) (note that z_t/2 < 0 for t < 0.5). Then

E [R (t) | W] = \sum_{i = 1}^{p} P (Z_{i} \geq - z_{t / 2} | W) + P (Z_{i} \leq z_{t / 2} | W) = \sum_{i = 1}^{p} P (- K_{i} \leq z_{t / 2} + η_{i} + μ_{i} | W) + P (K_{i} \leq z_{t / 2} - η_{i} - μ_{i} | W) = \sum_{i = 1}^{p} Φ [a_{i} (z_{t / 2} + η_{i} + μ_{i})] + Φ [a_{i} (z_{t / 2} - η_{i} - μ_{i})],

(2)

where $a_{i} = {[var (K_{i})]}^{- 1 / 2} = {(1 - σ_{i}^{2})}^{- 1 / 2}$ . Similarly, write

V (t) = \sum_{i \in {true null}} [1 (Z_{i} \geq - z_{t / 2}) + 1 (Z_{i} \leq z_{t / 2})] .

(3)

Then

E [V (t) | W] = \sum_{i \in {true null}} Φ [a_{i} (z_{t / 2} + η_{i})] + Φ [a_{i} (z_{t / 2} - η_{i})] .

In this notation, Proposition 2 establishes the asymptotic equivalence

\frac{R (t)}{p} ≐ \frac{E [R (t) | W]}{p}, \frac{V (t)}{p_{0}} ≐ \frac{E [V (t) | W]}{p_{0}},

which in Theorem 1 becomes

FDP (t) = \frac{V (t)}{R (t)} ≐ \frac{E [V (t) | W]}{E [R (t) | W]}

because p₀/p → 1. The key idea of Fan et al.’s estimator is to combine these approximations to establish

FDP (t) ≐ \frac{E [V (t) | W]}{R (t)} .

(4)

In essence, this is the same concept of conditioning as in Friguet et al.’s estimator. However, it is the details that make all the difference. Less importantly, Friguet et al. use a conservative estimate of the true null set to estimate the numerator E[V (t) |W]. Instead, Fan et al. use the asymptotic approximation

FDP (t) ≐ {FDP}_{A} (t) = \frac{E_{0} [R (t) | W]}{R (t)}

where

E [V (t) | W] ≐ E_{0} [R (t) | W] = \sum_{i = 1}^{p} Φ [a_{i} (z_{t / 2} + η_{i})] + Φ [a_{i} (z_{t / 2} - η_{i})] .

More importantly, the conditioning factors W have a different interpretation. Friguet et al. assume that the factor model holds exactly, and estimate the unknown factors by the EM algorithm. For Fan et al., the factor model is only an approximation to the arbitrary covariance matrix of the data. Therefore, the strength of Fan et al.’s estimator is the ability to estimate the approximating factors, which they achieve via L₁-regression.

Fan et al. also compare their estimator with that of Efron (2007). It is worth exploring the relation between the two a bit further. Schwartzman (2010) shows that Efron’s conditional FDR estimator can be obtained from a Gram-Charlier A series expansion approximation to the empirical cdf of the true null random variables Z_i for large p₀:

\frac{1}{p_{0}} \sum_{i \in {true null}} 1 (Z_{i} \leq x) ≐ Φ (x) - \sum_{j = 1}^{\infty} \frac{A_{j}}{\sqrt{j!}} {(- 1)}^{j} ϕ^{(j - 1)} (x),

where A₁,A₂, … are uncorrelated random variables with mean 0 and variance $var (A_{j}) = α_{j} = \sum_{i \neq i'} ρ_{ii'}^{j} / [p (p - 1)]$ , ρ_ii′ = cor(Z_i,Z_i′), and ϕ^(j)(x) denotes the j-th derivative of the standard normal density function ϕ(x). Indeed, using this expansion in the definition of V (t) (3) gives

E [V (t) | A] ≐ p_{0} [t + 2 \sum_{k = 1}^{\infty} \frac{A_{2 k}}{\sqrt{(2 k)!}} ϕ^{(2 k - 1)} (z_{t / 2})]

where A = (A₁,A₂, …), yielding

\frac{E [V (t) | A]}{R (t)} ≐ \frac{p_{0} t}{R (t)} [1 + \frac{2}{t} \sum_{k = 1}^{\infty} \frac{A_{2 k}}{\sqrt{(2 k)!}} ϕ^{(2 k - 1)} (z_{t / 2})] .

(5)

Efron’s estimator is obtained taking only the k = 1 term in the above expansion and defining A = A₂.

In comparison with (4), it becomes clear how Efron and Fan et al. use conditioning to capture the dependence between the variables Z_i. Efron’s approach is to first reduce the set of Z_i’s to their empirical cdf, which is equivalent to their order statistics, and then use the derivatives of the normal density (related to the Hermite polynomials) as a set of basis functions, conditioning on their random coefficients. In contrast, Fan et al. use the eigenvectors of the covariance matrix as a set of basis functions for establishing the PFA to the vector of variables (Z₁, …, Z_p) and condition on their random coefficients. Because the two sets of basis functions are different, the two estimators can be made to coincide (Eq. (27) of Fan et al.) only when the amount of dependence in the data, measured here by the σ_i, is small.

Fan et al.’s approach has several advantages with respect to Efron’s. First, the set of order statistics is not a sufficient statistic for a set of correlated random variables. Thus, by approximating the data directly via PFA, Fan et al. gain estimation efficiency, as observed in Fan et al.’s Figure 2. Second, while Efron has suggested an estimator for A, more terms in the expansion would be needed to make (5) accurate, and the positivity constraints on the coefficients make estimation difficult (Jondeau and Rockinger, 2001). In contrast, Fan et al. have been able to provide a procedure based on L₁-regression to consistently estimate the conditioning factors in the PFA.

Finally, the Gram-Charlier expansion, and thus Efron’s estimator, is intimately linked to the normal distribution. Its generalization is limited to a small set of non-normal distributions, such as χ² (Schwartzman, 2010). Derivation (2) suggests that Fan et al.’s approach could also possibly be generalized to non-normal distributions. The normality is crucial in that, to go from row 2 to row 3 of (2), normality guarantees that the residual variables K_i are independent of the factors W. The expression in row 2 could potentially be used as the numerator in the FDP estimator (4) in non-normal situations provided that the dependence between the residual variables and the random factors can be specified.

3 FDP vs. FDR again

As mentioned at the beginning of this comment, Fan et al. are able to estimate FDP, but leave the precise estimation of FDR for future work. It is interesting and surprising that FDP may be easier to estimate than FDR, since FDP is a random variable, while FDR is a parameter of the distribution. However, this appears to be true in high correlation situations. To understand this, Figure 1 below shows the effect of correlation on the distribution of FDP in the Equal Correlation scenario, where the covariance matrix of the data has diagonal entries equal to 1 and off-diagonal entries equal to ρ. Here I use the same parameters as in Figure 2 of Fan et al.: p = 1000, p₁ = 50, n = 100, t = 0.005 and β_i = 1 for i ∈{false null}, based on 1000 simulations. These distributions could also be obtained from the expression for FDP given in Example 1 of Fan et al.

Boxplots of simulated FDP values as a function of the equal correlation parameter ρ. The dashed line is the mean FDP, i.e. the FDR.

In Figure 1, as correlation increases, the distribution of FDP becomes more skewed, slowly splitting into two components. The Equal Correlation model is equivalent to a one-factor model; as correlation increases, the common factor W ~ N(0, 1) gets a larger weight and is easier to estimate. In the extreme case of perfect correlation, all the observed Z_i’s are either equal to W or equal to W + µ. Then the FDP is either zero if |W| < |z_t/2| or p₀/p = 0.95 if |W| > |z_t/2|. In this extreme situation the FDP can be estimated perfectly from the data. However, there is no information to estimate FDR.

Both Friguet et al. and Fan et al. take advantage of the presence of strong common factors in high correlation situations to estimate FDP. However, Friguet et al. note that when correlation is low, the common factors are harder to estimate and thus their conditional estimator does not correlate with FDP. In contrast, judging from Fan et al.’s results for the independent Cauchy scenario in their Figure 2, their estimator is adaptive to the data and appears to able to estimate the FDP even under these more challenging circumstances.

In the context of Figure 2 of Fan et al., it is easy to explain why the unconditional FDR estimator p₀t/R(t), shown in green, is almost perfectly negatively correlated with FDP, a phenomenon also observed by Efron (2007) and Friguet et al. (2009). Defining S(t) = R(t) − V (t) as in Table 1 of Fan et al., we may write FDP(t) = 1 − S(t)/R(t). Solving for R(t) and replacing in the unconditional estimator gives that

\frac{p_{0} t}{R (t)} = \frac{p_{0} t}{S (t)} [1 - FDP (t)] \approx \frac{p_{0} t}{p_{1}} [1 - FDP (t)],

which is a linearly decreasing function of FDP. The approximation above reflects the fact that, in the simulation scenario of Figure 2, $μ = \sqrt{n} β_{1} = 10$ is strong enough that the non-null cases are essentially always detected, so that S(t) ≈ p₁. The intercept and negative slope p₀t/p₁ = 0.095 correspond precisely to the green graph observed there.

To close, I want to congratulate Fan et al. for an important contribution to the field of large-scale multiple testing. I look forward to seeing their method implemented in practice in the search for scientific discoveries.

Acknowledgments

This work was partially supported by NIH grant PO1-CA134294.

References

Efron B. Correlation and Large-Scale Simultaneous Hypothesis Testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]
Friguet C, Kloareg M, Causeur D. A Factor Model Approach to Multiple Testing Under Dependence. J Amer Statist Assoc. 2009;104:1406–1415. [Google Scholar]
Genovese CR, Wasserman L. A stochastic process approach to false discovery control. Ann Statist. 2004;32:1035–1061. [Google Scholar]
Jondeau E, Rockinger M. Gram-Charlier densities. Journal of Economic Dynamics & Control. 2001;25:1457–1483. [Google Scholar]
Schwartzman A. Comment on ”Correlated z-values and the accuracy of large-scale statistical estimates” by Bradley Efron. J Amer Statist Assoc. 2010;105:1059–1063. doi: 10.1198/jasa.2010.tm10237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Biometrika. 2011;98:199–214. doi: 10.1093/biomet/asq075. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD. A direct approach to false discovery rates. J R Statist Soc B. 2002;64:479–498. [Google Scholar]
Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Statist Soc B. 2004;66:187–205. [Google Scholar]

[R1] Efron B. Correlation and Large-Scale Simultaneous Hypothesis Testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]

[R2] Friguet C, Kloareg M, Causeur D. A Factor Model Approach to Multiple Testing Under Dependence. J Amer Statist Assoc. 2009;104:1406–1415. [Google Scholar]

[R3] Genovese CR, Wasserman L. A stochastic process approach to false discovery control. Ann Statist. 2004;32:1035–1061. [Google Scholar]

[R4] Jondeau E, Rockinger M. Gram-Charlier densities. Journal of Economic Dynamics & Control. 2001;25:1457–1483. [Google Scholar]

[R5] Schwartzman A. Comment on ”Correlated z-values and the accuracy of large-scale statistical estimates” by Bradley Efron. J Amer Statist Assoc. 2010;105:1059–1063. doi: 10.1198/jasa.2010.tm10237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Biometrika. 2011;98:199–214. doi: 10.1093/biomet/asq075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Storey JD. A direct approach to false discovery rates. J R Statist Soc B. 2002;64:479–498. [Google Scholar]

[R8] Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Statist Soc B. 2004;66:187–205. [Google Scholar]

PERMALINK

Comment on “Estimating False Discovery Proportion Under Arbitrary Covariance Dependence” by Fan et al.

Armin Schwartzman

Roles

Abstract

1 FDP vs. FDR

2 The effect of conditioning

3 FDP vs. FDR again

Figure 1.

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comment on “Estimating False Discovery Proportion Under Arbitrary Covariance Dependence” by Fan et al.

Armin Schwartzman

Roles

Abstract

1 FDP vs. FDR

2 The effect of conditioning

3 FDP vs. FDR again

Figure 1.

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases