Abstract
Accurate estimation of the false discovery proportion (FDP) and false discovery rate (FDR) is a problem that has been eagerly waiting for a solution. Fan, Han and Gu have found one. They have achieved this by both clarifying the concept of the problem and providing a feasible algorithmic solution. In this comment, I discuss some of the central concepts involving FDP and its estimation via conditioning in contrast with the estimators by Efron (2007) and Friguet et al. (2009).
1 FDP vs. FDR
The concept of FDR estimation, originally introduced by Storey (2002), was first seen as a tool for FDR control (Storey et al., 2004; Genovese and Wasserman, 2004), but then gained interest on its own mainly thanks to Efron (2007). In Fan et al.’s notation, ignoring the estimation of the number of null hypotheses p0 (which is normally assumed to be close to the known number of tests p), the original FDR estimator by Storey (2002) is essentially E[V (t)]/R(t) = p0t/R(t). Because it has the random variable R(t) in the denominator, this estimator is biased and highly variable as an estimator of FDR(t) = E[V (t)/R(t)], particularly in the presence of correlation (Schwartzman and Lin, 2011).
As discussed by Fan et al., two recent estimators have attempted to incorporate the effect of correlation explicitly. Efron’s estimator is essentially E[V (t) |Â]/R(t), where  is a random variable that estimates the amount of dispersion between the test statistics in a given experiment (Efron, 2007). Friguet et al.’s estimator is essentially E[V (t) | Ŵ]/R(t), where Ŵ is an estimate of the realized random factors in a factor analysis model in a given experiment (Friguet et al., 2009). Both estimators are presented as conditional FDR estimators that correct the original FDR estimator E[V (t)]/R(t) by a multiplicative factor. However, in their simulations, both authors realize that in high correlation situations, their estimators actually correlate with the realized FDP.
Fan et al.’s first important conceptual realization is that the conditional estimators above may be better thought of as estimators of FDP(t) = V (t)/R(t) rather than FDR(t) = E[V (t)/R(t)], particularly in high correlation situations. This understanding is crucial because it allows Fan et al. to adapt the idea of the conditional FDR estimator and produce an estimator that is consistent for FDP under some set of asymptotic conditions. In contrast, Fan et al. suggest that estimating FDR may be more difficult, as they propose an asymptotic approximation to it but leave some of the details for future work.
2 The effect of conditioning
The essence of Fan et al.’s estimator is in its proper use of conditioning. Without the formalities of the proofs given in the Appendix, the expressions in Theorem 1 and Proposition 2 can be explained briefly as follows. Recall Fan et al.’s principal factor approximation (PFA) model (Eq. (10))
with , where the Ki’s are independent of the factors W. Now write
| (1) |
where zt/2 = Φ−1(t/2) (note that zt/2 < 0 for t < 0.5). Then
| (2) |
where . Similarly, write
| (3) |
Then
In this notation, Proposition 2 establishes the asymptotic equivalence
which in Theorem 1 becomes
because p0/p → 1. The key idea of Fan et al.’s estimator is to combine these approximations to establish
| (4) |
In essence, this is the same concept of conditioning as in Friguet et al.’s estimator. However, it is the details that make all the difference. Less importantly, Friguet et al. use a conservative estimate of the true null set to estimate the numerator E[V (t) |W]. Instead, Fan et al. use the asymptotic approximation
where
More importantly, the conditioning factors W have a different interpretation. Friguet et al. assume that the factor model holds exactly, and estimate the unknown factors by the EM algorithm. For Fan et al., the factor model is only an approximation to the arbitrary covariance matrix of the data. Therefore, the strength of Fan et al.’s estimator is the ability to estimate the approximating factors, which they achieve via L1-regression.
Fan et al. also compare their estimator with that of Efron (2007). It is worth exploring the relation between the two a bit further. Schwartzman (2010) shows that Efron’s conditional FDR estimator can be obtained from a Gram-Charlier A series expansion approximation to the empirical cdf of the true null random variables Zi for large p0:
where A1,A2, … are uncorrelated random variables with mean 0 and variance , ρii′ = cor(Zi,Zi′), and ϕ(j)(x) denotes the j-th derivative of the standard normal density function ϕ(x). Indeed, using this expansion in the definition of V (t) (3) gives
where A = (A1,A2, …), yielding
| (5) |
Efron’s estimator is obtained taking only the k = 1 term in the above expansion and defining A = A2.
In comparison with (4), it becomes clear how Efron and Fan et al. use conditioning to capture the dependence between the variables Zi. Efron’s approach is to first reduce the set of Zi’s to their empirical cdf, which is equivalent to their order statistics, and then use the derivatives of the normal density (related to the Hermite polynomials) as a set of basis functions, conditioning on their random coefficients. In contrast, Fan et al. use the eigenvectors of the covariance matrix as a set of basis functions for establishing the PFA to the vector of variables (Z1, …, Zp) and condition on their random coefficients. Because the two sets of basis functions are different, the two estimators can be made to coincide (Eq. (27) of Fan et al.) only when the amount of dependence in the data, measured here by the σi, is small.
Fan et al.’s approach has several advantages with respect to Efron’s. First, the set of order statistics is not a sufficient statistic for a set of correlated random variables. Thus, by approximating the data directly via PFA, Fan et al. gain estimation efficiency, as observed in Fan et al.’s Figure 2. Second, while Efron has suggested an estimator for A, more terms in the expansion would be needed to make (5) accurate, and the positivity constraints on the coefficients make estimation difficult (Jondeau and Rockinger, 2001). In contrast, Fan et al. have been able to provide a procedure based on L1-regression to consistently estimate the conditioning factors in the PFA.
Finally, the Gram-Charlier expansion, and thus Efron’s estimator, is intimately linked to the normal distribution. Its generalization is limited to a small set of non-normal distributions, such as χ2 (Schwartzman, 2010). Derivation (2) suggests that Fan et al.’s approach could also possibly be generalized to non-normal distributions. The normality is crucial in that, to go from row 2 to row 3 of (2), normality guarantees that the residual variables Ki are independent of the factors W. The expression in row 2 could potentially be used as the numerator in the FDP estimator (4) in non-normal situations provided that the dependence between the residual variables and the random factors can be specified.
3 FDP vs. FDR again
As mentioned at the beginning of this comment, Fan et al. are able to estimate FDP, but leave the precise estimation of FDR for future work. It is interesting and surprising that FDP may be easier to estimate than FDR, since FDP is a random variable, while FDR is a parameter of the distribution. However, this appears to be true in high correlation situations. To understand this, Figure 1 below shows the effect of correlation on the distribution of FDP in the Equal Correlation scenario, where the covariance matrix of the data has diagonal entries equal to 1 and off-diagonal entries equal to ρ. Here I use the same parameters as in Figure 2 of Fan et al.: p = 1000, p1 = 50, n = 100, t = 0.005 and βi = 1 for i ∈{false null}, based on 1000 simulations. These distributions could also be obtained from the expression for FDP given in Example 1 of Fan et al.
Figure 1.
Boxplots of simulated FDP values as a function of the equal correlation parameter ρ. The dashed line is the mean FDP, i.e. the FDR.
In Figure 1, as correlation increases, the distribution of FDP becomes more skewed, slowly splitting into two components. The Equal Correlation model is equivalent to a one-factor model; as correlation increases, the common factor W ~ N(0, 1) gets a larger weight and is easier to estimate. In the extreme case of perfect correlation, all the observed Zi’s are either equal to W or equal to W + µ. Then the FDP is either zero if |W| < |zt/2| or p0/p = 0.95 if |W| > |zt/2|. In this extreme situation the FDP can be estimated perfectly from the data. However, there is no information to estimate FDR.
Both Friguet et al. and Fan et al. take advantage of the presence of strong common factors in high correlation situations to estimate FDP. However, Friguet et al. note that when correlation is low, the common factors are harder to estimate and thus their conditional estimator does not correlate with FDP. In contrast, judging from Fan et al.’s results for the independent Cauchy scenario in their Figure 2, their estimator is adaptive to the data and appears to able to estimate the FDP even under these more challenging circumstances.
In the context of Figure 2 of Fan et al., it is easy to explain why the unconditional FDR estimator p0t/R(t), shown in green, is almost perfectly negatively correlated with FDP, a phenomenon also observed by Efron (2007) and Friguet et al. (2009). Defining S(t) = R(t) − V (t) as in Table 1 of Fan et al., we may write FDP(t) = 1 − S(t)/R(t). Solving for R(t) and replacing in the unconditional estimator gives that
which is a linearly decreasing function of FDP. The approximation above reflects the fact that, in the simulation scenario of Figure 2, is strong enough that the non-null cases are essentially always detected, so that S(t) ≈ p1. The intercept and negative slope p0t/p1 = 0.095 correspond precisely to the green graph observed there.
To close, I want to congratulate Fan et al. for an important contribution to the field of large-scale multiple testing. I look forward to seeing their method implemented in practice in the search for scientific discoveries.
Acknowledgments
This work was partially supported by NIH grant PO1-CA134294.
References
- Efron B. Correlation and Large-Scale Simultaneous Hypothesis Testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]
- Friguet C, Kloareg M, Causeur D. A Factor Model Approach to Multiple Testing Under Dependence. J Amer Statist Assoc. 2009;104:1406–1415. [Google Scholar]
- Genovese CR, Wasserman L. A stochastic process approach to false discovery control. Ann Statist. 2004;32:1035–1061. [Google Scholar]
- Jondeau E, Rockinger M. Gram-Charlier densities. Journal of Economic Dynamics & Control. 2001;25:1457–1483. [Google Scholar]
- Schwartzman A. Comment on ”Correlated z-values and the accuracy of large-scale statistical estimates” by Bradley Efron. J Amer Statist Assoc. 2010;105:1059–1063. doi: 10.1198/jasa.2010.tm10237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Biometrika. 2011;98:199–214. doi: 10.1093/biomet/asq075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storey JD. A direct approach to false discovery rates. J R Statist Soc B. 2002;64:479–498. [Google Scholar]
- Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Statist Soc B. 2004;66:187–205. [Google Scholar]

