Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2014 Jun 23;16(1):189–204. doi: 10.1093/biostatistics/kxu029

Bias and variance reduction in estimating the proportion of true-null hypotheses

Yebin Cheng 1, Dexiang Gao 2, Tiejun Tong 3,*
PMCID: PMC4263223  PMID: 24963010

Abstract

When testing a large number of hypotheses, estimating the proportion of true nulls, denoted by Inline graphic, becomes increasingly important. This quantity has many applications in practice. For instance, a reliable estimate of Inline graphic can eliminate the conservative bias of the Benjamini–Hochberg procedure on controlling the false discovery rate. It is known that most methods in the literature for estimating Inline graphic are conservative. Recently, some attempts have been paid to reduce such estimation bias. Nevertheless, they are either over bias corrected or suffering from an unacceptably large estimation variance. In this paper, we propose a new method for estimating Inline graphic that aims to reduce the bias and variance of the estimation simultaneously. To achieve this, we first utilize the probability density functions of false-null Inline graphic-values and then propose a novel algorithm to estimate the quantity of Inline graphic. The statistical behavior of the proposed estimator is also investigated. Finally, we carry out extensive simulation studies and several real data analysis to evaluate the performance of the proposed estimator. Both simulated and real data demonstrate that the proposed method may improve the existing literature significantly.

Keywords: Effect size, False-null p-value, Microarray data, Multiple testing, Probability density function, Upper tail probability

1. Introduction

When testing a large number of hypotheses, estimating the proportion of true nulls, denoted by Inline graphic, becomes increasingly important. Studies using high-throughput techniques and microarray experiments that identify genes expressed differentially across groups, often involve testing hundreds or thousands of hypotheses simultaneously. In addition to identifying differentially expressed genes, we may also want to know the proportion of genes that are truly differentially expressed, i.e., the value of Inline graphic. This quantity has many applications in practice. For instance, a reliable estimate of Inline graphic can eliminate the conservative bias of the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995) on controlling the false discovery rate, and therefore increase the average power (Storey, 2002; Nguyen, 2004). A good estimate of Inline graphic can also sharpen the Bonferroni-type family-wise error controlling procedures to improve the power and reduce the false negative rate (Hochberg and Benjamini, 1990; Finner and Gontscharuk, 2009). Besides the broad applications, Inline graphic is also a quantity of interest that has its own right (Langaas and others, 2005).

The estimation of Inline graphic was pioneered in Schweder and Spjtvoll (1982), where a graphical method was applied to evaluate a large number of tests on a plot of cumulative Inline graphic-values using the observed significance probabilities. They claimed that the points on the graph corresponding to the true-null hypotheses should fall on a straight line and that this line can then be used to estimate Inline graphic. Their method was further studied in Storey (2002) and Storey and others (2004). Since then, there is a rich body of literature on the estimation of Inline graphic. For instance, Langaas and others (2005) proposed a new method for estimating Inline graphic based on a nonparametric maximum likelihood estimation of the Inline graphic-value density, subject to the restriction that the density is decreasing or convex decreasing. In general, their convex density estimator based on a convex decreasing density estimation outperforms other estimators with respect to the mean squared error (MSE). Other significant works in estimating Inline graphic include: the smoothing spline method in Storey and Tibshirani (2003), the moment-based methods in Dalmasso and others (2005) and Lai (2007), the histogram methods in Nettleton and others (2006) and Tong and others (2013), the nonparametric method in Wu and others (2006), the average estimate method in Jiang and Doerge (2008), and the sliding linear model method in Wang and others (2011), among many others.

Assume that the test statistics are independent of each other. A straightforward model for the Inline graphic-values is a two-component mixture model,

1. (1.1)

where the Inline graphic-values under the null hypotheses follow the uniform distribution on Inline graphic, and the Inline graphic-values under the false-null hypotheses follow the distribution Inline graphic. Due to the unidentifiability problem (Genovese and Wasserman, 2002; 2004), most existing methods aforementioned have targeted to estimate an identifiable upper bound of Inline graphic, that is Inline graphic. As a consequence, those estimators always overestimate Inline graphic and we refer to them as conservative estimators. To obtain the identifiability in model (1.1), we need to make some assumptions on the density Inline graphic. For instance, if Inline graphic or if Inline graphic has a parametric form, the model will be identifiable and so we can estimate Inline graphic directly rather than the upper bound. Recently, some attempts have been made to the estimation of Inline graphic, with a main focus on reducing the estimation bias (Pawitan and others, 2005; McLachlan and others, 2006; Ruppert and others, 2007; Qu and others, 2012). In particular, by assuming that absolute values of the noncentrality parameters (NCPs) from the false-null hypotheses follow a smooth distribution with density Inline graphic, Ruppert and others (2007) developed a new methodology that combines a parametric model for the Inline graphic-values given the NCPs and a nonparametric spline model for the NCPs. The quantity Inline graphic and the coefficients in the spline model were then estimated by penalized least squares. In simulation studies, the authors demonstrated that their proposed estimator has the ability to reduce the bias in estimating Inline graphic. More recently, their method was improved by Qu and others (2012) where the authors applied some new nonparametric and semiparametric methods to the estimation of the NCPs distribution. We refer to these estimators as bias-reduced estimators.

Though the existing bias-reduced estimators have significant merit in reducing the estimation bias, we note that the variations of these estimators are usually considerably enlarged. As reported in Tables 1 and 2 in Qu and others (2012), the interquartile ranges of their estimators are often more than twice as large as the other competitors. In addition, we observe that their estimators only perform well when the NCPs are concentrated around 0, i.e., when a majority of false nulls have very weak signals. In the situation of microarray data analysis, to make their estimators work a large proportion of false-null genes need to be weakly differentially expressed. Otherwise, their estimators tend to be over bias corrected (see Section 5 for more detail). In this paper, we propose a new method for estimating Inline graphic that aims to reduce the bias and variance of the estimation simultaneously. To achieve this, we first utilize the probability density functions of false-null Inline graphic-values and then propose a novel algorithm to estimate the quantity of Inline graphic. Simulation studies will show that the proposed method may improve the existing literature significantly.

The rest of the paper is organized as follows. In Section 2, we introduce a bias-corrected method for estimating Inline graphic that aims to reduce the bias and variance simultaneously. In Section 3, we derive the probability density functions of false-null Inline graphic-values for testing two-sided hypotheses with unknown variances. In Section 4, we propose an algorithm for estimating Inline graphic and investigate the behavior of the proposed estimator. We then evaluate the performance of the proposed estimator via extensive simulation studies in Section 5 and several microarray data sets in Section 6. Finally, we conclude the paper in Section 7 and provide the technical proofs in Appendices of supplementary material available at Biostatistics online.

2. Main results

Let Inline graphic be the Inline graphic-values corresponding to each of Inline graphic total hypothesis tests. Let Inline graphic (size Inline graphic) denote the set of true-null hypotheses and Inline graphic (size Inline graphic) denote the set of false-null hypotheses. Then Inline graphic and Inline graphic. To avoid confusion, we define the “true-null Inline graphic-values” as Inline graphic-values from hypothesis tests in which the null was correct, and the “false-null Inline graphic-values” as Inline graphic-values from hypothesis tests in which the null was false. For a given Inline graphic, define Inline graphic to be the total number of Inline graphic-values on Inline graphic, Inline graphic to be the total number of true-null Inline graphic-values on Inline graphic, and Inline graphic to be the total number of false-null Inline graphic-values on Inline graphic. By definition, we have Inline graphic. In addition, we have Inline graphic since the true-null Inline graphic-values are uniformly distributed in Inline graphic. This suggests we estimate Inline graphic by

2. (2.1)

However, (2.1) is not a valid estimator as Inline graphic is unobservable in practice.

Note that the false-null Inline graphic-values are more likely to be small. Thus for a reasonably large Inline graphic, the majority of Inline graphic-values on Inline graphic should correspond to true-null Inline graphic-values and so Inline graphic. By this, Storey (2002) proposed to estimate Inline graphic by

2. (2.2)

where Inline graphic is the tuning parameter. We refer to Inline graphic as the Storey estimator. For any Inline graphic, it is easy to verify that

2.

This shows that Inline graphic always overestimates Inline graphic, and therefore, is a conservative estimator of Inline graphic. The conservativeness of Inline graphic can be rather significant when the sample size and/or the effect sizes of false-null hypotheses are small.

2.1. New methodology

We now propose a bias-corrected method for estimating Inline graphic. For each false-null Inline graphic-value with effect size Inline graphic, let Inline graphic be the probability density function and Inline graphic be the upper tail probability on Inline graphic. By the definition of Inline graphic, we have

2.1. (2.3)

where Inline graphic is the average upper tail probability for all false-null Inline graphic-values. By (2.3) and the fact that Inline graphic, we have Inline graphic. This leads to

2.1. (2.4)

Let Inline graphic be an estimate of Inline graphic. By (2.4), we propose a new estimator of Inline graphic as

2.1. (2.5)

Note that Inline graphic is not guaranteed to be within Inline graphic in practice. As in Storey (2002), we truncate Inline graphic to Inline graphic if Inline graphic, and round Inline graphic to Inline graphic if Inline graphic. This leads to the estimator to be Inline graphic.

The term Inline graphic serves as a regularization parameter of the proposed estimator. When Inline graphic, Inline graphic reduces to Inline graphic. When Inline graphic, in Appendix C of supplementary material available at Biostatistics online we show that

2.1.

for any Inline graphic. That is, the proposed estimator is always less conservative than Storey’s estimator for any Inline graphic. More discussion on Inline graphic is given in Sections 3 and 4.

Finally, in addition to the bias elimination, we apply the average estimate method in Jiang and Doerge (2008) to further reduce the estimation variance. Specifically, let Inline graphic where Inline graphic and Inline graphic is an integer value. We then compute Inline graphic for each Inline graphic and take their average as the final estimate,

2.1. (2.6)

where Inline graphic is the number of Inline graphic contained in the set Inline graphic. We note that the average estimate method is very robust when the independence assumption is violated.

2.2. Choice of the set Inline graphic

Needless to say, the set Inline graphic may play an important role for the proposed estimator. In what follows we investigate the choice of an appropriate set Inline graphic in practice. Recall that for the estimator Inline graphic in (2.2), there is a severe bias-variance trade-off on the tuning parameter Inline graphic. Specifically, (i) when Inline graphic, the variance of Inline graphic is smaller but the bias increases; and (ii) when Inline graphic, the bias of Inline graphic is smaller but the variance increases. In practice, the optimal Inline graphic is suggested to be the one that minimizes the MSE and is implemented by a bootstrap procedure in Storey and others (2004).

We note that, unlike the Storey estimator Inline graphic, the proposed estimator Inline graphic in (2.5) has little bias and does not suffer a severe bias-variance trade-off along with the choice of Inline graphic. Thus, to choose an appropriate Inline graphic value, we can aim to minimize the variance of the estimator only. Simulations (not shown) indicate that the variance of Inline graphic is usually larger when Inline graphic is near 0 or 1 than when it is near the middle of the range. In addition, from a theoretical point of view, we found that Inline graphic for the proposed Inline graphic in Section 4.1. Recall that Inline graphic is the denominator of (2.5). This implies that Inline graphic may not be stable and may have a large variation when Inline graphic is near 0 or 1. That is, to make Inline graphic a good estimate the Inline graphic value should not be too small or large. In Appendix D of supplementary material available at Biostatistics online, a simulation study is conducted that investigates how sensitive the method is to the choice of boundaries Inline graphic, Inline graphic and Inline graphic. According to the simulation results, we apply the set Inline graphic throughout the paper.

3. Probability density function of false-null Inline graphic-values

Given the set Inline graphic, to implement the estimator (2.6) we need to have an appropriate estimate for the unknown quantity Inline graphic. To achieve this, we need to have the probability density functions Inline graphic for each false-null Inline graphic-value with effect size Inline graphic, where Inline graphic. For ease of notation, in this section we will not specify the subscript Inline graphic in effect sizes unless otherwise specified. Our aim is then to determine the probability density function Inline graphic of a false-null Inline graphic-value with effect size Inline graphic.

For the one-sample comparison, let Inline graphic be a random sample of size Inline graphic from a normal distribution with mean Inline graphic and variance Inline graphic. Let Inline graphic be the sample mean, Inline graphic be the sample variance, and Inline graphic be the effect size. For testing the one-sided hypothesis

3. (3.1)

Hung and others (1997) assumed a known Inline graphic and considered the test statistic Inline graphic. Under Inline graphic, the test statistic Inline graphic follows a standard normal distribution. This yields a Inline graphic-value of Inline graphic, where Inline graphic is the realization of Inline graphic and Inline graphic is the probability function of the standard normal distribution. Under Inline graphic, Inline graphic is normally distributed with mean Inline graphic and variance Inline graphic. Then, by Jacobian transformation, for given Inline graphic and Inline graphic the probability density function of Inline graphic is

3. (3.2)

where Inline graphic is the Inline graphicth percentile of the standard normal distribution. Further, we have

3. (3.3)

Needless to say, the assumption of known variances and also the restriction to one-sided tests in Hung and others (1997) limited its application in testing the differential expression of genes. The small sample size in such studies can be another concern. Hence, to accommodate the needs of microarray studies, we extend their method to the two-sided testing problems with unknown variances.

3.1. Two-sided tests with unknown variances

We first consider the one-sample, two-sided comparison. For testing the hypothesis

3.1. (3.4)

We consider the test statistic Inline graphic, where Inline graphic and Inline graphic are the sample mean and sample variance, respectively. Let Inline graphic be the effect size. Under Inline graphic, the test statistic Inline graphic follows a Student’s Inline graphic distribution with Inline graphic degrees of freedom.

The Inline graphic-value for testing (3.4) is given as Inline graphic, where Inline graphic is the realization of Inline graphic, Inline graphic and Inline graphic is the probability function of Student’s Inline graphic distribution with Inline graphic degrees of freedom. Under Inline graphic, it is easy to verify that Inline graphic follows a non-central Inline graphic distribution with Inline graphic degrees of freedom and NCP Inline graphic. Let Inline graphic be the Inline graphicth percentile of Student’s Inline graphic distribution with Inline graphic degrees of freedom. In Appendix B of supplementary material available at Biostatistics online, for any given Inline graphic and Inline graphic, we show that the probability density function of Inline graphic is

3.1. (3.5)

where Inline graphic is the probability density function of Student’s Inline graphic distribution with Inline graphic degrees of freedom, and Inline graphic is the probability density function of the non-central Inline graphic distribution with Inline graphic degrees of freedom and NCP Inline graphic. When Inline graphic, both Inline graphic and Inline graphic reduce to Inline graphic so that Inline graphic follows a uniform distribution in Inline graphic. When Inline graphic, we have

3.1. (3.6)

where Inline graphic is the probability function of the non-central Inline graphic distribution with Inline graphic degrees of freedom and NCP Inline graphic.

Now we consider the two-sample, two-sided comparison. Let Inline graphic be a random sample of size Inline graphic from the normal distribution with mean Inline graphic and variance Inline graphic, and Inline graphic be a random sample of size Inline graphic from the normal distribution with mean Inline graphic and variance Inline graphic. Let also Inline graphic and Inline graphic be the sample means for the two samples, respectively. For testing the hypothesis

3.1. (3.7)

we consider the test statistic Inline graphic, where Inline graphic is the pooled sample variance with Inline graphic and Inline graphic. Under Inline graphic, Inline graphic follows a Student’s Inline graphic distribution with Inline graphic degrees of freedom. Under Inline graphic, Inline graphic follows a non-central Inline graphic distribution with Inline graphic degrees of freedom and NCP Inline graphic. Thus to make formulas (3.5) and (3.6) applicable to the two-sample comparison, we only need to redefine Inline graphic, Inline graphic and Inline graphic as follows: Inline graphic, Inline graphic and Inline graphic. Finally, if a common variance in (3.7) is not assumed, we may apply Welch’s Inline graphic-test statistic and it follows an approximate Inline graphic distribution.

4. The proposed algorithm for estimating Inline graphic

For the one-sample comparison, an intuitive estimator of Inline graphic is given as Inline graphic, where Inline graphic is the sample mean and Inline graphic is the sample standard deviation. However, Inline graphic is suboptimal as it is biased. Alternatively, because Inline graphic and Inline graphic are independent of each other, we have

4. (4.1)

where Inline graphic follows an inverse-Inline graphic distribution with Inline graphic degrees of freedom, Inline graphic, and Inline graphic is the gamma function. By (4.1), an unbiased estimator of Inline graphic is given as

4. (4.2)

Similarly, for the two-sample comparison, an unbiased estimator of Inline graphic is

4. (4.3)

where Inline graphic is the pooled sample variance.

4.1. Algorithm for estimating Inline graphic

For the sake of brevity, we present in this section the estimation procedure for the one-sample, two-sided comparison only. Note that the procedure is generally applicable when estimating Inline graphic in other settings. The proposed algorithm for estimating Inline graphic is as follows.

  1. For each Inline graphic, we estimate Inline graphic by the unbiased estimator Inline graphic in (4.2).

  2. For each Inline graphic and Inline graphic, we estimate the upper tail probability Inline graphic by
    graphic file with name M308.gif (4.4)
    We then order the values of Inline graphic for each Inline graphic such that
    graphic file with name M311.gif
  3. Let Inline graphic, where Inline graphic is an initial estimate of Inline graphic and Inline graphic is the integral part of Inline graphic. Then for each Inline graphic, we estimate the average upper tail probability Inline graphic by
    graphic file with name M319.gif (4.5)
  4. Given the estimates Inline graphic for all Inline graphic, we estimate Inline graphic by
    graphic file with name M323.gif (4.6)

We note that the initial estimate of Inline graphic may play an important role for the proposed estimation procedure. When the initial estimate Inline graphic is too large, Inline graphic tends to be small and so is Inline graphic. As a consequence, the bias correction of Inline graphic over Inline graphic may not be observable. On the other hand, when the initial estimate Inline graphic is too small, it may result in an over bias-corrected estimate. In Appendix E of supplementary material available at Biostatistics online, a simulation study is conducted that investigates how sensitive the method is to the choice of the initial estimator of Inline graphic. According to the simulation results, we adopt the bootstrap estimator Inline graphic in Storey and others (2004) as the initial estimate of Inline graphic in the proposed algorithm.

4.2. Behavior of the proposed estimator

The following result shows that the proposed estimator is always less conservative than the estimator of Jiang and Doerge (2008).

Theorem 1 —

For any given Inline graphic set Inline graphic, the proposed Inline graphic in (4.6) is a less conservative estimator of Inline graphic than the average estimate Inline graphic in Jiang and Doerge (2008), where

graphic file with name M339.gif (4.7)

The proof of Theorem 1 is given in Appendix C of supplementary material available at Biostatistics online. In addition, under certain conditions we can show that Inline graphic is asymptotically larger than Inline graphic so that the bias of Inline graphic is not over corrected. Specifically, we assume that (i) the initial estimate Inline graphic; and (ii) Inline graphic is a random sample from a certain distribution with a finite second moment. By (i), we have Inline graphic and so

4.2.

By (ii) and by the strong law of large numbers, we have Inline graphic as Inline graphic. Alternatively if the sample size Inline graphic, by (4.2) we have Inline graphic and Inline graphic. Then for any fixed Inline graphic, Inline graphic as Inline graphic. This shows that the proposed estimator protects from over bias correction and so is an asymptotically conservative estimator. In this sense, the proposed estimator improved the bias-reduced estimators in Ruppert and others (2007) and Qu and others (2012). Finally, we hope to clarify that the assumptions made above are very strong and may not hold in practice. Further research is warranted to investigate the statistical properties of the proposed estimator.

5. Simulation studies

In this section, we conduct simulation studies to assess the performance of the proposed estimator under various simulation settings. The five estimators we adopt for comparison are (i) the bootstrap estimator Inline graphic in Storey and others (2004), (ii) the average estimate estimator Inline graphic in Jiang and Doerge (2008), (iii) the convex estimator Inline graphic in Langaas and others (2005), (iv) the parametric estimator Inline graphic in Qu and others (2012) and (v) the proposed estimator Inline graphic.

5.1. Simulation setup

Consider a microarray experiment with Inline graphic genes and Inline graphic arrays. In this study, we set Inline graphic and consider Inline graphic and Inline graphic. The Inline graphic-dimensional arrays are generated from a multivariate normal distribution with mean vector Inline graphic and covariance matrix Inline graphic. To mimic a realistic scenario, we assume that the covariance matrix is a block diagonal matrix such that

5.1.

where Inline graphic and Inline graphic follows an auto-regressive structure. Let Inline graphic throughout the simulation studies. We consider four different values of Inline graphic, ranging from 0, 0.4 to 0.8, to represent different levels of dependence. Note that Inline graphic corresponds to a diagonal matrix and so is the situation where all the genes are independent of each other. Finally, we simulate Inline graphic independent and identically distributed (i.i.d) from the distribution Inline graphic to account for the heterogeneity of variance in genes.

The next step is to split the Inline graphic genes with Inline graphic constant genes corresponding to the true-null hypotheses, and Inline graphic differential expressed genes corresponding to the false-null hypotheses. To achieve this, we first randomly sample a set of Inline graphic numbers, denoted by Inline graphic, from the integer set Inline graphic. Let Inline graphic be the complement set so that Inline graphic. We then assign Inline graphic for each Inline graphic, and simulate Inline graphic i.i.d. from the uniform distribution on the interval Inline graphic for each Inline graphic. In other words, we specify the mean vector as Inline graphic. For a complete comparison, we consider 9 values of Inline graphic, ranging from Inline graphic, Inline graphic to Inline graphic, to represent different levels of proportion of true-null hypotheses.

For each combination of Inline graphic and Inline graphic, we first generate Inline graphic and Inline graphic using the algorithm specified above. We then simulate the Inline graphic arrays Inline graphic, Inline graphic, independently from the multivariate normal distribution with the generated mean Inline graphic and covariance matrix Inline graphic. To test the hypotheses Inline graphic versus Inline graphic, we let Inline graphic, where Inline graphic and Inline graphic are the sample mean and sample standard deviation of gene Inline graphic, respectively. We then compute the Inline graphic-values as Inline graphic, with Inline graphic the realization of Inline graphic, and estimate the estimators Inline graphic, Inline graphic, Inline graphic and Inline graphic using the computed Inline graphic-values.

5.2. Simulation results

Following the above procedure, we simulate Inline graphic sets of independent data for each combination setting of Inline graphic. For each method, we compute the MSE as

5.2.

where Inline graphic is the estimated Inline graphic for the Inline graphicth simulated data set and Inline graphic is the sample average. We report the MSEs of the five estimators as functions of the true Inline graphic in Figure 1 for Inline graphic and Inline graphic and Inline graphic, Inline graphic and Inline graphic, respectively. It is evident that the proposed Inline graphic provides a smaller MSE than the other four estimators in most settings. Specifically, we note that (i) for small and moderate Inline graphic values, the proposed Inline graphic is always the best estimator and (ii) for large Inline graphic values, the proposed Inline graphic is in a league with Inline graphic and Inline graphic that provide the best performance. We note that the comparison results among Inline graphic, Inline graphic, and Inline graphic remain similar to those reported in Langaas and others (2005) and Jiang and Doerge (2008). In addition, the estimator Inline graphic is always suboptimal throughout the simulations.

Figure 1.

Figure 1.

Plots of MSEs as functions of Inline graphic for various Inline graphic and Inline graphic values, where “1” represents the bootstrap estimator Inline graphic, “2” represents the average estimate estimator Inline graphic, “3” represents the convex estimator Inline graphic, “4” represents the parametric estimator Inline graphic, and “5” represents the proposed new estimator Inline graphic.

To visualize how the proposed method improves the existing methods, we plot the density estimates of the distributions of the estimators in Figure 2 for Inline graphic and in Figure 3 for Inline graphic. To save space, we only present the results for Inline graphic, Inline graphic and Inline graphic and Inline graphic and Inline graphic; the comparison patterns for other combination settings remain similar. From the densities, we note that (i) for small Inline graphic values such as Inline graphic, the proposed Inline graphic provides to be an unbiased estimator or a slightly underestimated estimator, whereas Inline graphic underestimates Inline graphic and the other three overestimate Inline graphic; (ii) for moderate Inline graphic values such as Inline graphic, the proposed Inline graphic proves to be an unbiased estimator or slightly overestimates Inline graphic, whereas the other four estimators keep the pattern as that for Inline graphic; and (iii) for large Inline graphic values such as Inline graphic, all five estimators tend to have a small bias, whereas Inline graphic and Inline graphic perform worst due to the large variability in the estimation. In addition, Inline graphic and Inline graphic perform very similarly for Inline graphic no matter what values of Inline graphic and Inline graphic are used. Finally, it is noteworthy that we have also conducted simulation studies for larger Inline graphic values and the comparison results remain similar. For more details, please refer to Appendix F of supplementary material available at Biostatistics online.

Figure 2.

Figure 2.

Density estimates of Inline graphic for Inline graphic, where the short dashed line represents the bootstrap estimator Inline graphic, the dash-dotted line represents the average estimate estimator Inline graphic, the dotted line represents the convex estimator Inline graphic, the long dashed line represents the parametric estimator Inline graphic, and the solid line represents the proposed new estimator Inline graphic.

Figure 3.

Figure 3.

Density estimates of Inline graphic for Inline graphic, where the short dashed line represents the bootstrap estimator Inline graphic, the dash-dotted line represents the average estimate estimator Inline graphic, the dotted line represents the convex estimator Inline graphic, the long dashed line represents the parametric estimator Inline graphic, and the solid line represents the proposed new estimator Inline graphic.

6. Applications to Microarray data

In this section, we apply the proposed estimator to several microarray data sets for estimating Inline graphic. The first data set is from the experiment described by Kuo and others (2003). The objective of the experiment was to identify the targets of the Arf gene on the Arf-Mdm2-p53 tumor suppressor pathway. In this study, the cDNA microarrays were printed from a murine clone library available at St. Jude Children’s Research Hospital. Samples from reference and Arf-induced cell lines were taken at 0, 2, 4 and 8 h. At each time point, three independent replicates of cDNA microarray were generated. There were 5776 probe spots on each array. Only 2936 spots that passed a quality control of image analysis were used for differential expression analysis. The Inline graphic-values used in the study were generated by Pounds and Cheng (2004) where Inline graphic-values were computed by permutation tests (see Figure 4A for the histogram of the Inline graphic-values). The second data set is the Estrogen data and is described in the “Estrogen 2x2 Factorial Design” vignette by Scholtens and others (2004). The objective of the study was to investigate the effect of estrogen on the genes in ERInline graphic breast cancer cells over time. The Inline graphic-values of testing null hypothesis of no differential expression in the presence and absence of estrogen were used in our study (see Figure 4B for the histogram of the Inline graphic-values). The third data set is the cancer cell line experiment described by Cui and others (2005). The data set is from a cDNA microarray experiment and the objective is to identify differentially expressed genes in two human colon cancer cell lines, CACO2 and HCT116, and three human ovarian cancer cell lines, ES2, MDAH2774 and OV1063. In total, there were Inline graphic genes tested on each array. The Inline graphic-values of testing differential expression among these cell lines were then generated by fitting an analysis of variance model to each gene to account for the multiple sources of variation including array, dye and sample effects (see Figure 4C for the histogram of the Inline graphic-values).

Figure 4.

Figure 4.

Histograms of Inline graphic-values for the three data sets, where (A), (B), and (C) correspond to the Inline graphic-values for the first, second, and third data set, respectively.

Table 1 reports the estimated values of Inline graphic for the three data sets using the bootstrap estimator Inline graphic, the average estimate estimator Inline graphic, the convex estimator Inline graphic and the proposed estimator Inline graphic, respectively. Note that the parametric estimator Inline graphic in Qu and others (2012) is not reported because the two-sided Inline graphic-statistics are not available for these data sets. Among the four estimators, we observe that Inline graphic is smaller than the other three estimators in most cases, especially for Inline graphic. This is consistent with the conclusion in Theorem 1. For the first data set, Inline graphic is the smallest and is followed by Inline graphic and Inline graphic, whereas Inline graphic is far above them. For the second data set, there is a high degree of agreement among the estimators except for Inline graphic which is much larger. For the third data set, Inline graphic is similar to Inline graphic and Inline graphic and is less conservative compared with Inline graphic.

Table 1.

Estimation of Inline graphic for the three data sets using the bootstrap estimator Inline graphic the average estimate estimator Inline graphic the convex estimator Inline graphic and the proposed estimator Inline graphic respectively.

Data Set 1 Data Set 2 Data Set 3
Inline graphic 0.447 0.944 0.486
Inline graphic 0.658 0.884 0.583
Inline graphic 0.463 0.875 0.501
Inline graphic 0.431 0.877 0.498

7. Conclusion

The proportion of true-null hypotheses, Inline graphic, is an important quantity in multiple testing and has attracted a lot of attention in the recent literature. It is known that most existing methods for estimating Inline graphic are either too conservative or suffering from an unacceptably large estimation variance. In this paper, we propose a new method for estimating Inline graphic that reduces the bias and variance of the estimation simultaneously. To achieve this, we first utilize the probability density functions of false-null Inline graphic-values and then propose a novel algorithm to estimate the quantity of Inline graphic. The statistical behavior of the proposed estimator is also investigated. Through extensive simulation studies and real data analysis, we demonstrated that the proposed estimator may substantially decrease the bias and variance compared to most existing competitors, and therefore, improve the existing literature significantly. Finally, we note that the paper has focused on the estimation of Inline graphic only. Some related questions, such as the behavior of false discovery rate using the proposed estimator, may warrant further studies.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Supplementary Data

Acknowledgements

Yebin Cheng’s research was supported in part by National Natural Science Foundation of China grant No.11271241) and Shanghai Leading Academic Discipline Project No.863. Dexiang Gao’s research was supported in part by NIH grant R01 CA 157850-02 and 51P30 CA46934. Tiejun Tong’s research was supported in part by Hong Kong Research grant HKBU202711 and Hong Kong Baptist University FRG grants FRG2/11-12/110 and FRG1/13-14/018. The authors thank the editor, the associate editor, a referee and Bryan McNair for their constructive comments that led to a substantial improvement of the paper. Conflict of Interest: None declared.

References

  1. Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57:289–300. [Google Scholar]
  2. Cui X., Hwang J. T. G., Qiu J., Blades N. J., Churchill G. A. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. doi: 10.1093/biostatistics/kxh018. [DOI] [PubMed] [Google Scholar]
  3. Dalmasso C., Broet P., Moreau T. A simple procedure for estimating the false discovery rate. Bioinformatics. 2005;21:660–668. doi: 10.1093/bioinformatics/bti063. [DOI] [PubMed] [Google Scholar]
  4. Finner H., Gontscharuk V. Controlling the familywise error rate with plug-in estimator for the proportion of true null hypotheses. Journal of the Royal Statistical Society, Series B. 2009;71:1031–1048. [Google Scholar]
  5. Genovese C., Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society, Series B. 2002;64:499–517. [Google Scholar]
  6. Genovese C., Wasserman L. A stochastic process approach to false discovery control. Annals of Statistics. 2004;32:1035–1061. [Google Scholar]
  7. Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine. 1990;9:811–818. doi: 10.1002/sim.4780090710. [DOI] [PubMed] [Google Scholar]
  8. Hung H. M., O’nell R. T., Bauer P., Köhne K. The behavior of the p-value when the alternative hypothesis is true. Biometrics. 1997;53:11–22. [PubMed] [Google Scholar]
  9. Jiang H., Doerge R. W. Estimating the proportion of true null hypotheses for multiple comparisons. Cancer Informatics. 2008;6:25–32. [PMC free article] [PubMed] [Google Scholar]
  10. Kuo M., Duncavage E. J., Mathew R., den Besten W., Pei D., Naeve D., Yamamoto T., Cheng C., Sherr C. J., Roussel M. F. Arf induces p53-dependent and -independent antiproliferative genes. Cancer Research. 2003;63:1046–1053. [PubMed] [Google Scholar]
  11. Lai Y. A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data. Biostatistics. 2007;8:744–755. doi: 10.1093/biostatistics/kxm002. [DOI] [PubMed] [Google Scholar]
  12. Langaas M., Lindqvist B. H., Ferkingstad E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B. 2005;67:555–572. [Google Scholar]
  13. McLachlan G. J., Bean R. W., Jones J. B. T. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006;22:1608–1615. doi: 10.1093/bioinformatics/btl148. [DOI] [PubMed] [Google Scholar]
  14. Nettleton D., Hwang J. T. G., Caldo R. A., Wise R. P. Estimating the number of true null hypotheses from a histogram of p-values. Journal of Agricultural, Biological, and Environmental Statistics. 2006;11:337–356. [Google Scholar]
  15. Nguyen D. V. On estimating the proportion of true null hypotheses for false discovery rate controlling procedures in exploratory DNA microarray studies. Computational Statistics & Data Analysis. 2004;47:611–637. [Google Scholar]
  16. Pawitan Y., Murthy K. R. K., Michiels S., Ploner A. Bias in the estimation of false discovery rate in microarry studies. Bioinformatics. 2005;21:3865–3872. doi: 10.1093/bioinformatics/bti626. [DOI] [PubMed] [Google Scholar]
  17. Pounds S., Cheng C. Improving false discovery rate estimation. Bioinformatics. 2004;20:1737–1745. doi: 10.1093/bioinformatics/bth160. [DOI] [PubMed] [Google Scholar]
  18. Qu L., Nettleton D., Dekkers J. C. Improved estimation of the noncentrality parameter distribution from a large number of t-statistics, with applications to false discovery rate estimation in microarray data analysis. Biometrics. 2012;68:1178–1187. doi: 10.1111/j.1541-0420.2012.01764.x. [DOI] [PubMed] [Google Scholar]
  19. Ruppert D., Nettleton D., Hwang J. T. G. Exploring the information in p-values for the analysis and planning of multiple-test experiments. Biometrics. 2007;63:483–495. doi: 10.1111/j.1541-0420.2006.00704.x. [DOI] [PubMed] [Google Scholar]
  20. Scholtens D., Miron A., Merchant F. M., Miller A., Miron P. L., Iglehart J. D., Gentleman R. Analyzing factorial designed microarray experiments. Journal of Multivariate Analysis. 2004;90:19–43. [Google Scholar]
  21. Schweder T., Spjøtvoll E. Plots of p-values to evaluate many tests simultaneously. Biometrika. 1982;69:493–502. [Google Scholar]
  22. Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B. 2002;64:479–498. [Google Scholar]
  23. Storey J. D., Taylor J. E., Siegmund D. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rate: a unified approach. Journal of the Royal Statistical Society, Series B. 2004;66:187–205. [Google Scholar]
  24. Storey J. D., Tibshirani R. SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In. In: Parmigiani G., Garrett E. S., Irizarry R. A., Zeger S. L., editors. New York: Springer; 2003. The Analysis of Gene Expression Data: Methods and Software. [Google Scholar]
  25. Tong T., Feng Z., Hilton J. S., Zhao H. Estimating the proportion of true null hypotheses using the pattern of observed p-values. Journal of Applied Statistics. 2013;40:1949–1964. doi: 10.1080/02664763.2013.800035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Wang H., Tuominen K. L., Tsai C. SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures. Bioinformatics. 2011;27:225–231. doi: 10.1093/bioinformatics/btq650. [DOI] [PubMed] [Google Scholar]
  27. Wu B., Guan Z., Zhao H. Parametric and nonparametric FDR estimation revisited. Biometrics. 2006;62:735–744. doi: 10.1111/j.1541-0420.2006.00531.x. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES