Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2013 Aug 8;15(1):60–73. doi: 10.1093/biostatistics/kxt026

Prior robust empirical Bayes inference for large-scale data by conditioning on rank with application to microarray data

J G Liao 1,*, Timothy Mcmurry 2, Arthur Berg 3
PMCID: PMC3862209  PMID: 23934072

Abstract

Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method’s prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology.

Keywords: Bayesian shrinkage, Confidence intervals, Ranking bias, Robust multiple estimation

1. Introduction

Large-scale technologies, which measure many similar entities in parallel, have emerged as an important tool in biomedical research. For example, the expression level of thousands of genes is compared between cancer and normal tissues in microarray experiments. In genome-wide association studies, the log odds ratio of the association of disease status (disease vs. control) and single-nucleotide polymorphism (SNP) frequency is estimated for thousands or millions of SNPs in a single case–control study. There are two prominent features in large-scale data. First, different parameters (e.g. difference in expression levels between cancer and normal tissues for different genes) are often studied with the same set of subjects and using the same design. Second, a large majority of the underlying parameters are 0. Because of this, the unknown parameters can be profitably modeled as random effects in an empirical Bayes framework. A popular model is to treat the large number of parameters as draws from a spike-and-slab prior distribution that is a mixture of a large mass at 0 and a non-zero component. The parameters in the spike-and-slab prior can be estimated from the many parallel measurements, resulting in an empirical Bayes analysis that borrows information across different genes or SNPs. The empirical Bayes framework automatically adjusts for the large number of hypothesis tests or effect estimates. The application of empirical Bayes to large-scale testing naturally leads to the false discovery rate (FDR) and the local FDR (Benjamini and Yekutieli, 2005; Efron and others, 2001; Newton and others, 2004; Efron, 2008). The application to parallel estimation (e.g. of the log fold changes in expression level) in the microarray context includes Newton and others (2001), Kendziorski and others (2003), and Smyth (2004). There is substantial literature in this area and the reader is referred to Efron (2010) for a summary of the state of the art in the empirical Bayes approach to large-scale inference and for complete references.

This paper focuses on estimating the effect sizes, the log fold change in gene expression level in microarray data, for example. We show that a popular empirical Bayes random effects model can lead to poor performance if the form of the prior is mis-specified; this has important practical implications because in real applications we are never certain of the correct distributional form, especially in the tails of the distribution, which often produce the most interesting observations. Motivated by this sensitivity to the form of the random effects model, we propose a new rank-conditioned inference in which shrinkage and confidence intervals are based on the distribution of the measurement error conditioned on the rank of the data instead of on the data themselves as in a traditional Bayesian posterior. The primary advantage of the rank-conditioned method is that it is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In particular, the proposed method provides efficient and valid inference even when the working random effects model substantially deviates from the true model in location. The proposed method can, therefore, substantially improve empirical Bayes inference for microarray studies as well as other large-scale data such as for genome-wide association studies and flow cytometry.

To put the rank-conditioned method in the context of the broader literature, we note the following. First, we condition on rank of the observed data in order to obtain more robust estimation of effect size. This is different from Noma and others (2010), whose aim is to better rank the effect sizes. Second, the idea of combining a Bayes formulation with rank-based likelihood has previously been proposed and studied in other context; for example, Dunson and Taylor (2005) use the idea for estimating quantiles, Gu and Ghosal (2009) for estimating receiver operating characteristic curves, and Hoff (2007) in estimating semi-parametric copula. Large-scale data is an area where this idea can be more profitably used because rank of the observed data and the observed data themselves are closely correlated. We are, therefore, able to take advantage of the robust property of the rank method with little loss of efficiency compared with standard empirical Bayes. Third, improved robustness can also be achieved through a more diffuse prior for the random effects. For example, Do and others (2005) and Kim and others (2009) combine Dirichlet process with spike-and-slab prior in a non-parametric Bayes model for random effects. A more diffuse prior, however, necessarily weakens the effectiveness of shrinkage and information borrowing as seen in the simulation study in Section 4.2.

The paper is organized as follows. Section 2 describes a model for large-scale microarray data analysis. Section 3 presents our proposed ranked-conditioned inference. Section 4 consists of simulation studies assessing the performance of rank-conditioned inference. Section 5 applies the proposed method to a breast cancer microarray dataset. Section 6 is discussion.

2. Empirical Bayes model for large-scale data

For concreteness, throughout the remainder of the paper, we focus on estimating the standardized effect size in case–control microarray experiments; application of our method in other large-scale data, such as genome-wide association studies, is similar.

We begin by describing an empirical Bayes model for the log fold change in gene expression. Let Inline graphic and Inline graphic be the log expression level of the Inline graphicth gene for the Inline graphicth subject in the cancer and normal group, respectively. The total number of genes is Inline graphic so that Inline graphic. We start with the model

2.

where Inline graphic is the variance of the Inline graphicth gene expression common for the cancer and normal groups, and Inline graphic and Inline graphic are the respective sample sizes. The quantity Inline graphic is the average (log) fold change (Guo and others, 2006; Choe and others, 2005). Let Inline graphic be the mean of Inline graphic over Inline graphic, and similarly let Inline graphic be the mean of Inline graphic. It then follows that

2.

where

2.

is the standardized log fold change and Inline graphic. Note that Inline graphic and Inline graphic typically do not vary much from gene to gene in a microarray experiment so that variance Inline graphic should be relatively constant across Inline graphic.

The first stage of our empirical Bayes model is

2. (2.1)

where Inline graphic independently for Inline graphic. In application, the Inline graphic in the definition of Inline graphic will be replaced by its pooled estimate using Inline graphic and Inline graphic and Inline graphic will then follow a Inline graphic-distribution. For simplicity, we shall use normal error model (2.1), since the Inline graphic degrees of freedom, Inline graphic, is large for the breast cancer data in Section 6. For a smaller Inline graphic, a modified version of (2.1) based on a non-central Inline graphic-distribution can be used instead. For a genome-wide association study, Inline graphic can be the estimated log odds ratio from a logistic regression for the association between disease status and the Inline graphicth SNP, Inline graphic be the unknown true log odds ratio and Inline graphic be the standard error of estimate Inline graphic. Next, we will model Inline graphic as independent random draws from a prior Inline graphic. Given prior Inline graphic, the Bayesian inference for Inline graphic is based on the posterior distribution of Inline graphic given Inline graphic with density

2. (2.2)

where Inline graphic is a Inline graphic density. The posterior mean Inline graphic is a standard Bayes estimator of Inline graphic and the Inline graphic and Inline graphic quantiles of the posterior distribution provide the Inline graphic confidence limits Inline graphic and Inline graphic.

The prior Inline graphic, however, is unknown. Empirical Bayes analysis uses a working prior Inline graphic in place of Inline graphic with the parameters in Inline graphic estimated from data Inline graphic usually via maximum likelihood (Morris, 1983). Our parametric working prior Inline graphic is a three-component mixture

2. (2.3)

where Inline graphic is the delta function (point mass) at 0, and Inline graphic and Inline graphic respectively, model the up- and down-regulated genes. This working prior is the same as that in Noma and others (2010) with one important difference: we use (2.3) to model the distribution of the standardized differences Inline graphic instead of the raw differences Inline graphic . We show in Section 5 that modeling the standardized differences Inline graphic as draws from a common prior leads to a much better fit to a breast cancer microarray dataset.

An important practical advantage of working prior (2.3) is that the posterior distribution Inline graphic is also a mixture of the same form as (2.3) (see Noma and others, 2010; Muralidharan, 2010) for analytical formula), which makes programming much easier and computing time manageable for large-scale problems. Spike-and-slab priors such as (2.3) have been used in variable selection and shrinkage estimation (see, e.g. Ishwaran and Rao, 2005) and have played a prominent role in multiple testing (Efron and others, 2001).

3. Rank-conditioned inference

3.1. Rank-conditioned shrinkage

For our basic model (2.1), we have

3.1.

The Bayesian estimate Inline graphic can also be written as

3.1.

which reflects the fact that the conditional mean of Inline graphic, given the observed Inline graphic, is no longer 0.

For the dataset Inline graphic, let Inline graphic be the rank of Inline graphic among Inline graphic. Our rank-conditioned inference for Inline graphic is based on the conditional distribution

3.1. (3.1)

where Inline graphic is the realized value of rank Inline graphic. The rank-conditioned shrinkage estimator is defined as

3.1. (3.2)

where Inline graphic is the conditional mean of the error Inline graphic, given that Inline graphic has rank Inline graphic among Inline graphic. Given prior Inline graphic, a draw from (3.1), Inline graphic, which is error Inline graphic conditional on Inline graphic, can be generated using the following three steps:

Step 1: Generate Inline graphic from density Inline graphic independently for Inline graphic. Let Inline graphic, where Inline graphic.

Step 2: Let Inline graphic be the rank of Inline graphic among Inline graphic.

Step 3: Repeat Steps 1–2 until Inline graphic. Then output Inline graphic.

Theorem 3.1 —

Let Inline graphic and Inline graphic independently for Inline graphic. Let Inline graphic be defined as in model (2.1). then Inline graphic is unbiased in the sense that

graphic file with name M111.gif

for any given Inline graphic and Inline graphic, when the expectation is evaluated with respect to the joint distribution of Inline graphic and Inline graphic.

Proof. —

Theorem 3.1 follows directly from definition (3.2) and (2.1):

graphic file with name M116.gif

Theorem 3.1 says that Inline graphic corrects the ranking bias, a concept discussed as in Jeffries (2009).

In addition to point estimate Inline graphic, the proposed method provides a natural confidence interval for Inline graphic. Let Inline graphic and Inline graphic satisfy

graphic file with name M122.gif (3.3)

It follows that

graphic file with name M123.gif

We have, therefore, shown that the interval

graphic file with name M124.gif (3.4)

contains the realized Inline graphic with Inline graphic probability, given Inline graphic when Inline graphic and Inline graphic are drawn as in Theorem 3.1.

Conditioning on Inline graphic in the rank-conditioned shrinkage estimator (3.2) and confidence limits (3.4) is in fact closely related to conditioning on Inline graphic itself in standard Bayes, as a larger Inline graphic generally corresponds to a higher rank. More specifically, let Inline graphic be the empirical distribution of Inline graphic. Suppose that Inline graphic, Inline graphic, can be modeled as draws from some distribution Inline graphic. It then follows from Glivenko–Cantelli theorem that Inline graphic converges uniformly to Inline graphic, the distribution of Inline graphic with Inline graphic, Inline graphic, and Inline graphic. In such a case, conditioning on Inline graphic is almost the same as conditioning on Inline graphic so long as Inline graphic is not close to 0 or 1 (the difference can be more substantial for Inline graphic close to 0 or 1). The proposed rank-conditioned inference, however, can be much more robust than standard empirical Bayes against mis-specification of Inline graphic. For this, we have the following result.

Theorem 3.2 —

Let Inline graphic and Inline graphic independently for Inline graphic. Let Inline graphic be defined as in model (2.1). In the case where the Inline graphic are equal, conditional distribution (3.1) and consequently rank-conditioned estimator (3.2) and confidence limits (3.4) remain the same (and valid) when the true prior density Inline graphic is replaced by density Inline graphic for any given constant Inline graphic.

Proof. —

The proof is straightforward. Let Inline graphic and Inline graphic+Inline graphic as in the three steps above. When the Inline graphic are equal, the rank of Inline graphic is not changed when Inline graphic are all translated by a constant Inline graphic, so the distribution of Inline graphic does not change. Theorem 3.2 then follows.

For unequal Inline graphic, Theorem 3.2 remains approximately valid so long as the variation in Inline graphic is small. Section 5 demonstrates that the rank-based shrinkage is in general more robust, not just against location shift. This is a unique feature of rank-conditioned shrinkage: the ranking bias Inline graphic is negative for lower ranked Inline graphic and positive for higher ranked Inline graphic even when evaluated under a badly specified prior. In the three steps for generating Inline graphic at the beginning of this section, the prior Inline graphic determines which Inline graphic will be output as Inline graphic. As such, the effect of a grossly mis-specified Inline graphic on Inline graphic remains limited. A grossly mis-specified Inline graphic can, however, have a much larger distorting impact on Bayes shrinkage estimator Inline graphic.

Finally, a confidence interval such as (3.4) that adjusts for ranking Inline graphic can be crucial for valid inference; Benjamini and Yekutieli (2005) show that the unadjusted marginal confidence interval of Inline graphic can have coverage probability that differs substantially from the nominal coverage for top-ranked parameters selected based on the same data. They propose the false coverage rate controlled confidence interval as a solution to this problem. As shown in Efron (2010, pp. 230–233) however, this interval can differ markedly from the corresponding Bayes interval and can be frighteningly wide. Westfall (2005) suggests constructing empirical Bayes confidence intervals centered at the shrunken estimators; the same idea is used and further developed in Qiu and Gene Hwang (2007) and in Ghosh (2009). Our interval is similar, but is instead based on rank-conditioned shrinkage. It is generally very close to the corresponding Bayes interval when the working prior is close to the true prior.

3.2. Non-parametric update of the parametric prior

In Section 2, a parametric working prior Inline graphic is used in empirical Bayes to capture the primary structure of Inline graphic. For the rank-conditioned method, we propose a non-parametric update of density Inline graphic to density Inline graphic by formula

3.2.

where the posterior density Inline graphic is given by (2.2) with Inline graphic replaced by Inline graphic and with Inline graphic replaced by a generic Inline graphic. The parameters in Inline graphic will take the values of their maximum likelihood estimates. Inline graphic can be interpreted as the average of the posterior densities for Inline graphic given Inline graphic with prior Inline graphic. Vardi and others (1985) use a similar update to improve the estimated image density in positron emission tomography. They show that it is one expectation-maximization iteration and therefore always increases the (marginal) likelihood of Inline graphic. See also Eggermont and LaRiccia (1997). The use of Inline graphic in place of Inline graphic does not significantly increase the computational burden for rank-conditioned inference. The density Inline graphic could potentially be further updated but the analytical complexity and computational cost will increase drastically.

3.3. Algorithm and implementation

The three-step algorithm for drawing Inline graphic in Section 3.1 is greatly simplified for the special case of Inline graphic for all Inline graphic because the distribution of Inline graphic depends only on the rank Inline graphic and not on Inline graphic. Under this condition, Steps 2 and 3 become

Let Inline graphic be the order statistics of Inline graphic. Let Inline graphic be the Inline graphic that corresponds to Inline graphic. Output Inline graphic for Inline graphic.

In this way, one round of Steps 1–3 generates a complete and independent set of Inline graphic. The Inline graphic in (3.2) is now simply Inline graphic and the Inline graphic and Inline graphic in (3.3) are defined by Inline graphic irrespective of the value of Inline graphic.

For most large-scale problems, the values of error standard deviation Inline graphic may not be constant but they are not far apart (say within a factor of 3 or 4) because of the inherent common design structure. For the microarray example in Section 2, Inline graphic for standardized difference Inline graphic depends only on sample size Inline graphic and Inline graphic. Therefore, Inline graphic does not differ too much unless the number of missing data points varies dramatically between genes. Similarly, in genome-wide association study, each SNP is compared between the same set of cases and controls. For such dataset, we can partition the Inline graphic observations, Inline graphic, into several sub-groups so that Inline graphic for observations within each sub-group varies within a factor of 1.5, for example. The simplified algorithm above can then be applied to each subgroup separately as an approximation. Monte Carlo Markov chain type of algorithm is under investigation to efficiently sample from the rank-conditioned distribution (3.1) without requiring Inline graphic to be constant.

4. Assessing performance of rank-conditioned inference

4.1. Simulation study Inline graphic

This example is adapted from Efron (2010, pp. 230–233). Let Inline graphic and Inline graphic for all Inline graphic for model (2.1). The true Inline graphic for random effects Inline graphic is (2.3) with Inline graphic, Inline graphic, Inline graphic, Inline graphic. These parameter values are chosen to have a moderate Bayes shrinkage effect. Monte Carlo simulation is conducted to compare the Bayes shrinkage estimates Inline graphic and rank-conditioned shrinkage estimates Inline graphic under five different specifications of working prior Inline graphic. These working priors Inline graphic have the same parametric form of (2.3) but with possibly different values of Inline graphic as given in various rows of Table 1. Parameters not listed are the same as in true prior. For example, Inline graphic for all the five working priors. Our simulation study is conducted as follows:

Table 1.

Mean square error of the three methods under different model mis-specification

Working prior Inline graphic with Inline graphic Inline graphic with Inline graphic Inline graphic with Inline graphic
Same as true prior 0.675 0.677 0.677
Inline graphic 0.763 0.710 0.681
Inline graphic
Inline graphic 0.762 0.685 0.683
Inline graphic
Inline graphic 0.702 0.686 0.683
Inline graphic
Inline graphic 0.699 0.681 0.679
Inline graphic

Step 1: Generate Inline graphic, Inline graphic, from prior Inline graphic. Let Inline graphic as in model (2.1).

Step 2: Let Inline graphic be the order statistics of Inline graphic. Let Inline graphic be the Inline graphic corresponding to Inline graphic. The Inline graphic can, therefore, refer to different Inline graphic for different realizations of Inline graphic.

Step 3: Compute empirical Bayes estimate Inline graphic under working model Inline graphic. Compute the rank-conditioned estimate Inline graphic under working model Inline graphic and its non-parametric update Inline graphic, respectively.

Step 4: Let Inline graphic. Calculate the mean square loss

4.1.

for estimator Inline graphic and estimator Inline graphic for both Inline graphic and Inline graphic. We only include the 500 lowest and 500 highest Inline graphic in Inline graphic because these Inline graphic are most interesting in large-scale data analysis.

Steps 1–4 are replicated 1000 times and the mean square error Inline graphic for Bayes method and mean square error Inline graphic for rank-conditioned inference are estimated by averaging the squared error loss over these replications. The estimated Inline graphic and Inline graphic values are given in Table 1. In row 1, the working prior is the same as the true prior and the standard Bayes is therefore optimal. The rank-conditional inference under both Inline graphic and Inline graphic shows little loss of efficiency with almost the same mean square error. In rows 2 and 3, Inline graphic and Inline graphic in the working prior are shifted away from Inline graphic and Inline graphic in the true prior. The Inline graphic increases noticeably with this model mis-specification. The rank-conditioned inference, especially under Inline graphic, proves robust with a much smaller change in Inline graphic from row 1. In rows 4 and 5, Inline graphic and Inline graphic in the working prior are inflated or shrunk from Inline graphic and Inline graphic in the true prior. Again, rank-conditioned inference is more robust.

4.2. Simulation study Inline graphic

For simulation 2, we continue to let Inline graphic and Inline graphic for all Inline graphic in model (2.1). The true prior Inline graphic of Inline graphic now has the form

4.2. (4.1)

where Inline graphic is the Gamma distribution with shape Inline graphic and scale Inline graphic. We take Inline graphic so that Inline graphic always has mean Inline graphic5 and variance 1 and Inline graphic always has mean 5 and variance 1 for any Inline graphic. For the simulation study, we generate Inline graphic and Inline graphic, Inline graphic, as specified by (2.1) and (4.1). Let Inline graphic be the Inline graphicth order statistic of Inline graphic and let Inline graphic be the corresponding Inline graphic. For every dataset Inline graphic, we compute a point estimate and 90% confidence limits for Inline graphic, Inline graphic, using three different methods. Method 1 is Bayes estimate Inline graphic and confidence limits Inline graphic and Inline graphic in Section 2 using Inline graphic in (2.3) as the working prior. In accordance with empirical Bayes, parameters Inline graphic in Inline graphic are substituted by their maximum likelihood estimates using data Inline graphic under models (2.1) and (2.3). We shall call this parametric Bayes method. Method 2 is also based on Bayes posterior but with a more diffuse prior using Dirichlet process mixture. Let Inline graphic be the Dirichlet process with base distribution Inline graphic and scaling parameter 1 and let Inline graphic be a (random) distribution drawn from this Dirichlet process. Following Do and others (2005) and Dunson (2010), we take the more diffuse prior, Inline graphic, as

4.2.

where Inline graphic, Inline graphic is generated as

4.2.

To be consistent with Method 1, Inline graphic and Inline graphic in Inline graphic and Inline graphic in Inline graphic are substituted by their maximum likelihood estimates under models (2.1) and (2.3) as in Method 1. For the inverse gamma distribution, we choose shape parameter Inline graphic and scale parameter Inline graphic, where Inline graphic is maximum likelihood estimate of variance Inline graphic in (2.3), so that the mean of the inverse gamma equals Inline graphic. Note that the prior Inline graphic is considerably more diffuse and less informative than Inline graphic due to the extra variation in Inline graphic. We call Method 2 non-parametric DP Bayes. Method 3 is the proposed rank-conditioned inference under Inline graphic, the non-parametric update of Inline graphic. Again, the parameters in Inline graphic are substituted by their maximum likelihood estimates. Let Inline graphic be the point estimate and Inline graphic be the confidence limits of Inline graphic from one of the three methods. Let 1(Inline graphic) be the indicator function. The mean square error and actual coverage rate for parameter Inline graphic, Inline graphic, are estimated by averaging

4.2.

over 1000 replications of Inline graphic.

The simulation study is conducted for Inline graphic, Inline graphic and Inline graphic with prior Inline graphic given by (4.1); the results are given in Figures 13, respectively. In each figure, the left panel shows the estimated root MSE for the lowest 100 ranked genes, Inline graphic for Inline graphic, and the right panel shows the estimated actual coverage rate of nominal 90% confidence intervals for Inline graphic. Figure 1 shows the case where Inline graphic; when Inline graphic is large, the normal distribution in the working prior Inline graphic approximates Inline graphic in true prior Inline graphic extremely well. As expected, the parametric Bayes performs the best among the three methods with the smallest mean square errors and close (to nominal) actual coverage rates for all Inline graphic, Inline graphic. The non-parametric Bayes with Dirichlet process prior performs poorly as the mean square errors are large and the actual coverage rates are far off. This is not surprising because an overly diffuse prior Inline graphic does not bring the needed shrinkage. Figure 3 shows the case for Inline graphic, in which the working prior deviates substantially from the true prior. The parametric Bayes performs poorly with huge MSE and far off actual coverage rates, while the non-parametric Bayes is much superior in both MSE and coverage rates. Figure 2 is for Inline graphic, an intermediate case between Inline graphic and Inline graphic. Neither of the two methods works well especially for Inline graphic. Our proposed rank-based inference, however, performs well for all the three cases. In particular, it is only slightly worse than the parametric Bayes for Inline graphic when the prior is correctly specified. When the prior is mis-specified as in Inline graphic and Inline graphic, its mean square errors are the smallest among the three methods and its actual coverage rates are not too far off from the nominal 90%. The rank-conditioned inference therefore achieves robustness against mis-specified prior with minimal loss of efficiency under correctly specified prior. While not shown in the graphs, the superior performance of the rank-conditioned inference is similarly observed for the highest ranked Inline graphic such as Inline graphic. The difference between the methods is small for middle-ranked Inline graphic as their inference are primarily determined by the large mass at 0 which is present in both the true prior Inline graphic and working prior Inline graphic. Finally, our implementation of Method 2 is based on R package DPpackage (Jara and others, 2011). Two additional simulation studies are included in supplementary material available at Biostatistics online.

Figure 1.

Figure 1.

Simulation study of Section 4.2 with the correct prior in (4.1) and parameter Inline graphic. The left panel is the root mean square error for parameter estimate Inline graphic, Inline graphic, and the right panel is the actual coverage rate of confidence intervals for Inline graphic at the 90% nominal level. Parametric empirical Bayes and rank-conditional inference perform similarly in this case with smaller mean square error and correct actual coverage rates. The non-parametric Bayes model with Dirichlet prior performs considerably worse compared with the other two methods in both mean square error and actual coverage rate.

Figure 3.

Figure 3.

Simulation study of Section 4.2 with the correct prior in (4.1) and parameter Inline graphic. The rank-conditional inference is the best in terms of MSE and non-parametric Bayes with Dirichlet prior is the close second best. The parametric Bayes is the distant third, with much larger MSE for Inline graphic. For the actual coverage rates, non-parametric Bayes method with Dirichlet prior is the best and the rank-conditioned inference is the close second. The parametric Bayes again is the distant third due to its much lower actual coverage rates.

Figure 2.

Figure 2.

Simulation study of Section 4.2 with the correct prior in (4.1) and parameter Inline graphic. The rank-conditioned method has the smallest root mean square errors and close actual confidence rates. The parametric Bayes and non-parametric Bayes with Dirichlet prior perform badly due to the large mean square error especially for Inline graphic.

5. Application to breast cancer microarray dataset

We now apply our proposed method to the breast cancer data in Wang and others (2005). This was a large Affymetrix-based gene expression profiling study of Inline graphic genes on 286 untreated patients with lymph node-negative primary breast cancer. The data are available at http://www.ncbi.nlm.nih.gov/geo/ as dataset GSE2034. We will compare gene expression level between patients who developed distant metastasis (74 subjects) and patients who were relapse-free at 5 years (135 subjects) among the 209 estrogen receptor positive patients. These data were also analyzed in Noma and others (2010). We use the gene expression model described in Section 2 with Inline graphic being the standardized sample mean difference in log gene expression level and Inline graphic being the true standardized mean difference for the Inline graphicth gene. We have Inline graphic and Inline graphic for all Inline graphic as there are no missing values for any gene. It then follows that Inline graphic in (2.1) is Inline graphic for all Inline graphic.

The maximum likelihood estimates of the parameters in the working prior Inline graphic obtained by assuming models (2.1) and (2.3) for Inline graphic are

η1 μ1 ω1 η2 μ2 ω2
0.0856 0.258 0.0426 0.315 −0.159 0.0470

which suggests about 40% of non-zero Inline graphic among Inline graphic genes. In order to check the fit of the parametric prior Inline graphic, we simulated new data from the fitted Inline graphic and compared its distribution to that of the original data through the following algorithm. Let Inline graphic, where Inline graphic and Inline graphic. The percentiles of Inline graphic and Inline graphic are given in the table below, which shows excellent fit of model Inline graphic to data Inline graphic. As comparison, we also fit the model used in Noma and others (2010), which formulates in terms of the unstandardized log fold change Inline graphic. Using notation of this paper, their model is

5.

where Inline graphic has form (2.3). The discrepancy between the percentiles of the original data Inline graphic and the simulated new data Inline graphic is much larger here. Modeling the standardized log fold change therefore provides much better fit to this dataset. **

5.

Coming back to the model in Section 2 and using the maximum likelihood estimates for Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic obtained above, the standard empirical Bayes estimates Inline graphic and the corresponding 90% confidence interval for all Inline graphic are then computed under Inline graphic. Rank-conditioned estimates Inline graphic and 90% intervals are also calculated under both Inline graphic and the non-parametric update Inline graphic. Results for Inline graphic that correspond to the five lowest ranked Inline graphic Inline graphic and to the five highest ranked Inline graphic Inline graphic are given in Figure 4 for the three methods. We make three observations. First, the three methods have a huge shrinkage effect on the raw estimate Inline graphic for these top genes. For example, Inline graphic but Inline graphic and Inline graphic (under both Inline graphic and Inline graphic). Second, results from empirical Bayes and rank-conditioned inference under Inline graphic and Inline graphic are very similar although the rank-conditioned confidence intervals are a little wider. The same is true for other Inline graphic not shown in Figure 4. This is not surprising, given the excellent fit of working prior Inline graphic to Inline graphic as discussed above. The agreement of the three methods and the robustness properties of the rank-conditioned inference should give us more confidence in the result. Third, an oddity of the rank-conditioned inference is that Inline graphic can be slightly larger than Inline graphic even though Inline graphic by definition. This happens when the difference in rank-conditioned bias for Inline graphic and Inline graphic as random variables exceeds their observed difference in the observed Inline graphic and Inline graphic. The same can happen to estimates of other ranks. This is generally a small peculiarity that is appropriately accounted for by the wide confidence intervals.

Figure 4.

Figure 4.

Parametric empirical Bayes and rank-conditional inference of Inline graphic for the five lowest ranked and the five highest ranked Inline graphic for the breast cancer dataset in Section 5.

6. Discussion

We have proposed a rank-conditioned inference that can substantially improve the prior robustness of empirical Bayes inference with little loss of efficiency. More research is needed, however, to further develop and establish the proposed method. For example, in the simulations presented in Section 4, the actual coverage rates for the rank-conditioned intervals, in spite of being a substantial improvement over standard empirical Bayes, are still below the nominal 90% rate for Inline graphic and Inline graphic. We expect that it is possible to further improve the actual coverage rate by drawing on similar research in the empirical Bayes literature, such as in Morris (1983), Laird and Louis (1987), He (1992), Qiu and Gene Hwang (2007), and Gene Hwang and others (2009). Second, model (2.1) assumes that errors Inline graphic are independent, which can be unrealistic in many applications. We are currently working to relax this requirement to accommodate a more general correlation structure. Preliminary results show that the method in this paper continues to work well if the correlation of Inline graphic is mild. Details will be reported in a future manuscript. We hope this paper will stimulate more research in robust Bayes inference for large-scale data to meet the pressing analytical need in genomics and genetics.

7. Software

Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology. The R code for the simulation studies is available at https://sites.google.com/site/jiangangliao/.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

L.G.L.’s and T.M.'s work was supported by Penn State Cancer Institute internal grant. Berg's work was supported by Grant Number UL1 TR000127 from the National Center for Advancing Translational Sciences (NCATS). The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

Supplementary Material

Supplementary Data
supp_15_1_60__index.html (1,015B, html)

Acknowledgements

We are grateful for the extremely thorough review and suggestions by an anonymous reviewer and an associate editor, which led to a greatly improved manuscript. Conflict of Interest: None declared.

References

  1. Benjamini Y., Yekutieli D. False discovery rate-adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association. 2005;100(469):71–81. [Google Scholar]
  2. Choe S. E., Boutros M., Michelson A. M., Church G. M., Halfon M. S. Preferred analysis methods for affymetrix genechips revealed by a wholly defined control dataset. Genome Biology. 2005;6(2):16. doi: 10.1186/gb-2005-6-2-r16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Do K. A., Müller P., Tang F. A Bayesian mixture model for differential gene expression. Journal of the Royal Statistical Society. 2005;54(3):627–644. [Google Scholar]
  4. Dunson D. B. Nonparametric Bayes applications to biostatistics. Bayesian Nonparametrics. 2010;28:223. [Google Scholar]
  5. Dunson D. B., Taylor J. A. Approximate Bayesian inference for quantiles. Nonparametric Statistics. 2005;17(3):385–400. [Google Scholar]
  6. Efron B. Microarrays, empirical Bayes and the two-groups model. Statistical Science. 2008;23(1):1–22. [Google Scholar]
  7. Efron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Volume 1. Cambridge: Cambridge University Press; 2010. [Google Scholar]
  8. Efron B., Tibshirani R., Storey J. D., Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151–1160. [Google Scholar]
  9. Eggermont P. P. B., LaRiccia V. N. Nonlinearly smoothed EM density estimation with automated smoothing parameter selection for nonparametric deconvolution problems. Journal of the American Statistical Association. 1997;92(440):1451–1458. [Google Scholar]
  10. Gene Hwang J. T., Qiu J., Zhao Z. Empirical Bayes confidence intervals shrinking both means and variances. Journal of the Royal Statistical Society. 2009;71(1):265–285. [Google Scholar]
  11. Ghosh D. Empirical Bayes methods for estimation and confidence intervals in high-dimensional problems. Statistica Sinica. 2009;19(1):125. [Google Scholar]
  12. Gu J., Ghosal S. Bayesian ROC curve estimation under binormality using a rank likelihood. Journal of Statistical Planning and Inference. 2009;139(6):2076–2083. [Google Scholar]
  13. Guo L., Lobenhofer E. K., Wang C., Shippy R., Harris S. C., Zhang L., Mei N., Chen T., Herman D., Goodsaid F. M. Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nature Biotechnology. 2006;24(9):1162–1169. doi: 10.1038/nbt1238. [DOI] [PubMed] [Google Scholar]
  14. He K. Parametric empirical Bayes confidence intervals based on James-Stein estimator. Statistics and Decisions. 1992;10(1–2):121–132. [Google Scholar]
  15. Hoff P. D. Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics. 2007;1(1):265–283. [Google Scholar]
  16. Ishwaran H., Rao J. S. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics. 2005;33(2):730–773. [Google Scholar]
  17. Jara A., Hanson T. E., Quintana F. A., Müller P., Rosner G. L. DPpackage: Bayesian non-and semi-parametric modelling in R. Journal of Statistical Software. 2011;40(5):1. [PMC free article] [PubMed] [Google Scholar]
  18. Jeffries N. O. Ranking bias in association studies. Human Heredity. 2009;67(4):267–275. doi: 10.1159/000194979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kendziorski C. M., Newton M. A., Lan H., Gould M. N. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine. 2003;22(24):3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
  20. Kim S., Dahl D. B., Vannucci M. Spiked Dirichlet process prior for Bayesian multiple hypothesis testing in random effects models. Bayesian Analysis. 2009;4(4):707–732. doi: 10.1214/09-BA426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Laird N. M., Louis T. A. Empirical Bayes confidence intervals based on bootstrap samples. Journal of the American Statistical Association. 1987;82(399):739–750. [Google Scholar]
  22. Morris C. N. Parametric empirical Bayes inference: theory and applications. Journal of the American Statistical Association. 1983;78(381):47–55. [Google Scholar]
  23. Muralidharan O. An empirical Bayes mixture method for effect size and false discovery rate estimation. The Annals of Applied Statistics. 2010;4(1):422–438. [Google Scholar]
  24. Newton M. A., Kendziorski C. M., Richmond C. S., Blattner F. R., Tsui K. W. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology. 2001;8(1):37–52. doi: 10.1089/106652701300099074. [DOI] [PubMed] [Google Scholar]
  25. Newton M. A., Noueiry A., Sarkar D., Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5(2):155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
  26. Noma H., Matsui S., Omori T., Sato T. Bayesian ranking and selection methods using hierarchical mixture models in microarray studies. Biostatistics. 2010;11(2):281–289. doi: 10.1093/biostatistics/kxp047. [DOI] [PubMed] [Google Scholar]
  27. Qiu J., Gene Hwang J. T. Sharp simultaneous confidence intervals for the means of selected populations with application to microarray data analysis. Biometrics. 2007;63(3):767–776. doi: 10.1111/j.1541-0420.2007.00770.x. [DOI] [PubMed] [Google Scholar]
  28. Smyth G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
  29. Vardi Y., Shepp L. A., Kaufman L. A statistical model for positron emission tomography. Journal of the American Statistical Association. 1985;80(389):8–20. [Google Scholar]
  30. Wang Y., Klijn J. G. M., Zhang Y., Sieuwerts A. M., Look M. P., Yang F., Talantov D., Timmermans M., Meijer-van Gelder M. E., Yu J. and others. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. The Lancet. 2005;365(9460):671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
  31. Westfall P. H. Comment. Journal of the American Statistical Association. 2005;100(469):85–89. doi: 10.1198/016214504000001817. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
supp_15_1_60__index.html (1,015B, html)

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES