Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Mar 22.
Published in final edited form as: J Am Stat Assoc. 2009 Sep 1;104(487):1015–1028. doi: 10.1198/jasa.2009.tm08523

Empirical Bayes Estimates for Large-Scale Prediction Problems

Bradley Efron *,
PMCID: PMC2844005  NIHMSID: NIHMS93102  PMID: 20333278

Abstract

Classical prediction methods such as Fisher’s linear discriminant function were designed for small-scale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10,000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to large-scale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to large-scale prediction, and also with false discovery rate theory.

Keywords: microarray prediction, empirical Bayes, shrunken centroids, effect size estimation, correlated predictors, local fdr

1 Introduction

An important class of prediction problems begins with the observation of n independent vectors,

(xj,yj)j=1,2,,n. (1.1)

Here xj is a N-vector of predictors, while yj is a real-valued response, taken to be dichotomous in most of what follows. For example, xj might include age, height, weight, gender, etc. for person j, while yj indicates whether or not that person later developed cancer. Given a newly observed N-vector X, we would like to predict its corresponding Y value. Our task is to use the “training data” (1.1) to construct an effective prediction rule.

Classic prediction methods, such as Fisher’s linear discriminant function, were fashioned for problems where N is much smaller than n, that is, where the number of predictors is less than the number of training cases. Current high-throughput scientific technology tends to produce just the opposite situation, with N >> n; modern equipment may permit thousands of measurements on a single individual, but recruiting new subjects remains as difficult as ever.

Microarrays offer the prototypical example. Here xj is a vector of genetic expression measurements on subject j, one for each of N genes, where N is typically several thousand. In the prostate cancer data (Singh et al., 2002) we will use for motivation, there are N = 6033 genes measured on each of n = 102 men, n1 = 50 healthy controls and n2 = 52 prostate cancer patients. Given a new microarray measuring the same 6033 genes, we would like to predict whether or not that man has prostate cancer.

Let ti be the two-sample t-statistic comparing sick versus healthy subjects for gene

ti=c0xi2xi1σ^i(c0=n1n2n), (1.2)

where i1 and i2 are the mean expression levels on gene i for the healthy and sick subjects, and σ^i is the usual pooled estimate of standard deviation. For easier discussion later, we transform the ti’s to a normal scale,

zi=Φ1(Fn2(ti)), (1.3)

with Φ and Fn2 the standard normal and tn2 cumulative distribution functions (cdf), so that under the classical null hypothesis, zi has a standard normal distribution,

H0:ziN(0,1). (1.4)

Figure 1 shows the histogram of all 6033 z-values. The theoretical N(0, 1) null distribution fits the center of the histogram reasonably well, which makes sense since, presumably, most of the N genes have nothing to do with prostate cancer. However the histogram’s heavy tails suggest some “non-null” genes that express themselves differently in sick and healthy subjects, and those are the ones that should be useful for prediction. Just how to fashion a prediction rule from them is the subject of this paper. (Note: it is not necessary that the zi’s be obtained from t-tests. Each of the N z-value calculations might involve a separate linear regression model, incorporating covariates such as age and weight.)

Figure 1.

Figure 1

6033 z-values from the prostate cancer study (Singh et al., 2002). A standard N(0, 1) density fits the histogram center, while the heavy tails indicate the presence of non-null genes that may be useful for prediction.

Large-scale prediction problems suffer from a surfeit of possible predictors, 6033 of them in this case, most of which are useless. Even the genuinely non-null cases appear to us in exaggerated form. Selection bias, the fact that we can only identify interesting possible predictors at the extremes of the N cases, means that an observed value at say, zi = 4, probably corresponds to a true effect considerably nearer the null hypothesis.

This paper uses empirical Bayes methods both to select useful predictors and to undo selection bias in the evaluation of their predictive power. It was suggested by the “shrunken centroids” method of Tibshirani et al. (2002), described in Section 2.

A simple model is introduced in Section 2, which, if we knew the parameter values, would lead to an optimum prediction rule. Section 3 discusses Bayes estimation of the optimum rule, using a model of Brown (1971) and Stein (1981) to assist the calculations (and showing a connection with the theory of local false discovery rates). An empirical Bayes algorithm for approximating the Bayes solution is developed in Section 4. Section 5 modifies the empirical Bayes algorithm to allow for correlation among the predictors. A different problem is considered in Section 6: the estimation of effect sizes for those cases found to be non-null, where our empirical Bayes approach provides an alternative to the False Coverage Rate theory of Benjamini and Yekutieli (2005). Most of the paper concerns dichotomous responses yj, but the results are extended to general response variables, for example survival times, in Section 7. Section 8 concludes the paper with Remarks that expand on some of the technical points and ideas (identified as Remark A, B, etc. throughout the text).

A healthy literature on large-scale prediction has grown up around innovative computer-intensive techniques such as support vector machines, lasso and ridge regression regularization methods, the singular value decomposition and sparse data representation. Chapter 18 of Hastie et al. (2008) provides a nice overview. A main goal here, besides presenting some new methodology, is to trace the inferential connections between Bayesian theory, regularization methods like shrunken centroids, false discovery rates, and large-scale prediction.

2 A Simple Model

Motivation for our empirical Bayes prediction rules comes from a simple idealized probability model for a vector of predictors X = (X1, X2, …, XN). We assume that the individual predictors xI are independently normal, with (location, scale) parameters (μi, σi), and with possibly different expectations in the two subject classes,

XiμiσiindN(±δi2c0,1){healthy class+sick class}, (2.1)

with c0 = (n1n2/n)1/2 as in (1.2). (Here the classes have been labeled ‘healthy’ and ‘sick’ in deference to the prostate example. Section 7 discusses non-dichotomous response variables.) Null cases have δi = 0, indicating no difference between the two classes; non-null cases, particularly those with large values of |δi|, are promising ingredients for effective prediction.

Let

Wi(Xiμi)σii=1,2,,N (2.2)

be the standardized versions of XI in (2.1). The optimal prediction rule is based on the weighted sum

S=i=1NδiWiN(±δ22c0,δ2), (2.3)

δ2=1Nδi2, with “±” indicating the two classes as in (2.1). We predict

healthyifS<0,sickifS>0. (2.4)

Prediction error rates of the first and second kinds, confusing healthy with sick or vice versa, both equal

α=Φ(δ2c0). (2.5)

Effective prediction requires a large δ vector. In what follows, prediction error will be called simply “α”. See Remark B of Section 8.

Rule (2.3), (2.4) is Fisher’s linear discriminant function applied to situation (2.1) (Hastie et al., 2008), assuming equal prior probabilities for the two classes. Remark B of Section 8 discusses the case of unequal probabilities. Section 5 considers a more realistic version of (2.1) that allows for correlations among the predictors XI.

In practice we need to estimate the parameters

(μi,σi,δi),i=1,2,,N (2.6)

entering into S = ΣδiWi. This is where the training data

x=(xij),i=1,2,,Nandj=1,2,,nandy=(y1,y2,,yn), (2.7)

with yj equal +1 or -1 depending on the dichotomous classification of subject j, comes in. Ebay, the algorithm used for the numerical calculations here, employs standard estimates for (μi, σi):

μ^i=xi1+xi22,σ^i=(SSi1+SSi2n2)12, (2.8)

i1 and SSi1 the mean and within-group sum of squares for gene i measurements in the healthy subjects, and likewise i2 and SSi2 for the sick subjects.

If σi were known, then

zi=c0xi2xi1σiN(δi,1) (2.9)

would provide an obvious estimate of δi, say δi=zi. With σi unknown, we convert the t-statistic ti to the normal scale as in (1.2), (1.3). Remark F considers this transformation more carefully, but for now we will ignore it, and use the approximation zi ~ N (δi, 1) for our actual z-values (1.3).

Selection bias makes the δi values overinflated estimates of the true δi’s. Suppose that for the prostate data we decided to employ the genes having the 51 largest values of δi for prediction. The vector of 51 δis has δ=27.3, suggesting α = .003 in (2.5) (using c0 = (50 · 52/102)1/2 = 5.05). The empirical Bayes calculations of Section 3 show that a more realistic estimate for the actual 51-vector’s length is 19.8, giving α = .025.

The shrunken centroids algorithm of Tibshirani et al. (2002) counteracts selection bias by shrinking the estimates δi=zi toward zero according to a soft thresholding rule,

δ^i=sign(zi)(ziλ)+. (2.10)

In words, each value δi=zi is shrunk toward zero by amount λ, under the restriction that shrinking never goes past zero. A range of possible shrinkage parameters λ is tried, and for each one a prediction rule like (2.3) is formed, using

S^λ=δ^iW^i[W^i=(Xiμ^i)σ^i] (2.11)

for prediction as in (2.4). Cross-validation is then employed to estimate αλ, the true error rate. (This description takes some liberties with the details of the shrunken centroids procedure.)

Notice that only cases having |zi| > λ enter into the prediction statistic Ŝλ. This is a favorable property: prediction is easier to implement and understand when the number of predictors is small.

Table 1 shows a shrunken centroids analysis for the prostate data, carried out using pamr, a CRAN algorithm in the R language. Cross-validation suggests λ = 2.16 as the best shrinkage parameter (so for instance zi = 4 yields δ^i=1.84 in (2.11)) with estimated error rate α^CV=.09. 377 of the 6033 genes are involved in Ŝλ. Unlike the theoretical result (2.5), adding too many predictors eventually decreases prediction accuracy in actual practice.

Table 1.

Shrunken centroids prediction for the prostate data (2.10), (2.11) using R program pamr, CRAN. The shrinkage parameter λ = 2.16 yields the smallest cross-validated error estimate, α^CV=.09. Prediction statistic S^λ involves 377 of the 6033 genes.

shrinkage # nonzero CV error
value λ genes rate
0.00 6033 0.34
0.54 3763 0.33
1.08 1931 0.23
1.62 866 0.12
2.16 377 0.09
2.70 172 0.10
3.24 80 0.16
3.78 35 0.30
4.32 4 0.41
4.86 1 0.48
5.29 0 0.52

Looking at Table 1, it seems we should use λ = 2.16 in our prediction rule. There is, however, a subtle danger lurking here: because cross-validation is involved in the choice of “best” λ, the estimated rate .09 may be downwardly biased. It would take a second level of cross-validation to correct this bias.

A small simulation study was run with N = 1000, n1 = n2 = 10, and all xijindN(0,1). In this case δi = 0 for every i in (2.1), so α = .50 at (2.5); but the minimum cross-validated error rates observed in 100 repetitions of this set-up had median .30 with standard deviation ±.16.

This is an extreme example. Usually the downward bias is less severe, particularly when good prediction is possible. Nevertheless we will try to avoid such biases in what follows by using rules where the cross-validation calculations are not involved in the choice of tuning parameters.

3 Bayesian Prediction

Suppose we had a Bayesian prior distribution for the parameters in model (2.1) that enabled us to calculate posterior expectations for the δi’s, say

δ~i=E{δiz}. (3.1)

Bayes estimates are immune to selection bias: even if zi were selected because it was the largest of the N z-values (zi = 5.29 for gene i = 610 in the prostate data), δ~i would still be the correct Bayes estimate for δi. We could, for example, use the 50 largest values of δ~i to form S~=δ~iW^i, as in (2.3) or (2.11), while maintaining at least some confidence in the error rate estimate α^=Φ(δ~2c0). See Senn (2008) and Dawid (1994) for discussions of the “paradox” of Bayesian immunity to selection effects, including its dangers.

Brown (1971) and Stein (1981) developed a Bayesian model that is especially convenient for calculating δ~i in (3.1). For any (δ, z) pair we supposed that δ has a prior density g(δ),

δg()andzδN(δ,1), (3.2)

so that z has marginal density

f(z)=φ(zδ)g(δ)dδ[φ(z)=ez222π]. (3.3)

Theorem 1. Under model (3.2), the posterior density of δ given z is

g(δz)=eδzψ(z)[eδ22g(δ)],withψ(z)=log(f(z)φ(z)). (3.4)

Proof. According to Bayes theorem,

g(δz)=φ(zδ)g(δ)f(z), (3.5)

which reduces immediately to (3.4).

Form (3.4) represents an exponential family having sufficient statistic δ, natural (canonical) parameter z, and cumulant generating function (cgf) ψ(z). Therefore the conditional cumulants of δ given z can be obtained by differentiating ψ with respect to z:

Corollary 1.

E{δz}=ψ(z)andVar{δz}=ψ(z). (3.6)

(Brown and Stein used multivariate versions of (3.6), differently derived, in their exploration of high-dimensional estimation theory.)

The advantage of Corollary 1 is that ψ (z), and the cumulants of δ given z, are obtained directly from the marginal density f(z), without requiring specific calculation of the prior g(δ), finessing the usual difficulties of deconvolution.

The algorithm Ebay described in Section 4 approximates E{δ|z} and Var{δ|z} by substituting a smoothed estimate ψ^(z) into (3.6). Figure 2 displays the Ebay output Ê{δ|z} for the prostate data, comparing it to the shrunken centroids curve (2.10) for λ = 2.16, the preferred choice in Table 1. Ε̂ is better matched to the choice λ = 1.42 in (2.10), suggesting that less shrinking is better here.

Figure 2.

Figure 2

Heavy curve is Ê{δ|z} for prostate data, Ebay algorithm, Section 4; compared with best shrunken centroids curve (2.10), λ = 2.16. Also shown, SD^=Var^{δz}12. At z = 4, Ê = 2.49, shrunken centroid = 1.84, SD^=.98. Remark I explains the slight positive slope of Ê{δ|z} for z in (−2, 1.5).

Suppose we add to the Brown—Stein model (3.2) the assumption that the prior distribution of δ has a discrete atom of probability at δ = 0 (see Remark C, Section 8),

p0=Prob{δ=0}. (3.7)

Then Bayes theorem yields

fdr(z)=Prob{δ=0z}=p0φ(z)f(z), (3.8)

fdr(z) being the “local false discovery rate,” Efron (2008). Comparing this with (3.4), (3.6) gives

Corollary 2. Under model (3.2), (3.7),

E{δz}=ddzlog(fdr(z))andVar{δz}=d2dz2log(fdr(z)). (3.9)

It seemingly makes sense that only genes with low false discovery rates should be utilized in prediction rules. The corollary shows that this is roughly true, but in a rather surprising manner: large values of δ~i=E{δizi} depend on the rate of change of log(fdr(zi)), not on fdr(zi) itself. Small values of fdr(zi) usually correspond to large values of δ~i, but this doesn’t have to be the case. Usually log(fdr(zi)) is nearly constant around z = 0, where fdr(z) = 1. This forces both Ê and SD^ to be small, as seen in Figure 2 (see Remark I).

4 Empirical Bayes Prediction

The Ebay algorithm that produced Figure 2 employs empirical Bayes methods to construct effective prediction rules. That is, it uses z, the vector of all N z-values, to estimate the Bayes prediction rule (2.3), (2.4). Here is a schematic description of Ebay’s operation:

  1. A target error rate α0 is selected (default α0 = .025).

  2. An estimate (z) for the marginal density f(z), (3.3), is obtained using Poisson regression on z; see Remark D.

  3. The estimated cumulative generating function ψ^(z)=log(f^(z)φ(z)), (3.4), is numerically differentiated to give
    δ^i=ψ^(zi)=E^{δizi} (4.1)
    as in (3.6).
  4. Letting δ^I be the vector of I largest δ^is (in absolute value), I is chosen to be the smallest integer such that the nominal error rate Φ(α^I2c0), (2.5), is less than α0; that is, I is the minimum choice yielding
    δ^I2c0Φ1(1α0). (4.2)
  5. The empirical Bayes prediction rule is based on the sign of
    S^=Iδ^i(Xiμ^iσ^i), (4.3)
    (μ^i,σ^i) as in (2.8).
  6. Repeated 10-fold cross-validation is used to furnish an unbiased estimate of the rule’s prediction error; see Remark G.

Table 2 shows a portion of Ebay’s output for the prostate data. It’s prediction rule employs genes with the 51 largest values of δ^i, at which point (4.2) is first satisfied (compared with 377 genes for the apparently best shrunken centroids rule in Table 1). An unbiased error estimate, based on 20 randomized 10-fold cross-validation runs, was .092, the same as the minimum error seen in Table 1; see Remark G.

Table 2.

Ebay prediction rule for the prostate data; rule uses genes with 51 largest δ^i values, α^=Φ(δ^2c0)=.025. Cross-validation error rate .092±004. Column α^cor explained in Section 5.

Step Index z-value δ^ α^ α^cor
1 610 5.29 4.30 0.335 0.335
2 1720 4.83 3.78 0.285 0.281
3 364 -4:42 -3:70 0.250 0.250
4 3940 -4:33 -3:64 0.222 0.222
5 4546 -4.29 -3.58 0.199 0.215
6 4331 -4.14 -3.40 0.182 0.189
7 332 4.47 3.34 0.167 0.181
8 914 4.40 3.24 0.154 0.166
9 1068 4.25 3.06 0.144 0.148
10 4088 -3.88 -3.05 0.135 0.149
45 4154 -3.38 -2.26 0.029 0.050
46 2 3.57 2.25 0.028 0.050
47 2370 3.56 2.24 0.028 0.049
48 3282 3.56 2.23 0.027 0.048
49 3505 -3.33 -2.18 0.026 0.046
50 905 3.51 2.18 0.025 0.047
51 4040 -3.33 -2.17 0.025 0.048

There are, potentially, many reasons why the nominal error rate .025 might be over-optimistic: (μ^i,σ^i) in (2.8) does not equal (μi, σi); the xI are not normally distributed; the xI are not independent (see Section 5); the empirical Bayes estimates δ^i differ from the actual Bayes estimates (3.1).

This last point can cause particular trouble at the extremes of the z scale, just where δ^(z) is largest but there are fewest zi’s for the estimation of δ^. Figure 3 concerns the following artificial situation, using notation similar to that for the prostate data and model (2.1):

N=5000,n1=n2=20,δiindN(1.5,1)fori=1,2,,250,δi=0fori=251,252,,5000,xijindN(±δi2c0,1)for alliandj,c0=20240. (4.4)

This results in

ziindN(δi,1) (4.5)

at (2.9), with δi ~ N (1.5, 1) for the first 250 genes, and 0 otherwise.

Figure 3.

Figure 3

True curve E{δ|z} (heavy), compared to E^{δz}=ψ^(z), 50 simulations from model (4.4). Estimates Ê reasonably accurate for z < 4, but fall apart for larger z values.

Figure 3 compares Ê{δ|z} from the Ebay algorithm with the true curve E{δ|z}. The estimates are reasonably accurate up to z = 4, but degenerate beyond that. Remark E of Section 8 derives a delta-method formula for the standard error of Ê{δ|z} that predicts this behavior. Modifying (4.4), (4.5) so that the zi’s were correlated, with root mean square correlation coefficient 0.1, increased the variability of Ê{δ|z} by roughly 50%.

An option in Ebay allows for truncation of the δ^ estimation procedure at some number “ktrunc” of observations in from the extremes. With ktrunc = 5 for instance, δ^i for the five largest zi values is set equal to max{δ^i:iN5}, and similarly at the negative end of the z scale.

Figure 4 shows the actual misclassification error probabilities α for 200 simulations from model (4.4), each time using the Ebay prediction rule with nominal error rate α0 = .025. As the truncation parameter ktrunc increases from 0 to 15, the actual prediction errors α decrease toward the target value .025. Table 3 displays the means and standard deviations for the data in Figure 4.

Figure 4.

Figure 4

Actual prediction errors α of Ebay rule with nominal α0 = .025; 200 simulations from model (4.4). As truncation parameter increases from 0 (rightmost histogram) to 15 (leftmost), actual errors decrease toward nominal α0.

Table 3.

Means and standard deviations for actual prediction errors α in simulation experiment for Figure 4.

ktrunc: 15 10 5 0
Mean: .032 .038 .051 .066
SD: .019 .018 .022 .025

Truncation had a less dramatic effect on the prostate data: for ktrunc = 0, 5, 10, 15, the cross-validated error estimates were .092, .085, .070, .077. Lowering the target rate from α0 = .025 to .01 gave corresponding error estimates .070, .062, .061, .058. Correlation among the predictors is part of the problem here; see Section 5.

Our original error estimate α^=.092 is “honest”, i.e., nearly unbiased for the Ebay rule produced with (α0, ktrunc) = (.025, 0). So are the α^ estimates for the other (α0, ktrunc) combinations. Choosing the combination with the smallest α^, however, again raises the possibility of over-optimism, as discussed at the end of Section 2.

More elaborate “honest” selection criteria, beyond the current capabilities of Ebay, might involve minimizing a linear combination of nominal error rate and number of predictors, say

Φ(δ^I2c0)+CI (4.6)

over all choices of I; accounting for correlation as in Section 5, adjusting for non-normality; using theoretical or data-based techniques to choose the truncation parameter, etc.

Some “snooping” into the cross-validation estimates seems inevitable in applications. Nevertheless, I believe that holding snooping to a minimum is good practice for honest prediction assessment, and that empirical Bayes methods, perhaps further refined, can be sufficiently accurate to allow for a nearly-honest practical methodology.

5 Correlation Corrections

The assumption of case-wise independence in model (2.1) is likely to be untrue, perhaps spectacularly untrue, in many applications. Suppose that the vector W of standardized predictors Wi = (Xiμi)/σi, (2.2), actually has covariance matrix Σ. Then both error probabilities in (2.5) become

α=Φ(Δ0η)where{Δ0=δ2c0η=(δtδδtΣδ)12.} (5.1)

Here Δ0 is the independence value, while η is a correction factor, usually less than 1, that increases the error rate α. See Remark K of Section 8.

If we can estimate Σ we can estimate correction factor η,

η^=(δ^tδ^δ^tΣ^δ^)12. (5.2)

According to (2.1), cov(W) = Σ has diagonal elements 1 in both classes, so the off-diagonal elements ρii’ are correlations. Notice that we need estimate these for only the I cases selected by the Ebay algorithm at (4.2), not for all N cases. For the prostate data we need to estimate a 51 × 51 correlation matrix Σ, from the 51 × 102 data submatrix xI of the full 6033 × 102 matrix x whose rows are indexed by the first column of Table 2.

The last column of Table 2 in Section 4 shows α^cor, obtained from (5.1), (5.2), with σ^ the usual sample correlation matrix. Correlation degrades the nominal error probability from .025 to .048 (closer to the cross-validation estimate .092). Much of the degradation is due to three large correlations,

r34,19=.97,r36,15=.65,r42,28=.92, (5.3)

the subscripts referring to the steps in Table 2.

Table 4 concerns a microarray study having more severe correlation problems, the Michigan lung cancer study discussed in Subramanian et al. (2005). There are N = 5217 genes, n = 86 subjects, n1 = 62 “good outcomes” and n2 = 24 “poor outcomes”. Here the Ebay algorithm stopped after 200 steps, without α^ reaching the target value α0 = .025. The correlation-corrected errors α^cor are much more pessimistic, actually increasing after the first 6 steps, eventually to α^cor=.360. A cross-validation error rate of .37 confirmed the pessimism. Restricting Ebay to use at most I = 10 predictors reduced the cross-validated error rate to .29, as suggested by Table 4 (an example of the kind of “snooping” disparaged at the end of Section 4, unless the decision to use the I = 10 Ebay prediction rule was made before the cross-validation calculations).

Table 4.

Ebay output for the Michigan lung cancer study. Correlation error estimates α^cor are much more pessimistic, as confrimed by cross-validation.

Step Index z-value δ^ α^ α^cor
1 3144 4.62 3.683 0.3290 0.329
2 2446 4.17 3.104 0.2813 0.307
3 4873 4.17 3.104 0.2455 0.256
4 1234 3.90 2.686 0.2234 0.225
5 621 3.77 2.458 0.2072 0.213
6 676 3.70 2.323 0.1942 0.228
7 2155 3.69 2.313 0.1824 0.230
8 3103 3.60 2.140 0.1731 0.236
9 1715 3.58 2.103 0.1647 0.240
10 452 3.54 2.028 0.1574 0.243
193 3055 2.47 0.499 0.0519 0.359
194 1655 -2.21 0.497 0.0518 0.359
195 2455 2.47 0.496 0.0517 0.359
196 3916 2.47 0.496 0.0516 0.359
197 4764 2.47 0.495 0.0515 0.359
198 1022 -2.20 -0.492 0.0514 0.359
199 1787 -2.19 -0.490 0.0513 0.360
200 901 -2.18 -0.486 0.0512 0.360

Sample correlation matrices tend toward overdispersion when n is small compared to the number of variates. Ebay includes an option for empirical Bayes shrinkage of the elements of Σ^; see Remark H.

6 Effect Size Estimation

Current developments in large-scale simultaneous inference have focused on hypothesis testing, where the goal is to identify a small number of non-null cases among a large number of potential candidates. See Dudoit et al. (2003) for a nice review. Benjamini and Yekutieli (2005) address a more ambitious goal: to assess the effect sizes for the non-null cases, that is, to estimate how far away they lie from the null hypothesis. The empirical Bayes theory of Section 4 provides an alternative approach to effect size estimation.

We begin with assumptions (3.2), (3.7), that

ziN(δi,1)i=1,2,,N, (6.1)

and that proportion p0 of the effects δi equal 0,

p0=Prob{δi=0}, (6.2)

these being the uninteresting null cases. The local false discovery rate fdr(z) = p0φ(z)/f(z), (3.8), is the Bayes posterior probability Probi = 0|zi}. If fdr^(zi), an estimate of fdr(zi), is suitably small, then case i can be reported as “probably non-null”, and we would like to put some sort of confidence limits on the effect size δi. The prior g(δ) in (3.2) is now of the mixed form

g(δ)=p0I0(δ)+(1p0)g1(δ), (6.3)

where I0(δ) is a delta-function at 0, and g1(δ) indicates the density of the non-null cases (see Remark C). Then the mixture density f(z), (3.3), becomes

f(z)=p0φ(z)+(1p0)f1(z)wheref1(z)=φ(zδ)g1(δ)dδ. (6.4)

Theorem 2. Under model (6.1), (6.2), the posterior density of effect size δ given z and given that δ ≠ 0 is

g1(δz)=eδzψ1(z)[eδ22g1(δ)]whereψ1(z)=log{1fdr(z)fdr(z)/1p0p0}. (6.5)

Proof. Bayes rule says that g1(δ|z) = φ(zδ)g1(δ)/f1(z), yielding

g1(δz)=eδzlog{f1(z)φ(z)}[eδ22g1(δ)]. (6.6)

An equivalent form of (3.8) is

1fdr(z)=Prob{δ0z}=(1p0)f1(z)f(z), (6.7)

from which we obtain, using (6.4),

f1(z)φ(z)=p01p01fdr(z)fdr(z). (6.8)

Combining (6.8) and (6.6) verifies Theorem 2.

As in (3.6), the conditional moments of a non-null δ (one for which δ ≠ 0) given z are obtained by differentiating ψ1(z),

E1{δz}=ψ1andVar1{δz}=ψ1(z), (6.9)

where the subscript “1” indicates conditioning on δ ≠ 0. Some calculation gives E1 and Var1 in terms of E{δ|z} and Var{δ|z} in (3.6):

Corollary 3. Under model (6.1), (6.2),

E1{δz}=E{δz}[1fdr(z)]andVar1{δz}=11fdr(z)[Var(δz)fdr(z)1fdr(z)E{δz}2]. (6.10)

Note. Since δ = 0 with probability fdr(z), we have

E{δjz}=[1fdr(z)]E1{δjz}. (6.11)

Using (6.11) with j = 1 and 2 leads to a quick verification of (6.10).

Our prediction algorithm in Section 4 requires only the estimation of E{δ|z}. Effect size estimation is more difficult, requiring Var{δ|z} and fdr(z) as well. The plug-in estimate of Var1{δ|z} in (6.10) may be particularly unstable, in which case we can conservatively replace it with Var^{δz}, as shown next.

Rearranging (6.10) yields

Var1Var=1fdr(z)Q(z)1fdr(z)whereQ(z)=E2(1fdr(z))Var (6.12)

, (with Var= Var{δ|z}, etc.) so that

Var1VarifVarE2(1fdr(z)). (6.13)

Since Var is usually near 1, this last condition is satisfied whenever δ^=E{δz} gets large enough to be interesting; in the case of the prostate data, for z ≥ 2.

Figure 5 demonstrates effect size estimation for the prostate data. The Poisson GLM estimate of f(z) described in Remark D provides estimates of fdr^(z), Ê{δ|z}, and Var^{δz}, as in Figure 2. The curved band in Figure 5 follows

E^{δz}[1fdr^(z)]±Var^{δz}12 (6.14)

as in Corollary 3, showing approximate 68% intervals for δ given z and given δ ≠ 0, made more conservative by replacing Var^1 with Var^. At z = 4 for example, we estimate that either δ = 0 with probability fdr^(4)=.048, or, if δ ≠ 0, it lies in the interval [1.58, 3.64) with estimated posterior probability exceeding .68. (Remember that δ, as defined in (2.1), is the number of standard deviations separating the two class means, multiplied by c0.)

Figure 5.

Figure 5

effect size estimation for the prostate data. Band is approximate 68% interval for δ given z and given δ ≠ 0; also showing fdr^(z), estimated probability δ = 0 given z. At z = 4, fdr^(z)=.048, with interval [1.58, 3.64] if δ is non-null.

Benjamini and Yekutieli’s (2005) False Coverage Rate algorithm provides conservative frequentist confidence bounds on the cases declared non-null by an FDR testing procedure, assuming independence of the zi’s. There is, however, a heavy price to pay: the bounds tend to be very wide. For z = 4 in the prostate example, their 68% interval is [1.36, 6.64) (using their Definition 1, with q = .32). Part of the problem, as discussed in Section 7 of Efron (2008), is that the Bejamini/Yekutieli procedure does not split off an atom of probability at δ = 0, though splitting seems natural in the hypothesis testing framework of (3.7) or (6.3).

The approximate 68% non-null limits (6.14) were calculated for 25 replications of simulation model (4.4). They appear in Figure 6, along with the true Bayesian posterior limits (z+1.5)2±12. Using Var^ instead of Var^1 in (6.14) makes the intervals too wide, but their overall performance is acceptable as rough estimates of effect size.

Figure 6.

Figure 6

Approximate effect size limits (6.14) for 25 replications of simulation model (4.4). Heavy straight lines are actual 68% Bayes posterior limits for non-null cases.

7 Other Response Variables

The development so far has concerned dichotomous response variables: healthy versus sick in the prostate example. This section extends the empirical Bayes prediction methodology to general univariate responses.

Let Y be a univariate response of interest, for example a survival time that we wish to predict from x = (X1, X2, …, XN)’ as in Section 2. For convenience we assume that Y has been standardized to have mean 0 and variance 1, denoted

Y(0,1), (7.1)

though this will play no role in the actual methodology.

We suppose that Y influences the standardized variable Wi = (Xi — μi)/σi, (2.1), through linear regression,

Wi=βiY+i,i=1,2,,N, (7.2)

Var(ϵi) = 1, where the vector of errors ϵ is uncorrelated with Y. In the dichotomous situation of (2.1), Y = −1 or 1 and βi = δi/2c0. Effective prediction of Y depends upon discovering those xI’s with large values of |βi|. See Remark J of Section 8.

The joint distribution of Y and W has mean vector and covariance matrix

(YW)[(OO),(1βtβββt+Σ)], (7.3)

Σ indicating the covariance matrix of ϵ. The best linear predictor of Y from W is

Y=βt(ββt+Σ)1w=11+Δ2βtΣ1w, (7.4)

where Δ2 is the squared Mahalanobis distance

Δ2=βtΣ1β. (7.5)

If Σ is the identity, as assumed in (2.1), then Y = constant · βtW, similarly to (2.3).

Combining (7.4) with (7.2) produces a simple expression for the conditional mean and variance of Y given Y,

YY(Δ21+Δ2Y,Δ21+Δ2), (7.6)

from which (7.1) gives

cor(Y,Y)=Δ1+Δ2. (7.7)

Effective prediction of Y from W requires a large value of Δ = (βt Σ β)1/2. (In the context of Section 2, where Σ = I and β = δ/2c0, we have Δ = ∥δ∥/2c0, so the error probability α equals Φ(-Δ) at (2.5)).

To bring empirical Bayes methods to bear on the estimation of Y we need to estimate posterior expectations for the regression coefficients βi from the training data (1.1): the N × n matrix x and the n-vector of responses y = (y1, y2, …, yn)t. Let xit indicate the ith row of x. Applying model (7.2) independently to each column of x gives a linear model for the rows,

xi=μi1n+σi(βiy+i), (7.8)

where 1n is a vector of n 1’s, and the components of ϵi = (ϵi1, ϵi2, …, ϵin)t are independent and identically distributed, with mean 0 and variance 1. Ordinary least squares applied to (7.8) provides familiar estimates of μi, σi and βi. In the dichotomous setting of (2.1), μ^i and σ^i are as given in (2.8) while 2c0β^i equals δi=zi in (2.9).

If we assume that the errors ϵi are normally distributed, then the t-statistic “ti” for testing βi = 0 in (7.8) has a non-central t distribution with n — 2 degrees of freedom and non-centrality parameter proportional to βi,

titn2(δi),[δi2c0βiwithc02=1n(yiy)24]. (7.9)

In usual practice, (7.9) remains a reasonable approximation as long as the ϵij distribution does not have heavy tails. With dichotomous yi, c0 = (n1n2/n)1/2 as before.

We can transform ti to a z-value via

zi=Φ1(Fn2(ti)), (7.10)

with Φ and Fn2 the standard normal and central tn2 cdf’s. If n is large then (7.9) gives

zi.N(δi,1) (7.11)

as in (2.9). Remark F improves upon approximation (7.11), but we will take it as given here.

We can now proceed as in Section 4:

  1. z = (z1, z2, …, zN)t provides (z), an estimate of the marginal density of the z-values (Remark D) and ψ^(z)=f^(z)φ(z).

  2. We then calculate
    δ^i=ψ^(zi)=E^{δizi}fori=1,2,,N. (7.12)
  3. δ^i, the vector of I largest δ^is in absolute value, gives
    Δ^I=δ^I2c0andcor^I=Δ^I1+Δ^I2. (7.13)
  4. We continue increasing I until either cor^I reaches some target value or I reaches a preselected upper bound, and use
    Y^=11+Δ^I2i=1Iδ^i2c0(Xiμ^iσ^i), (7.14)
    from (7.4), to predict Y from X.

Steps (3) and (4) assume uncorrelated xI’s, i.e., Σ the identity matrix, but correlation can be incorporated as in Section 5.

These steps were carried out for an ongoing lung cancer microarray study involving n = 100 patients each measured on N = 16,000 genes. All patients received the same new drug. The response variable “Y ” was a categorical assessment of improvement, adjusted for two covariates, running from −2 (worst) to +2 (best).

Figure 7 shows Ê{δ|z}, calculated by Steps (1) and (2) above. It seems clear that any power of the microarray expression measurements to predict Y must come from those genes having zi less than −2. Table 5 shows this to be true. Predictive power is modest here, with theoretical correlation only .48 after I = 50 steps, asymptoting to .57 at I = 16,000.

Figure 7.

Figure 7

Lung cancer microarray study, N = 16,000 genes, n = 100 patients, ordered categorical response variable Y . Heavy curve Ê{δ|z}, (7.14). Dashes indicate those z-values exceeding 3 in absolute value.

Table 5.

Right column shows cor^I, (7.13), for lung cancer data; I = 1 to 50. Final value cor^16000=.57.

Step Index z-value δ^ β^ Δ^ cor^I
1 12404 -4.27 -2.44 -0.20 0.04 0.04
2 6342 -3.98 -2.10 -0.17 0.07 0.07
3 2516 -3.92 -2.02 -0.16 0.10 0.10
4 488 -3.89 -1.99 -0.16 0.12 0.12
5 8471 -3.84 -1.93 -0.16 0.15 0.15
6 25 -3.84 -1.92 -0.16 0.17 0.17
7 2872 -3.82 -1.90 -0.15 0.20 0.19
8 300 -3.78 -1.85 -0.15 0.22 0.21
9 545 -3.78 -1.85 -0.15 0.24 0.23
10 12448 -3.78 -1.84 -0.15 0.26 0.26
45 10905 -2.83 -0.60 -0.05 0.54 0.47
46 390 -2.83 -0.59 -0.05 0.54 0.47
47 1498 -2.82 -0.59 -0.05 0.54 0.48
48 10317 -2.81 -0.57 -0.05 0.54 0.48
49 7894 -2.79 -0.56 -0.05 0.55 0.48
50 13263 -2.79 -0.55 -0.04 0.55 0.48

8 Remarks

The following remarks expand on some of the questions and technical points raised earlier.

A. Centroids Interpretation

Prediction rule (4.3), which depends on the sign of S^=δ^iW^i,W^i=(Xiμ^i)σ^i, can be stated in more conventional centroid terminology: letting

D1=W^+δ^2c0andD2=W^+δ^2c0, (8.1)

we predict “healthy” if D1 < D2 and “sick” if D2 < D1; so δ^2c0 and δ^2c0 are the standardized centroids. An alternative statement refers to the hyperplane L^ passing through the origin of N-space orthogonal to the line segment connecting δ^2c0 with δ^2c0: we predict healthy or sick depending on which side of L^ the point Ŵ falls.

B. Unequal Prior Probabilities

Prediction rule (4.3) tacitly assumes that our dichotomous response variable has equal prior probabilities on the two categories, irrespective of the observed frequencies n1 and n2 in the training set. Suppose that the prior probabilities are actually π1 and π2. Starting with model (2.1), calculations involving Fisher’s linear discriminant function imply the following change from Remark A: the prediction boundary L^ is translated to intersect the orthogonal line segment at directed distance

c0δ^log(π1π2) (8.2)

from the origin. (The definition of Ŵ is still (Xμ^)σ^, with (μ^i,σ^i), as given in (2.8).)

C. The Prior Density g(δ)

In the Brown—Stein model (3.2), the prior density g(δ) can be extended to a general probability distribution G(δ) incorporating discrete atoms of probability as in (3.7). Theorem 1’s statement is almost unchanged,

dG(δz)=eδzψ(z)eδ22dG(δ). (8.3)

The factor e−δ2/2 guarantees that the exponential family has natural parameter space including all values of z, justifying Corollary 2 for all z. The same considerations apply to Theorem 2.

D. Estimating f(z)

Ebay estimates f(z), the mixture density (3.3), by means of a Poisson generalized linear model (glm) applied to binned counts of the N z-values. In Figure 1, for example, there are K = 90 bins, each of width 0.1, ranging from −4.5 to 4.5. The counts

ck=#{ziin bink},k=1,2,,K (8.4)

are the heights of the histogram bars. Let b indicate the K-vector of bin midpoints. Then the estimate = (1, 2, …, K)t of f(z) at the points in b is obtained by Poisson regression of the counts on a natural spline function of the midpoints:

f^=glm(cns(b,df),poisson)$fit (8.5)

in R notation; default degrees of freedom df equals 7 in Ebay; is the discretized mle of f(z) in the 7-parameter exponential family defined by the natural spline basis.

Estimate (8.5) is the same one employed by locfdr, the local false discovery rate algorithm described in Efron (2008). Applied to the prostate data, locfdr estimated 0 = .93 for the proportion of null genes (3.7), assuming that f(z) is the correct null density.

E. Accuracy Formula for Ê{δ|z}

A closed-form delta-method expression for the variance of Δ^i=E^{Δizi} can be derived if we are willing to assume that the zi’s are independent of each other. Let M be the K × m structure matrix ns(b, df) in (8.5), K = 90 and m = 8; diag(c) the K × K diagonal matrix with diagonal entries the bin counts ck; and G = Mtdiag(c)M. Section 5 of Efron (2007) employs the relationship

d^=MG1Mtdc (8.6)

for the derivative matrix of the K-vector = log() with respect to a continuized version of c.

Let D be the (K — 2) × K matrix whose kth row is

(0,0,,0,1.0,1,0,0,)d0, (8.7)

with −1 in the kth place: Dl̂ = ’, the numerical derivative of . This gives

d^=DMG1Mtdc. (8.8)

The Poisson estimate Cov(c) = diag(c) for the covariance matrix of c then yields Cov(’) = DMG-1 Mt Dt. But since

ψ(z)=ddzlog{f(z)φ(z)}=z+(z), (8.9)

we have Δ^(k)ψ^(z=bk)=bk+l^k in (4.1), implying that

Cov(δ^)=Cov(^t)=DMG1MtDt. (8.10)

Table 6 shows estimates of standard error for δ^ (square roots of the diagonal elements in (8.10)) calculated for the prostate data. As in Figure 3, we can see an explosive increase in variability as |z| increases to 4.

Table 6.

Delta-method standard errors for δ^(z)=E^{δz}, formula (8.10), for the prostate data.

z: -4 -3 -2 -1 0 1 2 3 4
sd: .41 .12 .09 .06 .04 .05 .09 .10 .33

F. Transforming t-values to z-values

The ith row of x comprises n independent observations

xijindN(μi±σiδi2c0σi2)forj=1,2,,n (8.11)

in the notation of Section 2, with n1 “−” values and n2 “+” values. The corresponding two-sample t-statistic ti follows a non-central t distribution with n — 2 degrees of freedom and noncentrality parameter δi,

ti=c0x2ix1iσ^itn2(δi). (8.12)

Our previous discussion treated ti as zi ~ N (δi, 1), but Ebay actually employs transformations that improve the accuracy of Corollary 3.

Let

zi=Φ1(Fn2(ti)), (8.13)

as in (7.10), so if δi = 0 then zi ~ N (0, 1). If δi ≠ 0, zi is still surprisingly close to normal,

zi.N(ζi,σ2(ζi)),[ζi=Φ1(Fn2(δi))], (8.14)

with σ(ζi) < 1. For example, with δi = 4 and n = 102, zi from (8.13) has (mean, standard deviation, skewness, kurtosis) equal (3.845, .931, -.046, .010). A plot of (8.14) superimposed on (8.12) barely differentiates the two curves.

The computation of δ^i, (4.1) in the Ebay algorithm, is actually carried out using (8.14):

  • The vector t = (t1, t2, …, tN)t is converted component-wise to z = (z1, z2, …, zN), as in (8.13).

  • An estimate (z) is constructed from z as in Remark D.

  • A modified version of Corollary 2, described below, provides empirical Bayes estimates ζ^i.

  • Finally, transformation (8.14) is inverted to give
    δ^i=Fn21(Φ(ζ^i)), (8.15)
    after which Ebay proceeds as in Steps 4–6 in Section 4.

Suppose the Brown—Stein model (3.2) is modified to have z|δ ~ N (δ, σ2). Then it is easy to show that

E{δz}=z+σ2(z)andVar{δz}=σ2+σ4(z), (8.16)

where l(z) is the log of the marginal density f(z). The empirical Bayes estimate ζ^i mentioned above is given by

ζ^i=zi+σ^i2^(zi), (8.17)

(z) = log((z)) and σ^i2=σ2(zi), where the variance function σ2(·) in (8.14) is calculated numerically. None of this gave answers much different than using (4.1) directly, but the transformation effect becomes more important when n is smaller.

G. Cross-Validation Procedure

Both Ebay and the shrunken centroids procedure default to 10-fold cross-validation replicated R times. Each replication randomly splits the N cases into 10 folds, with correctly proportional numbers of “healthy” and “sick” in each fold. As usual, the prediction rule is refit 10 times with the cases of each fold withheld from the training set in turn, the cross-validated rate α^CV being the overall proportion of errors on the withheld cases averaged over all R replications. The R replications also provide a standard error for α^CV.

It is useful to remember that α^CV is not an estimate of error for the specific prediction rule selected by Ebay or pamr (unlike the actual prediction errors in Figure 4, which were computed from knowledge of the simulation structure (4.4)). Rather, it is the expected error rate for rules selected according to the same recipe, as emphasized in Efron (1983). In this sense it differs from the ideal Bayesian estimate α~=Φ(δ~2c0) following (3.1), or its empirical Bayes version α^=Φ(δ^2c0), both of which apply directly to the prediction rule at hand.

H. Empirical Bayes Estimation of Σ

The histogram of off-diagonal elements rii′ of a sample correlation matrix will usually be more dispersed than the corresponding histogram of true correlations ρii′, because sampling error adds a component of variance to the rii′ values. Ebay includes an empirical Bayes shrinkage option to account for overdispersion in the estimation of, (5.2).

Let

νii=n42log(1+ρii1ρii)andυii=n42log(1+rii1rii) (8.18)

denote Fisher’s transform of ρii′ and rii′ (where the usual constant n — 3 has been reduced to n — 4 since two separate means are subtracted off, for the healthy and sick subjects separately). A standard normal theory approximation (Johnson and Kotz, 1970, Chapt. 32, Sect. 4), says that

υii.N(νii,1), (8.19)

implying that the histogram of the υii′ values will have variance about one unit greater than that for the true νii′ ’s.

Suppose the ensemble of true νii′ values has (mean, variance) say (M, A), and that υii′ ~ (νii′, 1) as in (8.19), so that the υii′ ensemble .(M,A+1). Then

νii=M(1C)+Cυii,[c=A(A+1)] (8.20)

is the linear function of υii′ having (mean, variance) .(M,A).

Ebay first obtains robust estimates of M and A + 1 from the set of values {υii′}, and then substitutes and Ĉ = Â/(Â + 1) into (8.20) to give estimates ν~ii. In order to protect genuine outliers like those in (5.3), Efron and Morris’ (1972) limited translation rule is enforced: ν~ii is not allowed to shrink further than one unit away from υii′ . Finally, ν~ii gives ρ~ii by inverting transformation (8.18). (Σ~ may no longer be a correlation matrix, but that is not required for use in (5.2).)

A small simulation experiment was run, comparing Σ~ with the usual (unshrunk) estimate Σ^. It began with model (4.4), modified to instill correlation among the 5000 entries in any one column of x; the root mean square of true pair-wise correlations was set equal to 0.10, about triple that for the prostate study and half that for the Michigan lung cancer study of Table 4. Each of 200 replications yielded δ^I as in (4.2), the I × I sample correlation matrix Σ^, and its empirical Bayes counterpart Σ~.

The corresponding estimates (5.2),

η^=(δ^Itδ^Iδ^ItΣ^δ^I)12andη~=(δ^Itδ^Iδ^ItΣ^δ^I)12 (8.21)

are compared with

ηtrue=(δ^Itδ^Iδ^ItΣδ^I)12 (8.22)

in Table 7; η~ is seen to off er only minor improvement over η^. Robust estimates of standard deviation for η~ηtrue compared to η~ηtrue were a little more decisive: .074 compared to .085. Root mean square errors for estimating all of the elements of strongly favored Σ~ over Σ^, rms~=6.30 versus rms^=9.83.

Table 7.

Estimates η^ and η~, (8.21), compared with true correlation correction factor ηtrue, (8.22); 200 replications of correlated model (4.4); rms values are root mean square errors for estimating the elements of Σ

η~true η^ η~ rms^ rms~
mean: .597 .588 .598 9.83 6.30
stdev: .138 .171 .164 9.13 5.74

The α^cor values in Table 2 and Table 4 were based on Σ^, Ebay’s default option. Using Σ~ gave smaller estimates of the correlation effect in both cases. The choice is not crucial here since the current version of Ebay does not involve α^cor in constructing the prediction rule, but either or both methods convey useful information on the effects of correlation among the predictors.

Regularized estimation of correlation matrices is a major subject in its own right (see Warton, 2008), and other methods might further improve on Σ^. However Σ^ performs relatively well in our context for two reasons: the dimension “I” of Σ tends not to be too large, and more importantly, we need only estimate the function η, (5.1), not all of Σ. If we are willing to consider δ~ fixed in (5.2), then δ^tΣδ^ is a linear function of Σ’s elements, estimated almost unbiasedly by δ^tΣ^δ^. The estimation of Σ would be more crucial if we were attempting to implement the general linear discriminant function rather than the simplified version (2.3), (2.4).

I. Overdispersed z-Values

The z-value histogram for the prostate data in Figure 1 is a little bit wider than N(0, 1) near z = 0: a fit to the center of the histogram gave z.N(0,1.062) (using the locdfr algorithm, Efron (2008)). This discrepancy is reflected in Figure 2 by the slight upward slope of Ê{δ|z} for z between −2 and 1.5. Theorem 1 and Corollary 2 in Section 3 depend on the assumption z ~ N (δ, 1). If actually z ~ N (δ, σ2), with σ2 > 1, then the formula for E{δ|z} must be modified as in (8.16). We can compensate for overdispersion by using the values i = zi rather than zi in the Ebay algorithm. Doing so flattened Ê{δ|z} to zero between −2 and 1.5 in Figure 2, and shrank it slightly toward zero for larger |z|.

Figure 8 concerns a leukemia microarray study from Golub et al. (1999) where overdispersion is more severe. Here there are N = 7129 genes measured on n = 72 subjects in two subtypes, n1 = 45 and n2 = 12. Two-sample t-tests gave z-values zi as in (1.2), (1.3). The histogram of zi’s corresponding to Figure 1 has z.N(0.9,1.682) near its center.

Figure 8.

Figure 8

Solid curve is Ê{δ|z} for leukemia data, Golub et al. (1999). Broken curve is Ê{δ|z} based on standardized values i = (zi — .09)/1.68. Top row of dashes indicate 40 most extreme zi values; lower row 40 most extreme i values.

Now the curve Ê{δ|z} based on the standardized values i = (zi — .09)/1.68 is much less optimistic than that based on the original zi’s, especially taking account of the decreased size of the i’s. Prediction looks extremely easy with the zi’s; many genes have δ^i values, (4.1), exceeding 6. However δ^i tops out below 4 for the ’s. Ebay required only I = 10 genes to reach target error α0 = .01 using the zi’s, (4.2), compared with I = 34 for the i’s.

Which prediction rule is better? The answer depends on the reason for the zi’s overdispersion. If in fact zi ~ N(δi, 1) and the appearance of overdispersion is due to most of the δi’s lying far from zero, then the I = 10 rule should perform well. However, overdispersion may indicate ephemeral effects, for example due to unobserved covariates in an observational study, that won’t help with future predictions, in which case the i analysis is more realistic.

J. Model (7.1), (7.2)

The predictor variable Wi appears as the response in (7.2), which may seem less natural there than in (2.1). This allows us to express each row xI of the predictor matrix x as a separate linear regression in y, (7.8), facilitating the empirical Bayes estimation of parameters βi in (7.9)–(7.14). Notice that the correlation structure (7.3) implied by (7.1), (7.2) leads directly to (7.4), where now W assumes its proper role as a predictor vector.

K. Naive Bayes Prediction

Rule (2.3)–(2.4) is “naive Bayes” when applied in the correlated framework of Section 5. That is, it ignores the possible decrease in prediction error available from use of the full form of Fisher’s linear discriminant function. Such gains are likely to be more hypothetical than genuine. The theory and simulations in Bickel and Levina (2004) and Dudoit et al. (2002) show our naive Bayes prediction rules outperforming more sophisticated predictors in large-scale situations.

References

  • 1.Benjamini Y, Yekutieli D. False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 2005;100:71–93. [Google Scholar]
  • 2.Bickel PJ, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. [Google Scholar]
  • 3.Brown LD. Admissible estimators, recurrent di usions, and insoluble boundary value problems. Ann. Math. Statist. 1971;42:855–903. [Google Scholar]
  • 4.Dawid AP. Multivariate analysis and its applications (Hong Kong, 1992) Inst. Math. Statist., IMS Lecture Notes Monogr. Ser.; Hayward, CA: 1994. Selection paradoxes of Bayesian inference; pp. 211–220. 24. [Google Scholar]
  • 5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 2002;97:77–87. [Google Scholar]
  • 6.Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103. [Google Scholar]
  • 7.Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Amer. Statist. Assoc. 1983;78:316–331. [Google Scholar]
  • 8.Efron B. Size, power and false discovery rates. Ann. Statist. 2007;35:1351–1377. [Google Scholar]
  • 9.Efron B. Microarrays, empirical bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. [Google Scholar]
  • 10.Efron B, Morris C. Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J. Amer. Statist. Assoc. 1972;67:130–139. [Google Scholar]
  • 11.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
  • 12.Hastie T, Tibshirani R, Friedman J. Springer Series in Statistics. 2nd Springer-Verlag; New York: 2008. The Elements of Statistical Learning. [Google Scholar]
  • 13.Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 2. Houghton Mi in Co.; Boston, Mass.: 1970. [Google Scholar]
  • 14.Senn S. A note concerning a selection “paradox” of Dawid’s. Amer. Statist. 2008;62:206–210. [Google Scholar]
  • 15.Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kanto ,PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]
  • 16.Stein CM. Estimation of the mean of a multivariate normal distribution. Ann. Statist. 1981;9:1135–1151. [Google Scholar]
  • 17.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Warton DI. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J. Amer. Statist. Assoc. 2008;103:340–349. [Google Scholar]

RESOURCES