Empirical Bayes Estimates for Large-Scale Prediction Problems

Bradley Efron

doi:10.1198/jasa.2009.tm08523

. Author manuscript; available in PMC: 2010 Mar 22.

Published in final edited form as: J Am Stat Assoc. 2009 Sep 1;104(487):1015–1028. doi: 10.1198/jasa.2009.tm08523

Empirical Bayes Estimates for Large-Scale Prediction Problems

Bradley Efron ^*,^†

PMCID: PMC2844005 NIHMSID: NIHMS93102 PMID: 20333278

Abstract

Classical prediction methods such as Fisher’s linear discriminant function were designed for small-scale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10,000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to large-scale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to large-scale prediction, and also with false discovery rate theory.

Keywords: microarray prediction, empirical Bayes, shrunken centroids, effect size estimation, correlated predictors, local fdr

1 Introduction

An important class of prediction problems begins with the observation of n independent vectors,

(x_{j}, y_{j}) j = 1, 2, \dots, n .

(1.1)

Here x_j is a N-vector of predictors, while y_j is a real-valued response, taken to be dichotomous in most of what follows. For example, x_j might include age, height, weight, gender, etc. for person j, while y_j indicates whether or not that person later developed cancer. Given a newly observed N-vector X, we would like to predict its corresponding Y value. Our task is to use the “training data” (1.1) to construct an effective prediction rule.

Classic prediction methods, such as Fisher’s linear discriminant function, were fashioned for problems where N is much smaller than n, that is, where the number of predictors is less than the number of training cases. Current high-throughput scientific technology tends to produce just the opposite situation, with N >> n; modern equipment may permit thousands of measurements on a single individual, but recruiting new subjects remains as difficult as ever.

Microarrays offer the prototypical example. Here x_j is a vector of genetic expression measurements on subject j, one for each of N genes, where N is typically several thousand. In the prostate cancer data (Singh et al., 2002) we will use for motivation, there are N = 6033 genes measured on each of n = 102 men, n₁ = 50 healthy controls and n₂ = 52 prostate cancer patients. Given a new microarray measuring the same 6033 genes, we would like to predict whether or not that man has prostate cancer.

Let t_i be the two-sample t-statistic comparing sick versus healthy subjects for gene

t_{i} = c_{0} \frac{{\overset{‒}{x}}_{i 2} - {\overset{‒}{x}}_{i 1}}{{\hat{σ}}_{i}} (c_{0} = \sqrt{n_{1} n_{2} ∕ n}),

(1.2)

where x̄_i₁ and x̄_i₂ are the mean expression levels on gene i for the healthy and sick subjects, and ${\hat{σ}}_{i}$ is the usual pooled estimate of standard deviation. For easier discussion later, we transform the t_i’s to a normal scale,

z_{i} = Φ^{- 1} (F_{n - 2} (t_{i})),

(1.3)

with Φ and F_n_—₂ the standard normal and t_n_—₂ cumulative distribution functions (cdf), so that under the classical null hypothesis, z_i has a standard normal distribution,

H_{0} : z_{i} \sim N (0, 1) .

(1.4)

Figure 1 shows the histogram of all 6033 z-values. The theoretical N(0, 1) null distribution fits the center of the histogram reasonably well, which makes sense since, presumably, most of the N genes have nothing to do with prostate cancer. However the histogram’s heavy tails suggest some “non-null” genes that express themselves differently in sick and healthy subjects, and those are the ones that should be useful for prediction. Just how to fashion a prediction rule from them is the subject of this paper. (Note: it is not necessary that the z_i’s be obtained from t-tests. Each of the N z-value calculations might involve a separate linear regression model, incorporating covariates such as age and weight.)

6033 z-values from the prostate cancer study (Singh et al., 2002). A standard N(0, 1) density fits the histogram center, while the heavy tails indicate the presence of non-null genes that may be useful for prediction.

Large-scale prediction problems suffer from a surfeit of possible predictors, 6033 of them in this case, most of which are useless. Even the genuinely non-null cases appear to us in exaggerated form. Selection bias, the fact that we can only identify interesting possible predictors at the extremes of the N cases, means that an observed value at say, z_i = 4, probably corresponds to a true effect considerably nearer the null hypothesis.

This paper uses empirical Bayes methods both to select useful predictors and to undo selection bias in the evaluation of their predictive power. It was suggested by the “shrunken centroids” method of Tibshirani et al. (2002), described in Section 2.

A simple model is introduced in Section 2, which, if we knew the parameter values, would lead to an optimum prediction rule. Section 3 discusses Bayes estimation of the optimum rule, using a model of Brown (1971) and Stein (1981) to assist the calculations (and showing a connection with the theory of local false discovery rates). An empirical Bayes algorithm for approximating the Bayes solution is developed in Section 4. Section 5 modifies the empirical Bayes algorithm to allow for correlation among the predictors. A different problem is considered in Section 6: the estimation of effect sizes for those cases found to be non-null, where our empirical Bayes approach provides an alternative to the False Coverage Rate theory of Benjamini and Yekutieli (2005). Most of the paper concerns dichotomous responses y_j, but the results are extended to general response variables, for example survival times, in Section 7. Section 8 concludes the paper with Remarks that expand on some of the technical points and ideas (identified as Remark A, B, etc. throughout the text).

A healthy literature on large-scale prediction has grown up around innovative computer-intensive techniques such as support vector machines, lasso and ridge regression regularization methods, the singular value decomposition and sparse data representation. Chapter 18 of Hastie et al. (2008) provides a nice overview. A main goal here, besides presenting some new methodology, is to trace the inferential connections between Bayesian theory, regularization methods like shrunken centroids, false discovery rates, and large-scale prediction.

2 A Simple Model

Motivation for our empirical Bayes prediction rules comes from a simple idealized probability model for a vector of predictors X = (X₁, X₂, …, X_N). We assume that the individual predictors x_I are independently normal, with (location, scale) parameters (μ_i, σ_i), and with possibly different expectations in the two subject classes,

\frac{X_{i} - μ_{i}}{σ_{i}} \overset{ind}{\sim} N (\pm \frac{δ_{i}}{2 c_{0}}, 1) {\begin{matrix} “ - ” & healthy class \\ “ + ” & sick class \end{matrix},

(2.1)

with c₀ = (n₁n₂/n)^1/2 as in (1.2). (Here the classes have been labeled ‘healthy’ and ‘sick’ in deference to the prostate example. Section 7 discusses non-dichotomous response variables.) Null cases have δ_i = 0, indicating no difference between the two classes; non-null cases, particularly those with large values of |δ_i|, are promising ingredients for effective prediction.

Let

W_{i} \equiv (X_{i} - μ_{i}) ∕ σ_{i} i = 1, 2, \dots, N

(2.2)

be the standardized versions of X_I in (2.1). The optimal prediction rule is based on the weighted sum

S = \sum_{i = 1}^{N} δ_{i} W_{i} \sim N (\pm ‖ δ ‖^{2} ∕ 2 c_{0}, ‖ δ ‖^{2}),

(2.3)

$‖ δ ‖^{2} = \sum_{1}^{N} δ_{i}^{2}$ , with “±” indicating the two classes as in (2.1). We predict

\begin{matrix} “ healthy ” & if S < 0, \\ “ sick ” & if S > 0 . \end{matrix}

(2.4)

Prediction error rates of the first and second kinds, confusing healthy with sick or vice versa, both equal

α = Φ (- ‖ δ ‖ ∕ 2 c_{0}) .

(2.5)

Effective prediction requires a large δ vector. In what follows, prediction error will be called simply “α”. See Remark B of Section 8.

Rule (2.3), (2.4) is Fisher’s linear discriminant function applied to situation (2.1) (Hastie et al., 2008), assuming equal prior probabilities for the two classes. Remark B of Section 8 discusses the case of unequal probabilities. Section 5 considers a more realistic version of (2.1) that allows for correlations among the predictors X_I.

In practice we need to estimate the parameters

(μ_{i}, σ_{i}, δ_{i}), i = 1, 2, \dots, N

(2.6)

entering into S = Σδ_iW_i. This is where the training data

\begin{matrix} x & = (x_{i j}), i = 1, 2, \dots, N and j = 1, 2, \dots, n \\ and y & = (y_{1}, y_{2}, \dots, y_{n}), \end{matrix}

(2.7)

with y_j equal +1 or -1 depending on the dichotomous classification of subject j, comes in. Ebay, the algorithm used for the numerical calculations here, employs standard estimates for (μ_i, σ_i):

{\hat{μ}}_{i} = \frac{{\overset{‒}{x}}_{i 1} + {\overset{‒}{x}}_{i 2}}{2}, {\hat{σ}}_{i} = {(\frac{S S_{i 1} + S S_{i 2}}{n - 2})}^{1 ∕ 2},

(2.8)

x̄_i₁ and SS_i₁ the mean and within-group sum of squares for gene i measurements in the healthy subjects, and likewise x̄_i₂ and SS_i₂ for the sick subjects.

If σ_i were known, then

z_{i} = c_{0} \frac{{\overset{‒}{x}}_{i 2} - {\overset{‒}{x}}_{i 1}}{σ_{i}} \sim N (δ_{i}, 1)

(2.9)

would provide an obvious estimate of δ_i, say ${\overset{‒}{δ}}_{i} = z_{i}$ . With σ_i unknown, we convert the t-statistic t_i to the normal scale as in (1.2), (1.3). Remark F considers this transformation more carefully, but for now we will ignore it, and use the approximation z_i ~ N (δ_i, 1) for our actual z-values (1.3).

Selection bias makes the ${\overset{‒}{δ}}_{i}$ values overinflated estimates of the true δ_i’s. Suppose that for the prostate data we decided to employ the genes having the 51 largest values of $∣ {\overset{‒}{δ}}_{i} ∣$ for prediction. The vector of 51 ${\overset{‒}{δ}}_{i} ’ s$ has $‖ \overset{‒}{δ} ‖ = 27.3$ , suggesting α = .003 in (2.5) (using c₀ = (50 · 52/102)^1/2 = 5.05). The empirical Bayes calculations of Section 3 show that a more realistic estimate for the actual 51-vector’s length is 19.8, giving α = .025.

The shrunken centroids algorithm of Tibshirani et al. (2002) counteracts selection bias by shrinking the estimates ${\overset{‒}{δ}}_{i} = z_{i}$ toward zero according to a soft thresholding rule,

{\hat{δ}}_{i} = sign (z_{i}) \cdot (∣ z_{i} ∣ - λ) + .

(2.10)

In words, each value ${\overset{‒}{δ}}_{i} = z_{i}$ is shrunk toward zero by amount λ, under the restriction that shrinking never goes past zero. A range of possible shrinkage parameters λ is tried, and for each one a prediction rule like (2.3) is formed, using

{\hat{S}}_{λ} = \sum {\hat{δ}}_{i} {\hat{W}}_{i} [{\hat{W}}_{i} = (X_{i} - {\hat{μ}}_{i}) ∕ {\hat{σ}}_{i}]

(2.11)

for prediction as in (2.4). Cross-validation is then employed to estimate α_λ, the true error rate. (This description takes some liberties with the details of the shrunken centroids procedure.)

Notice that only cases having |z_i| > λ enter into the prediction statistic Ŝ_λ. This is a favorable property: prediction is easier to implement and understand when the number of predictors is small.

Table 1 shows a shrunken centroids analysis for the prostate data, carried out using pamr, a CRAN algorithm in the R language. Cross-validation suggests λ = 2.16 as the best shrinkage parameter (so for instance z_i = 4 yields ${\hat{δ}}_{i} = 1.84$ in (2.11)) with estimated error rate ${\hat{α}}_{CV} = .09$ . 377 of the 6033 genes are involved in Ŝ_λ. Unlike the theoretical result (2.5), adding too many predictors eventually decreases prediction accuracy in actual practice.

Table 1.

Shrunken centroids prediction for the prostate data (2.10), (2.11) using R program pamr, CRAN. The shrinkage parameter λ = 2.16 yields the smallest cross-validated error estimate, ${\hat{α}}_{CV} = .09$ . Prediction statistic ${\hat{S}}_{λ}$ involves 377 of the 6033 genes.

shrinkage	# nonzero	CV error
value λ	genes	rate
0.00	6033	0.34
0.54	3763	0.33
1.08	1931	0.23
1.62	866	0.12
2.16	377	0.09
2.70	172	0.10
3.24	80	0.16
3.78	35	0.30
4.32	4	0.41
4.86	1	0.48
5.29	0	0.52

Open in a new tab

Looking at Table 1, it seems we should use λ = 2.16 in our prediction rule. There is, however, a subtle danger lurking here: because cross-validation is involved in the choice of “best” λ, the estimated rate .09 may be downwardly biased. It would take a second level of cross-validation to correct this bias.

A small simulation study was run with N = 1000, n₁ = n₂ = 10, and all $x_{i j} \overset{ind}{\sim} N (0, 1)$ . In this case δ_i = 0 for every i in (2.1), so α = .50 at (2.5); but the minimum cross-validated error rates observed in 100 repetitions of this set-up had median .30 with standard deviation ±.16.

This is an extreme example. Usually the downward bias is less severe, particularly when good prediction is possible. Nevertheless we will try to avoid such biases in what follows by using rules where the cross-validation calculations are not involved in the choice of tuning parameters.

3 Bayesian Prediction

Suppose we had a Bayesian prior distribution for the parameters in model (2.1) that enabled us to calculate posterior expectations for the δ_i’s, say

{\tilde{δ}}_{i} = E {δ_{i} ∣ z} .

(3.1)

Bayes estimates are immune to selection bias: even if z_i were selected because it was the largest of the N z-values (z_i = 5.29 for gene i = 610 in the prostate data), ${\tilde{δ}}_{i}$ would still be the correct Bayes estimate for δ_i. We could, for example, use the 50 largest values of $∣ {\tilde{δ}}_{i} ∣$ to form $\tilde{S} = \sum {\tilde{δ}}_{i} {\hat{W}}_{i}$ , as in (2.3) or (2.11), while maintaining at least some confidence in the error rate estimate $\hat{α} = Φ (- ‖ \tilde{δ} ‖ ∕ 2 c_{0})$ . See Senn (2008) and Dawid (1994) for discussions of the “paradox” of Bayesian immunity to selection effects, including its dangers.

Brown (1971) and Stein (1981) developed a Bayesian model that is especially convenient for calculating ${\tilde{δ}}_{i}$ in (3.1). For any (δ, z) pair we supposed that δ has a prior density g(δ),

δ \sim g (\cdot) and z ∣ δ \sim N (δ, 1),

(3.2)

so that z has marginal density

f (z) = \int_{- \infty}^{\infty} φ (z - δ) g (δ) d δ [φ (z) = e^{- z^{2} ∕ 2} ∕ \sqrt{2 π}] .

(3.3)

Theorem 1. Under model (3.2), the posterior density of δ given z is

\begin{matrix} g (δ ∣ z) = e^{δ z - ψ (z)} [e^{- δ^{2} ∕ 2} g (δ)], \\ w i t h \\ ψ (z) = \log (f (z) ∕ φ (z)) . \end{matrix}

(3.4)

Proof. According to Bayes theorem,

g (δ ∣ z) = φ (z - δ) g (δ) ∕ f (z),

(3.5)

which reduces immediately to (3.4).

Form (3.4) represents an exponential family having sufficient statistic δ, natural (canonical) parameter z, and cumulant generating function (cgf) ψ(z). Therefore the conditional cumulants of δ given z can be obtained by differentiating ψ with respect to z:

Corollary 1.

E {δ ∣ z} = ψ^{'} (z) a n d V a r {δ ∣ z} = ψ^{″} (z) .

(3.6)

(Brown and Stein used multivariate versions of (3.6), differently derived, in their exploration of high-dimensional estimation theory.)

The advantage of Corollary 1 is that ψ (z), and the cumulants of δ given z, are obtained directly from the marginal density f(z), without requiring specific calculation of the prior g(δ), finessing the usual difficulties of deconvolution.

The algorithm Ebay described in Section 4 approximates E{δ|z} and Var{δ|z} by substituting a smoothed estimate $\hat{ψ} (z)$ into (3.6). Figure 2 displays the Ebay output Ê{δ|z} for the prostate data, comparing it to the shrunken centroids curve (2.10) for λ = 2.16, the preferred choice in Table 1. Ε̂ is better matched to the choice λ = 1.42 in (2.10), suggesting that less shrinking is better here.

Heavy curve is *Ê{δ|z}* for prostate data, Ebay algorithm, Section 4; compared with best shrunken centroids curve (2.10), λ = 2.16. Also shown, $\hat{SD} = \hat{Var} {δ ∣ z}^{1 ∕ 2}$ . At z = 4, Ê = 2.49, shrunken centroid = 1.84, $\hat{SD} = .98$ . Remark I explains the slight positive slope of *Ê{δ|z}* for z in (−2, 1.5).

Suppose we add to the Brown—Stein model (3.2) the assumption that the prior distribution of δ has a discrete atom of probability at δ = 0 (see Remark C, Section 8),

p_{0} = Prob {δ = 0} .

(3.7)

Then Bayes theorem yields

fdr (z) = Prob {δ = 0 ∣ z} = p_{0} φ (z) ∕ f (z),

(3.8)

fdr(z) being the “local false discovery rate,” Efron (2008). Comparing this with (3.4), (3.6) gives

Corollary 2. Under model (3.2), (3.7),

E {δ ∣ z} = - \frac{d}{d z} \log (f d r (z)) and Var {δ ∣ z} = - \frac{d^{2}}{d z^{2}} \log (f d r (z)) .

(3.9)

It seemingly makes sense that only genes with low false discovery rates should be utilized in prediction rules. The corollary shows that this is roughly true, but in a rather surprising manner: large values of ${\tilde{δ}}_{i} = E {δ_{i} ∣ z_{i}}$ depend on the rate of change of log(fdr(z_i)), not on fdr(z_i) itself. Small values of fdr(z_i) usually correspond to large values of $∣ {\tilde{δ}}_{i} ∣$ , but this doesn’t have to be the case. Usually log(fdr(z_i)) is nearly constant around z = 0, where fdr(z) = 1. This forces both Ê and $\hat{SD}$ to be small, as seen in Figure 2 (see Remark I).

4 Empirical Bayes Prediction

The Ebay algorithm that produced Figure 2 employs empirical Bayes methods to construct effective prediction rules. That is, it uses z, the vector of all N z-values, to estimate the Bayes prediction rule (2.3), (2.4). Here is a schematic description of Ebay’s operation:

A target error rate α₀ is selected (default α₀ = .025).
An estimate f̂(z) for the marginal density f(z), (3.3), is obtained using Poisson regression on z; see Remark D.
The estimated cumulative generating function $\hat{ψ} (z) = \log (\hat{f} (z) ∕ φ (z))$ , (3.4), is numerically differentiated to give
${\hat{δ}}_{i} = {\hat{ψ}}^{'} (z_{i}) = \hat{E} {δ_{i} ∣ z_{i}}$ (4.1)
as in (3.6).
Letting ${\hat{δ}}_{I}$ be the vector of I largest ${\hat{δ}}_{i} ’ s$ (in absolute value), I is chosen to be the smallest integer such that the nominal error rate $Φ (- ‖ {\hat{α}}_{I} ‖ ∕ 2 c_{0})$ , (2.5), is less than α₀; that is, I is the minimum choice yielding
$‖ {\hat{δ}}_{I} ‖ \geq 2 c_{0} Φ^{- 1} (1 - α_{0}) .$ (4.2)
The empirical Bayes prediction rule is based on the sign of
$\hat{S} = \sum_{I} {\hat{δ}}_{i} (\frac{X_{i} - {\hat{μ}}_{i}}{{\hat{σ}}_{i}}),$ (4.3)
$({\hat{μ}}_{i}, {\hat{σ}}_{i})$ as in (2.8).
Repeated 10-fold cross-validation is used to furnish an unbiased estimate of the rule’s prediction error; see Remark G.

Table 2 shows a portion of Ebay’s output for the prostate data. It’s prediction rule employs genes with the 51 largest values of $∣ {\hat{δ}}_{i} ∣$ , at which point (4.2) is first satisfied (compared with 377 genes for the apparently best shrunken centroids rule in Table 1). An unbiased error estimate, based on 20 randomized 10-fold cross-validation runs, was .092, the same as the minimum error seen in Table 1; see Remark G.

Table 2.

Ebay prediction rule for the prostate data; rule uses genes with 51 largest $∣ {\hat{δ}}_{i} ∣$ values, $\hat{α} = Φ (- ‖ \hat{δ} ‖ ∕ 2 c_{0}) = .025$ . Cross-validation error rate .092±004. Column ${\hat{α}}_{cor}$ explained in Section 5.

Step	Index	z-value	$\hat{δ}$	$\hat{α}$	${\hat{α}}_{cor}$
1	610	5.29	4.30	0.335	0.335
2	1720	4.83	3.78	0.285	0.281
3	364	-4:42	-3:70	0.250	0.250
4	3940	-4:33	-3:64	0.222	0.222
5	4546	-4.29	-3.58	0.199	0.215
6	4331	-4.14	-3.40	0.182	0.189
7	332	4.47	3.34	0.167	0.181
8	914	4.40	3.24	0.154	0.166
9	1068	4.25	3.06	0.144	0.148
10	4088	-3.88	-3.05	0.135	0.149
⋥	⋥	⋥	⋥	⋥	⋥
45	4154	-3.38	-2.26	0.029	0.050
46	2	3.57	2.25	0.028	0.050
47	2370	3.56	2.24	0.028	0.049
48	3282	3.56	2.23	0.027	0.048
49	3505	-3.33	-2.18	0.026	0.046
50	905	3.51	2.18	0.025	0.047
51	4040	-3.33	-2.17	0.025	0.048

Open in a new tab

There are, potentially, many reasons why the nominal error rate .025 might be over-optimistic: $({\hat{μ}}_{i}, {\hat{σ}}_{i})$ in (2.8) does not equal (μ_i, σ_i); the x_I are not normally distributed; the x_I are not independent (see Section 5); the empirical Bayes estimates ${\hat{δ}}_{i}$ differ from the actual Bayes estimates (3.1).

This last point can cause particular trouble at the extremes of the z scale, just where $∣ \hat{δ} (z) ∣$ is largest but there are fewest z_i’s for the estimation of $\hat{δ}$ . Figure 3 concerns the following artificial situation, using notation similar to that for the prostate data and model (2.1):

\begin{matrix} • N = 5000, n_{1} = n_{2} = 20, \\ • δ_{i} \overset{ind}{\sim} N (1.5, 1) & for i = 1, 2, \dots, 250, \\ • δ_{i} = 0 & for i = 251, 252, \dots, 5000, \\ • x_{i j} \overset{ind}{\sim} N (\pm δ_{i} ∕ 2 c_{0}, 1) & for all i and j, c_{0} = \sqrt{20^{2} ∕ 40} . \end{matrix}

(4.4)

This results in

z_{i} \overset{ind}{\sim} N (δ_{i}, 1)

(4.5)

at (2.9), with δ_i ~ N (1.5, 1) for the first 250 genes, and 0 otherwise.

True curve *E{δ|z}* (heavy), compared to $\hat{E} {δ ∣ z} = \hat{ψ} (z)'$ , 50 simulations from model (4.4). Estimates Ê reasonably accurate for z < 4, but fall apart for larger z values.

Figure 3 compares Ê{δ|z} from the Ebay algorithm with the true curve E{δ|z}. The estimates are reasonably accurate up to z = 4, but degenerate beyond that. Remark E of Section 8 derives a delta-method formula for the standard error of Ê{δ|z} that predicts this behavior. Modifying (4.4), (4.5) so that the z_i’s were correlated, with root mean square correlation coefficient 0.1, increased the variability of Ê{δ|z} by roughly 50%.

An option in Ebay allows for truncation of the $\hat{δ}$ estimation procedure at some number “k_trunc” of observations in from the extremes. With k_trunc = 5 for instance, ${\hat{δ}}_{i}$ for the five largest z_i values is set equal to max ${{\hat{δ}}_{i} : i \leq N - 5}$ , and similarly at the negative end of the z scale.

Figure 4 shows the actual misclassification error probabilities α for 200 simulations from model (4.4), each time using the Ebay prediction rule with nominal error rate α₀ = .025. As the truncation parameter k_trunc increases from 0 to 15, the actual prediction errors α decrease toward the target value .025. Table 3 displays the means and standard deviations for the data in Figure 4.

Actual prediction errors α of Ebay rule with nominal α₀ = .025; 200 simulations from model (4.4). As truncation parameter increases from 0 (rightmost histogram) to 15 (leftmost), actual errors decrease toward nominal α₀.

Table 3.

Means and standard deviations for actual prediction errors α in simulation experiment for Figure 4.

k_trunc:	15	10	5	0
Mean:	.032	.038	.051	.066
SD:	.019	.018	.022	.025

Open in a new tab

Truncation had a less dramatic effect on the prostate data: for k_trunc = 0, 5, 10, 15, the cross-validated error estimates were .092, .085, .070, .077. Lowering the target rate from α₀ = .025 to .01 gave corresponding error estimates .070, .062, .061, .058. Correlation among the predictors is part of the problem here; see Section 5.

Our original error estimate $\hat{α} = .092$ is “honest”, i.e., nearly unbiased for the Ebay rule produced with (α₀, k_trunc) = (.025, 0). So are the $\hat{α}$ estimates for the other (α₀, k_trunc) combinations. Choosing the combination with the smallest $\hat{α}$ , however, again raises the possibility of over-optimism, as discussed at the end of Section 2.

More elaborate “honest” selection criteria, beyond the current capabilities of Ebay, might involve minimizing a linear combination of nominal error rate and number of predictors, say

Φ (- ‖ {\hat{δ}}_{I} ‖ ∕ 2 c_{0}) + C \cdot I

(4.6)

over all choices of I; accounting for correlation as in Section 5, adjusting for non-normality; using theoretical or data-based techniques to choose the truncation parameter, etc.

Some “snooping” into the cross-validation estimates seems inevitable in applications. Nevertheless, I believe that holding snooping to a minimum is good practice for honest prediction assessment, and that empirical Bayes methods, perhaps further refined, can be sufficiently accurate to allow for a nearly-honest practical methodology.

5 Correlation Corrections

The assumption of case-wise independence in model (2.1) is likely to be untrue, perhaps spectacularly untrue, in many applications. Suppose that the vector W of standardized predictors W_i = (X_i — μ_i)/σ_i, (2.2), actually has covariance matrix Σ. Then both error probabilities in (2.5) become

α = Φ (- Δ_{0} \cdot η) where {\begin{matrix} Δ_{0} & = ‖ δ ‖ ∕ 2 c_{0} \\ η & = {(δ^{t} δ ∕ δ^{t} Σ δ)}^{1 ∕ 2} . \end{matrix}

(5.1)

Here Δ₀ is the independence value, while η is a correction factor, usually less than 1, that increases the error rate α. See Remark K of Section 8.

If we can estimate Σ we can estimate correction factor η,

\hat{η} = {({\hat{δ}}^{t} \hat{δ} ∕ {\hat{δ}}^{t} \hat{Σ} \hat{δ})}^{1 ∕ 2} .

(5.2)

According to (2.1), cov(W) = Σ has diagonal elements 1 in both classes, so the off-diagonal elements ρ_ii’ are correlations. Notice that we need estimate these for only the I cases selected by the Ebay algorithm at (4.2), not for all N cases. For the prostate data we need to estimate a 51 × 51 correlation matrix Σ, from the 51 × 102 data submatrix x_I of the full 6033 × 102 matrix x whose rows are indexed by the first column of Table 2.

The last column of Table 2 in Section 4 shows ${\hat{α}}_{cor}$ , obtained from (5.1), (5.2), with $\hat{σ}$ the usual sample correlation matrix. Correlation degrades the nominal error probability from .025 to .048 (closer to the cross-validation estimate .092). Much of the degradation is due to three large correlations,

r_{34, 19} = .97, r_{36, 15} = .65, r_{42, 28} = .92,

(5.3)

the subscripts referring to the steps in Table 2.

Table 4 concerns a microarray study having more severe correlation problems, the Michigan lung cancer study discussed in Subramanian et al. (2005). There are N = 5217 genes, n = 86 subjects, n₁ = 62 “good outcomes” and n₂ = 24 “poor outcomes”. Here the Ebay algorithm stopped after 200 steps, without $\hat{α}$ reaching the target value α₀ = .025. The correlation-corrected errors ${\hat{α}}_{cor}$ are much more pessimistic, actually increasing after the first 6 steps, eventually to ${\hat{α}}_{cor} = .360$ . A cross-validation error rate of .37 confirmed the pessimism. Restricting Ebay to use at most I = 10 predictors reduced the cross-validated error rate to .29, as suggested by Table 4 (an example of the kind of “snooping” disparaged at the end of Section 4, unless the decision to use the I = 10 Ebay prediction rule was made before the cross-validation calculations).

Table 4.

Ebay output for the Michigan lung cancer study. Correlation error estimates ${\hat{α}}_{cor}$ are much more pessimistic, as confrimed by cross-validation.

Step	Index	z-value	$\hat{δ}$	$\hat{α}$	${\hat{α}}_{cor}$
1	3144	4.62	3.683	0.3290	0.329
2	2446	4.17	3.104	0.2813	0.307
3	4873	4.17	3.104	0.2455	0.256
4	1234	3.90	2.686	0.2234	0.225
5	621	3.77	2.458	0.2072	0.213
6	676	3.70	2.323	0.1942	0.228
7	2155	3.69	2.313	0.1824	0.230
8	3103	3.60	2.140	0.1731	0.236
9	1715	3.58	2.103	0.1647	0.240
10	452	3.54	2.028	0.1574	0.243
⋥	⋥	⋥	⋥	⋥	⋥
193	3055	2.47	0.499	0.0519	0.359
194	1655	-2.21	0.497	0.0518	0.359
195	2455	2.47	0.496	0.0517	0.359
196	3916	2.47	0.496	0.0516	0.359
197	4764	2.47	0.495	0.0515	0.359
198	1022	-2.20	-0.492	0.0514	0.359
199	1787	-2.19	-0.490	0.0513	0.360
200	901	-2.18	-0.486	0.0512	0.360

Open in a new tab

Sample correlation matrices tend toward overdispersion when n is small compared to the number of variates. Ebay includes an option for empirical Bayes shrinkage of the elements of $\hat{Σ}$ ; see Remark H.

6 Effect Size Estimation

Current developments in large-scale simultaneous inference have focused on hypothesis testing, where the goal is to identify a small number of non-null cases among a large number of potential candidates. See Dudoit et al. (2003) for a nice review. Benjamini and Yekutieli (2005) address a more ambitious goal: to assess the effect sizes for the non-null cases, that is, to estimate how far away they lie from the null hypothesis. The empirical Bayes theory of Section 4 provides an alternative approach to effect size estimation.

We begin with assumptions (3.2), (3.7), that

z_{i} \sim N (δ_{i}, 1) i = 1, 2, \dots, N,

(6.1)

and that proportion p₀ of the effects δ_i equal 0,

p_{0} = Prob {δ_{i} = 0},

(6.2)

these being the uninteresting null cases. The local false discovery rate fdr(z) = p₀φ(z)/f(z), (3.8), is the Bayes posterior probability Prob{δ_i = 0|z_i}. If $\hat{fdr} (z_{i})$ , an estimate of fdr(z_i), is suitably small, then case i can be reported as “probably non-null”, and we would like to put some sort of confidence limits on the effect size δ_i. The prior g(δ) in (3.2) is now of the mixed form

g (δ) = p_{0} I_{0} (δ) + (1 - p_{0}) g_{1} (δ),

(6.3)

where I₀(δ) is a delta-function at 0, and g₁(δ) indicates the density of the non-null cases (see Remark C). Then the mixture density f(z), (3.3), becomes

\begin{matrix} f (z) = p_{0} φ (z) + (1 - p_{0}) f_{1} (z) \\ where \\ f_{1} (z) = \int_{- \infty}^{\infty} φ (z - δ) g_{1} (δ) d δ . \end{matrix}

(6.4)

Theorem 2. Under model (6.1), (6.2), the posterior density of effect size δ given z and given that δ ≠ 0 is

\begin{matrix} g_{1} (δ ∣ z) = e^{δ z - ψ_{1} (z)} [e^{- δ^{2} ∕ 2} g_{1} (δ)] \\ w h e r e \\ ψ_{1} (z) = \log {\frac{1 - fdr (z)}{fdr (z)} / \frac{1 - p_{0}}{p_{0}}} . \end{matrix}

(6.5)

Proof. Bayes rule says that g₁(δ|z) = φ(z — δ)g₁(δ)/f₁(z), yielding

g_{1} (δ ∣ z) = e^{δ z - \log {f_{1} (z) ∕ φ (z)}} [e^{- δ^{2} ∕ 2} g_{1} (δ)] .

(6.6)

An equivalent form of (3.8) is

1 - fdr (z) = Prob {δ \neq 0 ∣ z} = (1 - p_{0}) f_{1} (z) ∕ f (z),

(6.7)

from which we obtain, using (6.4),

\frac{f_{1} (z)}{φ (z)} = \frac{p_{0}}{1 - p_{0}} \frac{1 - fdr (z)}{fdr (z)} .

(6.8)

Combining (6.8) and (6.6) verifies Theorem 2.

As in (3.6), the conditional moments of a non-null δ (one for which δ ≠ 0) given z are obtained by differentiating ψ₁(z),

E_{1} {δ ∣ z} = ψ_{1}^{'} and {Var}_{1} {δ ∣ z} = ψ_{1}^{″} (z),

(6.9)

where the subscript “1” indicates conditioning on δ ≠ 0. Some calculation gives E₁ and Var₁ in terms of E{δ|z} and Var{δ|z} in (3.6):

Corollary 3. Under model (6.1), (6.2),

\begin{matrix} E_{1} {δ ∣ z} = E {δ ∣ z} ∕ [1 - fdr (z)] \\ a n d \\ {Var}_{1} {δ ∣ z} = \frac{1}{1 - fdr (z)} [Var (δ ∣ z) - \frac{fdr (z)}{1 - fdr (z)} E {δ ∣ z}^{2}] . \end{matrix}

(6.10)

Note. Since δ = 0 with probability fdr(z), we have

E {δ^{j} ∣ z} = [1 - fdr (z)] \cdot E_{1} {δ^{j} ∣ z} .

(6.11)

Using (6.11) with j = 1 and 2 leads to a quick verification of (6.10).

Our prediction algorithm in Section 4 requires only the estimation of E{δ|z}. Effect size estimation is more difficult, requiring Var{δ|z} and fdr(z) as well. The plug-in estimate of Var₁{δ|z} in (6.10) may be particularly unstable, in which case we can conservatively replace it with $\hat{Var} {δ ∣ z}$ , as shown next.

Rearranging (6.10) yields

\frac{{Var}_{1}}{Var} = \frac{1 - fdr (z) Q (z)}{1 - fdr (z)} where Q (z) = \frac{E^{2}}{(1 - fdr (z)) \cdot Var}

(6.12)

, (with Var= Var{δ|z}, etc.) so that

{Var}_{1} \leq Var if Var \leq E^{2} ∕ (1 - fdr (z)) .

(6.13)

Since Var is usually near 1, this last condition is satisfied whenever $\hat{δ} = E {δ ∣ z}$ gets large enough to be interesting; in the case of the prostate data, for z ≥ 2.

Figure 5 demonstrates effect size estimation for the prostate data. The Poisson GLM estimate of f(z) described in Remark D provides estimates of $\hat{fdr} (z)$ , Ê{δ|z}, and $\hat{Var} {δ ∣ z}$ , as in Figure 2. The curved band in Figure 5 follows

\hat{E} {δ ∣ z} ∕ [1 - \hat{fdr} (z)] \pm \hat{Var} {δ ∣ z}^{1 ∕ 2}

(6.14)

as in Corollary 3, showing approximate 68% intervals for δ given z and given δ ≠ 0, made more conservative by replacing ${\hat{Var}}_{1}$ with $\hat{Var}$ . At z = 4 for example, we estimate that either δ = 0 with probability $\hat{fdr} (4) = .048$ , or, if δ ≠ 0, it lies in the interval [1.58, 3.64) with estimated posterior probability exceeding .68. (Remember that δ, as defined in (2.1), is the number of standard deviations separating the two class means, multiplied by c₀.)

effect size estimation for the prostate data. Band is approximate 68% interval for δ given z and given δ ≠ 0; also showing $\hat{fdr} (z)$ , estimated probability δ = 0 given z. At z = 4, $\hat{fdr} (z) = .048$ , with interval [1.58, 3.64] if δ is non-null.

Benjamini and Yekutieli’s (2005) False Coverage Rate algorithm provides conservative frequentist confidence bounds on the cases declared non-null by an FDR testing procedure, assuming independence of the z_i’s. There is, however, a heavy price to pay: the bounds tend to be very wide. For z = 4 in the prostate example, their 68% interval is [1.36, 6.64) (using their Definition 1, with q = .32). Part of the problem, as discussed in Section 7 of Efron (2008), is that the Bejamini/Yekutieli procedure does not split off an atom of probability at δ = 0, though splitting seems natural in the hypothesis testing framework of (3.7) or (6.3).

The approximate 68% non-null limits (6.14) were calculated for 25 replications of simulation model (4.4). They appear in Figure 6, along with the true Bayesian posterior limits $(z + 1.5) ∕ 2 \pm 1 ∕ \sqrt{2}$ . Using $\hat{Var}$ instead of ${\hat{Var}}_{1}$ in (6.14) makes the intervals too wide, but their overall performance is acceptable as rough estimates of effect size.

Approximate effect size limits (6.14) for 25 replications of simulation model (4.4). Heavy straight lines are actual 68% Bayes posterior limits for non-null cases.

7 Other Response Variables

The development so far has concerned dichotomous response variables: healthy versus sick in the prostate example. This section extends the empirical Bayes prediction methodology to general univariate responses.

Let Y be a univariate response of interest, for example a survival time that we wish to predict from x = (X₁, X₂, …, X_N)’ as in Section 2. For convenience we assume that Y has been standardized to have mean 0 and variance 1, denoted

Y \sim (0, 1),

(7.1)

though this will play no role in the actual methodology.

We suppose that Y influences the standardized variable W_i = (X_i — μ_i)/σ_i, (2.1), through linear regression,

W_{i} = β_{i} Y + \in_{i}, i = 1, 2, \dots, N,

(7.2)

Var(ϵ_i) = 1, where the vector of errors ϵ is uncorrelated with Y. In the dichotomous situation of (2.1), Y = −1 or 1 and β_i = δ_i/2c₀. Effective prediction of Y depends upon discovering those x_I’s with large values of |β_i|. See Remark J of Section 8.

The joint distribution of Y and W has mean vector and covariance matrix

(\begin{matrix} Y \\ W \end{matrix}) \sim [(\begin{matrix} O \\ O \end{matrix}), (\begin{matrix} 1 & β^{t} \\ β & β β^{t} + Σ \end{matrix})],

(7.3)

Σ indicating the covariance matrix of ϵ. The best linear predictor of Y from W is

\begin{matrix} Y^{†} & = β^{t} {(β β^{t} + Σ)}^{- 1} w \\ = \frac{1}{1 + Δ^{2}} β^{t} Σ^{- 1} w, \end{matrix}

(7.4)

where Δ² is the squared Mahalanobis distance

Δ^{2} = β^{t} Σ^{- 1} β .

(7.5)

If Σ is the identity, as assumed in (2.1), then Y^† = constant · β^tW, similarly to (2.3).

Combining (7.4) with (7.2) produces a simple expression for the conditional mean and variance of Y^† given Y,

Y^{†} ∣ Y \sim (\frac{Δ^{2}}{1 + Δ^{2}} Y, \frac{Δ^{2}}{1 + Δ^{2}}),

(7.6)

from which (7.1) gives

cor (Y, Y^{†}) = Δ ∕ \sqrt{1 + Δ^{2}} .

(7.7)

Effective prediction of Y from W requires a large value of Δ = (β^tΣ β)^1/2. (In the context of Section 2, where Σ = I and β = δ/2c₀, we have Δ = ∥δ∥/2c₀, so the error probability α equals Φ(-Δ) at (2.5)).

To bring empirical Bayes methods to bear on the estimation of Y^† we need to estimate posterior expectations for the regression coefficients β_i from the training data (1.1): the N × n matrix x and the n-vector of responses y = (y₁, y₂, …, y_n)^t. Let $x_{i}^{t}$ indicate the ith row of x. Applying model (7.2) independently to each column of x gives a linear model for the rows,

x_{i} = μ_{i} 1_{n} + σ_{i} (β_{i} y + \in_{i}),

(7.8)

where 1_n is a vector of n 1’s, and the components of ϵ_i = (ϵ_i₁, ϵ_i₂, …, ϵ_in)^t are independent and identically distributed, with mean 0 and variance 1. Ordinary least squares applied to (7.8) provides familiar estimates of μ_i, σ_i and β_i. In the dichotomous setting of (2.1), ${\hat{μ}}_{i}$ and ${\hat{σ}}_{i}$ are as given in (2.8) while $2 c_{0} {\hat{β}}_{i}$ equals ${\overset{‒}{δ}}_{i} = z_{i}$ in (2.9).

If we assume that the errors ϵ_i are normally distributed, then the t-statistic “t_i” for testing β_i = 0 in (7.8) has a non-central t distribution with n — 2 degrees of freedom and non-centrality parameter proportional to β_i,

t_{i} \sim t_{n - 2} (δ_{i}), [δ_{i} \equiv 2 c_{0} β_{i} with c_{0}^{2} = \sum_{1}^{n} {(y_{i} - \overset{‒}{y})}^{2} ∕ 4] .

(7.9)

In usual practice, (7.9) remains a reasonable approximation as long as the ϵ_ij distribution does not have heavy tails. With dichotomous y_i, c₀ = (n₁n₂/n)^1/2 as before.

We can transform t_i to a z-value via

z_{i} = Φ^{- 1} (F_{n - 2} (t_{i})),

(7.10)

with Φ and F_n_—₂ the standard normal and central t_n_—₂ cdf’s. If n is large then (7.9) gives

z_{i} \dot{\sim} N (δ_{i}, 1)

(7.11)

as in (2.9). Remark F improves upon approximation (7.11), but we will take it as given here.

We can now proceed as in Section 4:

z = (z₁, z₂, …, z_N)^t provides f̂(z), an estimate of the marginal density of the z-values (Remark D) and $\hat{ψ} (z) = \hat{f} (z) ∕ φ (z)$ .
We then calculate
${\hat{δ}}_{i} = {\hat{ψ}}^{'} (z_{i}) = \hat{E} {δ_{i} ∣ z_{i}} for i = 1, 2, \dots, N .$ (7.12)
${\hat{δ}}_{i}$ , the vector of I largest ${\hat{δ}}_{i} ’ s$ in absolute value, gives
${\hat{Δ}}_{I} = ‖ {\hat{δ}}_{I} ‖ ∕ 2 c_{0} and {\hat{c o r}}_{I} = {\hat{Δ}}_{I} ∕ \sqrt{1 + {\hat{Δ}}_{I}^{2} .}$ (7.13)
We continue increasing I until either ${\hat{cor}}_{I}$ reaches some target value or I reaches a preselected upper bound, and use
${\hat{Y}}^{†} = \frac{1}{1 + {\hat{Δ}}_{I}^{2}} \sum_{i = 1}^{I} \frac{{\hat{δ}}_{i}}{2 c_{0}} (\frac{X_{i} - {\hat{μ}}_{i}}{{\hat{σ}}_{i}}),$ (7.14)
from (7.4), to predict Y from X.

Steps (3) and (4) assume uncorrelated x_I’s, i.e., Σ the identity matrix, but correlation can be incorporated as in Section 5.

These steps were carried out for an ongoing lung cancer microarray study involving n = 100 patients each measured on N = 16,000 genes. All patients received the same new drug. The response variable “Y ” was a categorical assessment of improvement, adjusted for two covariates, running from −2 (worst) to +2 (best).

Figure 7 shows Ê{δ|z}, calculated by Steps (1) and (2) above. It seems clear that any power of the microarray expression measurements to predict Y must come from those genes having z_i less than −2. Table 5 shows this to be true. Predictive power is modest here, with theoretical correlation only .48 after I = 50 steps, asymptoting to .57 at I = 16,000.

Lung cancer microarray study, N = 16,000 genes, n = 100 patients, ordered categorical response variable Y . Heavy curve *Ê{δ|z}*, (7.14). Dashes indicate those z-values exceeding 3 in absolute value.

Table 5.

Right column shows ${\hat{cor}}_{I}$ , (7.13), for lung cancer data; I = 1 to 50. Final value ${\hat{cor}}_{16000} = .57$ .

Step	Index	z-value	$\hat{δ}$	$\hat{β}$	$\hat{Δ}$	${\hat{cor}}_{I}$
1	12404	-4.27	-2.44	-0.20	0.04	0.04
2	6342	-3.98	-2.10	-0.17	0.07	0.07
3	2516	-3.92	-2.02	-0.16	0.10	0.10
4	488	-3.89	-1.99	-0.16	0.12	0.12
5	8471	-3.84	-1.93	-0.16	0.15	0.15
6	25	-3.84	-1.92	-0.16	0.17	0.17
7	2872	-3.82	-1.90	-0.15	0.20	0.19
8	300	-3.78	-1.85	-0.15	0.22	0.21
9	545	-3.78	-1.85	-0.15	0.24	0.23
10	12448	-3.78	-1.84	-0.15	0.26	0.26
⋥	⋥	⋥	⋥	⋥	⋥	⋥
45	10905	-2.83	-0.60	-0.05	0.54	0.47
46	390	-2.83	-0.59	-0.05	0.54	0.47
47	1498	-2.82	-0.59	-0.05	0.54	0.48
48	10317	-2.81	-0.57	-0.05	0.54	0.48
49	7894	-2.79	-0.56	-0.05	0.55	0.48
50	13263	-2.79	-0.55	-0.04	0.55	0.48

Open in a new tab

8 Remarks

The following remarks expand on some of the questions and technical points raised earlier.

A. Centroids Interpretation

Prediction rule (4.3), which depends on the sign of $\hat{S} = \sum {\hat{δ}}_{i} {\hat{W}}_{i}, {\hat{W}}_{i} = (X_{i} - {\hat{μ}}_{i}) ∕ {\hat{σ}}_{i}$ , can be stated in more conventional centroid terminology: letting

D_{1} = ‖ \hat{W} + \hat{δ} ∕ 2 c_{0} ‖ and D_{2} = ‖ \hat{W} + \hat{δ} ∕ 2 c_{0} ‖,

(8.1)

we predict “healthy” if D₁ < D₂ and “sick” if D₂ < D₁; so $\hat{δ} ∕ 2 c_{0}$ and $- \hat{δ} ∕ 2 c_{0}$ are the standardized centroids. An alternative statement refers to the hyperplane $\hat{L}$ passing through the origin of N-space orthogonal to the line segment connecting $\hat{δ} ∕ 2 c_{0}$ with $- \hat{δ} ∕ 2 c_{0}$ : we predict healthy or sick depending on which side of $\hat{L}$ the point Ŵ falls.

B. Unequal Prior Probabilities

Prediction rule (4.3) tacitly assumes that our dichotomous response variable has equal prior probabilities on the two categories, irrespective of the observed frequencies n₁ and n₂ in the training set. Suppose that the prior probabilities are actually π₁ and π₂. Starting with model (2.1), calculations involving Fisher’s linear discriminant function imply the following change from Remark A: the prediction boundary $\hat{L}$ is translated to intersect the orthogonal line segment at directed distance

\frac{c_{0}}{‖ \hat{δ} ‖} \log (\frac{π_{1}}{π_{2}})

(8.2)

from the origin. (The definition of Ŵ is still $(X - \hat{μ}) ∕ \hat{σ}$ , with $({\hat{μ}}_{i}, {\hat{σ}}_{i})$ , as given in (2.8).)

C. The Prior Density g(δ)

In the Brown—Stein model (3.2), the prior density g(δ) can be extended to a general probability distribution G(δ) incorporating discrete atoms of probability as in (3.7). Theorem 1’s statement is almost unchanged,

d G (δ ∣ z) = e^{δ z - ψ (z)} e^{- δ^{2} ∕ 2} d G (δ) .

(8.3)

The factor e^−δ²/2 guarantees that the exponential family has natural parameter space including all values of z, justifying Corollary 2 for all z. The same considerations apply to Theorem 2.

D. Estimating f(z)

Ebay estimates f(z), the mixture density (3.3), by means of a Poisson generalized linear model (glm) applied to binned counts of the N z-values. In Figure 1, for example, there are K = 90 bins, each of width 0.1, ranging from −4.5 to 4.5. The counts

c_{k} = # {z_{i} in bin k}, k = 1, 2, \dots, K

(8.4)

are the heights of the histogram bars. Let b indicate the K-vector of bin midpoints. Then the estimate f̂ = (f̂₁, f̂₂, …, f̂_K)^t of f(z) at the points in b is obtained by Poisson regression of the counts on a natural spline function of the midpoints:

\hat{f} = glm (c \sim ns (b, df), poisson) $ fit

(8.5)

in R notation; default degrees of freedom df equals 7 in Ebay; f̂ is the discretized mle of f(z) in the 7-parameter exponential family defined by the natural spline basis.

Estimate (8.5) is the same one employed by locfdr, the local false discovery rate algorithm described in Efron (2008). Applied to the prostate data, locfdr estimated p̂₀ = .93 for the proportion of null genes (3.7), assuming that f(z) is the correct null density.

E. Accuracy Formula for Ê{δ|z}

A closed-form delta-method expression for the variance of ${\hat{Δ}}_{i} = \hat{E} {Δ_{i} ∣ z_{i}}$ can be derived if we are willing to assume that the z_i’s are independent of each other. Let M be the K × m structure matrix ns(b, df) in (8.5), K = 90 and m = 8; diag(c) the K × K diagonal matrix with diagonal entries the bin counts c_k; and G = M^tdiag(c)M. Section 5 of Efron (2007) employs the relationship

d \hat{ℓ} = M G^{- 1} M^{t} d c

(8.6)

for the derivative matrix of the K-vector l̂ = log(f̂) with respect to a continuized version of c.

Let D be the (K — 2) × K matrix whose kth row is

(0, 0, \dots, 0, - 1.0, 1, 0, 0, \dots) ∕ d_{0},

(8.7)

with −1 in the kth place: Dl̂ = l̂’, the numerical derivative of l̂. This gives

d {\hat{ℓ}}^{'} = D M G^{- 1} M^{t} d c .

(8.8)

The Poisson estimate Cov(c) = diag(c) for the covariance matrix of c then yields Cov(l̂’) = DMG^-1M^tD^t. But since

ψ^{'} (z) = \frac{d}{d z} \log {f (z) ∕ φ (z)} = z + ℓ^{'} (z),

(8.9)

we have ${\hat{Δ}}_{(k)} \equiv \hat{ψ}' (z = b_{k}) = b_{k} + \hat{l}'_{k}$ in (4.1), implying that

Cov (\hat{δ}) = Cov ({\hat{ℓ}}^{t}) = D M G^{- 1} M^{t} D^{t} .

(8.10)

Table 6 shows estimates of standard error for $\hat{δ}$ (square roots of the diagonal elements in (8.10)) calculated for the prostate data. As in Figure 3, we can see an explosive increase in variability as |z| increases to 4.

Table 6.

Delta-method standard errors for $\hat{δ} (z) = \hat{E} {δ ∣ z}$ , formula (8.10), for the prostate data.

z:	-4	-3	-2	-1	0	1	2	3	4
sd:	.41	.12	.09	.06	.04	.05	.09	.10	.33

Open in a new tab

F. Transforming t-values to z-values

The ith row of x comprises n independent observations

x_{i j} \overset{ind}{\sim} N (μ_{i} \pm σ_{i} δ_{i} ∕ 2 c_{0} σ_{i}^{2}) for j = 1, 2, \dots, n

(8.11)

in the notation of Section 2, with n₁ “−” values and n₂ “+” values. The corresponding two-sample t-statistic t_i follows a non-central t distribution with n — 2 degrees of freedom and noncentrality parameter δ_i,

t_{i} = c_{0} \frac{{\overset{‒}{x}}_{2 i} - {\overset{‒}{x}}_{1 i}}{{\hat{σ}}_{i}} \sim t_{n - 2} (δ_{i}) .

(8.12)

Our previous discussion treated t_i as z_i ~ N (δ_i, 1), but Ebay actually employs transformations that improve the accuracy of Corollary 3.

Let

z_{i} = Φ^{- 1} (F_{n - 2} (t_{i})),

(8.13)

as in (7.10), so if δ_i = 0 then z_i ~ N (0, 1). If δ_i ≠ 0, z_i is still surprisingly close to normal,

z_{i} \dot{\sim} N (ζ_{i}, σ^{2} (ζ_{i})), [ζ_{i} = Φ^{- 1} (F_{n - 2} (δ_{i}))],

(8.14)

with σ(ζ_i) < 1. For example, with δ_i = 4 and n = 102, z_i from (8.13) has (mean, standard deviation, skewness, kurtosis) equal (3.845, .931, -.046, .010). A plot of (8.14) superimposed on (8.12) barely differentiates the two curves.

The computation of ${\hat{δ}}_{i}$ , (4.1) in the Ebay algorithm, is actually carried out using (8.14):

The vector t = (t₁, t₂, …, t_N)^t is converted component-wise to z = (z₁, z₂, …, z_N), as in (8.13).
An estimate f̂(z) is constructed from z as in Remark D.
A modified version of Corollary 2, described below, provides empirical Bayes estimates ${\hat{ζ}}_{i}$ .
Finally, transformation (8.14) is inverted to give
${\hat{δ}}_{i} = F_{n - 2}^{- 1} (Φ ({\hat{ζ}}_{i})),$ (8.15)
after which Ebay proceeds as in Steps 4–6 in Section 4.

Suppose the Brown—Stein model (3.2) is modified to have z|δ ~ N (δ, σ²). Then it is easy to show that

E {δ ∣ z} = z + σ^{2} ℓ^{'} (z) and Var {δ ∣ z} = σ^{2} + σ^{4} ℓ^{″} (z),

(8.16)

where l(z) is the log of the marginal density f(z). The empirical Bayes estimate ${\hat{ζ}}_{i}$ mentioned above is given by

{\hat{ζ}}_{i} = z_{i} + {\hat{σ}}_{i}^{2} {\hat{ℓ}}^{'} (z_{i}),

(8.17)

l̂(z) = log(f̂(z)) and ${\hat{σ}}_{i}^{2} = σ^{2} (z_{i})$ , where the variance function σ²(·) in (8.14) is calculated numerically. None of this gave answers much different than using (4.1) directly, but the transformation effect becomes more important when n is smaller.

G. Cross-Validation Procedure

Both Ebay and the shrunken centroids procedure default to 10-fold cross-validation replicated R times. Each replication randomly splits the N cases into 10 folds, with correctly proportional numbers of “healthy” and “sick” in each fold. As usual, the prediction rule is refit 10 times with the cases of each fold withheld from the training set in turn, the cross-validated rate ${\hat{α}}_{CV}$ being the overall proportion of errors on the withheld cases averaged over all R replications. The R replications also provide a standard error for ${\hat{α}}_{CV}$ .

It is useful to remember that ${\hat{α}}_{CV}$ is not an estimate of error for the specific prediction rule selected by Ebay or pamr (unlike the actual prediction errors in Figure 4, which were computed from knowledge of the simulation structure (4.4)). Rather, it is the expected error rate for rules selected according to the same recipe, as emphasized in Efron (1983). In this sense it differs from the ideal Bayesian estimate $\tilde{α} = Φ (- ‖ \tilde{δ} ‖ ∕ 2 c_{0})$ following (3.1), or its empirical Bayes version $\hat{α} = Φ (- ‖ \hat{δ} ‖ ∕ 2 c_{0})$ , both of which apply directly to the prediction rule at hand.

H. Empirical Bayes Estimation of Σ

The histogram of off-diagonal elements r_ii′ of a sample correlation matrix will usually be more dispersed than the corresponding histogram of true correlations ρ_ii′, because sampling error adds a component of variance to the r_ii′ values. Ebay includes an empirical Bayes shrinkage option to account for overdispersion in the estimation of, (5.2).

Let

ν_{i i^{'}} = \frac{\sqrt{n - 4}}{2} \log (\frac{1 + ρ_{i i^{'}}}{1 - ρ_{i i^{'}}}) and υ_{i i^{'}} = \frac{\sqrt{n - 4}}{2} \log (\frac{1 + r_{i i^{'}}}{1 - r_{i i^{'}}})

(8.18)

denote Fisher’s transform of ρ_ii′ and r_ii′ (where the usual constant n — 3 has been reduced to n — 4 since two separate means are subtracted off, for the healthy and sick subjects separately). A standard normal theory approximation (Johnson and Kotz, 1970, Chapt. 32, Sect. 4), says that

υ_{i i^{'}} \dot{\sim} N (ν_{i i^{'}}, 1),

(8.19)

implying that the histogram of the υ_ii′ values will have variance about one unit greater than that for the true ν_ii′ ’s.

Suppose the ensemble of true ν_ii′ values has (mean, variance) say (M, A), and that υ_ii′ ~ (ν_ii′, 1) as in (8.19), so that the υ_ii′ ensemble $\dot{\sim} (M, A + 1)$ . Then

{\overset{‒}{ν}}_{i i^{'}} = M (1 - \sqrt{C}) + \sqrt{C} υ_{i i^{'}}, [c = A ∕ (A + 1)]

(8.20)

is the linear function of υ_ii′ having (mean, variance) $\dot{\sim} (M, A)$ .

Ebay first obtains robust estimates of M and A + 1 from the set of values {υ_ii′}, and then substitutes M̂ and Ĉ = Â/(Â + 1) into (8.20) to give estimates ${\tilde{ν}}_{i i'}$ . In order to protect genuine outliers like those in (5.3), Efron and Morris’ (1972) limited translation rule is enforced: ${\tilde{ν}}_{i i'}$ is not allowed to shrink further than one unit away from υ_ii′ . Finally, ${\tilde{ν}}_{i i'}$ gives ${\tilde{ρ}}_{i i'}$ by inverting transformation (8.18). ( $\tilde{Σ}$ may no longer be a correlation matrix, but that is not required for use in (5.2).)

A small simulation experiment was run, comparing $\tilde{Σ}$ with the usual (unshrunk) estimate $\hat{Σ}$ . It began with model (4.4), modified to instill correlation among the 5000 entries in any one column of x; the root mean square of true pair-wise correlations was set equal to 0.10, about triple that for the prostate study and half that for the Michigan lung cancer study of Table 4. Each of 200 replications yielded ${\hat{δ}}_{I}$ as in (4.2), the I × I sample correlation matrix $\hat{Σ}$ , and its empirical Bayes counterpart $\tilde{Σ}$ .

The corresponding estimates (5.2),

\hat{η} = {({\hat{δ}}_{I}^{t} {\hat{δ}}_{I} ∕ {\hat{δ}}_{I}^{t} \hat{Σ} {\hat{δ}}_{I})}^{\frac{1}{2}} and \tilde{η} = {({\hat{δ}}_{I}^{t} {\hat{δ}}_{I} ∕ {\hat{δ}}_{I}^{t} \hat{Σ} {\hat{δ}}_{I})}^{\frac{1}{2}}

(8.21)

are compared with

η_{true} = {({\hat{δ}}_{I}^{t} {\hat{δ}}_{I} ∕ {\hat{δ}}_{I}^{t} Σ {\hat{δ}}_{I})}^{\frac{1}{2}}

(8.22)

in Table 7; $\tilde{η}$ is seen to off er only minor improvement over $\hat{η}$ . Robust estimates of standard deviation for $\tilde{η} - η_{true}$ compared to $\tilde{η} - η_{true}$ were a little more decisive: .074 compared to .085. Root mean square errors for estimating all of the elements of strongly favored $\tilde{Σ}$ over $\hat{Σ}$ , $\tilde{rms} = 6.30$ versus $\hat{rms} = 9.83$ .

Table 7.

Estimates $\hat{η}$ and $\tilde{η}$ , (8.21), compared with true correlation correction factor η_true, (8.22); 200 replications of correlated model (4.4); rms values are root mean square errors for estimating the elements of Σ

	${\tilde{η}}_{true}$	$\hat{η}$	$\tilde{η}$	$\hat{rms}$	$\tilde{rms}$
mean:	.597	.588	.598	9.83	6.30
stdev:	.138	.171	.164	9.13	5.74

Open in a new tab

The ${\hat{α}}_{cor}$ values in Table 2 and Table 4 were based on $\hat{Σ}$ , Ebay’s default option. Using $\tilde{Σ}$ gave smaller estimates of the correlation effect in both cases. The choice is not crucial here since the current version of Ebay does not involve ${\hat{α}}_{cor}$ in constructing the prediction rule, but either or both methods convey useful information on the effects of correlation among the predictors.

Regularized estimation of correlation matrices is a major subject in its own right (see Warton, 2008), and other methods might further improve on $\hat{Σ}$ . However $\hat{Σ}$ performs relatively well in our context for two reasons: the dimension “I” of Σ tends not to be too large, and more importantly, we need only estimate the function η, (5.1), not all of Σ. If we are willing to consider $\tilde{δ}$ fixed in (5.2), then ${\hat{δ}}^{t} Σ \hat{δ}$ is a linear function of Σ’s elements, estimated almost unbiasedly by ${\hat{δ}}^{t} \hat{Σ} \hat{δ}$ . The estimation of Σ would be more crucial if we were attempting to implement the general linear discriminant function rather than the simplified version (2.3), (2.4).

I. Overdispersed z-Values

The z-value histogram for the prostate data in Figure 1 is a little bit wider than N(0, 1) near z = 0: a fit to the center of the histogram gave $z \dot{\sim} N (0, {1.06}^{2})$ (using the locdfr algorithm, Efron (2008)). This discrepancy is reflected in Figure 2 by the slight upward slope of Ê{δ|z} for z between −2 and 1.5. Theorem 1 and Corollary 2 in Section 3 depend on the assumption z ~ N (δ, 1). If actually z ~ N (δ, σ²), with σ² > 1, then the formula for E{δ|z} must be modified as in (8.16). We can compensate for overdispersion by using the values z̃_i = z_i rather than z_i in the Ebay algorithm. Doing so flattened Ê{δ|z} to zero between −2 and 1.5 in Figure 2, and shrank it slightly toward zero for larger |z|.

Figure 8 concerns a leukemia microarray study from Golub et al. (1999) where overdispersion is more severe. Here there are N = 7129 genes measured on n = 72 subjects in two subtypes, n₁ = 45 and n₂ = 12. Two-sample t-tests gave z-values z_i as in (1.2), (1.3). The histogram of z_i’s corresponding to Figure 1 has $z \dot{\sim} N (0.9, {1.68}^{2})$ near its center.

Solid curve is *Ê{δ|z}* for leukemia data, Golub et al. (1999). Broken curve is *Ê{δ|z}* based on standardized values *z̃_i* = (*z_i* — .09)/1.68. Top row of dashes indicate 40 most extreme *z_i* values; lower row 40 most extreme *z̃_i* values.

Now the curve Ê{δ|z} based on the standardized values z̃_i = (z_i — .09)/1.68 is much less optimistic than that based on the original z_i’s, especially taking account of the decreased size of the z̃_i’s. Prediction looks extremely easy with the z_i’s; many genes have $∣ {\hat{δ}}_{i} ∣$ values, (4.1), exceeding 6. However $∣ {\hat{δ}}_{i} ∣$ tops out below 4 for the z̃’s. Ebay required only I = 10 genes to reach target error α₀ = .01 using the z_i’s, (4.2), compared with I = 34 for the z̃_i’s.

Which prediction rule is better? The answer depends on the reason for the z_i’s overdispersion. If in fact z_i ~ N(δ_i, 1) and the appearance of overdispersion is due to most of the δ_i’s lying far from zero, then the I = 10 rule should perform well. However, overdispersion may indicate ephemeral effects, for example due to unobserved covariates in an observational study, that won’t help with future predictions, in which case the z̃_i analysis is more realistic.

J. Model (7.1), (7.2)

The predictor variable W_i appears as the response in (7.2), which may seem less natural there than in (2.1). This allows us to express each row x_I of the predictor matrix x as a separate linear regression in y, (7.8), facilitating the empirical Bayes estimation of parameters β_i in (7.9)–(7.14). Notice that the correlation structure (7.3) implied by (7.1), (7.2) leads directly to (7.4), where now W assumes its proper role as a predictor vector.

K. Naive Bayes Prediction

Rule (2.3)–(2.4) is “naive Bayes” when applied in the correlated framework of Section 5. That is, it ignores the possible decrease in prediction error available from use of the full form of Fisher’s linear discriminant function. Such gains are likely to be more hypothetical than genuine. The theory and simulations in Bickel and Levina (2004) and Dudoit et al. (2002) show our naive Bayes prediction rules outperforming more sophisticated predictors in large-scale situations.

References

1.Benjamini Y, Yekutieli D. False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 2005;100:71–93. [Google Scholar]
2.Bickel PJ, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. [Google Scholar]
3.Brown LD. Admissible estimators, recurrent di usions, and insoluble boundary value problems. Ann. Math. Statist. 1971;42:855–903. [Google Scholar]
4.Dawid AP. Multivariate analysis and its applications (Hong Kong, 1992) Inst. Math. Statist., IMS Lecture Notes Monogr. Ser.; Hayward, CA: 1994. Selection paradoxes of Bayesian inference; pp. 211–220. 24. [Google Scholar]
5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 2002;97:77–87. [Google Scholar]
6.Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103. [Google Scholar]
7.Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Amer. Statist. Assoc. 1983;78:316–331. [Google Scholar]
8.Efron B. Size, power and false discovery rates. Ann. Statist. 2007;35:1351–1377. [Google Scholar]
9.Efron B. Microarrays, empirical bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. [Google Scholar]
10.Efron B, Morris C. Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J. Amer. Statist. Assoc. 1972;67:130–139. [Google Scholar]
11.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
12.Hastie T, Tibshirani R, Friedman J. Springer Series in Statistics. 2nd Springer-Verlag; New York: 2008. The Elements of Statistical Learning. [Google Scholar]
13.Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 2. Houghton Mi in Co.; Boston, Mass.: 1970. [Google Scholar]
14.Senn S. A note concerning a selection “paradox” of Dawid’s. Amer. Statist. 2008;62:206–210. [Google Scholar]
15.Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kanto ,PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]
16.Stein CM. Estimation of the mean of a multivariate normal distribution. Ann. Statist. 1981;9:1135–1151. [Google Scholar]
17.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Warton DI. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J. Amer. Statist. Assoc. 2008;103:340–349. [Google Scholar]

[R1] 1.Benjamini Y, Yekutieli D. False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 2005;100:71–93. [Google Scholar]

[R2] 2.Bickel PJ, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. [Google Scholar]

[R3] 3.Brown LD. Admissible estimators, recurrent di usions, and insoluble boundary value problems. Ann. Math. Statist. 1971;42:855–903. [Google Scholar]

[R4] 4.Dawid AP. Multivariate analysis and its applications (Hong Kong, 1992) Inst. Math. Statist., IMS Lecture Notes Monogr. Ser.; Hayward, CA: 1994. Selection paradoxes of Bayesian inference; pp. 211–220. 24. [Google Scholar]

[R5] 5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 2002;97:77–87. [Google Scholar]

[R6] 6.Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103. [Google Scholar]

[R7] 7.Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Amer. Statist. Assoc. 1983;78:316–331. [Google Scholar]

[R8] 8.Efron B. Size, power and false discovery rates. Ann. Statist. 2007;35:1351–1377. [Google Scholar]

[R9] 9.Efron B. Microarrays, empirical bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. [Google Scholar]

[R10] 10.Efron B, Morris C. Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J. Amer. Statist. Assoc. 1972;67:130–139. [Google Scholar]

[R11] 11.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[R12] 12.Hastie T, Tibshirani R, Friedman J. Springer Series in Statistics. 2nd Springer-Verlag; New York: 2008. The Elements of Statistical Learning. [Google Scholar]

[R13] 13.Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 2. Houghton Mi in Co.; Boston, Mass.: 1970. [Google Scholar]

[R14] 14.Senn S. A note concerning a selection “paradox” of Dawid’s. Amer. Statist. 2008;62:206–210. [Google Scholar]

[R15] 15.Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kanto ,PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]

[R16] 16.Stein CM. Estimation of the mean of a multivariate normal distribution. Ann. Statist. 1981;9:1135–1151. [Google Scholar]

[R17] 17.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Warton DI. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J. Amer. Statist. Assoc. 2008;103:340–349. [Google Scholar]

PERMALINK

Empirical Bayes Estimates for Large-Scale Prediction Problems

Bradley Efron

Abstract

1 Introduction

Figure 1.

2 A Simple Model

Table 1.

3 Bayesian Prediction

Figure 2.

4 Empirical Bayes Prediction

Table 2.

Figure 3.

Figure 4.

Table 3.

5 Correlation Corrections

Table 4.

6 Effect Size Estimation

Figure 5.

Figure 6.

7 Other Response Variables

Figure 7.

Table 5.

8 Remarks

A. Centroids Interpretation

B. Unequal Prior Probabilities

C. The Prior Density g(δ)

D. Estimating f(z)

E. Accuracy Formula for Ê{δ|z}

Table 6.

F. Transforming t-values to z-values

G. Cross-Validation Procedure

H. Empirical Bayes Estimation of Σ

Table 7.

I. Overdispersed z-Values

Figure 8.

J. Model (7.1), (7.2)

K. Naive Bayes Prediction

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases