Correlated z-values and the accuracy of large-scale statistical estimates

Bradley Efron

doi:10.1198/jasa.2010.tm09129

. Author manuscript; available in PMC: 2010 Dec 1.

Published in final edited form as: J Am Stat Assoc. 2010 Sep 1;105(491):1042–1055. doi: 10.1198/jasa.2010.tm09129

Correlated z-values and the accuracy of large-scale statistical estimates

Bradley Efron ^*

PMCID: PMC2967047 NIHMSID: NIHMS164092 PMID: 21052523

Abstract

We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects' expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N · (N − 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under non-null conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.

Keywords: rms correlation, non-null z-values, correlation penalty, Mehler's identity, empirical process, acceleration

1 Introduction

Modern scientific studies routinely produce data on thousands of related situations. A familiar example is a microarray experiment in which thousands of genes are being investigated for possible disease involvement. Each gene might produce a z-value, say z_i, for the ith gene, by definition a test statistic theoretically having a standard normal distribution

H_{0} : z_{i} \sim 𝒩 (0, 1)

(1.1)

under the null hypothesis H₀ of no disease involvement. A great deal of the current literature was developed under the assumption of independence among the z_i's. This can be grossly unrealistic in practice, as discussed in Owen (2005) and Efron (2007a), among others. This paper concerns the accuracy of summary statistics of the z_i's, for example, their empirical cdf (cumulative distribution function), under conditions of substantial correlation.

Figure 1 concerns a leukemia microarray study by Golub et al. (1999) that we will use for motivation and illustration. Two forms of leukemia are being examined for possible genetic differences: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). In the version of the data discussed here there are n₁ = 47 ALL patients and n₂ = 25 AML patients, with expression levels on the same N = 7128 genes measured on each patient.

Histogram of z-values for N = 7128 genes, leukemia study, Golub et al. (1999). *Dashed curve* f̂(x), smooth fit to histogram; *solid curve* “empirical null”, normal density fit from central 50% of histogram, is much wider than theoretical 𝒩(0, 1) null distribution. Small red bars plotted negatively discussed in Section 4.

A two-sample t-statistic t_i comparing AML with ALL expression levels was computed for each gene and converted to a z-value,

z_{i} = Φ^{- 1} (F_{70} (t_{i})) i = 1, 2, \dots, N,

(1.2)

where Φ and F₇₀ are the cumulative distribution functions for a standard normal and a Student-t distribution with 70 degrees of freedom. Figure 1 shows a histogram of the z_i's, which turns out to be much wider than (1.1) suggests: its central spread is estimated to be σ̂₀ = 1.68 rather than 1, as discussed in Section 3.

Here is an example of the results to be derived in Sections 2 through 4. Let F̂(x) be the right-sided cdf (“survival curve”) of the z-values,

\hat{F} (x) = # {z_{i} > x} ∕ N .

(1.3)

Then a good approximation for the variance of F̂(x) is

Var {\hat{F} (x)} ≐ {\frac{\hat{F} (x) (1 - \hat{F} (x))}{N}} + {\frac{{\hat{σ}}_{0}^{2} \hat{α} {\hat{f}}^{(1)} (x)}{\sqrt{2}}}^{2} .

(1.4)

The first term in (1.4) is the usual binomial variance, while the second term is a correlation penalty accounting for dependence between the z_i's. The quantities occurring in the correlation penalty are

σ̂₀, the estimate of central spread (1.68 above);
α̂, an estimate of the root-mean-square of the correlations between the N(N − 1)/2 pairs of z_i's (equaling about .11 for the leukemia data, as calculated from the simple formula in Section 3);
f̂⁽¹⁾(x), the first derivative of a smooth fit to the z-value histogram (estimated by a Poisson spline regression in Figure 1).

The row marked $\hat{sd}$ in Table 1 is the square root of formula (1.4) applied to the leukemia data. F̂(4) = .025 is seen to have $\hat{sd} = .0040$ , more than double ${\hat{sd}}_{0} = .0018$ , the binomial standard deviation obtained by ignoring the second term in (1.4). The permutation standard deviation, obtained from repeated permutations of the 72 patients, is only .0001 at x = 4. Permutation methods, which preserve within-microarray correlations, have been advocated for large-scale hypothesis testing (see Westfall and Young, 1993; Dudoit, Shaffer and Boldrick, 2003, Sect. 2.6), but they are inappropriate for the accuracy considerations of this paper.

Table 1.

Estimates of standard deviation for right-sided cdf F̂(x) (1.3); $\hat{sd}$ square root of formula (1.4); ${\hat{sd}}_{0}$ square root of first term in (1.4); ${\hat{sd}}_{perm}$ permutation standard deviation. Accuracy of False Discovery Rate estimate $\hat{Fdr} (x)$ discussed in Section 4.

x:	1	2	3	4	5
$\hat{sd} :$	.017	.022	.0101	.0040	.0019
${\hat{sd}}_{0} :$	.005	.004	.0027	.0018	.0012
${\hat{sd}}_{perm} :$	.021	.001	.0014	.0001	.0000
F̂(x) :	.29	.13	.057	.025	.010
$\hat{Fdr} (x) :$	.94	.92	.71	.38	.15

Open in a new tab

Formula (1.4), and more ambitious versions that include covariances across different values of x, are derived in Section 2 and Section 3: an exact expression is derived first, followed by a series of simplifying approximations and techniques for their estimation. The basic results are extended to provide more general accuracy estimates in Section 4: comparing, for example, the variability of local versus tail-area false discovery rates.

All of our results depend on the assumption that the z_i's are normal with possibly different means and variances,

z_{i} \sim 𝒩 (μ_{i}, σ_{i}^{2}) i = 1, 2, \dots, N .

(1.5)

There is no requirement that they be “z-values” in the hypothesis-testing sense of (1.1), and in fact this paper is more concerned with estimation than testing. However, z-values are ubiquitous in large-scale applications, and not only in the two-sample setting of the leukemia study. Section 5 concerns the non-null distribution of z-values. A theorem is derived justifying (1.5) as a good approximation, allowing results like (1.4) to be applied to the leukemia study z-values. Section 6 and Section 7 close with remarks and a brief summary.

The statistics microarray literature has shown considerable interest in the effects of large-scale correlation, some good references being Dudoit, van der Laan and Pollard (2004), Owen (2005), Qiu, Klebanov and Yakovlev (2005b), Qiu, Brooks, Klebanov and Yakovlev (2005a) and Desai, Deller and McCormick (2009). Efron (2007a) used a z-value setting to examine the effects of correlation on false discovery rate analysis; that paper's Section 2 theorem is a null hypothesis version of the general result developed here. A useful extension along different lines appears in Schwartzman and Lin (2009).

Clarke and Hall's (2009) asymptotic calculations support the use of the independence standard deviation ${\hat{sd}}_{0}$ in Table 1, even in the face of correlation. The situations they consider are low-correlation by the standard here, with the root-mean-square value α̂ of (1.4) approaching zero (from their assumption (3.2)). Since α̂ is often easy to estimate, formulas such as (1.4) provide a quantitative check on the use of ${\hat{sd}}_{0}$ .

2 The distribution of correlated normal variates

Given N correlated normal variates z₁, z₂, … , z_N, with possibly different means and standard deviations, let F̂(x) denote their right-sided empirical cdf¹

\hat{F} (x) = # {z_{i} \geq x} ∕ N, for - \infty < x < \infty .

(2.1)

This section presents tractable formulas for the mean and covariance of the process {F̂(x), −∞ < x < ∞}, and a simpler approximation that we will see is nicely suited for applications.

Rather than work directly with cdfs, it will be easier, and in a sense more basic, to first derive results for a discretized version of the empirical density of the z_i values. We partition the range Ƶ into K bins Ƶ_k,

𝒵 = ⋃_{k = 1}^{K} 𝒵_{k},

(2.2)

each bin being of width Δ. Let x_k indicate the midpoint of Ƶ_k, and y_k the number of z_i's in Ƶ_k,

y_{k} = # {z_{i} \in 𝒵_{k}} k = 1, 2, \dots, K .

(2.3)

We will derive expressions for the mean and covariance of the vector y = (y₁, y₂, … , y_K)′. In effect, y is the order statistic of z = (z₁, z₂, … , z_N)′, becoming exactly that as the bin width Δ → 0. (In which case the y_k values go to 1 or 0, with the non-zero bin x_k values indicating the locations of the ordered z_i's, assuming no ties.) Familiar statistical applications, of the type described in Section 4, depend on z only through y.

Suppose that the z_i's are divided into a finite number of classes, with members of the cth class 𝒞_c having mean μ_c and standard deviation σ_c,

z_{i} \sim 𝒩 (μ_{c}, σ_{c}^{2}) for z_{i} \in 𝒞_{c} .

(2.4)

Let N_c be the number of members of 𝒞_c, with p_c the proportion

N_{c} = # {𝒞_{c}} and p_{c} = N_{c} ∕ N

(2.5)

so Σ_c N_c = N and Σ_c p_c = 1. The use of model (2.4) for z-values is supported by the results of Section 5.

If x is the K-vector of bin midpoints, let x_kc = (x_k − μ_c)/σ_c and

x_{c} = (x - μ_{c}) ∕ σ_{c} = {(\dots, x_{kc}, \dots)}^{'} .

(2.6)

Likewise, for any real-valued function h(x) we define h_c to be the K-vector of function values

h_{c} = {(\dots, h (x_{kc}), \dots)}^{'},

(2.7)

also denoted by h(x_c) in what follows.

It is easy to calculate the expectation of the count vector y under the multi-class model (2.4)-(2.5). Let π_kc equal the probability that z_i from class 𝒞_c falls into the kth bin,

π_{kc} = {Prob}_{c} {z_{i} \in 𝒵_{k}} ≐ Δ φ (x_{kc}) ∕ σ_{c} .

(2.8)

Here φ(x) = exp(−x²/2)/√2π, the standard normal density. The approximation π_kc ≐ Δφ(x_kc)/σ_c from (2.4) becomes arbitrarily accurate for Δ sufficiently small, and we will take it as exact in what follows. Then

E {y} = N \sum_{c} p_{c} π_{c} = N Δ \sum_{c} p_{c} φ (x_{c}) = N Δ \sum_{c} p_{c} φ_{c} .

(2.9)

The K × K covariance matrix of the count vector y depends on the N × N correlation matrix of z, but in a reasonably simple way discussed next. Two important definitions are needed to state the first result: there are M = N(N − 1)/2 correlations ρ_ii′ between pairs (z_i, z_i′) of members of z, and we denote by “g(ρ)” the distribution putting weight 1/M on each ρ_ii′. Also, for φ_ρ(u, v) the bivariate normal density having zero means, unit standard deviations, and correlation ρ, we define

λ_{ρ} (u, v) = \frac{φ_{ρ} (u, v)}{φ (u) φ (v)} - 1 = {(1 - ρ^{2})}^{- \frac{1}{2}} \exp {\frac{2 ρ uv - ρ^{2} (u^{2} + v^{2})}{2 (1 - ρ^{2})}} - 1

(2.10)

and

λ (u, v) = \int_{- 1}^{1} λ_{ρ} (u, v) g (ρ) d ρ

(2.11)

(the integral notation being shorthand for summing over M discrete points).

Lemma 1

Under the multi-class model (2.4)-(2.5), the covariance of the count vector y (2.3) has two components,

cov (y) = {cov}_{0} + {cov}_{1}

(2.12)

where

{cov}_{0} = N \sum_{c} p_{c} {diag (π_{c}) - π_{c} π_{c}^{'}}

(2.13)

and

\begin{matrix} {cov}_{1} = N^{2} & \sum_{c} \sum_{d} p_{c} p_{d} diag (π_{c}) λ_{cd} diag (π_{d}) \\ - N \sum_{c} p_{c} diag (π_{c}) λ_{cc} diag (π_{c}) . \end{matrix}

(2.14)

Here diag(π_c) is the K × K diagonal matrix having diagonal elements π_kc, similarly diag(π_d), while λ_cd is the K × K matrix with klth element λ(x_kc, x_ld); the summations are over all classes.

Note. Equation (2.14) assumes that the correlation distribution g(ρ) is the same across all classes 𝒞_c. The proof of Lemma 1, which is similar to that for the simpler situation of Efron (2007a), appears in Remark C of Section 6.

The cov₀ term in (2.12)-(2.13) is the sum of the multinomial covariance matrices that would apply if the z_i's were mutually independent with fixed numbers drawn from each class; cov₁ is a penalty for correlation, almost always increasing cov(y). The N² factor in (2.14) makes the correlation penalty more severe as N increases, assuming g(ρ) stays the same.

Expression (2.14) for the correlation penalty can be considerably simplified. Mehler's identity for λ_ρ(u, v) (2.10) is

λ_{ρ} (u, v) = \sum_{j \geq 1} \frac{ρ^{j}}{j!} h_{j} (u) h_{j} (v)

(2.15)

where h_j is the jth Hermite polynomial. (See Lancaster, 1958 for an enlightening discussion of (2.15), also known as the “tetrachoric series”, and its connections to the singular value decomposition, canonical correlation, Pearson's coefficient of contingency, and correspondence analysis.) Denoting the jth moment of the correlation distribution g(ρ) by α_j,

α_{j} = \int_{- 1}^{1} ρ^{j} g (ρ) d ρ,

(2.16)

(2.11) becomes

λ (u, v) = \sum_{j \geq 1} \frac{α_{j}}{j!} h_{j} (u) h_{j} (v)

(2.17)

so λ_cd in (2.14) can be written in outer product notation as

λ_{cd} = \sum_{j \geq 1} \frac{α_{j}}{j!} h_{j} (x_{c}) h_{j} {(x_{d})}^{'} .

(2.18)

Making use of (2.8), taken as exact,

\begin{matrix} diag (π_{c}) h_{j} (x_{c}) & = N Δ diag (φ (x_{c})) h_{j} (x_{c}) ∕ σ_{c} \\ = {(- 1)}^{j} N Δ \cdot φ_{c}^{(j)} ∕ σ_{c} \end{matrix}

(2.19)

where $φ_{c}^{(j)}$ indicates the jth derivative of φ(u) evaluated at each component of x_c (using φ^(j)(u) = (−1)^jφ(u)h_j(u)).

Rearranging (2.14) then gives a simplified formula.

Lemma 2.

Defining

{\overset{‒}{ϕ}}^{(j)} \equiv \sum_{c} p_{c} φ_{c}^{(j)} ∕ σ_{c},

(2.20)

(2.14) for the correlation penalty becomes

{cov}_{1} = N^{2} Δ^{2} {\sum_{j \geq 1} \frac{α_{j}}{j!} {\overset{‒}{ϕ}}^{(j)} {\overset{‒}{ϕ}}^{{(j)}^{'}} - \frac{1}{N} \sum_{j \geq 1} \frac{α_{j}}{j!} (\sum_{c} p_{c} φ_{c}^{(j)} φ_{c}^{{(j)}^{'}} ∕ σ_{c}^{2})} .

(2.21)

A convenient approximation to cov₁ is based on three reductions of (2.21):

The second term in (2.21) is neglible for large N.
Common standardization methods for large-scale data sets often make α₁, the expectation of g(ρ), exactly or nearly zero, as illustrated in Section 3 for the leukemia data; see Section 3 of Efron (2009).
This leaves α₂ of (2.16) as the leading term in (2.21). With ρ confined to [−1, 1], the higher-order moments α_j = E_g{ρ^j} often decrease quickly to zero.

The root mean square (rms) correlation

α = α_{2}^{1 ∕ 2} = {[\int_{- 1}^{1} ρ^{2} g (ρ) d ρ]}^{\frac{1}{2}}

(2.22)

featured in Efron (2007a) (where it is called the total correlation), is a single-number summary of z_i's entire correlation structure. Carrying out the three reductions above produces a greatly simplified form of (2.21),

rms approximation : {cov}_{1} ≐ {(N Δ α)}^{2} {\overset{‒}{ϕ}}^{(2)} {\overset{‒}{ϕ}}^{{(2)}^{'}} ∕ 2

(2.23)

with ϕ̄⁽²⁾ in (2.20) depending on the second derivative of the normal density, φ⁽²⁾(u) = φ(u) · (u² − 1).

Figure 2 compares the exact formulas (2.12)-(2.14) for cov(y) with the simplified formula based on the rms approximation (2.23); for a numerical example having N = 6000, α = .10, and two classes (2.4)-(2.5), initially with

(μ_{0}, σ_{0}) = (0, 1), p_{0} = .95 and (μ_{1}, σ_{1}) = (2.5, 1), p_{1} = .05

(2.24)

but then recentered as in the leukemia example; see Remark D of Section 6 for more detail. The plotted curves show the standard deviations $sd {y_{k}} = {cov}_{kk} {(y)}^{\frac{1}{2}}$ from (2.12), the corresponding rms approximation (2.23), and also ${sd}_{0} {y_{k}} = {({cov}_{0 kk})}^{\frac{1}{2}}$ from (2.13). We can see there is a substantial correlation penalty over most of the range of z, and also that the rms approximation is quite satisfactory here.

Comparison of exact formula for standard deviation of *y_k* from (2.12) (heavy curve) with rms approxmation from (2.23) (dotted curve); N = 6000, α = .10 in (2.22), two classes as in (2.24). Dashed curve is standard deviation from (2.13) ignoring the correlation penalty. Hash marks indicate bin midpoints *x_k*.

Returning to right-sided cdfs (2.1), let B be the K × K matrix

B_{{kk}^{'}} = {\begin{matrix} 1 & if k \leq k^{'} \\ 0 & if k > k^{'} \end{matrix}

(2.25)

\hat{F} = \frac{1}{N} By

(2.26)

is a K-vector with kth component the proportion of z_i's in bins indexed ≥ k,

{\hat{F}}_{k} = # {z_{i} \geq x_{k} - Δ ∕ 2} ∕ N (k = 1, 2, \dots, K) .

(2.27)

(B would be transposed if we were dealing with left-sided cdfs.) The expectation of F̂ is both obvious and easy to obtain from (2.9),

\begin{matrix} E {{\hat{F}}_{k}} & = \sum_{c} p_{c} [\sum_{k^{'} \geq k} Δ φ (\frac{x_{k^{'}} - μ_{c}}{σ_{c}}) / σ_{c}] ≐ \sum_{c} p_{c} \int_{x_{kc}}^{\infty} φ (u) du \\ = \sum_{c} p_{c} Φ^{+} (x_{kc}) \end{matrix}

(2.28)

where Φ⁺(u) = 1 − Φ(u). Now that we are working with tail areas rather than densities we can let Δ → 0, making (2.28) exact.

F̂ has covariance matrix Bcov(y)B′/N². The same kind of calculations as in (2.28) applied to Lemma 1 gives the following theorem.

Theorem 1

Under the multiclass model (2.4)-(2.5),

Cov (\hat{F}) = {Cov}_{0} + {Cov}_{1}

(2.29)

where Cov₀has klth entry

\frac{1}{N} \sum_{c} p_{c} {Φ^{+} (\max (x_{kc}, x_{lc})) - Φ^{+} (x_{kc}) Φ^{+} (x_{lc})}

(2.30)

and

{Cov}_{1} = \sum_{j} \frac{α_{j}}{j!} {\overset{‒}{φ}}^{(j - 1)} {\overset{‒}{φ}}^{{(j - 1)}^{'}} - \frac{1}{N} \sum_{j} \frac{α_{j}}{j!} {\sum_{c} p_{c} φ_{c}^{(j - 1)} φ_{c}^{{(j - 1)}^{'}}} .

(2.31)

Here p_c is from (2.5), x_kc and x_lc from (2.6), α_j is as in (2.16) and

{\overset{‒}{φ}}^{(j - 1)} = \sum_{c} p_{c} φ_{c}^{(j - 1)} = \sum_{c} p_{c} φ^{(j - 1)} (x_{c}) .

(2.32)

(Notice the distinction between φ̄ and ϕ̄ (2.20), and between Cov and cov etc., Lemma 1.)

The three-step reduction leading to (2.31) also can be applied to Cov₁: for α as in (2.22),

Rms approximation : {Cov}_{1} ≐ α^{2} {\overset{‒}{φ}}^{(1)} {\overset{‒}{φ}}^{{(1)}^{'}} ∕ 2

(2.33)

with φ̄⁽¹⁾ depending on the first derivative of the normal density, φ⁽¹⁾(u) = −φ(u)u. Section 3 shows that (2.33) is especially convenient for applications.

Figure 3 is the version of Figure 2 applying to F̂: the heavy curve tracks sd(F̂_k) from (2.29), the dotted curve is from Rms approximation (2.33), and the dashed curve shows the standard deviations from Cov₀ (2.30), ignoring the correlation penalty. Once again the simple approximation formula performs well, particularly for extreme values of z, which are likely to be the ones of interest in applications. The correlation penalty is more severe here than in Figure 2, especially in the tails.

Comparison of exact formula for sd{*F̂_k*} from Theorem 1 (heavy curve) with Rms approximation using (2.33) (dotted curve); same example as in Figure 2. Dashed curve shows standard deviation estimates ignoring the correlation penalty.

The Cov₀ formula (2.30) is essentially the covariance function for a Brownian bridge. Results related to Theorem 1 can be found in the empirical process literature; see equation (2.2) of Csörgő and Mielniczuk (1996) for example, which applies to the “one-class” case when all the z_i's are 𝒩(0, 1). Desai et al. (2009) extend the covariance calculations in Efron (2007a) to include skewness corrections.

3 Estimation of the correlation parameters

Application of Section 2's theory requires us to estimate several parameters: the rms correlation α (2.22), and the class components (p_c, μ_c, σ_c) in (2.4)-(2.5) (though we will see that the latter task can be avoided under some assumptions). This section illustrates the estimation process in terms of the leukemia study of Section 1. X, the data matrix for the study, has N = 7128 rows, one for each gene, and n = 72 columns, one for each patient; the n₁ = 47 ALL patients precede the n₂ = 25 AML patients. Entry x_ij of X is the expression level for gene i on patient j. The columns of X were individually standardized to have mean 0 and variance 1; see Remark E.

The ith row of X gives t_i, the two-sample t-statistic comparing expression levels on gene i for AML versus ALL patients. These are converted to z-values z_i = Φ⁻¹(F₇₀(t_i)) (1.2), whose histogram appears in Figure 1. As noted before, the histogram is much wider near its center than a theoretical 𝒩(0, 1) null distribution: analysis using the locfdr program described in Efron (2007b, 2008) estimated that proportion p₀ = .93 of the genes were “null” (i.e., identically distributed for ALL and AML), and that z-values for the null genes followed a 𝒩(.09, 1.68²) distribution.

We wish to estimate the rms correlation α (2.22). Let X₀ indicate an N × n₀ subset of X pertaining to a single population of subjects, for example the 47 ALL patients. There are N · (N − 1)/2 sample correlations ρ̂_ii′ between rows i and i′ of X₀. Computing all of these, or a sufficiently large random sample, yields the empirical mean and variance (m, v) of the ρ̂ distribution,

\hat{ρ} \sim (m, v),

(3.1)

(m, v) = (.002,.190²) for the ALL patients. As discussed in Section 3 of Efron (2009), standardizing the columns of X₀ to have mean 0 forces m ≐ 0, and we will assume m = 0 in what follows. (This is equivalent to taking α₁ = 0 as we did following (2.21).)

The obvious choice $\bar{α} = v^{\frac{1}{2}}$ tends to greatly overestimate α : each ρ̂_ii′ is nearly unbiased for its true correlation ρ_ii′, a normal-theory approximation for mean and variance being

{\hat{ρ}}_{{ii}^{'}} \sim (ρ_{{ii}^{'}}, {(1 - ρ_{{ii}^{'}}^{2})}^{2} ∕ (n - 3))

(3.2)

(Johnson and Kotz, 1970), but the considerable variances in (3.2) can greatly broaden the empirical distribution of the ρ̂'s. Two corrected estimates of α are developed in Efron (2009). The simpler correction formula is

{\hat{α}}^{2} = \frac{n_{0}}{n_{0} - 1} (v - \frac{1}{n_{0} - 1})

(3.3)

based on an identity between the row and column correlations of X₀. The second approach uses an empirical Bayes analysis of the variance term in (3.3) to justify a more elaborate formula,

{\tilde{α}}^{2} = \tilde{v} - \frac{3}{n - 5} {\tilde{v}}^{2} [\tilde{v} = \frac{(n - 3) v - 1}{n - 5}] .

(3.4)

The first three columns of Table 2 compare α̂ with α̃ for X₀ based on the ALL patients, the AML patients, and both. The final column reports mean ± standard deviation for α̂ and α̃ in 100 simulations of model (2.24): N = 6000, n₁ = n₂ = 40 patients in each class, true α = .10; see Remark D. The two estimates are effectively linear functions of each other for typical values of v; α̂, the simpler choice, is preferred by the author.

Table 2.

Estimates α̂ and α̃, (3.3) and (3.4), for rms correlation α (2.22) of leukemia data; also 100 simulations of model (2.24), N = 6000, n₁ = n₂ = 40, true α = .10, showing mean ± standard deviation.

	ALL	AML	Both	Simulation
α^ :	.121	.109	.114	.1054 ±.0074
α˜ :	.118	.092	.113	.1045 ±.0075

Open in a new tab

It seems that we need to estimate the class components (p_c, μ_c, σ_c) in (2.4)-(2.5) in order to apply the theory of Section 2, but under certain assumptions this can be finessed, as discussed next.

The marginal density f(z) under model (2.4)-(2.5) is

f (z) = \sum_{c} p_{c} φ (\frac{z - μ_{c}}{σ_{c}}) \frac{1}{σ_{c}};

(3.5)

so, letting f = f(x) (the density evaluated at the K-vector of bin midpoints), we have Δ · f = Σ_cp_cπ_c as in (2.8). Formula (2.13) can be expressed as

{cov}_{0} = N {diag (Δ f) - \sum_{c} p_{c} π_{c} π_{c}^{'}} .

(3.6)

Here we are assuming, as in (2.5), that the class sample sizes N_c are fixed. A more realistic assumption might be that the numbers N₁, N₂, … ,N_C are a multinomial sample of size N, sampled with probabilities p₁, p₂, … ,p_C, in which case (3.6) becomes the usual multinomial covariance matrix

{cov}_{0} = N {diag (Δ f) - Δ^{2} {ff}^{'}} .

(3.7)

A smooth curve f̂ fit to the histogram heights² as in Figure 1 then yields ${\hat{cov}}_{0}$ by substitution into (3.7), without requiring knowledge of the class structure (2.4)-(2.5). In the same way, we can estimate the Cov₀ for F̂ in (2.30) by the standard multinomial formula

{({\hat{Cov}}_{0})}_{kl} = \frac{1}{N} {{\hat{F}}_{\max (k, l)} - {\hat{F}}_{k} {\hat{F}}_{l}} .

(3.8)

Under some circumstances, a similar tactic can be applied to estimate the correlation penalties cov₁ and Cov₁, (2.23) and (2.33). The first and second derivatives f⁽¹⁾(z) and f⁽²⁾(z) of (3.5) are

f^{(1)} (z) = \sum_{c} p_{c} φ^{(1)} (\frac{z - μ_{c}}{σ_{c}}) \frac{1}{σ_{c}^{2}} and f^{(2)} (z) = \sum_{c} p_{c} φ^{(2)} (\frac{z - μ_{c}}{σ_{c}}) \frac{1}{σ_{c}^{3}} .

(3.9)

Suppose we make the homogeneity assumption that all σ_c values are the same, say σ_c = σ₀. Comparison with definitions (2.32) and (2.20) then gives

{\overset{‒}{φ}}^{(1)} = σ_{0}^{2} f^{(1)} and {\overset{‒}{ϕ}}^{(2)} = σ_{0}^{2} f^{(2)}

(3.10)

with f^(j) = (f^(j)(x_k))′. This leads to the convenient covariance penalty formulas,

{Cov}_{1} ≐ \frac{{(σ_{0}^{2} α)}^{2}}{2} f^{(1)} f^{{(1)}^{'}} and {cov}_{1} ≐ \frac{{(N Δ σ_{0}^{2} α)}^{2}}{2} f^{(2)} f^{{(2)}^{'}}

(3.11)

from (2.33) and (2.23).

A smooth estimate f̂(z) of f(z) can be differentiated to give estimated values of Cov₁ and cov₁, for example,

{sd}_{1} {{\hat{F}}_{k}} = {({\hat{Cov}}_{1})}_{kk}^{\frac{1}{2}} = \frac{{\hat{σ}}_{0}^{2} \hat{α}}{\sqrt{2}} ∣ {\hat{f}}^{(1)} (x_{k}) ∣

(3.12)

for the correlation penalty standard deviation of F̂(x_k) (2.27). (This provides the second term in formula (1.4).) The heavy curve in Figure 4 shows (3.12) for the leukemia data, using σ̂₀ = 1.68, α̂ = .114, and f̂(z) from Figure 1.

Leukemia data; two estimates of correlation penalty standard deviation sd₁ {*F̂_k*} for *F̂_k* (2.27). *Solid curve* formula (3.12); *dashed curve* Rms approximation (2.33) using class estimates from Table 3. *Dotted curve* is independence estimate from (3.8), indicating that the correlation penalty is substantial.

Suppose we are unwilling to make the homogeneity assumption. A straightforward approach to estimating Cov₁ or cov₁ requires assessments of the parameters (p_c, μ_c, σ_c) in (2.4)-(2.5). These can be based on the “non-null counts” (Efron, 2007b), the small bars plotted negatively in Figure 1; see Remark B. The figure suggests three classes, left, center and right, with parameter values as estimated in Table 3.

Table 3.

Three-class model (2.4)–(2.5) for leukemia data. Parameter estimates based on non-null counts, Remark B.

	left	center	right
p_c:	.054	.930	.016
μ_c:	−4.2	.09	5.4
σ_c:	1.16	1.68	1.05

Open in a new tab

The dashed curve in Figure 4 shows sd₁(F̂_k) estimated directly from (2.32)-(2.33) using the values in Table 3. It is similar to the homogeneity estimate (3.12) except in the extreme tails.

Formula (1.4) for the standard deviation of F̂(x) was tested in a simulation experiment. The specifications were the same as in Figure 3, with N = 6000, α = .10, and two classes of z-values (2.24). One hundred X matrices were generated as in the simulation for Table 2, each yielding a vector of 6000 correlated z-values, followed by σ̂₀, α̂ and f̂⁽¹⁾(x) for use in (1.4); see Remark D for further details. Finally, $\hat{sd}$ , the square root of (1.4), was calculated along with ${\hat{sd}}_{0}$ , the square root of just the first term.

The solid curve in Figure 5 shows the average of the $\hat{sd}$ values for x between −4 and 4.5, with solid bars indicating standard deviations of the 100 $\hat{sd} ’ s$ . There is a good match of the average with the exact sd curve from Figure 3. The error bars indicate moderate variability across the replications. The average for ${\hat{sd}}_{0}$ , dashed curve, agrees with the corresponding curve in Figure 3 and shows that correlation cannot be ignored in this situation.

Simulation experiment for formula (1.4). *Solid curve* average of $\hat{sd}$ , square root of (1.4), 100 replications, with bars indicating standard deviation of $\hat{sd}$ at x = −4, −3, …, 4; *dotted curve* exact sd from Figure 3; *dashed curve* average of ${\hat{sd}}_{0}$ , standard error estimate for F̂(x) ignoring correlation.

4 Applications

Correlation usually degrades statistical accuracy, an important question for the data analyst being the severity of its effects on the estimates and tests at hand. The purpose of Section 2 and Section 3 was to develop practical methods for honestly assessing the accuracy of inferences made in the presence of large-scale correlation. This section presents a few examples of the methodology in action.

We have already seen one example: in Table 1 the accuracy of F̂(x), the right-sided empirical cdf for the leukemia data, computed from the usual binomial formula that assumes independence among the z-values,

{\hat{sd}}_{0} = {\hat{F} (x) (1 - \hat{F} (x)) / N}^{\frac{1}{2}},

(4.1)

was compared with $\hat{sd}$ from formula (1.4) in which the correlation penalty term was included: $\hat{sd}$ more than doubled ${\hat{sd}}_{0}$ over most of the range.

Suppose we assume, as in Efron (2008), that each of the N cases (the N genes in the leukemia study) is either null or non-null with prior probability p₀ or p₁ = 1 − p₀, and with the corresponding z-values having density either f₀(z) or f₁(z),

\begin{matrix} p_{0} = \Pr {null} & f_{0} (z) density if null, \\ p_{1} = \Pr {non-null} & f_{1} (z) density if non-null . \end{matrix}

(4.2)

Let F₀ and F₁ be the right-sided cdfs of f₀ and f₁, and F the mixture cdf

F (x) = p_{0} F_{0} (x) + p_{1} F_{1} (x) .

(4.3)

The probability of a case being null given that z exceeds x is

Fdr (x) \equiv \Pr {null ∣ z \geq x} = \frac{p_{0} F_{0} (x)}{F (x)}

(4.4)

according to Bayes theorem, “Fdr” standing for false discovery rate.

If p₀ and F₀ are known then Fdr has the obvious estimate

\hat{Fdr} (x) = p_{0} F_{0} (x) ∕ \hat{F} (x)

(4.5)

(2.1). Benjamini and Hochberg's celebrated 1995 algorithm uses $\hat{Fdr} (x)$ for simultaneous hypothesis testing, but it can also be thought of as an empirical Bayes estimator of the Bayesian probability Fdr(x). The bottom row of Table 1 shows $\hat{Fdr} (x)$ for the leukemia data, taking p₀ = .93 and F₀ ~ N(.09, 1.68²) as in Figure 1. (Later we will do a more ambitious calculation taking into account the estimation of p₀ and F₀.)

The coefficient of variation for $\hat{Fdr} (x)$ approximately equals that for F̂(x) (when p₀F₀(x) is known in (4.5)). At x = 5 we have $\hat{Fdr} (5) = .15$ , with coefficient of variation about .19. An $\hat{Fdr}$ of .15 might be considered small enough to trigger significance in the Benjamini–Hochberg algorithm, but in any case it seems clear that the probabiltiy of being null is quite low for the 71 genes having z_i above 5. Even taking account of correlation effects, we have a rough upper confidence limit of .21 (i.e., .15 · (1 + 2 · .19)) for Fdr(5).

Next we consider accuracy estimates for a general class of statistics Q(y), where Q is a q-dimensional function of the count vector y (2.3). As in Section 5 of Efron (2007b), we assume that a small change dy in the count vector (considered as varying continuously) produces change dQ in Q according to

dQ = \hat{D} dy [{\hat{D}}_{jk} = \partial Q_{j} ∕ \partial y_{k}] .

(4.6)

If $\hat{cov} (y)$ is a covariance estimate for y, obtained perhaps as in (2.12), (3.8), or (3.11), then the usual delta-method estimate for cov(Q) is

\hat{cov} (Q) = \hat{D} \hat{cov} (y) {\hat{D}}^{'} .

(4.7)

In a theoretical context, where cov(y) is known, we might instead use

cov (Q) ≐ D cov (y) D^{'}

(4.8)

now with the derivative matrix D evaluated at the expectation of y.

Model (4.2) yields the local false discovery rate

fdr (x) \equiv \Pr {null ∣ z = x} = p_{0} f_{0} (x) ∕ f (x),

(4.9)

f(x) being the mixture density

f (x) = p_{0} f_{0} (x) + p_{1} f_{1} (x);

(4.10)

fdr(x) is inferentially more appropriate than Fdr(x) from a Bayesian point of view, but it is not as immediately available since it involves estimating the density f(x). However, because z-value densities are mixtures of near-normals as shown in Section 5, it is usually straightforward to carry out the estimation.

Locfdr, the algorithm discussed in Efron (2007a, 2008), estimates f(x) by means of Poisson regression of the counts y_k as a spline function of the x_k, the bin midpoints in (2.2)-(2.3). The structure matrix M for the Poisson regression is K × d, where K is the number of bins and d is degrees of freedom (e.g., the number of free parameters of the spline fit; see Remark G for details). Let f̂ be the vector of fitted values f̂(x_k), and ℓ̂ the vector with components ℓ̂_k = log(f̂(x_k)). Then, as discussed in (Efron, 2007b, Sect. 5), (4.6) takes the form

d \hat{ℓ} = \hat{D} d y with \hat{D} = M {(M^{'} diag (N Δ \hat{f}) M)}^{- 1} M^{'}

(4.11)

and we can use (4.7) or (4.8) to approximate cov(ℓ̂).

For any function v(x) define the vector

v = {(v_{1}, v_{2}, \dots, v_{k}, \dots, v_{K})}^{'} = {(\dots, v (x_{k}), \dots)}^{'}

(4.12)

as with f̂ and ℓ̂ above. If

\hat{lfdr} (x) \equiv \log (\hat{fdr} (x)) = \log (p_{0} f_{0} (x)) - \log (f (x))

(4.13)

then

\hat{lfdr} (x) = \log (p_{0}) + \log (f_{0}) - \hat{ℓ}

(4.14)

implying, if p₀ and f₀ are known, that

cov (\hat{lfdr} (x)) = cov (\hat{ℓ}) ≐ D cov (y) D^{'}

(4.15)

with D = M(M′ diag(NΔf)M)⁻¹M′ (4.8).

The solid curves in Figure 6 plot standard deviations for $\log (\hat{fdr} (x))$ , obtained as square root of the diagonal elements of $cov (\hat{lfdr})$ (4.15), for model (2.24) with N = 6000 and rms correlation α equal 0, .1, or .2; see Remark D. The horizontal axes are plotted in terms of the upper percentiles of F(x), the right end of each plot corresponding to the far right tail of the z-value distribution. For α = 0, $sd (\log \hat{fdr} (x))$ increases from .03 to .08 as we move from the fifth to the first percentile of F. The coefficient of variation (CV) of $\hat{fdr} (x)$ nearly equals $sd (\log \hat{fdr} (x))$ , so $\hat{fdr} (x)$ is quite accurately estimated for α = 0, but substantially less so for α = .2. Reducing N to 1500 doubles the standard deviation estimates for α = 0, but has less effect in the correlated situations: for α = .1 for example, the increase is only 20% at percentile .025. Simulations confirmed the correctness of these results.

*Solid curves* show standard deviation of $\log (\hat{fdr} (x))$ as a function of x at the upper percentiles of the z-value distribution for model (2.24), N = 6000 and α = 0, .1, .2. *Dotted curves* (green) same for $\log (\hat{Fdr} (x))$ (4.5), nonparametric Fdr estimator. *Dashed curves* (red) for parametric version (4.17) of Fdr estimator.

Intuitively it seems that fdr should be harder to estimate than Fdr, but that is not what Figure 6 shows. Let L̂_k = log(F̂(x_k)), with corresponding vector L̂. Then D̂ in (4.6) has

{\hat{D}}_{jk} = B_{jk} ∕ (N \cdot {\hat{F}}_{j})

(4.16)

with B as in (2.25), giving an estimate of cov(L̂) from (4.7) or (4.8). The same argument as (4.13)-(4.15) shows that this also estimates $cov (\log \hat{Fdr})$ , the log of vector (4.5), assuming p₀F₀(x) is known. The dotted curves in Figure 6 show standard deviations for $\log (\hat{Fdr} (x))$ . If anything, Figure 6 suggests that $\hat{fdr}$ is less variable than $\hat{Fdr}$ , particularly at the smaller percentiles.

Here we are comparing the nonparametric estimator $\hat{Fdr} (x)$ (4.5) with the parametric estimator $\hat{fdr} (x)$ . The Poisson spline estimate f̂(x) that gave $\hat{fdr} (x)$ can be summed to give parametric estimates of F(x) and Fdr(x), say $\tilde{Fdr} (x)$ . Straightforward calculations show that the derivative matrix D̂ for $\tilde{Fdr} (x)$ is

\hat{D} = \hat{C} {\hat{D}}_{f} where C_{jk} = B_{jk} {\hat{f}}_{k} ∕ {\hat{F}}_{j}

(4.17)

with B from (2.25) and D̂_f equaling D̂ in (4.11). Standard deviations for $\log (\tilde{Fdr})$ , shown by the dashed curves in Figure 6, indicate about the same accuracy for $\tilde{Fdr} (x)$ as for $\hat{Fdr}$ .

All of these calculations assumed that p₀ and f₀(z) (or F₀(z)) in (4.2) were known. This is unrealistic in situations like the leukemia study, where there is clear evidence that a textbook N(0, 1) theoretical null distribution is too narrow. Estimating an “empirical null” distribution, such as N(.09, 1.68²) in Figure 1, is both necessary and feasible (see Efron, 2008) but can greatly increase variability, as discussed next.

Formula (4.14) becomes

\hat{lfdr} = \log ({\hat{p}}_{0}) + \log ({\hat{f}}_{0}) - \hat{ℓ}

(4.18)

when p₀ and f₀ are themselves estimated. The corresponding derivative matrix $\hat{D} = d \hat{lfdr} ∕ d y$ in (4.6) appears as equation (5.8) in Efron (2007b), this formula applying to the central matching method for estimating p₀f₀(z). The second row of Table 4 shows ${\log \hat{fdr} (x)}$ obtained from Dcov(y)D′ for the same situation as in the middle panel of Figure 6. Comparison with the theoretical null standard deviations (from the solid curve in the middle panel) shows that estimating the null distribution greatly increases variability.

Table 4.

Comparison of ${\log (\hat{fdr} (x))}$ using empirical null versus theoretical null for the situation in the middle panel of Figure 6. The empirical null standard deviations are much larger, as seen also in Efron (2007b).

percentile:	0.05	0.04	0.03	0.02	0.01
sd empirical null:	0.18	0.26	0.36	0.54	0.83
sd theoretical null:	0.13	0.13	0.12	0.11	0.10
x:	1.98	2.16	2.40	2.74	3.25
fdr(x):	0.69	0.58	0.44	0.25	0.09
Fdr(x):	0.34	0.27	0.19	0.10	0.04

Open in a new tab

Here are some points to note:

Accuracy is worse for $\log (\hat{Fdr})$ then for $\log (\hat{fdr})$ in the top line of Table 4.
Accuracy is somewhat better when p₀f₀(z) is estimated by the MLE option in locfdr (Lemma 2 of Efron, 2007b).
The big empirical null standard deviations in Table 4 are at least partially misleading: some of the variability in $1 + O (n^{- \frac{1}{2}})$ is “signal” rather than “noise”, tracking conditional changes in the appropriate value of fdr(x). See Figure 2 of Efron (2007a) and the discussion in that paper.

Remark H of Section 6 describes a parametric bootstrap resampling scheme that avoids the Taylor series computations of (4.7), but which has not yet been carefully investigated.

5 The non-null distribution of z-values

The results of the previous sections depend on the variates z_i having normal distributions (1.5). By definition, a z-value is a statistic having a N(0, 1) distribution under a null hypothesis H₀ of interest (1.1): but will it still be normal for non-null conditions? This section shows that under repeated sampling the non-null distribution of z will typically have mean O(1), standard deviation $1 + O (n^{- \frac{1}{2}})$ , and non-normality O_p(n⁻¹) (as measured by the magnitude of skewness and kurtosis). In other words, normality degrades more slowly than unit standard deviation as we move away from the null hypothesis.

Figure 7 illustrates the phenomenon for the case of non-central t distributions,

z = Φ^{- 1} (F_{ν} (t)) t \sim t_{ν} (δ),

(5.1)

the notation indicating a non-central t variable with ν degrees of freedom and non-centrality parameter δ (not δ²), as described in Chapter 31 of Johnson and Kotz (1970). Here, as in (1.2), F_ν is the cdf of a central t_ν distribution. The standard deviation of z decreases as |δ| increases; for δ = 5, ν = 20, z has (mean, sd) equal (4.01, 0.71). The useful and perhaps surprising observation is that normality holds up quite well even far from the null case δ = 0. We tacitly used this fact to justify application of our theoretical results to the leukemia study.

Density of the z-value statistic (5.1) when t has a noncentral t distribution with ν = 20 degrees of freedom; for non-centrality parameter δ = 0, 1, 2, 3, 4, 5. The densities are seen to be nearly normal; dashed curves are exact normal densities matched in mean and standard deviation. For δ = 5, z has (mean,sd, skew,kurt) = (4.01, .71,−.06, .08). Negative values of δ give mirror image results. Remark G of Section 6 describes the density function calculations.

To begin the theoretical development, suppose that y₁, y₂, … , y_n are independent and identically distributed (iid) observations sampled from F_θ, a member of a one-parameter family of distributions,

ℱ = {F_{θ}, θ \in ϴ}

(5.2)

having its moment parameters {mean, standard deviation, skewness, kurtosis}, denoted

{μ_{θ}, σ_{θ}, γ_{θ}, δ_{θ}},

(5.3)

defined differentiably in θ. The results that follow are heuristic in the sense that they only demonstrate second-order Cornish–Fisher expansion properties, with no attempt to provide strict error bounds.

Under the null hypothesis H₀ : θ = 0, which we can write as

H_{0} : y \sim {μ_{0}, σ_{0}, γ_{0}, δ_{0}},

(5.4)

the standardized variate

Y_{0} = \sqrt{n} (\frac{\bar{y} - μ_{0}}{σ_{0}}) [\bar{y} = \sum_{i = 1}^{n} y_{i} ∕ n]

(5.5)

satisfies

H_{0} : Y_{0} \sim {0, 1, \frac{γ_{0}}{\sqrt{n}}, \frac{δ_{0}}{n}} .

(5.6)

Normality can be improved to second order by means of a Cornish–Fisher transformation,

Z_{0} = Y_{0} - \frac{γ_{0}}{6 \sqrt{n}} (Y_{0}^{2} - 1)

(5.7)

which reduces the skewness in (5.6) from $O (n^{- \frac{1}{2}})$ to O(n⁻¹),

H_{0} : Z_{0} \sim {0, 1, 0, 0} + O (n^{- 1}) .

(5.8)

See Chapter 1 of Johnson and Kotz (1970) or, for much greater detail, Section 2.2 of Hall (1992). We can interpret (5.8) as saying that Z₀ is a second-order z-value,

H_{0} : Z_{0} \sim 𝒩 (0, 1) + O_{p} (n^{- 1}),

(5.9)

e.g., a test statistic giving standard normal p-values accurate to O(n⁻¹).

Suppose now that H₀ is false, and instead H₁ is true, with y₁, y₂, … , y_n iid according to

H_{1} : y \sim {μ_{1}, σ_{1}, γ_{1}, δ_{1}}

(5.10)

rather than (5.4). Setting

Y_{1} = \sqrt{n} (\frac{\bar{y} - μ_{1}}{σ_{1}}) and Z_{1} = Y_{1} - \frac{γ_{1}}{6 \sqrt{n}} (Y_{1}^{2} - 1)

(5.11)

makes Z₁ second-order normal under H₁,

H_{1} : Z_{1} \sim 𝒩 (0, 1) + O_{p} (n^{- 1}) .

(5.12)

We wish to calculate the distribution of Z₀ (5.7) under H₁. Define

c = σ_{1} ∕ σ_{0}, d = \sqrt{n} (μ_{1} - μ_{0}) ∕ σ_{0}, and g_{0} = γ_{0} ∕ (6 \sqrt{n}) .

(5.13)

Some simple algebra yields the following relationship between Z₀ and Z₁.

Lemma 3

Under definitions (5.7), (5.11) and (5.13),

Z_{0} = M + {SZ}_{1} + g_{0} {(\frac{γ_{1}}{γ_{0}} S - c^{2}) (Y_{1}^{2} - 1) + (1 - c^{2})}

(5.14)

where

M = d \cdot (1 - {dg}_{0}) and S = c \cdot (1 - 2 {dg}_{0}) .

(5.15)

The asymptotic relationships claimed at the start of this section are easily derived from Lemma 3. We consider a sequence of alternatives θ_n approaching the null hypothesis value θ₀ at rate $n^{- \frac{1}{2}}$ ,

θ_{n} - θ_{0} = O (n^{- \frac{1}{2}}) .

(5.16)

The parameter d = √n(μ_{θ_n} − μ₀)/σ₀ defined in (5.13) is then of order O(1), as is

M = d (1 - {dg}_{0}) = d (1 - d γ_{0} ∕ (6 \sqrt{n})),

(5.17)

while standard Taylor series calculations give

c = 1 + \frac{{\dot{σ}}_{0}}{{\dot{μ}}_{0}} \frac{d}{\sqrt{n}} + O (n^{- 1}) and S = 1 + (\frac{{\dot{σ}}_{0}}{{\dot{μ}}_{0}} - \frac{γ_{0}}{3}) \frac{d}{\sqrt{n}} + O (n^{- 1}),

(5.18)

the dot indicating differentiation with respect to θ.

Theorem 2

Under model (5.2), (5.16), and the assumptions of Lemma 3,

Z_{0} \sim 𝒩 (M, S^{2}) + O_{p} (n^{- 1})

(5.19)

with M and S as given in (5.17)-(5.18). Moreover,

\frac{dS}{dM} ∣_{θ_{0}} = \frac{1}{\sqrt{n}} (\frac{d σ}{d μ} ∣_{θ_{0}} - \frac{γ_{0}}{3}) + O (n^{- 1}) .

(5.20)

Proof. The proof of Theorem 2 uses Lemma 3, with θ_n playing the role of H1 in (5.14). Both 1 − c² and (γ₁/γ₀)S − c² are of order $O (n^{- \frac{1}{2}})$ ; the former from (5.18) and the latter using γ₁/γ₀ = 1 + (γ̇₀/γ₀)(θ_n − θ₀) + O(n⁻¹). Since $Y_{1}^{2} - 1$ is O_p(1), this makes the bracketed term in (5.14) $O_{p} (n^{- \frac{1}{2}})$ ; multiplying by g₀ = γ₀/(6 √n) reduces it to O_p(n⁻¹), and (5.19) follows from (5.12). Differentiating M and S in (5.17)-(5.18) with respect to d verifies (5.20). ■

Theorem 2 supports our claim that, under non-null alternatives, the null hypothesis normality of Z₀ degrades more slowly than its unit standard deviation, the comparison being O_p(n⁻¹) versus $O (n^{- \frac{1}{2}})$ .

One-parameter exponential families are an important special case of (5.2). With θ the natural parameter of ℱ and y its sufficient statistic, i.e., with densities proportional to exp{θ_y}g₀(y), (5.20) reduces to

\frac{dS}{dM} ∣_{θ_{0}} = \frac{γ_{0}}{6 \sqrt{n}} + O (n^{- 1}) .

(5.21)

The parameter γ₀=(6√n) is called the acceleration in Efron (1987), interpreted as “the rate of change of standard deviation with respect to expectation on the normalized scale,” which agrees with its role in (5.21).

As an example, suppose $y_{1}, y_{2}, \dots, y_{n} \overset{iid}{\sim} θ Γ_{1}$ , Γ_n indicating a standard gamma distribution with n degrees of freedom, so gθ(y) = (1/θ)exp(y/θ) for y ≤ 0. (Equivalently, ȳ ~ θΓn/n.) This is an exponential family having skewness γ₀ = 2 for any choice of θ₀. An exact z-value for testing H₀ : θ = θ_o is

Z_{0} = Φ^{- 1} (G_{n} (n \bar{y} ∕ θ_{0}))

(5.22)

where G_n is the cdf of Γ_n. Table 5 shows the mean, standard deviation, skewness and kurtosis of Z₀ for n = 10, θ₀ = 1, evaluated for several choices of the alternative θ₁. The standard deviation of Z₀ increases steadily with θ₁; here γ₀/(6√n) = .1054, matching to better than three decimal places the observed numerical derivative dS/dM. Skewness and kurtosis are both very small; in the equivalent of Figure 7, there is no visible discrepancy at all between the density curves for Z₀ and their matching normal equivalents.

Table 5.

Gamma example, n = 10, θ₀ = 1, indicating the distribution of z-value (5.22) for various non-null choices of θ. Standard deviation increases with θ₁ in accordance with (5.21), while maintaining near-perfect normality for Z₀.

θ₁:	0.4	0.5	0.67	1	1.5	2.0	2.5
mean	−2.49	−1.94	−1.19	0	1.36	2.45	3.38
stdev	0.76	0.81	0.88	1	1.15	1.27	1.38
skew	−0.05	−0.04	−0.02	0	0.02	0.04	0.04
kurt	0.01	0.01	0.00	0	0.00	−0.01	−0.04

Open in a new tab

So far we have considered z-values obtained from an average ȳ of iid observations, but the results of Theorem 2 hold in greater generality. Section 5 of Efron (1987) considers one-parameter families where θ̂, an estimator of θ , has MLE-like asymptotic properties in terms of its bias, standard deviation, skewness and kurtosis,

\hat{θ} \sim {θ + β_{θ} ∕ n, σ_{θ} ∕ \sqrt{n}, γ_{θ} ∕ \sqrt{n}, δ_{θ} ∕ n} .

(5.23)

Letting θ̂ play the role of ȳ and μ_θ = θ + β_θ/n in definitions (5.5)-(5.12), Lemma 3 and Theorem 2 remain true, assuming only the validity of the Cornish–Fisher transformations (5.9)-(5.12). Ignoring the bias β_θ, i.e., taking Y₀ = √n(θ̂ − θ₀)/σ₀ at (5.5), adds an $O (n^{- \frac{1}{2}})$ term to M in (5.17).

Moving beyond one-parameter families, suppose ℱ is a p-parameter exponential family, having densities proportional to $\exp {η_{1} x_{1} + η_{2}^{'} x_{2}} g_{0} (x_{1}, x_{2})$ , where η₁ and x₁ are real-valued while η₂ and x₂ are (p − 1)-dimensional vectors, but where we are only interested in η₁, not the nuisance vector η₂. The conditional distribution of x₁ given x₂ is then a one-parameter exponential family with natural parameter η₁, which puts us back in the context of Theorem 2. Remark H of Section 6 suggests a further extension where the parameter of interest “ θ” can be a general real-valued function of η, not just a coordinate such as η₁.

The non-central t family does not meet the conditions of Lemma 3 or Theorem 2: (5.1) is symmetric in δ around zero, causing γ₀ in (5.14) to equal zero and likewise the derivative in (5.20). Nevertheless, as Figure 7 shows, it does exhibit impressive non-null normality. Table 6 displays the moment parameters of z = Φ⁻¹(F_ν(t)) (1.3), for t ~ t_ν(δ), ν = 20 and δ = 0, 1, 2, 3, 4, 5. The non-null normality isn't quite as good as in the gamma example of Table 5, but is still quite satisfactory for its application in Section 4.

Table 6.

Non-central t example t ~ t_ν(δ) for ν = 20, δ = 0,1,2,3,4,5; moment parameters of z = Φ⁻¹(F_ν(t)) (1.3) indicate near-normality even for δ far from 0. (Moments calculated using (6.10).)

δ:	0	1	2	3	4	5
mean	0	0.98	1.89	2.71	3.41	4.01
sd	1	0.98	0.92	0.85	0.77	0.71
skew	0	−0.07	−0.11	−0.11	−0.10	−0.07
kurt	0	0.02	0.06	0.08	0.09	0.07

Open in a new tab

Microarray studies can be more elaborate than two-sample comparisons. Suppose that in addition to the N × n expression matrix X we have measured a primary response variable y_j and covariates w_j1, w_j2, … , w_jp on each of the n subjects. Given the observed expression levels x_i1, x_i2, … , x_in for gene i, we could calculate t_i, the usual t-value for y_j as a function of x_ij, in a linear model that includes the p covariates. Then

z_{i} = Φ^{- 1} (F_{n - p - 1} (t_{i}))

(5.24)

is a z-value (1.1) under the usual Gaussian assumption, showing behavior like that in Table 6 for non-null genes.

6 Remarks

Some remarks, proofs, and details relating to the previous sections are presented here.

A. Poisson regression

The curve f̂(z) in Figure 1 is a Poisson regression fit to the counts y_k, as a natural spline function of the bin centers x_k. Here the x_k ranged from −7.8 to 7.8 in steps of Δ = .2, while the spline had five degrees of freedom, so M in (4.11) was 79 × 6 (including the intercept column). See Section 5 of Efron (2007b).

B. Table 3

Section 3 of Efron (2008) defines the non-null counts $y_{k}^{(1)} = (1 - \hat{fdr} (x_{k})) \cdot y_{k}$ . Since, under model (4.2), $1 - \hat{fdr} (x_{k})$ estimates the proportion of non-null z-values in bin k, $y_{k}^{(1)}$ estimates the number of non-nulls. The $y_{k}^{(1)}$ values are plotted below the x axis in Figure 1, determining the “left” and “right” distribution parameters in Table 3. “Center” was determined by the empirical null fit from locfdr, using the MLE method described in Section 4 of Efron (2007b). This method tends to underestimate the non-null counts near z = 0, and also the σ_c values for the left and right classes, but increasing then to 1.68 had little effect on the dashed curve in Figure 4.

C. Proof of Lemma 1

Let I_k(i) denote the indicator function of the event z_i ∈ Ƶ_k (2.2) so that the number of z_i's from class 𝒞_cƵ_k is

y_{kc} = \sum_{c} I_{k} (i),

(6.1)

the boldface subscript indicating summation over the members of 𝒞_c. We first compute E{y_kcy_ld} for bins k and l, k ≠ l, and classes c and d,

E {y_{kc} y_{ld}} = E {\sum_{c} \sum_{d} I_{k} (i) I_{l} (j)} = Δ^{2} \sum_{c} \sum_{d} φ_{ρ_{ij}} (x_{k c}, x_{l d}) (1 - χ_{ij}) ∕ σ_{c} σ_{d}

(6.2)

following notation (2.5)-(2.10), with χ_ij the indicator function of event i = j (which can only occur if c = d). This reduces to

E {y_{kc} y_{ld}} = N^{2} Δ^{2} p_{c} (p_{d} - χ_{cd} ∕ N) \int_{- 1}^{1} φ_{ρ} (x_{kc}, x_{ld}) g (ρ) d ρ ∕ σ_{c} σ_{d}

(6.3)

under the assumption that the same correlation distribution g(ρ) applies across all class combinations. Since y_k = Σ_cy_kc (2.3), we obtain

E {y_{k} y_{l}} = N^{2} Δ^{2} \sum_{c} \sum_{d} p_{c} (p_{d} - χ_{cd} ∕ N) \int_{- 1}^{1} φ_{ρ} (x_{kc}, x_{ld}) g (ρ) d ρ ∕ σ_{c} σ_{d},

(6.4)

the non-bold subscripts indicating summation over classes.

Subtracting

E {y_{k}} E {y_{l}} = N^{2} Δ^{2} \underset{c}{Σ} \underset{d}{Σ} φ (x_{kc}) φ (x_{ld}) ∕ σ_{c} σ_{d}

(6.5)

from (6.4) results, after some rearrangement, in

\begin{matrix} cov (y_{k}, y_{l}) = & N^{2} Δ^{2} \underset{c}{Σ} \underset{d}{Σ} \frac{φ (x_{kc}) φ (x_{ld})}{σ_{c} σ_{d}} {p_{c} (p_{d} - \frac{χ_{cd}}{N}) \int_{- 1}^{1} (\frac{φ_{ρ} (x_{kc}, x_{ld})}{φ (x_{kc}) φ (x_{ld})} - 1) g (ρ) d ρ} \\ - N Δ^{2} \underset{c}{Σ} p_{c} \frac{φ (x_{kc}) φ (x_{ld})}{σ_{c} σ_{d}} . \end{matrix}

(6.6)

Using π_kc = Δ · φ(x_kc)/σ_c as in (2.8), expression (6.6) is seen to equal the klth element of cov(y) in Lemma 1, when k ≠ l.

The case k = l proceeds in the same way, the only difference being that NΔp_cχ_cdφ(x_kc)/σ_c must be added to formula (6.3). This adds NΔΣ _cp_cφ(x_kc)/σ_c to (6.6), again in agreement with cov(y) in Lemma 1.

The assumption that g(ρ) is the same across all classes can be weakened for the rms approximations (2.23) and (2.33), where we only need the second moments α₂ to be the same. In fact, the class structure can disappear entirely for rms formulas, as seen in (3.12).

D. Model (2.24)

Specifications (2.24) were recentered to give overall expectation 0 in (2.4), (2.5):

(p_{0}, μ_{0}, σ_{0}) = (.95, - .125, 1) and (p_{1}, μ_{1}, σ_{1}) = (.05, 2.38, 1),

(6.7)

these being the parameter values used in Figures 2, 3, 5 and 6. Recentering overall expectations to zero is common in practice, a consequence of the data matrix X having its column-wise means subtracted off.

The 6000 × 80 data matrices X used in the simulations for Table 2 and Figure 5 had entries x_ij ~ 𝒩(δ_ij, 1) independent across columns j: δ_ij = 0 for j ≤ 40, while for columns j > 40,

δ_{ij} = .224 μ_{1} for i = 1, 2, \dots, 300, and δ_{ij} = .224 μ_{0} for i > 300;

(6.8)

z-values based on the difference of means between the last and first 40 “patients” then satisfy (2.4), (6.7). The correlation distribution g(ρ) was supported on two points, 20% ρ = .20 and 80% ρ = −.05, giving α = .10.

E. Leukemia data standardization

The original entries x_ij of the leukemia data matrix were genetic expression levels obtained using Affymetrix oligonucleotide microarrays. For the analyses here, each column of X was replaced by its normal score values x̃_ij = Φ⁻¹((r_ij−.5)/7128), where r_ij was the rank of x_ij in its column. Transformations such as this reduce the disturbing effects of sensitivity differences between microarrays; see Bolstad, Irizarry, Astrand and Speed (2003).

F. A parametric bootstrap method

Section 3 of Efron (2007a) discusses a hierarchical Poisson simulation scheme that can be adapted to the more general context of this paper. Following the notation in (3.11), we first simulate a vector u,

u = N Δ (\hat{f} + \frac{σ_{0}^{2}}{\sqrt{2}} A {\hat{f}}^{(2)}) with A \sim 𝒩 (0, {\hat{α}}^{2}),

(6.9)

and then take y ~ Poisson(u), that is $y_{k} \overset{ind}{\sim} Poisson (u_{k})$ . The simulated y vectors can then be used to assess the variability of any function Q(y), obviating the need for the derivative matrix D̂. This amounts to a parametric bootstrap approach to accuracy estimation. It produced similar answers to (4.11) when applied to the leukemia data but seemed prone to biases in other applications. (Nonparametric bootstrapping, resampling columns of the data matrix X, can produce erratic results for the kind of large-scale accuracy problems considered in Section 4.)

G. z-value densities

Suppose test statistic t has possible densities {f_θ(t), θ ∈ Θ}, with corresponding cdfs F_θ(t), and we wish to test H₀ : θ = θ₀. The z-value statistic z = Φ⁻¹{F_θ₀(t)} then has densities

g_{θ} (z) = φ (z) f_{θ} (t) ∕ f_{θ_{0}} (t) .

(6.10)

The density curves in Figure 7 were obtained from (6.10), with f_θ(t) the noncentral t_ν(θ) density, ν = 20 and θ = 0, 1, 2, 3, 4, 5.

H. Extensions of Theorem 2

In some circumstances, Theorem 2 can be extended to multi-parameter families ℱ = {F_η} where we wish to test θ = θ₀ for θ a real-valued function of η. This is straightforward to verify in the context of Efron (1985), which includes for example Fieller's problem, and is conjectured to be true in general exponential families.

7 Summary

The paper considers studies where a large number N of cases are under investigation, N perhaps in the hundreds or thousands, each represented by its own z-value z_i, and where there is the possibility of substantial correlation among the z_i's. Our main result is a simple approximation formula for the accuracy of summary statistics such as the empirical cdf of the z-values or an estimated false discovery rate. The argument proceeds in five steps:

Exact formulas for the accuracy of correlated z-value cdfs are derived under normal distribution assumptions (Section 2).
Simple approximations to the exact formulas are developed in terms of the root mean square correlation of all N · (N − 1)/2 cases (Section 2, (2.23) and (2.33)).
Practical estimates for the approximation formulas are derived and demonstrated through simulations and application to a microarray study (Section 3 and Section 4).
Delta-method arguments are used to extend the cdf results to more general summary statistics (Section 3 and Section 4).
Under reasonable assumptions, it is shown that z scores tend to have nearly normal distributions, even in non-null situations (Section 5), justifying application of the theory to studies in which the individual variates are z-values.

Our main conclusion is that by dealing with normal variates, a practical assessment of large-scale correlation effects on statistical estimates is possible.

Acknowledgments

This work was supported in part by NIH grant 8R01 EB002784 and NSF grant DMS0505673.

Footnotes

It is convenient for the applications of Section 4 to deal with right-sided cdfs or survival curves instead of the usual left-sided ones in (1.2), and we will use this definition in what follows.

The estimate used here is a Poisson spline regression as described following (4.10).

References

Bolstad B, Irizarry R, Astrand M, Speed T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
Clarke S, Hall P. Robustness of multiple testing procedures against dependence. Ann. Statist. 2009;37:332–358. [Google Scholar]
Csörgő S, Mielniczuk J. The empirical process of a short-range dependent stationary sequence under Gaussian subordination. Probab. Theory Related Fields. 1996;104:15–25. [Google Scholar]
Desai K, Deller J, McCormick J. The distribution of number of false discoveries for highly correlated null hypotheses. Ann. Appl. Statist. 2009 Submitted, under review. [Google Scholar]
Dudoit S, Laan M. J. van der, Pollard KS. Multiple testing. I. Single-step procedures for control of general type I error rates. Stat. Appl. Genet. Mol. Biol. 2004;3:71. doi: 10.2202/1544-6115.1040. Art. 13. electronic. [DOI] [PubMed] [Google Scholar]
Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103. [Google Scholar]
Efron B. Bootstrap confidence intervals for a class of parametric problems. Biometrika. 1985;72:45–58. [Google Scholar]
Efron B. Better bootstrap confidence intervals. J. Amer. Statist. Assoc. 1987;82:171–200. with comments and a rejoinder by the author. [Google Scholar]
Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103. [Google Scholar]
Efron B. Size, power and false discovery rates. Ann. Statist. 2007b;35:1351–1377. [Google Scholar]
Efron B. Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 2008;23:1–22. [Google Scholar]
Efron B. Are a set of microarrays independent of each other? Ann. Appl. Statist. 2009 doi: 10.1214/09-AOAS236. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
Hall P. The bootstrap and Edgeworth expansion. Springer-Verlag; New York: 1992. (Springer Series in Statistics). [Google Scholar]
Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 1. Houghton Mifflin Co; Boston, Mass: 1970. [Google Scholar]
Lancaster HO. The structure of bivariate distributions. Ann. Math. Statist. 1958;29:719–736. [Google Scholar]
Owen AB. Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005;67:411–426. [Google Scholar]
Qiu X, Brooks A, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005a;6:120. doi: 10.1186/1471-2105-6-120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Stat. Appl. Genet. Mol. Biol. 2005b;4:32. doi: 10.2202/1544-6115.1157. Art. 34. electronic. [DOI] [PubMed] [Google Scholar]
Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Harvard University; 2009. (Biostatistics Working Paper Series number 106). [Google Scholar]
Westfall P, Young S. Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley-Interscience; New York, NY: 1993. [Google Scholar]

[R1] Bolstad B, Irizarry R, Astrand M, Speed T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]

[R2] Clarke S, Hall P. Robustness of multiple testing procedures against dependence. Ann. Statist. 2009;37:332–358. [Google Scholar]

[R3] Csörgő S, Mielniczuk J. The empirical process of a short-range dependent stationary sequence under Gaussian subordination. Probab. Theory Related Fields. 1996;104:15–25. [Google Scholar]

[R4] Desai K, Deller J, McCormick J. The distribution of number of false discoveries for highly correlated null hypotheses. Ann. Appl. Statist. 2009 Submitted, under review. [Google Scholar]

[R5] Dudoit S, Laan M. J. van der, Pollard KS. Multiple testing. I. Single-step procedures for control of general type I error rates. Stat. Appl. Genet. Mol. Biol. 2004;3:71. doi: 10.2202/1544-6115.1040. Art. 13. electronic. [DOI] [PubMed] [Google Scholar]

[R6] Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103. [Google Scholar]

[R7] Efron B. Bootstrap confidence intervals for a class of parametric problems. Biometrika. 1985;72:45–58. [Google Scholar]

[R8] Efron B. Better bootstrap confidence intervals. J. Amer. Statist. Assoc. 1987;82:171–200. with comments and a rejoinder by the author. [Google Scholar]

[R9] Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103. [Google Scholar]

[R10] Efron B. Size, power and false discovery rates. Ann. Statist. 2007b;35:1351–1377. [Google Scholar]

[R11] Efron B. Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 2008;23:1–22. [Google Scholar]

[R12] Efron B. Are a set of microarrays independent of each other? Ann. Appl. Statist. 2009 doi: 10.1214/09-AOAS236. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[R14] Hall P. The bootstrap and Edgeworth expansion. Springer-Verlag; New York: 1992. (Springer Series in Statistics). [Google Scholar]

[R15] Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 1. Houghton Mifflin Co; Boston, Mass: 1970. [Google Scholar]

[R16] Lancaster HO. The structure of bivariate distributions. Ann. Math. Statist. 1958;29:719–736. [Google Scholar]

[R17] Owen AB. Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005;67:411–426. [Google Scholar]

[R18] Qiu X, Brooks A, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005a;6:120. doi: 10.1186/1471-2105-6-120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Stat. Appl. Genet. Mol. Biol. 2005b;4:32. doi: 10.2202/1544-6115.1157. Art. 34. electronic. [DOI] [PubMed] [Google Scholar]

[R20] Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Harvard University; 2009. (Biostatistics Working Paper Series number 106). [Google Scholar]

[R21] Westfall P, Young S. Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley-Interscience; New York, NY: 1993. [Google Scholar]

PERMALINK

Correlated z-values and the accuracy of large-scale statistical estimates

Bradley Efron

Abstract

1 Introduction

Figure 1.

Table 1.

2 The distribution of correlated normal variates

Lemma 1

Lemma 2.

Figure 2.

Theorem 1

Figure 3.

3 Estimation of the correlation parameters

Table 2.

Figure 4.

Table 3.

Figure 5.

4 Applications

Figure 6.

Table 4.

5 The non-null distribution of z-values

Figure 7.

Lemma 3

Theorem 2

Table 5.

Table 6.

6 Remarks

A. Poisson regression

B. Table 3

C. Proof of Lemma 1

D. Model (2.24)

E. Leukemia data standardization

F. A parametric bootstrap method

G. z-value densities

H. Extensions of Theorem 2

7 Summary

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases