Ranking Bias in Association Studies

Neal O Jeffries

doi:10.1159/000194979

. 2009 Jan 27;67(4):267–275. doi: 10.1159/000194979

Ranking Bias in Association Studies

Neal O Jeffries ^1,^*

PMCID: PMC2880722 PMID: 19172085

Abstract

Background

It is widely appreciated that genomewide association studies often yield overestimates of the association of a marker with disease when attention focuses upon the marker showing the strongest relationship. For example, in a case-control setting the largest (in absolute value) estimated odds ratio has been found to typically overstate the association as measured in a second, independent set of data. The most common reason given for this observation is that the choice of the most extreme test statistic is often conditional upon first observing a significant p value associated with the marker. A second, less appreciated reason is described here. Under common circumstances it is the multiple testing of many markers and subsequent focus upon those with most extreme test statistics (i.e. highly ranked results) that leads to bias in the estimated effect sizes.

Conclusions

This bias, termed ranking bias, is separate from that arising from conditioning on a significant p value and may often be a more important factor in generating bias. An analytic description of this bias, simulations demonstrating its extent, and identification of some factors leading to its exacerbation are presented.

Key Words: Estimation bias, Multiple comparisons

Introduction

As high dimensional assay technology has become more available for genomic investigations there has been a corresponding increase in methodology to limit false positive findings [1, 2]. Until recently, less attention has been given to the effect sizes of those comparisons that are determined to reach statistical significance. In many instances this may be appropriate as effect sizes may not be of great importance per se; however in other circumstances it may be important to accurately assess the magnitude of the associated effects.

As an example, there has been increased recognition that some form of replication is desirable to support a claim of association found as a single result among a large panel of genetic markers [3]. However, using estimates of effect size derived from an initial study of as many as a million markers can lead to an underpowered evaluation as the initial estimate is likely biased toward more extreme results than are true.

A common rationale given for this bias is that it results from conditioning on the initial estimate meeting p value criteria for declaring significance – often referred to as selection or truncation bias. The term ‘significance bias’ will be used here to stress the conditioning upon a significant p value. Garner [4] describes this in the context of genomewide association scans; Allison et al. [5] describe it in the context of quantitative trait loci of paired siblings. This conditioning on a significant result has the effect of truncating the distribution of the marker's sampling distribution and creating an associated bias. For example, given a normally distributed variable X with mean μ and standard deviation ν, one can show the expected value of X conditional upon X > c (where c may be chosen to generate a significant p value) is given by

E [X | X > c] = μ + \frac{ν^{2} \cdot φ (c; μ, ν^{2})}{Prob [X > c]}

(1.1)

where φ(·) denotes the probability density function of normally distributed variable with mean μ and standard deviation ν. The second term on the right hand side of (1.1) captures the bias from conditioning. This description of bias does not recognize the effect multiple testing may exert on creating bias beyond its role in requiring a p value threshold less than 0.05. In this paper multiple testing and focusing on extreme results (i.e. highly ranked results) is shown to generate overestimation in situations where significance bias may have little effect. The terms ‘significance bias’ and ‘ranking bias’ are not in common use – the phrases ‘truncation bias’ and ‘selection bias’ are used more often but are insufficiently specific and fail to draw the distinction made in this paper.

Though a detailed description is given in the next section the idea of ranking bias is briefly described here. Given a number of test statistics, each one is composed of a trend term (related to true underlying effect size) and some random variation. When looking at the test statistic with the largest positive observed effect size it is likely (under many circumstances) that the random variation component is positive as well – because the large, positive observed effect is the sum of a trend and the random component. This random component captures the difference between the true and observed effect size and positive values correspond to a type of overestimation bias – called ranking bias here.

Allison et al. [5] briefly describe ranking bias but focus the work on significance bias. Göring et al. [6] discuss both significance bias and ranking bias. They make an analogy between the ranking bias discussed here with that corresponding to bias in model selection procedures.

Though they discuss ranking bias their modeling and analytical results are derived solely from models of significance bias. Zöllner and Pritchard [7] use the term ‘Winner's Curse’ to describe significance bias in a single marker and present an analytical approach toward correcting this bias. Sun and Bull [8] explicitly describe significance bias and more implicitly describe ranking bias in the context of a mathematical model with a number of non-informative markers and 1 informative marker. In the present work we present more general models with a number of informative markers and the relationship between bias (both significance and ranking) and the pattern of the markers' effect sizes is explicitly explored.

Description of Bias

The modeling proposed here is relatively simple both because it is easy to see how to apply this situation to others and because it allows for a transparent partitioning of a statistic into trend and random error terms that facilitates understanding ranking bias. Let N denote the common number of cases and controls drawn in population-based study, G denote the number of biallelic markers under consideration, G₁ the number of markers with different allele frequencies between cases and controls, and G₀ the number with no true difference between cases and controls, i.e. G = G₁ + G₀. For simplicity's sake a recessive pattern is assumed to influence the likelihood of disease for the informative markers. For each of the G markers the log odds ratio will be computed from an associated 2 × 2 table (table 1).

Table 1.

2 × 2 table for each biallelic marker

	Case	Control	Total
2 minor alleles	a	b	a + b
0 or 1 minor alleles	c	d	c + d
	N	N	2N

Open in a new tab

For the j-th marker let θ_j denote the log odds ratio and ${\hat{θ}}_{j}$ its estimate, log(ad/bc). Then

Z_{j} = \frac{{\hat{θ}}_{j} - θ_{j}}{{\hat{s e}}_{j}} \sim N (0, 1)

(2.1)

where

{\hat{s e}}_{j} = \sqrt{1 / a + 1 / c + 1 / b + 1 / d}

and ∼ N(0,1) indicates an approximate Gaussian distribution with mean 0 and variance 1. Given the observed log odds and associated test statistic

T_{j} = \frac{{\hat{θ}}_{j}}{{\hat{s e}}_{j}}

(2.2)

one can use the relation in (2.1) to obtain p values for the hypothesis that θ_j = 0 as well as to generate confidence intervals

{\hat{θ}}_{j} \pm Φ_{1 - α / 2}^{- 1} \cdot {\hat{s e}}_{j}

where 1 – α is confidence associated with the interval and $Φ_{1 - α / 2}^{- 1}$ is the associated percentile from a N(0,1) distribution.

To investigate how bias arises it is useful to rewrite the relationship between (2.1) and (2.2). First, write

\hat{s d} = \sqrt{\frac{1}{{\hat{p}}_{2 / Case}} + \frac{1}{1 - {\hat{p}}_{2 / Case}} + \frac{1}{{\hat{p}}_{2 / Control}} + \frac{1}{1 - {\hat{p}}_{2 / Control}}}

(2.3)

where

{\hat{p}}_{2 / Case} = a / N and {\hat{p}}_{2 / Control} = b / N

(2.4)

denote the empirically observed probability of two minor alleles in the case and control populations, respectively. The term $\hat{s d}$ is an estimate of σ, which is defined by equation (2.3) after replacing the estimated probabilities of having both minor alleles by the true unknown proportions of having both minor alleles. Then we can decompose the test statistic T_j as

T_{j} = \frac{\sqrt{N {\hat{θ}}_{j}}}{{\hat{s d}}_{j}} = \frac{\sqrt{N ({\hat{θ}}_{j} - θ j)}}{{\hat{s d}}_{j}} + \frac{\sqrt{N θ_{j}}}{{\hat{s d}}_{j}}

(2.5)

= Z_{j} + \frac{\sqrt{N θ_{j}}}{{\hat{s d}}_{j}} where Z_{j} = \frac{\sqrt{N} ({\hat{θ}}_{j} - θ_{j})}{{\hat{s d}}_{j}} .

(2.6)

The decomposition shows the test statistic T_j is composed of a trend term,

\frac{\sqrt{N} θ_{j}}{{\hat{s d}}_{j}}

and a mean 0 random fluctuation term, $Z_{j}$ . Note that $Z_{j}$ and T_j differ in that $Z_{j}$ has an approximately normal distribution centered about 0 while for T_j this is only true when θ_j = 0. The $Z_{j}$ term captures the difference between estimated log odds and true log odds and its distribution is of primary interest in assessing overestimation bias.

For each j, $Z_{j}$ has approximately an unconditional N(0,1) distribution. A critical question concerns the distribution of $Z_{j}$ when j corresponds to most extreme observed T_j statistic. We will focus upon the case of largest T_j – the case for smallest, i.e. most negative, statistic is similar and generalizations involving it and other extreme statistics are discussed below.

Let r₁,r₂, …, r_G order the T test statistics where

T_{j} = \frac{\sqrt{N} {\hat{θ}}_{j}}{{\hat{s d}}_{j}} and T_{r_{1}} \leq T_{r_{2}} \leq \dots \leq T_{r_{G}} .

Typically, the distribution of

Z_{r_{G}} = \frac{\sqrt{N} ({\hat{θ}}_{r_{G}} - θ_{r_{G}})}{{\hat{s d}}_{r_{G}}}

(2.7)

is not that of a normal distribution with mean 0 – this will be demonstrated with simulations below. The reason for the bias is that if a number of markers have similar sized trend components

\frac{\sqrt{N} θ_{j}}{{\hat{s d}}_{j}}

then the random components $Z_{j}$ are likely to play a larger role in determining which marker has the maximal observed value, $T_{r_{G}}$ . Under these conditions of multiple markers with similar trend values the contribution of $Z_{j}$ for that with largest $T_{j}$ value is likely to be positive. From the definition of $Z_{r_{G}}$ in (2.7) we see a positive value corresponds to overestimation of the underlying log odds ratio. The expected value of the difference between ${\hat{θ}}_{r_{G}}$ and ${\hat{θ}}_{r_{G}}$ is what is meant by ranking bias, i.e.

Ranking Bias = E [{\hat{θ}}_{r_{G}} - θ_{r_{G}}] .

As discussed further below, it is important to note that r_G, the index of the marker with greatest observed effect size, is random and can change in repeated sampling from the relevant populations.

Simulation Studies Demonstrating Bias

To illustrate the ranking bias a simple set of simulations are presented. In this case-control study there are G = 500 markers and there are N = 500 cases and 500 controls. In these simulation studies the markers are generated as independent variables. Populations of cases and controls were generated with 480 of the markers having no differences in allele frequencies. For the remaining 20 markers the odds ratios for diseased relative to healthy were 1.05, 1.10, …, 1.95, and 2.0. The probability of having 2 minor alleles in the control group was randomly chosen for each marker to lie between 0.05 and 0.25 (i.e. the minor allele frequencies varied between 0.224 and 0.5).

Given a simulated sample of cases and controls the following steps were performed:

For each of G = 500 markers find

T_{j} = \frac{\sqrt{N} {\hat{θ}}_{j}}{{\hat{s d}}_{j}}, j ∊ {1, \dots, G} .

Because the true log odds ratios (θ_j) are known, one can calculate a realization of

Z_{r_{G}} = \frac{\sqrt{N} ({\hat{θ}}_{r_{G}} - θ_{r_{G}})}{{\hat{s d}}_{r_{G}}} .

After making many such samples (1000 simulations) the resulting distribution of $Z_{r_{G}}$ is shown along with the standard normal distribution in figure 1. Of note are the facts that the $Z_{r_{G}}$ distribution is not centered about 0 (indicating ${\hat{θ}}_{r_{G}}$ typically overestimates $θ_{r_{G}}$ ) and that the distribution appears bimodal. Also, the distribution is relatively narrow compared to the standard normal density – this is perhaps surprising given asymptotic efficiency properties of normal approximations. However, the standard normal distribution is not relevant here as it neglects to take into account that r_G, the index of the maximal T statistic, is random and changes in different samples drawn from the underlying populations. For example, in the 1000 simulations approximately 34% of the time r_G corresponded to that marker with the odds ratio of 2.0, 32% of the time r_G was associated with the 1.95 odds ratio, and 8% of the time with that associated with 1.90, and so forth. Because different markers are chosen as the simulated datasets change a mixture distribution arises as suggested by the non-unimodal character of the distribution.

Fig. 1 — Distribution of $Z_{r_{G}} = N ({\hat{θ}}_{r_{G}} - θ_{r_{G}}) / {\hat{s d}}_{r_{G}}$ and a standard normal distributions; maximum odds ratio of 2.0.

The observed and true log-odds values associated with the largest test statistic (i.e. ${\hat{θ}}_{r_{G}}$ and $θ_{r_{G}}$ ) for each of the 1000 simulations are shown in figure 2. The discrete vertical bands correspond to the 12 distinct values of $θ_{r_{G}}$ that were obtained: {0, log(1.45), log(1.50), …, log(2.0) = 0.69}. The degree to which the points lie above the diagonal 45 degree line indicate the amount of bias in each simulation (fig. 2).

Fig. 2 — Comparison of δ_{r_G} (X-axis) and δ^ˆ_{r_G} (Y-axis) in 1000 simulations.

Ranking Bias vs. Significance Bias

Thus far there has been no reliance upon a p value threshold as a means for creating the observed bias. Truncation bias related to conditioning on a significant p value (i.e. significance bias) is a common rationale given for overestimation bias in marker studies and is an independent contributing factor. The bias demonstrated thus far is not conditioned on the largest T statistic first exceeding some threshold, e.g. one that corresponds to a p value = 0.025/G. Hence it cannot be significance bias. To clarify the difference significance bias is now further discussed.

Recent work involving significance bias focuses upon the effect in a single marker [4, 7]. In our setting we illustrate the idea using the marker with largest true effect size. Let r⁰_G designate this marker, i.e.

\frac{θ_{r_{G}^{0}}}{σ_{r_{G}^{0}}} > \frac{θ_{j}}{σ_{j}} for all j \neq r_{G}^{0}, j ∊ {1, \dots, G} .

Because r⁰_G is fixed (though unknown) it is the case that for large sample size N

Z_{r_{G}^{0}} = \frac{\sqrt{N} ({\hat{θ}}_{r_{G}^{0}} - θ_{r_{G}^{0}})}{{\hat{s d}}_{r_{G}^{0}}} \sim Normal (0, 1)

(4.1)

and hence $E [{\hat{θ}}_{r_{G}^{0}}] = θ_{r_{G}^{0}} .$ However, a bias is introduced if one conditions on $T_{r_{G}^{0}} = \sqrt{N} {\hat{θ}}_{r_{G}^{0}} / {\hat{s d}}_{r_{G}^{0}}$ exceeding a thres hold value so that the associated p value is sufficiently small. For instance, if $T_{r_{G}^{0}} > 3.90$ then the associated p value will be less than 0.025/500 and in our simulations the marker would be judged showing significant association using a Bonferroni criterion. However, using standard derivations concerning conditional expectations of normally distributed random variables one can show that

E [Z_{r_{G}^{0}} | T_{r_{G}^{0}} > 3.90] \approx E [Z_{r_{G}^{0}} > 3.90 - \sqrt{N} θ_{r_{G}^{0}} / σ_{r_{G}^{0}}]

(4.2)

= \frac{φ (3.90 - \sqrt{N} θ_{r_{G}^{0}} / σ_{r_{G}^{0}})}{1 - Φ (3.90 - \sqrt{N} θ_{r_{G}^{0}} / σ_{r_{G}^{0}})}

(4.3)

where φ and Φ denote the density and cumulative distribution function of a standard normal distribution. Consequently it follows from (4.1) that

E [{\hat{θ}}_{r_{G}^{0}} | T_{r_{G}^{0}} > 3.90] = θ_{r_{G}^{0}} + significance bias

(4.4)

\approx \frac{σ_{r_{G}^{0}}}{\sqrt{N}} . (\frac{φ (3.90 - \sqrt{N} θ_{r_{G}^{0}} / σ_{r_{G}^{0}})}{1 - Φ (3.90 - \sqrt{N} θ_{r_{G}^{0}} / σ_{r_{G}^{0}})}) .

(4.5)

The approximation arises because $\hat{s} d_{r_{G}^{0}}$ has been replaced by $σ_{r_{G}^{0}}$ and the asymptotic normal distribution was used. From the result in (4.5) one can show how changes in sample size, odds ratio, allele frequency (affecting the σ term), and p value threshold change the significance bias. Such an assessment was performed by Garner [4].

However, in most circumstances r⁰_G is not known, i.e. in a study with a large number of markers and little a priori knowledge of which markers have the strongest effects only r_G is observed and the preceding development based upon r⁰_G is not really applicable in practice. What is of practical interest is $E [{\hat{θ}}_{r_{G}} | T_{r_{G}} > 3.90] .$ However analytical evaluation of this expression is more complicated than assessing bias for r⁰_G because r_G is random. Consequently, simulations may be used to explore the following types of overestimation bias significance bias related to ${\hat{θ}}_{r_{G}^{0}} : E [{\hat{θ}}_{r_{G}^{0}} | T r_{G}^{0} > 3.90] - θ_{r_{G}^{0}},$ ranking bias related to ${\hat{θ}}_{r_{G}} : E [{\hat{θ}}_{r_{G}} - θ_{r_{G}}],$ and the combination of ranking bias and significance bias related to ${\hat{θ}}_{r_{G}} : E [{\hat{θ}}_{r_{G}} - θ_{r_{G}} | T_{r_{G}} > 3.90] .$

Sets of simulations were performed to examine the relative size of biases across varying conditions. Each design has 500 markers with a smaller number of informative markers (G₁) ranging between 20 and 1; the results are shown in table 2. Scenario 1 corresponds to the simulations generating figure 1, the maximum true odds ratio is 2 (log odds ratio of 0.69) and the maximum effect size is given by θr⁰_G/σr⁰_G = 0.69/3.465 where 3.465 is the associated σ value derived from the underlying allele probabilities (the probability of 2 minor alleles is 0.1685 in the control group) and odds ratio. An estimate of σ is given by $\hat{s d}$ in equation (2.3). From equation (4.5) an analytical estimate of the significance bias applied to this particular marker is 0.072 as shown in the first column of table 2 (all biases in this section are reported on a log odds ratio scale). In the 1000 simulations the observed bias for this marker arising from conditioning on its T statistic exceeding 3.90 was 0.075 – in good agreement with the analytic result. The simulations' estimate of ranking bias related to ${\hat{θ}}_{r_{G}}$ is about 0.191 in this case – approximately 2.5 times larger than the significance bias of ${\hat{θ}}_{r_{G}^{0}}$ . As an aside it is worth noting that other markers with smaller true effect sizes will have larger significance bias, but smaller likelihood of reaching the threshold. Finally we can get an estimate of the combined effects of ranking bias and significance bias for ${\hat{θ}}_{r_{G}}$ by examining just those simulations in which T_{r_G} > 3.90. In these 1000 simulations T_{r_G} > 3.90 in 99% of the occasions so the additional significance bias induced by conditioning on this common event is essentially negligible as it raises the bias only from 0.191 to 0.192 (table 2).

Table 2.

Ranking and significance bias on a log odds ratio scale

{\hat{θ}}_{r G}^{0} significance bias: E [{\hat{θ}}_{r G}^{0} | T_{r G}^{0} > 3.90] - {\hat{θ}}_{r G}^{0}

{\hat{θ}}_{r G} ranking bias: E [{\hat{θ}}_{r G} - {\hat{θ}}_{r G}]

{\hat{θ}}_{r G} combined significance and ranking bias: E [{\hat{θ}}_{r G} - {\hat{θ}}_{r G} | T_{r G} > 3.90]

Scenario 1: n = 500, G₁ = 20 informative ORs {1.05, …, 2.0}

0.072

0.191

0.192

Scenario 2: n = 500, G₁ = 40 {1.025, …, 2. 0}

0.072

0.230

Scenario 3: n = 500, G₁ = 5 {1.20, …, 2.0}

0.072

0.113

0.129

Scenario 4: n = 500, G₁ = 1 {2.0}

0.072

0.069

0.085

Scenario 5: n = 500, G₁ = 1 {4.0}

0.000

Scenario 6: n = 1000, G₁ = 20 {1.05, …, 2.0}

0.002

0.112

Scenario 7: n = 500, G₁ = 10 {1.04, …,1.40}

0.330

0.448

0.471

Scenario 8: n = 500, G₁ = 10 {1.30, …, 1.30}

0.395

0.476

0.543

Open in a new tab

Next we investigate how results change when the number of informative markers change from the design in Scenario 1. Scenario 2 has a similar structure but has 40 informative markers taking values 1.025, 1.05, …, 1.975, and 2.0. Because the sample size, standard deviations, and maximum odds ratio of 2 stay the same the significance bias associated with ${\hat{θ}}_{r_{G}^{0}}$ remains the same at 0.072. However, the changes in informative effect sizes induce changes in the ranking bias and combined ranking and significance bias associated with ${\hat{θ}}_{r_{G}}$ . These figures have increased to 0.230 for both bias measures. Conversely, when the of informative markers are reduced to only 5 as in Scenario 3, taking values 1.20, …, 1.80, 2.0, then the two bias measures associated with ${\hat{θ}}_{r_{G}}$ are reduced to 0.113 and 0.129 though the significance bias associated with ${\hat{θ}}_{r_{G}^{0}}$ remains unchanged. These three scenarios show how bias in ${\hat{θ}}_{r_{G}}$ depends upon the distribution of all markers, not just the conditions associated with the single marker having the largest true effect size.

In Scenario 4 the more extreme case of a single informative marker with an odds ratio of 2 is considered. Here the ranking bias of ${\hat{θ}}_{r_{G}} = 0.069$ is considerably less because there is a relatively low probability that any other marker will be associated with r_G – in 92.7% of the simulations the marker with an odds ratio of 2 generated the largest T statistic. Therefore, the significance bias of ${\hat{θ}}_{r_{G}^{0}}$ and ${\hat{θ}}_{r_{G}}$ are quite comparable as shown in the first and third column. In Scenario 5 the effect size of the single informative marker is increased so that this marker produces the smallest p value in all the simulations so there is essentially no ranking bias. Further, the sample size is sufficiently large so that the associated T statistic will exceed the 3.90 cutoff with an extremely high probability so there is no detected significance bias. Scenario 6 changes Scenario 1 by increasing the sample size. As in Scenario 5 this essentially eliminates the significance bias associated with the marker having an odds ratio of 2. However, there is still considerable bias associated with ${\hat{θ}}_{r_{G}}$ indicating ranking bias is a more important factor than significance bias in this case.

In Scenario 7 more modest effect sizes are considered. Here there are 10 informative ORs between 1.04 and 1.40. In this case 10,000 simulations were considered as there was low power (7.1%) for any of the 10 markers with informative ORs to exceed the Bonferroni threshold. Because the power is low, one might expect higher significance bias and this is in fact the case. The marker with OR = 1.4 had significance bias of 0.330 – indicating the observed log-odds ratio was typically twice as high as the true effect (log(1.4) = 0.336) when the criterion was met. However, in this case the ranking bias was extensive as well with values of 0.448 (unconditional) and 0.471 (conditional upon a significant statistic). Ranking bias remains high because those conditions that create large significance bias (e.g. a large Bonferroni threshold relative to true effect size) also tend to create large ranking bias (many different markers could potentially be highest ranked, with large bias for any one marker being highest ranked). In general it is difficult to find conditions in which significance bias will appreciably exceed the rank bias – either unconditional or conditional upon the highest ranked test statistic exceeding a threshold. Indeed, as it is usually the case that ${\hat{θ}}_{r_{G}} \geq {\hat{θ}}_{r_{G}^{0}}$ and $θ_{r_{G}} \leq θ_{r_{G}^{0}}$ one would not expect significance bias, $E [{\hat{θ}}_{r_{G}^{0}} - θ_{r_{G}^{0}} | T_{r_{G}^{0}} > 3.90],$ to exceed the conditional ranking bias, $E [{\hat{θ}}_{r_{G}} - θ_{r_{G}} | T_{r_{G}} > 3.90] .$

Finally, Scenario 8 shows the case when all the informative markers have the same effect size. As described in the Discussion and Appendix sections one should expect high ranking bias and this is in fact the case.

Among the conclusions to be drawn from these comparisons of different types of bias are (1) significance bias associated with a single, fixed marker may not be particularly relevant; (2) ranking bias is distinct from significance bias; (3) ranking bias is likely the greater of the two sources when many variables are considered. Finally, as shown by comparing the first 4 scenarios in table 2 that have many common design elements, the effect sizes of all markers play a role in determining ranking bias – efforts to understand bias associated with highly ranked markers must take this into account.

Discussion

Appendix A shows that ranking bias (scaled by the marker's standard deviation and sample size) for independent markers is given approximately by

E [\frac{\sqrt{N} ({\hat{θ}}_{r_{G}} - θ_{r_{G}})}{{\hat{s d}}_{r_{G}}}] = E [Z_{r_{G}}] \approx \frac{1}{\sqrt{2 π}} \sum_{j = 1}^{G} E | e^{\frac{- M_{- j}^{2}}{2}} |

(5.1)

where

M_{- j} = max_{k \neq j} T_{k} - \sqrt{N} \frac{θ_{j}}{σ_{j}} = max_{k \neq j} (Z_{k} + \sqrt{N} (\frac{θ_{k}}{σ_{k}} - \frac{θ_{j}}{σ_{j}}))

(5.2)

for j in 1, …, G. In words, M_–j represents the largest of all T_k (besides T_j) minus the trend term corresponding to the j-th marker. As discussed further in the appendix, some understanding of the bias can be deduced from relation (5.1). First, and perhaps most importantly, it is clear the bias depends in principle on the effect sizes, θ_j/σ_j, for all G of the markers. In practice, perhaps a few or even just one marker is relevant, but in principle one needs to take into account the true effect sizes of all.

As the sample size increases then each of the M_–j increase toward infinity in absolute value (with probability 1 under common conditions) so that the bias tends to zero as we expect assuming there are some informative markers and one is more informative than the others. On the other hand, bias is considerably worse when the effect sizes, θ_j/σ_j, are the same for all j = 1, …, G as would be the case is there were no informative markers. If the markers are independent then in this case

E [Z_{r_{G}}] \approx \frac{G}{\sqrt{2 π}} E [e^{- \frac{1}{2} {(max {X_{1}, \dots, X_{G - 1}})}^{2}}]

An important related question that underlies equation (5.1) is: given repeated samples of size N from the underlying populations, how many different markers could reasonably be selected as r_G? If one marker has an effect size that is much larger than all the rest then in repeated samples we would expect r_G to almost always correspond to that marker and no appreciable bias should result – empirically this is seen in Scenario 5 of table 2. Analytically this is demonstrated in the appendix when γ → ∞. If, on the other hand, there are a number of markers that have non-trivial probabilities of producing the largest test statistic in a given sample then in these cases the corresponding $Z_{j}$ terms play a larger role in determining which marker is chosen and we would expect a correspondingly greater degree of bias given the relationship between bias and $Z_{j}$ . One way of thinking about these factors is to see that the distribution of $Z_{r G}$ is a mixture of the conditional distributions $Z_{j}$ given r_G = j with the weights of the mixture given by the probability that r_G = j. These conditional distributions are typically not centered about 0 – in fact they are subject to a kind of truncation bias similar in form to that arising from conditioning on a significant p value, but instead arising from conditioning on r_G = j. When there are a number of markers with non-negligible Prob[r_G = j] this mixture leads to non-negligible bias for $Z_{r G}$ . Appendix A provides technical details.

The results thus far presented only focus upon the marker with largest test statistic. A natural question is how much bias may be present if attention is paid to all markers that exceed a significance threshold. This corresponds to the common strategy of following-up on all markers that meet a stringent significance threshold. For each simulation under Scenario 1 in table 2 the number of markers with T scores > 13.90 was recorded and the difference between the observed δ^ˆ and the true underlying θ was recorded. Table 3 provides information about all highly ranked markers that exceed the threshold. In 99% of the simulations at least one marker met the threshold requirement. The bias associated with the highest ranking marker exceeding the threshold is 0.192 when measured on a log odds scale – this is essentially the same information as reported in the first row of table 2. Again it is worth noting that this is not reporting the results for the marker with true log-odds ratio of 2.0, but rather the average bias for the marker with the highest observed T statistic – in 34% of the simulations this did correspond to the marker with odds ratio of 2.0, in 32% of occasions the marker with odds ratio of 1.95, etc…. The other rows of table 3 provide new information: in 92.5% of the simulations the second highest ranked marker met the Bonferroni criteria and the bias for these instances was estimated as 0.143 log-odds units. Table 3 gives information for the 3rd, 4th, 5th, 6th, and 7th mostly highly ranked markers (no more than 7 markers ever met the criteria) and we see the bias conditional upon exceeding the threshold remains considerable. This table indicates that the ranking bias is not just a problem for the most highly ranked marker – it can be present and substantial for all those markers meeting significance criteria.

Table 3.

Bias of markers exceeding threshold under scenario 1

Marker rank

Prob [T_{r j} > 3.90]

E [{\hat{θ}}_{r j} - {\hat{θ}}_{r j} | T_{r j} > 3.90]

1st highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 500}

0.990

0.192

2nd highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 499}

0.925

0.143

3rd highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 498}

0.741

0.124

4th highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 497}

0.438

0.128

5th highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 496}

0.193

0.132

6th highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 495}

0.054

0.131

7th highest ranked:

{\hat{θ}}_{r j} = {\hat{θ}}_{r 494}

0.007

0.129

Open in a new tab

Conclusions

In this report the focus has been on measuring bias associated with the largest effect size. Ranking bias is potentially present in any study with multiple tests where attention is drawn to outcomes associated with the most extreme observed effect sizes. Here we have seen that focus upon highly ranked results, rather than conditioning on significant p values, can be responsible for much of overestimation bias. Some authors have put forth methods to correct for both ranking and significance bias. The bootstrap/cross-validation approach put forth by Sun and Bull [8] can accommodate both types of bias. A similar bootstrap approach by Jeffries [9, 10] has been employed to correct for ranking bias in microarray and diagnostic modeling contexts. Analytic approaches have been put forth by Zöllner and Pritchard [7], Ghosh et al. [11], and Zhong and Prentice [12] however these appear to address significance bias but not ranking bias. The more extensive dependence of rank bias on potentially all effect sizes (as opposed to significance bias that depends only upon one marker's effect size) complicates analytical solutions. It is likely that analytic approaches to biased overestimation from genomic studies that ignore the need to consider all markers' effect sizes may be missing an important aspect of the problem, i.e. the distribution of the most extreme test statistic is a mixture distribution depending, in principle, upon all markers' parameters.

The modeling in this paper was very simple: a single stage study (i.e. no two-stage or higher stage designs) with a recessive genetic model and Wald test approach to evaluating significance. The simple approach allowed for a simple analytic description of the problem – more complicated designs are likely prone to the same types of biases and work remains for examining the role of ranking bias in these circumstances.

Appendix A

Calculation of the Bias

Here an effort is made to sketch the degree of bias that may be expected and link this magnitude to some factors such as sample size, distribution of true effect sizes, and the number of markers. Simplifying assumptions will be employed as necessary. Recall that the j-th test statistic T_j may be decomposed into a trend and random fluctuation component as

T_{j} = Z_{j} + \sqrt{N} \frac{θ_{j}}{{\hat{s d}}_{j}}

where $Z_{j} = \sqrt{N} ({\hat{θ}}_{j} - θ_{j}) / {\hat{s d}}_{j}$ and $Z_{j}$ is approximated by a standard normal distribution. Then we may calculate the expected value of $Z_{r G}$ (and hence obtain a measure of the bias of ${\hat{θ}}_{r_{G}} - θ_{r_{G}}$ ) as

E [Z_{r_{G}}] = \sum_{j = 1}^{G} E [Z_{r_{G}} | r_{G} = j] P [r_{G} = j]

(A.1)

and

E [Z_{r_{G}} | r_{G} = j] P [r_{G} = j] = (\int z_{j} f_{Z j | r_{G} = j} (z_{j}) d z_{j}) P [r_{G} = j]

(A.2)

= \frac{\int Z_{j} f_{Z j} (z_{j}, r_{G} = j) d z_{j}}{P [r_{G} = j]} P [r_{G} = j]

(A.3)

= \int Z_{j} f_{Z_{j}} (z_{j}, r_{G} = j) d z_{j}

(A.4)

where $f_{Z_{j | r_{G} = j}}$ is the conditional distribution of $Z_{j}$ given r_G = j and f $Z_{j}$ (z_j,r_G = j) describes the joint distribution of $Z_{j}$ and the event r_G = j.

Now r_{G} = j if and only if T_{j} > \underset{k \neq j}{max T_{k}}

(A.5)

if and only if Z_{j} > max_{k \neq j} (z_{k} + \sqrt{N} (\frac{θ_{k}}{{\hat{s d}}_{k}} - \frac{θ_{j}}{{\hat{s d}}_{j}})) .

(A.6)

Here we have used the decomposition in (2.5) to show how the bias depends on the random fluctuation terms, $Z_{j}$ and $Z_{k}$ . To simplify we will approximate the sd^ˆ_j and sd^ˆ_k terms by σ_j and σ_k (the associated true population values). Then we obtain

E [Z_{r_{G}}] = \sum_{j = 1}^{G} \int z_{j} f_{Z_{j}} (z_{j}, r_{G} = j) d z_{j}

(A.7)

\approx \sum_{j = 1}^{G} E [Z_{j} \cdot I_{[Z_{j} > M_{- j}]}]

(A.8)

where

M_{- j} = max_{k \neq j} (Z_{k} + \sqrt{N} (\frac{θ_{k}}{σ_{k}} - \frac{θ_{j}}{σ_{j}}))

(A.9)

and

= I_{[A]} is a function = 1 if event A is true, 0 otherwise .

(A.10)

To proceed further we condition the expectation on M_–j and recall that $Z_{j}$ is marginally distributed as a standard Gaussian random variable and thus, for independent markers,

E [Z_{r_{G}}] \approx \sum_{j = 1}^{G} E [Z_{j} \cdot I_{[Z_{j} > M_{- j}]}]

(A.11)

= \sum_{j = 1}^{G} E [E [Z_{j} \cdot I_{[Z_{j} > M_{- j}]} | M_{- j}]]

(A.12)

= E [\sum_{j = 1}^{G} \int_{M_{- j}}^{\infty} z f Z (z) d]

(A.13)

fz denotes a Gaussian distribution and the expectation in (A.13) is necessary because M_–j contains random elements $Z_{k}$ . The integral may be explicitly rewritten as

E [Z_{r_{G}}] \approx \frac{1}{\sqrt{2 π}} \sum_{j = 1}^{G} E [e^{\frac{- M_{- j}^{2}}{2}}] .

(A.14)

From (A.14) one sees that bias is inversely related to the absolute value of the M_–j terms. Some consequences of this derivation are as follows.

Consider the effect of increasing the sample size holding all else constant. We focus upon the case in which there exists a single true maximum effect size and designate that marker's index by r⁰_G, i.e.

\frac{θ_{r_{G}^{0}}}{σ_{r_{G}^{0}}} > \frac{θ_{j}}{σ_{j}} for all j \neq r_{G}^{0} .

It is worthwhile to examine M_–j for the case when j = r⁰_G and j ≠ r⁰_G separately where we assume only one variable (with index r⁰_G) has the most positive effect size, i.e. there are no ties. Then

\underset{N \to \infty}{lim M_{- j}} = lim_{N} max_{k \neq j} (Z_{k} + \sqrt{N} (\frac{θ_{k}}{σ_{k}} - \frac{θ_{j}}{σ_{j}}))

(A.15)

= - \infty with probability 1 if j = r_{G}^{0}

(A.16)

= - \infty if j = r_{G}^{0} .

(A.17)

In either case we have that M²_–j → ∞ so from (A.14) one sees $lim N \to \infty E [Z_{r_{G}}] = 0.$

The case for expanding the differences among effect sizes is similar – at least for the simplified example below. For a given pattern of effect sizes among the G variables (again with no ties for the largest effect size), consider a new pattern of effect sizes given by multiplying each original effect by a constant γ > 0. Then if r⁰_G designates the marker with the most positive true effect size

lim_{γ \to \infty} M_{- j} = lim_{γ \to \infty} max_{k \neq j} (Z_{k} + γ \sqrt{N} (\frac{θ_{k}}{σ_{k}} - \frac{θ_{j}}{σ_{j}}))

(A.18)

= - \infty if j = r_{G}^{0}

(A.19)

= \infty if j \neq r_{G}^{0} .

(A.20)

Consequently the same conclusion of no bias follows. If one reverses the limiting action of γ so that γ → 0 from above then

lim_{γ ↓ 0} M_{- j} = \underset{k \neq j}{max Z_{k}}

(A.21)

where the $Z_{k}$ are standard normal statistics and the bias is then positive. This situation corresponds to the situation of no variables showing differential expression.

The case for increasing G, the number of variables is less clear cut as it depends upon the combination of effect sizes. Empirically it seems that adding variables with effect sizes at or near the size of the largest preexisting effect sizes exaggerates the bias effects for θ_{r_G} In terms of figuring the change of M_–j terms as above there is more ambiguity as some terms M²_−j terms will likely increase, others decrease, and some new terms will be introduced.

References

1.Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57:289–300. [Google Scholar]
2.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Editorial Freely associating. Nat Genet. 1999;22:1–2. doi: 10.1038/8702. [DOI] [PubMed] [Google Scholar]
4.Garner C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol. 2007;31:288–295. doi: 10.1002/gepi.20209. [DOI] [PubMed] [Google Scholar]
5.Allison D, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM, Amos CI. Bias in estimates of quantitative-trait locus effect in genome scans: demonstration of the phenomenon and a methodof-moments procedure for reducing bias. Am J Hum Genet. 2002;70:575–585. doi: 10.1086/339273. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Göring H, Terwilliger JD, Blanger J. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet. 2001;69:1357–1363. doi: 10.1086/324471. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zöllner S, Pritchard JK. Overcoming the winner's curse: estimating penetrance parameters from casecontrol data. Am J Hum Genet. 2007;80:605–615. doi: 10.1086/512821. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sun L, Bull SB. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005;28:352–367. doi: 10.1002/gepi.20068. [DOI] [PubMed] [Google Scholar]
9.Jeffries N. Multiple comparisons distortions of parameter estimates. Biostatistics. 2007;8:500–504. doi: 10.1093/biostatistics/kxl025. [DOI] [PubMed] [Google Scholar]
10.Jeffries N. Performance of a genetic algorithm for mass spectrometry proteomics. BMC Bioinformatics. 2004;5:180. doi: 10.1186/1471-2105-5-180. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ghosh A, Zou F, Wright FA. Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet. 2008;82:1064–1074. doi: 10.1016/j.ajhg.2008.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhong H, Prentice RL: Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 2008, in press. [DOI] [PMC free article] [PubMed]

[B1] 1.Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57:289–300. [Google Scholar]

[B2] 2.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Editorial Freely associating. Nat Genet. 1999;22:1–2. doi: 10.1038/8702. [DOI] [PubMed] [Google Scholar]

[B4] 4.Garner C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol. 2007;31:288–295. doi: 10.1002/gepi.20209. [DOI] [PubMed] [Google Scholar]

[B5] 5.Allison D, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM, Amos CI. Bias in estimates of quantitative-trait locus effect in genome scans: demonstration of the phenomenon and a methodof-moments procedure for reducing bias. Am J Hum Genet. 2002;70:575–585. doi: 10.1086/339273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Göring H, Terwilliger JD, Blanger J. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet. 2001;69:1357–1363. doi: 10.1086/324471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Zöllner S, Pritchard JK. Overcoming the winner's curse: estimating penetrance parameters from casecontrol data. Am J Hum Genet. 2007;80:605–615. doi: 10.1086/512821. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Sun L, Bull SB. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005;28:352–367. doi: 10.1002/gepi.20068. [DOI] [PubMed] [Google Scholar]

[B9] 9.Jeffries N. Multiple comparisons distortions of parameter estimates. Biostatistics. 2007;8:500–504. doi: 10.1093/biostatistics/kxl025. [DOI] [PubMed] [Google Scholar]

[B10] 10.Jeffries N. Performance of a genetic algorithm for mass spectrometry proteomics. BMC Bioinformatics. 2004;5:180. doi: 10.1186/1471-2105-5-180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Ghosh A, Zou F, Wright FA. Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet. 2008;82:1064–1074. doi: 10.1016/j.ajhg.2008.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Zhong H, Prentice RL: Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 2008, in press. [DOI] [PMC free article] [PubMed]

PERMALINK

Ranking Bias in Association Studies

Neal O Jeffries

Abstract

Background

Conclusions

Introduction

Description of Bias

Table 1.

Simulation Studies Demonstrating Bias

Fig. 1.

Fig. 2.

Ranking Bias vs. Significance Bias

Table 2.

Discussion

Table 3.

Conclusions

Appendix A

Calculation of the Bias

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Ranking Bias in Association Studies

Neal O Jeffries

Abstract

Background

Conclusions

Introduction

Description of Bias

Table 1.

Simulation Studies Demonstrating Bias

Fig. 1.

Fig. 2.

Ranking Bias vs. Significance Bias

Table 2.

Discussion

Table 3.

Conclusions

Appendix A

Calculation of the Bias

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases