Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2016 Mar 18;17(4):634–649. doi: 10.1093/biostatistics/kxw012

Accounting for selection and correlation in the analysis of two-stage genome-wide association studies

David S Robertson 1,*, A Toby Prevost 2, Jack Bowden 3
PMCID: PMC5031943  PMID: 26993061

Abstract

The problem of selection bias has long been recognized in the analysis of two-stage trials, where promising candidates are selected in stage 1 for confirmatory analysis in stage 2. To efficiently correct for bias, uniformly minimum variance conditionally unbiased estimators (UMVCUEs) have been proposed for a wide variety of trial settings, but where the population parameter estimates are assumed to be independent. We relax this assumption and derive the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. One area of application is the estimation of odds ratios (ORs) when combining a genome-wide scan with a replication study. Our framework explicitly accounts for correlated single nucleotide polymorphisms, as might occur due to linkage disequilibrium. We illustrate our approach on the measurement of the association between 11 genetic variants and the risk of Crohn's disease, as reported in Parkes and others (2007. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Gen. 39(7), 830–832.), and show that the estimated ORs can vary substantially if both selection and correlation are taken into account.

Keywords: Correlated outcomes, Genome-wide scan, Selection bias, Two-stage sample, Uniformly minimum variance conditionally unbiased estimator

1. Introduction

The two-stage “learn and confirm” strategy is widely implemented across biomedical research. For example, in the sphere of randomized clinical trials, the vast majority of adaptive trial designs follow a two-stage strategy, where design adaptations are made at the first interim analysis, with the trial then proceeding under a fixed design until its completion. See Bauer and others (2016), Bretz and others (2009), and Posch and others (2005) for a thorough review of the literature.

A key application has been in two-stage adaptive seamless trials incorporating mid-trial treatment selection (Bowden and Glimm, 2014; Kimani and others, 2013; Sampson and Sill, 2005). Typically, multiple experimental treatments are simultaneously compared with a control in stage 1, and the most promising treatment is selected for confirmatory analysis in stage 2.

In biomedical science more generally, there is an increasing focus on the identification of genetic markers predictive of phenotypic variation, in order to better understand disease etiology and to target future drug development. The design of choice is the genome-wide association study (GWAS), in which upwards of one million single nucleotide polymorphisms (SNPs) can be tested simultaneously for their association with a given phenotype. Those SNPs achieving “genome-wide” significance (typically defined as a Inline graphic-value of Inline graphic) are then validated in a replication study (Ioannidis and others, 2009).

A whole variety of formal two-stage design procedures have been proposed in this setting; see, for example, Satagopan and Elston (2003), Satagopan and others (2004, 2002), Thomas and others (2004), with an overview provided in Thomas and others (2009). In practice, the two-stage GWAS approach has been successfully applied to many disease areas, including sclerosis (Chiò and others, 2009), epilepsy (Guo and others, 2012), and breast cancer (Sapkota and others, 2012).

However, ranking and selecting candidates in a two-stage trial can induce bias in the final estimates of the parameter of interest. Indeed, as a general rule, the more aggressive the selection, the worse is the bias (Bauer and others, 2010). Specifically, a candidate has to perform “well” in stage 1 in order to proceed to stage 2, which often leads to overly optimistic estimates, or the so-called “winner's curse”. When there is heavy selection from many candidates of roughly equivalent parameters, then the selected candidate is typically based on chance variability rather than true superiority (Sill and Sampson, 2007). All this calls into question the generalizability of the trial's findings to future patients.

The need to correct for this selection bias in the analysis of two-stage trials has long been recognized. An efficient and unbiased estimator can be obtained through the technique of Rao–Blackwellization. This involves taking the unbiased stage 2 data and conditioning on a complete, sufficient statistic for the parameter in question. By the Lehmann–Scheffé theorem, this gives the uniformly minimum variance conditionally unbiased estimator (UMVCUE).

This two-stage framework was introduced by Cohen and Sackrowitz (1989), who derived the UMVCUE for the selected mean, where the stage 1 populations are independent and normally distributed. The method lends itself naturally to estimation in two-stage adaptive seamless phase II/III clinical trials (Bowden and Glimm, 2008; Kimani and others, 2013), with equivalent work carried out in biomarker research (Pepe and others, 2009; Koopmeiners and others, 2012; Robertson and others, 2015). In the GWAS setting, Bowden and Dudbridge (2009) extended the Cohen and Sackrowitz method to derive the UMVCUE for the SNP-disease log odds ratios (ORs), where either a one-sided or two-sided Inline graphic-value test is used for ranking the SNPs.

A crucial assumption in all of the above analyses is that the stage 1 population parameter estimates are independent random variables. However, this may not be a reasonable assumption to make. For example, in the GWAS setting, SNPs in the same chromosomal region may be in linkage disequilibrium (LD).

The aim of this paper is to present general theory to enable the calculation of UMVCUEs for each setting that explicitly accounts for correlation. We achieve this by deriving the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. We illustrate the utility of the new theory using an application in the GWAS setting.

Since the focus of this paper is on point estimation, our primary criterion for comparing the performance of different estimators is the magnitude of the bias. Of course, bias is not the only criterion with which to judge an estimator, especially considering the bias–variance trade-off. However, it is important to quantify the bias of an estimator, not least due to regulatory concerns. As a secondary criterion, we also compare the mean squared error (MSE) of the estimators, which takes into account the variance. In practice, the relative importance attached to the criterion of bias and MSE will vary from setting to setting.

In Section 2, we describe the general model framework and derive the form of the multivariate normal UMVCUE. We then present a short simulation study in Section 3 for the bivariate normal case. We extend our framework to estimating ORs in genome-wide scans in Section 4, and apply the resulting UMVCUE to data reported in Parkes and others (2007). We conclude with a discussion in Section 5.

2. General framework for the UMVCUE

2.1. Model description

Consider the following two-stage trial design. Suppose that we have Inline graphic correlated continuous stage 1 parameter estimates, defined by Inline graphic, for the candidate treatments, biomarkers or genetic variants Inline graphic. Let Inline graphic follow a multivariate normal distribution, defined by Inline graphic, where Inline graphic is the vector of (unknown) means and Inline graphic is the covariance matrix. We assume that Inline graphic is known and is positive definite. For notational convenience, we let Inline graphic for Inline graphic.

We denote the ordered stage 1 estimates as Inline graphic, where the ordering is simply by effect size: Inline graphic. We consider other ordering mechanisms in due course. Let Inline graphic be the stage 2 estimate of the Inline graphicth ranked candidate, and so Inline graphic. At the end of stage 2, the aim is to efficiently estimate Inline graphic for some Inline graphic.

The MLE for Inline graphic is a weighted average of the first- and second-stage data:

graphic file with name M22.gif

This estimator may be biased, because it does not take into account the first-stage selection procedures or the correlation. An unbiased estimator can easily be found by just using the stage 2 data Inline graphic. However, this estimator suffers from lower precision since we are neglecting the stage 1 data. Hence we look for an unbiased estimator that uses data from both stages.

2.2. Calculating the UMVCUE

Let Inline graphic be the event Inline graphic, which implies that Inline graphic for Inline graphic. Without loss of generality, we condition on Inline graphic for the remainder of the derivation of the UMVCUE.

To start with, consider estimating the mean of the highest ranked population, Inline graphic. For notational convenience, we let Inline graphic and Inline graphic. In Section 1 of the supplementary materials (available at Biostatistics online), we prove the following theorem for the complete and sufficient statistic for Inline graphic.

Theorem 2.1. —

The statistic Inline graphic is sufficient and complete for Inline graphic, where Inline graphic for Inline graphic.

Then, by the Lehmann–Scheffé theorem, Inline graphic is the UMVCUE for Inline graphic under Inline graphic. Hence we have the following result, as proved in Section 2 of the supplementary materials (available at Biostatistics online).

Theorem 2.2. —

The UMVCUE for Inline graphic given Inline graphic is

graphic file with name M42.gif (2.1)

where

graphic file with name M43.gif

and we define Inline graphic and Inline graphic.

Remark. —

In the independent normal setting, Cohen and Sackrowitz (1989) note that all dependency on observations Inline graphic vanished in the expression for the UMVCUE—which they interpreted as a negative result when Inline graphic. However, we see in Equation (2.1) that accounting for correlation means that all of the observations can contribute to the sets Inline graphic and Inline graphic and be implicitly used in the estimator. In fact, any of the observations Inline graphic can be used explicitly, depending on which feature in Inline graphic and Inline graphic.

Remark 2. —

It is often the case that the stage 1 estimates are asymptotically normal rather than precisely normal. For example, log ORs have an asymptotic normal distribution. In these settings, our theorems hold asymptotically.

Suppose that, instead of only selecting the single top-ranking treatment or biomarker for continuation to stage 2, we take forward the top Inline graphic from a larger group of Inline graphic. For example, in the GWAS setting, typically Inline graphic and Inline graphic. We now require the UMVCUE for the Inline graphicth best candidate out of Inline graphic and the following corollary covers this case.

Corollary 2.3. —

For a given value of Inline graphic, the statistic Inline graphic is sufficient and complete for Inline graphic, where

graphic file with name M62.gif (2.2)

for Inline graphic.

Also, the UMVCUE for Inline graphic given Inline graphic is

graphic file with name M66.gif (2.3)

where

graphic file with name M67.gif

and we define Inline graphic and Inline graphic.

Remark. —

As a simple check of our methodological development's consistency with previous results, if we set the off-diagonal terms of the covariance matrix Inline graphic equal to zero, then we recover the independent normal setting. The UMVCUE was derived in this special case by Bowden and Glimm (2008). We compare our results in Section 3 of the supplementary materials (available at Biostatistics online) and show that they are the same.

In Section 4, we show the form of the UMVCUE when ranking by Inline graphic-value, and subsequently show how to construct bootstrapped confidence intervals (CIs) in this setting.

3. Simulation study: bivariate normal case

3.1. Bivariate normal UMVCUE

We now consider the UMVCUE derived in the bivariate normal setting, where Inline graphic. Suppose we have two correlated continuous stage 1 outcome measures, defined by Inline graphic, for the candidates Inline graphic and Inline graphic. We let Inline graphic follow a bivariate normal distribution, with

graphic file with name M77.gif

where Inline graphic is the correlation coefficient between Inline graphic and Inline graphic, and we assume Inline graphic.

From Equations (2.1), the statistic Inline graphic is sufficient for Inline graphic, where

graphic file with name M84.gif

The UMVCUE depends on whether Inline graphic. Since Inline graphic if follows that, from Equation (2.1), the UMVCUE for Inline graphic given Inline graphic is

graphic file with name M89.gif (3.1)

where

graphic file with name M90.gif

Remark. —

Note that if Inline graphic, then the UMVCUE is in fact equal to the MLE Inline graphic.

3.2. Simulation results

In this subsection, we compare various estimators for the largest mean Inline graphic in the bivariate normal setting:

  • Inline graphic observed stage 2 data for the highest ranked population.

  • Inline graphic MLE for Inline graphic, which ignores selection and correlation.

  • Inline graphic UMVCUE derived by Bowden and Glimm (2008), which accounts for selection but ignores correlation. Note that this is equivalent to Inline graphic when we set Inline graphic.

  • Inline graphic UMVCUE given in Equation (3.1), which accounts for selection and correlation.

To evaluate the performance of a generic estimator for Inline graphic, say Inline graphic, we calculate the bias and MSE as defined in Posch and others (2005), Sill and Sampson (2007), and Bowden and Glimm (2008):

graphic file with name M103.gif

In our simulation study, we set Inline graphic and Inline graphic and then consider four simple scenarios (i)–(iv) with varying values of the true means Inline graphic and stage 2 variance Inline graphic:

graphic file with name M108.gif

Note that, for scenarios (i) and (ii), Inline graphic. For scenarios (iii) and (iv), the exact MSE of Inline graphic is as follows:

graphic file with name M111.gif

Figure 1 shows the mean relative bias (defined as Inline graphic) of the MLE Inline graphic and the UMVCUE Inline graphic (which ignores correlation) as a function of Inline graphic for the four scenarios. Since the mean bias of the stage 2 data Inline graphic and the UMVCUE Inline graphic are virtually identical to zero for all values of Inline graphic (by definition), we do not show the simulation results here.

Fig. 1.

Fig. 1.

Mean relative bias of the MLE (Inline graphic, blue), and the UMVCUE (Inline graphic, black) that ignores the correlation. Plots show the mean of 100 000 simulations for each value of Inline graphic. The horizontal line shows where the bias equals zero.

The plots show that the bias of both estimators is at a maximum for Inline graphic close to Inline graphic. The MLE is positively biased for Inline graphic, and in scenarios (ii) is in fact substantially biased for all values of Inline graphic. We see that Inline graphic is positively biased for Inline graphic, unbiased when Inline graphic (as would be expected), and negatively biased for Inline graphic. The bias of both estimators is worse when Inline graphic compared with Inline graphic.

Figure 2 shows the MSE of the estimators Inline graphic for the four scenarios. As would be expected, the MLE always has the lowest MSE, while Inline graphic always has the highest. We also see that there is a substantial reduction in the MSE when using the UMVCUEs Inline graphic or Inline graphic.

Fig. 2.

Fig. 2.

MSE of the MLE (blue), Inline graphic (green), Inline graphic (black), and Inline graphic (red). Plots show the mean of 100 000 simulations for each value of Inline graphic, except for Inline graphic, which is the theoretical value of the MSE.

In scenarios (i)–(iii), the MSE of Inline graphic decreases as Inline graphic approaches 0.5, and then increases again. This is to be expected given Equation (3.1), which shows that Inline graphic reduces down to the MLE when Inline graphic. In scenarios (iii) and (iv), we also see that, for Inline graphic close to 1, the MSE of Inline graphic and the MLE are very close.

In all of the scenarios, the MSE of Inline graphic is greater than that for Inline graphic when Inline graphic, equal when Inline graphic (as would be expected), and less than that for Inline graphic when Inline graphic. However, these differences in the MSE are much less than the differences between the MSE of Inline graphic and either Inline graphic or Inline graphic. In conclusion, there is little or no advantage in terms of MSE when using the estimator Inline graphic as opposed to our UMVCUE Inline graphic, which properly accounts for correlation.

4. Application to the Parkes and others' GWAS data

We now apply our methodology to data from the genome-wide association scan for Crohn's disease published in 2007 by the Wellcome Trust Case Control Consortium (2007) (WTCCC). There were 1748 cases for the disease, and 2938 controls in this cohort, which identified 12 SNPs associated with disease status at genome-wide significance.

A replication study was then reported by Parkes and others (2007) in a follow-up cohort of 1182 cases and 2024 controls, in which 12 SNPs were successfully replicated in this study. Following Bowden and Dudbridge (2009), we exclude one (rs6887695) from our illustration (as its significance in the WTCCC scan was severely reduced after data cleaning).

Table 1 shows the estimated ORs for stages 1 and 2 (i.e. the WTCCC data and the replication data), as well as the overall MLE. This is then followed by two previously described estimators that combine the data from the two stages in a way that attempts to correct for bias. The first estimator (denoted by ZP) is the corrected MLE of Zhong and Prentice (2008). The second (denoted by Inline graphic) is the estimator considered in Bowden and Dudbridge (2009), who calculated the UMVCUE assuming that all of the log ORs for the SNPs are uncorrelated.

Table 1.

The ORs (Inline graphic CIs) of 11 SNPs estimated from allele frequencies reported by Parkes and others (2007) in an initial scan (stage 1) and replication study (stage 2). SNPs are ranked in order of stage 1 statistical significance, with those on the same chromosomal region highlighted in bold

Chr SNP Stage 1 OR Stage 2 OR MLE ZP Inline graphic
5p13 rs17234657 1.55 (1.38, 1.74) 1.16 (1.00, 1.35) 1.39 1.39 1.16
5p13 rs9292777 1.38 (1.26, 1.51) 1.34 (1.20, 1.50) 1.37 1.37 1.39
10q24 rs10883365 1.27 (1.17, 1.38) 1.18 (1.05, 1.32) 1.24 1.24 1.16
18p11 rs2542151 1.35 (1.21, 1.50) 1.15 (1.00, 1.32) 1.27 1.25 1.15
5q33 rs13361189 1.51 (1.30, 1.76) 1.38 (1.15, 1.66) 1.46 1.45 1.40
3p21 rs9858542 1.26 (1.15, 1.38) 1.17 (1.04, 1.31) 1.22 1.21 1.17
5q33 rs4958847 1.35 (1.20, 1.53) 1.36 (1.17, 1.59) 1.36 1.35 1.35
5q23 rs10077785 1.29 (1.16, 1.43) 1.19 (1.05, 1.36) 1.25 1.22 1.19
1q24 rs12035082 1.22 (1.12, 1.33) 1.14 (1.02, 1.27) 1.19 1.17 1.15
21q22 rs2836754 1.22 (1.12, 1.33) 1.15 (1.03, 1.28) 1.19 1.16 1.16
1q31 rs10801047 1.38 (1.18, 1.61) 1.47 (1.22, 1.76) 1.42 1.39 1.44

As the bold entries in Table 1 indicate, two pairs of SNPs are on the same chromosomal region. The first- and second-ranked SNPs (rs17234657 and rs9292777) are both on chromosomal region 5p13, while the fifth- and seventh-ranked SNPs (rs13361189 and rs4958847) are both on chromosomal region 5q33.

It can be safely assumed that SNPs from distinct genomic regions are in linkage equilibrium. However, in an effort to assess the potential correlation between SNPs in the same genomic region, we used the data from 498 Europeans in the 1000 genomes project (1000 Genomes Project Consortium, 2012). The empirical correlation between the SNPs (encoded as 0,1,2 to indicate the number of risk increasing alleles) on chromosomal regions 5p13 and 5q33 were Inline graphic and Inline graphic respectively.

Although the populations considered by Parkes and others (2007) and the 1000 genomes project are different, these correlations do suggest LD exists between these SNPs. Hence a natural question to ask is how the OR estimates are affected by correlation. To do so, we first need to extend our framework to account for ranking by Inline graphic-value.

4.1. Calculating the UMVCUE

Consider a two-stage design with a genome-wide association scan (stage 1) followed by a replication study (stage 2). Only those SNPs that meet selection criteria (see below) in stage 1 continue on to stage 2. A common approach in the genome-wide setting is to rank the SNPs according to the statistical significance of the effects, with a Inline graphic-value threshold Inline graphic. We assume that Inline graphic of the SNPs pass Inline graphic and are then ranked.

For a one-sided test, we condition (WLOG) on the event

graphic file with name M168.gif

The UMVCUE in this case is given in Section 4 the supplementary materials (available at Biostatistics online).

For a two-sided test, we instead condition (WLOG) on the event

graphic file with name M169.gif

Then the UMVCUE for Inline graphic is as follows, with the proof found in Section 5 of the supplementary materials (available at Biostatistics online):

Theorem 4.1. —

The UMVCUE for Inline graphic given Inline graphic is

graphic file with name M173.gif (4.1)

where

graphic file with name M174.gif

and we define Inline graphic and Inline graphic.

Remark. —

The introduction of the variable Inline graphic above is for notational convenience. In particular, we re-express the intersection of sets Inline graphic as the disjoint union of Inline graphic closed intervals on the real line Inline graphic, where Inline graphic. Note that possibly Inline graphic and/or Inline graphic.

For either test, if all the off-diagonal terms of the covariance matrix are equal to zero, then we recover the independent normal setting as in Bowden and Dudbridge (2009). In Section 6 of the supplementary materials (available at Biostatistics online), we compare our estimators and show they are the same.

4.2. Results

Rather than assuming we know Inline graphic, we take each pair of SNPs that are on the same chromosomal region and calculate the UMVCUEs as the correlation between the log ORs varies between Inline graphic. More explicitly, given a pair of SNPs on the same chromosomal region which have stage 1 ranks Inline graphic and Inline graphic respectively, then we assume that the log ORs Inline graphic and Inline graphic follow a bivariate normal distribution, with correlation coefficient Inline graphic

graphic file with name M191.gif

Starting with the SNPs on chromosomal region 5p13, the upper half of Figure 3 shows the UMVCUEs for the ORs as a function of the correlation coefficient Inline graphic. The first plot for SNP rs17234657 shows that the UMVCUE can vary substantially if there is positive correlation between the SNPs. As Inline graphic increases from 0 to 0.88, the UMVCUE increases from 1.16 to 1.32, with a particularly rapid increase for Inline graphic greater than about 0.5. For Inline graphic we actually see a (small) decrease in the UMVCUE. In contrast, for negative values of Inline graphic, the UMVCUE is hardly affected.

Fig. 3.

Fig. 3.

The UMVCUEs for the ORs of the SNPs on chromosomal region 5p13 (rs17234657 and rs9292777) and 5q33 (rs13361189 and rs4958847) as a function of the correlation coefficient.

The second plot for SNP rs9292777 shows that the UMVCUE decreases as Inline graphic increases, but in absolute terms correlation has little effect. The main reason for this is because the stage 1 and stage 2 estimates for the OR were very similar (1.38 and 1.34, respectively), so we would not expect the UMVCUE to be dramatically different regardless of any correlation.

Performing the analysis for the SNPs on chromosomal region 5q33 gives the results displayed in the lower half of Figure 3. Looking at the first plot for SNP rs13361189, in absolute terms correlation has little effect on the UMVCUE. However, note the piecewise nature of the UMVCUE: it increases for Inline graphic, decreases for Inline graphic, remains constant for Inline graphic and then (rapidly) decreases again for Inline graphic. Also note that this time, there is a large discrepancy between the stage 1 and stage 2 log ORs (1.51 and 1.38, respectively). The fact that the UMVCUE hardly varies indicates that the substantial correction to the stage 1 estimate is in fact robust to correlation.

As for the second plot for SNP rs4958847, again the correlation hardly changes the value of the UMVCUE (in fact, the UMVCUE remains constant until Inline graphic). However, this should be expected since the stage 1 and stage 2 estimates for the OR were very similar (1.35 and 1.36, respectively).

Constructing CIs. After constructing a point estimate for the OR, it is natural to construct CIs as well. We use a parametric bootstrap procedure inspired by Bowden and Glimm (2008) and Pepe and others (2009), where we assume that the true log OR mean Inline graphic (the UMVCUE calculated from the observed data).

Starting with the SNPs on chromosomal region 5q33, these had the fifth and seventh stage 1 ranks. Given the observed stage 1 log ORs Inline graphic and the assumed correlation coefficient Inline graphic between Inline graphic and Inline graphic, we use the following procedure:

  1. Simulate “bootstrapped” stage 1 data Inline graphic by drawing from a bivariate normal distribution
    graphic file with name M209.gif
    conditional on Inline graphic and Inline graphic.
  2. Simulate bootstrapped stage 2 data Inline graphic from independent normal distributions Inline graphic and Inline graphic.

  3. Calculate the bootstrapped UMVCUEs Inline graphic using Inline graphic and the original observed values Inline graphic.

These steps are then repeated for a large value of Inline graphic, with the Inline graphic and Inline graphic empirical quantiles then used as the Inline graphic CIs for Inline graphic and Inline graphic. Bootstrapped CIs for the MLE, as well as the Zhong & Prentice estimator (ZP) are also then straightforwardly available.

Further details of how the variables are generated in step 1 can be found in Section 7 of the supplementary materials (available at Biostatistics online), as well as an extension to the general multivariate case.

As for the SNPs on chromosomal region 5p13, these had the first and second stage 1 ranks. The same bootstrap procedure described above can be used, except that in step 1, we condition on Inline graphic.

Figure 4 shows the bootstrapped 95% CIs for SNP rs17234657 on chromosomal region 5p13. When the correlation Inline graphic, the CI for the UMVCUE is virtually identical to that achieved by just using the stage 2 data. However, we see that, for Inline graphic the CIs become appreciably different, since the stage 2 OR ignores the correlation.

Fig. 4.

Fig. 4.

The UMVCUE for the OR of the highest ranked SNP (rs17234657) as a function of the correlation coefficient with the second-ranked SNP (rs9292777). Also plotted are the 95% bootstrapped CIs for the UMVCUE, using 10 000 bootstrapped replicates per data point. The 95% CIs for the stage 2 OR (in red) and the MLE (in black) are displayed for comparison purposes.

Table 2 compares the bootstrapped 95% CIs for the various estimators for the two highest ranked SNPs (rs17234657 and rs9292777) when the correlation Inline graphic. For the SNP rs9292777, there is a 12% reduction in the CI width (from 0.304 to 0.268) when using the UMVCUE compared to just using the stage 2 data. For both SNPs, the ZP estimator and the MLE give very similar CIs.

Table 2.

Bootstrapped Inline graphic CIs for the ORs of the two highest ranked SNPs when the correlation Inline graphic. There were Inline graphic bootstrapped replicates used

rs17234657 rs9292777
Stage 2 (1.002, 1.348) (1.243, 1.547)
UMVCUE (1.003, 1.348) (1.299, 1.566)
ZP (1.289, 1.529) (1.308, 1.487)
MLE (1.290, 1.526) (1.307, 1.485)

5. Discussion

In this paper, we present a general framework for accounting for selection and correlation in two-stage trials, with an emphasis on the GWAS setting. We achieve this by deriving the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. Our framework relaxes the assumption present in the literature that the stage 1 population parameter estimates are independent.

As the bivariate normal simulation study demonstrated, it is important to correctly account for correlation. Indeed, if the correlation coefficient is not close to zero, then the UMVCUE that ignores correlation can be substantially biased. Due to the bias–variance trade-off, the MSE of the UMVCUE is greater than that of the MLE, but is substantially less than the MSE of using the (unbiased) stage 2 data alone.

The GWAS example showed how our estimation strategy can be applied in practice. We also described how to construct CIs using a parametric bootstrap procedure. Our results demonstrate how correcting for correlation is necessary in the GWAS setting. Indeed, for the Parkes and others (2007) data, the OR estimate for the highest ranking SNP varied substantially for high positive values of the correlation coefficient.

In the GWAS setting, our framework assumes that the LD structure of the SNPs is known. In practice, we would envisage using large external datasets such as the International HapMap Project (The International HapMap Consortium, 2003) to give an estimate of SNP LD. Alternatively, the correlation could be estimated based on the stage 1 data itself if individual participant data were available, but further research is needed to give guidance as to whether doing so would induce bias into our UMVCUE and/or appreciably increase the MSE.

Even if it is not possible to have reliable estimates of the correlation between the (log) ORs, a sensitivity analysis (like the one we carried out) is still useful to probe the robustness of the reported ORs to any LD between the SNPs.

Due to the structure of the Parkes and others (2007) data, our results in Section 4.2 focused on the bivariate setting. Of course, in other GWAS studies the data may mean that it is necessary to consider multiple (i.e. Inline graphic) SNPs in a local region that are in LD. Our UMVCUE, although still tractable, would then have a much more complex dependence on Inline graphic correlation coefficients.

We note that some progress was made to account for correlated outcomes by Sill and Sampson (2007), who looked at two-stage trials with mid-trial treatment selection based on a surrogate endpoint. They derived the UMVCUE for the final endpoint when the surrogate and final endpoint follow a bivariate normal distribution. However, our framework is a more general one, and allows application to other trial settings. Indeed, as current work, we have applied our estimator to two-stage adaptive seamless phase II/III clinical trials, allowing for unequal stage 1 (and stage 2) variances as well as generalized selection rules.

One of the limitations of our work is the construction of CIs. In the bivariate case, we can simply use a parametric bootstrap procedure. As for the general multivariate case, in the supplementary materials (available at Biostatistics online) we show theoretically how to extend our method. However, it is not clear what the coverage of such CIs would be in practice. An alternative approach could be based on the work of Sampson and Sill (2005), who developed exact CIs in the independent normal setting. It may be possible to extend their analysis to correlated multivariate normal outcomes.

In the framework for our UMVCUE, we treat the stage 2 data as coming from independent univariate normal distributions, i.e. we ignore the potential stage 2 correlation structure. As current work, we are deriving efficient unbiased estimators that make use of, and fully account for, all the available correlated stage 2 outcomes. However, preliminary results appear to show that no UMVCUE exists for this case, despite the obvious appeal of making use of more data, when available.

Finally, another extension would be trials with more than two stages. Bowden and Glimm (2014) derived Rao–Blackwellized estimators for multistage drop-the-loser trials assuming independent normal outcomes, and this could be a starting point for looking at multistage trials with multiply correlated outcomes.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

This work was supported by a Medical Research Council Methodology Research Fellowship (grant number MR/N501906/1 to J.B.). Funding to pay the Open Access publication charges for this article was provided by the MRC BSU, grant codes MC_UP_1302/2 and MC_UP_1302/3.

Supplementary Material

Supplementary Data

Acknowledgments

The authors would like to thank the anonymous referees for their helpful comments, and Paul Newcombe for his help in accessing the 1000 genomes project data. Conflict of Interest: None declared.

References

  1. 1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1092 human genomes. Nature 4917422, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bauer P., Bretz F., Dragalin V., König F., Wassmer G. (2016). Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine 353, 325–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bauer P., Koenig F., Brannath W., Posch M. (2010). Selection and bias—two hostile brothers. Statistics in Medicine 291, 1–13. [DOI] [PubMed] [Google Scholar]
  4. Bowden J., Dudbridge F. (2009). Unbiased estimation of odds ratios: combining genome-wide association scans with replication studies. Genetic Epidemiology 335, 406–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bowden J., Glimm E. (2008). Unbiased estimation of selected treatment means in two-stage trials. Biometrical Journal 504, 515–527. [DOI] [PubMed] [Google Scholar]
  6. Bowden J., Glimm E. (2014). Conditionally unbiased and near unbiased estimation of the selected treatment mean for multistage drop-the-losers trials. Biometrical Journal 562, 332–349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bretz F., Koenig F., Brannath W., Glimm E., Posch M. (2009). Adaptive designs for confirmatory clinical trials. Statistics in Medicine 28, 1181–1217. [DOI] [PubMed] [Google Scholar]
  8. Chiò A., Schymick J. C., Restagno G., Scholz S. W., Lombardo F., Lai S. L., Mora G., Fung H. C., Britton A., Arepalli S.. and others (2009). A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Human Molecular Genetics 188, 1524–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cohen A., Sackrowitz H. B. (1989). Two stage conditionally unbiased estimators of the selected mean. Statistics & Probability Letters 83, 273–278. [Google Scholar]
  10. Guo Y., Baum L. W., Sham P. C., Wong V., Ng P. W., Lui C. H. T., Sin N. C., Tsoi T. H., Tang C. S. M, Kwan J. S. H.. and others (2012). Two-stage genome-wide association study identifies variants in CAMSAP1L1 as susceptibility loci for epilepsy in Chinese. Human Molecular Genetics 215, 1184–1189. [DOI] [PubMed] [Google Scholar]
  11. Ioannidis J. P. A., Thomas G., Daly M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics 105, 318–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kimani P. K., Todd S., Stallard N. (2013). Conditionally unbiased estimation in phase II/III clinical trials with early stopping for futility. Statistics in Medicine 3217, 2893–2910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Koopmeiners J. S., Feng Z., Pepe M. S. (2012). Conditional estimation after a two-stage diagnostic biomarker study that allows early termination for futility. Statistics in Medicine 315, 420–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Parkes M., Barrett J. C., Prescott N. J., Tremelling M., Anderson C. A., Fisher S. A., Roberts R. G., Nimmo E. R., Cummings F. R., Soars D.. and others (2007). Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nature Genetics 397, 830–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Pepe M. S., Feng Z., Longton G., Koopmeiners J. (2009). Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility. Statistics in Medicine 28, 762–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Posch M., Koenig F., Branson M., Brannath W., Dunger-Baldauf C., Bauer P. (2005). Testing and estimation in flexible group sequential designs with adaptive treatment selection. Statistics in Medicine 2424, 3697–3714. [DOI] [PubMed] [Google Scholar]
  17. Robertson D. S., Prevost A. T., Bowden J. (2015). Correcting for bias in the selection and validation of informative diagnostic tests. Statistics in Medicine 34, 1417–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Sampson A. R., Sill M. W. (2005). Drop-the-losers design: Normal case. Biometrical Journal 473, 257–268. [DOI] [PubMed] [Google Scholar]
  19. Sapkota Y., Robson P., Lai R., Cass C. E., Mackey J. R., Damaraju S. (2012). A two-stage association study identifies methyl-CpG-binding domain protein 2 gene polymorphisms as candidates for breast cancer susceptibility. European Journal of Human Genetics 206, 682–689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Satagopan J. M., Elston R. C. (2003). Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology 252, 149–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Satagopan J. M., Venkatraman E. S., Begg C. B. (2004). Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 603, 589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Satagopan J. M., Verbel D. A., Venkatraman E. S., Offit K. E., Begg C. B. (2002). Two-stage designs for gene-disease association studies. Biometrics 581, 163–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Sill M. W., Sampson A. R. (2007). Extension of a two-stage conditionally unbiased estimator of the selected population to the bivariate normal case. Communications in Statistics – Theory and Methods 364, 801–813. [Google Scholar]
  24. The International HapMap Consortium. (2003). The international hapmap project. Nature 426, 789–796. [DOI] [PubMed] [Google Scholar]
  25. Thomas D. C., Casey G., Conti D. V., Haile R. W., Lewinger J. P., Stram D. O. (2009). Methodological issues in multistage genome-wide association studies. Statistical Science 244, 414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Thomas D., Xie R., Gebregziabher M. (2004). Two-stage sampling designs for gene association studies. Genetic Epidemiology 274, 401–414. [DOI] [PubMed] [Google Scholar]
  27. Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature 447, 661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhong H., Prentice R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 94, 621–634. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES