Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Aug 15.
Published in final edited form as: Stat Sin. 2012;22:1041–1074.

ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION DATA

Zheyang Wu 1, Hongyu Zhao 2
PMCID: PMC3744348  NIHMSID: NIHMS451862  PMID: 23956610

Abstract

For more fruitful discoveries of genetic variants associated with diseases in genome-wide association studies, it is important to know whether joint analysis of multiple markers is more powerful than the commonly used single-marker analysis, especially in the presence of gene-gene interactions. This article provides a statistical framework to rigorously address this question through analytical power calculations for common model search strategies to detect binary trait loci: marginal search, exhaustive search, forward search, and two-stage screening search. Our approach incorporates linkage disequilibrium, random genotypes, and correlations among score test statistics of logistic regressions. We derive analytical results under two power definitions: the power of finding all the associated markers and the power of finding at least one associated marker. We also consider two types of error controls: the discovery number control and the Bonferroni type I error rate control. After demonstrating the accuracy of our analytical results by simulations, we apply them to consider a broad genetic model space to investigate the relative performances of different model search strategies. Our analytical study provides rapid computation as well as insights into the statistical mechanism of capturing genetic signals under different genetic models including gene-gene interactions. Even though we focus on genetic association analysis, our results on the power of model selection procedures are clearly very general and applicable to other studies.

Keywords and phrases: model selection, statistical power, random predictor, genome-wide association studies, gene-gene interaction

1. Introduction

Marker-by-marker analyses in genome-wide association studies (GWAS) have unraveled many genetic variants associated with a variety of complex traits. However, the current progresses are still limited in two aspects. First, for some diseases, such as asthma and coronary heart disease, fewer novel loci have been found than those for other diseases [7]. Second, the discovered genes only account for a small proportion of genetic risk in most diseases [5]. As a result, developing more sophisticated methods to better identify genetic variants associated with diseases has become a main focus of GWAS data analysis after initial scanning through single marker analysis.

Because of the genetic complexity of common diseases, a joint consideration of multiple markers is intuitively more informative when multiple genes and their interactions are involved in disease etiology. However, joint methods often lead to a sharp increase in the computational burden and in the stringency of statistical significance control, which can weaken their statistical power due to the large amount of candidate models considered. An optimal marker selection strategy should achieve a delicate balance between computational efficiency, satisfactory statistical power, and low error rates. To recognize the optimal marker selection strategy for a certain GWA study, researchers look for techniques to quickly evaluate possible strategies, marginal versus joint, for a variety of interesting genetic models.

There are three fundamental marker search strategies. The marginal search chooses the best fitted single-marker models separately. The exhaustive search selects the best fitted multiple-marker models from all possible combinations of predictors. The forward search looks for the preferred models conditional on the best fitted marker(s). Other marker search strategies are mostly extended from these three methods. For example, in a marginal-exhaustive two-stage search strategy, one can first screen the marker candidates through marginal search, and then choose the best multiple-marker models within the previously selected marker set. In the literature, the power evaluations of these methods have been explored by either simulations or with real data analyses [6][3][10][1]. Because real data are too specific and cannot be used in experimental design, and simulations are time-consuming and less insightful about the statistical mechanism of how genetic signals are captured, it is highly desirable to have analytical results.

The merits of analytical power calculations have been discussed and the methods for quantitative trait have been developed in the literature[12]. Because GWAS mostly focus on disease outcomes, which are binary, there is a need to derive results for binary traits. Statistically, genetic models for quantitative traits and binary traits are quite different. The genetic model for a quantitative trait is mostly based on a linear regression model with a random error component, whereas the genetic model for a binary trait specifies the disease risk for each possible genotype. Searching quantitative trait loci is commonly performed through fitting linear regression models, and the F-statistic is usually used to measure the model goodness-of-fit. On the other hand, searching binary trait loci is commonly performed through fitting logistic regression models, and the log-likelihood ratio test (LRT) or a score test statistic is generally applied for model comparisons. The distributions of the F-statistic and the LRT or a score statistic are quite different. So there is a need to rigorously explore whether the marker search methods will behave differently for binary traits in comparison to quantitative traits.

It is also of interest to study how error control criteria affect the relative performance of different search strategies. One contribution of this article is to compare and contrast two types of controls that have been widely applied in practice: discovery number control and type I error rate control. With the former, one collects a pre-specified number of models after ordering all candidate models. With the later, one selects models whenever they have test statistics that exceed the critical value based on a genome-wide type I error rate. The analytic power calculations help us to demonstrate and explain the power contrasts for each type of controls. Furthermore, we take into account linkage disequilibrium (LD) between the observed markers and the causative but unobserved loci. Besides the three basic search methods: the marginal search, the exhaustive search, and the forward search, this article also considers a marginal-exhaustive two-stage search strategy, which is practically appealing because of its computational efficiency.

We have implemented our method in an R package markerSearchPower, which provides researchers a convenient tool to find proper sample size in experimental design, to decide suitable strategies in data analysis, and to increase chance for true findings. Our statistical technique can also be used to address model selection problems of binary responses with general random predictor settings in other application areas.

The rest of this article is organized as follows. Section 2 sets up the genetic model to be studied, defines marker search strategies and statistical power. In Section 3, we introduce the score test statistics, derive asymptotic distributions for score tests and the power calculation formulas for four model selection strategies. In Section 4, the accuracy of our analytical results is demonstrated by simulations, and power comparisons among search strategies are illustrated in a large space of genetic parameters. In Section 5, we discuss the advantages and limitations of our analytical approaches, compare the results for quantitative traits and binary traits, and summarize the distinction between the performance under discovery number control and that under Bonferroni control.

2. Genetic Model and Marker Search

2.1. Model Setup

We assume that the odds of a binary trait (or hereafter described as “disease”) are specified by two loci which may or may not be directly genotyped. A genotype data set contains n independent individuals indexed by i = 1, …, n, and L candidate markers indexed by j, k = 1, .., .L. A marker genotype is defined to be the number of its disease allele A. Denoted by Gji, the random genotype value of the jth marker in the ith individual is

Gji={2Genotype=AjAj,withprobabilitypj21Genotype=Ajaj,withprobability2pj(1-pj)0Genotype=ajaj,withprobability(1-pj)2,

where pj is the disease minor allele frequency (MAF). Considering biallelic markers, the most general way to specify the underlying genetic model is through a 3-by-3 table of disease odds. Without loss of generality, we assume the first two markers, indexed by j = 1 or 2, are the associated markers which determine the odds of disease

O(g1,g2)=p(Dg1,g2)p(D¯g1,g2),

where (g1, g2) is a specific combination of genotypes, D and denote disease and non-disease, respectively. For example, the following are three common genetic models measuring different types of gene-gene interactions studied in the genetics literature [6]:

Model1:A2A2A2a2a2a2A1A1α(1+θ1)2(1+θ2)2α(1+θ1)2(1+θ2)α(1+θ1)2A1a1α(1+θ1)(1+θ2)2α(1+θ1)(1+θ2)α(1+θ1)a1a1α(1+θ2)2α(1+θ2)α, (2.1)
Model2:A2A2A2a2a2a2A1A1α(1+θ)4α(1+θ)2αA1a1α(1+θ)2α(1+θ)αa1a1ααα, (2.2)
Model3:A2A2A2a2a2a2A1A1α(1+θ)α(1+θ)αA1a1α(1+θ)α(1+θ)αa1a1ααα. (2.3)

In the above models, α and θ’s are parameters for baseline and genotypic effects, respectively.

Based on the odds of disease, the genotypic disease risks are p(Dg1,g2)=O(g1,g2)1+O(g1,g2), and the joint distribution of genotypes in diseased individuals is

p(g1,g2D)=p(Dg1,g2)p(g1,g2)g1,g2p(Dg1,g2)p(g1,g2).

Similarly, we can get the joint distribution of genotypes in controls p (g1, g2|). For any marker-pair involving one non-associated marker j, it is clear that p (g1, gj|D) = p (g1|D) p (gj) with p (g1|D) = Σg2 p (g1, g2|D). For any non-associated marker-pair, we have p (gj, gk|D) = p (gj) p (gk), k > j ≥ 3.

The odds of disease can be defined through a logistic model

log(O(g1,g2))=b0+b1g1+b2g2+b3g1g2. (2.4)

With various values of b1, b2 and b3, equation (2.4) defines a flexible genetic interaction model. For example, the genetic model in (2.2) can be rewritten as a logistic model (2.4) with b0 = log (α), b1 = b2 = 0, b3 = log (1 + θ).

The genetic model can also be defined through a disease prevalence p (D) together with a 3-by-3 table of genetic relative risks

GRR(g1,g2)=p(Dg1,g2)/p(D0,0).

In this setup, it is easy to get the corresponding odds of disease through

p(Dg1,g2)=GRR(g1,g2)p(D0,0)=GRR(g1,g2)p(D)g1,g2GRR(g1,g2)p(g1,g2).

It is not unusual that the causative markers are not observed but there exists LD between the causative and the genotyped markers. In this situation, the genotyped markers 1 and 2 are non-causative but associated with causative markers index by, say −1 and −2. Let marker 1 be in LD with marker −1 and marker 2 be in LD with marker −2. The odds of disease at markers 1 and 2 is

O=p(Dg1,g2)p(D¯g1,g2)=g-1,g-2p(Dg-1,g-2)p(g1g-1)p(g2g-2)p(g-1,g-2)g-1,g-2p(D¯g-1,g-2)p(g1g-1)p(g2g-2)p(g-1,g-2). (2.5)

The method of calculating p (g1|g−1) and p (g2|g−2) is illustrated in the supplement, which follows the LD models defined in the literature [6].

2.2. Model Fitting and Selection Procedures

We define the above genetic models of disease susceptibility by specifying the disease odds from a perspective data-generating point of view. In case-control retrospective studies, the samples are collected from a random sample of n1 cases and n0 controls. For a given GWAS data set, marker search requires to fit the following one-marker or two-marker logistic regression models:

logit(P^j(Yi=1gji))=β^0j+β^1jgji, (2.6)
logit(P^jk(Yi=1gji,gki))=β^0jk+β^1jkgji+β^2jkgki+β^3jkgjigki, (2.7)

where logit(p) = log (p/(1 − p)), gji and gki are the observed genotypes of markers j and k in individual i, and Yi = 1 or 0 indicates disease or non-disease status. The marginal search method looks for the best fitted models (2.6) over all single markers. The exhaustive search method seeks the best fitted models (2.7) over all marker-pairs. The forward search method first selects the best fitted model (2.6), then picks the best two-marker models (2.7) while retaining the previously chosen marker. Extended from these approaches, a marginal-exhaustive two-stage search method applies marginal search as the first stage and the exhaustive search in the chosen set of markers as the second stage. After any search procedure, the markers contained in the selected models are treated as the putative disease-associated markers.

The significance of statistical associations, or equivalently the goodness of model fitting, relies on either log-likelihood ratio test (LRT) or score test. As shown in the simulations below, the two tests are similar for the purpose to select markers when sample size is moderately large. Because LRT has no closed form, we used score test statistic to calculate power.

As a model can contain none, one, or two associated markers, it is necessary to consider both stringent and relaxed criteria to decide whether any chosen model is what we are looking for. Accordingly, the following two definitions of power in genetics literature [6][10] are considered in any selection procedure:

  1. Power is defined as the probability of identifying the true genetic association model (in marginal search, it is defined as the probability of detecting both associated markers).

  2. Power is defined as the probability of detecting at least one associated marker.

We study two criteria for significance level control. The first is a discovery number control, where one selects the top R most significant models. The power under this control is a generalization of detection probability (DP) [4] into the context of model selection. The second is a type I error rate control at a genome-wide significance level α, which applies the Bonferroni correction according to the number of models to be compared. Specifically, in marginal search, the correction is α/L for comparing totally L one-marker models, the null distribution is χ12. In exhaustive search, the correction is α/(L2) for comparing totally (L2) pairwise marker models, the null distribution is χ32. In the forward search, the correction of the first step is α/L with the null distribution χ12; the correction of the second step is α/(L − 1) with the null distribution χ22. In the following we construct the score test statistics for logistic regression, derive the null and alternative asymptotic distributions of relevant test statistics in the procedure of marker search, and present the formulae of power calculation under significance control criteria.

3. Model Selection Power

3.1. Score Test Statistic

We adopt a score test [13] into the context of genome-wide association case-control studies. Let Ti1=(T1i1,,Tmi1), i = 1, …, n1, be a random sample of m covariates of logistic regression in cases, Ti0=(T1i0,,Tmi0), i = 1, …, n0, be a random sample of m covariates in controls. Ti1 and Ti0 have distributions p (t1, …, tm|D) and p (t1, …, tm|), respectively. Let Tm×n=(T1,,Tn)=(T11,,Tn11,T10,,Tn00) represent the combined variables with total sample size n = n1 + n0. To test the null hypothesis that there is no association between the outcome and the covariates, the score statistic is

S=nUΓ-1U

where Um×1=n0n1n2(T¯1-T¯0) and Γm×m=n0n1n2(1ni=1nTiTi-T¯T¯) with T¯1=i=1n1Ti1/n1,T¯0=i=1n0Ti0/n0 , and T¯=i=1nTi/n being the vectors of sample averages.

In GWAS, let Zj = (Zj1, …, Zjn1) and Xj = (Xj1, …, Xjn0) be samples of genotype values for the jth marker in cases and controls, respectively. That is, the vector of random genotype of marker j is Gj = (Gj1, …, Gjn) = (Zj1, …, Zjn1, Xj1, …, Xjn0). In general, the elements of Ti1 and Ti0 are functions of random genotypes corresponding to the form of the logistic regression model. Particularly, for a single-marker model (2.6) of marker j, j = 1, …, L, there is m = 1 covariate such that Ti1=(Zji),Ti0=(Xji), and T = Gj. So the score test is

Sj=n2(2r(1-r)(Z¯j-X¯j)2rZj2¯+(1-r)Xj2¯-(rZ¯j+(1-r)X¯j)2), (3.1)

where r = n1/n, Z¯j=i=1n1Zji/n1,X¯j=i=1n0Xji/n0,Zj2¯=i=1n1Zji2/n1 and Xj2¯=i=1n0Xji2/n0. For a two-marker model (2.7) of markers j and k, there are m = 3 covariate such that Ti1=(Zji,Zki,ZjiZki),Ti0=(Xji,Xki,Xji,Xki), and Tn= (Gj, Gk, Gj * Gk)′, where * denotes the element-wise cross-product of two vectors. Thus the score test becomes

Sjk=nUjk(Γjk)-1Ujk, (3.2)

where

Ujk=n1n0n(Z¯j-X¯j,Z¯k-X¯k,ZjZk¯-XjXk¯),
Γjk=n1n0n(γ11γ12γ13γ21γ22γ23γ31γ32γ33), (3.3)

with Γjk being symmetric and

γ11=n1Zj2¯+n0Xj2¯n-(n1Zj¯+n0Xj¯n)2,γ12=n1ZjZk¯+n0XjXk¯n-(n1Zj¯+n0Xj¯n)(n1Zk¯+n0Xk¯n),γ13=n1Zj2Zk¯+n0Xj2Xk¯n-(n1ZjZk¯+n0XjXk¯n)(n1Zj¯+n0Xj¯n),γ22=n1Zk2¯+n0Xk2¯n-(n1Zk¯+n0Xk¯n)2,γ23=n1ZjZk2¯+n0XjXk2¯n-(n1ZjZk¯+n0XjXk¯n)(n1Zk¯+n0Xk¯n),γ33=n1Zj2Zk2¯+n0Xj2Xk2¯n-(n1ZjZk¯+n0XjXk¯n)2.

Note that ZjZk¯=i=1n1ZjiZki/n1,XjXk¯=i=1n0XjiXki/n0, the other averages functions of cross-product terms are analogously defined.

3.2. Asymptotic Distributions

In GWAS, the marker genotypes of individuals are not controllable but randomly observed. It is crucial to consider the genotype predictors as random variables. We apply the following generalization of Delta method to derive the null and the alternative distributions of the score test statistics, which are functions of random predictors. Let Wi = (W1i, …, Wmi), i = 1, …, n, be n independent and identically distributed random vectors of dimension m. The corresponding mean vector is θ = (θ1, …, θm) with θs = E (Wsi), and the covariance matrix is Σ = Cov (Wi) with (Σ)st = Cov (Wsi, Wti), s, t = 1, …, m. Let = (1, …, m) be the vector of the sample means, i.e. W¯s=1ni=1nWsi. Consider a real valued function h () of , if h(θ)(h(θ)θ1,,h(θ)θm)0,

n[h(W¯)-h(θ)]LN(0,τ2), (3.4)

where τ2 = [∇h (θ)]′ Σ [∇h (θ)] and L denotes convergence in law. If ∇h (θ) = 0,

n[h(W¯)-h(θ)]Lcχd2, (3.5)

where if AD2h (θ) Σ, with D2h(θ)=2θ2h(θ) being the Hessian matrix of h (θ), then

  1. c=12, d = rank (A), if A is idempotent,

  2. ctrace(A2)2trace(A),dtrace(A)2trace(A2), if A is not idempotent.

Furthermore, if ∇h1 (θ) ≠ 0 and ∇h2 (θ) ≠ 0,

Cov(nh1(W¯),nh2(W¯))P[h1(θ)][h2(θ)]. (3.6)

Clearly, the score tests in (3.1) and (3.2) are functions of the genotypic sample means. The distribution of can be derived from the genotypic distributions in cases or in controls determined by genetic models. For the score tests involved in each marker search method, the following sections specify the distribution of for the given markers involved in the model fittings. When the causative loci are not genotyped but markers 1 and 2 are in LD with them, by (2.5) we obtained the conditional joint genotypic distribution p (gA1, gA2|D) in cases and p (gA1, gA2| in controls. The distributions of relevant score tests then come after the mean vector and covariance matrix involving the associated markers 1 and 2:

θA1A2=E(WA1A2),A1A2=Var(WA1A2).

3.3. Marginal Search

3.3.1. Asymptotic Distribution of Test Statistic

The relevant tests and the corresponding distributions for marginal search are as follows. For the jth single marker, j = 1, …, L, let TjSj, where Sj is the score test in (3.1). We can rewrite Tj=n/2h(W¯j) with W¯j=(Z¯j,X¯j,Zj2¯,Xj2¯) being a vector of sample averages over cases and controls. Let Wj=(Zj,Xj,Zj2,Xj2) represent the random genotypic vector of any observation. For an associated marker j = 1 (similarly for j = 2), Z1 has the distribution p (g1|D) =Σg2 p (g1, g2|D), and X1 has the distribution p (g1|) For the non-associated markers j = 3, …, L, Zj and Xj have the same distribution p (gj). The means and variance matrices are gotten by θj=E (Wj) and Σj = Cov (Wj), respectively. When n1 = n0, by (3.4),

Tj-n2h(θj)LN(0,τj2),

where τj2=[h(θj)]j[h(θj)].

To calculate the alternative distributions, note that T1 and T2 are correlated because the odds of disease is a function of both markers 1 and 2. The joint distribution of (T1, T2)′ is asymptotically multivariate normal

(T1,T2)-μT1,T2LMVN(0,τT1,T2), (3.7)

and is a part of the joint distribution in (3.8). Corresponding to the non-associated markers j = 3, …, L, the null distribution for marginal search is TjLN(0,1) by (3.4), or consistently, SjLχ12 by (3.5). Further, by (3.6), the correlations between Tj, j = 3, …, L, and T1 (or T2) are asymptotically

3.3.2. Power under Discovery Number Control

For the discovery number control, the power of detecting alternative model(s) in the top R most significantly fitted models is equal to the probability that an alternative model is better fitted than the Rth (or in marginal search under power definition (A), the (R − 1)th) best fitted null models. Specifically, when the number of discoveries is controlled by R, the power of marginal search under definition (A) for detecting both associated markers 1 and 2 is

P(S1S2S(r))=P(S(r)t12t22)dG(t1,t2),

where S1S2 = min {S1, S2}, r = L − 2 − R + 1, S(r) is the rth smallest (or the Rth largest) order statistics in the set {Sj, j = 3, …, L}, and G (t1, t2) is the cumulative distribution function (CDF) of (T1, T2)′ in (3.7). Let G1 (·) be the CDF of χ12. Then

P(S(r)x)=G1r(x)l=0L-2-r(r+l-1l)(1-G1(x))l.

To get the power of marginal search under definition (B) that either associated marker 1 or marker 2 is selected, we calculate the probability that either S1 or S2 is larger than the cutoff point: P (S1S2S(r)), where S1S2 = max {S1, S2}.

3.3.3. Power under Bonferroni Control

Since the null distribution of a score statistic used in marginal search is χ12, the cutoff under the Bonferroni corrected type I error rate control is c=G1-1(1-α/L), where G1-1 is the quantile function of χ12 and α is the genome-wide significance level. Under power definition (A) or (B), the probability of finding both or either associated marker is P (S1S2c) or P (S1S2c).

3.4. Exhaustive Search

3.4.1. Asymptotic Distribution of Test Statistic

The relevant test distributions for exhaustive search are based on the score tests in (3.1) and (3.2). For the statistics involving associated markers 1 and 2, let T12S12=n/2h12(W¯12),TiSi=n/2hi(W¯12), i = 1,2, where

W¯12=(Z1¯,X1¯,Z2¯,X2¯,Z1Z2¯,X1X2¯,Z12¯,X12¯,Z22¯,X22¯,Z12Z2¯,X12X2¯,Z1Z22¯,X1X22¯,Z12Z22¯,X12X22¯)

is a vector of sample averages over cases and controls. Let W12 be the corresponding genotypic vector of one random observation, we have the mean vector θ12=E (W12) and the variance matrix Σ12 = V ar (W12). Based on the asymptotic distribution results in (3.4) and (3.6), when n1 = n0,

(T12,T1,T2)-μT12,T1,T2LMVN(0,τT12,T1,T2), (3.8)

where

μT12,T1,T2=n2(h12(θ12),h1(θ12),h2(θ12)),τT12,T1,T2=D12D,

with D = (∇h12 (θ12), ∇h1 (θ12), ∇h2 (θ12)).

Because S12 has 3 degrees of freedom, the convergence in (3.8) is relatively slower than that in (3.7). In the following we describe an approximation for the mean of T12, in the case that sample size is small (e.g. n1 < 1000) and genetic effect is weak (e.g. θ < 0.2 in model (2.2)). Note that if we consider an observed (and thus fixed) data design matrix t = (g1, g2, g1 * g2) of a logistic regression in the form (2.7), where gj = (zj1, …, zn1j, xj1, …, xjn0)′ j = 1 or 2, it has been shown that [13]

S12~χ3,δn(t)2

where

δn(t)=nbΓ12(t)b,

with Γ12 (t) given in the form (3.3) and b = (b1, b2, b3)′ being the vector of coefficients in (2.4). Now we define δn = δn (E (T)) in our set-up for random genotype T = (G1, G2, G1 * G2) with Gj = (Zj1, …, Zjn1, Xj1; …; Xjn0)′ j = 1 or 2. We can use a weighted Chi-square with one degree of freedom to approximate S12, i.e. S12anχ1,λn2. Solving the equations assuming equal mean and variance [9], i.e. an (1 + λn) = 3 + δn and an2(2+4λn)=6+4δn, we get an=3+δn1+λn,λn=2tn-1+4tn2-2tn, and tn=(3+δn)26+4δn. So we can apply the following approximation for weak genetic effects:

E(T12)anλn.

By (3.5), the score test Sjk for model (2.7) of two non-associated markers j and k, 3 ≤ j < kL, has an asymptotic distribution

SjkLχ32. (3.9)

Let Sk|j denote the score test for the extra terms in model (2.7) over model (2.6). The following is a useful decomposition

Sjk=Sj+Skj. (3.10)

So the correlation between the score tests Sjk1and Sjk2, sharing the same marker j, can be captured by Sj, while Sk1|j and Sk2|j can be treated independent. Furthermore, by (3.5), for k ≥ 3,

SkjLχ22. (3.11)

3.4.2. Power under Discovery Number Control

With the test distributions derived above, we can calculate the probability of identifying the whole associated genetic model through exhaustive search under power definition (A). Define sets A1 ≡ {Sij, i = 1, 2, j = 3, …, L} and A2 ≡ {Sjk, 3 ≤ j < kL}. Let SA,[R] denote the Rth largest score tests in a set A. When controlling the false discovery number by R, the probability of detecting the associated marker-pair is

P(S12SA1A2,[R])=P(t122SA1A2,[R])dG(t12,t1,t2),

where G (t12, t1, t2) is the CDF of (3.8), A1={ti2+Sji,i=1,2,j=3,,L} is from the decomposition (3.10), and

P(t122SA1A2,[R])=r=0R-1{r1,r2,r3}SrP1P2P3,

where

Sr={{r1,r2,r3}:ri=r,0r1,r2(L-2),0r3N},P1=(L-2r1)[1-G1(t122-t12)]r1G1(t122-t12)L-2-r1,P2=(L-2r2)[1-G1(t122-t22)]r2G1(t122-t22)L-2-r2,P3=(Nr3)[1-G3(t122)]r3G3(t122)N-r3,

N=(L-22) is the number of variables in S2, G1 (·) is the CDF of distribution (3.11), and G3 (·) is the CDF of distribution (3.9). With the same argument given in the literature [12], the test statistics within the sets A* ≡ {Sj|1, Sj|2, j = 3, …, L} and S2 ≡ {Sjk, 2 < j < kL} can be treated as asymptotically independent as L → ∞.

To simplify the heavy computation needed for the above formula, we can use the following approximations of the integrands. Simulations (results not presented) illustrated that these approximations are pretty accurate in the context of Monte Carlo integration. Let m = 2 (L − 2) + N be the total number of the elements in A1A2. Q denotes the quantile function of mixed distribution of these elements. For a given (t12, t1, t2), P(t122SA1A2,[R]) can be approximately replaced with

I{t122>Q(m-R+0.5N)}I{t122>RthlargetstvalueinA3},

where I {E} denotes the indicator function of event E, and the set

A3{Q1(r)+t12,Q1(r)+t22,Q3(r),r=1,,R},

with Q1(r)=G1-1(1-r-0.5L-2) and Q3(r)=G3-1(N-r+0.5N).

According to power definition (B), the probability for exhaustive search to detect either associated marker is

P(max({S12}A1)>SA2,[R])=1-Pt12,t1,t2dG(t12,t1,t2)

where

Pt12,t1,t2=P(max({t122}A1)SA2,[R])=P(max({t122}A1)x)g3(N-R+1)(d)dx=t122[G1(x-t12)G1(x-t22)]L-2g3(N-R+1)(x)dx,

g3(NR+1) (·) is the PDF of the (NR + 1)th order statistics distribution with the following density function:

g3(N-R+1)(x)=N!(N-R)!(R-1)!G3(x)N-R[1-G3(x)]R-1g3(x),

G3 (c) and g3 (·) are the CDF and PDF of the distribution of (3.9), respectively.

If R is neither too small nor too large, i.e., RNc, 0 < c < 1, as N → ∞, we can use quantiles to replace the order statistics in order to simplify the calculation ([2] Chapter 4.6), i.e. SA2,[R]G3-1(N-R+0.5N)Q. So for given (t12, t1, t2), we can approximate the integrand Pt12,t1,t2 with

I{t122Q}[G1(Q-t12)G1(Q-t22)]L-2.

3.4.3. Power under Bonferroni Control

Traditional type I error control for exhaustive search does not consider the models with one associated and one non-associated markers, so the null distribution is from non-associated two-marker models [6] [10]. Let G3-1 be the quantile functions of χ32 in (3.9). The cutoff is c=G3-1(1-α/(L2)). The probability of finding the whole associated genetic model under power definition (A) is

P(S12c)=P(t122c)dG(t12).

With a similar argument for the power under discovery number control, the probability of finding either associated marker under power definition (B) is

P(max({S12}A1)c)=1-Pt12,t1,t2(c)dG(t12,t1,t2)

where

Pt12,t1,t2(c)=P(max({t122}A1)c)=I{t122c}[G1(c-t12)G1(c-t22)]L-2,

with G1 (·) being the CDF of distribution (3.11).

3.5. Forward Search

3.5.1. Asymptotic Distribution of Test Statistic

For forward search, first we derive the distributions of test statistics, which will be used to calculate the power of this search procedure. For the score tests involving the associated markers 1 and 2, let TijSij, where Si|j follows (3.10), i = 1, 2, j = 3, …, L. We can rewrite Tij=n/2hij(W¯12j),Ti=n/2hi(W¯12j), where

W¯12j=(Z1¯,X1¯,Z2¯,X2¯,Z1Z2¯,X1X2¯,Z12¯,X12¯,Z22¯,X22¯,Z12Z2¯,X12X2¯,Z1Z22¯,X1X22¯,Z12Z22¯,X12X22¯,Zj¯,Xj¯,Z1Zj¯,X1Xj¯,Zj2¯,Xj2¯,Z12Zj¯,X12Xj¯,Z1Zj2¯,X1Xj2¯,Z12Zj2¯,X12Xj2¯,Z2Zj¯,X2Xj¯,Z22Zj¯,X22Xj¯,Z2Zj2¯,X2Xj2¯,Z22Zj2¯,X22Xj2¯)

is a vector of sample averages over cases and controls. Let W12j be the corresponding random genotypic vector of any observation. The mean vector is θ12j=E (W12j) and the variance matrix is Σ12j = V ar (W12j). Following (3.4) and (3.6), we have the asymptotic joint distribution

(T1,T2,T1j,T2j)-μT1,T2,T1j,T2jLMVN(0,τT1,T2,T1j,T2j), (3.12)

where

μT1,T2,T1j,T2j=n/2(h1(θ12j),h2(θ12j),h1j(θ12j),h2j(θ12j)),τT1,T2,T1j,T2j=D12jD,

with D = (∇h1 (θ12j), ∇h2 (θ12j), ∇h1|j (θ12j), ∇h2|j (θ12j)). Through calculation [11], we have ∇hi|j (θ12j) = ∇hi (θ12j), i = 1, 2. By (3.4) and (3.6) it is clear that

Var(Ti)=(hi(θ12j))12jhi(θ12j)=(hi(θ12j))12jhij(θ12j)=Cov(Ti,Tij).

So Ti and Ti|j have correlation coefficient converging to 1. This explains why forward selection has similar power as marginal search for detecting either associated marker: if a genetic effect cannot standout in a marginal scan, it does not likely show strong signal in the following step either. Furthermore, Tj and Ti|j are asymptotically independent, i.e.

Cov(Tj,Tij)0,i=1,2,j=3,,L.

When comparing a model involving two incorrect markers j and k (3 ≤ j < kL) in (2.7) with a model for marker j in (2.6), by (3.5), the corresponding score test statistic Sk|j has the asymptotic chi-square distribution:

SkjLχ22. (3.13)

3.5.2. Power under Discovery Number Control

In the forward search procedure, we first apply marginal search to find the most significant marker among models in (2.6). Based on the selected marker, we then fit models in (2.7) in the second step to find the markers that have strong joint association. When controlling for R total discoveries, under power definition (A) for finding the whole associated model, we need to calculate the probability that the forward search chooses marker 1 or 2 in the first step, and then picks the genetic model in the second step. Define i* ≡ arg maxi=1,2 {Si}, Ai* ≡ {Si*3, …, Si*L}, as L → ∞. The power can be written as

P(SiS(L-2)S12>SAi,[R])=P(ti2>S(L-2)t122>SAi,[R])dG(t12,t1,t2)P(ti2>S(L-2))P(t122SAi,[R])dG(t12,t1,t2),

where G (t12, t1, t2) is the CDF of (T12, T1, T2)′ given in (3.8), S(L−2) = maxj≥3 {Sj}, Ai={ti2+Sji,j=3,,L} by the decomposition (3.10), and

P(ti2>S(L-2))=(G1(t12t22))L-2,
P(t122SAi,[R])=G2(u)rl=0L-2-r(r+l-1l)[1-G2(u)]l,

where u=t122-ti2, r = L − 2 − R + 1, G1 (·) is the CDF of χ12 for the distribution of Sj, and G2 (·) is the CDF of χ22 for the distribution of Sj|i*. i* is fixed for an observed value (t1, t2)′ of the random vector (T1, T2)′, so it is easy to implement the power calculation with Monte Carlo integration.

Note that S(L−2) and SAi,[R] are asymptotically independent. This is because corr (Sj, Sj|i*) < 1 for each j ≥ 3. So as L → ∞,

P(jk:Sj=S(L-2),Ski=SAi,[R])1.

But when j* ≠ k*, Sj* and Sk*|i* are always independent.

When R and L are large, we can simplify the formula of P(t122SAi,[R]) by approximating the Rth largest variable in set {Sj|i*, j = 3, …, L with G2-1(1-R-0.5L-2) , where G2-1 is the quantile function of Sj|i*. So we can approximately replace P(t122SAi,[R]) with I{u>G2-1(1-R-0.5L-2)} to calculate the integration.

Under power definition (B) for finding either associated marker, the power of the forward search is the sum of PA: the probability of detecting marker 1 or 2 in the 1st step, and PB: the probability that step 1 fails but step 2 picks up at least one associated marker. When controlling for R total discovered models, it is straightforward that

PA=P(Si>S(L-2))=(G1(t12t22))L-2dG(t1,t2),

where G (t1, t2) is the CDF of the joint distribution of (T1, T2)′ given in (3.7). Define j* ≡ arg maxk≥3 {Sk}, Aj* ≡ {Sk|j*, k ≥ 3, kj*}. The second probability is

PB=P((S1S2)<Sj(S1jS2j)SAj,[R]).

For any k ≥ 3, Si|k and Sk are independent, so Si|j* and Sj* are independent. By the results in (3.12) and (3.13), the distribution of Si|j* does not depend on j*. So, Si|j* has the same distribution of Si|j, j = 3, …, L. Then

PB=Pt1t2Pt1jt2jdG(t1,t2,t1j,t2j),

where G (t1, t2, t1|j, t2|j) is the CDF of (T1, T2, T1|j, T2|j)′ given in (3.12), and

Pt1t2=P(S(L-2)>(t12t22))=1-(G1(t12t22))L-2,Pt1jt2j=P((t1j2t2j2)SAj,[R])=G2(t1j2t2j2)rl=0L-3-r(r+l-1l)[1-G2(t1j2t2j2)]l,

with r = L − 3 − R + 1, G2 (·) is the CDF of Sk|j, k ≥ 4, given in (3.13). We can approximate SAj*, [R] through the quantile function G2-1(1-R-0.5L-3) to simplify the calculation of integration.

3.5.3. Power under Bonferroni Control

When we utilize the Bonferroni control in forward search, the first step selects the most significant single marker only if the test is larger than the cutoff c1=G1-1(1-α/L), where G1-1 is the quantile function of χ12. In the second step, the null distribution is always χ32 no matter which marker is first selected. Let the cutoff be c2=G2-1(1-α/(L-1)), where G2-1 be the quantile function of χ22. For finding the true genetic model under power definition (A), the analytical power calculation is given by

P(Si>S(L-2)Si>c1S12-Si>c2)=(G1(t12t22))L-2{(t12t22)>c1}{t122-(t12t22)>c2}dG(t12,t1,t2).

Similarly as the calculation under discovery number control, the power of forward search for finding either associated marker is PA + PB, where

PA=P(Si>S(L-2)Si>c1)=(G1(t12t22))L-2{(t12t22)>c1}dG(t1,t2),

and

PB=P(Sj>SiSj>c1(S1jS2j)c2)=Pt1t2t1jt2jdG(t1,t2,t1j,t2j)

where

Pt1t2t1jt2j=P(Sj>(t12t22)(t1j2t2j2)c2Sj>c1).

As shown above, Ti and Ti|j have large correlation coefficient converging to 1. Furthermore, SjG1-1(1-1-0.5L-2)<c2, so

P(Sj>(t12t22)(t1j2t2j2)c2)0,

thus PB ≅ 0.

3.6. Marginal-Exhaustive Two-Stage Search

We also study the power of a marginal-exhaustive two-stage method [6]. In the first stage for screening, it selects a set of single markers I1 ∈ {1, 2, …, L} with a liberal type I error-rate cutoff c1=G1-1(1-α1), where G1-1 is the quantile function of χ12. Then in the second stage, it applies the exhaustive search to the selected markers set. In the context of score test statistics, we adopt the approach by Marchini et al. [6] to define the statistic in the second stage. Specifically, the statistic is Slm=Slm-2c1, where Slm be the score test for the marker-pair (l, m) ⊂ I1. We would select markers l and m if Slmc2G3-1(1-α/(α1L2)), where G3-1 is the quantile function of χ32. To find the associated genetic model, the first stage has to find both associated markers marginally. So under definition (A), the power of the two-stage search is

P(S1S2c1S12>c2)=I((t12t22)>c1(t122-2c1)>c2)dG(t12,t1,t2).

4. Results

4.1. Comparison between Analytical and Simulation Results

In order to demonstrate the accuracy of our analytical power calculation, we compared the power values from calculations with those from simulations. For the feasibility of simulation, we considered L = 300 candidate markers with MAF pj = 0.3, j = 1, …, L, n1 = 1000 cases, n0 = 1000 controls, base line effect α = 0.007 and genotypic effect θ = 0.3 for genetic model in (2.2). These setups lead to a disease prevalence close to 0.01. Table 1 shows the power of marker search procedures under discovery number control. Table 2 shows the power under the Bonferroni corrected type I error rate control with the similar setups as Table 1 except that sample size n1 = n0 = 5000 and genotypic effect θ = 0.2 in (2.2). These parameters were chosen to get the power values that are not too small nor too large, but are in a spectrum of values of practical interests. We simulated 1000 data sets and ran search procedures for each, the empirical power is the proportion of successful detections. In simulations, we used both score test and log likelihood ratio test for model comparisons. With various parameter setups, more comparisons between the simulated and the calculated power values can be found in the supplement. The consistent closeness between the analytical and simulation results demonstrates that our power calculation methods perform well, and that score test and LRT have similar performance for model selection.

Table 1.

Under the control of the discovery number R, the comparisons of the power values between simulations (based on score test and log-likelihood ratio test) and analytical calcualtions (based on score test). Power definitions (A) and (B) are considered. α = 0.007, θ = 0.3.

Strategy Source R = 1* R = 5 R = 10 R = 15 R = 20 R = 30
Power definition (A)

Marginal search Score Simu. 0.14 0.38 0.51 0.58 0.64 0.71
LRT Simu. 0.14 0.38 0.51 0.58 0.64 0.71
Score Calcu. 0.15 0.38 0.50 0.58 0.65 0.72

Exhaustive search Score Simu. 0.32 0.50 0.57 0.61 0.65 0.70
LRT Simu. 0.33 0.52 0.61 0.64 0.69 0.72
Score Calcu. 0.32 0.50 0.59 0.63 0.65 0.69

Forward search Score Simu. 0.28 0.42 0.48 0.50 0.52 0.55
LRT Simu. 0.29 0.43 0.48 0.50 0.53 0.55
Score Calcu. 0.28 0.44 0.48 0.51 0.52 0.54

Power definition (B)

Marginal search Score Simu. 0.55 0.83 0.90 0.92 0.95 0.96
LRT Simu. 0.55 0.83 0.90 0.92 0.95 0.96
Score Calcu. 0.55 0.83 0.91 0.94 0.95 0.98
Exhaustive search Score Simu. 0.57 0.80 0.87 0.91 0.93 0.95
LRT Simu. 0.59 0.81 0.87 0.81 0.93 0.95
Score Calcu. 0.62 0.77 0.84 0.88 0.90 0.93

Forward search Score Simu. 0.64 0.76 0.83 0.87 0.91 0.94
LRT Simu. 0.64 0.76 0.83 0.87 0.91 0.94
Score Calcu. 0.64 0.74 0.82 0.88 0.90 0.94

(*R = 2 in marginal search under power definition (A)).

Table 2.

Under the Bonferroni corrected type I error with family-wise significance level α, the comparisons of the power values between simulations (based on score test and log-likelihood ratio test) and analytical calcualtions (based on score test). Power definitions (A) and (B) are considered. n1 = n2 = 5000, α = 0.007, θ = 0.2.

Strategy Source α = 0.01 α = 0.05 α = 0.10 α = 0.15
Power definition (A)

Marginal search Score Simu. 0.20 0.37 0.46 0.51
LRT Simu. 0.20 0.37 0.46 0.51
Score Calcu. 0.19 0.34 0.42 0.48

Exhaustive search Score Simu. 0.86 0.92 0.94 0.95
LRT Simu. 0.87 0.92 0.94 0.94
Score Calcu. 0.86 0.92 0.93 0.95

Forward search Score Simu. 0.55 0.73 0.79 0.82
LRT Simu. 0.56 0.73 0.80 0.83
Score Calcu. 0.51 0.70 0.76 0.80

Power definition (B)

Marginal search Score Simu. 0.68 0.83 0.88 0.91
LRT Simu. 0.68 0.83 0.88 0.91
Score Calcu. 0.66 0.82 0.87 0.89

Exhaustive search Score Simu. 0.87 0.93 0.94 0.96
LRT Simu. 0.87 0.92 0.94 0.95
Score Calcu. 0.87 0.93 0.95 0.96

Forward search Score Simu. 0.69 0.82 0.88 0.89
LRT Simu. 0.69 0.82 0.88 0.89
Score Calcu. 0.67 0.82 0.86 0.89

4.2. Power Comparisons of Marker Search Methods

We applied the analytical power calculations to compare different marker search methods in a hypothetical GWAS that contains n1 = 1000 cases, n0 = 1000 controls, and L = 300, 000 candidate markers with MAF pj = 0.3, j = 1, …, L. Assume the true genetic model is a logistic model of form (2.4) with the baseline effect a = log (0.007). Let the main effects b1 = b2 and the interaction effect b3 varies from −1 to 1 by a step size of 0.1. Figures 1 and 2 show the 3-D plots of statistical power over a set of main and interaction effects under the discovery number control R = 20 and the Bonferroni corrected type I error rate α = 0.05, respectively. As demonstrated by the “trenches”, marginal search and forward search will unavoidably fail when a disease susceptibility is controlled by interactions that show no marginally detectable signal. Exhaustive search, on the other hand, can avoid this problem and detect the full signal in 2-dimension as long as the effect size is large enough.

Figure 1.

Figure 1

3-D plots of statistical power under the discovery number control over genetic effect space. Three model selection methods: marginal search in the left column, exhaustive search in the middle column and forward search in the right column. Two definitions of power: (A) detecting the joint association (or both associated markers in marginal search) in row 1, and (B) detecting either associated marker in row 2. The genetic models follows a logistic model with the main effect b1 = b2 and the epistatic effect b3 both varying from −1 to 1. The MAF pj = 0.3, j = 1, …, L. The total discovery number R is set to be 10.

Figure 2.

Figure 2

3-D plots of statistical power under the Bonferroni control over genetic effect space. Three model selection methods: marginal search in the left column, exhaustive search in the middle column and forward search in the right column. Two definitions of power: (A) detecting the joint association (or both associated markers in marginal search) in row 1, and (B) detecting either associated marker in row 2. The genetic models follows a logistic model with the main effect b1 = b2 and the epistatic effect b3 both varying from −1 to 1. The MAF pj = 0.3, j = 1, …, L. The genome-wide significance level α is set to be 0.05.

In order to contrast one marker search method with another, we subtracted the power values of one method from those of another method. The differences of the power values are plotted in heat maps. The red areas in Figures 3 and 4 represent negative values, indicating the first method in a pair of comparison has lower power; the green areas represent the opposite.

Figure 3.

Figure 3

Power differences with varying main and interaction effects with the discovery number R = 10. Row 1 illustrates power definition (A), row 2 illustrates power definition (B). Left column: marginal search vs. exhaustive search; middle column: marginal search vs. forward search; right column: forward search vs. exhaustive search. Green areas indicate positive values of difference, red areas indicate negative values. The main effect b1 = b2 and the epistatic effect b3 both vary from −1 to 1, the allele frequency pj = 0.3, j = 1, …, L.

Figure 4.

Figure 4

Power differences with varying main and interaction effects with the Bonferroni type I error rate α = 0.05. Power definition (A) is applied in row 1 and (B) is used in row 2. Left column: marginal search vs. exhaustive search; middle column: marginal search vs. forward search; right column: forward search vs. exhaustive search. Green areas indicate positive values of difference, red areas indicate negative values. The main effects b1 = b2 varies from −1 to 1, the epistatic effect b3 varies from −1 to 1, the allele frequency pj = 0.3, j = 1, …, L.

Figure 3 illustrates that the marginal search is better than the exhaustive search in the green areas, where b3 is small, and b1 and b2 are moderate. Relative to marginal search, forward search benefits from modeling interactions in finding certain true epistatic models, as shown in the red areas in the upper middle panel. However, to find either associated marker, forward search is uniformly less powerful than marginal search through the whole genetic model space, as shown in the whole lower middle panel that is white or green. Comparing with the forward search, the exhaustive search almost always performs similarly or better in finding the whole genetic model. With regard to the influence of the power types, power definition (B) favors marginal selection more than power definition (A), which is evidenced by the enlarged green areas in the second row.

Type I error control with the Bonferroni correction leads to notably different patterns of power comparisons. By examining the left and right columns of Figure 3 together with those of Figure 4, we can see exhaustive search increases its advantage so as to uniformly beat both marginal search and forward search in finding the true genetic model over the whole genetic model space. The Bonferroni control also raises the performance of forward search over marginal search in finding either associated marker, which is demonstrated by comparing the middle column of Figure 3 with that of Figure 4. For both control criteria, when we relax the control level of R or α, there is a trend that marginal search becomes relatively more powerful than exhaustive and forward searches. This is shown in the supplement which contains more maps of comparisons under different genetic parameter set-ups.

4.3. Power Comparisons When Marginal Effects Are Fixed

As illustrated above, interaction effect is crucial for the statistical power of marker selection. However, because there is usually a lack of knowledge of interaction effects from real studies, it is meaningful to compare the search methods when the marginal association, possibly revealed from an interaction effect, is fixed. Assume the three genetic models in (2.1) – (2.3) have the same marginal association, which is represented by the heterozygote odds ratio λ at each causative marker. When the values of λ and the population disease prevalence p (D) are fixed, we can calculate α and θ (letting θ1 = θ2 in model (2.1)). Furthermore, it is interesting to study the influence of LD when the true disease-causing loci are not observed but are linked with genotyped markers. We adopted the squared correlation coefficient r2 to measure LD [8] while deriving the analytical power calculation to find those linked markers. The assumptions for the fixed marginal effect and the constraints of LD follow Marchini et al. [6]. Technical details are given in the supplement.

Assume λ = 1.5, p (D) = 0.01, n1 = n0 = 2000, the MAF pj = 0.05, 0.1, 0.2, and 0.5, the LD strength r2 = 0.5, 0.7, and 1, the LD constraint p (Ai|Ai) = 1 and p (Ai|ai) = q, i = 1, 2, where Ai is the disease disease-causing allele at the unobserved locus indexed by −i, Ai is the disease allele at the genotyped locus of marker i, which is in LD with the causative locus −i. In general, power definition, genetic model, allele frequency, and sample size are influential to the relative performance of search strategies. Under the discovery number control R = 5, Figure 5 shows the power comparisons for finding the joint association. Here, the marginal search is the best for detecting model (2.1). For detecting models (2.2) and (2.3), the exhaustive search is the best while the marginal selection becomes the worst. The forward search is similar (for detecting model (2.1)) or better than marginal search (for detecting models (2.2) and (2.3)). This is not surprising because model (2.1) is additive in the log scale of odds, whereas models (2.2) and (2.3) are interactive, which accommodate those strategies facilitated by interaction effects. Figure 6 shows the power comparisons in finding either associated marker. Under this power definition, the marginal search interestingly outperforms the forward search for the interaction models (2.2) and (2.3). Furthermore, with sample size increasing, the exhaustive search increases its power faster than the marginal search and the forward search for the interaction models (2.2) and (2.3) (see the supplement for the figures of a smaller sample size n1 = n0 = 1000). This indicates that the exhaustive search has a more stringent requirement for sample size, but it will provide greater potential to detect small interactive effect when we do have enough observations.

Figure 5.

Figure 5

Power of finding the joint association with the discovery number R = 5. Green lines, marginal search; blue dashed lines, exhaustive search; red dot-dashed lines, forward search. The marginal odds ratio at both loci is 1.5, disease prevalence is 0.01, case and control numbers are both 2000. Columns of panels show genetic Models 1, 2, 3, respectively; rows show LD strength r2 = 0.5, 0.7 and 1. The minor allele frequencies are 0.05, 0.1, 0.2 and 0.5 on the x-axis of each penal.

Figure 6.

Figure 6

Power of finding either associated marker with the discovery number R = 5. Green lines, marginal search; blue dashed lines, exhaustive search; red dot-dashed lines, forward search. The marginal odds ratio at both loci is 1.5, disease prevalence is 0.01, case and control numbers are both 2000. Columns of panels show genetic Models 1, 2, 3, respectively; rows show LD strength r2 = 0.5, 0.7 and 1. The minor allele frequencies are 0.05, 0.1, 0.2 and 0.5 on the x-axis of each penal.

Under the Bonferroni corrected type I error rate control, Figure 7 (or 8) shows the power comparisons for finding the joint association model (or either associated marker). The genetic parameter set-up is the same as that in Figures 5 and 6. To find the joint association model, Figure 7 shows that the exhaustive search is uniformly the best for all models, over all allele frequencies. Comparing with the forward search, the marginal search is a more favored method for model (2.1), but not for models (2.2) and (2.3). As shown in Figure 8, marginal search and forward search are very similar for finding either associated marker.

Figure 7.

Figure 7

Power of finding the joint association with the Bonferroni type I error rate α = 0.05. Green lines, marginal search; blue dashed lines, exhaustive search; red dot-dashed lines, forward search. The marginal odds ratio at both loci is 1.5, disease prevalence is 0.01, case and control numbers are both 2000. Columns of panels show genetic Models 1, 2, 3, respectively; rows show LD strength r2 = 0.5, 0.7 and 1. The minor allele frequencies are 0.05, 0.1, 0.2 and 0.5 on the x-axis of each penal.

Figure 8.

Figure 8

Power of finding either associated marker with the Bonferroni type I error rate α = 0.05. Green lines, marginal search; blue dashed lines, exhaustive search; red dot-dashed lines, forward search. The marginal odds ratio at both loci is 1.5, disease prevalence is 0.01, case and control numbers are both 2000. Columns of panels show genetic Models 1, 2, 3, respectively; rows show LD strength r2 = 0.5, 0.7 and 1. The minor allele frequencies are 0.05, 0.1, 0.2 and 0.5 on the x-axis of each penal.

We calculated the statistical power of a marginal-exhaustive two-stage method for finding the joint association. The first stage screens single markers at a liberal type I error control level α1. Then the second stage carries out the exhaustive search within the selected set of markers, at a Bonferroni corrected type I error level α/(α1L2). This method is appealing for its potential of reducing the computational burden in the exhaustive search while remaining the high power. Figure 9 shows the comparisons among the power of the marginal search in finding either or both associated markers, the power of the exhaustive search in finding the joint association, and the power of the two-stage method in finding the joint association. The genetic parameters are the same as those for Figures 58. The two-stage method performs similarly as or even slightly better than exhaustive search. However, this result is valid only under the moderate marginal association that guarantees a high probability of picking the associated markers in the screening stage. As shown in Figure 2, it is possible, at least in theory, that the marginal association from particular interactions may totally vanish. In this case, the two-stage method will certainly not be able to surpass the exhaustive search. Note that we only used minutes of computational time to analytically calculate power for generating Figure 9, which reproduces the similar comparison patterns as that shown by heavy simulations (Figure 2 in Marchini et al. [6]).

Figure 9.

Figure 9

Power of four marker search methods controlled by the Bonferroni type I error rate α = 0.05. Red dashed lines, marginal search for either association; green lines, marginal search for joint association; dark blue long-dashed lines, exhaustive search for joint association; light blue dot-dashed lines, two-stage search for joint association. The marginal odds ratio at both loci is 1.5, disease prevalence is 0.01, case and control numbers are both 2000. Columns of panels show genetic Models 1, 2, 3, respectively; rows show LD strength r2 = 0.5, 0.7 and 1. The minor allele frequencies are 0.05, 0.1, 0.2 and 0.5 on the x-axis of each penal.

5. Discussion

In this article the analytical power calculation derived for marker search strategies offers valuable tools for GWAS data analysis. In practice, power study itself is important for designing experiments and choosing proper analysis techniques based on prior knowledge. However, the underlying genetic model, on which any power study has to be based, is often not known. Hence, it is important to evaluate power efficiently so that researchers could study a wide range of possibilities. Our analytical power calculation can significantly reduce computational burden of simulations, which researchers had to rely on before. Our R package will enable researchers to investigate the statistical power from their proposed sample sizes and compare the performance of different multi-marker analysis strategies to analyze GWAS data. It can be used to explore a large genetic model space with flexible assumptions of joint genetic effects with or without interactions.

It is generally believed that complex diseases are jointly influenced by multiple markers with potential interactions. Therefore, it is more intuitive to view the underlying genetic model as one multivariate model, instead of a group of single-marker models. Our assumptions for the underlying genetic models are more in line with this view by considering joint effects and interactions instead of the oversimplified single-marker models in the literature [4]. The genetic model studied here can be extended to more complex models with higher order interactions. Moreover, for the multivariate joint marker models, it is important to understand the mechanisms of how the joint genetic signal can be picked through various statistical model fitting and comparing procedures. Our study in the distributions and the correlation structures among test statistics provides insights into these questions.

Comparing the results obtained here for the binary trait with those for the quantitative trait [12], the patterns of the relative performances of the marginal search, exhaustive search, and forward search have both common and distinguishable characters for the two types of traits. For the common features, the strength of interaction effect is a key factor to the performance. Both marginal and forward searches fail when certain interactions show no marginal signal. In general, strong interaction effect favors the exhaustive and the forward searches, which are carried in a higher dimensional model space, to find the joint genetic association.

For the distinction, the exhaustive search is relatively more powerful for binary traits than for quantitative traits, especially in finding the joint association. The forward search is relatively less powerful for binary traits than for quantitative traits, especially in finding at least one associated marker. The underlying reason revealed by the analytical study is the correlation between the test statistics of single-marker model fittings in (2.6) and the test statistics of the extra terms in (2.7) over (2.6). The correlation is stronger among score tests for binary traits than that among F test for quantitative traits. Thus compared with quantitative traits, in the exhaustive search for binary traits, finding one associated marker can better increase the chance to find the other associated marker. Nevertheless, in the forward search, if the selected marker in the first step is non-associated, it is much harder in the second step to discover the associated markers for binary traits than for quantitative traits.

Another difference is the symmetry of the influence of the genetic effects on the power. If the total genetic effect can be represented by b1g1 + b2g2 + b3g1g2, the search power is the same for quantitative traits when the main effects b1 = b2 = 0 and the interaction effect b3 = ±c. This is because the underlying quantitative trait genetic model is linear [12], and the regression models have the same goodness-of-fit for signals with opposite directions but the same magnitude. However, this is no longer true for binary traits because the total genetic effects with the same magnitude but opposite directions lead to different disease risks as shown in (2.4). This is why the heat maps in Figures 34 are not symmetric.

Statistical significance control is crucial to the statistical power of detecting genetic signals. We studied two different types of controls: the total discovery number R-control, which is related to false discovery number or false discovery proportion control, and the type I error rate α-control with the Bonferroni adjustment. Interestingly these two types of controls grant different marker search strategies with distinguishable relative merit. First, the α-control is more stringent than the R-control, and the R-control leads to higher statistical power. So even if we do not have many “significant” results in the sense of the Bonferroni control, we can still include the top ranked genetic variations into the validation stage, as long as resources permit. Second, because GWAS are mostly used to screen for candidate genes, the type I error rate control is not the number one aim. R-control is more commonly adopted in phased designs by researchers who want to control the number of markers for the follow-up validation study. Third, the Bon-ferroni corrected α-control provides the exhaustive search more power than the marginal search, especially for finding the joint associations. So we expect more joint associations to be found when applying the 2-dimensional exhaustive scan in real GWAS data analysis. Lastly, for finding either associated marker, R-control makes forward selection less favored than marginal search, whereas the α-control makes the two methods very similar.

The widely applied Bonferroni control procedure provides an intuitively simple rule for model selection. Nevertheless, it might not provide accurate type I error rate control for the whole model selection process. For example, in the exhaustive search, if the alternative is defined as the joint association [6][3][10][1], the models with partially associated markers should contribute to the null distribution. However, the Bonferroni correction procedure only applies χ32 as the null distribution, which ignores those “wrong” models. As for the forward selection, applying type I error rate control in each step does not necessarily lead to the overall family-wise type I error rate control.

Through the analytical study, one can see that the non-associated markers have little influence on the power of marker search when the numbers of cases and controls are large. This is because the non-associated markers correspond to the null distributions. As shown in the Methods section, these null distributions are asymptotically chi-square with degrees of freedom 1, 2, or 3, no matter what the genetic parameters of the non-associated markers are. It does not affect the results much whether we assume the homogenous or random MAFs for these non-associated markers. The factors of the associated markers and the LD are more important. The R package markerSearchPower allows flexible assumptions for these genetic parameters.

The association signals from GWAS do not imply causality because they only report the statistical association by evidencing the discrepancy of the genotype distributions between cases and controls. This study focuses on the ability of the logistic model fitting to find the difference of distributions between cases and control under the assumption of homogenous population.

Acknowledgments

We are grateful to Yale University Biomedical High Performance Computing Center and WPI Computing and Communications Center for computation support.

This work is supported in part by NIH grant GM 59507 and NSF grant DMS 0714817.

Contributor Information

Zheyang Wu, Email: zheyangwu@wpi.edu, Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worecester, MA, 01609, USA, URL: http:users.wpi.edu/zheyangwu/.

Hongyu Zhao, Email: hongyu.zhao@yale.edu, Department of Epidemiology and Public Health, Yale University School of Medicine, 300 George Street, Suite 503, New Haven, CT 06520, USA, URL: http://bioinformatics.med.yale.edu.

References

  • 1.Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005;436:701. doi: 10.1038/nature03865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.David HA. Order Statistics. New York: 1981. [Google Scholar]
  • 3.Evans DM, Marchini J, Morris AP, Cardon LR. Two-stage two-locus models in genome-wide association. PLoS Genet. 2006;2:e157. doi: 10.1371/journal.pgen.0020157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gail MH, Pfeiffer RM, Wheeler W, Pee D. Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics. 2008;9:201. doi: 10.1093/biostatistics/kxm032. [DOI] [PubMed] [Google Scholar]
  • 5.Kraft P, Hunter DJ. Genetic Risk Prediction{Are We There Yet? New England Journal of Medicine. 2009;360:1701. doi: 10.1056/NEJMp0810107. [DOI] [PubMed] [Google Scholar]
  • 6.Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Locus. 2005;2 doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]
  • 7.McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9:356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
  • 8.Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. The American Journal of Human Genetics. 2001;69:1–14. doi: 10.1086/321275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Scheffe H. The analysis of variance. Wiley-Interscience; 1999. [Google Scholar]
  • 10.Storey JD, Akey JM, Kruglyak L. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol. 2005;3:e267. doi: 10.1371/journal.pbio.0030267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wolfram S. The Mathematica Book. Cambridge University Press; 1999. [Google Scholar]
  • 12.Wu Z, Zhao H. Statistical Power of Model Selection Strategies for Genome-Wide Association Studies. PLoS Genet. 2009;5:e1000582. doi: 10.1371/jour-nal.pgen.1000582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhang B. A score test under logistic regression models based on case-control data. Statistica Neerlandica. 2006;60:477–496. [Google Scholar]

RESOURCES