A likelihood ratio test for genomewide association under genetic heterogeneity

Meng Qian; Yongzhao Shao

doi:10.1111/ahg.12005

. Author manuscript; available in PMC: 2014 Mar 1.

Published in final edited form as: Ann Hum Genet. 2013 Jan 31;77(2):174–182. doi: 10.1111/ahg.12005

A likelihood ratio test for genomewide association under genetic heterogeneity^{^*}

Meng Qian ¹, Yongzhao Shao ¹

PMCID: PMC3910100 NIHMSID: NIHMS423893 PMID: 23362943

Summary

Most existing association tests for genome-wide association studies (GWAS) fail to account for genetic heterogeneity. Zhou and Pan proposed a binomial mixture model based association test to account for the possible genetic heterogeneity in case-control studies. The idea is elegant, however, the proposed test requires an EM-type iterative algorithm to identify the penalized maximum likelihood estimates and a permutation method to assess p-values. The intensive computational burden induced by the EM-algorithm and the permutation becomes prohibitive for direct applications to genome-wide association studies. This paper develops a likelihood ratio test (LRT) for genome-wide association studies under genetic heterogeneity based on a more general alternative mixture model. In particular, a closed-form formula for the likelihood ratio test statistic is derived to avoid the EM-type iterative numerical evaluation. Moreover, an explicit asymptotic null distribution is also obtained which avoids using the permutation to obtain p-values. Thus, the proposed LRT is easy to implement for genome-wide association studies (GWAS). Furthermore, numerical studies demonstrate that the LRT has power advantages over the commonly used Armitage trend test and other existing association tests under genetic heterogeneity. A breast cancer GWAS data set is used to illustrate the newly proposed LRT.

Keywords: association test, binomial mixture model, complex disease, genetic heterogeneity, genomewide association study

Introduction

Common and complex diseases (or traits) are often genetically heterogeneous in etiologies (Lander & Schork, 1994; Zhou & Pan, 2009). Some well-known complex diseases with genetic heterogeneity include asthma, breast cancer (Hall et al., 1990; Wooster et al., 1994; Turnbull et al., 2010), and diabetes (Hattersley, 1998; Sladek et al. 2010). As in Zhou & Pan (2009), this paper considers the situation when a complex disease (or trait) is caused by mutations in multiple unlinked loci, commonly referred to as locus heterogeneity (Ott, 1999; Abreu et al., 2002; Fu et al., 2006). As a consequence of genetic heterogeneity, the population of individuals with disease may be decomposed into various latent sub-populations, each with disease caused by mutations at different loci (or their combinations). Most of the existing association tests for population-based case-control studies, e.g. GWAS, are based on comparing the mean genotype scores (e.g. the Armitage trend test) between the case and control groups, which are not efficient in the presence of genetic heterogeneity. Zhou & Pan (2009) showed that it can be beneficial to use methods that account for genetic heterogeneity for testing association in a case-control study.

Similar to admixture mapping in linkage analysis (Smith, 1963; Abreu et al., 2002; Fu et al., 2006), Zhou & Pan (2009) proposed a binomial mixture model to account for genetic heterogeneity and developed a modified likelihood ratio test (MLRT) for a single locus (Fu et al., 2006). They also consider two methods to combine single-locus-based MLRTs across multiple loci in linkage disequilibrium to boost power when causal SNPs are not genotyped (Zhou & Pan, 2009). They illustrated, with a wide spectrum of numerical examples, that the proposed MLRT tests are more powerful than some commonly used association tests under genetic heterogeneity. Following Zhou and Pan, we define the genetic score X as the number of the minor alleles at a single locus for a subject. Zhou and Pan (2009) assumed the genetic score X_H in a healthy control subject follows a binomial distribution, that is

P (X_{H} = g) = B_{2} (g, θ_{b}), g = 0, 1, 2, 0 < θ_{b} < 1,

(1)

where $B_{2} (g, θ_{b}) = (\begin{matrix} 2 \\ g \end{matrix}) θ_{b}^{g} {(1 - θ_{b})}^{2 - g}$ and where θ_b represents the minor allele frequency (MAF) on that specific locus of the control subject. On the other hand, under genetic heterogeneity, the genetic score for a diseased subject, X_D, follows a simple two-component mixture binomial distribution,

P (X_{D} = g) = α_{1} B_{2} (g, θ_{b}) + α_{2} B_{2} (g, θ), g = 0, 1, 2; 0 \leq α_{1}, α_{2} \leq 1, α_{1} + α_{2} = 1,

(2)

where θ represents the probability of having the minor allele on one chromosome for a subgroup of cases with disease associated with the minor allele. They adopt a two-step procedure for parameter estimation. First, a maximum likelihood estimate (MLE) of θ_b is obtained based only on the control sample. Then, fixing the estimated θ_b at its MLE derived from the control-group data, maximum penalized likelihood estimates of other parameters in the mixture model are obtained using an EM-type algorithm (Li et al., 2009). Subsequently, the penalized MLEs from the EM-step are plugged into a likelihood ratio to form a test statistic to detect the association between the marker genotypes and the disease status. Finally, they proposed a permutation procedure to obtain the p-value of the association test.

Zhou and Pan’s idea is applicable to an association study for a limited number of candidate markers, however, there are several challenges in applying their proposed method to genome-wide association studies (GWAS). First, the computation of their proposed MLRT for a vast number of SNPs in a typical GWAS would be very intensive. Since the penalized MLEs are obtained by an EM algorithm for maximization of the penalized mixture likelihood, there are known complexities and caveats associated with the EM or other iterative methods for identifying MLEs and penalized MLEs in mixture models including the challenges in selecting multiple starting points for parameter estimation. Moreover, the p-value of the MLRT is proposed to be attained by permutation, which is also difficult to apply directly to detect the SNP-disease association in GWAS with a vast number of SNPs, where the significance level is usually set to be less than 10⁻⁶. In addition, it is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. The simple binomial mixture model with two-components in equation (2) may be too simple to capture the complex heterogeneity for many complex diseases. A more general form of binomial mixture model can be written as follows

P_{η} (X_{D} = g) = \sum_{j = 1}^{J} α_{j} B_{2} (g, θ_{j}), J \geq 2, \sum_{j = 1}^{J} α_{j} = 1, α_{j} \geq 0,

(3)

where η = (η_j)_j≤J, η_j = (θ_j, α_j)^T, j = 1, …, J, and θ_i = θ_j if and only if i = j. In particular, for many of the complex diseases with genetic heterogeneity, it is likely that J is quite large. Since it is hard to know the number of the sub-populations J under genetic heterogeneity, it is desirable to have a new test that is applicable without the need to know the exact value of J while allowing J ≥ 2.

In this paper, we developed a likelihood ratio test (LRT) for genome-wide association studies (GWAS) based on the more flexible binomial mixture models in (3). It is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. Thus, we assume that the genetic score in the case group, X_D, follows a general binomial mixture distribution in (3) which allows the possibility of a large and unknown J. The proposed LRT overcomes the above mentioned challenges of using Zhou and Pan’s method for testing association of a vast number of SNPs in a typical GWAS. In particular, we derived the closed-form formula for the likelihood ratio test statistic even though the maximum likelihood estimates (MLEs) of parameters in the binomial mixture model are non-regular with loss of identifiability (Liu & Shao, 2003). We further derived the simple closed-form asymptotic null distribution of the LRT which avoids the intensive numerical calculations, such as the EM based iterations for identification of MLEs and the permutations for evaluation of p-values. Additionally, the LRT can be implemented without the requirement of knowing the number of components J in the mixture model (3). We conducted extensive simulation studies to show that the LRT has power advantages over the Armitage trend test (ATT) and some other association tests under genetic heterogeneity. We applied our test to a real dataset from a breast cancer GWAS to illustrate that it can achieve a much smaller p-value than some commonly used tests when there is evidence of genetic heterogeneity. Thus, the proposed LRT might be used to scan SNPs in GWAS to make novel discoveries by taking account of genetic heterogeneity.

Method

Notation and set-up

We focus on detecting marker-disease association at a single locus with two alleles A and a, such as a SNP in a case-control genome-wide association study (GWAS). Suppose m₊ controls and n₊ cases are sampled from the population. For each SNP, the genotype frequencies in a case-control study can be summarized as in the following 2 × 3 table.

Let the genetic score X_H and X_D denote the number of minor alleles, say a, at a single locus for a healthy control and a diseased case, respectively. It is clear that ΣX_H = 2m₂ + m₁, ΣX_D = 2n₂ + n₁. Similar to Zhou and Pan’s set-up, we assume that under the null hypothesis, both X_H and X_D have the same binomial distribution B₂(g, θ_b) as described in equation (1). As in Zhou & Pan (2009), X_H is assumed to have a binomial distribution under H₁. Under the alternative hypothesis of genetic heterogeneity, we assume X_D has a mixture distribution as described in equation (3). This last assumption is worthy of further comments. On one hand, it is possible to have J > 2 in equation (3) under H₁ both in practice and in theory, thus it is conceptually desirable to allow J > 2 in equation (3). On the other hand, for likelihood inference, it is not necessary to have J > 2 in the model in order to achieve the maximum of the likelihood because the model is actually saturated with J = 2. In other words, for a given dataset, posing a model (3) with J = 2 or with J ≥ 2, the testing results from the LRT are not going to be different. In fact, as will be seen in next section, our proposed likelihood ratio test actually has the “non-parametric” nature because it has a closed-form formula, with a simple null distribution shown to be valid, thus it will be valid for testing any alternative models including the common models and those under heterogeneity. In this paper we will establish that the test is actually a likelihood ratio test under the specified set-up motivated by the elegant work of Zhou & Pan (2009) and by the fact that the likelihood ratio test has well-known optimalities in terms of statistical power and efficiency.

Mixture binomial and maximum likelihood

Assuming the set-up in the previous subsection, under H₀, using the notation in Table 1 and denoting the true value of θ_b as P₀, the maximum likelihood estimate (MLE) of P₀ for the overall combined case-control data in Table 1 is

{p̂}_{0} = [Σ X_{D} + Σ X_{H}] / (2 n_{+} + 2 m_{+}) = [n_{2} + m_{2} + (n_{1} + m_{1}) / 2] / (n_{+} + m_{+}]) .

(4)

Thus, the binomial likelihood function for the overall combined case-control data evaluated at p̂₀, L₀, is given by

L_{0} = \prod_{g = 0}^{2} B_{2} {(g, {p̂}_{0})}^{m_{g} + n_{g}},

(5)

where $B_{2} (g, {p̂}_{0}) = (\begin{matrix} 2 \\ g \end{matrix}) {p̂}_{0}^{g} {(1 - {p̂}_{0})}^{2 - g}$ and p̂₀ is defined in (4). Following Zhou & Pan (2009), in the control group, the genetic score X_H is assumed to follow a binomial distribution under the alternative hypothesis, say

P (X_{H} = g) = B_{2} (g, P_{H}), g = 0, 1, 2 .

(6)

Using the notation in Table 1, the maximum likelihood estimate of P_H within the healthy control group only is given by

{p̂}_{H} = Σ X_{H} / (2 m_{+}) = (m_{2} + m_{1} / 2) / m_{+} .

(7)

The binomial likelihood function of the healthy controls data evaluated at p̂_H, L_H, is given by

L_{H} = \prod_{g = 0}^{2} B_{2} {(g, {p̂}_{H})}^{m_{g}}

(8)

Similarly, in the case group, if the genetic score X_D has the distribution B₂(g; P_D), the maximum likelihood estimate p̂_D of P_D within the diseased case group only would be

{p̂}_{D} = Σ X_{D} / (2 n_{+}) = (n_{2} + n_{1} / 2) / n_{+} .

(9)

However, as in Zhou & Pan (2009), we assume that under genetic heterogeneity, the cases can be divided into multiple latent sub-populations. Thus, under the alternative hypothesis of genetic heterogeneity, we assume X_D has a mixture distribution as described in equation (3). It can be shown that (see Appendix 1), using the above notation, the maximum of the mixture likelihood for X_D has an explicit formula:

L_{D} = sup_{η} \prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} = {\begin{matrix} \prod_{g = 0}^{2} {(n_{g} / n_{+})}^{n_{g}} & if 4 n_{0} n_{2} > n_{1}^{2}; \\ \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} & if 4 n_{0} n_{2} \leq n_{1}^{2} . \end{matrix}

(10)

The derivation of the above equation can be found in Appendix 1. It is also clear from the derivation in Appendix 1 that the mixture likelihood function of the parameter vector η = (θ_j, α_j)_j≤J in the mixture model (3) can have many local maxima due to the lack of identifiability in parameters (Liu & Shao, 2003). Nevertheless, the supremum of the mixture likelihood L_D for X_D has a single unique value for each dataset and can be obtained from the explicit formula in equation (10). In the typical case-control study design, L_D is independent of L_H.

Table 1.

The genotype frequencies for case-control data of a SNP.

	AA	aA	aa	total
case	n₀	n₁	n₂	n₊
control	m₀	m₁	m₂	m₊

Open in a new tab

The likelihood ratio test

Using the maximum of the likelihood L₀, L_H and L_D in equations (5), (8) and (10), respectively, we can write down the explicit formula of the log likelihood ratio test statistic as follows

2 λ_{N} = 2 [log (L_{D} L_{H}) - log L_{0}] .

(11)

No iterative numerical maximization of the mixture likelihood function is needed for the evaluation of the LRT statistic in (11). Thus the LRT statistic is easy to compute even for GWAS. It is known that the LRT statistics for testing homogeneity in mixture models often have complicated asymptotic distributions that typically lack closed-form representations. However, the above statistic 2λ_N can be shown to have an explicit form of asymptotic distribution under the null hypothesis. More specifically, under H₀, as n₊ → ∞ and m₊ → ∞, we have

2 λ_{N} \to \frac{1}{2} χ_{1}^{2} + \frac{1}{2} χ_{2}^{2},

(12)

where $χ_{d}^{2}$ denotes a chi squared distribution with d degrees of freedom, d=1, 2. Although the above asymptotic null distribution can be derived from general results such as those in Chernoff & Lander (1995), Chiano & Yates (1995), or Liu & Shao (2003), an elementary and detailed direct derivation of the above asymptotic null distribution is given in Appendix 2 for readers who are interested in a direct derivation based on first principles.

It is worth pointing out that our extensive numerical simulations discussed in the next section indicate that the simple asymptotic null distribution in (12) approximates the exact finite sample null distribution very well. The asymptotic formula is only slightly conservative. Therefore, the p-values of the likelihood ratio test can be easily read off from the above simple closed-form asymptotic null distribution. For example, given any observed data in Table 1, one can first evaluate the value of 2λ_N in (11), then can obtain the p-value using the following simple command in the widely used R-platform:

{pchisq(2λ_N; 1; lower:tail = F) + pchisq(2λ_N; 2; lower:tail = F)}/2.

Last but not least, it is well known that the likelihood ratio test generally has better power than other ad hoc tests. Thus it should not be a surprise to see that the LRT can be more powerful than other commonly used tests which ignore the genetic heterogeneity that exists for many common complex diseases such as breast cancer. Finally, to implement the LRT, there is no need to identify the exact number of mixture components J in (3), which is desirable because J is hard to determine in practice.

Numerical Results

Type I Errors

The LRT has an explicit asymptotic distribution under H₀. Consequently, it is convenient to evaluate the p-value and type I errors. We conducted comprehensive simulations to compare the empirical type I error of the LRT to the nominal significance level ranging from 10⁻² to 10⁻⁸. In the Monte Carlo simulations, the genotype data for both the control group and the case group were generated from the same binomial distribution B₂(g; θ_b), where θ_b takes some fixed value P₀, which represents the minor allele frequency (MAF). A number of simulation set-ups, which varied over a range of minor allele frequency and sample size were selected. The control and case sample sizes are set to be equal. The nominal significance levels were taken to be 10⁻², 10⁻³, 10⁻⁴, 10⁻⁵, 10⁻⁶, 10⁻⁷ and 10⁻⁸, respectively. For each set-up, 10¹¹ samples were generated. We found that the empirical type I error is slightly smaller than the nominal level, but they are extremely close to each other. Thus using the asymptotic null distribution for the LRT is valid. For illustration, an example with θ_b = 0.4 and sample size n₊ = m₊ =1000 is shown in Table 2.

Table 2.

Empirical type I error and nominal significance level at θ_b = 0.4 and n₊ = m₊ = 1000.

Nominal level	.01	.001	10⁻⁴	10⁻⁵	10⁻⁶	10⁻⁷	10⁻⁸
Empirical level	.0098	.00099	9.7 × 10⁻⁵	9.8 × 10⁻⁶	9.7 × 10⁻⁷	9.5 × 10⁻⁸	9.8 × 10⁻⁹

Open in a new tab

Power Comparison

The significance level of the association test is usually set very small for genome-wide association studies (GWAS). For example, the genome-wide significance level of 5×10⁻⁸ is being increasingly used for arrays that contain one million SNPs. The most commonly used association tests for GWAS include Armitage’s trend test (ATT) and the $χ_{2}^{2}$ test, both applicable for testing association in a 2 × 3 table between the case-control status and the three genotypes, as illustrated in Table 1. Accordingly, we designed simulation studies to evaluate and compare the powers of the LRT, the ATT and the $χ_{2}^{2}$ test when the significance level is set to be 5×10⁻⁸. Note that, the MLRT of Zhou & Pan (2009) was not included for comparison due to its severe computational challenge when the significance level is very small. In the first set of Monte Carlo simulations, the control sample was generated from a binomial distribution B₂(g, θ_b); the case sample was generated from a two-component mixture binomial distribution as described in (3) with J = 2:

P_{η} (X_{D} = k) = \sum_{j = 1}^{2} α_{j} B_{2} (k, θ_{j}), 0 \leq α_{1}, α_{2} \leq 1, α_{1} + α_{2} = 1 .

20000 replicate data sets of n₊= m₊= N controls and cases were simulated for each of the eight simulations set-up and the empirical power for each test are shown in Table 3. The simulation results indicate that the LRT has power advantage over the Armitage trend test (ATT) and the $χ_{2}^{2}$ test under genetic heterogeneity.

Table 3.

empirical power^* when X_D has a mixture distribution with J = 2

Set-up

θ_b

0.1

0.2

0.25

0.3

θ₁

0.12

0.08

0.18

0.23

0.20

0.30

0.32

0.28

θ₂

0.50

0.60

0.70

α₁

0.90

0.85

0.80

0.9

0.8

0.9

0.8

α₂

0.10

0.15

0.20

0.1

0.2

0.1

0.2

1000

1500

1000

2000

1500

Power

LRT

0.717

0.840

0.987

0.833

0.997

0.829

0.673

0.918

ATT

0.439

0.059

0.576

0.717

0.087

0.756

0.490

0.362

χ_{2}^{2}

0.399

0.150

0.810

0.687

0.718

0.709

0.484

0.604

Open in a new tab

Significance level is set at 5 × 10⁻⁸

Similar power advantages of the LRT over other tests are also observed when the alternative mixture model has three components (J = 3) as demonstrated in Table 4 where θ₃ for the cases is set as equal to θ_b for the control group.

Table 4.

empirical power^* when X_D has a mixture distribution with J = 3

Set-up

θ_b = θ₃

0.1

0.2

0.3

θ₁

0.13

0.15

0.1

0.25

0.22

0.2

0.33

θ₂

0.5

0.4

0.6

0.7

α₁

0.35

0.4

0.3

0.2

0.3

0.35

0.4

α₂

0.15

0.1

0.2

0.1

0.15

0.2

0.15

α₃

0.5

0.6

0.5

0.4

0.45

800

1200

800

1000

2000

1500

Power

LRT

0.883

0.903

0.843

0.829

0.846

0.937

0.864

0.852

ATT

0.554

0.707

0.713

0.125

0.612

0.696

0.011

0.631

χ_{2}^{2}

0.528

0.683

0.641

0.32

0.64

0.757

0.274

0.665

Open in a new tab

Significance level is set at 5 × 10⁻⁸

Note that the Armitage trend test (ATT), also called Cochran-Armitage trend test (CATT) by many researchers, has good power only when the disease risk of the genotypes AA, Aa, aa is monotone increasing or decreasing under the alternative hypothesis (Armitage, 1955; Freidlin et al., 2002). Thus, ATT can have very low power when there is a violation of a linear trend in the disease risk across the ordered genotypes AA, Aa and aa, as in the case of both set-ups #3 and #5 in Table 3.

It is clear from the power simulations across the multiple simulation set-ups that the LRT can be much more powerful than the commonly used Armitage trend test (ATT) and the $χ_{2}^{2}$ test in GWAS in the presence of genetic heterogeneity.

A Breast Cancer GWAS

Breast cancer is the most common cancer among women. Many genes on different chromosomes that underlie breast cancer have been identified including many well-known studies conducted two decades ago (Hall et al., 1990; Wooster et al., 1994). Many more genetic variants underlying breast cancer are still being discovered nowadays, thus there is little doubt about the existence of genetic heterogeneity in the case of breast cancer. For illustration, we applied the newly proposed likelihood ratio test to a breast cancer GWAS dataset. In particular, Turnbull et al. (2010) conducted a genome-wide association study to identify breast cancer susceptibility alleles. They studied 582886 SNPs in 3659 breast cancer cases and 4897 controls in the first stage, and evaluated promising SNPs that were identified in Stage 1 in a second stage with 12576 cases and 12223 controls. In the paper they reported five new susceptibility SNPs with summary genotype data of the five SNPs made publicly available. A literature search indicates that four of the five SNPs (rs1011970, rs10995190, rs704010 and rs614367) have been independently confirmed by other studies since the publication of their GWAS results in 2010 (Lambrechts et al., 2012; Peng et al., 2011). We evaluated the p-values of the LRT, Armitage trend test (ATT) and $χ_{2}^{2}$ test for these four SNPs for comparison. The results are summarized in Table 5.

Table 5.

Comparison of P-values for the four SNPs reported in Turnbull et al. (2010).

SNP	Stage	LRT P-values	ATT P-values	χ² test P-values
rs10995190	Stage 1	7 × 10⁻⁹	5 × 10⁻⁹	3 × 10⁻⁸
	Stage 2	2 × 10⁻⁸	10⁻⁸	2 × 10⁻⁸
	Fisher P-value	5 × 10⁻¹⁵	2 × 10⁻¹⁵	2 × 10⁻¹⁴
rs614367	Stage 1	6 × 10⁻¹⁰	2 × 10⁻¹⁰	3 × 10⁻¹⁰
	Stage 2	6 × 10⁻¹⁵	10⁻⁸	6 × 10⁻⁸
	Fisher P-value	2 × 10⁻²²	10⁻¹⁶	7 × 10⁻¹⁶
rs704010	Stage 1	3 × 10⁻⁸	3 × 10⁻⁶	7 × 10⁻⁷
	Stage 2	7 × 10⁻⁴	3 × 10⁻⁴	4 × 10⁻⁴
	Fisher P-value	5 × 10⁻¹⁰	2 × 10⁻⁸	6 × 10⁻⁹
rs1011970	Stage 1	3 × 10⁻⁵	9 × 10⁻⁶	5 × 10⁻⁵
	Stage 2	7 × 10⁻⁵	2 × 10⁻⁴	4 × 10⁻⁴
	Fisher P-value	4 × 10⁻⁸	4 × 10⁻⁸	4 × 10⁻⁷

Open in a new tab

Note that for the SNP rs10995190 and SNP rs614367, the p-values are smaller than the genome-wide significance level 5 × 10⁻⁸ for the newly proposed LRT and the ATT, and for each of the two stages. The performance of the LRT is as good as or better than the other two tests. In particular, the LRT has an extremely small p-value 6 × 10⁻¹⁵ for Stage 2 data of SNP rs614367 showing statistical significance at even lower levels. It is thus not surprising that these SNPs are independently replicated by other GWAS. For the SNP rs704010 and SNP 1011970, a simple combined p-value (for combining the two stages), e.g. Fisher’s meta p-value, indicates both SNPs are significant even using the genome-wide significance level 5 × 10⁻⁸ for all three tests. The newly proposed LRT is also very competitive for these two SNPs. For example, for the Stage 1 data of SNP rs704010, only the p-value of the LRT is smaller than the genome-wide significance level 5 × 10⁻⁸. As an indication of overall strength of the test, Fisher’s meta p-value of the LRT from the combined Stages 1 and 2 is smaller than those of the other two tests, and the LRT is clearly the most competitive test among the three competitors. This example indicates the potential value of the proposed LRT for GWAS data to detect association of complex diseases where the presence of genetic heterogeneity is always a possibility.

Discussion

In the analysis of GWAS data, potential latent genetic heterogeneity has been largely ignored by researchers. Zhou & Pan (2009) first proposed mixture models to account for genetic heterogeneity. However, for the analysis of a vast number of SNPs in GWAS, the MLRT of Zhou and Pan has major computational challenges. In this paper, using a more general binomial mixture model, we have derived a likelihood ratio test for case-control association studies that improves the MLRT by Zhou and Pan on computational efficiency and multiple other aspects. In particular, the likelihood ratio test statistic has a simple closed-form formula, which could avoid intensive computation, such as the EM algorithm for penalized maximum likelihood estimates. Additionally, we have derived an explicit asymptotic null distribution for the proposed LRT, which is convenient to obtain p-values even at a small significance level. Moreover, to perform the LRT, there is no need to decide the exact number of mixture components, which is convenient in practice. Therefore, the new LRT has computational advantages over the MLRT proposed by Zhou and Pan and is suitable for scanning SNPs in GWAS data.

As demonstrated by our numerical studies, in the presence of genetic heterogeneity, the LRT can be much more powerful than either Armitage’s trend test or the $χ_{2}^{2}$ test, both of which are among the most widely used tests in GWAS. Given that most complex diseases are widely believed to be polygenic and have environmental components, genetic heterogeneity is a hallmark of complex diseases. As illustrated using the GWAS data for breast cancer, newly proposed LRT can be easily used for any GWAS data, thus researchers can use the simple algorithm to scan their SNPs as a cost-effective way to potentially make novel and important discoveries using existing data already collected in the large number of GWAS. Given that there are already about 1000 published GWAS, and many more genome-wide studies are being planned and conducted, the new LRT has the potential to become one of the useful tests to scan the SNPs in these GWAS, maybe as a secondary analysis to account for genetic heterogeneity. Thus the new user-friendly LRT can potentially be used to increase the impact of existing and future genome-wide association studies.

Acknowledgements

This research is partially supported by the NIH Cancer Center Supporting Grant to NYU (2P30 CA16087), and the NIEHS Center Grant to NYU (5P30 ES00260), as well as a Stony Wold-Herbert Foundation grant to YS. The authors would like to thank the reviewers for insightful suggestions that lead to improvement of the paper.

Appendix 1

Derivation of the test statistic of the LRT

To prove our proposed association test is indeed a LRT under the given set-up, we just need to establish equation (10), that is, when X_D follows the mixture distribution in (3),

L_{D} = sup_{η} \prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} = {\begin{matrix} \prod_{g = 0}^{2} {(n_{g} / n_{+})}^{n_{g}} & if 4 n_{0} n_{2} > n_{1}^{2}; \\ \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} & if 4 n_{0} n_{2} \leq n_{1}^{2}; \end{matrix}

where p̂_D is defined as in equation (9). First we we want to show that when $4 n_{0} n_{2} > n_{1}^{2}$ , the maximum likelihood estimates η̂ of the η in (3) satisfy

L_{D} = \prod_{g = 0}^{2} P_{η̂} {(X_{D} = g)}^{n_{g}} = \prod_{g = 0}^{2} {(n_{g} / n_{+})}^{n_{g}} .

(13)

A simple application of Jensen’s inequality yields that, for any η,

log \prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} \leq log \prod_{g = 0}^{2} {(n_{g} / n_{+})}^{n_{g}} .

(14)

The right-hand side of the above inequality is an upper bound which may not be achievable in general. However, when $4 n_{0} n_{2} > n_{1}^{2}$ , we can show that the equality in (13) is achievable. In fact, when $4 n_{0} n_{2} > n_{1}^{2}$ , there are infinitely many values of the MLE η̂ can make (13) an equality. It is straightforward and elementary to verify that one set of solutions for MLE is given as follows:

{θ̂}_{1} \in (\frac{2 n_{2}}{2 n_{2} + n_{1}}, 1), {θ̂}_{2} = \frac{(n_{1} + 2 n_{2}) {θ̂}_{1} - 2 n_{2}}{2 n_{+} {θ̂}_{1} - n_{1} - 2 n_{2}}, {α̂}_{1} = \frac{4 n_{0} n_{2} - n_{1}^{2}}{2 n_{+} (2 n_{+} {θ̂}_{1}^{2} - 2 (n_{1} + 2 n_{2}) {θ̂}_{1} + 2 n_{2})}, {α̂}_{2} = 1 - {α̂}_{1}, {α̂}_{i} = 0, {θ̂}_{i} = 1 / 2, for i \neq 1, 2 .

As indicated above, θ̂₁ can take any values in an interval, thus there are infinitely many sets of solutions for the MLE. Thus equation (13) is proved. Next we show that when $4 n_{0} n_{2} \leq n_{1}^{2}$ ,

L_{D} = sup_{η} \prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} = \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} .

(15)

First, we show that, for any fixed η,

\prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} \leq \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} .

Using the inequality log x ≤ x − 1, we get

\sum_{g = 0}^{2} n_{g} log \frac{P_{η} (X_{D} = g)}{B_{2} (g, {p̂}_{D})} \leq \sum_{g = 0}^{2} n_{g} (\frac{P_{η} (X_{D} = g)}{B_{2} (g, {p̂}_{D})} - 1)

It is straightforward to verify that

\sum_{g = 0}^{2} n_{g} (\frac{P_{η} (X_{D} = g)}{B_{2} (g, {p̂}_{D})} - 1) = \frac{n_{+} (4 n_{0} n_{2} - n_{1}^{2})}{{(2 n_{0} + n_{1})}^{2} {(2 n_{2} + n_{1})}^{2}} \sum_{j = 0}^{J} {α̂}_{j} {(2 n_{2} + n_{1} - 2 n_{+} {θ̂}_{j})}^{2}

Therefore, when $4 n_{0} n_{2} \leq n_{1}^{2}$ , and for any η

log = \prod_{g = 0}^{2} {[P_{η} (X_{D} = g)]}^{n_{g}} \leq log \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} .

Finally, it is obvious that

L_{D} = sup_{η} \prod_{g = 0}^{2} P_{η} {(X_{D} = g)}^{n_{g}} \geq \prod_{g = 0}^{2} B_{2} {(g; {p̂}_{D})}^{n_{g}} .

This finishes the proof (15), thus also (10).

Appendix 2

The asymptotic null distribution of the LRT

Under H₀, both X_D and X_H have the same binomial distribution B₂(g; θ_b). We denote the true null value for θ_b as P₀. Without loss of generality, we assume 0 < P₀ < 1 to avoid P₀(1 − P₀) = 0 appearing in any denominator. First, we may consider testing H₀ : B₂(g; P₀) against H₀ : B₂(g; θ_b), θ_b ∈ (0, 1), using only the healthy controls. This is a classic problem, the likelihood ratio test statistic is well known to have a $χ_{1}^{2}$ distribution. It is well known that, under H₀ : B₂(g; P₀), the LRT statistic can be written as

2 \sum_{g = 0}^{2} m_{g} log \frac{B_{2} (g; {p̂}_{H})}{B_{2} (g; P_{0})} = 2 m_{+} \frac{{({p̂}_{H} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + o_{p} (1),

(16)

where p̂_H = (m₂ + m₁/2)/m₊ is the MLE of θ_b using only the healthy controls. Similarly, we may consider testing H₀ : B₂(g; P₀) against H₀ : B₂(g; θ_b), θ_b ∈ (0, 1), using only the diseased cases. Then, under H₀ : B₂(g; P₀), the LRT statistic has $χ_{1}^{2}$ distribution and can be written as

2 \sum_{g = 0}^{2} n_{g} log \frac{B_{2} (g; {p̂}_{D})}{B_{2} (g; P_{0})} = 2 n_{+} \frac{{({p̂}_{D} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + o_{p} (1),

(17)

where p̂_D = (n₂ + n₁/2)/n₊ is the MLE of θ_b using only the diseased cases. Similarly, we may consider testing H₀ : B₂(g; P₀) against H₀ : B₂(g; θ_b), θ_b ∈ (0, 1), using the overall sample combining both the diseased cases and health controls. Then the MLE for θ_b = P₀ from the combined sample is p̂₀ as defined in (4). The LRT statistic can be written as

2 \sum_{g = 0}^{2} [m_{g} log \frac{B_{2} (g; {p̂}_{0})}{B_{2} (g; P_{0})} + n_{g} log \frac{B_{2} (g; {p̂}_{0})}{B_{2} (g; P_{0})}] = 2 (m_{+} + n_{+}) \frac{{({p̂}_{0} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + o_{p} (1) .

(18)

From the above three equations, and the equations (5), (8), (10), we have, when $4 n_{0} n_{2} < n_{1}^{2}$ ,

2 log \frac{L_{D} L_{H}}{L_{0}} = 2 m_{+} \frac{{({p̂}_{H} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + 2 n_{+} \frac{{({p̂}_{D} - P_{0})}^{2}}{P_{0} (1 - P_{0})} - 2 (m_{+} + n_{+}) \frac{{({p̂}_{0} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + o_{p} (1) .

(19)

Denote ρ = n₊/(m₊ + n₊). Then it is straightforward to verify that

{p̂}_{0} - P_{0} = ρ ({p̂}_{D} - P_{0}) + (1 - ρ) ({p̂}_{H} - P_{0})

and

2 log \frac{L_{D} L_{H}}{L_{0}} = (1 - ρ) \frac{2 n_{+} {({p̂}_{D} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + ρ \frac{2 m_{+} {({p̂}_{H} - P_{0})}^{2}}{P_{0} (1 - P_{0})} - 2 \sqrt{ρ (1 - ρ)} \frac{\sqrt{2 n_{+}} ({p̂}_{D} - P_{0})}{\sqrt{P_{0} (1 - P_{0})}} \frac{\sqrt{2 m_{+}} ({p̂}_{H} - P_{0})}{\sqrt{P_{0} (1 - P_{0})}} + o_{p} (1) .

Denote

Z_{H} = \frac{\sqrt{2 m_{+}} ({p̂}_{H} - P_{0})}{\sqrt{P_{0} (1 - P_{0})}}

and

Z_{D} = \frac{\sqrt{2 n_{+}} ({p̂}_{D} - P_{0})}{\sqrt{P_{0} (1 - P_{0})}} .

Then

2 log \frac{L_{D} L_{H}}{L_{0}} = {(\sqrt{ρ} Z_{H} + \sqrt{1 - ρ} Z_{D})}^{2} + o_{ρ} (1) .

(20)

Note that Z_H ~ N(0, 1) and Z_D ~ N(0, 1) and Z_H and Z_D are independent. Thus

\sqrt{ρ} Z_{H} + \sqrt{1 - ρ} Z_{D} ~ N (0, 1) .

Therefore, when $4 n_{0} n_{2} \leq n_{1}^{2}$ , we have $2 λ_{N} = 2 (log L_{D} + log L_{H} - log L_{0}) ~ χ_{1}^{2}$ .

On the other hand, under H₀, when $4 n_{0} n_{2} > n_{1}^{2}$ , we can first consider testing goodness-of-fit of H₀ : B₂(g; P₀) using only the diseased cases. The likelihood ratio test statistic has a $χ_{2}^{2}$ asymptotic distribution and can be written as

2 \sum_{g = 0}^{2} n_{g} log \frac{n_{g} / n_{+}}{B_{2} (g; P_{0})} = 2 \sum_{g = 0}^{2} n_{g} log \frac{n_{g} / n_{+}}{B_{2} (g; {p̂}_{D})} + 2 \sum_{g = 0}^{2} n_{g} log \frac{B_{2} (g; {p̂}_{D})}{B_{2} (g; P_{0})} = 2 n_{+} \frac{{(n_{g} / n_{+} - {p̂}_{D})}^{2}}{{p̂}_{D} (1 - {p̂}_{D})} + 2 n_{+} \frac{{({p̂}_{D} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + o_{p} (1) .

(21)

The first term at the right-hand side of the last equality is equivalent to the Pearson’s classic chi-square statistic (via comparing observed to expected cell frequencies) for testing Hardy-Weinberg equilibrium which is well-known to have the $χ_{1}^{2}$ distribution (Emigh 1980). Using the above equations, when $4 n_{0} n_{2} > n_{1}^{2}$ , we have

2 log \frac{L_{D} L_{H}}{L_{0}} = 2 m_{+} \frac{{({p̂}_{H} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + 2 n_{+} \frac{{({p̂}_{D} - P_{0})}^{2}}{P_{0} (1 - P_{0})} - 2 (m_{+} + n_{+}) \frac{{({p̂}_{0} - P_{0})}^{2}}{P_{0} (1 - P_{0})} + 2 n_{+} \frac{{(n_{g} / n_{+} - {p̂}_{D})}^{2}}{{p̂}_{D} (1 - {p̂}_{D})} + o_{p} (1) .

(22)

By equations (19) and (20), from the above equation, we have

2 log \frac{L_{D} L_{H}}{L_{0}} = {(\sqrt{ρ} Z_{H} + \sqrt{1 - ρ} Z_{D})}^{2} + 2 n_{+} \frac{{(n_{g} / n_{+} - {p̂}_{D})}^{2}}{{p̂}_{D} (1 - {p̂}_{D})} + o_{p} (1) .

(23)

Note that the two terms in the right-hand side of (21) are well known to be asymptotically independent which, in turn, implies asymptotic independence of the two terms at the right-hand side of (23). Therefore, when $4 n_{0} n_{2} > n_{1}^{2}$ , we have

2 λ_{N} = 2 log \frac{L_{D} L_{H}}{L_{0}} = 2 (log L_{D} + log L_{H} - log L_{0}) ~ χ_{2}^{2} .

Finally, it suffices to show that $P (4 n_{0} n_{2} > n_{1}^{2} | H_{0}) \to 1 / 2$ as n₊ → ∞. Note that, under H₀, (n₀, n₁, n₂) follow a multinomial distribution (n₊, π₀, π₁, π₂), where π_g = P(X_D = _g), for g = 0, 1, 2. Let U^T be the random vector $(\frac{n_{0}}{n_{+}}, \frac{n_{1}}{n_{+}}, \frac{n_{2}}{n_{+}})$ . Then we have (Bickel & Docksum, 2000)

E {(U)}^{T} = Π = (π_{0}, π_{1}, π_{2}), Var (U) = Σ / n_{+},

where

Σ = (\begin{matrix} π_{0} (1 - π_{0}) & - π_{0} π_{1} & - π_{0} π_{2} \\ - π_{0} π_{1} & π_{1} (1 - π_{1}) & - π_{1} π_{2} \\ - π_{0} π_{2} & - π_{1} π_{2} & π_{2} (1 - π_{2}) \end{matrix}) .

(24)

Let G(U) denote $\frac{4 n_{0} n_{2} - n_{1}^{2}}{{n_{+}}^{2}}$ , G(Π) denote $4 π_{0} π_{2} - π_{1}^{2}$ . Under H₀, then

π_{0} = {(1 - P_{0})}^{2}, π_{1} = 2 P_{0} (1 - P_{0}), π_{2} = P_{0}^{2} . G (Π) = 4 π_{0} π_{2} - π_{1}^{2} = 0 .

By the central limit theorem and the multivariate delta method, G(U) has an asymptotic normal distribution with mean 0. That is

\sqrt{N} (G (U) - G (Π)) \to N (0, G' (U) Var (U) G' {(U)}^{T}) .

(25)

Thus, under H₀, as n₊ → ∞,

P (4 n_{0} n_{2} - n_{1}^{2} < 0) = P (G (U) < 0) \to 1 / 2 .

This finishes the proof of the following convergence in distribution, under H₀,

2 λ_{N} \to \frac{1}{2} χ_{1}^{2} + \frac{1}{2} χ_{2}^{2} .

(26)

Footnotes

There is not any conflict of interest for both authors

References

Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375386. [Google Scholar]
Abreu PC, Hodge SE, Greenberg DA. Quantification of type I error probabilities for heterogeneity lod scores. Genet Epidemiol. 2002;22:156–169. doi: 10.1002/gepi.0155. [DOI] [PubMed] [Google Scholar]
Bickel PJ, Doksum KA. Mathematical Statistics: Basic Ideas and Selected Topics. Vol I. New Jersey: Prentice Hall; 2000. [Google Scholar]
Chernoff H, Lander E. Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. J Statist Plann Inference. 1995;43:19–40. [Google Scholar]
Chiano MN, Yates JRW. Linkage detection under heterogeneity and the mixture problem. Ann Hum Genet. 1995;59:83–95. doi: 10.1111/j.1469-1809.1995.tb01607.x. [DOI] [PubMed] [Google Scholar]
Emigh TH. A comparison of tests for Hardy-Weinberg equilibrium. Biometrics. 1980;36:627–642. [PubMed] [Google Scholar]
Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Hum Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]
Fu Y, Chen J, Kalbfleisch JD. Testing for Homogeneity in Genetic Linkage Analysis. Stat Sinica. 2006;16:805–823. [Google Scholar]
Hall JM, Lee MK, Newman B, Morrow JE, Anderson LA, Huey B, King MC. Linkage of early-onset familial breast cancer to chromosome 17q21. Science. 1990;250:1684–1689. doi: 10.1126/science.2270482. [DOI] [PubMed] [Google Scholar]
Hattersley AT. Maturity-onset diabetes of the young: clinical heterogeneity explained by genetic heterogeneity. Diabet Med. 1998;15(1):15–24. doi: 10.1002/(SICI)1096-9136(199801)15:1<15::AID-DIA562>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]
Lambrechts D, Truong T, Justenhoven C, Humphreys MK, Wang J, Hopper JL, Dite GS, Apicella C, Southey MC, Schmidt MK, et al. 11q13 is a susceptibility locus for hormone receptor positive breast cancer. Hum Mutat. 2012;33(7):1123–1132. doi: 10.1002/humu.22089. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li P, Chen JH, Marriott P. Non-finite Fisher information and homogeneity: an EM approach. Biometrika. 2009;96:411–426. [Google Scholar]
Liu X, Shao Y. Asymptotics for likelihood ratio tests under loss of indentifiability. Ann Stat. 2003;31:807–832. [Google Scholar]
Ott J. Analysis of Human Genetic Linkage. Third Edtition. Baltimore: The John Hopkins University Press; 1999. [Google Scholar]
Peng S, L B, Ruan W, Zhu Y, Sheng H, Lai M. Genetic polymorphisms and breast cancer risk: evidence from meta-analyses, pooled analyses, and genome-wide association studies. Breast Cancer Res Treat. 2011;127(2):309–324. doi: 10.1007/s10549-011-1459-5. [DOI] [PubMed] [Google Scholar]
Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2010;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]
Smith CA. Testing for heterogeneity of recombination fraction values in Human Genetics. Ann Hum Genet. 1963;27:175–182. doi: 10.1111/j.1469-1809.1963.tb00210.x. [DOI] [PubMed] [Google Scholar]
Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, Seal S, Ghoussaini M, Hines S, Healey CS, Hughes D, Warren-Perry M, Tapper W, Eccles D, Evans DG, Hooning M, Schutte M, van den Ouweland A, Houlston R, Ross G, Langford C, Pharoah PD, Stratton MR, Dunning AM, Rahman N, Easton DF Breast Cancer Susceptibility Collaboration (UK) Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet. 2010;42:504–507. doi: 10.1038/ng.586. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wooster R, Neuhausen SL, Mangion J, Quirk Y, Ford D, Collins N, Nguyen K, Seal S, Tran T, Averill D, Fields P, Marshall G, Narod S, Lenoir GM, Lynch H, Feunteun J, Devilee P, Cornelisse CJ, Menko FH, Daly PA, Ormiston W, McManus R, Pye C, Lewis CM, Cannon-Albright LA, Peto J, Ponder BAJ, Skolnick MH, Easton DF, Goldgar DE, Stratton MR. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13. Science. 1994;265:2088–2090. doi: 10.1126/science.8091231. [DOI] [PubMed] [Google Scholar]
Zhou H, Pan W. Binomial Mixture Model-based Association Tests under Genetic Heterogeneity. Ann Hum Genet. 2009;73:614–630. doi: 10.1111/j.1469-1809.2009.00542.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375386. [Google Scholar]

[R2] Abreu PC, Hodge SE, Greenberg DA. Quantification of type I error probabilities for heterogeneity lod scores. Genet Epidemiol. 2002;22:156–169. doi: 10.1002/gepi.0155. [DOI] [PubMed] [Google Scholar]

[R3] Bickel PJ, Doksum KA. Mathematical Statistics: Basic Ideas and Selected Topics. Vol I. New Jersey: Prentice Hall; 2000. [Google Scholar]

[R4] Chernoff H, Lander E. Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. J Statist Plann Inference. 1995;43:19–40. [Google Scholar]

[R5] Chiano MN, Yates JRW. Linkage detection under heterogeneity and the mixture problem. Ann Hum Genet. 1995;59:83–95. doi: 10.1111/j.1469-1809.1995.tb01607.x. [DOI] [PubMed] [Google Scholar]

[R6] Emigh TH. A comparison of tests for Hardy-Weinberg equilibrium. Biometrics. 1980;36:627–642. [PubMed] [Google Scholar]

[R7] Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Hum Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]

[R8] Fu Y, Chen J, Kalbfleisch JD. Testing for Homogeneity in Genetic Linkage Analysis. Stat Sinica. 2006;16:805–823. [Google Scholar]

[R9] Hall JM, Lee MK, Newman B, Morrow JE, Anderson LA, Huey B, King MC. Linkage of early-onset familial breast cancer to chromosome 17q21. Science. 1990;250:1684–1689. doi: 10.1126/science.2270482. [DOI] [PubMed] [Google Scholar]

[R10] Hattersley AT. Maturity-onset diabetes of the young: clinical heterogeneity explained by genetic heterogeneity. Diabet Med. 1998;15(1):15–24. doi: 10.1002/(SICI)1096-9136(199801)15:1<15::AID-DIA562>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]

[R11] Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]

[R12] Lambrechts D, Truong T, Justenhoven C, Humphreys MK, Wang J, Hopper JL, Dite GS, Apicella C, Southey MC, Schmidt MK, et al. 11q13 is a susceptibility locus for hormone receptor positive breast cancer. Hum Mutat. 2012;33(7):1123–1132. doi: 10.1002/humu.22089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Li P, Chen JH, Marriott P. Non-finite Fisher information and homogeneity: an EM approach. Biometrika. 2009;96:411–426. [Google Scholar]

[R14] Liu X, Shao Y. Asymptotics for likelihood ratio tests under loss of indentifiability. Ann Stat. 2003;31:807–832. [Google Scholar]

[R15] Ott J. Analysis of Human Genetic Linkage. Third Edtition. Baltimore: The John Hopkins University Press; 1999. [Google Scholar]

[R16] Peng S, L B, Ruan W, Zhu Y, Sheng H, Lai M. Genetic polymorphisms and breast cancer risk: evidence from meta-analyses, pooled analyses, and genome-wide association studies. Breast Cancer Res Treat. 2011;127(2):309–324. doi: 10.1007/s10549-011-1459-5. [DOI] [PubMed] [Google Scholar]

[R17] Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2010;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]

[R18] Smith CA. Testing for heterogeneity of recombination fraction values in Human Genetics. Ann Hum Genet. 1963;27:175–182. doi: 10.1111/j.1469-1809.1963.tb00210.x. [DOI] [PubMed] [Google Scholar]

[R19] Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, Seal S, Ghoussaini M, Hines S, Healey CS, Hughes D, Warren-Perry M, Tapper W, Eccles D, Evans DG, Hooning M, Schutte M, van den Ouweland A, Houlston R, Ross G, Langford C, Pharoah PD, Stratton MR, Dunning AM, Rahman N, Easton DF Breast Cancer Susceptibility Collaboration (UK) Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet. 2010;42:504–507. doi: 10.1038/ng.586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Wooster R, Neuhausen SL, Mangion J, Quirk Y, Ford D, Collins N, Nguyen K, Seal S, Tran T, Averill D, Fields P, Marshall G, Narod S, Lenoir GM, Lynch H, Feunteun J, Devilee P, Cornelisse CJ, Menko FH, Daly PA, Ormiston W, McManus R, Pye C, Lewis CM, Cannon-Albright LA, Peto J, Ponder BAJ, Skolnick MH, Easton DF, Goldgar DE, Stratton MR. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13. Science. 1994;265:2088–2090. doi: 10.1126/science.8091231. [DOI] [PubMed] [Google Scholar]

[R21] Zhou H, Pan W. Binomial Mixture Model-based Association Tests under Genetic Heterogeneity. Ann Hum Genet. 2009;73:614–630. doi: 10.1111/j.1469-1809.2009.00542.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A likelihood ratio test for genomewide association under genetic heterogeneity^{^*}

Meng Qian

Yongzhao Shao

Summary

Introduction

Method

Notation and set-up

Mixture binomial and maximum likelihood

Table 1.

The likelihood ratio test

Numerical Results

Type I Errors

Table 2.

Power Comparison

Table 3.

Table 4.

A Breast Cancer GWAS

Table 5.

Discussion

Acknowledgements

Appendix 1

Derivation of the test statistic of the LRT

Appendix 2

The asymptotic null distribution of the LRT

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A likelihood ratio test for genomewide association under genetic heterogeneity*

Meng Qian

Yongzhao Shao

Summary

Introduction

Method

Notation and set-up

Mixture binomial and maximum likelihood

Table 1.

The likelihood ratio test

Numerical Results

Type I Errors

Table 2.

Power Comparison

Table 3.

Table 4.

A Breast Cancer GWAS

Table 5.

Discussion

Acknowledgements

Appendix 1

Derivation of the test statistic of the LRT

Appendix 2

The asymptotic null distribution of the LRT

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A likelihood ratio test for genomewide association under genetic heterogeneity^{^*}