Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2021 May 4;78(2):487–498. doi: 10.1111/biom.13456

A generalized robust allele‐based genetic association test

Lin Zhang 1, Lei Sun 1,2,
PMCID: PMC9544499  PMID: 33729547

Abstract

The allele‐based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy–Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele‐based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual‐level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y‐dependent sampling. Finally, the proposed allele‐based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

Keywords: allele‐based association analysis, correlation, dependent sample, Hardy–Weinberg equilibrium, multiple phenotypes, multiple populations, robustness

1. INTRODUCTION

An association study aims to identify genetic markers that influence a heritable trait or phenotype of interest, while accounting for environmental effect. To formulate the problem more precisely, let single nucleotide polymorphisms (SNPs) be the genetic markers available for analysis. For each biallelic SNP, let a and A be the two possible alleles, and as in convention, let A denote the minor allele with population frequency p0.5. The SNP genotype consists of a paired (but unordered) alleles, aa, Aa, or AA. For a case‐control association study of a binary trait (Table 1), intuitively one can compare the estimates of p between the case and control groups. Indeed, the resulting allelic test is locally most powerful (Sasieni, 1997). However, the validity of the test hinges on the assumption of Hardy–Weinberg equilibrium (HWE), which assumes that genotype frequencies depend only on allele frequencies. That is, paa=(1p)2, pAa=2p(1p), and pAA=p2.

TABLE 1.

Notations for genotype and allele counts for a case‐control study

Genotype counts Allele counts
aa
Aa
AA
Total a A Total
Case r 0 r 1 r 2 r
2r0+r1
r1+2r2
2r
Control s 0 s 1 s 2 s
2s0+s1
s1+2s2
2s
naa
nAa
nAA
na
nA
Total n 0 n 1 n 2 n
2n0+n1
n1+2n2
2n
The HLA‐DQ3 example from Sasieni (1997)
Case 40 45 28 113 125 101 226
Control 273 100 43 416 646 186 832
Total 313 145 71 529 771 287 1058

The HLA‐DQ3 example is from Sasieni (1997), studying women with cervical intraepithelial neoplasia 3. The sample estimates of allele frequency of allele A in the case, control, and combined groups are denoted, respectively, as p^r=(2r2+r1)/2r=rA/2r,p^s=(2s2+s1)/2s=sA/2s, and p^=(2n2+n1)/2n=nA/2n.

To evaluate the HWE assumption in an independent sample, one typically applies the Pearson goodness‐of‐fit χ2 test, and under the H 0 of HWE,

THWE,Pearson=(n0n(1p^)2)2n(1p^)2+(n1n2p^(1p^))2n2p^(1p^)+(n2np^2)2np^2H0χ12. (1)

Using the HLA‐DQ3 data in Table 1 as an illustration, application of THWE,Pearson yields a test statistic of 49.7623 and a p‐value of 1.74×1012, suggesting departure from HWE.

Prior to conducting a case‐control genome‐wide association study (GWAS), testing HWE in the control sample and excluding SNPs in Hardy–Weinberg disequilibrium (HWD) is a routine part of data quality control; severe HWD is an indication of genotyping error. However, the need for a robust allelic association test remains. First, the p‐value threshold used for HWE‐based screening in practice is subjective and depends on sample size, often ranging between 10−8 and 10−12. For example, Bycroft et al. (2018) recommended 10−12 for each batch of ∼4000 samples. With this stringent criterion, it is possible that a SNP that passes the quality control step is truly in HWD at the population level, but not attributed to genotyping error. The corresponding allelic test is then biased, and we provide an application example in Section 5.2. Second, developing a robust and generalized allelic test is analytically important, so that this locally most powerful test (Sasieni, 1997) can be applied to, for example, a continuous trait or a sample of related individuals.

To robustify the classical allelic test against HWD, early work focused on improving variance estimation (Schaid and Jacobsen, 1999), but application is still limited to case‐control studies and in simple settings (e.g., independent samples). Thus, most genetic association studies rely on the Y‐on‐G regression approach, where G is coded additively as 0, 1, and 2 for aa, Aa, and AA. The resulting genotype‐based additive test is known to be robust to HWD, but the exact HWD‐correction mechanism is not clear. Further, data collection in practice typically first samples individuals based on Y, which can be random or Y‐dependent sampling (Derkach et al., 2015), and then collects G for the sampled individuals. Thus, it can be argued that G‐on‐Y is a more fitting statistical framework.

This “reverse” regression approach can also readily analyze multiple phenotypes simultaneously, which was the motivation for MultiPhen (O'Reilly et al., 2012). To deal with the three genotype groups, O'Reilly et al. (2012) used an ordinal logistic regression and stated that the proposed likelihood ratio test does not assume HWE. However, the statistical insight is lacking and analyzing pedigree data remains a challenge. Zhang et al. (2014) provided more background on using “reverse” regression to simultaneously analyze multiple traits and also proposed and compared a series of genotype‐based generalized estimating equation (GEE) association tests.

This work generalizes the locally most powerful allelic association test to more complex settings by developing a novel allele‐based “reverse” regression model. Section 2 revisits the classical allelic association test and provides insight on the need for a more flexible formulation of the test. Section 3 develops the new allele‐based “reverse” regression model by first appropriately partitioning the two alleles of a genotype and then specifying the individual allele as the response variable. In addition to the association parameter, the proposed regression includes a parameter that models the dependency between the two alleles of a genotype, explicitly accounting for potential departure from HWE. Section 4 considers more complex settings including related individuals from pedigree data, genetic markers with more than two alleles, and multiple phenotypes or populations. Given the theoretical results presented, simulation experiments in Section 5 are relatively brief with additional empirical evidence from two applications. Section 6 concludes with remarks and discussion.

2. THE CLASSICAL ALLELIC TEST REVISITED

For a given SNP and a binary phenotype of interest, let pr and ps denote the allele frequencies, respectively, for the case and control populations (Table 1). The classical allele‐based association test, testing H0:pr=ps, is based on

Tallelic=(p^rp^s)212r+12sp^(1p^)HWE,H0χ12. (2)

The validity of T allelic, however, requires the HWE assumption, because only under HWE, var^(p^rp^s)=(12r+12s)p^(1p^). Using the HLA‐DQ3 data in Table 1 as an example, because the HWE test is significant (Section 1), a direct application of the classical allelic association test in this case is not appropriate.

Indeed, Sasieni (1997) has pointed out that T allelic is valid and locally most powerful if and only if the HWE assumption holds and the genetic effect is additive. In the presence of HWD, T allelic is known to be anticonservative. But, we emphasize that this is true only if there is an excess of homozygotes AA, that is δ=pAAp2>0, where δ is a measure of HWD (Weir, 1996). If δ<0, T allelic is conservative as shown in Figure 1.

FIGURE 1.

FIGURE 1

The theoretical type 1 error rate of the classical allelic test, Tallelic. The nominal level is α=0.05, departure from HWE is measured by δ=pAAp2, where p is the frequency of the minor allele A and p2δp(1p). When p=0.5, 0.25δ0.25

To robustify T allelic against HWD, Schaid and Jacobsen (1999) used a multinomial distribution for the genotype counts to improve variance estimate. The resulting test is

Tallelic,Schaid=(p^rp^s)212r+12s(p^(1p^)+δ^)H0χ12, (3)

where δ^=(p^AAp^2) is the sample estimate of the HWD measure. Section 3 provides analytical insight about how THWE,Pearson in (1) is related to δ. Here, the effect of δ0 on the accuracy of T allelic is clear. In the HLA‐DQ3 example, δ^=0.061 and is highly significant, so the classical allelic test will be too optimistic, {Tallelic=44.8470}>{Tallelic,Schaid=34.3207}.

Tallelic,Schaid is effective in making T allelic, the classical allele‐based association test, robust to HWD, but its direct application is limited to the simplest setting of case‐control studies using independent observations without covariates. It is also not clear how to adapt this comparing‐two‐proportions analytical framework to, for example, analyzing a continuous trait. Thus, a new formulation of the allele‐based association test is needed.

3. A GENERALIZED ROBUST ALLELE‐BASED ASSOCIATION TEST

3.1. Decoupling the two alleles in a genotype

Consider a biallelic SNP with genotype G{aa,Aa,AA}, and for the moment assume that there are n independent observations, Gi,i=1,,n. We partition each Gi as follows:

(Gi1,Gi2)=(0,0)ifthegenotypeisaa(0,1)ifthegenotypeisAaandci=0(1,0)ifthegenotypeisAaandci=1(1,1)ifthegenotypeisAA, (4)

where ciiidBernoulli(1/2) if Gi=Aa. The partition of a genotype is straightforward for homozygotes aa or AA, but not so for a heterozygote Aa. Previous work attempted to split the nAa observations to (0,1) and (1,0) equally (Schaid et al., 2012), which reduces the variation inherent in a randomly selected allele (Web Appendix A). The use of a fair coin in our proposed approach ensures that iGi1Binomial(n,pAA+pAa/2) and similarly for iGi2, which is critical to a valid HWE test as well as association inference.

3.2. Reformulating the test of HWE as an allele‐based regression

A critical component of developing a robust allelic association test is the modeling of the HWE assumption. HWE assumes that the two alleles in a genotype are independent of each other, so a natural approach is to use a logistic regression, logit(E(Gi1))=log(pi1pi)=α+βGi2. Indeed, we show in Web Appendix B that the corresponding score test of testing H0:β=0 approximates THWE,Pearson; regressing Gi2 on Gi1 leads to the same conclusion. However, the differential treatment of Gi1 and Gi2 is not ideal. Further, the sample size of this model is n whereas there are 2n alleles from n genotypes. Thus, we consider an alternative regression model that “doubles” the sample size.

The proposed robust allele‐based (RA) regression model for HWE testing is

Gi1Gi2=α11+εi1εi2,whereεi1εi2iidN(0,σ21ρρ1). (5)

This RA model has two important features. First, instead of using the location parameter β to represent the dependence between two alleles, we reparameterize it as the correlation parameter ρ in the covariance matrix to capture HWD. This model reformulation is crucial to methodology development in Section 3.3 where the regression coefficient is reserved for association testing. Second, instead of a logistic regression, we use a Gaussian model even though the outcome is binary. This is because the primary goal here is hypothesis testing not parameter estimation or interpretation. For testing the location parameters in regression, Chen (1983) has shown that, under some regularity conditions, the score test statistics for regression models from the exponential family have identical form. In our setting, the Gaussian model enables explicit modeling of HWD through the correlation parameter ρ, and the resulting score test encompasses THWE,Pearson as a special case which we show below.

Based on the Gaussian model of (5), the score test statistic of testing H0:ρ=0 is

THWE,RA=(p^AAp^2)21np^2(1p^)2=δ^21np^2(1p^)2=nρ^2H0χ12, (6)

where δ^=p^AAp^2 and ρ^=δ^/(p^(1p^)), a scaled estimate of HWD, δ.

We first note that the newly developed HWE test statistic is, attractively, proportional to δ^=p^AAp^2. Interestingly, after some algebraic manipulations, we show in Web Appendix C that THWE,RA in (6) is identical to THWE,Pearson in (1). This equivalence, however, is under the simplest scenario of an independent sample. For more complex data, several authors have proposed different HWE testing strategies, for example, Troendle and Yu (1994) developed a method that tests HWE across strata, while Bourgain et al. (2004) proposed a quasi‐likelihood method that tests HWE in related individuals. Section 4 shows how the proposed regression (5) can be extended to derive a generalized HWE test. For the moment, we still consider an independent sample but turn our attention to association analysis.

3.3. The generalized robust allele‐based association test via regression

Let Gi1 and Gi2 be the two allele‐based random variables, as constructed in (4) for a biallelic SNP in an independent sample of size n, i=1,,n. We now also consider Y, a (categorical or continuous) phenotype of interest, and Z, one single covariate for notation simplicity but without loss of generality. The proposed RA regression for association analysis is

Gi1Gi2=(α+βYi+γZi)11+εi1εi2,whereεi1εi2iidN(0,σ21ρρ1). (7)

The corresponding score test, testing H0:β=0, is

TRA=i=1nj=12(gijp^γ^(ziz¯))yi221{i(yiy¯)(ziz¯)}2i(yiy¯)2i(ziz¯)2i(yiy¯)2(1+ρ^)σ^2H0χ12, (8)

where

p^=i(gi1+gi2)/2n,y¯=iyi/n,z¯=izi/n,γ^=i(gi1+gi22p^)(ziz¯)/2i(ziz¯)2,σ^2=ij(gijp^γ^(ziz¯))2/2n,

and

ρ^=i(gi1p^γ^(ziz¯))(gi2p^γ^(ziz¯))/nσ^2.

We provide the general form of T RA with multiple Ys and multiple Zs in Web Appendix D.

The proposed T RA unifies previous methods. For example, if Y is binary and there are no covariates, T RA in (8) is simplified to

TRA,binary,γ=0=(p^rp^s)212r+12s(p^(1p^)+δ^),

which is Tallelic,Schaid in (3). If we further assume HWE, T RA is reduced to

TRA,binary,γ=0,ρ=0=(p^rp^s)212r+12sp^(1p^),

which is the classical allelic test, T allelic in (2).

The proposed RA test also generalizes the allelic association test. To provide analytical insight, consider a continuous trait without covariate effect. In that case,

TRA,γ=0={i((gi1+gi2)/2p^)yi}212i(yiy¯)2(p^(1p^)+δ^).

Thus, for a continuous trait, the generalized allelic association test evaluates a weighted difference between the individual‐level allele frequency estimate, (gi1+gi2)/2, and the whole sample estimate, p^, where the weight is an individual's trait value, yi. In hindsight, this result may not be surprising. However, the advantages of developing the proposed RA regression model become evident when we analyze more complex data such as data with population heterogeneity and related individuals, which we investigate in the next section.

4. COMPLEX DATA

4.1. Multiple populations, alleles, or phenotypes

The proposed RA regression model of (7) can naturally adjust for population effects by including population indicators, or top principal components, as part of the covariates. Here we emphasize that the potential population effects could include both difference in allele frequency and difference in HWD between populations. The RA framework models allele frequency heterogeneity through γ and σ2, while accounting for HWD through ρ.

Without loss of generality, it is instructive to consider the simple case of a case‐control study with two populations but without additional covariates. Denote Zi=0 for population I and Zi=1 for population II. The corresponding RA regression model is

Gi1Gi2=(α+βYi+γZi)11+εi1εi2,whereεi1εi2N(0,σi21ρiρi1), (9)

ρi=ρI and σi2=(σI)2 if Zi=0; ρi=ρII and σi2=(σII)2 if Zi=1. Using superscripts I and II for all the other notations introduced so far, the generalized RA test of H0:β=0 while accounting for population heterogeneity has the following expression:

TRA,binary,2pop=(w1(p^rIp^sI)+w2(p^rIIp^sII))2w1+w2,

where w1=2rIsInI/(p^I(1p^I)+δ^I), and δ^I=p^AAI(p^I)2 captures the population‐specific HWD in population I; the analytical expressions are the same for w 2 and δ^II.

If evaluating HWE across multiple populations is the primary objective, we can test H0:ρI=ρII=0 and show that THWE,RA,2pop=THWE,RA,popI+THWE,RA,popIIH0χ22, where the expressions for THWE,RA,popI and THWE,RA,popII are given in (6). We note again the unifying feature of the proposed RA framework. For example, the test of Troendle and Yu (1994) developed specifically for testing HWE across strata has identical form as THWE,RA,2pop.

The RA model of (7) can also be extended to derive a generalized allelic association test for multiallelic markers with K>2 alleles, with adjustments for covariate effect and HWD. In that case, we need to introduce two indicator vectors, gi1 and gi2, each of length K1, to partition the genotype of individual i; we leave the technical details to Web Appendix E and Web Table 1. Given gi1 and gi2, the implementation of the RA regression model of (7) is straightforward. The RA model can also be used to derive the regression‐based HWE test of a multiallelic marker as shown in Web Appendix F.

To analyze multiple J phenotypes simultaneously, we can simply include multiple Yj1 vectors in the RA model of (7), each representing one phenotype, and then test H0:βj=0,j{1,2,,J}. The corresponding score test statistic is χJ2 distributed under the null of no association. Here we reiterate that the proposed “reverse” regression is allele based, conceptually distinct from the genotype‐based MultiPhen (O'Reilly et al., 2012) which uses an ordinal logistic regression for the three genotype groups.

4.2. Related individuals

We now consider a sample of n correlated individuals with known or accurately estimated pedigree structure (Dimitromanolakis et al., 2019). For notation simplicity but without loss of generality, we present results for analyzing a biallelic SNP and one phenotype in a homogeneous population. Let g be a 2n×1 vector of allele indicators for the n genotypes available, where g=(g1,g2,,gn) and gi=(Gi1,Gi2) for i{1,,n}, following the allele‐partition method in (4), and let y=(y1,y2,,yn), yi=(Yi,Yi), z=(z1,z2,,zn), and zi=(Zi,Zi). The generalized RA regression model for a dependent sample is,

g=α1+βy+γz+ε,whereεN(0,σ2Σ), (10)

1 is a 2n×1 vector of 1s, and Σ is a 2n×2n matrix that captures the genetic correlation between individuals as well as departure from HWE in founders. Founders are individuals who have no ancestors or siblings included in the sample, and their offspring genotypes are in HWE assuming random mating (Web Appendix G).

The specification of Σ is nontrivial, where for any two individuals i and j, Σ2(i1)+l,2(j1)+l not only measures the genetic correlation between individual is lth allele and individual js lth allele, landl{1,2}, but also accounts for potential HWD. We note that if i=j and l=l, Σ2(i1)+l,2(j1)+l=1. If i=j and ll, Σ2(i1)+l,2(j1)+l=0 for a nonfounder and =ρ for a founder, where ρ models HWD. Finally, if ij, Σ2(i1)+l,2(j1)+l=ϕi,j(1+ρ), where ϕi,j is the kinship coefficient between the two individuals (Web Appendix H).

As an illustration, consider a sample of f independent sib‐pairs. With a slight abuse of notations, let {Gj11,Gj12,Gj21,Gj22} denote the four alleles of the jth sib‐pair for j{1,,f}, where {Gj11,Gj12} are for sibling 1 and {Gj21,Gj22} are for sibling 2. In this case, Σ is a block diagonal matrix with

Σj=10ϕ(1+ρ)ϕ(1+ρ)01ϕ(1+ρ)ϕ(1+ρ)ϕ(1+ρ)ϕ(1+ρ)10ϕ(1+ρ)ϕ(1+ρ)01,

where ϕ=0.25 is the kinship coefficient for a sib‐pair.

For further illustration, assume all sib‐pairs are concordant in phenotype (i.e., r pairs of cases and s pairs of controls) with no covariates, the score statistic of testing H0:β=0 is

TRA,sib-pair,binary-concordant,γ=0=(g¯rg¯s)214r+14s(1+2ϕ(1+ρ^))g¯(1g¯), (11)

where

g¯r=j=1fk=12l=12yjkgjkl/4r,g¯s=j=1fk=12l=12(1yjk)gjkl/4s,g¯=j=1fk=12l=12gjkl/4f,

and

ρ^=j=1fl=12l=12(gj1lg¯)(gj2lg¯)/(4fϕg¯(1g¯))1.

It is compelling that the expression of (11) is similar to that of the classical allelic test in (2). However, the denominator in (11) explicitly adjusts for the inherent genetic correlation between (not within unless inbreeding) the sibling alleles through ϕ, as well as any potential HWD through ρ^.

5. EMPIRICAL EVIDENCE

5.1. Simulation studies

To numerically demonstrate the robustness of T RA to HWD as compared with T allelic, we simulated a case‐control study with an independent sample of 1000 cases and 1000 controls. The minor allele frequency (MAF) was p=0.2 or 0.5 for the minor allele A. The amount of HWD as measured by δ=pAAp2 ranged from the minimum of p2 to the maximum of p(1p). Then pAA=p2+δ and pAa=2p(1p)2δ, and (naa,nAa,nAA)Multinomial{n,(1pAapAA,pAa,pAA)}. For power evaluation at α=0.05, we assumed an additive model with disease penetrances f0=P(Y=1|G=aa)=0.08, f1=P(Y=1|G=Aa)=0.10, and f2=P(Y=1|G=AA)=0.12.

The empirical type 1 error results in Figure 2(A) and (B) confirm the theoretical results in Figure 1: the proposed T RA is accurate across the whole range of HWD values, while T allelic is not robust against HWD. Further, the empirical power results in Figure 2(C) and (D) highlight the fact that the classical allelic test could have reduced power when the number of homozygotes AA is fewer than what is expected under the HWE assumption (i.e., when δ<0), which is not well acknowledged in the existing literature.

FIGURE 2.

FIGURE 2

Empirical type 1 error rate and power of the classical allelic association test and the proposed RA test. The nominal level is α=0.05, evaluated based on 104 simulation replicates. Note that when δ>0, the classical allelic test has the inflated type 1 error rate as shown in (A) and (B), so the corresponding power in (C) and (D) is not meaningful and shown in a lighter shade. Also note that the HWD measure δ is bounded by the MAF p, p2δp(1p). Results of type 1 error control when analyzing rare variants with p=0.01 or 0.05 are shown in Web Figure 2, and when using the GWAS significance level of α=5×108 (evaluated based on 1010 simulation replicates) are shown in Web Table 2. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

We also assessed the accuracy of T RA and T allelic when analyzing SNPs with low MAF (p=0.01 and 0.05), and results in Web Figure 2 are similar to that in Figure 2. In addition, we evaluated the performance of the tests using the GWAS significance level of α=5×108, with 1010 simulation replicates for each parameter setting considered (Web Table 2); results for MAF >0.1 are not shown, because the performance of T RA is better for larger MAF as expected. Overall, the proposed RA test is accurate or slightly conservative for the whole range of HWD. In contrast, the classical allelic test has grossly inflated type 1 error rate up to 1.35×105 for α=5×108 (Web Table 2).

5.2. Application 1: Revisit the study of Wittke‐Thompson et al.

Studying HWD not due to the genotyping error, Wittke‐Thompson et al. (2005) identified 60 biallelic SNPs from 41 case‐control association studies. Focusing on association analysis of these 60 SNPs, we compared T allelic with the proposed T RA (Figure 3). As expected based on the simulation results in Figure 2, for SNPs with δ^>0, T allelic can appear to be more powerful, due to the inflated type 1 error rate, than the proposed T RA. For example, for the most significant SNP, p‐valueallelic=1.82×1010 and p‐valueRA=6.60×109. However, for this SNP δ^=0.052>0 with p‐valueHWE=3.09×104; the HWE test is significant but not suggesting genotyping error. Thus, T allelic is too optimistic. In contrast, for the third most significant SNP, δ^=0.031<0 and p‐valueHWE = 0.040. In that case, T allelic is conservative while the proposed T RA is not only robust but also more powerful, where p‐valueallelic=5.84×106 and p‐valueRA=8.86×107.

FIGURE 3.

FIGURE 3

Results of application 1. Allele‐based association tests of the 60 SNPs identified in Wittke‐Thompson et al. (2005), contrasting the proposed RA method, T RA, with the classical allelic test, T allelic. Unfilled triangles are for SNPs with δ^>0 (T allelic having inflated type 1 error), and filled triangles are for SNPs with δ^<0 (T allelic having deflated type 1 error); see Figure 1 for theoretical results and Figure 2 for simulation results regarding type 1 error control of the two methods

5.3. Application 2: A cystic fibrosis gene modifier study

Here we applied T RA to simultaneously analyze two phenotypes using a sample with related individuals from the Canadian cystic fibrosis (CF) gene modifier study. The two phenotypes of interest are lung function, a quantitative trait (Taylor et al., 2011), and meconium ileus (MI), a binary trait (Gong et al., 2019). The sample of 2540 CF subjects includes 2420 singletons and 60 independent sib‐pairs. For completeness, we first analyzed each phenotype individually using the proposed allele‐based RA framework, and we compared the results with the traditional genotype‐based method via the standard (generalized) linear mixed models (LMM or GLMM). We then analyzed both phenotypes simultaneously using T RA.

Figure 4(A) and (B) show that results of genotype‐ and allele‐based methods are largely consistent; see Section 6 for a discussion. For the most significant SNP associated with MI (Figure 4(B)), the p‐value of T RA is 2.62×106, slightly smaller than 7.80×106 of the GLMM method. In addition, the proposed T RA method can simultaneously analyze both phenotypes and identify suggestive SNPs that have p‐values several orders of magnitude smaller than those from studying one phenotype at a time (Figure 4(C) and (D)). However, these results are not genome‐wide significant and establishing true association requires follow‐up studies.

FIGURE 4.

FIGURE 4

Results of application 2. Chromosome 5‐wide. Genetic association studies of lung function and meconium ileus of 34,378 bi‐allelic markers on chromosome 5, using a sample of 2540 individuals with cystic fibrosis of which 2420 are singletons and 120 are from 60 sib‐pairs. LMM and GLMM are genotype‐based association analyses based on, respectively, the standard linear mixed model for a continuous trait (i.e., lung) and generalized LMM for a binary trait (i.e., MI), and RA is the proposed allele‐based association method that can also simultaneously analyze multiple traits using a sample of related individuals. Genome‐wide results are shown in Web Figure 1. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

Table 2 summarizes the association results for previously reported and replicated SNPs associated with CF lung function (Corvol et al., 2015) and MI association (Sun et al., 2012). For all the SNPs in Table 2, the proposed RA test yields slightly larger −log10(p‐values) than LMM or GLMM, suggesting that the allele‐based method has the potential to be more powerful than the traditional genotype‐based approach. The RA simultaneous analysis of the two phenotypes, however, did not lead to more significant results; this is not surprising because these SNPs were selected based on the two previous single‐phenotype analyses.

TABLE 2.

Results of application 2: Previously reported SNPs

Top CF lung function associated SNPs from Corvol et al. (2015)
Lung function only Lung function and MI simultaneously
Chr SNP
log10pLMM
log10pRA
log10pRA,lung,andMI
3 rs2246901 3.21 3.25 2.63
5 rs3749615 3.25 3.27 3.57
6 rs2395185 6.65 6.77 6.08
11 rs10466455 5.84 5.86 4.84
Top CF meconium ileus associated SNPs from Sun et al. (2012)
MI only Lung function and MI simultaneously
Chr SNP
log10pGLMM
log10pRA
log10pRA,lung,andMI
1 rs4077468 5.34 5.47 4.87
1 rs7512462 4.54 4.82 4.56
1 rs7419153 3.68 4.07 3.35
1 rs12047830 3.10 3.20 2.63

The p‐values in this table differ from those in Sun et al. (2012) and Corvol et al. (2015), because the analyses here only included the Canadian samples and individuals with both phenotypes measured. For other details, see legend to Figure 4.

6. DISCUSSION

To generalize the concept of comparing allele frequencies between case and control groups to more complex data settings, here we developed a novel RA‐based regression framework that regresses individual alleles on phenotypes. Motivated by the earlier work of Chen (1983) for testing regression coefficients in independent samples, the proposed method utilizes the Gaussian model of (10) that (i) leads to a valid allelic association test through testing the regression coefficient β, (ii) adjusts for the covariate effect through additional regression coefficient γ, (iii) explicitly models potential departure from HWE through ρ in the covariance matrix Σ, (iv) accounts for sample correlation through kinship coefficient ϕ in Σ, and (v) analyzes either a binary or a continuous phenotype, or multiple phenotypes simultaneously, where the phenotype data can be subjected to Y‐dependent sampling.

The generalized allelic association test also unifies existing allelic tests. Under the HWE assumption and for a case‐control study using an independent sample without covariates, the score test of H0:β=0 based on the simplified RA model (7) is identical to the classical allelic test in (2), TRA,binary,γ=0,ρ=0=Tallelic. In the presence of HWD, TRA,binary,γ=0=Tallelic,Schaid.

A crucial stage of this work is creating the two allele‐based random variables, Gi1 and Gi2, and leveraging the “power” of regression in new settings. The idea of reformulating an existing test statistic as a regression to facilitate method extension is not new. In their reader reaction to the generalized nonparametric Kurskal–Wallis test of Acar and Sun (2013) for handling group uncertainty, Wu and Guan (2015) presented “ a rank linear regression model and derived the proposed generalized Kruskal‐Wallis (GKW) statistic as a score test statistic.” More recently, Soave and Sun (2017) showed that by first reformulating the original Levene's test, testing for variance heterogeneity between k groups in an independent sample without group uncertainty, as a two‐stage regression, the extension to more complex data is more straightforward.

The concept of “reverse” regression has also been explored before, focusing on regressing genotype on phenotype, notably by O'Reilly et al. (2012) for simultaneously analyses of multiple phenotypes. The corresponding MultiPhen method uses an ordinal logistic regression for the three genotype groups and then applies a likelihood ratio test in independent samples. Zhang et al. (2014) then examined GEE‐based association tests for related individuals, but fundamentally these early tests are genotype based.

Another stream of the genotype‐based “reverse” or retrospective approach started with the quasi‐likelihood method of Thornton and McPeek (2007) for case‐control association testing with related individuals. The method first defines Xi=Gi/2{0,1/2,1}, then links the mean of Xi with Yi via a logit transformation and uses the kinship coefficient matrix as the covariance matrix of Xi, and finally obtains a quasi‐likelihood score test. Subsequently, Feng (2014) and Feng et al. (2011) extended the method of Thornton and McPeek (2007) to a quasi‐likelihood regression model that can incorporate multiple phenotypes. Although E(Xi)=E(Gi/2) can be interpreted as allele frequency, the quasi‐likelihood score test is fundamentally a genotype‐based association method. Further, the use of the kinship matrix alone as the covariance matrix requires the assumption of HWE.

Most existing family‐based association studies rely on the Y‐on‐G prospective regression framework via LMM or GLMM (Eu‐Ahsunthornwattana et al., 2014). For the application study in Section 5.3, we applied both the proposed RA method, and LMM (for the continuous CF lung function) and GLMM (for the binary meconium ileus status). Although there are differences in the (single‐phenotype) analyses (Figure 4(A) and (B)), results are remarkably consistent. Interestingly, in the simplest case of an independent sample with no covariates, the corresponding RA test statistic has identical form as that derived from the genotype‐based prospective regression model, as well as that from the nonparametric trend test (Web Appendix I). This inference similarity indirectly confirms the validity of the proposed approach (“reverse,” allele‐based and Gaussian model applied to a binary variable), but does not take away the contribution of this work. In particular, unlike LMM or GLMM, the proposed RA method can analyze multiple phenotypes simultaneously as shown in Figure 4(C) and (D).

One of the challenges related to the proposed framework is the interpretation of the parameter estimate for β even though its corresponding hypothesis testing is valid. Thus, we emphasize that the method developed here is tailored for rapid association testing, not effect size estimation. We direct the readers to Krutchkoff (1967) and Halperin (1970) for parameter estimation in the context of “reverse” regression. The proposed RA framework provides a statistically efficient and computationally fast way for genome‐wide association scans. For example, the analysis of Application 2 took 23.16 h using one CPU core of an Intel Xeon Processor (Skylake, IBRS) at 2.40 GHz; the computation time is reduced to less than 2 h when analyzing only the unrelated individuals in the CF sample.

Another difficulty present in any “reverse” regression approach is the modeling and interpretation of gene–gene or gene–environment interactions. It is also not clear how to perform an allelic association test for an X‐chromosomal variant; see Chen et al. (2018) for genotype‐based association methods. Finally, although the proposed RA method has good type 1 error control when analyzing rare variants (Web Figure 2 and Web Table 2), power of the RA method in its current implementation, analyzing one SNP at a time, is likely to be low for a rare variant. Extending the RA methodology to simultaneously analyze multiple rare variants (Derkach et al., 2014) is of future interest but beyond the scope of this work.

The proposed framework, however, is flexible and promising in other directions. For example, the inclusion of parameter ρ in the RA model (7) is advantageous for both method comparison and further development. In the absence of Y and Z and sample correlation, we have shown that the score test of H0:ρ=0 is identical to the classical Pearson's χ2 test of HWE in (1), THWE,RA=THWE,Pearson. For more complex data, instead of developing individual remedies addressing specific challenges, the proposed method provides a principled approach for extensions. For example, we have shown in Section 4.1 that by introducing a population indicator we can derive a HWE test across populations that, in the simple setting of an independent sample, has identical form with that of Troendle and Yu (1994).

In terms of association testing, the value of explicitly modeling HWD via ρ in the regression is twofold. First, if there is a strong prior evidence for HWE, we can restrict ρ to be zero and establish a locally most powerful score test. Second, for the special case of a case‐control study, Song and Elston (2006) and Wang and Shete (2010) have argued that departure from HWE in the case group provides additional evidence for association. However, the existing methods are ad hoc. For example, the method of Song and Elston (2006) first conducts the genotype‐based association test and Pearson χ2 test of HWE separately, then aggregates the two (dependent) tests by a weighted sum, and finally evaluates the statistical significance via simulations. The proposed RA regression framework conceptually offers a new approach that could directly incorporate group‐specific ρ into association inference, which we will explore as future work.

Supporting information

Web Appendices, Tables, and Figures referenced in Sections 3, 4 and 6 are available with this paper at the Biometrics website on Wiley Online Library. The method has been implemented, available at https://github.com/lzhangdc/Robust‐Allele‐based‐Regression‐Framework.

ACKNOWLEDGMENTS

The authors thank Dr. Lisa Strug and her lab for providing the cystic fibrosis application data. This research is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN‐04934 and RGPAS‐522594) and the Canadian Institutes of Health Research (CIHR, MOP‐310732) to LS. LZ is a trainee of the CANSSI‐ONTARIO STAGE training program at the University of Toronto.

Zhang L, Sun L. A generalized robust allele‐based genetic association test. Biometrics. 2022;78:487–498. 10.1111/biom.13456

DATA AVAILABILITY STATEMENT

The dataset of Application 1 is publicly available from Table 1 of Wittke‐Thompson et al. (2005) available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1196455/. The dataset of Application 2 is available by application to the Cystic Fibrosis Canada National data registry for researchers who meet the criteria for access to confidential clinical data for the purpose of CF research.

REFERENCES

  1. Acar, E.F. and Sun, L. (2013) A generalized Kruskal–Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics, 69, 427–435. [DOI] [PubMed] [Google Scholar]
  2. Bourgain, C. , Abney, M. , Schneider, D. , Ober, C. and McPeek, M.S. (2004) Testing for Hardy–Weinberg equilibrium in samples with related individuals. Genetics, 168, 2349–2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bourgain, C. , Hoffjan, S. , Nicolae, R. , Newman, D. , Steiner, L. , Walker, K. , Reynolds, R. , Ober, C. and McPeek, M.S. (2003) Novel case‐control test in a founder population identifies P‐selectin as an atopy‐susceptibility locus. The American Journal of Human Genetics, 73, 612–626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bycroft, C. , Freeman, C. , Petkova, D. , Band, G. , Elliott, L.T. , Sharp, K. , Motyer, A. , Vukcevic, D. , Delaneau, O. , O'Connell, J. et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen, B. , Craiu, R.V. , Strug, L.J. and Sun, L. (2018) The X factor: a robust and powerful approach to X‐chromosome‐inclusive whole‐genome association studies. arXiv preprint arXiv:1811.00964. [DOI] [PMC free article] [PubMed]
  6. Chen, C.‐F. (1983) Score tests for regression models. Journal of the American Statistical Association, 78, 158–161. [Google Scholar]
  7. Corvol, H. , Blackman, S.M. , Boëlle, P.‐Y. , Gallins, P.J. , Pace, R.G. , Stonebraker, J.R. , Accurso, F.J. , Clement, A. , Collaco, J.M. , Dang, H. et al. (2015) Genome‐wide association meta‐analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nature Communications, 6, 8382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Derkach, A. , Lawless, J.F. and Sun, L. (2015) Score tests for association under response‐dependent sampling designs for expensive covariates. Biometrika, 102, 988–994. [Google Scholar]
  9. Derkach, A. , Lawless, J.F. , Sun, L. et al. (2014) Pooled association tests for rare genetic variants: a review and some new results. Statistical Science, 29, 302–321. [Google Scholar]
  10. Dimitromanolakis, A. , Paterson, A.D. and Sun, L. (2019) Fast and accurate shared segment detection and relatedness estimation in un‐phased genetic data via truffle. The American Journal of Human Genetics, 105, 78–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Eu‐Ahsunthornwattana, J. , Miller, E.N. , Fakiola, M. , Wellcome Trust Case Control Consortium 2, Jeronimo , S.M., Blackwell, J.M. and Cordell, H.J. (2014) Comparison of methods to account for relatedness in genome‐wide association studies with family‐based data. PLoS Genetics, 10, e1004445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Feng, Z. (2014) A generalized quasi‐likelihood scoring approach for simultaneously testing the genetic association of multiple traits. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63, 483–498. [Google Scholar]
  13. Feng, Z. , Wong, W.W. , Gao, X. and Schenkel, F. (2011) Generalized genetic association study with samples of related individuals. The Annals of Applied Statistics, 5, 2109–2130. [Google Scholar]
  14. Gong, J. , Wang, F. , Xiao, B. , Panjwani, N. , Lin, F. , Keenan, K. , Avolio, J. , Esmaeili, M. , Zhang, L. , He, G. et al. (2019) Genetic association and transcriptome integration identify contributing genes and tissues at cystic fibrosis modifier loci. PLoS Genetics, 15, e1008007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Halperin, M. (1970) On inverse estimation in linear regression. Technometrics, 12, 727–736. [Google Scholar]
  16. Krutchkoff, R. (1967) Classical and inverse regression methods of calibration. Technometrics, 9, 425–439. [Google Scholar]
  17. O'Reilly, P.F. , Hoggart, C.J. , Pomyen, Y. , Calboli, F.C. , Elliott, P. , Jarvelin, M.‐R. and Coin, L.J. (2012) Multiphen: joint model of multiple phenotypes can increase discovery in GWAS. PloS One, 7, e34861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Sasieni, P.D. (1997) From genotypes to genes: doubling the sample size. Biometrics, 53, 1253–1261. [PubMed] [Google Scholar]
  19. Schaid, D.J. and Jacobsen, S.J. (1999) Biased tests of association: comparisons of allele frequencies when departing from Hardy–Weinberg proportions. American Journal of Epidemiology, 149, 706–711. [DOI] [PubMed] [Google Scholar]
  20. Schaid, D.J. , Sinnwell, J.P. and Jenkins, G.D. (2012) Regression modeling of allele frequencies and testing Hardy Weinberg equilibrium. Human Heredity, 74, 71–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Soave, D. and Sun, L. (2017) A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty. Biometrics, 73, 960–971. [DOI] [PubMed] [Google Scholar]
  22. Song, K. and Elston, R.C. (2006) A powerful method of combining measures of association and Hardy–Weinberg disequilibrium for fine‐mapping in case‐control studies. Statistics in Medicine, 25, 105–126. [DOI] [PubMed] [Google Scholar]
  23. Sun, L. , Rommens, J.M. , Corvol, H. , Li, W. , Li, X. , Chiang, T.A. , Lin, F. , Dorfman, R. , Busson, P.‐F. , Parekh, R.V. et al. (2012) Multiple apical plasma membrane constituents are associated with susceptibility to meconium ileus in individuals with cystic fibrosis. Nature Genetics, 44, 562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Taylor, C. , Commander, C.W. , Collaco, J.M. , Strug, L.J. , Li, W. , Wright, F.A. , Webel, A.D. , Pace, R.G. , Stonebraker, J.R. , Naughton, K. et al. (2011) A novel lung disease phenotype adjusted for mortality attrition for cystic fibrosis genetic modifier studies. Pediatric Pulmonology, 46, 857–869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Thornton, T. and McPeek, M.S. (2007) Case‐control association testing with related individuals: a more powerful quasi‐likelihood score test. The American Journal of Human Genetics, 81, 321–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Troendle, J. and Yu, K. (1994) A note on testing the Hardy–Weinberg law across strata. Annals of Human Genetics, 58, 397–402. [DOI] [PubMed] [Google Scholar]
  27. Wang, J. and Shete, S. (2010) Using both cases and controls for testing Hardy–Weinberg proportions in a genetic association study. Human Heredity, 69, 212–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Weir, B. (1996) Genetic Analysis II. Sunderland, MA: Sinauer. [Google Scholar]
  29. Wittke‐Thompson, J.K. , Pluzhnikov, A. and Cox, N.J. (2005) Rational inferences about departures from Hardy–Weinberg equilibrium. The American Journal of Human Genetics, 76, 967–986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wu, B. and Guan, W. (2015) Reader reaction on the generalized Kruskal–Wallis test for genetic association studies incorporating group uncertainty. Biometrics, 71, 556–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhang, Y. , Xu, Z. , Shen, X. , Pan, W. , Initiative, A.D.N. et al. (2014) Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data. NeuroImage, 96, 309–325. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Appendices, Tables, and Figures referenced in Sections 3, 4 and 6 are available with this paper at the Biometrics website on Wiley Online Library. The method has been implemented, available at https://github.com/lzhangdc/Robust‐Allele‐based‐Regression‐Framework.

Data Availability Statement

The dataset of Application 1 is publicly available from Table 1 of Wittke‐Thompson et al. (2005) available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1196455/. The dataset of Application 2 is available by application to the Cystic Fibrosis Canada National data registry for researchers who meet the criteria for access to confidential clinical data for the purpose of CF research.


Articles from Biometrics are provided here courtesy of Wiley

RESOURCES