Linear Mixed-Effect Models Through the Lens of Hardy–Weinberg Disequilibrium

Lin Zhang; Lei Sun

doi:10.3389/fgene.2022.856872

. 2022 Apr 12;13:856872. doi: 10.3389/fgene.2022.856872

Linear Mixed-Effect Models Through the Lens of Hardy–Weinberg Disequilibrium

Lin Zhang ¹, Lei Sun ^1,^2,^*

PMCID: PMC9039721 PMID: 35495131

Abstract

For genetic association studies with related individuals, the linear mixed-effect model is the most commonly used method. In this report, we show that contrary to the popular belief, this standard method can be sensitive to departure from Hardy–Weinberg equilibrium (i.e., Hardy–Weinberg disequilibrium) at the causal SNPs in two ways. First, when the trait heritability is treated as a nuisance parameter, although the association test has correct type I error control, the resulting heritability estimate can be biased, often upward, in the presence of Hardy–Weinberg disequilibrium. Second, if the true heritability is used in the linear mixed-effect model, then the corresponding association test can be biased in the presence of Hardy–Weinberg disequilibrium. We provide some analytical insights along with supporting empirical results from simulation and application studies.

Keywords: genome-wide association study, dependent sample, robust association analysis, heritability estimate, Hardy–Weinberg equilibrium

1 Introduction

Genetic association tests are often derived from a regression model, regressing the phenotypic data of a complex trait (Y) on the genotypic data of a single-nucleotide polymorphism (SNP; G), as well as on the covariate data of important environmental factors (Z). When individuals in a sample are genetically related with each other, the linear mixed-effect model (LMM) is the most commonly used method for genome-wide association studies (GWAS) (Eu-Ahsunthornwattana et al., 2014). The variance–covariance matrix of the regression model is partitioned into a weighted sum of the genetic correlation matrix and the correlation matrix due to shared environmental effects. The genetic correlation matrix is typically represented by the kinship coefficient matrix, which is either inferred from the (correctly) known pedigree structure or estimated based on the available genome-wide genetic data (Yang et al., 2011; Dimitromanolakis et al., 2019). The weight for the genetic correlation matrix is referred to as the heritability of the trait (Visscher et al., 2006; Visscher et al., 2008); Falconer (1985) gave a theoretical modeling of the variance partition, which sets the foundation for heritability.

It is commonly assumed that these regression-based association tests are robust to departure from Hardy–Weinberg equilibrium (HWE) (Sasieni, 1997). HWE states that the two alleles in a genotype are independent draws from the same Bernoulli distribution, or, equivalently, genotype frequencies depend solely on the allele frequencies (Hardy et al., 1908; Weinberg, 1908). For a biallelic SNP with two possible alleles A and a, let p and 1 − p be the population allele frequencies, respectively. Under HWE, p _aa = (1 − p)², p _Aa = 2p(1 − p), and p _AA = p ², where p _aa, p _Aa, and p _AA are the population genotype frequencies of genotypes aa, Aa, and AA, respectively. To quantify the departure from HWE or the amount of Hardy–Weinberg disequilibrium (HWD),

δ = p_{A A} - p^{2}

(1)

is a widely used measure (Weir, 1996), and δ = 0 indicates HWE holds. We note that a) HWE is also known as Hardy–Weinberg proportion and b) δ is also known as p(1 − p)F, where F is the inbreeding coefficient (Powell et al., 2010). Equivalently, instead of quantifying the genotype frequencies as p _aa = (1 − p)² + δ, p _Aa = 2p(1 − p) − 2δ, and p _AA = p ² + δ based on δ (Weir, 1996), we can define them based on F as p _aa = (1 − p)² + p(1 − p)F, p _Aa = 2p(1 − p)(1 − F), and p _AA = p ² + p(1 − p)F (Powell et al., 2010). As the classical Pearson χ ² HWE testing is based on comparing the observed genotype counts with the expected under HWE (Zhang and Sun, 2021); we thus chose δ for this work to be consistent with the GWAS literature.

A truly associated or causal SNP can be out of HWE (Wittke-Thompson et al., 2005; Ryckman and Williams, 2008; Turner et al., 2011), which is often overlooked but an important consideration when studying a method’s robustness to HWD. Note that the HWD attributed to true association is typically not as extreme as the HWD caused by genotyping errors (Zhang and Sun, 2020). Thus, true HWD can remain in a “cleaned” dataset after applying the standard HWD-based quality control screening using a stringent p-value threshold [e.g., 10^–12 for an application of the UK Biobank data by Bycroft et al. (2018)]. With a sample of independent individuals, both theoretical and empirical results support that genotype-based association tests are robust to HWD (Sasieni, 1997; Schaid and Jacobsen, 1999; Zhang, 2021). However, in the presence of sample dependency, little has been discussed.

In this report, we first provide some analytical insights on why the standard LMM can be sensitive to HWD in pedigree data in contrast to when analyzing a sample of unrelated individuals. We then demonstrate with a simple sib-pair design that 1) when the heritability is estimated from the data as in practice, although the empirical type I error rate of the LMM is well controlled, the estimated heritability is biased, often upward biased; 2) when the true heritability is known and used, the empirical type I error rate of the LMM is then inflated when δ > 0, and deflated if δ < 0. The result of 2) is novel, but it is mostly of an academic interest as the true heritability of a trait is often unknown in practice. On the other hand, the result of 1) has important practical implications because if the estimate of a trait heritability is larger than the true value, then it helps explain some of the “missing heritability” (Manolio et al., 2009); the insightful work of Chen (2014) “discuss[es] the circumstances in which the HE [Haseman‐Elston] regression and the mixed linear model are equivalent.”

2 Methods

2.1 Traditional Y ∼ G Model With Independent Samples, T _Indep, Is Robust to HWD

Let Y be a (continuous) trait of interest, and G = 0, 1, and 2, respectively, for the genotypes aa, Aa, and AA of a SNP. Additionally, for notation simplicity but without loss of generality, we assume that there is only one additional covariate, denoted by Z. With a sample of n unrelated individuals, the traditional genotype-based association analysis assumes that

y = α^{*} 1 + β^{*} g + γ^{*} z + ϵ^{*}, ϵ^{*} \sim N (0, σ^{* 2} I),

(2)

where y = (y ₁, y ₂,…, y _n) is a n × 1 vector for the phenotypic values, 1 is a n × 1 vector of 1’s, g = (g ₁, g ₂,…, g _n) is a n × 1 vector for the genotypes of the SNP, z = (z ₁, z ₂,…, z _n) is a n × 1 vector for the covariate values, ϵ* is the error term with variance σ*², and I is the identity matrix.

Score-based tests are often used for genetic association analyses (Derkach et al., 2015). In this case, the score statistic of testing H ₀: β* = 0 can be easily derived as

T_{indep} = n \cdot \frac{{\{{(g - \bar{g} 1)}^{T} (y - \bar{y} 1) - \frac{{(g - \bar{g} 1)}^{T} (z - \bar{z} 1) {(y - \bar{y} 1)}^{T} (z - \bar{z} 1)}{{(z - \bar{z} 1)}^{T} (z - \bar{z} 1)}\}}^{2}}{[{(g - \bar{g} 1)}^{T} (g - \bar{g} 1) - \frac{{\{{(g - \bar{g} 1)}^{T} (z - \bar{z} 1)\}}^{2}}{{(z - \bar{z} 1)}^{T} (z - \bar{z} 1)}] [{(y - \bar{y} 1)}^{T} (y - \bar{y} 1) - \frac{{\{{(y - \bar{y} 1)}^{T} (z - \bar{z} 1)\}}^{2}}{{(z - \bar{z} 1)}^{T} (z - \bar{z} 1)}]} .

(3)

To observe T _indep’s connection with Hardy–Weinberg disequilibrium, it is instructive to employ some algebraic tricks and show that

\frac{1}{n} {(g - \bar{g} 1)}^{T} (g - \bar{g} 1) = \hat{var} (G) = 2 (\hat{p} (1 - \hat{p}) + ({\hat{p}}_{A A} - {\hat{p}}^{2})) = 2 (\hat{p} (1 - \hat{p}) + \hat{δ}) .

Because $\hat{δ} = {\hat{p}}_{A A} - {\hat{p}}^{2}$ measures the amount of HWD present in the data (Weir, 1996), T _indep inherently adjusts for departure from HWE through $\hat{var} (G) = 2 (\hat{p} (1 - \hat{p}) + \hat{δ})$ . As a result, the traditional genotype-based association test is robust to HWD in independent samples.

When Y is binary, the classic logistic regression is commonly used. However, Chen (1983) showed that under some regularity conditions, the score test statistics have an identical form for the exponential family in independent samples, which was recently validated by Zhang and Sun (2021) for genetic association studies. Additionally, Derkach et al. (2015) showed that for Y-dependent sampling, “the score statistics are identical for conditional and full likelihood approaches, and are of the same form as those for ordinary random sampling.” Thus, in terms of association testing (not genetic effect estimation), we can conclude that genotype-based association studies of binary traits in independent samples are also robust to HWD.

2.2 Linear Mixed-Effect Model With Dependent Samples, T _LMM, Can Be Sensitive to HWD

Although a pedigree-based study design is rare for genome-wide association studies, individuals can be (cryptically) related with each other even in population-based GWAS (Sun et al., 2017). Omitting related individuals simplifies the association analysis but reduces the sample size and thus power. Instead, Σ_Φ, the kinship coefficient matrix, can be estimated using the available genome-wide data to capture the sample relatedness between the n individuals (Visscher et al., 2006; Yang et al., 2011). The association analysis using the full sample can be conducted using the linear mixed-effect model.

y = α^{*} 1 + β^{*} g + γ^{*} z + ϵ^{*}, where ϵ^{*} \sim N (0, σ_{y}^{2} Σ_{y}) and Σ_{y} = h^{2} Σ_{Φ} + (1 - h^{2}) I .

(4)

Compared with the linear model used for independent samples, var(ϵ*) = σ*² I in Eq. 2 is replaced by $σ_{y}^{2} Σ_{y}$ to reflect the sample dependence. The matrix Σ_y is a weighted average of two components, where Σ_Φ reflects the sample relatedness; naturally, the model is reduced to the linear model of Eq. 2 for independent samples when Σ_Φ = I. The weight h ² is interpreted as the heritability of the trait (Visscher et al., 2008), $h^{2} σ_{y}^{2}$ as the phenotypic variation due to (additive) genetic variation, and $(1 - h^{2}) σ_{y}^{2} = σ_{e}^{2}$ as the phenotypic variation due to environmental variation. The matrix Σ_Φ is the kinship matrix, where Σ_Φ(i, j) = 2ϕ _i,j and ϕ _i,j is the kinship coefficient between the ith and jth samples.

By convention, h ² is defined as

h^{2} = \frac{\sum β_{k}^{2} var (G_{k})}{var (Y)} = \frac{\sum β_{k}^{2} 2 p_{k} (1 - p_{k})}{\sum β_{k}^{2} 2 p_{k} (1 - p_{k}) + σ_{e}^{2}},

where there could be multiple causal SNPs, k = 1,…, S. In reality, h ² is estimated by the correlation between phenotypes of related individuals. Consider the simple case of sibling pairs, and let Y ₁ and Y ₂ be the phenotypes for sib 1 and sib 2, respectively. Allowing for HWD and adjusting for the kinship coefficient ϕ, the estimated h ² is

{\hat{h}}^{2} = \hat{corr} (Y_{1}, Y_{2}) / 2 ϕ,

where corr(Y ₁, Y ₂) depends on the correlation between G _1k and G _2k between the siblings; see Zhang and Sun (2021) for the derivation of corr(G _1k, G _2k) accounting for kinship coefficient and HWD. Thus,

\frac{E ({\hat{h}}^{2})}{h^{2}} = \frac{\sum β_{k}^{2} 2 (p_{k} (1 - p_{k}) + δ_{k})}{\sum β_{k}^{2} 2 p_{k} (1 - p_{k})},

and the bias of the h ² estimate is

E ({\hat{h}}^{2}) - h^{2} = h^{2} \cdot \frac{\sum β_{k}^{2} δ_{k}}{\sum β_{k}^{2} p_{k} (1 - p_{k})} .

(5)

Under the simple case of one causal SNP, the bias is simplified to h ² ⋅ δ/(p(1 − p)).

Given the analytical insights provided so far, we then briefly examine the empirical properties of T _LMM through both application and simulation studies.

3 Results

3.1 Cystic Fibrosis Sib-Pair Data Application: T _LMM Has Correct Type I Error but h ² Appears to Be Overestimated

We extracted 65 sibling pairs from a cystic fibrosis (CF) gene modifier study (Wright et al., 2011; Sun et al., 2012). The phenotype Y of interest is the lung function measurements of the 130 related individuals with CF. In total, there were 570,539 SNPs genotyped using the Illumina 610-Quad Beadchip after applying the standard quality control, including minor allele frequency (MAF) greater than 2%. To stabilize the variance estimation, we additionally required SNPs to have MAF greater than 5%. We then applied T _LMM to the remaining 505,172 SNPs. In the application, we treated h ² as unknown and estimated it based on the linear mixed-effect model of Eq. 4 as in convention.

When h ² was estimated from the data, our association testing based on T _LMM had good type I error control (results not shown), consistent with the empirical observations in the GWAS literature. However, the estimated h ², obtained using the 65-pair sibling data, is ${\hat{h}}^{2} = 0.82$ . This value is substantially greater than 0.5, the commonly believed “true” heritability of lung function in CF obtained from the classic monozygous (MZ) vs. dizygous (DZ) twin-based estimation method (Vanscoy et al., 2007).

To verify if the large heritability estimate from the LMM method in our application was due to chance, we conducted a proof-of-principle simulation study. We assumed that only one causal SNP, G _causal with MAF of 0.2, affects Y with h ² = 0.5. Genotype and phenotype values for 65 sibling pairs were then simulated under the assumption of HWE (i.e., without HWD). Among the 100,000 independently simulated replicates, only 4.24% of the heritability estimates were greater than ${\hat{h}}^{2} = 0.82$ . This suggests that ${\hat{h}}^{2} = 0.82$ , the value that was observed in the CF data application, was unlikely if the true heritability was 0.5 and without HWD at the causal SNP.

To verify if HWD at the causal SNP can lead to a biased heritability estimate, we then conducted additional simulation studies, following the same sib-pair design as mentioned previously. Our goal is to demonstrate that 1) when h ² is treated as a nuisance parameter, its estimate based on model (Eq. 4) cross-reference can be biased in the presence of HWD; and 2) assuming the true h ² is known, the empirical type I error rate of LMM (Eq. 4) cross-reference inflates when δ > 0 and deflates when δ < 0.

3.2 Simulated Sib-Pair Data in the Presence of HWD: h ² Estimate Is Biased

Consider a continuous trait Y with h ² = 0.5 and influenced by one causal SNP, G _causal, with minor allele frequency of 0.2 and with HWD factor, δ _causal, ranging from −0.04 to 0.16. A non-associated SNP, G _tested, also has an MAF of 0.2 but with its own δ _tested, which may not be the same as δ _causal in a specific simulation study. The sample size was 65 sibling pairs, chosen to match with the sample size of the cystic fibrosis application study in Section 3.1.

Most practical implementations of the linear mixed-effect model (Eq. 4) cross-reference treat h ² as a nuisance parameter, and no type I error issue has been reported. Indeed, when h ² was estimated in our simulation study conducted in Section 3.3, the test size of T _LMM was correct at the nominal level (black squares in Figure 3 shown in Section 3.3) even if δ _tested ≠ 0 (i.e., out of HWE) and across the range of δ _causal values (from −0.04 to 0.16).

Empirical type I error rate of T _LMM based on the linear mixed-effect model (Eq. 4) against δ _causal. **(A)** When G _tested of tested SNPs is in HWD with δ _tested = 0.06. **(B)** When G _tested of tested SNPs is in HWE with δ _tested = 0. The true heritability of the phenotype is h ² = 0.5, the minor allele frequencies p _causal = p _tested = 0.2, and 10,000 independent replicates of phenotypes and genotypes for 65 sibling pairs were simulated for each δ _causal value. The blue circles are for T _LMM using the true heritability h ² = 0.5, and the black squares are for T _LMM while estimating h ² (results of ${\hat{h}}^{2}$ are shown in Figure 1).

However, in this situation, when h ² is treated as unknown, the impact of HWD is on the estimation of h ². Specifically, Figure 1 shows that ${\hat{h}}^{2}$ is downward biased when δ _causal < 0, and upward biased if δ _causal > 0. The bias can be substantial. For example, when δ _causal = 0.10, the estimated heritability ${\hat{h}}^{2}$ is centered at 0.80 as compared to the true value of 0.5, with a bias of 0.30. Indeed, based on our theoretical insight in Section 2.2, the expected bias is h ² ⋅ δ/(p(1 − p)) = 0.5 ⋅ 0.1/(0.2(1–0.2)) = 0.31.

Box plots of ${\hat{h}}^{2}$ , estimated from the linear mixed-effect model (Eq. 4) against δ _causal. The true heritability of the phenotype is h ² = 0.5. The minor allele frequencies p _causal = p _tested = 0.2 and 10,000 independent replicates of phenotypes and genotypes for 65 sibling pairs were simulated for each δ _causal value. The empirical type 1 error rates are shown in Figure 3 as black squares.

In Figure 1, it is notable that ${\hat{h}}^{2}$ can be greater than one. Since h ² is the proportion of variance in Y explained by additive genetic variation, 0 ≤ h ² ≤ 1 by definition. However, if δ _causal ≠ 0, ${\hat{h}}^{2}$ based on the LMM, without additional truncation, is a biased estimate of h ² with a bias of h ² ⋅ δ/(p(1 − p)) for this sib-pair design as shown in Section 2.2; the bias is 0 (i.e., no bias) under HWE when δ = 0.

Additionally, although a larger sample that consists of 5,000 sibling pairs shrinks the variance of the h ² estimate as expected, it does not shrink the bias, as shown in Figure 2. However, we also note that, in practice, it is unlikely to have so many sibling pairs.

Box plots of ${\hat{h}}^{2}$ , estimated from the linear mixed-effect model (Eq. 4) against δ _causal using 5,000 sibling pairs. The other parameter values are the same as in Figure 1, where the true heritability of the phenotype is h ² = 0.5, the minor allele frequencies p _causal = p _tested = 0.2, and 10,000 independent replicates were simulated for each δ _causal value.

3.3 Simulated Sib-Pair Data in the Presence of HWD: When Using the True h ² Value T _LMM Has Incorrect Test Size

Here, we conducted the association analysis between Y and the non-associated SNP, G _tested, using the LMM model of Eq. 4 but assuming h ² = 0.5 is known.

Figure 3A plots the empirical type I error rates (blue circles) of T _LMM using the true h ² = 0.5, for a nominal level of 0.05, estimated from independently simulated 10,000 replicates for each δ _causal value. (An empirical type I error greater than 0.05 + 3 ⋅ 0.002 = 0.056 can be considered inflated as the standard error of the empirical type I error rate can be estimated as $\sqrt{0.05 \cdot 0.95 / 10000} = 0.002$ .) In Figure 3A, the trend of type I error inflation is clear as δ _causal increases.

In Figure 3A, we set δ _tested = 0.06, but we note that the main cause of the type I error issue is δ _causal ≠ 0 when using the LMM of Eq. 4 with h ² = 0.5 plugged in. Indeed, Figure 3B shows that even if G _tested is in HWE (i.e., δ _tested = 0), the problem remains, albeit less severe, as long as δ _causal ≠ 0.

4 Discussion

We used a sib-pair design to demonstrate that the linear mixed-effect model can be problematic in the presence of Hardy–Weinberg disequilibrium at the causal SNP(s). To demonstrate that the LMM-based heritability estimate can be biased, as a proof-of-principle, our simulation study assumed that the phenotype Y has only one causal SNP, which is unrealistic for complex traits. However, the analytical insight shown in Eq. 5 (i.e., bias expected to be $h^{2} \cdot \sum β_{k}^{2} δ_{k} / \sum β_{k}^{2} p_{k} (1 - p_{k})$ ) suggests that the issue discussed here remains relevant in the case of multiple causal SNPs as $\sum β_{k}^{2} δ_{k}$ is unlikely to be zero, even while allowing the signs of δ _k to differ.

Assuming the true heritability h ² is known, we also demonstrated the potential type I error issue of the LMM in the presence of HWD using data that consist of related individuals only. In practice, this issue diminishes if the sample includes a large number of independent individuals or the magnitude of HWD at the causal SNP is small. Additionally, in practice, h ² is treated as unknown, in which case, the type I error rate of the LMM is well controlled; indeed, no increased false positives of the LMM due to HWD have been reported in the literature to the best of our knowledge. However, the estimate of h ² can be upward biased and upwardly so if δ _causal > 0, as demonstrated in the simulation study in Section 3.2 and seen in the cystic fibrosis application study in Section 3.1. This new observation offers a possible complementary explanation of the “missing heritability” discussed extensively in Maher (2008).

In practice, SNPs out of HWE are typically not analyzed due to concerns for low genotyping quality (Wellcome Trust Case Control Consortium, 2007; Bycroft et al., 2018; Marees et al., 2018). However, the observation made here remains relevant as the heritability estimates in LMM-based models are biased when the causal SNPs are in HWD (which is unknown in practice) but not the tested SNPs. This is also supported by Figure 3B. When there was HWD at the causal SNP (e.g., δ _causal = 0.10 on the X-axis), there was a type I error issue even if there was no HWD at the tested SNP (i.e., δ _tested = 0). Conversely, Figure 3A shows that if there was no HWD at the causal SNP (i.e., δ _causal = 0 on the X-axis), then the test is accurate even if there was HWD at the tested SNP (δ _tested = 0.06).

Additionally, the HWE-based screening practice itself can be called into question because a truly associated SNP is often in HWD (Wittke-Thompson et al., 2005; Ryckman and Williams, 2008; Turner et al., 2011). The potential of leveraging the HWD expected at a causal SNP to increase the power of association testing has been explored by several groups (Song and Elston, 2006; Wang and Shete, 2008; Zhang and Sun, 2020).

We have not examined the implication of HWD combined with linkage disequilibrium (LD) (Weir, 2008) on the LMM, which is an important future research question. Additionally, recent work has shown that dominant genetic effect could complicate the LD measure and interpretation (Palmer et al., 2021), which in turn could affect our examination of the effect of HWD on the LMM.

Although the linear mixed-effect model is a popular and powerful method for GWAS, conceptually, the use of kinship coefficient matrix (i.e., Σ_Φ), derived from G, as part of the variance–covariance matrix (i.e., Σ_y) of the LMM can be problematic because the response variable Y is the phenotype of interest. An alternative approach is to reverse the roles of Y and G in the regression model. Indeed, O’Reilly et al. (2012) proposed MultiPhen, a method that treats the genotype G of an SNP as the response variable and phenotype values Y of multiple traits as predictors, and uses an ordinal logistic regression applicable to independent samples. More recently, Zhang (2021) (Chapter 2) proposed a generalized reverse (or retrospective) regression model that can be applied to dependent samples, which takes the form of

g = α 1 + β y + γ z + ϵ, ϵ \sim N (0, σ^{2} Σ_{g}), σ^{2} Σ_{g} = σ^{2} Σ_{Φ} + Σ_{δ},

(6)

where Σ_Φ is the kinship coefficient matrix as defined earlier and Σ_δ is a function of δ that explicitly models the amount of HWD; the use of a linear model for the discrete genotype data G is motivated by the work of Chen (1983).

Interestingly, if the variance and covariance matrices in Eqs 4, 6 of the LMM were the same, the resulting score test statistics are also the same. However, conceptually, the model Eq. 6 correctly uses the kinship coefficient matrix to model the response variable G, in contrast to the LMM model of Eq. 4. Specifically, at a tested SNP, as the reverse regression is conditional on Y, the variance–covariance matrix only concerns G _tested, that is, Σ_g. The modeling and estimation of Σ_g can account for potential HWD through Σ_δ, in addition to the genetic correlation captured by the kinship coefficient matrix of Σ_Φ, resulting in a more robust association test for related individuals. Indeed, when the method was applied to the same simulated sib-pair data in Section 3.3, it had correct type I error control [results shown in Figure 2.2 of Chapter 2 of Zhang (2021)]. However, how to model gene–environment interaction through the reverse regression framework remains an open question.

Acknowledgments

We thank Dr. Lisa J. Strug and acknowledge her laboratory for providing the cystic fibrosis genotype data. LZ was a trainee and funding recipient of the CANSSI Ontario STAGE (Strategic Training for Advanced Genetic Epidemiology) program at the University of Toronto.

Data Availability Statement

The data analyzed in this study are subject to the following licenses/restrictions: The CF application data are available by application to the Cystic Fibrosis Canada National data registry for researchers who meet the criteria for access to confidential clinical data for the purpose of CF research. Requests to access these datasets should be directed to cfregistry@cysticfibrosis.ca.

Ethics Statement

The studies involving human participants were reviewed and approved by the Canadian Gene Modifier Study (CGMS), the Research Ethics Board of the Hospital for Sick Children (# 0020020214 from 2012–2019 and #1000065760 from 2019-present), and all participating subsites. Written informed consent to participate in this study was provided by the participants’ legal guardian/next of kin.

Author Contributions

LZ and LS proposed the method and wrote the manuscript. LZ performed the analysis. LS obtained the application data and funding.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-04934 and RGPAS-522594).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K., et al. (2018). The uk Biobank Resource with Deep Phenotyping and Genomic Data. Nature 562, 203–209. 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen C.-F. (1983). Score Tests for Regression Models. J. Am. Stat. Assoc. 78, 158–161. 10.1080/01621459.1983.10477945 [DOI] [Google Scholar]
Chen G.-B. (2014). Estimating Heritability of Complex Traits from Genome-wide Association Studies Using IBS-Based Hasemanâ€"Elston Regression. Front. Genet. 5, 107. 10.3389/fgene.2014.00107 [DOI] [PMC free article] [PubMed] [Google Scholar]
Derkach A., Lawless J. F., Sun L. (2015). Score Tests for Association under Response-dependent Sampling Designs for Expensive Covariates. Biometrika 102, 988–994. 10.1093/biomet/asv038 [DOI] [Google Scholar]
Dimitromanolakis A., Paterson A. D., Sun L. (2019). Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via Truffle. Am. J. Hum. Genet. 105, 78–88. 10.1016/j.ajhg.2019.05.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eu-Ahsunthornwattana J., Miller E. N., Fakiola M., Jeronimo S. M. B., Blackwell J. M., Cordell H. J., et al. (2014). Comparison of Methods to Account for Relatedness in Genome-wide Association Studies with Family-Based Data. Plos Genet. 10, e1004445. 10.1371/journal.pgen.1004445 [DOI] [PMC free article] [PubMed] [Google Scholar]
Falconer D. S. (1985). A Note on Fisher's 'average Effect' and 'average Excess'. Genet. Res. 46, 337–347. 10.1017/s0016672300022825 [DOI] [PubMed] [Google Scholar]
Hardy G. H. (1908). Mendelian Proportions in a Mixed Population. Science 28, 49–50. 10.1126/science.28.706.49 [DOI] [PubMed] [Google Scholar]
Maher B. (2008). Personal Genomes: The Case of the Missing Heritability. Nature 456, 18–21. 10.1038/456018a [DOI] [PubMed] [Google Scholar]
Manolio T. A., Collins F. S., Cox N. J., Goldstein D. B., Hindorff L. A., Hunter D. J., et al. (2009). Finding the Missing Heritability of Complex Diseases. Nature 461, 747–753. 10.1038/nature08494 [DOI] [PMC free article] [PubMed] [Google Scholar]
Marees A. T., de Kluiver H., Stringer S., Vorspan F., Curis E., Marie-Claire C., et al. (2018). A Tutorial on Conducting Genome-wide Association Studies: Quality Control and Statistical Analysis. Int. J. Methods Psychiatr. Res. 27, e1608. 10.1002/mpr.1608 [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Reilly P. F., Hoggart C. J., Pomyen Y., Calboli F. C., Elliott P., Jarvelin M. R., et al. (2012). Multiphen: Joint Model of Multiple Phenotypes Can Increase Discovery in Gwas. PLoS One 7, e34861. 10.1371/journal.pone.0034861 [DOI] [PMC free article] [PubMed] [Google Scholar]
Palmer D. S., Zhou W., Abbott L., Baya N., Churchhouse C., Seed C., et al. (2021). Analysis of Genetic Dominance in the uk Biobank. bioRxiv [DOI] [PMC free article] [PubMed] [Google Scholar]
Powell J. E., Visscher P. M., Goddard M. E. (2010). Reconciling the Analysis of Ibd and Ibs in Complex Trait Studies. Nat. Rev. Genet. 11, 800–805. 10.1038/nrg2865 [DOI] [PubMed] [Google Scholar]
Ryckman K., Williams S. M. (2008). Calculation and Use of the hardy-weinberg Model in Association Studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit–18. 10.1002/0471142905.hg0118s57 [DOI] [PubMed] [Google Scholar]
Sasieni P. D. (1997). From Genotypes to Genes: Doubling the Sample Size. Biometrics 53, 1253–1261. 10.2307/2533494 [DOI] [PubMed] [Google Scholar]
Schaid D. J., Jacobsen S. J. (1999). Blased Tests of Association: Comparisons of Allele Frequencies when Departing from Hardy-Weinberg Proportions. Am. J. Epidemiol. 149, 706–711. 10.1093/oxfordjournals.aje.a009878 [DOI] [PubMed] [Google Scholar]
Song K., Elston R. C. (2006). A Powerful Method of Combining Measures of Association and Hardy-Weinberg Disequilibrium for fine-mapping in Case-Control Studies. Statist. Med. 25, 105–126. 10.1002/sim.2350 [DOI] [PubMed] [Google Scholar]
Sun L., Dimitromanolakis A., Chen W.-M. (2017). “Identifying Cryptic Relationships,” in Statistical Human Genetics (Springer; ), 45–60. 10.1007/978-1-4939-7274-6_4 [DOI] [PubMed] [Google Scholar]
Sun L., Rommens J. M., Corvol H., Li W., Li X., Chiang T. A., et al. (2012). Multiple Apical Plasma Membrane Constituents Are Associated with Susceptibility to Meconium Ileus in Individuals with Cystic Fibrosis. Nat. Genet. 44, 562–569. 10.1038/ng.2221 [DOI] [PMC free article] [PubMed] [Google Scholar]
Turner S., Armstrong L. L., Bradford Y., Carlson C. S., Crawford D. C., Crenshaw A. T., et al. (2011). Quality Control Procedures for Genome-wide Association Studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit1–19. 10.1002/0471142905.hg0119s68 [DOI] [PMC free article] [PubMed] [Google Scholar]
Vanscoy L. L., Blackman S. M., Collaco J. M., Bowers A., Lai T., Naughton K., et al. (2007). Heritability of Lung Disease Severity in Cystic Fibrosis. Am. J. Respir. Crit. Care Med. 175, 1036–1043. 10.1164/rccm.200608-1164oc [DOI] [PMC free article] [PubMed] [Google Scholar]
Visscher P. M., Hill W. G., Wray N. R. (2008). Heritability in the Genomics Era - Concepts and Misconceptions. Nat. Rev. Genet. 9, 255–266. 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]
Visscher P. M., Medland S. E., Ferreira M. A. R., Morley K. I., Zhu G., Cornes B. K., et al. (2006). Assumption-free Estimation of Heritability from Genome-wide Identity-By-Descent Sharing between Full Siblings. Plos Genet. 2, e41. 10.1371/journal.pgen.0020041 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J., Shete S. (2008). A Test for Genetic Association that Incorporates Information about Deviation from hardy-weinberg Proportions in Cases. Am. J. Hum. Genet. 83, 53–63. 10.1016/j.ajhg.2008.06.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Weinberg W. (1908). On the Demonstration of Heredity in Man. in (1963) Papers on Human Genetics. [Google Scholar]
Weir B. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sunderland, Massachusetts: Sinauer Series (Sinauer). [Google Scholar]
Weir B. S. (2008). Linkage Disequilibrium and Association Mapping. Annu. Rev. Genom. Hum. Genet. 9, 129–142. 10.1146/annurev.genom.9.081307.164347 [DOI] [PubMed] [Google Scholar]
Wellcome Trust Case Control Consortium (2007). Genome-wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls. Nature 447, 661–678. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wittke-Thompson J. K., Pluzhnikov A., Cox N. J. (2005). Rational Inferences about Departures from hardy-weinberg Equilibrium. Am. J. Hum. Genet. 76, 967–986. 10.1086/430507 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright F. A., Strug L. J., Doshi V. K., Commander C. W., Blackman S. M., Sun L., et al. (2011). Genome-wide Association and Linkage Identify Modifier Loci of Lung Disease Severity in Cystic Fibrosis at 11p13 and 20q13.2. Nat. Genet. 43, 539–546. 10.1038/ng.838 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang J., Lee S. H., Goddard M. E., Visscher P. M. (2011). Gcta: a Tool for Genome-wide Complex Trait Analysis. Am. J. Hum. Genet. 88, 76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang L. (2021). A General Study of Genetic Association Tests and the Test of Hardy–Weinberg Equilibrium. Ph.D. thesis. University of Toronto. [Google Scholar]
Zhang L., Sun L. (2021). A Generalized Robust Allele-Based Genetic Association Test. Oxford, United Kingdom: Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang L., Sun L. (2020). Leveraging hardy-weinberg Disequilibrium for Association Testing in Case-Control Studies. bioRxiv. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[B1] Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K., et al. (2018). The uk Biobank Resource with Deep Phenotyping and Genomic Data. Nature 562, 203–209. 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Chen C.-F. (1983). Score Tests for Regression Models. J. Am. Stat. Assoc. 78, 158–161. 10.1080/01621459.1983.10477945 [DOI] [Google Scholar]

[B3] Chen G.-B. (2014). Estimating Heritability of Complex Traits from Genome-wide Association Studies Using IBS-Based Hasemanâ€"Elston Regression. Front. Genet. 5, 107. 10.3389/fgene.2014.00107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Derkach A., Lawless J. F., Sun L. (2015). Score Tests for Association under Response-dependent Sampling Designs for Expensive Covariates. Biometrika 102, 988–994. 10.1093/biomet/asv038 [DOI] [Google Scholar]

[B5] Dimitromanolakis A., Paterson A. D., Sun L. (2019). Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via Truffle. Am. J. Hum. Genet. 105, 78–88. 10.1016/j.ajhg.2019.05.007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Eu-Ahsunthornwattana J., Miller E. N., Fakiola M., Jeronimo S. M. B., Blackwell J. M., Cordell H. J., et al. (2014). Comparison of Methods to Account for Relatedness in Genome-wide Association Studies with Family-Based Data. Plos Genet. 10, e1004445. 10.1371/journal.pgen.1004445 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Falconer D. S. (1985). A Note on Fisher's 'average Effect' and 'average Excess'. Genet. Res. 46, 337–347. 10.1017/s0016672300022825 [DOI] [PubMed] [Google Scholar]

[B8] Hardy G. H. (1908). Mendelian Proportions in a Mixed Population. Science 28, 49–50. 10.1126/science.28.706.49 [DOI] [PubMed] [Google Scholar]

[B9] Maher B. (2008). Personal Genomes: The Case of the Missing Heritability. Nature 456, 18–21. 10.1038/456018a [DOI] [PubMed] [Google Scholar]

[B10] Manolio T. A., Collins F. S., Cox N. J., Goldstein D. B., Hindorff L. A., Hunter D. J., et al. (2009). Finding the Missing Heritability of Complex Diseases. Nature 461, 747–753. 10.1038/nature08494 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Marees A. T., de Kluiver H., Stringer S., Vorspan F., Curis E., Marie-Claire C., et al. (2018). A Tutorial on Conducting Genome-wide Association Studies: Quality Control and Statistical Analysis. Int. J. Methods Psychiatr. Res. 27, e1608. 10.1002/mpr.1608 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] O'Reilly P. F., Hoggart C. J., Pomyen Y., Calboli F. C., Elliott P., Jarvelin M. R., et al. (2012). Multiphen: Joint Model of Multiple Phenotypes Can Increase Discovery in Gwas. PLoS One 7, e34861. 10.1371/journal.pone.0034861 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Palmer D. S., Zhou W., Abbott L., Baya N., Churchhouse C., Seed C., et al. (2021). Analysis of Genetic Dominance in the uk Biobank. bioRxiv [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Powell J. E., Visscher P. M., Goddard M. E. (2010). Reconciling the Analysis of Ibd and Ibs in Complex Trait Studies. Nat. Rev. Genet. 11, 800–805. 10.1038/nrg2865 [DOI] [PubMed] [Google Scholar]

[B15] Ryckman K., Williams S. M. (2008). Calculation and Use of the hardy-weinberg Model in Association Studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit–18. 10.1002/0471142905.hg0118s57 [DOI] [PubMed] [Google Scholar]

[B16] Sasieni P. D. (1997). From Genotypes to Genes: Doubling the Sample Size. Biometrics 53, 1253–1261. 10.2307/2533494 [DOI] [PubMed] [Google Scholar]

[B17] Schaid D. J., Jacobsen S. J. (1999). Blased Tests of Association: Comparisons of Allele Frequencies when Departing from Hardy-Weinberg Proportions. Am. J. Epidemiol. 149, 706–711. 10.1093/oxfordjournals.aje.a009878 [DOI] [PubMed] [Google Scholar]

[B18] Song K., Elston R. C. (2006). A Powerful Method of Combining Measures of Association and Hardy-Weinberg Disequilibrium for fine-mapping in Case-Control Studies. Statist. Med. 25, 105–126. 10.1002/sim.2350 [DOI] [PubMed] [Google Scholar]

[B19] Sun L., Dimitromanolakis A., Chen W.-M. (2017). “Identifying Cryptic Relationships,” in Statistical Human Genetics (Springer; ), 45–60. 10.1007/978-1-4939-7274-6_4 [DOI] [PubMed] [Google Scholar]

[B20] Sun L., Rommens J. M., Corvol H., Li W., Li X., Chiang T. A., et al. (2012). Multiple Apical Plasma Membrane Constituents Are Associated with Susceptibility to Meconium Ileus in Individuals with Cystic Fibrosis. Nat. Genet. 44, 562–569. 10.1038/ng.2221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Turner S., Armstrong L. L., Bradford Y., Carlson C. S., Crawford D. C., Crenshaw A. T., et al. (2011). Quality Control Procedures for Genome-wide Association Studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit1–19. 10.1002/0471142905.hg0119s68 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Vanscoy L. L., Blackman S. M., Collaco J. M., Bowers A., Lai T., Naughton K., et al. (2007). Heritability of Lung Disease Severity in Cystic Fibrosis. Am. J. Respir. Crit. Care Med. 175, 1036–1043. 10.1164/rccm.200608-1164oc [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Visscher P. M., Hill W. G., Wray N. R. (2008). Heritability in the Genomics Era - Concepts and Misconceptions. Nat. Rev. Genet. 9, 255–266. 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]

[B24] Visscher P. M., Medland S. E., Ferreira M. A. R., Morley K. I., Zhu G., Cornes B. K., et al. (2006). Assumption-free Estimation of Heritability from Genome-wide Identity-By-Descent Sharing between Full Siblings. Plos Genet. 2, e41. 10.1371/journal.pgen.0020041 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Wang J., Shete S. (2008). A Test for Genetic Association that Incorporates Information about Deviation from hardy-weinberg Proportions in Cases. Am. J. Hum. Genet. 83, 53–63. 10.1016/j.ajhg.2008.06.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Weinberg W. (1908). On the Demonstration of Heredity in Man. in (1963) Papers on Human Genetics. [Google Scholar]

[B27] Weir B. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sunderland, Massachusetts: Sinauer Series (Sinauer). [Google Scholar]

[B28] Weir B. S. (2008). Linkage Disequilibrium and Association Mapping. Annu. Rev. Genom. Hum. Genet. 9, 129–142. 10.1146/annurev.genom.9.081307.164347 [DOI] [PubMed] [Google Scholar]

[B29] Wellcome Trust Case Control Consortium (2007). Genome-wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls. Nature 447, 661–678. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Wittke-Thompson J. K., Pluzhnikov A., Cox N. J. (2005). Rational Inferences about Departures from hardy-weinberg Equilibrium. Am. J. Hum. Genet. 76, 967–986. 10.1086/430507 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] Wright F. A., Strug L. J., Doshi V. K., Commander C. W., Blackman S. M., Sun L., et al. (2011). Genome-wide Association and Linkage Identify Modifier Loci of Lung Disease Severity in Cystic Fibrosis at 11p13 and 20q13.2. Nat. Genet. 43, 539–546. 10.1038/ng.838 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Yang J., Lee S. H., Goddard M. E., Visscher P. M. (2011). Gcta: a Tool for Genome-wide Complex Trait Analysis. Am. J. Hum. Genet. 88, 76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] Zhang L. (2021). A General Study of Genetic Association Tests and the Test of Hardy–Weinberg Equilibrium. Ph.D. thesis. University of Toronto. [Google Scholar]

[B34] Zhang L., Sun L. (2021). A Generalized Robust Allele-Based Genetic Association Test. Oxford, United Kingdom: Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Zhang L., Sun L. (2020). Leveraging hardy-weinberg Disequilibrium for Association Testing in Case-Control Studies. bioRxiv. [Google Scholar]

PERMALINK

Linear Mixed-Effect Models Through the Lens of Hardy–Weinberg Disequilibrium

Lin Zhang

Lei Sun

Abstract

1 Introduction

2 Methods

2.1 Traditional Y ∼ G Model With Independent Samples, T _Indep, Is Robust to HWD

2.2 Linear Mixed-Effect Model With Dependent Samples, T _LMM, Can Be Sensitive to HWD

3 Results

3.1 Cystic Fibrosis Sib-Pair Data Application: T _LMM Has Correct Type I Error but h ² Appears to Be Overestimated

3.2 Simulated Sib-Pair Data in the Presence of HWD: h ² Estimate Is Biased

FIGURE 3.

FIGURE 1.

FIGURE 2.

3.3 Simulated Sib-Pair Data in the Presence of HWD: When Using the True h ² Value T _LMM Has Incorrect Test Size

4 Discussion

Acknowledgments

Data Availability Statement

Ethics Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Linear Mixed-Effect Models Through the Lens of Hardy–Weinberg Disequilibrium

Lin Zhang

Lei Sun

Abstract

1 Introduction

2 Methods

2.1 Traditional Y ∼ G Model With Independent Samples, T Indep, Is Robust to HWD

2.2 Linear Mixed-Effect Model With Dependent Samples, T LMM, Can Be Sensitive to HWD

3 Results

3.1 Cystic Fibrosis Sib-Pair Data Application: T LMM Has Correct Type I Error but h 2 Appears to Be Overestimated

3.2 Simulated Sib-Pair Data in the Presence of HWD: h 2 Estimate Is Biased

FIGURE 3.

FIGURE 1.

FIGURE 2.

3.3 Simulated Sib-Pair Data in the Presence of HWD: When Using the True h 2 Value T LMM Has Incorrect Test Size

4 Discussion

Acknowledgments

Data Availability Statement

Ethics Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.1 Traditional Y ∼ G Model With Independent Samples, T _Indep, Is Robust to HWD

2.2 Linear Mixed-Effect Model With Dependent Samples, T _LMM, Can Be Sensitive to HWD

3.1 Cystic Fibrosis Sib-Pair Data Application: T _LMM Has Correct Type I Error but h ² Appears to Be Overestimated

3.2 Simulated Sib-Pair Data in the Presence of HWD: h ² Estimate Is Biased

3.3 Simulated Sib-Pair Data in the Presence of HWD: When Using the True h ² Value T _LMM Has Incorrect Test Size