Skip to main content
Genetics logoLink to Genetics
letter
. 2015 Nov 19;202(1):363–366. doi: 10.1534/genetics.115.177212

Misspecification in Mixed-Model-Based Association Analysis

Willem Kruijer 1,1
PMCID: PMC4701099  PMID: 26584900

Abstract

Additive genetic variance in natural populations is commonly estimated using mixed models, in which the covariance of the genetic effects is modeled by a genetic similarity matrix derived from a dense set of markers. An important but usually implicit assumption is that the presence of any nonadditive genetic effect increases only the residual variance and does not affect estimates of additive genetic variance. Here we show that this is true only for panels of unrelated individuals. In the case that there is genetic relatedness, the combination of population structure and epistatic interactions can lead to inflated estimates of additive genetic variance.

Keywords: misspecification, epistasis, nonadditive genetic variance, missing heritability


MIXED models with random genetic effects have become an important tool for studying the genetic architecture of complex traits. The covariance of the genetic effects is assumed to be proportional to a genetic similarity matrix (GSM) based on a dense set of markers, which is equivalent to assuming additive effects for each standardized marker score. Under several additional assumptions, such as constant linkage disequilibrium, this gives unbiased estimates of additive genetic variance and narrow-sense heritability (Yang et al. 2010; Speed et al. 2012; Lee and Chow 2014; Speed and Balding 2015). The sampling variance of such heritability estimators has been studied in Visscher and Goddard (2014) and Kruijer et al. (2015). These results are, however, derived under the assumption that the model is correct, i.e., contains the true distribution of the data. Here we consider situations where this is not the case and argue that potential sources of bias may be identified by computing the parameter value θ˜ that minimizes the Kullback–Leibler (KL) divergence KL(Q,Pθ)=log(Q/Pθ)dQ with respect to the true distribution Q. For n-dimensional Gaussian distributions P=N(0,Σ1) and Q=N(0,Σ0), the KL divergence equals

KL(Q|P)=12(tr(Σ11Σ0)+log(|Σ1|/|Σ0|)n)

It is a well-known fact from statistics that in the case of misspecification, i.e., when Q is not contained in the model {Pθ:θ∈Θ}, the maximum-likelihood (ML) estimator converges to θ˜ (Huber 1967; White 1982). Here we investigate misspecification in a mixed-model context, the covariance of the data being misspecified due to infinitesimal interactions or other nonadditive effects. We consider three different scenarios (A–C) and in each of them three different values of additive and nonadditive genetic variance. The total phenotypic variance is assumed to be known and equal to 1.

In scenario A, the phenotype Y=(Y1,,Yn) of n individuals is modeled using the multivariate normal distribution

PσA2,σE2=N(0,σA2K+σE2In), (1)

where K is a marker-based GSM, In is the identity matrix, σA2[0,1] is the additive genetic variance, and σE2=1σA2 is the residual variance. We assume, however, that Q, the actual distribution of Y, is the zero mean normal distribution with covariance 0.4K+0.2(KK)+0.4In, KK being the Hadamard (entry-wise) product. The matrix (KK) is the covariance due to small epistatic interactions between all standardized marker scores (Supporting Information, File S1; see also Jiang and Reif 2015). Hence, the narrow- and broad-sense heritabilities are equal to, respectively, 0.4 and 0.6. In addition to this genetic architecture, we also consider the case where the covariance matrix of Y is 0.2K+0.1(KK)+0.7In (i.e., h2=0.2 and H2=0.3) and 0.6K+0.3(KK)+0.1In (i.e., h2=0.6 and H2=0.9).

For all these genetic architectures, (KK) does not equal the identity matrix In, and Q is therefore not contained in model (1). Hence, the ML estimator will not converge to Q, but rather to the point (σ˜Α2,σ˜Ε2) minimizing the KL divergence, KL(Q,PσA2,σE2). For genetic similarity matrices derived from published data in maize, rice, and Arabidopsis, σ˜A2 ranges between 0.47 and 0.53, given a true value of 0.4 (Table 1). Similar bias occurs when σA2=0.2 and σA2=0.6. Hence, the presence of epistatic interactions leads to inflated estimates of additive genetic variance. For a panel of simulated unrelated individuals, σ˜A2 equals the true value of σA2, which is due to the much smaller off-diagonal elements of K, making KK almost indistinguishable from In.

Table 1 .

Values of the additive genetic variance (σ˜A2) minimizing the Kullback–Leibler divergence KL(Q, P) with respect to the true distribution (Q) of scenarios A–C, with P contained in models (1)–(3)

Population/source Species Size (n) Scenario A Scenario B Scenario C
σA2=0.2,H2=0.3
Swedish regmap Arabidopsis thaliana 298 0.26 (0.111) 0.27 (0.052) 0.26 (0.076)
Hapmap A. thaliana 350 0.23 (0.164) 0.29 (0.048) 0.23 (0.116)
Van Heerwaarden et al. (2012) Zea mays 400 0.25 (0.096) 0.27 (0.045) 0.25 (0.066)
Zhao et al. (2011) Oryza sativa 413 0.26 (0.075) 0.25 (0.044) 0.26 (0.060)
Unrelated individuals Simulated 3000 0.20 (0.067)
σA2=0.4,H2=0.6
Swedish regmap A. thaliana 298 0.53 (0.101) 0.58 (0.034) 0.53 (0.075)
Hapmap A. thaliana 350 0.47 (0.176) 0.60 (0.030) 0.48 (0.136)
Van Heerwaarden et al. (2012) Z. mays 400 0.50 (0.092) 0.58 (0.029) 0.50 (0.069)
Zhao et al. (2011) O. sativa 413 0.51 (0.098) 0.52 (0.032) 0.50 (0.102)
Unrelated individuals Simulated 3000 0.40 (0.066)
σA2=0.6,H2=0.9
Swedish regmap A. thaliana 298 0.78 (0.086) 0.89 (0.011) 0.77 (0.083)
Hapmap A. thaliana 350 0.71 (0.160) 0.90 (0.008) 0.71 (0.149)
Van Heerwaarden et al. (2012) Z. mays 400 0.75 (0.079) 0.88 (0.019) 0.66 (0.163)
Zhao et al. (2011) O. sativa 413 0.73 (0.156) 0.77 (0.162) 0.57 (0.494)
Unrelated individuals Simulated 3000 0.60 (0.062)

Minimization was performed by evaluating KL divergence on the grid 0, 0.01, …, 1 for all variance components, under the constraint they sum to one. Standard errors (in parentheses) were calculated as the square root of the asymptotic variance (White 1982, theorem 3.2). Five populations were considered: the Arabidopsis Hapmap and Swedish regmap (Horton et al. 2012; Kruijer et al. 2015), the rice population from Zhao et al. (2011), the maize population of van Heerwaarden et al. (2012), and a simulated population (File S1). In scenarios B and C there are r = 2 replicates of each genotype.

In scenario B, a plant trait is phenotyped on r genetically identical replicates. Following Kruijer et al. (2015), the observations Y=(Y11,,Ynr) are modeled by the normal distribution

PσA2,σE2=N(0,σA2ZKZ+σE2Inr), (2)

Z being an incidence matrix assigning plants to genotypes. The true distribution Q is multivariate normal with covariance 0.4ZKZ+0.2ZZ+0.4Inr; i.e., there are nonadditive (not necessarily epistatic) effects with independent N(0,0.2) distributions. Such effects could be due to, for example, genotype–environment interaction. As in scenario A, we also consider a genetic architecture with h2=0.2 and H2=0.3 (i.e., covariance 0.2ZKZ+0.1ZZ+0.7Inr) and a genetic architecture with h2=0.6 and H2=0.9. In contrast to model (1) (where Z=In and r=1), ZZ is different from Inr, and Q is not contained in model (2). Again, the value σ˜A2 minimizing KL divergence is substantially larger than the true value (Table 1), and additive genetic variance will tend to be overestimated. Intuitively, this is because the block structure ZZ is better captured by ZKZ than by the diagonal residual.

Scenario C is a combination of scenarios A and B. To avoid the misspecification occurring in scenario B, the model

PσA2,σG2,σE2=N(0,σA2ZKZ+σG2ZZ+σE2IN) (3)

is considered, extending (2) with independent nonadditive effects. This model has been used in the analysis of field trials (Oakey et al. 2006, 2007), as well as genomic prediction (Gianola and van Kaam 2008; Howard et al. 2014; Jarquin et al. 2014). If in fact the nonadditive effects have covariance KK (as in scenario A) and σA2=0.4, the data have covariance 0.4ZKZ+0.2Z(KK)Z+0.4Inr. As in scenarios A and B, the σ˜A2 minimizing KL divergence is larger than the true value (Table 1), except for the rice population of Zhao et al. (2011) with H2=0.9. In the latter case, σ˜E2 was on average 0.14, while its bias was at most 0.01 for all other populations and heritability levels.

In addition to the minimization of KL divergence we analyzed ML estimates for simulated traits, in which case the phenotypic variance is unknown (File S2). For most populations and heritability levels, the bias of additive genetic variance estimates (σ˜A2) is similar to what was found by minimizing KL divergence in models (1)–(3). Differences are largest for the population of Zhao et al. (2011), where the total phenotypic variance is consistently overestimated.

The bias we identified here by statistical arguments and simulations has important implications, in particular for immortal populations, for which genetically identical replicates are available (e.g., Arabidopsis thaliana, agronomic crops, bacteria, and fungi). Typically there is strong population structure and often only several hundreds of different genotypes are phenotyped. One can analyze such data at the individual level [model (2)] or at the level of genotypic means [model (1), with σE2’s divided by the number of replicates]. Kruijer et al. (2015) showed that in the latter type of analysis, standard errors of heritability estimates can be huge, and recommended model (2) for both heritability estimation and genomic prediction. Here we have shown that in the presence of nonadditive effects, this model is likely to overestimate additive genetic variance. If, however, the nonadditive effects are due to epistatic interactions, analysis at the genotypic means level [model (1)] will, apart from the large sampling variance, also give inflated estimates of additive genetic variance. This is a rather realistic scenario, since epistasis may be an important part of the genetic architecture (Mackay 2014), and several other types of nonadditive effects can be ruled out or minimized for immortal populations: e.g., genotype–environment interactions are unlikely in homogeneous controlled environments with adequate randomization, and dominance effects are absent when using inbred lines. Inflated heritability estimates may also affect the performance of G-BLUP, although the loss in accuracy is considerably smaller than in the case where heritability is underestimated (Kruijer et al. 2015).

Interestingly, the inflation of additive genetic variance is not due to any nonlinearity or absence of main effects (as in, e.g., Culverhouse et al. 2002; Song et al. 2010; Zuk et al. 2012), but rather to the population structure present in the epistatic GSM, which to some extent resembles the structure of the GSM for the additive effects. At the same time, it is this structure that makes the epistatic GSM distinguishable from the diagonal error. This suggests that epistatic interactions are easier to model in structured populations; i.e., sampling variance of epistatic variance components may not be as large as in unstructured human populations (Yang et al. 2011). Expressions for the asymptotic variance in a model with both additive and epistatic effects (File S3) indicate that this is indeed the case. More generally, the inflation of heritability estimates due to misspecification illustrates the difficulty of modeling and estimating genetic effects. As recently pointed out by Speed and Balding (2015) this is already challenging for the additive genetic effects, in the sense that depending on the genetic architecture, different GSMs may be appropriate. Indeed, the potential bias resulting from an inappropriate GSM could be assessed by evaluating KL divergence with respect to the true model, as is the case for alternatives for the epistatic GSM considered here.

Acknowledgments

I thank two anonymous reviewers for their constructive comments that helped to improve the manuscript. Martin Boer and Fred van Eeuwijk are acknowledged for useful discussions. The research leading to these results has been conducted as part of the project DROught-tolerant yielding PlantS (DROPS), which received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 244374. This research was also funded by the Learning from Nature project of the Dutch Technology Foundation, which is part of the Netherlands Organisation for Scientific Research.

Footnotes

Communicating editor: A. H. Paterson

Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177212/-/DC1.

Literature Cited

  1. Culverhouse R., Suarez B. K., Lin J., Reich T., 2002.  A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet. 70: 461–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Gianola D., van Kaam J. B. C. H. M., 2008.  Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178: 2289–2303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Horton M. W., Hancock A. M., Huang Y. S., Toomajian C., Atwell S., et al. , 2012.  Genomewide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat. Genet. 44: 212–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Howard R., Carriquiry A. L., Beavis W. D., 2014.  Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 4: 1027–1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Huber P. J., 1967.  The behavior of maximum likelihood estimates under nonstandard conditions, pp. 221–233 in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1 University of California Press, Berkeley, CA. [Google Scholar]
  6. Jarquin D., Crossa J., Lacaze X., Du Cheyron P., Daucourt J., et al. , 2014.  A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 127: 595–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Jiang Y., Reif J. C., 2015.  Modelling epistasis in genomic selection. Genetics 201: 759–768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kruijer W., Boer M. P., Malosetti M., Flood P. J., Engel B., et al. , 2015.  Marker-based estimation of heritability in immortal populations. Genetics 199: 379–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lee J. J., Chow C. C., 2014.  Conditions for the validity of SNP-based heritability estimation. Hum. Genet. 133: 1011–1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Mackay T. F., 2014.  Epistasis and quantitative traits: using model organisms to study gene-gene interactions. Nat. Rev. Genet. 15: 22–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Oakey H., Verbyla A., Pitchford W., Cullis B., Kuchel H., 2006.  Joint modeling of additive and non-additive genetic line effects in single field trials. Theor. Appl. Genet. 113: 809–819. [DOI] [PubMed] [Google Scholar]
  12. Oakey H., Verbyla A., Cullis B., Wei X., Pitchford W., 2007.  Joint modeling of additive and non-additive (genetic line) effects in multi-environment trials. Theor. Appl. Genet. 114: 1319–1332. [DOI] [PubMed] [Google Scholar]
  13. Song Y. S., Wang F., Slatkin M., 2010.  General epistatic models of the risk of complex diseases. Genetics 186: 1467–1473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Speed D., Balding D. J., 2015.  Relatedness in the post-genomic era: Is it still useful? Nat. Rev. Genet. 16: 33–44. [DOI] [PubMed] [Google Scholar]
  15. Speed D., Hemani G., Johnson M. R., Balding D. J., 2012.  Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 91: 1011–1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. van Heerwaarden J., Hufford M. B., Ross-Ibarra J., 2012.  Historical genomics of North American maize. Proc. Natl. Acad. Sci. USA 109: 12420–12425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Visscher P. M., Goddard M. E., 2014.  A general unified framework to assess the sampling variance of heritability estimates using pedigree or marker-based relationships. Genetics 199: 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. White H., 1982.  Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25. [Google Scholar]
  19. Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010.  Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Yang J., Lee S. H., Goddard M. E., Visscher P. M., 2011.  Gcta: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88: 76–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Zhao K., Tung C.-W. W., Eizenga G. C., Wright M. H., Ali M. L., et al. , 2011.  Genomewide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat. Commun. 2: 467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zuk O., Hechter E., Sunyaev S. R., Lander E. S., 2012.  The mystery of missing heritability: genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA 109: 1193–1198. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES