Abstract
Prediction of phenotypes from genotypes is an important objective to fulfill the promises of genomics, precision medicine and agriculture. Although it’s now possible to account for the majority of genetic variation through model fitting, prediction of phenotypes remains a challenge, especially across populations that have diverged in the past. In this study, we designed simulation experiments to specifically investigate the role of genetic interactions in failure of polygenic prediction. We found that non-additive genetic interactions can significantly reduce the accuracy of polygenic prediction. Our study demonstrated the importance of considering genetic interactions in genetic prediction.
Keywords: Genomic Prediction, GenPred, Shared Data Resources, genetic interactions, polygenic prediction, cross-population prediction
Significant progress has been made in our understanding of the genetic architecture of complex quantitative traits in recent years, due largely to large-scale genome-wide association studies (Visscher et al. 2017). For example, human adult height is a classical quantitative trait with a narrow sense heritability (h2) of approximately 0.8 based on twin studies (Silventoinen et al. 2003). However, early GWAS studies identified common variants explaining only a total of 2–4% phenotypic variance (Gudbjartsson et al. 2008; Lettre et al. 2008; Weedon et al. 2008) with sample sizes in the order of 20,000. In 2010, a landmark study increased this proportion to about 45% by fitting ∼300,000 SNP markers regardless of their significance in the model for ∼4,000 individuals with the covariance among individuals determined by genome-wide SNP similarity (Yang et al. 2010). Applying the same idea, the most recent study using whole genome sequences of ∼20,000 individuals in the TOPMed project almost entirely captured all heritability (Wainschtein et al. 2019). These studies suggested that complex traits are highly polygenic, with many loci of individually small effects.
However, our ability to predict complex quantitative traits from genotype data remains limited. A perfect genetic model with precise effects and model specification should be able to predict unobserved phenotypes with an accuracy (measured by r2) equal to the heritability. However, this is rarely the case. For example, a large GWAS on human adult height with almost 200,000 individuals identified over 180 loci, which could only predict phenotypes with an accuracy of ∼10% (Lango Allen et al. 2010). This prediction accuracy was measured based on “leave-one-out” out-of-sample prediction (International Schizophrenia Consortium et al. 2009), i.e., the effects of the genetic loci were estimated in one subset of the sample and polygenic scores (genetic effects summed over all significant loci) was computed to predict phenotypes in another subset. The partition between the subsets conveniently followed sample origin from different European countries (Lango Allen et al. 2010). In animal and plant breeding, genomic prediction is widely used, where effects of genetic markers across the whole genome, regardless of their statistical significance, are summed to compute genetic prediction (Meuwissen et al. 2001; VanRaden 2008; de los Campos et al. 2013).
Recently, there has been renewed interest in the application of polygenic score (International Schizophrenia Consortium et al. 2009) with the advent of large public data sets such as the UK Biobank (Khera et al. 2018). In particular, many studies have observed poor prediction by polygenic scores across different ancestry groups (Martin et al. 2019) or even within an ancestry group but with variable characteristics (Mostafavi et al. 2019). In fact, earlier studies with smaller sample sizes observed similar patterns, but were interpreted as missing heritability (Lango Allen et al. 2010; Makowsky et al. 2011). In animal breeding, similar observations have also been made. Although genomic prediction works exceedingly well within a breed of cattle, cross-breed prediction generally fails (Hayes et al. 2009). The explanation is obvious, genetic effects can be context dependent and heterogeneous between groups. Variable linkage disequilibrium (LD) patterns, environments, and other factors can all contribute to the variable genetic effects, manifesting as variable accuracy of polygenic prediction.
Genetic interactions are pervasive, and an important type of context dependent effects (Mackay 2014; Mackay and Moore 2014). The presence of genetic interactions does not have a strong effect on the proportion of phenotypic variance attributable to the additive effects of all markers (Hill et al. 2008; Huang and Mackay 2016), therefore the magnitude of additive variance explained by all markers offers no indication of the genetic architecture. However, genetic interactions may influence genomic prediction accuracy. Models explicitly taking into account the complexity can improve prediction (Ober et al. 2015; Jiang and Reif 2015; Martini et al. 2017; Morgante et al. 2018). Moreover, non-parametric models that do not rely on the additivity of the model can outperform parametric additive models when the genetic architecture is non-additive (Momen et al. 2018). These results clearly suggest that the simplification of genetic architecture to the additive infinitesimal model when the true model is not, although convenient and no comparable alternatives exist, can be risky. In this study, we specifically investigate the influence of genetic interactions on polygenic prediction of phenotypes, with an emphasis on prediction across diverged populations.
Materials and Methods
Population simulation
We used the coalescent simulator MaCS (Chen et al. 2008) to simulate genome sequences of 75,000 individuals, with 25,000 in each of the populations, according to the demographic history in Figure 1A. We simulated 1,000 independently inherited chromosomes of 100,000 base pairs in size and set mutation rate as 1.25 × 10−8 per bp and recombination as 1.25 × 10−8 per bp (Figure 1B). Effective population size was set to 20,000. The MaCS command for one chromosome was “macs 150000 100000 -s “$random_seed” -i 1 -h 1000 -t 0.001 -r 0.001 -I 3 50000 50000 50000 0 -ej 0.0125 3 2 -ej 0.025 2 1”. This simulation was performed once but the partitions between samples were repeated 20 time, which were summarized as box plots in figures. We defined three sets of variants. 1) causal variants: one variant was sampled from each of the 1,000 chromosomes to constitute the causal variants. 2) tag variants: all variants excluding the causal variants. 3) all variants: all variants including the causal variants.
Simulation of quantitative phenotypes
We simulated quantitative phenotypes according to the genetic architecture depicted in Figure 1C. For each of the three possible genotypes for a biallelic locus with alleles A and a, we used the additive coding aa = -1, Aa = 0, and AA = 1 and the dominance coding aa = 0, Aa = 1, AA = 0 to code genotypes (Table 1). The simulation of phenotypes consisted of two steps. In the first step, the corresponding genotype coding for an individual or product of genotype codings (in the case of between-loci interactions) were multiplied by a genetic effect randomly drawn from the standard normal distribution and summed over all loci or all pairs of loci to obtain the genetic values. In the second step, an environmental effect was added by drawing from a normal distribution with a computed variance such that the broad sense heritability H2 = 0.8. These steps are summarized in Table 1 with illustrative examples for nine possible genotypes across two loci (Table 1). It’s straightforward to extend this to all loci. We performed this simulation in each of the 20 random partitions of populations and independently sampled causal variants and genetic effects.
Table 1. Summary of genotype-phenotype relationships in simulations.
QTL A genotype | AA | AA | AA | Aa | Aa | Aa | aa | aa | aa |
---|---|---|---|---|---|---|---|---|---|
QTL B genotype | BB | Bb | bb | BB | Bb | Bb | BB | Bb | bb |
x1a | 1 | 1 | 1 | 0 | 0 | 0 | −1 | −1 | −1 |
x1d | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
x2a | 1 | 0 | −1 | 1 | 0 | −1 | 1 | 0 | −1 |
x2d | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Additive (β1ax1a + β2ax2a + …) | β1a + β2a | β1a | β1a − β2a | β2a | 0 | −β2a | −β1a + β2a | −β1a | −β1a −β2a |
Dominance (β1dx1a + β1dx1d + β2dx2a + β2dx2d + …) | β1d + β2d | β1d + β2d | β1d − β2d | β1d + β2d | β1d + β2d | β1d − β2d | −β1d + β2d | −β1d + β2d | −β1d − β2d |
Overdominance (β1dx1d + β2dx2d + …) | 0 | β2d | 0 | β1d | β1d + β2d | β1d | 0 | β2d | 0 |
A x A (βaax1ax2a + …) | βaa | 0 | −βaa | 0 | 0 | 0 | −βaa | 0 | βaa |
A x D (βadx1ax2d + …) | 0 | βad | 0 | 0 | 0 | 0 | 0 | −βad | 0 |
D x D (βddx1dx2d + …) | 0 | 0 | 0 | 0 | βdd | 0 | 0 | 0 | 0 |
Fitting GREML
We fitted the GREML model using GCTA (Yang et al. 2011) with 20,000 individuals from each of the P1, P2, and P3 populations and P1 + P2 and P1 + P3. The GREML partitioned phenotypic variance into a genomic (σ2g) and an environmental component (σ2e) by fitting a mixed model using REML with covariance matrix determined by a relationship matrix calculated based on standardized genotypes (Yang et al. 2010). Genomic heritability was computed as h2g = σ2g/(σ2g + σ2e).
Polygenic score prediction
The BLUP estimates of SNP effects were obtained using GCTA and provided to PLINK2 (https://www.cog-genomics.org/plink/2.0/credits) to compute a polygenic score in 5,000 new individuals either from the same population as the fitted model or from a different population. Prediction accuracy of polygenic score was computed as the r2 of correlating predicted polygenic scores and the simulated true phenotypes. In the case of prediction using causal variants with the correct dominance by dominance model (Figure 5), we constructed pseudo-variants using the relevant genotype coding (for D x D, double heterozygotes were coded as one genotype class and all others the other) and ran GREML and polygenic score prediction the same way as an additive model.
Data availability
All procedures to simulate the data are described in the manuscript and codes can be found at https://github.com/qgg-lab/epistasis-prediction. We provide the simulated genotype data for all 75,000 individuals (25,000 per population) in PLINK binary format (Purcell et al. 2007) on figshare (https://figshare.com/projects/Influence_of_genetic_interactions_on_polygenic_prediction/70427). There are a large number of random partitions for the replicates and the associated phenotypes, these are not directly provided but they are easy to recapitulate with the description of methods and the computer codes. Supplemental material available at figshare: https://doi.org/10.25387/g3.10031807.
Results
Experimental design
Because it’s not yet possible to unambiguously know the true genetic architecture of a quantitative trait, all experiments in this study were performed using simulated data instead of real data. This allows us to specifically ask simple questions while eliminating influence from other factors. We simulated a sample of 75,000 diploid individuals from three ancestry groups, where population P1 and P2 diverged 1,000 generations ago and their ancestors diverged from population P3 an additional 1,000 generations ago (Figure 1A). This specification is qualitatively similar to the global human population history where the ancestral population that went out of Africa were further split into multiple populations.
We considered three possible variant sets (Figure 1B); 1) causal: all and only causal variants; 2) tag: all variants except causal variants; and 3) all: all variants including causal variants. These represent three simplified scenarios 1) a best case scenario where causal variants have been identified, 2) a realistic scenario where causal variants are tagged by genotyped variants, and 3) an achievable scenario in the near future with whole genome sequences. We did not consider variants that were rare (MAF < 0.01) in all three populations as they led to gross overestimation of genomic heritability approaching one, similar to findings in a simulation study using real genotypes (Evans et al. 2018). The three variant sets were used to compute genomic heritability and perform polygenic prediction. There were a total of approximately 680,000 variants in the ‘all’ variants case. When performing polygenic prediction, we did not select variants based on association tests. This choice was based on the consideration that selection of markers introduced another variable in the experiment to complicate the design and interpretation. Instead, we draw from the distinction between causal and all variants to represent the extreme scenarios where a perfect selection or no selection was performed.
We simulated a quantitative trait controlled by 1,000 independently inherited QTL (Figure 1B) of broad sense heritability H2 = 0.8 but different types of genetic architecture. When the genetic architecture is strictly additive, the narrow sense heritability h2 = H2 = 0.8, whereas in other cases h2 < 0.8. Six simple models of genetic architecture were simulated, including additive, dominance, overdominance, and pairwise additive by additive (A x A), additive by dominance (A x D), and dominance by dominance (D x D) (Figure 1C). No higher order interaction was simulated and effects across loci or across pairs were additive.
Genomic heritability misses little heritability
We first recapitulated a result that has been consistently shown (Hill et al. 2008; Huang and Mackay 2016). We fitted a linear mixed model in each of the three populations or combined samples using GREML implemented in the GCTA (Yang et al. 2011) with 20,000 individuals. We found that hg2 were uniformly high when the genetic architecture was additive, dominance, or additive by additive, accounting for nearly all heritability (Figure 2, Figure S1 online). Whether or not the variant sets included casual variants appeared to have little effects on hg2; variant sets excluding causal variants performed as well as causal variants only and there was a slight tendency of upward bias (Figure 2). Similar results were obtained regardless of whether the samples were from a homogeneous population or a mixture of samples from two diverged populations (Figure S1). When the genetic architecture was entirely overdominance, additive by dominance, or dominance by dominance, hg2 was lower, but still consistently explained > 50% of the heritability (Figure 2, Figure S1). Taken together, these results suggest that as long as a large number of genome-wide markers were fitted, little heritability was missed, regardless of the genetic architecture. In other words, the magnitude of genomic heritability offers no discrimination of the underlying genetic architecture (Huang and Mackay 2016).
Accuracy of polygenic prediction with an additive genetic architecture
We then asked a simple question. If genome-wide variants are able to capture the majority of heritability, are they able to predict phenotypes accurately? This question directly addresses the distinction between the two definitions of missing heritability. If there is no missing heritability based on mixed model fitting, is there missing heritability in polygenic prediction? Many illuminating results could be obtained by comparing different scenarios of simulations (Figure S2).
We first considered the simplest and best scenario, in which the genetic architecture was fully additive, and all and only causal variants were known. In this case, the statistical model took the form of the true model and only model parameters needed to be estimated. We trained the model in one population (n = 20,000, training data) and computed polygenic scores of new individuals (n = 5,000, test data) either in the same population or a different population (Figure 3A). To test the performance of cross-population prediction, we considered three possible relationships between the training and test populations, representing a gradient of divergence between training and test data (Figure 3A).
As expected, the accuracy of polygenic prediction was very high in this best case scenario, approaching the true heritability (‘causal’ in Figure 3B). There was a small decline in accuracy when cross-population prediction was performed and the degree of population divergence negatively affected prediction accuracy. However, when non-causal variants were included to make predictions, accuracy plummeted from ∼0.8 to ∼0.4 (Figure 3B) even when training and test samples were from the same population. This was likely due to the inclusion of independent predictors whose number vastly exceeded that of the causal variants. As populations become more divergent, prediction accuracy further dropped, the rate of which was much more pronounced when tag or all variants were used. These results (in the cases of tag or all variant sets) largely agreed with the large body of empirical work that accuracy of polygenic prediction was substantially lower than genomic heritability and cross-population prediction was poor (Lango Allen et al. 2010; Makowsky et al. 2011; Martin et al. 2019).
One important lesson could be learned in this simple experiment. The facts that simply adding non-causal variants to the model drastically reduced prediction accuracy, and that the rate of decay in the accuracy of cross-population prediction was much greater in the presence of non-causal variants indicated that the agreement between model and true genetic architecture mattered. This is in sharp contrast to genomic heritability estimation, where including more variants generally improves model fit (compare ref (Yang et al. 2010) with ref (Wainschtein et al. 2019)).
Accuracy of polygenic prediction in the presence of genetic interactions
We then tested the influence of genetic interactions on the accuracy of polygenic prediction, which fits an additive model. In a favorable condition when all causal variants were known (but not their effects or interactions) and prediction was performed within the same homogenous population, polygenic prediction accuracy was highly dependent on the genetic architecture (P1 -> P1 in Figure 4A). The accuracy ranged from 0.78, nearly the theoretical maximum in the case of an additive genetic architecture to less than 0.20 in the case of a dominance by dominance genetic architecture (P1 -> P1 in Figure 4A). In general, prediction accuracy was higher for genetic architecture with higher hg2, such as additive, dominance, and additive by additive. In contrast, under overdominance, additive by dominance, and dominance by dominance genetic architecture, polygenic prediction performed substantially worse (P1 -> P1 in Figure 4A). When all variants were used, including non-causal ones, the prediction accuracies decreased dramatically, from 0.78 to 0.37 in the most favorable within-population additive case (additive case in P1 -> P1 in Figure 4A and 4B). Furthermore, the dependency on genetic architecture appeared to be stronger when non-causal variants were included (P1 -> P1 in Figure 4B).
We then asked how genetic interactions influence the rate of decay in prediction accuracies when the training and test populations diverge. We set the accuracy of within-population prediction as the baseline and compared cross-population prediction accuracies to this baseline. When all variants were used for polygenic prediction, the accuracy of cross-population prediction dropped to about 40–60% of the accuracy of within-population prediction, depending on genetic architecture (Figure 4B). Additive, additive by additive, and dominance genetic architecture, those with the highest hg2 and r2, retained the most prediction accuracy while overdominance, additive by dominance, and dominance by dominance lost the most (Figure 4B). The more diverged the populations were, the more predictive ability of polygenic scores was lost (Figure 4B).
There are many reasons why polygenic prediction failed when test population diverged from training population. In our simple simulation setting, genetic effects were the same across populations and were not sensitive to any non-genetic factors. The difference in the linkage disequilibrium structure between populations may in part explain the drop when all variants were used (Figure 4B). Importantly, simulations allowed us to directly use causal variants for prediction, thus eliminating the influence of LD (Figure 4A). Remarkably, while the accuracy of cross-population prediction was lower for all genetic architecture, the rate of decay was much greater when the genetic architecture was over-dominance, additive by dominance, or dominance by dominance (Figure 4A, compare slopes of the different lines). These results clearly suggest that genetic interactions can not only cause cross-population polygenic prediction to fail, but also in a more severe manner compared to an additive genetic architecture.
Discussion
We demonstrate in this study through simulations that genetic interactions can influence the accuracy of polygenic prediction. In particular, cross-population polygenic prediction performed worse than intra-population prediction in all cases. For traits controlled by genetic interactions, the cross-population decay in prediction accuracy was far greater (Figure 4). The results make intuitive sense. For a statistical model to predict new data accurately, two conditions must be met. First, the model specification must be correct or at least sufficiently accurate to capture variation in the data. Second, parameters in the model must be precise. When genetic interactions are present, the additive polygenic model clearly is not accurate.
Previous studies have mostly focused on improving parameter estimation, through increasing sample size and methodological improvement. For example, increasing sample size substantially increased accuracy of polygenic prediction of height within individuals of European ancestry (Lello et al. 2018). Inclusion of samples of different backgrounds in the training data also helped (Martin et al. 2019) (Figure S2).
However, the complexity of the genetic architecture of a quantitative trait makes it nearly impossible to specify a model prior to modeling. As a consequence, the polygenic infinitesimal model or variants of it (Gianola et al. 2009) has been used as the default model. The infinitesimal model has been instrumental and allowed for many theoretical insights as well as applications to be developed. In particular, prediction of breeding values in animal and plant breeding relying on the infinitesimal model has been very successful (García-Ruiz et al. 2016). However, its limitations are also apparent. Cross-population and cross-breed polygenic prediction was low in accuracy (Hayes et al. 2009; Lango Allen et al. 2010; Martin et al. 2019). Although many factors may contribute to this limitation, our simulation results clearly indicated that genetic interactions unaccounted for was a major contributor. Indeed, if the correct genetic model could be specified, cross-population prediction can achieve very high accuracy (Figure 5). There have been attempts to explicitly model non-additive genetic effects in the context of polygenic prediction; some moderate improvement was observed (Martini et al. 2017; Varona et al. 2018). However, these studies modeled non-additive effects using genome-wide markers, which added a large number of independent predictors as noise to the model and may negatively impact the performance.
We did not analyze existing public data sets with real genotypes and phenotypes, some of which contained subjects from multiple ancestries. Previous work with real data has consistently shown that cross-population polygenic prediction generally fails (Martin et al. 2019), which agreed with results obtained by simulations in this study. However, it is difficult to disentangle the different factors that may contribute to effect heterogeneity and the failure of prediction in real data sets. Using simulations, we can focus on specific questions and our results clearly indicated a contribution of genetic interactions to the failure of cross-population polygenic prediction. While the additive infinitesimal model is the most sensible model when no other information is available, our study suggests that the development in the field should be expanded to include efforts to more explicitly model genetic interactions. Although it is challenging, recent advances in modeling (Boyle et al. 2017; Liu et al. 2019) and genomic assays informing regulatory networks (Gerstein et al. 2012) may finally offer new ways to develop biologically sensible models.
Acknowledgments
This research is supported by Michigan State University, MSU AgBioResearch, and National Institute of Food and Agriculture.
Footnotes
Supplemental material available at figshare: https://doi.org/10.25387/g3.10031807.
Communicating editor: D. J. de Koning
Literature Cited
- Boyle E. A., Li Y. I., and Pritchard J. K., 2017. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169: 1177–1186. 10.1016/j.cell.2017.05.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen G. K., Marjoram P., and Wall J. D., 2008. Fast and flexible simulation of DNA sequence data. Genome Res. 19: 136–142. 10.1101/gr.083634.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- de los Campos G., Hickey J. M., Pong-Wong R., Daetwyler H. D., and Calus M. P. L., 2013. Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics 193: 327–345. 10.1534/genetics.112.143313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans L. M., Tahmasbi R., Vrieze S. I., Abecasis G. R., Das S. et al. , 2018. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50: 737–745. 10.1038/s41588-018-0108-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- García-Ruiz A., Cole J. B., VanRaden P. M., Wiggans G. R., Ruiz-López F. J. et al. , 2016. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proc. Natl. Acad. Sci. USA 113: E3995–E4004. Erratum: xxx. 10.1073/pnas.1519061113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerstein M. B., Kundaje A., Hariharan M., Landt S. G., Yan K.-K. et al. , 2012. Architecture of the human regulatory network derived from ENCODE data. Nature 489: 91–100. 10.1038/nature11245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gianola D., de los Campos G., Hill W. G., Manfredi E., and Fernando R., 2009. Additive Genetic Variability and the Bayesian Alphabet. Genetics 183: 347–363. 10.1534/genetics.109.103952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gudbjartsson D. F., Walters G. B., Thorleifsson G., Stefansson H., Halldorsson B. V. et al. , 2008. Many sequence variants affecting diversity of adult human height. Nat. Genet. 40: 609–615. 10.1038/ng.122 [DOI] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., and Goddard M. E., 2009. Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51 10.1186/1297-9686-41-51 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill W. G., Goddard M. E., and Visscher P. M., 2008. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4: e1000008 10.1371/journal.pgen.1000008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang W., and Mackay T. F. C., 2016. The Genetic Architecture of Quantitative Traits Cannot Be Inferred from Variance Component Analysis. PLOS Genet. 12: e1006421 10.1371/journal.pgen.1006421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- International Schizophrenia Consortium, Purcell S. M., Wray N. R., Stone J. L., Visscher P. M. et al. , 2009. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang Y., and Reif J. C., 2015. Modeling Epistasis in Genomic Selection. Genetics 201: 759–768. 10.1534/genetics.115.177907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khera A. V., Chaffin M., Aragam K. G., Haas M. E., Roselli C. et al. , 2018. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50: 1219–1224. 10.1038/s41588-018-0183-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lango Allen H., Estrada K., Lettre G., Berndt S. I., Weedon M. N. et al. , 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467: 832–838. 10.1038/nature09410 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lello L., Avery S. G., Tellier L., Vazquez A. I., de los Campos G. et al. , 2018. Accurate Genomic Prediction of Human Height. Genetics 210: 477–497. 10.1534/genetics.118.301267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lettre G., Jackson A. U., Gieger C., Schumacher F. R., Berndt S. I. et al. , 2008. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat. Genet. 40: 584–591. 10.1038/ng.125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X., Li Y. I., and Pritchard J. K., 2019. Trans Effects on Gene Expression Can Drive Omnigenic Inheritance. Cell 177: 1022–1034.e6. 10.1016/j.cell.2019.04.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mackay T. F. C., 2014. Epistasis and quantitative traits: using model organisms to study gene-gene interactions. Nat. Rev. Genet. 15: 22–33. 10.1038/nrg3627 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mackay T. F., and Moore J. H., 2014. Why epistasis is important for tackling complex human disease genetics. Genome Med. 6: 124 10.1186/gm561 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makowsky R., Pajewski N. M., Klimentidis Y. C., Vazquez A. I., Duarte C. W. et al. , 2011. Beyond Missing Heritability: Prediction of Complex Traits. PLoS Genet. 7: e1002051 10.1371/journal.pgen.1002051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin A. R., Kanai M., Kamatani Y., Okada Y., Neale B. M. et al. , 2019. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51: 584–591. 10.1038/s41588-019-0379-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martini J. W. R., Gao N., Cardoso D. F., Wimmer V., Erbe M. et al. , 2017. Genomic prediction with epistasis models: on the marker-coding-dependent performance of the extended GBLUP and properties of the categorical epistasis model (CE). BMC Bioinformatics 18: 3 10.1186/s12859-016-1439-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T. H., Hayes B. J., and Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Momen M., Mehrgardi A. A., Sheikhi A., Kranis A., Tusell L. et al. , 2018. Predictive ability of genome-assisted statistical models under various forms of gene action. Sci. Rep. 8: 12309 10.1038/s41598-018-30089-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgante F., Huang W., Maltecca C., and Mackay T. F. C., 2018. Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity 120: 500–514. 10.1038/s41437-017-0043-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mostafavi H., Harpak A., Conley D., Pritchard J. K., and Przeworski M., 2019. Variable prediction accuracy of polygenic scores within an ancestry group. bioRxiv 10.1101/629949 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ober U., Huang W., Magwire M., Schlather M., Simianer H. et al. , 2015. Accounting for genetic architecture improves sequence based genomic prediction for a Drosophila fitness trait. PLoS One 10 10.1371/journal.pone.0126880 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A. R. et al. , 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silventoinen K., Sammalisto S., Perola M., Boomsma D. I., Cornes B. K. et al. , 2003. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Res. 6: 399–408. 10.1375/136905203770326402 [DOI] [PubMed] [Google Scholar]
- VanRaden P. M., 2008. Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91: 4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]
- Varona L., Legarra A., Toro M. A., and Vitezica Z. G., 2018. Non-additive Effects in Genomic Selection. Front. Genet. 9: 78 10.3389/fgene.2018.00078 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Visscher P. M., Wray N. R., Zhang Q., Sklar P., McCarthy M. I. et al. , 2017. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101: 5–22. 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wainschtein P., Jain D. P., Yengo L., Zheng Z., TOPMed Anthropometry Working Group, et al. , 2019. Recovery of trait heritability from whole genome sequence data. bioRxiv 10.1101/588020 [DOI] [Google Scholar]
- Weedon M. N., Lango H., Lindgren C. M., Wallace C., Evans D. M. et al. , 2008. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 40: 575–583. 10.1038/ng.121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K. et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J., Lee S. H., Goddard M. E., and Visscher P. M., 2011. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88: 76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All procedures to simulate the data are described in the manuscript and codes can be found at https://github.com/qgg-lab/epistasis-prediction. We provide the simulated genotype data for all 75,000 individuals (25,000 per population) in PLINK binary format (Purcell et al. 2007) on figshare (https://figshare.com/projects/Influence_of_genetic_interactions_on_polygenic_prediction/70427). There are a large number of random partitions for the replicates and the associated phenotypes, these are not directly provided but they are easy to recapitulate with the description of methods and the computer codes. Supplemental material available at figshare: https://doi.org/10.25387/g3.10031807.