Abstract
Genome-wide association studies of complex traits frequently find that SNP-based estimates of heritability are considerably smaller than estimates from classic family-based studies. This ‘missing’ heritability may be partly explained by genetic variants interacting with other genes or environments that are difficult to specify, observe, and detect. To circumvent these challenges, we propose a new method to detect genetic interactions that leverages pleiotropy from multiple related traits without requiring the interacting variable to be specified or observed. Our approach, Latent Interaction Testing (LIT), uses the observation that correlated traits with shared latent genetic interactions have trait variance and covariance patterns that differ by genotype. LIT examines the relationship between trait variance/covariance patterns and genotype using a flexible kernel-based framework that is computationally scalable for biobank-sized datasets with a large number of traits. We first use simulated data to demonstrate that LIT substantially increases power to detect latent genetic interactions compared to a trait-by-trait univariate method. We then apply LIT to four obesity-related traits in the UK Biobank and detect genetic variants with interactive effects near known obesity-related genes. Overall, we show that LIT, implemented in the R package lit, uses shared information across traits to improve detection of latent genetic interactions compared to standard approaches.
1. Introduction
There are many large genome-wide association studies (GWAS) available to help facilitate gene discovery and improve our molecular understanding of complex traits and diseases. In particular, recent biobank-sized GWAS collect massive sample sizes and a broad range of phenotypic data to enhance complex trait mapping and augment knowledge of gene functionality. There are two important patterns that have emerged among the multitude of GWAS analyses that have been performed to date. First, the genetic variation of complex traits often involves many thousands of loci [1]. Second, for many traits studied, family-based estimates of heritability (i.e., the proportion of trait variance explained by genetic factors) tend to be substantially greater than corresponding heritability estimates from GWAS single nucleotide polymorphisms (SNP) data [2]. For example, the heritability estimates for body mass index (BMI) from GWAS-based studies are 22–30% [3–6] compared to 40–70% [7,8] in family-based studies. While this ‘missing’ heritability may be due to small sample sizes, structural variants, and/or rare variants [6, 9,–11], these sources may not fully explain the difference in some traits [12]. Another possible explanation is that family-based estimates of broad-sense heritability are capturing within-family sharing of genetic variants with interactive effects (e.g., gene-by-gene and gene-by-environment interactions) which are omitted from GWAS estimates derived from nearly unrelated individuals who only generally share the additive effects of alleles (i.e., narrow-sense heritability) [12–14]. Given the evidence of such interactions [15–18], discovering genetic variants with interactive effects may explain missing heritability and broaden our understanding of the genetic architecture of complex traits.
There are many statistical challenges to discovering genetic variants with interactive effects in GWAS [19]. In particular, studies are typically underpowered to detect interactions due to small effect sizes, a large multiple testing burden, and unknown interactive variables (e.g., other variants or environmental factors). Consequently, it can be difficult to design a study to identify and accurately observe interacting variables. Furthermore, even when interacting variables are known, the mismea-surement of such variables can lead to power loss [20]. One strategy to circumvent some of these issues is to use variance-based testing procedures which do not require the interactive variable(s) to be observed [17, 21,–24]. Intuitively, such procedures model and detect any unequal residual trait variation among genotype categories at a specific SNP (i.e., heteroskedasticity), which provides evidence of other latent (unobserved) factors interacting with the SNP to influence the trait. Previous work has found that variance-based testing procedures can help identify latent genetic interactions on complex traits, including inflammatory markers [22] and obesity-related traits [17, 23, 24].
When there are multiple related traits measured in a study, researchers often apply variance-based procedures on a trait-by-trait (or univariate) basis to detect latent genetic interactions. However, a univariate strategy ignores any biological pleiotropy among traits despite theoretical [25] and empirical [26, 27] support for this phenomenon. Furthermore, in the presence of pleiotropy, many studies have demonstrated that joint statistical modeling of related traits (i.e., a multivariate procedure) outperforms univariate procedures for gene mapping [28,29]. This observation, coupled with the potential existence of pleiotropic interactive effects, suggests that analyzing multiple traits simultaneously in a statistical procedure will increase power to detect latent genetic interactions. To this end, we propose a novel statistical framework, called Latent Interaction Testing (LIT), that leverages multiple related traits to increase the power for detecting latent genetic interactions. LIT is motivated by the observation that latent genetic interactions induce not only a differential variance pattern (i.e., heteroskedasticity) as previously reported, but also a differential covariance pattern between traits. We can harness the differential covariance patterns to increase the power to detect latent genetic interactions compared to variance-based strategies. Similar to variance-based strategies, LIT does not require the interactive partner(s) to be observed or specified.
The manuscript is outlined as follows. We first introduce the LIT framework for detecting latent genetic interactions and then evaluate the performance using simulated biobank-sized datasets. We also compare LIT to univariate testing procedures and observe that LIT provides significant power gains to detect interactive effects in GWAS. Finally, we demonstrate LIT using four obesity-related traits in the UK Biobank with over 6 million single nucleotide polymorphisms from 330,868 genotyped individuals. Our analysis identifies multiple loci demonstrating significant interactive effects near known obesity-related genes. We then conclude with a discussion of our results.
2. Results
2.1. Overview
We provide a brief overview of the LIT framework and leave more technical details to the Methods section. As illustrated in Figure S1, a SNP with a latent interaction induces a genotype effect on the trait variances and on the covariance between traits. We can assess this interaction effect by relating an individual’s genotype to individual-specific trait variances and covariances (Figure S2). We estimate individual-specific trait variances using squared residuals (SQ) for each trait after adjusting for additive (and possibly dominance) effects and likewise estimate individual-specific covariances by multiplying the residuals of different pairs of traits together to form cross products (CP; see ref. [30]). Using a kernel-based distance covariance (KDC) statistic (Figure 1) [31–33], we then assess evidence of a latent genetic interaction by testing whether the elements of a matrix comprised of pairwise similarity of SQ/CP terms in the sample is independent of the elements of a second matrix comprised of pairwise genotype similarity. To measure the similarity between variables, we apply a user-defined kernel function such as a linear kernel (analogous to scaled covariance) or a projection kernel [34, 35]. We show later that the optimal kernel choice depends on the complexity of the interaction signal. Researchers have previously applied variations of the KDC statistic, which yields a -value testing the global null of no association between the elements of two matrices, in genetic analyses for studies of both common [34, 36, 37] and rare [35, 38, 39] variation.
The traditional KDC statistic utilizes the corresponding eigenvectors (directions of maximal variation) and eigenvalues (weights emphasizing eigenvectors) derived from the SQ/CP similarity matrix for inference. In the process, the traditional KDC statistic emphasizes signals explaining the most variation in this matrix. While we show this emphasis is suitable under certain pleiotropy settings, there are other settings where the interaction signal is not captured by the top eigenvectors of the similarity matrix and so the test may not be optimal [29, 40]. Therefore, we also consider weighting eigenvectors equally in our test statistic to increase power to detect interaction signals captured by the lower eigenvectors of the similarity matrix. We refer to the implementation that weights eigenvectors by corresponding eigenvalues (i.e., the traditional KDC framework) as weighted LIT (wLIT) and refer to the implementation that weights eigenvectors equally as unweighted LIT (uLIT). Since the pleiotropic genetic architecture of a trait is unknown a priori, we maximize the performance of LIT by aggregating the -values from wLIT and uLIT using the Cauchy combination test (CCT) [41], which has proven valuable in a variety of genetic settings [42]. We refer to the CCT of the wLIT and uLIT -values as aggregate LIT (aLIT). For simplicity, we primarily focus on implementing wLIT, uLIT, and aLIT on a SNP-by-SNP basis to test for interactive effects but discuss extensions to handling multiple SNPs simultaneously within the Methods section (see Section 4.2.2). Finally, to improve computational efficiency for biobank-sized datasets, we apply a linear kernel in wLIT and so uLIT is equivalent to using a projection kernel.
2.2. Power and type I error rate control
We simulated related traits for 300,000 observations (reflecting sample sizes for biobank datasets) under the polygenic trait model with additive genetic, environmental, and GxE interaction components. The baseline correlation between traits was either 0.25, 0.50, or 0.75 which represents different correlation strengths from shared genetic and environmental effects. We then simulated a genetic risk factor, an environmental factor, and a GxE interaction that explains 0.2%, a randomly drawn value from 0.5% to 2.0%, and a randomly drawn value from 0.1% to 0.15% of the trait variation, respectively. To assess the performance of LIT under different sparsity settings, we varied the proportion of traits with a shared GxE interaction as . We considered three types of pleiotropy in our study, namely, the GxE interaction effect size is positive across traits (positive pleiotropy), a mixture of positive and negative across traits (positive and negative pleiotropy), and a variation of these two settings where the direction is opposite of the interacting environment.
We found that the LIT implementations provide type I error rate control at significance level 10−3, including when the trait distribution is skewed or heavy trailed (Figure S3 S4). We then compared the power across various configurations of number of traits, baseline correlation, proportion of traits with shared interaction effects, and direction of the interaction effect (Figure 2A–B). In comparing wLIT with uLIT, neither method is optimal across all settings as expected. As mentioned in Section 2.1, wLIT emphasizes the high-variance (i.e., large eigenvalues) eigenvectors of the SQ/CP kernel matrix while uLIT weights them equally. Under a simulation model where the signal would reside on the top eigenvector, we expect wLIT to outperform uLIT. Conversely, we expect ULIT to outperform wLIT when the signal resides on the lower-variance eigenvectors of the SQ/CP kernel matrix.
To illustrate how the interaction signal can reside on different eigenvectors of the SQ/CP kernel matrix, we performed an association test between the eigenvectors and genotype under the positive pleiotropy setting with 10 traits (Figure S5). We find that the power to detect the latent interaction signal at each eigenvector depends on the proportion of traits with shared interaction effects (sparsity level), baseline trait correlation, and the proportion of variation explained by the genotype (denoted as ). More specifically, for small baseline correlations, the high-variance eigenvector generally captures the signal for most sparsity settings (which explains why wLIT outperforms uLIT in these situations). As the baseline correlation increases, the power of the high-variance eigenvector can decrease rapidly (even if the proportion of traits with shared interaction effects is high) due to the reduction in (see top right panels of Figure S5. On the other hand, an increase in baseline correlation coupled with a decrease in the proportion of traits with shared interaction effects (i.e., increase in sparsity), results in the interaction signal being separated out from the high-variance eigenvector and becoming detectable in the low-variance eigenvectors.
Given the above insights, we can delineate the performance between uLIT and wLIT by assessing which eigenvectors capture the interaction signal. In the positive pleiotropy setting with 10 traits and a baseline correlation of 0.5 (bottom center panel of Figure 2A), as the proportion of traits with shared interaction effects increases from 0 to 0.6, the power of uLIT increases whereas the power of wLIT is constantly negligible. When the proportion of traits with shared interaction effects increases from 0.6 to 1, the power of uLIT decreases whereas the power of wLIT increases and overtakes uLIT when this proportion exceeds 0.8. These power trends are due to the low-variance eigenvectors capturing the interaction signal when the proportion of traits with shared interaction effects is small (which favors uLIT) to the high-variance eigenvectors when this is high (which favors wLIT; Figure S5). Furthermore, when the baseline correlation increases from 0.5 to 0.75 (bottom right panel of Figure 2A), uLIT follows a similar power curve while the power of wLIT now remains negligible across all sparsity settings. In this case, the is low in the high-variance eigenvectors when the baseline correlation is high (Figure S5 and so wLIT has little power in these situations. In general, we find that wLIT tends to outperform uLIT when the baseline correlation is modest (i.e., 0.25) and the proportion of traits with shared interaction effects is high, otherwise uLIT is the optimal method.
In the setting where there is a mixture of positive and negative pleiotropy, uLIT outperforms wLIT across all settings (Figure 2B). Intuitively, in our simulations, the high-variance eigenvector is the weighted sum of the squared residuals and cross products where the weights have the same sign. When the effect sizes are in different directions (positive and negative pleiotropy), the high-variance eigenvector may dampen the interaction signal, and thus it will also be captured by the low-variance eigenvectors. Since uLIT weights the eigenvectors equally, we observe a large increase in power compared to wLIT. We also considered a variation of the above two pleiotropy scenarios where the effect size for the GxE interaction is opposite of the interacting environment. While we find similar results, the overall power is reduced for all methods (Figure S6).
In summary, even though the top eigenvectors explain the largest amount of variation, it does not imply that they are the ones most correlated to genotype. The interaction signal may be captured by the high-variance eigenvectors or the low-variance eigenvectors depending on the number of traits, baseline correlation, at each eigenvector, proportion of traits with shared interaction effects, and type of pleiotropy. Since the particular eigenvectors that are most powerful can vary widely and are unknown a priori, we applied aLIT to the -values from the above LIT implementations to maximize the number of discoveries. We find that aLIT controls the type I error rate (Figure S3,S4) while making more discoveries than each individual implementation (Figure 2,56). More specifically, aLIT has similar power to wLIT when the signal is captured by the high-variance eigenvectors and similar power to uLIT when the signal is captured by the low-variance eigenvectors. Therefore, we implement aLIT in subsequent analyses.
2.3. aLIT increases power compared to marginal testing procedures
Using the same simulation configuration as in Section 2.2, we considered two competing procedures for identifying latent genetic interactions using multiple traits. The first procedure performs an association test between the squared residuals and a SNP (Marginal (SQ)), while the second procedure additionally includes the cross product terms (Marginal (SQ/CP)). More specifically, Marginal (SQ) tests the squared residuals for all traits and selects the minimum -value from these different tests. Marginal (SQ/CP) adds tests for the cross products and selects the minimum -value from the individual tests. Because we are testing the global null hypothesis of no latent genetic interaction, the marginal testing procedures require a Bonferroni correction for the total number of tests, i.e., where is the significance threshold and is chosen to be the number of principal components that explains 95% of the variation. We then threshold the minimum -value by to determine statistical significance. Across all power simulations (Figure 3, S7), we observed that Marginal (SQ/CP) was more powerful than Marginal (SQ), suggesting that the inclusion of cross products improves performance to detect latent interactions. Given these findings, we compare the performance of aLIT to Marginal (SQ/CP) for the remainder of this work.
In the positive pleiotropy setting with a low baseline correlation, aLIT increases the power to detect GxE interactions when there are a higher proportion of traits with shared interaction effects compared to Marginal (SQ/CP) (Figure 3A). Furthermore, the difference is more pronounced as the number of traits increases. For example, when the baseline correlation is 0.25, and the proportion of traits with shared interaction effects is 0.8, the empirical power of aLIT is 53% and 94.8% for five and ten traits, respectively. On the other hand, the empirical power of Marginal (SQ/CP) is 26% and 48.6%, respectively. While aLIT provides substantial increases in power when the proportion of traits with shared interaction effects is high, Marginal (SQ/CP) can outperform aLIT when the proportion of traits with shared interaction effects is low in the positive pleiotropy setting (see lower left panel of Figure 3A). Intuitively, when there is little correlation between traits due to shared interactions, selecting the minimum -value across traits slightly outperforms combining information between the traits.
The difference in power between aLIT and Marginal (SQ/CP) is also evident across baseline correlations. Interestingly, the improvement in power of aLIT compared to Marginal (SQ/CP) reduces when the baseline correlation increases from 0.25 to 0.50. This observation agrees with our simulation results. from Figure S5 which suggest that the power to detect an interaction signal at any particular eigenvector decreases as the baseline correlation increases from 0.25 to 0.50. Alternatively, when the baseline correlation increases to 0.75, aLIT provides drastic increases in power for most sparsity settings. For example, when there are 10 traits where the proportion of traits with shared interaction effects is 0.5, aLIT’s power increases from 30.8% to 93.8% at a baseline correlation of 0.50 and 0.75 while Marginal (SQ/CP) decreases from 11.6% to 4.6%, respectively. However, in this same example, Marginal (SQ/CP) slightly outperforms aLIT when all traits have an interaction.
Overall, while our results suggest that aLIT outperforms Marginal (SQ/CP) for most baseline correlations and sparsity settings under positive pleiotropy, there are some rare cases where Marginal (SQ/CP) has similar (or improved) performance. Meanwhile, under the simulation setting where there is a mixture of positive and negative pleiotropy (Figure 3B) or the direction of GxE effect sizes are opposite of the interactive environment (Figure S7), the increase in power from aLIT over Marginal (SQ/CP) is substantial across all settings. Note that aLIT outperformed Marginal (SQ) across all power simulations.
2.4. aLIT applied to the UK Biobank data
We applied the LIT framework to detect shared latent genetic interactions in four obesity-related traits from the UK Biobank, namely, waist circumference (WC), hip circumference (HC), body mass index (BMI), and body fat percentage (BFP). After preprocessing, there were 329,146 unrelated individuals that have measurements for all traits and 6,186,503 SNPs (Section 4.4). The correlation between traits ranged from 0.75 (BMI and BFP) to 0.87 (BMI and WC). The total computational time of LIT was approximately 3.3 days using 12 cores.
In each implementation of LIT, after filtering for LD and significant SNPs, the genomic inflation factor of wLIT and ULIT was 1.14 (Figure S8a). While the test statistics are inflated, it is difficult to distinguish the factors driving inflation, e.g., unmodeled population structure or biological signal under polygenic inheritance [43]. We found that the genomic inflation factor increases as a function of minor allele frequency which is expected under polygenic inheritance (Figure S9). To be conservative, we adjusted the significance results of each approach by the corresponding genomic inflation factor (Figure S8b). These adjusted -values were then combined in aLIT to detect latent genetic interactions (Figure 4).
Using the aLIT -values, we discovered 2,252 SNPs with significant interactive effects in 11 distinct regions. Table 1 shows the most significantly associated (lead) SNP in each region. As a comparison, we also applied Marginal (SQ/CP) and detected 2,099 SNPs. Of those found by Marginal (SQ/CP), aLIT’s results overlapped with ≈98% of the detected SNPs and had substantially smaller -values at most loci (Figure S10). Although Marginal SQ/CP detects a few regions that are not found by aLIT, the aLIT -values are comparable in magnitude at these regions. On the other hand, there are three distinct regions found by aLIT (rs2821230, rs11030066, and rs157845) where the -values are substantially smaller than the -values from Marginal (SQ/CP) (Table 1). Thus, in agreement with our simulation results, depending on the type of pleiotropy at a particular locus, there are regions where the significance results of Marginal (SQ/CP) and aLIT are comparable and other regions where aLIT is substantially more powerful than Marginal (SQ/CP).
Table 1:
Chr. | Gene | Lead SNP | MAF | -value (aLIT) | -value (SQ/CP) | -value (LD) | -value (Dom.) |
---|---|---|---|---|---|---|---|
16 | FTO | rs11642015 | 0.402 | 1.08 × 10−46 | 2.73 × 10−40 | 1.23 × 10−46 | 1.28 × 10−46 |
2 | COBLL1 | rs5835988 | 0.406 | 5.30 × 10−14 | 1.08 × 10−9 | 1.29 × 10−13 | 5.32 × 10−14 |
1 | LYPLAL1 | rs2820444 | 0.299 | 1.17 × 10−13 | 2.64 × 10−13 | 1.27 × 10−13 | 1.19 × 10−13 |
6 | PRSS16 | rs13212921 | 0.136 | 1.78 × 10−13 | 3.40 × 10−12 | 1.90 × 10−12 | 4.85 × 10−9 |
18 | MC4R | rs35614134 | 0.234 | 3.95 × 10−12 | 9.34 × 10−12 | 1.39 × 10−11 | 3.97 × 10−12 |
1 | ATP2B4 | rs2821230 | 0.474 | 9.45 × 10−11 | 1.19 × 10−5 | 1.02 × 10−10 | 9.38 × 10−11 |
7 | KLF14 | rs972284 | 0.389 | 1.43 × 10−10 | 2.43 × 10−9 | 1.25 × 10−10 | 1.43 × 10−10 |
11 | LIN7C | rs11030066 | 0.143 | 6.38 × 10−10 | 5.03 × 10−6 | 6.97 × 10−10 | 6.62 × 10−10 |
6 | ILRUN | rs9469860 | 0.146 | 1.01 × 10−9 | 4.47 × 10−9 | 7.72 × 10−9 | 3.62 × 10−7 |
12 | FAIM2 | rs7132908 | 0.384 | 8.85 × 10−9 | 4.22 × 10−9 | 8.49 × 10−9 | 8.96 × 10−9 |
5 | MAP3K1 | rs157845 | 0.253 | 1.03 × 10−8 | 2.39 × 10−4 | 1.30 × 10−8 | 1.03 × 10−8 |
A concern in this analysis is whether statistical significance is due to a nearby SNP with a large additive effect and/or trait scaling issues. To address the latter, first note that the LIT framework not only detects latent interactions, but also departures from linearity due to dominance or misspecification of the trait scale. Therefore, to be conservative and help distinguish interactive effects, we removed the non-linear genetic signal within a locus by fitting a two degree of freedom genotypic model (Section 4.5). After removing the dominance/scaling effects, the lead SNP rs9469860 on chromosome 6 was above the genome-wide significance threshold (Table 1). In general, while the significance of a few loci were impacted (nearly all on chromosome 6 and a few on chromosome 18; Figure S11), most of the lead SNPs remained significant. Finally, to account for nearby SNPs with large additive effects, we applied the LIT implementations to the lead SNPs while regressing out nearby significant SNPs in LD and found all of the lead SNPs remain significant (Table 1).
After evaluating the significant SNPs found by aLIT, we focused on ten lead SNPs that remained significant after accounting for LD and dominance/scaling issues. Using the Variants to Genes (V2G) measure on Open Targets Platform [44], we assigned the lead SNPs to the highest ranked genes (Table 1). A few of these genes are found in other GWAS of obesity-related traits. In particular, the FTO gene is a known obesity-related gene that is associated with type 2 diabetes (see, e.g., [45–47]). While V2G score assigned rs5935988 to COBLL1 (198,262 bp), it was also close to GRB14 (23,935 bp) which has been associated with body fat distribution and may be involved in regulating insulin signaling [48–50]. Other genes that were assigned to lead SNPs are involved in regulating satiety and energy homeostasis (MC4R; [51]), adiposity (LYPLAL1; [52, 53]), and metabolic diseases such as type 2 diabetes (KLF14; [54, 55]).
We then searched for evidence of known interacting variables using the significant SNPs identified by aLIT. Previous work has found sex-specific effects of variants in KLF14, GRB14, and LYPLAL1 [50, 55, 56]. Therefore, we tested whether sex was an interactive variable in at least one of the traits using a multivariate regression model and found evidence of a genotype-by-sex interaction in rs 972284 (; KLF14), rs5835988 (; COBLL1/GRB14), and rs2820444(; LYPLAL1).
3. Discussion
It is challenging to identify, observe, accurately measure, and then detect genetic interactions in a GWAS study. While there are methods to infer interactions that do not require specifying the interactive partner(s) [17, 22–24], these approaches only consider a single trait. To increase statistical power, our proposed kernel-based framework, Latent Interaction Testing (LIT), leverages the shared genetic interaction signal from multiple related traits (i.e., pleiotropy) while maintaining the flexibility of single trait approaches. In our simulation study, we found that the optimal implementation between wLIT and uLIT depends on the genetic architecture. We also found that combining the -values from both approaches in aLIT maximized the number of discoveries while controlling the type I error rate. Furthermore, aLIT increased the power to detect latent genetic interactions compared to marginal testing procedures, and the difference was drastic for certain genetic architectures. We then applied the LIT framework to four obesity-related traits in the UK Biobank and found many loci with potential interactive effects. While we emphasized the linear and projection kernels in our study, aLIT can incorporate multiple kernel choices (e.g., Gaussian) which may increase the power to detect complex interaction signals. However, including additional kernel functions will also increase the computational complexity which may be time prohibitive for biobank-sized datasets.
There are some caveats when interpreting the significance results of LIT or, more generally, any approach that does not require observing the interactive variable(s). Type I error rate control is impacted by loci with large additive effects and trait scaling issues. To address the former issue, we performed inference using all SNPs in LD with the lead SNPs. While it is possible that the true causal SNP is not tagged, it is unlikely in this work since there is dense coverage with the imputed genotypes. We also assessed model misspecification due to an incorrect trait scaling by fitting a genotypic model to flexibly capture non-linear genetic signals. Interestingly, we found that our significance results were primarily impacted at loci located on chromosome 6 (outside the MHC). While this strategy can help identify the extent of genome-wide inflation due to dominance/scaling, it cannot determine whether the latent interactive effects are an artifact of the scale, even though our simulations suggest that detections by LIT are robust to deviations from normality. When presented with traits that follow a non-Normal distribution, an inverse-Normal transformation is typically applied so a trait “appears” as a standard Normal distribution. However, for detecting latent interactions, we recommend against such practice as it does not correct for the mean-variance relationship and can lead to invalid inference [24]. In general, model misspecification from an incorrect scaling is problematic for any population genetic analysis and may require other approaches such as goodness-of-fit testing to help identify an appropriate variance-stabilizing transformation. Finally, similar to other variance-based testing procedures, LIT cannot distinguish whether a discovery is due to a gene-by-gene interaction, gene-by-environment interaction, parent-of-origin effects, or another complex non-additive relationship involving the tested SNP.
There are also several important considerations when applying LIT to genetic data. Importantly, in this work, we assume that individuals are unrelated and traits follow a multivariate Normal distribution. While LIT assumes the data follows a multivariate Normal distribution, our simulation study suggests that it is robust to violations of this assumption. In general, the computational time of LIT increases as the number of traits and sample size increases (Figure S12). Therefore, in order to analyze biobank-sized datasets, LIT uses multiple cores to distribute SNPs (e.g., on the same chromosome) for interaction testing to be computationally more efficient. Because calculating the residual cross products for a large number of traits is computationally intensive ( increase in computational time per SNP), LIT provides a user option to only use the squared residuals. However, as demonstrated in simulation and in the UK Biobank dataset, employing this option is nearly certain to lose power. Finally, while a discovery in LIT suggests evidence of a non-additive effect, LIT does not identify the trait, or subset of traits, driving that result. To do so, investigators might consider running Marginal (SQ/CP) at the “lead” SNP to rank/identify individual traits with non-additive effects (squared residuals) and pairs of traits with shared non-additive effects (cross products; although, see Methods).
For many complex traits, there is strong discrepancy between GWAS-based estimates of heritability (which explicitly assume additive effects of genetic variation) and family-based estimates (which may incorporate non-additive effects and higher-order interactions). With recent biobank-sized datasets, we can begin to identify loci with non-additive genetic variation that contribute to this missing heritability while understanding its role in the etiology of complex traits. As biobank-sized datasets become more prevalent, we anticipate that computationally scalable approaches that leverage information across multiple traits, such as LIT, will become increasingly important to discovering non-additive genetic loci.
4. Methods
4.1. Motivation
Consider the trait for unrelated individuals with measurable traits. Suppose depends on a biallelic locus with genotype denoting the number of minor alleles for the th individual, an unobserved (or latent) environmental variable , and a latent genotype-by-environment (GxE) interaction . These components contribute to expression additively in the following regression model:
(1) |
where is the effect size of the minor allele, is the effect size of the environmental variable, is the effect size of the GxE interaction, and is an independent and identically distributed random error with mean zero and variance . In this simplified setting, our goal is to detect the latent GxE interaction without observing the interacting variable .
Under the above model assumptions, the latent GxE interaction will induce differential trait variance and covariance patterns that differ by genotype. Without loss of generality, assume the environmental variable has mean zero with unit variance. In Appendix 5.1.1, we show that the individual-specific trait variance (ITV) of the th trait conditional on genotype is
(2) |
where , and . We also show that the individual-specific covariance (ITC) between the th and th trait conditional on genotype is
(3) |
where , and . It is evident that a latent GxE interaction in trait not only induces a variance pattern that depends on genotype (Equation 2), but can also induce a covariance pattern between traits and from either a shared interaction or a shared environment involved in the interaction (; Equation 3). These results suggest that we can test for loci with latent interactive effects by assessing whether the individual-specific trait variances (ITV) and covariances (ITC) differ by genotype without specifying or directly modeling the interacting variable . While we assumed the interacting variable is environmental, the above insights are generalizable to any latent interaction(s) involving the SNP (e.g., a genotype-by-genotype interaction).
4.2. Latent Interaction Testing (LIT) framework
Our strategy builds from the above observations and estimates the ITV and ITC to detect latent genetic interactions. To derive estimates of these quantities, we first remove the additive genetic effect from the traits to ensure that any variance and covariance effects are not due to the additive effect. Let us denote the trait residuals as where we assume the effect size is known for simplicity. We can then express the ITV and ITC as a function of these residuals: the ITV of trait and the ITC between traits and is defined as and , respectively (Appendix 5.1.1). Thus, we can estimate the ITV by squaring the residuals, , and estimate the ITC between traits and by the pairwise product of the residuals (i.e., the cross products), . Aggregating the ITV and ITC estimates across all individuals, we denote the cross product (CP) terms in the matrix where the th row vector is , and the squared residual (SQ) terms in the matrix where .
Our inference goal is to assess whether the SNP, , is independent of the squared residuals and cross products,
(4) |
where ‘.’ denotes all the rows (or individuals) and ‘’ denotes statistical independence. In the above regression model, this corresponds to testing the global null hypothesis versus the alternative hypothesis for at least one of the traits. While a regression model can be directly applied to the squared residuals and cross products to test the global null hypothesis (see Appendix for mathematical details), a univariate model approach does not adequately leverage pleiotropy and requires a multiple testing correction which reduces power.
To address these issues, we develop a new multivariate kernel-based framework, Latent Interaction Testing (LIT), that captures pleiotropy across the ITV and ITC terms to increase power for detecting latent interactions. There are three key steps in the LIT framework (Figure 1):
Regress out the additive genetic effects and any other covariates from the traits. Additionally, adjust the traits and genotypes for population structure.
Calculate estimates of the ITV and ITC for each individual using the squared residuals and the cross products of the residuals, respectively.
Test the global null hypothesis of no latent interaction by comparing the adjusted genotype(s) to the ITV and ITC estimates.
We expand on the above steps in detail below.
Step 1: In the first step, LIT standardizes the traits and then regresses out the additive genetic effects, population structure, and any other covariates. This ensures that any differential variance and/or covariance patterns are not due to additive genetic effects or population structure. Suppose there are measured covariates and principal components to control for structure. We denote these variables in the matrix . After regressing out these variables and the additive genetic effects, the matrix of residuals is , where is the standardized trait matrix, is a matrix of effect sizes and is a matrix of coefficients estimated using least squares. We also regress out population structure from the genotypes which we denote by .
The above approach only removes the mean effects and does not correct for variance effects from population structure which can impact type I error rate control [57]. A strategy to adjust for the variance effects is to standardize the genotypes with the estimated individual-specific allele frequencies (IAF), i.e., the allele frequencies given the genetic ancestry of an individual. However, it is computationally costly to standardize the genotypes for biobank-sized datasets as it requires estimating the IAFs of all SNPs using a generalized linear model [58, 59]. Therefore, in this work, we remove the mean effects from structure and then adjust the test statistics with the genomic inflation factor to be conservative. Our software includes an implementation to standardize the genotypes using the IAFs for smaller datasets.
Step 2: The second step uses the residuals, , to reveal any latent interactions by constructing estimates of the ITV and ITC. For the th individual’s set of trait residuals, the ITVs are estimated by squaring the trait residuals while the ITCs are estimated by calculating the cross products of the trait residuals. We express the squared residuals as , and the pairwise cross products as . Importantly, when the studentized residuals are used, then and represent an unbiased estimate of the ITVs and ITCs, respectively. We aggregate these terms across all individuals into the matrix .
Step 3: In the last step, we test for association between the adjusted SNP and the squared residuals and cross products (SQ/CP) using a kernel-based distance covariance framework [31–33]. Specifically, we apply a kernel-based independence test called the Hilbert-Schmidt independence criterion (HSIC), which has been previously used for GWAS data (see, e.g., [35–38]). The HSIC constructs two similarity matrices between individuals using the SQ/CP matrix and genotype matrix, then calculates a test statistic that measures any shared signal between these similarity matrices. To estimate the similarity matrix, a kernel function is specified that captures the similitude between the th and th individual.
Since our primary application is biobank-sized data, we use a linear kernel so that LIT is computationally efficient. The linear similarity matrix is defined as for the genotype matrix and for the SQ/CP matrix. The linear kernel is a scaled version of the covariance matrix and, for this special case, the HSIC is related to the RV coefficient. We note that one can choose other options for a kernel function, such as a polynomial kernel, projection kernel, and a Gaussian radial-basis function that can capture non-linear relationships [34, 35].
Once the similarity matrices and are constructed, we can express the HSIC test statistic as
(5) |
which follows a weighted sum of Chi-squared random variables under the null hypothesis, i.e., where and are the ordered non-zero eigenvalues of the respective matrices and . Intuitively, the test statistic measures the ‘overlap’ between two random matrices where large values of imply the two matrices are similar (i.e., a latent genetic interactive effect) while small values of imply no evidence of similarity (i.e., no latent genetic interactive effects). We can approximate the null distribution of using Davies’ method, which is computationally fast and accurate for large [35, 38, 60].
For the linear kernel considered here, we implement a simple strategy to substantially improve the computational speed of LIT. We first calculate the eigenvectors and eigenvalues of the SQ/CP and genotype matrices to construct the test statistic. Since the number of traits, , is much smaller than the sample size, , we can perform a singular value decomposition to estimate the subset of eigenvectors and eigenvalues in a computationally efficient manner [61–63]. This allows us to circumvent direct calculation and storage of large similarity matrices. Let and be the singular value decomposition (SVD) of the similarity matrices where the matrix is a diagonal matrix of eigenvalues and is a matrix of eigenvectors of the respective kernel matrices. We can then express the test statistic in terms of the SVD components as , where is the outer product between the two eigenvectors. Thus, for a single SNP, the test statistic is , where is the rank of the SQ/CP matrix and is the rank of the genotype matrix such that .
4.2.1. Aggregating different LIT implementations using the Cauchy combination test
We explore an important aspect of the test statistic in Equation 5, namely, the role of the eigenvalues in determining statistical significance. The above equations suggest that the eigenvalues of the kernel matrices are emphasizing the eigenvectors that explain the most variation in the test statistic. While this may be reasonable in some settings, the interaction signal can be captured by eigenvectors that explain the least variation and this can be very difficult to ascertain beforehand [40]. In this case, the testing procedure will be underpowered. Thus, we also consider weighting the eigenvectors equally in LIT, i.e., , where are the eigenvalues of the outer product matrix. In this work, we implement a linear kernel (scaled covariance matrix) and so, in this special case, weighting the eigenvectors equally is equivalent to the projection kernel.
In summary, there are two implementations of the LIT framework. The residuals are first transformed to calculate the SQ and CP to reveal any latent interactive effects. We then calculate the weighted and unweighted eigenvectors in the test statistic which we refer to as weighted LIT (wLIT) and unweighted LIT (uLIT), respectively. We also apply a Cauchy combination test (CCT) [41] to combine the -values from the LIT implementations to maximize the number of discoveries and hedge for various (unknown) settings where one implementation may outperform the other. More specifically, let denote the value for the implementations. In this case, the CCT statistic is , where is a mathematical constant. A corresponding -value is then calculated using the standard Cauchy distribution. Importantly, when applying genome-wide significance levels, the CCT -value provides control of the type I error rate under arbitrary dependence structures. In the Results section, we refer to the CCT -value as aggregate LIT (aLIT).
4.2.2. Incorporating multiple loci in LIT
We can extend LIT to assess latent interactions within a genetic region (e.g., a gene) consisting of multiple SNPs. In the first step, we regress out the joint additive effects from the multiple SNPs along with any other covariates and population structure. In the second step, we calculate the squared residuals and cross products using the corresponding residual matrix. Finally, in the last step, we construct the similarity matrices and perform inference using the HSIC: the linear similarity matrix for the genotype matrix is and our test statistic is where is the rank of the genotype matrix.
Compared to the previous section, this extended version of LIT is a region-based test for interactive effects instead of a SNP-by-SNP test. A region-based test is advantageous to reduce the number of tests compared to a SNP-by-SNP approach. However, in this work, we demonstrate LIT on SNP-by-SNP genome-wide scan to demonstrate the scalability.
4.3. Simulation study
We evaluated the performance of LIT using simulated data with the following assumptions. Let the individual-specific minor allele frequencies of biallelic genotypes be denoted by . Of the SNPs, SNPs had no interacting partner and a minor allele frequency drawn from a Uniform (0.1, 0.4). The SNP with an interacting partner had a minor allele frequency of 0.25. We fixed this MAF to remove stochastic variation in the observed power induced by simulations differing only by the MAF of the interacting SNP. The genotypes were then drawn from a Binomial distribution with parameter , i.e., . In total, there were individuals simulated to reflect biobank-sized GWAS. We simulated the trait expression value for traits under the polygenic trait model with two risk environmental variables and . Specifically, there were traits and genotypes simulated with an additive genetic, environmental, and GxE components:
(6) |
where the intercept, , follows a Normal distribution with a standard deviation of 5; the effect sizes of the GxE interaction, , interacting environment, , and additive genetic component, , follow a Normal distribution with mean zero and standard deviation of 0.01; the two environmental variables were generated from a standard Normal distribution where only one interacts with the risk allele; and the error term was generated from a standard Normal distribution. Using the above model, we considered different types of pleiotropy. First, we assigned the effect size direction of the additive genetic component, interacting environment, and the GxE interaction to be the same in each trait. We then considered cases where the effect size for the shared GxE interaction is in the same direction (i.e., and random directions across traits. These settings represent positive pleiotropy and a mixture of positive and negative pleiotropy, respectively. We also considered a variation of the above settings where the direction of the effect size for the GxE interaction is opposite of the interacting environment.
We transformed the components in the model using the function , which takes a vector and standardizes it by the estimated mean and standard deviation. We scaled each component to set the baseline correlation between traits (ignoring the risk factor, interactive environment, and GxE interaction) as 0.25, 0.50, and 0.75. In particular, the percent variance explained of the non-interactive environment was 15% and the additive genetic component (minus the risk factor) was 10%, 35%, and 60%, which represents a 0.25, 0.50, and 0.75 baseline correlation between traits, respectively. We then assigned the percent variance explained for the additive genetic risk factor as 0.2%, the interactive environment as a uniformly drawn value from 0.5% to 2.0%, the GxE interaction as a uniformly drawn value from 0.1% to 0.15%, and the remaining variation as noise.
In our simulation study, we also varied the proportion of traits with an interaction term. For traits, let denote the proportion of traits with a shared GxE interaction signal. We varied this proportion as . At each combination of baseline trait correlation, number of traits, and proportion of null traits, we generated data from the above polygenic trait model 500 times for each pleiotropy setting. We calculated the empirical power by averaging the total number of times the -values were below a significance threshold of . Under the null hypothesis of no GxE interaction, we assessed the type I error rate at using 50 simulated datasets with 10,000 SNPs where the traits do not have a GxE interaction. We also considered cases where the random error follows a Chi-squared distribution with five degrees of freedom and a -distribution with three degrees of freedom under the null hypothesis.
4.4. UK Biobank
The UK Biobank is a collaborative research effort to gather environmental and genetic information from half a million volunteers 40–69 years old in the United Kingdom. The data was collected across 22 assessment centers from 2006 to 2010 where participants were given a general lifestyle and health questionnaire, a physical examination, and a blood test that provided genetic data [64, 65]. See ref. [66,67] for detailed information on the study design.
We applied LIT to four obesity-related traits, namely, waist circumference, hip circumference, body mass index, and body fat percentage. We restricted our analysis to unrelated individuals with British ancestry and removed any individuals with a sex chromosome aneuploidy. Using the imputed genotypes (autosomes only), SNPs were filtered in PLINK [68] with the following thresholds: a MAF of >0.05, a genotype missingness rate of <0.05, Hardy-Weinberg equilibrium (defined as >10−5), and an INFO score of >0.9. The traits were adjusted for age and the top 20 principal components provided by the UK Biobank to account for ancestry. We removed individuals with measurements that were four standard deviations above the average and then standardized the traits by sex. After filtering, there were 329,146 individuals and 6,186,503 SNPs in our analysis.
4.5. Challenges for latent interaction tests
In the Results section, we addressed a couple of challenges for interaction tests that do not require observing the interactive variable(s). The first is that false positives are possible due to linkage disequilibrium (LD) with a SNP that has a large additive effect (see, e.g., [23]). Intuitively, an imperfect correction of the additive effects creates heteroskedasticity which can be detected after transforming the residuals. Ideally, the SNP(s) in LD with additive effects are regressed out from the traits to avoid false positives. Therefore, we first identified SNPs that were in LD (within 1 Mb and correlation >0.1) with the lead SNPs. We then applied a multivariate testing procedure, GAMuT [35], on the selected SNPs to detect additive effects across traits. Note that GAMuT performs an association test between a SNP and the traits using the same test statistic as LIT. The significant SNPs detected were regressed out from the traits and then these adjusted traits were used in LIT.
The second challenge is that an incorrect scaling of the trait, or a scaling where the polygenic assumption does not hold, will induce a variance effect [17, 24]. This complicates underlying inferences because the latent variables may contain non-interactive genetic effects that will be detected as a non-additive effect; for example, the latent variables may capture non-linear effects such as . However, such effects are indistinguishable from a loci with a dominant effect. To be conservative, we addressed the extent of dominance/scaling issues by fitting a two degree of freedom genotypic model to each trait. This allows us to flexibly capture and estimate non-linear genetic variation as a function of genotype. We regressed these effects out from each trait and then applied LIT to the adjusted traits to test whether a SNP remained statistically significant. Under the assumption that a majority of loci act additively, if the significance results across many loci are driven by non-linear genetic effects then it may be suggestive of trait scaling issues.
Supplementary Material
Acknowledgements
This research has been conducted using the UK Biobank Resource under Applied Number 58259. This work was supported by NIH grants R01 AG071170 (AJB, SB, DJC, MPE), R01 AG072120 (APW, TSW), R01 AG075827 (APW, TSW), and R01 AG079170 (TSW). The authors would like to thank Michael Boehnke for his helpful feedback during the preparation of this manuscript.
Footnotes
Software and data
LIT is publicly available in the R package lit. The package can be downloaded at https://github.com/ajbass/lit. The code to reproduce the results in this work can be found at https://github.com/ajbass/lit_manuscript and access to the UK Biobank data can be requested at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.
References
- 1.Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: Biology, function, and translation. The American Journal of Human Genetics. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456(7218):18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 3.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. doi: 10.1038/nature14177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AAE, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics. 2015;47(10):1114–1120. doi: 10.1038/ng.3390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Metaanalysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Human Molecular Genetics. 2018;27(20):3641–3649. doi: 10.1093/hmg/ddy271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wainschtein P, Jain D, Zheng Z, Aslibekyan S, Becker D, Bi W, et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics. 2022;54(3):263–273. doi: 10.1038/s41588-021-00997-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Elks C, Den Hoed M, Zhao JH, Sharp S, Wareham N, Loos R, et al. Variability in the heritability of body mass index: A systematic review and meta-regression. Frontiers in Endocrinology. 2012;3. doi: 10.3389/fendo.2012.00029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Loos RJF, Yeo GSH. The genetics of obesity: from discovery to biology. Nature Reviews Genetics. 2022;23(2):120–133. doi: 10.1038/s41576-021-00414-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gibson G. Rare and common variants: Twenty arguments. Nature Reviews Genetics. 2012;13(2):135–145. doi: 10.1038/nrg3118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.López-Cortegano E, Caballero A. Inferring the nature of missing heritability in human traits using data from the GWAS catalog. Genetics. 2019;212(3):891–904. doi: 10.1534/genetics.119.302077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences. 2012;109(4):1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hemani G, Knott S, Haley C. An evolutionary perspective on epistasis and the missing heritability. PLOS Genetics. 2013;9(2):1–11. doi: 10.1371/journal.pgen.1003295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Robinson MR, English G, Moser G, Lloyd-Jones LR, Triplett MA, Zhu Z, et al. Genotype-covariate interaction effects and the heritability of adult body mass index. Nature Genetics. 2017;49(8):1174–1181. doi: 10.1038/ng.3912. [DOI] [PubMed] [Google Scholar]
- 16.Tyrrell J, Wood AR, Ames RM, Yaghootkar H, Beaumont RN, Jones SE, et al. Gene-obesogenic environment interactions in the UK Biobank study. International Journal of Epidemiology. 2017;46(2):559–575. doi: 10.1093/ije/dyw337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sulc J, Mounier N, Günther F, Winkler T, Wood AR, Frayling TM, et al. Quantification of the overall contribution of gene-environment interaction for obesity-related traits. Nature Communications. 2020;11(1):1385. doi: 10.1038/s41467-020-15107-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Nagpal S, Tandon R, Gibson G. Canalization of the polygenic risk for common diseases and traits in the UK Biobank cohort. Molecular Biology and Evolution. 2022;39(4). doi: 10.1093/molbev/msac053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Aschard H. A perspective on interaction effects in genetic association studies. Genetic Epidemiology. 2016;40(8):678–688. doi: 10.1002/gepi.21989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kraft P, Aschard H. Finding the missing gene-environment interactions. European Journal of Epidemiology. 2015;30(5):353–355. doi: 10.1007/s10654-015-0046-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Breusch Trevor S and Pagan Adrian R. A simple test for heteroscedasticity and random coefficient variation. Econometrica. 1979;47(5):1287–1294. [Google Scholar]
- 22.Paré G, Cook NR, Ridker PM, Chasman DI. On the use of variance per genotype as a tool to identify quantitative trait interaction effects: A report from the Women’s Genome Health Study. PLOS Genetics. 2010;6(6):1–10. doi: 10.1371/journal.pgen.1000981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wang H, Zhang F, Zeng J, Wu Y, Kemper KE, Xue A, et al. Genotype-by-environment interactions inferred from genetic effects on phenotypic variability in the UK Biobank. Science Advances. 2019;5(8). doi: 10.1126/sciadv.aaw3538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Al Young, Wauthier FL, Donnelly P. Identifying loci affecting trait variability and detecting interactions in genome-wide association studies. Nature Genetics. 2018;50(11):1608–1614. doi: 10.1038/s41588-018-0225-6. [DOI] [PubMed] [Google Scholar]
- 25.Barton NH, Turelli M. Evolutionary quantitative genetics: How little do we know? Annual Review of Genetics. 1989;23(1):337–370. doi: 10.1146/annurev.ge.23.120189.002005. [DOI] [PubMed] [Google Scholar]
- 26.Sivakumaran S, Agakov F, Theodoratou E, Prendergast JG, Zgaga L, Manolio T, et al. Abundant pleiotropy in human complex diseases and traits. The American Journal of Human Genetics. 2011;89(5):607–618. doi: 10.1016/j.ajhg.2011.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gratten J, Visscher PM. Genetic pleiotropy in complex traits and diseases: Implications for genomic medicine. Genome Medicine. 2016;8(1):78. doi: 10.1186/s13073-016-0332-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods. 2014;11(4):407–409. doi: 10.1038/nmeth.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Aschard H, Vilhjálmsson BJ, Greliche N, Morange PE, Trégouët DA, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. The American Journal of Human Genetics. 2014;94(5):662–676. doi: 10.1016/j.ajhg.2014.03.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lea A, Subramaniam M, Ko A, Lehtimäki T, Raitoharju E, Kähönen M, et al. Genetic and environmental perturbations lead to regulatory decoherence. eLife. 2019;8:e40538. doi: 10.7554/eLife.40538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gretton A, Fukumizu K, Teo C, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. In: Platt J, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems. vol. 20. Curran Associates, Inc.; 2007.Available from: https://proceedings.neurips.cc/paper/2007/file/d5cfead94f5350c12c322b5b664544c1-Paper.pdf. [Google Scholar]
- 32.Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. The Annals of Statistics. 2007;35(6):2769 – 2794. doi: 10.1214/009053607000000505. [DOI] [Google Scholar]
- 33.Zhang K, Peters J, Janzing D, Schölkopf B. Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. UAl’11. Arlington, Virginia, USA: AUAI Press; 2011. p. 804–813. [Google Scholar]
- 34.Schaid DJ. Genomic similarity and kernel methods II: Methods for genomic information. Human heredity. 2010;70(2):132–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Broadaway KA, Cutler DJ, Duncan R, Moore JL, Ware EB, Jhun MA, et al. A statistical approach for testing cross-phenotype effects of rare variants. The American Journal of Human Genetics. 2016;98(3):525–540. doi: 10.1016/j.ajhg.2016.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits. The American Journal of Human Genetics. 2008;82(2):386–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNP-set analysis for case-control genome-wide association studies. The American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dutta D, Scott L, Boehnke M, Lee S. Multi-SKAT: General framework to test for rarevariant association with multiple phenotypes. Genetic Epidemiology. 2019;43(1):4–23. doi: 10.1002/gepi.22156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jolliffe IT. Principal component analysis. Springer Series in Statistics. Springer; 2002. [Google Scholar]
- 41.Liu Y, Xie J. Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association. 2020;115(529):393–402. doi: 10.1080/01621459.2018.1554485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: A fast and powerful p-value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics. 2019;104(3):410–421. doi: 10.1016/j.ajhg.2019.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yang J, Weedon MN, Purcell S, Lettre G, Estrada K, Willer CJ, et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics. 2011;19(7):807–812. doi: 10.1038/ejhg.2011.39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ochoa D, Hercules A, Carmona M, Suveges D, Gonzalez-Uriarte A, Malangone C, et al. Open Targets Platform: Supporting systematic drug-target identification and prioritisation. Nucleic Acids Research. 2020;49(D1):D1302–D1310. doi: 10.1093/nar/gkaa1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Dina C, Meyre D, Gallina S, Durand E, Körner A, Jacobson P, et al. Variation in FTO contributes to childhood obesity and severe adult obesity. Nature Genetics. 2007;39(6):724–726. doi: 10.1038/ng2048. [DOI] [PubMed] [Google Scholar]
- 46.Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316(5826):889–894. doi: 10.1126/science.1141634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Herrera BM, Keildson S, Lindgren CM. Genetics and epigenetics of obesity. Maturitas. 2011;69(1):41–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Morris AP, Voight BF, Teslovich TM, Ferreira T, Segrè AV, Steinthorsdottir V, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics. 2012;44(9):981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Karaderi T, Drong AW, Lindgren CM. Insights into the genetic susceptibility to type 2 diabetes from genome-wide association studies of obesity-related traits. Current Diabetes Reports. 2015;15(10):83. doi: 10.1007/s11892-015-0648-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pulit SL, Karaderi T, Lindgren CM. Sexual dimorphisms in genetic loci linked to body fat distribution. Bioscience Reports. 2017;37(1). doi: 10.1042/BSR20160184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Namjou B, Stanaway IB, Lingren T, Mentch FD, Benoit B, Dikilitas O, et al. Evaluation of the MC4R gene across eMERGE network identifies many unreported obesity-associated variants. International Journal of Obesity. 2021;45(1):155–169. doi: 10.1038/s41366-020-00675-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Heid IM, Jackson AU, Randall JC, Winkler TW, Qi L, Steinthorsdottir V, et al. Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nature Genetics. 2010;42(11):949–960. doi: 10.1038/ng.685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Lindgren CM, Heid IM, Randall JC, Lamina C, Steinthorsdottir V, Qi L, et al. Genome-wide association scan meta-analysis identifies three loci influencing adiposity and fat distribution. PLOS Genetics. 2009;5(6):1–13. doi: 10.1371/journal.pgen.1000508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nature Genetics. 2010;42(7):579–589. doi: 10.1038/ng.609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Small KS, Todorčević M, Civelek M, El-Sayed Moustafa JS, Wang X, Simon MM, et al. Regulatory variants at KLF14 influence type 2 diabetes risk via a female-specific effect on adipocyte size and body composition. Nature Genetics. 2018;50(4):572–580. doi: 10.1038/s41588-018-0088-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Norheim F, Hasin-Brumshtein Y, Vergnes L, Chella Krishnan K, Pan C, Seldin MM, et al. Geneby-sex interactions in mitochondrial functions and cardio-metabolic traits. Cell Metabolism. 2019;29(4):932–949.e4. doi: 10.1016/j.cmet.2018.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Song M, Hao W, Storey JD. Testing for genetic associations in arbitrarily structured populations. Nature Genetics. 2015;47(5):550–554. doi: 10.1038/ng.3244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hao W, Song M, Storey JD. Probabilistic models of genetic variation in structured populations applied to global human studies. Bioinformatics. 2015;32(5):713–721. doi: 10.1093/bioinformatics/btv641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Bass AJ. High-dimensional methods to model biological signal in genome-wide studies. Princeton University; 2021. Available from: http://arks.princeton.edu/ark:/88435/dsp01m326m485q. [Google Scholar]
- 60.Davies RB. Algorithm AS 155: The distribution of a linear combination of random variables. Journal of the Royal Statistical Society Series C (Applied Statistics). 1980;29(3):323–333. [Google Scholar]
- 61.Lanczos C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards. 1950;45:255–282. [Google Scholar]
- 62.Arnoldi WE. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quarterly of Applied Mathematics. 1951;9(1):17–29. [Google Scholar]
- 63.Lehoucq RB, Sorensen DC. Deflation techniques for an implicitly restarted Arnoldi iteration. SIAM J Matrix Anal Appl. 1996;17(4):789–821. doi: 10.1137/S0895479895281484. [DOI] [Google Scholar]
- 64.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine. 2015;12(3):1–10. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.UK Biobank: Protocol for a large-scale prospective epidemiological resource; 2016. Available from: https://www.ukbiobank.ac.uk/media/gnkeyh2q/study-rationale.pdf.
- 67.Genotyping and quality control of UK Biobank, a large-scale, extensively phenotyped prospective resource; 2015. Available from: https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/genotyping_qc.pdf.
- 68.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Nadarajah S, Pogány TK. On the distribution of the product of correlated normal random variables. Comptes Rendus Mathematique. 2016;354(2):201–204. doi: 10.1016/j.crma.2015.10.019. [DOI] [Google Scholar]
- 70.Cui G, Yu X, lommelli S, Kong L. Exact distribution for the product of two correlated Gaussian random variables. IEEE Signal Processing Letters. 2016;23(11):1662–1666. doi: 10.1109/LSP.2016.2614539. [DOI] [Google Scholar]
- 71.Bartlett RF. Linear modelling of Pearson’s product moment correlation coefficient: An application of Fisher’s z-transformation. Journal of the Royal Statistical Society Series D (The Statistician). 1993;42(1):45–53. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.