Skip to main content
Genetics logoLink to Genetics
. 2024 Feb 5;226(4):iyae018. doi: 10.1093/genetics/iyae018

An expression-directed linear mixed model discovering low-effect genetic variants

Qing Li 1,#, Jiayi Bian 2,#, Yanzhao Qian 3, Pathum Kossinna 4, Cooper Gau 5, Paul M K Gordon 6, Xiang Zhou 7, Xingyi Guo 8, Jun Yan 9,10, Jingjing Wu 11,, Quan Long 12,13,14,15,16,✉,4
Editor: H Zhao
PMCID: PMC11630775  PMID: 38314848

Abstract

Detecting genetic variants with low-effect sizes using a moderate sample size is difficult, hindering downstream efforts to learn pathology and estimating heritability. In this work, by utilizing informative weights learned from training genetically predicted gene expression models, we formed an alternative approach to estimate the polygenic term in a linear mixed model. Our linear mixed model estimates the genetic background by incorporating their relevance to gene expression. Our protocol, expression-directed linear mixed model, enables the discovery of subtle signals of low-effect variants using moderate sample size. By applying expression-directed linear mixed model to cohorts of around 5,000 individuals with either binary (WTCCC) or quantitative (NFBC1966) traits, we demonstrated its power gain at the low-effect end of the genetic etiology spectrum. In aggregate, the additional low-effect variants detected by expression-directed linear mixed model substantially improved estimation of missing heritability. Expression-directed linear mixed model moves precision medicine forward by accurately detecting the contribution of low-effect genetic variants to human diseases.

Keywords: low-effect genetic variants, linear mixed model, gene expression, human diseases


Li et al. propose a new statistical method detecting low-effect genetic variants by adjusting the linear mixed models using weights learned from gene expressions.

Introduction

Discovering genetic variants associated with risk of complex traits is a long-standing theme in the field of genetics. Given complicated genetic background of participating individuals, it is generally difficult to identify common variants that have statistically small effect sizes (e.g. with R2 < 1%) using a moderate sample size (e.g. at the level of thousands) (Zondervan and Cardon 2004). Although their independent contributions are statistically weak in association studies, they may play critical roles in the pathology of individual patients, despite of their seemingly low contribution in the whole population. For instance, the Price lab has articulated that pathologically important variants have to explain low phenotypic variance due to the effect of negative selections (Gazal et al. 2018; O'Connor et al. 2019; Peyrot and Price 2021). Additionally, these low-effect variants might aggregately account for a substantial proportion of the heritability. If the above assumptions are indeed valid, methods discovering low-effect variants using a moderate sample size may uncover risk genetic variants in many applications, including the experimental investigation of pathology, formation of a polygenic risk score (PRS) (Choi et al. 2020), and characterization of the susceptibility genetic architecture of a complex trait.

Linear mixed models (LMMs) serve as primary approaches to discovering genetic variants associated with disease phenotype in association studies (Yu et al. 2006; Kang et al. 2008a; Listgarten et al. 2010; Price et al. 2010; Zhang et al. 2010). This is largely due to the advantage of using a random term modeled by the genetic relationship matrix (GRM), to correct population structure and other confounding effects. In this work, we determined that by accurately modeling this random term in an LMM with the informative weights learned from omics data, (e.g. transcriptome), one can discover risk genetic variants with lower effect size on diseases/traits, leading to more proportion of explainable heritability. From a technical standpoint, we extended an idea from transcriptome-wide associate studies (TWAS) (Gamazon et al. 2015; Gusev et al. 2016) using our own unique angle (Cao et al. 2021b; Cao et al. 2022). Here we provide an overview of our methodology, to understand what exactly LMM and TWAS are.

First, in its simplest form, the LMM can be generally expressed as: yx+u+ε, where y is the phenotype, x is the focal genetic variant under consideration, and u, the key component in LMM is the random term, which distinguishes LMM from a simple linear regression. Here, u follows a multivariate normal distribution MVN(0,K), where K, the variance–covariance matrix of the MVN is estimated by the genome-wide variants (Rousset 2002; Powell et al. 2010; Speed and Balding 2015). K was initially called “kinship matrix” when pedigree data were used, and then called GRM when using seemingly unrelated individuals in a typical association study. Both names emphasize the accounting for “relatedness” among samples, which is the source of population stratification (Freedman et al. 2004; Kang et al. 2010). Another interpretation of u, which is also popular in the field of genomic selection (Xu 2013; Crossa et al. 2017; Xu et al. 2021), is to call it a “polygenic” term (Xu 2003, 2017; Goddard and Hayes 2007; Xie et al. 2021), which captures the variance component explained by contributing variants (other than the focal variant x) in the genome. In this sense, the interpretation of “relatedness”, or “population structure”, is a rough approximation of the “polygenic model” as the chance of 2 individuals carrying the same (combinations of) functional alleles may be proportional to their relatedness. Although the number of potential causal variants may be large, LMMs enjoy the advantage of not running into overfitting because the GRM is calculated in advance so that genetic variants with small contributions will not be modeled individually. However, LMMs may run into the risk of underfitting as the calculation of GRM evenly incorporates the genetic variants without explicitly considering their functional consequence. In this sense, modeling the GRM more accurately will increase the overall fit, leading to higher power in the association mapping (Fisher 1919; Ober et al. 2001; Visscher and Goddard 2019). Indeed, treating different variants selectively in GRM has been shown to improve performance in various applications, e.g. the LDAK model (Speed et al. 2012; Speed et al. 2017). However, incorporating informative weights in an LMM to better model its null distribution has not been explored yet. Specifically, although it is generally believed that the use of a polygenic term modeled by GRM could be equivalent to a form of Bayesian feature selection (Sorensen and Gianola 2002; Zhou et al. 2013) (a mathematical elaboration is enclosed in Supplementary Notes 1), there is no such efforts to improve the null distribution modeling in LMM using functional a priori.

Second, we articulate our alternative understanding of what is TWAS (recently published in Cao et al. (2021a, 2021b, 2022)). The mainstream format of TWAS is a 2-step protocol: first, one forms an expression prediction model using (usually cis) genotype: eΣβixi(REF)+ε in a reference dataset (e.g. GTEx (Lonsdale et al. 2013; Carithers and Moore 2015)), where e is the expression vector of the focal gene (subscript specific for the gene omitted for simplicity), xi(REF) are genotypes in the reference dataset, and βi are the weights to be trained. The predicted expression is called Genetically Regulated eXpression, or GReX. Then, in the second step, in the main dataset for association mapping (that does not contain expression values), one can predict expressions using the genotype: e^=Σβixi(GWAS), where e^ is the predicted expression βi are the previously trained weights, and xi(GWAS) are the corresponding genotype in the GWAS dataset. Then using the predicted expression, one will conduct association mapping ye^. Our recent works have shown that this standard interpretation of “predicting expressions” may need to be revisited. First, theoretical power analysis showed that TWAS could be more powerful than the hypothetical scenario in which expression data are available in the main GWAS dataset (Cao et al. 2021a); additionally, TWAS could be underpowered compared with GWAS when the expression heritability is low (Cao et al. 2021a). Both results question the interpretation of the prediction of expression in TWAS. As such, we proposed to interpret the “prediction” step as a selection of genetic variants directed by expression (Cao et al. 2022). From the perspective of Machine Learning, the first step in TWAS is exactly feature selection and the second being feature aggregation. With this interpretation in mind, one may cancel GReX and instead conduct feature selection and aggregation independently. Indeed, novel methods splitting these 2 steps developed by us (Cao et al. 2021b; Cao et al. 2022) and others (Tang et al. 2021) showed higher power than standard TWAS.

Based on our understanding of LMM and alternative interpretation of TWAS, we developed expression-directed linear mixed model, or edLMM, a synergy between LMM and TWAS, to model the polygenic term (based on a weighted GRM) using expression-based feature selection. In both real data and simulations, we showed that edLMM is more powerful than EMMAX (Kang et al. 2010), a flagship LMM used by many researchers, and it is particularly sensible when the effect size is low. Additionally, it is demonstrated that the explainable proportion of heritability is substantially improved when the low-effect variants discovered by edLMM are incorporated.

Materials and methods

Overview of the edLMM model

edLMM is structured into 2 steps: first, like a typical TWAS protocol, BSLMM (Zhou et al. 2013) is employed to train a cis genotype-based expression prediction model using a reference panel such as GTEx (Lonsdale et al. 2013; Carithers and Moore 2015) (Fig. 1a). Based on our alternative interpretation of TWAS, we do not have to use the predicted expression (GReX) in the next step; instead, the first step is used only for feature selection. We then pull together all the selected genetic variants (from all individual gene prediction models) and their BSLMM-coefficients to form the GRM. Instead of the traditional GRM formation using the product of a centralized genotype matrix and its transpose, we incorporate BSLMM-trained coefficients as functional weights by inserting a diagonal matrix to rescale the product (Fig. 1b). This GRM weighted by expression-relevance will be then used in a standard LMM (Fig. 1c). Note that, the use of weighted GRM is equivalent to adding functional a priori in a Bayesian setting (Supplementary Notes 1). The edLMM protocol enjoys higher power for discovering low-effect variants due to its more accurate estimation under the null model: when the effect size is small, the subtle distance between the actual and the modeled null distributions may be larger than the effect size, masking the genuine signal to be captured (Fig. 1d). In contrast, by better modeling the null using the weighted GRM, edLMM effectively amplified the statistical signal to alleviate this difficulty (Fig. 1e). As such, the same effect size in the fixed term (i.e. the regression coefficient) will be assessed to a more significant P-value in edLMM, increasing statistical power.

Fig. 1.

Fig. 1.

The edLMM protocol. a) The BSLMM tool (Bayesian sparse linear mixed model) is used to train the weights of each genetic variant, using genotype-expression data from a reference panel such as GTEx. b) The calculated weights are used to compute the weighted GRM in a GWAS dataset. c) The weighted GRM is used as the variance–covariance matrix of the multivariate normal distribution in an LMM for the GWAS. Panels d) and e) show the null and alternative distributions of genetic variants under EMMAX (the standard LMM) and edLMM protocol. The newly formed null distribution in panel e) is more accurate, signifying the statistical effect of the focal variant (red arrows).

Mathematical formulation of the standard LMM and edLMM

The standard linear mixed model in genotype–phenotype association mapping is defined as:

yx+u+ε,
uMVNn(0,σg2K),
εMVNn(0,σe2In),

where n is the number of individuals, y is an n×1 vector of quantitative or qualitative phenotypes, x is an n×1 vector of genotype, u is the random effect term with var(u)=σg2K where K is the n×n genetic relationship matrix (GRM), ε is an n×1 vector of residual effects such that var(ε)=σe2In where In is an n×n identity matrix, MVNn denotes multivariate normal distribution with n dimension. The overall phenotypic variance–covariance matrix can be expressed as V=σg2K+σe2In (Fig. 1c).

In standard LMM (Kang et al. 2008b; Kang et al. 2010), the GRM K was calculated without assuming any functional weights, e.g. using the realized relationship matrix (RRM):

K=1pXXT,

where X is the n×p standardized genotype matrix, and p is the number of genetic variants.

In edLMM, we calculated K by weighting the genetic variants according to their contributions to gene expressions. Specifically, the weighted realized relationship matrix (w-RRM) is calculated as (Fig. 1b):

K=1pXwwTXT,

where w is a p×1 vector of weights estimated from BSLMM, a Bayesian sparse linear mixed model that predicts expressions based on genotype (Zhou et al. 2013) (Fig. 1a):

z1mμ+Xw+ν+ϵ,
wπN(0,σa2+σb2)+(1π)N(0,σb2).

Here, z is a vector of gene expressions measured on m individuals, m is the number of individuals in genotype-expression dataset, 1m is an m×1 vector of 1s, μ is a scalar representing the phenotype mean, X is an m×p matrix of genotypes measured on the same individuals at p genetic variants, w is the corresponding effects of variants which come from a mixture of 2 normal distributions, ν is the random effect term, and ϵ is the error term. In the mixture normal distribution, π controls the proportion of w values that are nonzero. σa2 controls the expected magnitude of the nonzero w. σb2 controls the expected magnitude of the random effects ν. The BSLMM model involves μ, π, σa2, and σb2 as hyperparameters and specifies their prior distributions. w can be estimated using Markov chain Monte Carlo (MCMC) given the observed data. In our analysis, all the parameters in BSLMM are based on its default recommendation (Zhou et al. 2013) and the detailed commands we used are listed in Supplementary Notes 3.

As linear model without controlling population structure has also been used frequently in practice, we also including a linear simple regression in the comparison:

yx+ε.

GWAS and gene expression datasets, genotype imputation, and quality controls

To test the performance of edLMM in real data, 2 representative GWAS datasets with around 5,000 samples for binary and quantitative traits, namely WTCCC (The Wellcome Trust Case Control Consortium) (Wellcome Trust Case Control 2007) and NFBC1966 (Northern Finland Birth Cohort 1966) (Sabatti et al. 2009) were used. WTCCC contains ∼2,000 individuals for each of the 7 diseases and a shared set of ∼3,000 controls: type 2 diabetes (T2D), type 1 diabetes (T1D), hypertension (HT), rheumatoid arthritis (RA), coronary artery disease (CAD), Crohn's disease (CD), and bipolar disorder (BD) (Supplementary Table 1). We removed genetic variants with a minor allele frequency (MAF) ≤ 1%, resulting in a pruned set of 392,937 variants carried forward in our analysis. NFBC1966 contains quantitative metabolic traits: triglycerides, high-density lipoprotein (HDL), low-density lipoprotein (LDL), glucose, insulin, C-reactive protein, body mass index, and systolic and diastolic blood pressure (Supplementary Table 2). A full set of 299,681 variants remained after removing variants with MAF ≤ 1%, along with 5,244 individuals. A variant will be reported significant if it reaches the Bonferroni-corrected significance, i.e. 0.05 divided by number of genome-wide genetic variants in the dataset.

Replication studies for significant variants were conducted using 2 independent dbGaP datasets for T1D (Pezzolesi et al. 2009) (dbGaP: phs000018.v2.p1; N = 1,792, #SNP = 364,292) and T2D (phs000237.v1.p1, N = 1,384, #SNP = 1,199,187), respectively. The genomic build for these 2 datasets was hg18 causing few overlapped SNPs with WTCCC after they were lifted over to hg38. Therefore, we conducted imputation to increase SNPs number in a secure manner. Specifically, genotype imputation was conducted using a pipeline involving SHAPEIT4 (Delaneau et al. 2019) for pre-phasing and IMPUTE5 (Rubinacci et al. 2020). Datasets aligned to previous versions of the human genome were lifted over to hg38 using CrossMap (Zhao et al. 2014) and SHAPEIT4 was used with the included maps for hg38 to produce pre-phased VCFs of each chromosome with no missing genotypes. After that, IMPUTE5 was used to output chunked VCFs each containing imputed genotypes using data from the 1000 Genome Project resulting in over 73 million SNPs for both final datasets. Finally, imputed SNPs with the same genomic coordinates as WTCCC (392,937) SNPs were selected for downstream EMMAX and edLMM analyses (390,838 for T1D and 389,577 for T2D). A genetic variant is reported as a being successfully replicated if it meets 2 criteria: (1) its P-value is lower than 0.05 divided by number of significant SNPs in corresponding WTCCC disease dataset; and (2) it locates within 500 K base pair distance to the WTCCC significant variant to be replicated.

Gene expression data generated by RNA-seq were obtained from the Genotype-Tissue Expression Project version 8 (GTEx v8) (GTEx Consortium 2015). As the sample size and data quality vary among different GTEx tissues, the number of genes for each tissue also varies. Here we used whole-blood sample (N = 670) as it is the most representative general dataset to use when the disease-causing tissue is unclear. The target tissue has a total of 20,315 genes. As the focus of edLMM is to collect a large number of genetic variants to form the GRM, in contrast to make an accurate gene prediction model that conventional TWAS do, we did not conduct filtering of genes and used all 20,315 genes. The covariates, including genotyping principal components (PCs), were obtained from GTEx_Analysis_v8_eQTL_covariates.tar.gz, which was downloaded from the GTEx portal. For each gene, the gene expression was adjusted for top 5 genotyping PCs, age, sex, sequencing platform, PCR protocol, and 60 peer confounding factors using PEER, following the GTEx best practices (Stegle et al. 2012).The BSLMM training used GTEx genotype within 1 Mb flanking region of the focal genes.

Simulation study

We conducted simulations to estimate power and Type-I Error of edLMM, in contrast to EMMAX. The genotype data of NFBC (N = 5,244) and whole-blood gene expression data of GTEx (v8, N = 670) were used to simulate phenotypic values. Quantitative traits and binary traits were simulated separately, which are described below.

Quantitative phenotypes were simulated based on an “additive” architecture, combing the contributions from a focal variant, polygenic background, and a noise. During each round of the simulation, we first simulated 1 variant as the target to be tested by the models (i.e. EMMAX and edLMM). Then we randomly sampled 2,500 genes from a total of 20,315 in GTEx v8. For each gene, 4-cis genetic variants were randomly selected. These 2,500 × 5 = 10,000 variants will form the contribution of polygenic term. The noise was simulated by rescaling the phenotypic variance components to ensure that the phenotypes have different prespecified heritability. Finally, phenotypes were calculated by adding the contribution from the focal variant, weighted sum of 10,000 variants from GTEx v8 where weights were from the pretrained BSLMM model (polygenic background), and the noise. We repeated simulations 1,000 times for both quantitative and qualitative phenotypes.

Mathematically, the model is given by:

y=ax+bZ+cε,

where x represents the focal variant, Z=Σi=110,000Σj=12(wij+δij)zij is the weighted sum of 10,000 variants with wijs prespecified by the BSLMM model, δij is a noise added to gene expression effect sizes that is sample from a standard normal distribution δijN(0,1), and ε is the noise sampling from a standard normal distribution (εN(0,1)). The coefficients a,b,c are calculated as follows. Note that although only selected variants contribute to the phenotype in simulations, in the analysis of simulated data, the same protocols analyzing real data are used. That is, we utilized all genome-wide genetic variants to calculate the GRM for LMM (i.e. EMMAX) and all variants that have a real weight to calculate the GRM for edLMM.

The overall heritability (i.e. variance components contributed by all variants, x and Z) is fixed to h12=0.5. The variance component of the focal variant, h22, is set to be 9 different quantities: 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.01, and 0.015. The variance component of noise term is 1h12. To ensure var(ax+bZ)/var(y)=h12 and var(ax)/var(y)=h22, we calculate a,b,c as follows. We first set a=1. Given var(ε)=1, b,c can be solved using the equations below:

var(ax+bZ)var(y)=h12,
var(ax)var(y)=h22.

Solving the above equations yields:

b=h12h22h22×var(x)var(Z),
c=var(x)(1h12)h22.

The model for simulating binary traits largely reused the above model for quantitative traits. We first simulated a quantitative trait, and then used a liability cutoff to divide data into cases and controls, keeping the ratio between cases and controls roughly 2:3, mirroring the WTCCC dataset for which we have conducted the real data analysis.

Type-I error estimation and power calculations

The Type-I error of each protocol was estimated around α=0.05. To achieve this, for the binary case, we randomly assigned phenotypes (1 for a case and 2 for a control) to 5,244 subjects to empirically determine the null distribution of each protocol. In the quantitative case, we randomly assigned phenotypes by drawing samples from a standard normal distribution. Then we ranked P-values calculated for each variant and took the top 5% as the cutoff. If this cutoff is around 0.05, then it indicates the Type-I error is under control.

For the above genetic architecture and its associated parameters, 1,000 datasets containing simulated phenotypes were created. In each simulated dataset, we tested the 2 protocols’ (EMMAX and edLMM) ability to successfully identify the causal variant simulated from GWAS dataset. Success is defined as the Bonferroni-corrected P-value of this variant that is lower than a predetermined critical value (0.05). We then calculated the power of each protocol as the number of successes divided by the number of rounds of simulation (1,000).

Protocols for association mapping using edLMM and EMMAX

First, BSLMM (Zhou et al. 2013), a function provided by the GEMMA toolkit (Zhou and Stephens 2012), was applied to the GTEx (v8) dataset to train the weights of each cis genetic variant (in the surrounding 1 Mb flanking region). Using the default setting of GEMMA, the variant effect w can be decomposed into 2 parts: α that captures the minor effects that all variants have, and β that captures the additional effects of some large effect variants. The total effect size for a given genetic variant i is then wi=αi+βi. Based on the formulas described in Mathematical formulation of the standard LMM and edLMM above, we implemented a new function, “kinshipwe”, in our tool Jawamix5 (Xiong et al. 2019) (Long et al. 2013) to calculate the weighted GRM. An existing function, “kinship”, is used to calculate the standard (unweighted) GRM. Then, taking the above 2 different versions of GRM as alternative input (representing edLMM and EMMAX, respectively), the “emmax” function of Jawamix5 was used to conduct the GWAS for both WTCCC and NFBC.

SNPs annotated to genes

Significant SNPs identified by edLMM and EMMAX were annotated to genes using cS2G (combining SNP-to-gene) (Gazal et al. 2022), which utilized multiple combined SNP-to-gene strategies and trained on various diseases.

Genes’ functional validations

To verify functional relevance of statistical results out of the association mapping, we performed gene-disease association analysis using the DisGeNET repository (v7.0) (Pinero et al. 2015; Pinero et al. 2017), a platform containing 1,134,942 gene-disease associations (GDAs), between 21,671 genes and 30,170 traits. For each disease, DisGeNET has its specific summary of gene-disease associations containing information of identified significant genes and their gene-disease associations scores (GDAs). This summary of associations can be searched using the CUI number of each disease. The CUI numbers of 7 diseases in the WTCCC are listed Supplementary Table 3. The gene model file used in this study was downloaded from the GTEx website (https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf).

Heritability estimation

We aim to quantify the degree to which the low-effect variants bring incremental heritability. We estimated SNP heritability identified by both protocols in the WTCCC and NFBC datasets using GCTA (Yang et al. 2011) (Genome-wide Complex Trait Analysis), a software package developed to estimate the proportion of phenotypic variance explained by all genome-wide variants. In particular, GCTA-GRM was used to calculate the genetic relationship matrix (GRM) from the associated variants and GCTA-GREML was used to estimate the variance explained by them.

Results

Simulations confirm the higher power of edLMM at the low-effect size end of the genetic effects spectrum

We conducted simulations to test the statistical power of edLMM against EMMAX (Kang et al. 2010), based on both quantitative and binary traits (Materials and methods). First, the type-I errors in both binary and quantitative datasets were around 0.05, which are well controlled (Supplementary Table 3). The power comparisons showed that, for both quantitative traits and binary traits (=case/control), edLMM outperforms EMMAX by a substantial margin when the variance component explained by the focal genetic variant is low, i.e. less than 0.01 (Fig. 2a, and 2b; Supplementary Table 4). In contrast, when the variance component is high, e.g. above 0.01, there is no clear distinction between the 2 protocols. This supports our assertion that by fitting random terms using weighted GRM in LMMs, the statistical power is increased for variants with low-effect sizes. Additionally, edLMM also outperforms linear models in a similar pattern.

Fig. 2.

Fig. 2.

Power comparison between edLMM, EMMAX, and linear model based using simulations. a) Binary and b) quantitative traits. The variance component explained by the focal variant, h22, which is depicted in the x-axis, is set to 9 different quantities: 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.01, and 0.015.

edLMM improves identification of genetic variants in real data

We applied edLMM and EMMAX to 2 frequently used benchmark datasets that have a moderate sample size of around 5,000 (Materials and methods). Excluding the traits for which none of the protocols called any signal (after Bonferroni correction), edLMM identified more significant variants than EMMAX in both WTCCC (Fig. 3; Supplementary Table 5) and NFBC (Supplementary Table 6). In addition, we replicated EMMAX and edLMM significant discoveries in 2 independent datasets for T2D and T1D using other independent GWAS datasets (Materials and methods). Our results showed that edLMM is more robust than EMMAX in both diseases, as edLMM replication rates are 11.97% (14/117) and 10.16% (31/305) in T2D and T1D, respectively, in contrast to EMMAX's replication rates of 10.53% (6/57) and 4.97% (9/181).

Fig. 3.

Fig. 3.

Manhattan plots of significant genetic variants identified by edLMM of the 7 WTCCC diseases. Unique significant SNPs identified by edLMM are colored in green and the unique ones for EMMAX are colored in red. Red vertical lines stand for the cutoff of Bonferroni correction.

edLMM discovered additional genetic variants with low-effect sizes

We further analyzed whether edLMM's superior performance over EMMAX is indeed due to the advantage of detecting low-effect variants (that is indicated by our design intuition and simulations). To this end, the distribution of effect sizes (=absolute values of regression coefficient of the genetic variant) of significant signals identified by EMMAX and edLMM were visualized for WTCCC (Fig. 4). It is observed that edLMM outperforms EMMAX across the spectrum, and the advantage for low-effect variants (≤0.05) are particularly pronounced.

Fig. 4.

Fig. 4.

Distribution of effect sizes of significant genetic variants identified by edLMM and EMMAX of the 7 WTCCC diseases. Variants identified by both EMMAX and edLMM (yellow), uniquely identified by edLMM (green), and uniquely identified by EMMAX (grey) are depicted.

edLMM-identified variants are functionally relevant

To check the functional relevance of the statistical results, we first mapped each significant variant to genes using cS2G, a state-of-the-art tool to map genetic variants to genes based on various factors (Gazal et al. 2022). Then we queried the role of genes in the related disease in DisGeNET, an established repository of validated genes associated with diseases (Pinero et al. 2017) (Materials and Methods; Supplementary Table 7). We compared edLMM and EMMAX for each disease on their numbers of validated and the rate of validated (number of validated divided by the total number of discovered). Additionally, by removing the variants that were discovered by both methods, we also calculated the number of unique validated as well as unique validated rate. We find that edLMM outperformed EMMAX in all the 7 WTCCC diseases in terms of numbers of validated, although validated rates are slightly lower (Fig. 5; Supplementary Table 7). However, the edLMM validated rates are still good considering the substantially larger number of candidates and their low-effect sizes, as well as the fact that low-effect variants may be less studied and deposited in the databases. We then also calculated the numbers of background genes and assessed the significant level of enrichments using a hypergeometric test (Supplementary Table 21). Evidently, among all 7 diseases, 4 show significance (T1D, HT, RA, and CD) and edLMM outcomes enjoy lower P-values in 3 of them (T1D, HT, and CD). All significant genes identified by any of the 2 protocols for the 7 diseases of WTCCC, along with their P-values, effect size, and MAF, are in Supplementary Tables 8–14. For NFBC dataset, we focused on 2 traits: high-density lipoprotein (HDL) and low-density lipoprotein (LDL), as other traits the number of significant variants identified by EMMAX and edLMM is relatively small and not much difference to analyze. Following the same protocol, we mapped significant variants identified by both EMMAX and edLMM to genes and reported these genes’ DisGeNET validations as well (Supplementary Tables 15 and 16).

Fig. 5.

Fig. 5.

The DisGeNET functional annotations of signals identified by edLMM and EMMAX in the 7 WTCCC diseases. a) Number of validated genes (validated) reported in the DisGeNET database. b) Number of validated genes (unique validated) reported exclusively by 1 of the 2 protocols in the DisGeNET database. c) Proportion (validated rate) of all reported genes which are validated in the DisGeNET database. d) Proportion (unique validated rate) of genes reported exclusively by 1 of the 2 protocols in the DisGeNET database. The number of background genes and quantitative assessment of significant levels using a hypergeometric test are listed in Supplementary Table 21.

Literature support of genes uniquely identified by edLMM

We conducted a literature search on the genes that were discovered by edLMM but not EMMAX. For WTCCC, we focused on T1D and RA, which contain most significant genes among 7 diseases. For each disease, we selected 2 edLMM uniquely discovered genes based on their DisGeNET scores, as well as 2 genes that are not reported by DisGeNET but with large cS2G scores. For NFBC, only 1 gene from HDL was annotated by cS2G but has not been reported by DisGeNET. The gene names and associated P-values and scores are presented in (Supplementary Tables 17). We were able to identify multiple lines of evidence in literature for these genes (Supplementary Notes 2).

Inclusion of edLMM-identified genetic variants leads to higher explainable SNP heritability

To quantify the incremental benefits brought by low-effect variants, we collected variants identified by edLMM and EMMAX to estimate SNP heritability using GCTA, a standard tool for heritability estimation (Yang et al. 2011) (Materials and methods). In the WTCCC dataset, the SNP heritability explained by associated variants increased substantially, ranging from 6.76% (T1D) to 147.61% (HT), with the median being 73.29% (T2D) (Fig. 6; Supplementary Table 18). In the NFBC dataset, SNPs’ heritability increases of 24.44% for HDL and 1.35% for LDL were observed (Supplementary Table 19).

Fig. 6.

Fig. 6.

The proportion of phenotypic variance explained by significant genetic variants (Vg/Vp) identified by edLMM and EMMAX in the WTCCC dataset.

Discussion

Based on our interpretation of LMM and unique understanding of TWAS (Cao et al. 2021b; Cao et al. 2022), we proposed edLMM, a new model that borrows a variant-weighting protocol to refine the modeling of the polygenic term (represented by the weighted GRM) in an LMM model. This design has been shown to bring a substantial power gain when detecting low-effect genetic variants in association studies with moderate samples (i.e. WTCCC and NFBC that have around 5,000 individuals), leading to improved SNP-heritability estimations. This work further suggests that, by skipping the use of GReX, our disentangling of genetic feature selection and aggregation in transcriptome-wide association studies (Cao et al. 2022) may open a door for novel models integrating multi-scale -omics to be developed.

This work has been only focusing on expression data within a single tissue. We do not have a systematic way to decide which tissue to use to differentiate diseases. To investigate the limitation of tissue specificity, we applied edLMM to re-analyze the 7 WTCCC diseases with 14 GTEx tissues that have reasonably large sample sizes, namely adipose, artery aorta, artery coronary, brain cerebellar hemisphere, brain cerebellum, brain cortex, brain hippocampus, brain hypothalamus, colon sigmoid, colon transverse, heart atrial appendage, heart left ventricle, small intestine, and stomach. Some of these tissues may be matched to the related disease better than others. Interestingly, it is observed that tissue-specific edLMM identified a similar number of significant variants comparable to the results from whole blood, with an average overlap of 88% significant variants between whole blood edLMM and tissue-specific edLMM (Supplementary Fig. 1 and Supplementary Table 20). Moreover, the utilization of potentially matched tissue expression does not reveal many additional variants compared to whole blood, which can be attributed to the relatively limited sample size of primary tissues in comparison. These observations indicate that while whole-blood gene expression might provide an aggregation of individuals’ gene expression landscape and could be a reasonable choice if matched tissue for the focal disease is not known or unavailable.

The GCTA estimates of heritability may be sensitive to the structure of genetic relatedness matrix (Kumar et al. 2016). Additionally, under infinitesimal model, more genetic variants naturally tend to explain more heritability. To address these 2 potential problems, we analyzed the per-SNP heritability using LDAK (Speed et al. 2012; Speed et al. 2017), in particular its BLD-LDAK variation that is specifically recommended for human traits (Speed et al. 2020) (Supplementary Note 3). By applying LDAK to the WTCCC diseases, using the same number of top significant variants, we observed that, for most of diseases, per-SNP heritability of variants identified by edLMM is higher than (in T2D, T1D, CAD) or comparable to (in RA, HT, BD) the per-SNP heritability of variants identified by EMMAX, with the exception of CD in which EMMAX-identified is higher (Supplementary Fig. 2). This provides another line of evidence that edLMM indeed identified genetics variants explaining more heritability.

Although the main motivation of the work is to show edLMM's advantage in small samples, to test its performance in larger samples, we applied edLMM to another T1D study with #cases = 5,220 and #controls = 5,571 (Barrett et al. 2009) (dbGaP: phs000911.v1.p1, #SNP = 164,643). The detailed outcomes are listed in Supplementary Table 23. Our analysis reveals that edLMM surpassed EMMAX in detecting significant variants (edLMM = 1,447 vs EMMAX = 694). Notably, nearly all the results from EMMAX (98%, or 680 out of 694) were included in the results of edLMM. Of the significant variants pinpointed by edLMM, approximately 19.9% (or 288 out of 1,447) are characterized as low-effect variants. Intriguingly, 91.3% of these low-effect variants (263 out of 288) were not previously detected by EMMAX. This additional analysis shows that the superiority of edLMM is also present when the sample size increases.

LMM has been shown to be robust to model misspecification (Jiang et al. 2014). To test whether edLMM indeed improves power on top of such property, we carried out simulations by generating phenotype directly from yx+u+ε, where uMVNn(0,σg2K),εMVNn(0,σe2In), and K=1pXwwTXT and then analyzed the data using mis-specified model (in which K=1pXXT) or matched model (in which K=1pXwwTXT) that has a more accurate polygenic term (i.e. GRM estimation). Indeed, we find that the matched model, i.e. edLMM, has higher power consistently across all the spectrum of heritability (Supplementary Table 22). Therefore, we suggest that, although LMM models are robust to the model misspecification, the accurate estimation of polygenetic terms brings more power to detect low-effect variants.

To check whether the gain of power of edLMM is indeed brought by the functional weight, instead of excluding variants not near a gene, we performed a variation of EMMAX analysis, namely cis-EMMAX, which uses a GRM calculated by the genetic variants in cis regions (which are 336,856 out of 392,937 variants, accounting for 85% SNPs in the WTCCC genotype dataset) without including the expression-directed weights. We observed that cis-EMMAX identified a small proportion of edLMM-identified variants, ranging from 3 to 53% for different diseases (Supplementary Fig. 3). These results indicate that simply including cis-region variants does not enhance the power of LMMs, and the integrating functional weights may be the genuine source of the power gain.

Natural extensions include developing refined protocols using multi-tissue-based variants selection (Hu et al. 2019) and the inclusion of other functional -omics data such as DNA methylation (Wu et al. 2020) and CHIP-Seq (Sun et al. 2022). The flexibility of the edLMM framework enables straightforward incorporation of extra -omics data and multiple tissue-based feature selections, which are our future work. Another extension is to use GLMM to handle binary traits, although it has been pointed out that ordinary LMM is also generally appropriate (Kang et al. 2010).

Although there are many interpretations for “missing heritability” including rare variants (Zuk et al. 2012) and nonlinear interactions (Zuk et al. 2014), here we chose to focus on genetic variants with a low-effect size. These alternative interpretations do not have to be mutually exclusive. Indeed, although our additional low-effect variants can explain substantial missing heritability, still most of the heritability is left unexplained. However, our contribution from the perspective of the linear contribution of low-effect variants may be the most straightforward to implement in practical genetic tests using PRS, as it incorporates common variants in a simple way.

Supplementary Material

iyae018_Supplementary_Data

Contributor Information

Qing Li, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada.

Jiayi Bian, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.

Yanzhao Qian, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.

Pathum Kossinna, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada.

Cooper Gau, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.

Paul M K Gordon, Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada.

Xiang Zhou, School of Public Health, University of Michigan, Ann Arbor 48109, USA.

Xingyi Guo, Department of Medicine & Biomedical Informatics, Vanderbilt University Medical Center, Nashville 37203, USA.

Jun Yan, Physiology and Pharmacology, University of Calgary, Calgary T2N 1N4, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada.

Jingjing Wu, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.

Quan Long, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada; Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada; Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada; Department of Medical Genetics, University of Calgary, Calgary T2N 1N4, Canada.

Data availability

GTEx gene expression: https://gtexportal.org/home/datasets. GTEx whole genome sequencing data: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v9.p2. WTCCC genotype: https://www.wtccc.org.uk/. NFBC genotype: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1. T1D and T2D dataset used in replication: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000018.v2.p1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000911.v1.p1 and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000237.v1.p1. The gene model file: https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf. edLMM is a function in Jawamix5, which is publicly available at: https://github.com/theLongLab/Jawamix5. cS2G code is publicly available at: https://alkesgroup.broadinstitute.org/cS2G/code.

Supplemental material available at GENETICS online.

Funding

New Frontiers in Research Fund grant (NFRFE-2018-00748) (QuL). Alberta Innovates LevMax-Health Program Bridge Funds 222300769 (QuL). Canada Foundation for Innovation 36605 (QuL). NSERC Discovery Grant (RGPIN-2018-04328 JW; RGPIN-2017-04860: QuL). Campbell McLaurin Chair for Hearing Deficiencies (JY). Alberta Innovates Graduate Student Scholarships (JB).

Author contributions

Conceived the project: JW and QuL; supervised the study: JW and QuL; designed statistical model and derived mathematical formulations: JB & QuL; implemented computational tool: QiL, JB and YQ; conducted analyses: QiL, JB, PK; provided advice and tools: PG, XZ, XG, and JY; provided funding support: JY, JW, QuL; wrote the manuscript: QiL, JB, and QuL, with contributions from all coauthors.

Literature cited

  1. Barrett  JC, Clayton  DG, Concannon  P, Akolkar  B, Cooper  JD, Erlich  HA, Julier  C, Morahan  G, Nerup  J, Nierras  C, et al.  2009. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet. 41(6):703–707. doi: 10.1038/ng.381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cao  C, Ding  B, Li  Q, Kwok  D, Wu  J, Long  Q. 2021a. Power analysis of transcriptome-wide association study: implications for practical protocol choice. PLoS Genet. 17(2):e1009405. doi: 10.1371/journal.pgen.1009405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cao  C, Kossinna  P, Kwok  D, Li  Q, He  J, Su  L, Guo  X, Zhang  Q, Long  Q. 2022. Disentangling genetic feature selection and aggregation in transcriptome-wide association studies. Genetics. 220(2):iyab216. doi: 10.1093/genetics/iyab216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cao  C, Kwok  D, Edie  S, Li  Q, Ding  B, Kossinna  P, Campbell  S, Wu  J, Greenberg  M, Long  Q. 2021b. kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Brief Bioinform. 22(4):bbaa270. doi: 10.1093/bib/bbaa270. [DOI] [PubMed] [Google Scholar]
  5. Carithers  LJ, Moore  HM. 2015. The genotype-tissue expression (GTEx) project. Biopreserv Biobank. 13(5):307–308. doi: 10.1089/bio.2015.29031.hmm. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Choi  SW, Mak  TSH, O'Reilly  PF. 2020. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 15(9):2759–2772. doi: 10.1038/s41596-020-0353-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Crossa  J, Pérez-Rodríguez  P, Cuevas  J, Montesinos-López  O, Jarquín  D, de los Campos  G, Burgueño  J, González-Camacho  JM, Pérez-Elizalde  S, Beyene  Y, et al.  2017. Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 22(11):961–975. doi: 10.1016/j.tplants.2017.08.011. [DOI] [PubMed] [Google Scholar]
  8. Delaneau  O, Zagury  JF, Robinson  MR, Marchini  JL, Dermitzakis  ET. 2019. Accurate, scalable and integrative haplotype estimation. Nat Commun. 10(1):5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fisher  RA. 1919. XV.—the correlation between relatives on the supposition of Mendelian inheritance. Earth Environ Sci Trans R Soc Edinburgh. 52(2):399–433. doi: 10.1017/S0080456800012163. [DOI] [Google Scholar]
  10. Freedman  ML, Reich  D, Penney  KL, McDonald  GJ, Mignault  AA, Patterson  N, Gabriel  SB, Topol  EJ, Smoller  JW, Pato  CN, et al.  2004. Assessing the impact of population stratification on genetic association studies. Nat Genet. 36(4):388–393. doi: 10.1038/ng1333. [DOI] [PubMed] [Google Scholar]
  11. Gamazon  ER, Wheeler  HE, Shah  KP, Mozaffari  SV, Aquino-Michaels  K, Carroll  RJ, Eyler  AE, Denny  JC, Consortium  GT, Nicolae  DL, et al.  2015. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 47(9):1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gazal  S, Loh  PR, Finucane  HK, Ganna  A, Schoech  A, Sunyaev  S, Price  AL. 2018. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat Genet. 50(11):1600–1607. doi: 10.1038/s41588-018-0231-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gazal  S, Weissbrod  O, Hormozdiari  F, Dey  KK, Nasser  J, Jagadeesh  KA, Weiner  DJ, Shi  H, Fulco  CP, O'Connor  LJ, et al.  2022. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet. 54(6):827–836. doi: 10.1038/s41588-022-01087-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Goddard  ME, Hayes  BJ. 2007. Genomic selection. J Anim Breed Genet. 124(6):323–330. doi: 10.1111/j.1439-0388.2007.00702.x. [DOI] [PubMed] [Google Scholar]
  15. GTEx Consortium . 2015. Human genomics. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 348(6235):648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gusev  A, Ko  A, Shi  H, Bhatia  G, Chung  W, Penninx  BW, Jansen  R, de Geus  EJ, Boomsma  DI, Wright  FA, et al.  2016. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 48(3):245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hu  Y, Li  M, Lu  Q, Weng  H, Wang  J, Zekavat  SM, Yu  Z, Li  B, Gu  J, Muchnik  S, et al.  2019. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet. 51(3):568–576. doi: 10.1038/s41588-019-0345-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jiang  J, Li  C, Paul  D, Yang  C, Zhao  H.  2014. High-dimensional genome-wide association study and misspecified mixed model analysis. arXiv 1404.2355. 10.48550/arXiv.1404.2355, preprint: not peer reviewed. [DOI]
  19. Kang  HM, Sul  JH, Service  SK, Zaitlen  NA, Kong  SY, Freimer  NB, Sabatti  C, Eskin  E. 2010. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 42(4):348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kang  HM, Ye  C, Eskin  E. 2008a. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 180(4):1909–1925. doi: 10.1534/genetics.108.094201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kang  HM, Zaitlen  NA, Wade  CM, Kirby  A, Heckerman  D, Daly  MJ, Eskin  E. 2008b. Efficient control of population structure in model organism association mapping. Genetics. 178(3):1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Krishna Kumar  S, Feldman  MW, Rehkopf  DH, Tuljapurkar  S. 2016. Limitations of GCTA as a solution to the missing heritability problem. Proc Natl Acad Sci USA. 113(1):E61–E70. doi: 10.1073/pnas.1520109113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Listgarten  J, Kadie  C, Schadt  EE, Heckerman  D. 2010. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci U S A. 107(38):16465–16470. doi: 10.1073/pnas.1002425107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Long  Q, Zhang  Q, Vilhjalmsson  BJ, Forai  P, Seren  U, Nordborg  M. 2013. JAWAMix5: an out-of-core HDF5-based java implementation of whole-genome association studies using mixed models. Bioinformatics. 29(9):1220–1222. doi: 10.1093/bioinformatics/btt122. [DOI] [PubMed] [Google Scholar]
  25. Lonsdale  J, Thomas  J, Salvatore  M, Phillips  R, Lo  E, Shad  S, Hasz  R, Walters  G, Garcia  F, Young  N, et al.  2013. The genotype-tissue expression (GTEx) project. Nat Genet. 45(6):580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ober  C, Abney  M, McPeek  MS. 2001. The genetic dissection of complex traits in a founder population. Am J Hum Genet. 69(5):1068–1079. doi: 10.1086/324025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. O'Connor  LJ, Schoech  AP, Hormozdiari  F, Gazal  S, Patterson  N, Price  AL. 2019. Extreme polygenicity of complex traits is explained by negative selection. Am J Hum Genet. 105(3):456–476. doi: 10.1016/j.ajhg.2019.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Peyrot  WJ, Price  AL. 2021. Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet. 53(4):445–454. doi: 10.1038/s41588-021-00787-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pezzolesi  MG, Poznik  GD, Mychaleckyj  JC, Paterson  AD, Barati  MT, Klein  JB, Ng  DPK, Placha  G, Canani  LH, Bochenski  J, et al.  2009. Genome-wide association scan for diabetic nephropathy susceptibility genes in type 1 diabetes. Diabetes. 58(6):1403–1410. doi: 10.2337/db08-1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Piñero  J, Bravo  A, Queralt-Rosinach  N, Gutiérrez-Sacristán  A, Deu-Pons  J, Centeno  E, García-García  J, Sanz  F, Furlong  LI. 2017. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45(D1):D833–D839. doi: 10.1093/nar/gkw943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Pinero  J, Queralt-Rosinach  N, Bravo  A, Deu-Pons  J, Bauer-Mehren  A, Baron  M, Sanz  F, Furlong  LI. 2015. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford). 2015:bav028. doi: 10.1093/database/bav028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Powell  JE, Visscher  PM, Goddard  ME. 2010. Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet. 11(11):800–805. doi: 10.1038/nrg2865. [DOI] [PubMed] [Google Scholar]
  33. Price  AL, Zaitlen  NA, Reich  D, Patterson  N. 2010. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 11(7):459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Rousset  F. 2002. Inbreeding and relatedness coefficients: what do they measure?  Heredity (Edinb). 88(5):371–380. doi: 10.1038/sj.hdy.6800065. [DOI] [PubMed] [Google Scholar]
  35. Rubinacci  S, Delaneau  O, Marchini  J. 2020. Genotype imputation using the positional Burrows Wheeler transform. PLoS Genet. 16(11):e1009049. doi: 10.1371/journal.pgen.1009049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sabatti  C, Service  SK, Hartikainen  AL, Pouta  A, Ripatti  S, Brodsky  J, Jones  CG, Zaitlen  NA, Varilo  T, Kaakinen  M, et al.  2009. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 41(1):35–46. doi: 10.1038/ng.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sorensen  D, Gianola  D. 2002. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. New York: Springer. [Google Scholar]
  38. Speed  D, Balding  DJ. 2015. Relatedness in the post-genomic era: is it still useful?  Nat Rev Genet. 16(1):33–44. doi: 10.1038/nrg3821. [DOI] [PubMed] [Google Scholar]
  39. Speed  D, Cai  N, Consortium  U, Johnson  MR, Nejentsev  S, Balding  DJ. 2017. Reevaluation of SNP heritability in complex human traits. Nat Genet. 49(7):986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Speed  D, Hemani  G, Johnson  MR, Balding  DJ. 2012. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet. 91(6):1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Speed  D, Holmes  J, Balding  DJ. 2020. Evaluating and improving heritability models using summary statistics. Nat Genet. 52(4):458–462. doi: 10.1038/s41588-020-0600-y. [DOI] [PubMed] [Google Scholar]
  42. Stegle  O, Parts  L, Piipari  M, Winn  J, Durbin  R. 2012. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 7(3):500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Sun  Q, Liu  Y, Chen  Z, Chua  TS, Schiele  B. 2022. Meta-transfer learning through hard tasks. IEEE Trans Pattern Anal Mach Intell. 44(3):1443–1456. doi: 10.1109/TPAMI.2020.3018506. [DOI] [PubMed] [Google Scholar]
  44. Tang  SZ, Buchman  AS, De Jager  PL, Bennett  DA, Epstein  MP, Yang  JJ. 2021. Novel variance-component TWAS method for studying complex human diseases with applications to Alzheimer's dementia. PLoS Genet.  17(4):e1009482. doi: 10.1371/journal.pgen.1009482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Visscher  PM, Goddard  ME. 2019. From R.A. Fisher's 1918 paper to GWAS a century later. Genetics. 211(4):1125–1130. doi: 10.1534/genetics.118.301594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Wellcome Trust Case Control Consortium . 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Wu  L, Yang  Y, Guo  X, Shu  XO, Cai  Q, Shu  X, Li  B, Tao  R, Wu  C, Nikas  JB, et al.  2020. An integrative multi-omics analysis to identify candidate DNA methylation biomarkers related to prostate cancer risk. Nat Commun. 11(1):3905. doi: 10.1038/s41467-020-17673-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Xie  FJ, Wang  SB, Beavis  WD, Xu  SZ. 2021. Estimation of genetic variance contributed by a quantitative trait locus: correcting the bias associated with significance tests. Genetics. 219(3):iyab115. doi: 10.1093/genetics/iyab115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Xiong  Z, Zhang  Q, Platt  A, Liao  W, Shi  X, de Los Campos  G, Long  Q. 2019. OCMA: fast, memory-efficient factorization of prohibitively large relationship matrices. G3 (Bethesda). 9(1):13–19. doi: 10.1534/g3.118.200908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Xu  SZ. 2003. Estimating polygenic effects using markers of the entire genome. Genetics. 163(2):789–801. doi: 10.1093/genetics/163.2.789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Xu  SZ. 2013. Genetic mapping and genomic selection using recombination breakpoint data. Genetics. 195(3):1103–1115. doi: 10.1534/genetics.113.155309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Xu  SZ. 2017. Predicted residual error sum of squares of mixed models: an application for genomic prediction. G3 (Bethesda). 7(3):895–909. doi: 10.1534/g3.116.038059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Xu  Y, Ma  KX, Zhao  Y, Wang  X, Zhou  K, Yu  GN, Li  C, Li  PC, Yang  ZF, Xu  CW, et al.  2021. Genomic selection: a breakthrough technology in rice breeding. Crop J. 9(3):669–677. doi: 10.1016/j.cj.2021.03.008. [DOI] [Google Scholar]
  54. Yang  J, Lee  SH, Goddard  ME, Visscher  PM. 2011. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Yu  J, Pressoir  G, Briggs  WH, Vroh Bi  I, Yamasaki  M, Doebley  JF, McMullen  MD, Gaut  BS, Nielsen  DM, Holland  JB, et al.  2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38(2):203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
  56. Zhang  Z, Ersoz  E, Lai  CQ, Todhunter  RJ, Tiwari  HK, Gore  MA, Bradbury  PJ, Yu  J, Arnett  DK, Ordovas  JM, et al.  2010. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 42(4):355–360. doi: 10.1038/ng.546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Zhao  H, Sun  ZF, Wang  J, Huang  HJ, Kocher  JP, Wang  LG. 2014. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 30(7):1006–1007. doi: 10.1093/bioinformatics/btt730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Zhou  X, Carbonetto  P, Stephens  M. 2013. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9(2):e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Zhou  X, Stephens  M. 2012. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 44(7):821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Zondervan  KT, Cardon  LR. 2004. The complex interplay among factors that influence allelic association. Nat Rev Genet. 5(2):89–100. doi: 10.1038/nrg1270. [DOI] [PubMed] [Google Scholar]
  61. Zuk  O, Hechter  E, Sunyaev  SR, Lander  ES. 2012. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA. 109(4):1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Zuk  O, Schaffner  SF, Samocha  K, Do  R, Hechter  E, Kathiresan  S, Daly  MJ, Neale  BM, Sunyaev  SR, Lander  ES. 2014. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci USA. 111(4):E455–E464. doi: 10.1073/pnas.1322563111. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

iyae018_Supplementary_Data

Data Availability Statement

GTEx gene expression: https://gtexportal.org/home/datasets. GTEx whole genome sequencing data: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v9.p2. WTCCC genotype: https://www.wtccc.org.uk/. NFBC genotype: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1. T1D and T2D dataset used in replication: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000018.v2.p1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000911.v1.p1 and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000237.v1.p1. The gene model file: https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf. edLMM is a function in Jawamix5, which is publicly available at: https://github.com/theLongLab/Jawamix5. cS2G code is publicly available at: https://alkesgroup.broadinstitute.org/cS2G/code.

Supplemental material available at GENETICS online.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES