Abstract
Detecting genetic variants with low-effect sizes using a moderate sample size is difficult, hindering downstream efforts to learn pathology and estimating heritability. In this work, by utilizing informative weights learned from training genetically predicted gene expression models, we formed an alternative approach to estimate the polygenic term in a linear mixed model. Our linear mixed model estimates the genetic background by incorporating their relevance to gene expression. Our protocol, expression-directed linear mixed model, enables the discovery of subtle signals of low-effect variants using moderate sample size. By applying expression-directed linear mixed model to cohorts of around 5,000 individuals with either binary (WTCCC) or quantitative (NFBC1966) traits, we demonstrated its power gain at the low-effect end of the genetic etiology spectrum. In aggregate, the additional low-effect variants detected by expression-directed linear mixed model substantially improved estimation of missing heritability. Expression-directed linear mixed model moves precision medicine forward by accurately detecting the contribution of low-effect genetic variants to human diseases.
Keywords: low-effect genetic variants, linear mixed model, gene expression, human diseases
Li et al. propose a new statistical method detecting low-effect genetic variants by adjusting the linear mixed models using weights learned from gene expressions.
Introduction
Discovering genetic variants associated with risk of complex traits is a long-standing theme in the field of genetics. Given complicated genetic background of participating individuals, it is generally difficult to identify common variants that have statistically small effect sizes (e.g. with R2 < 1%) using a moderate sample size (e.g. at the level of thousands) (Zondervan and Cardon 2004). Although their independent contributions are statistically weak in association studies, they may play critical roles in the pathology of individual patients, despite of their seemingly low contribution in the whole population. For instance, the Price lab has articulated that pathologically important variants have to explain low phenotypic variance due to the effect of negative selections (Gazal et al. 2018; O'Connor et al. 2019; Peyrot and Price 2021). Additionally, these low-effect variants might aggregately account for a substantial proportion of the heritability. If the above assumptions are indeed valid, methods discovering low-effect variants using a moderate sample size may uncover risk genetic variants in many applications, including the experimental investigation of pathology, formation of a polygenic risk score (PRS) (Choi et al. 2020), and characterization of the susceptibility genetic architecture of a complex trait.
Linear mixed models (LMMs) serve as primary approaches to discovering genetic variants associated with disease phenotype in association studies (Yu et al. 2006; Kang et al. 2008a; Listgarten et al. 2010; Price et al. 2010; Zhang et al. 2010). This is largely due to the advantage of using a random term modeled by the genetic relationship matrix (GRM), to correct population structure and other confounding effects. In this work, we determined that by accurately modeling this random term in an LMM with the informative weights learned from omics data, (e.g. transcriptome), one can discover risk genetic variants with lower effect size on diseases/traits, leading to more proportion of explainable heritability. From a technical standpoint, we extended an idea from transcriptome-wide associate studies (TWAS) (Gamazon et al. 2015; Gusev et al. 2016) using our own unique angle (Cao et al. 2021b; Cao et al. 2022). Here we provide an overview of our methodology, to understand what exactly LMM and TWAS are.
First, in its simplest form, the LMM can be generally expressed as: where y is the phenotype, x is the focal genetic variant under consideration, and u, the key component in LMM is the random term, which distinguishes LMM from a simple linear regression. Here, u follows a multivariate normal distribution , where , the variance–covariance matrix of the MVN is estimated by the genome-wide variants (Rousset 2002; Powell et al. 2010; Speed and Balding 2015). was initially called “kinship matrix” when pedigree data were used, and then called GRM when using seemingly unrelated individuals in a typical association study. Both names emphasize the accounting for “relatedness” among samples, which is the source of population stratification (Freedman et al. 2004; Kang et al. 2010). Another interpretation of u, which is also popular in the field of genomic selection (Xu 2013; Crossa et al. 2017; Xu et al. 2021), is to call it a “polygenic” term (Xu 2003, 2017; Goddard and Hayes 2007; Xie et al. 2021), which captures the variance component explained by contributing variants (other than the focal variant x) in the genome. In this sense, the interpretation of “relatedness”, or “population structure”, is a rough approximation of the “polygenic model” as the chance of 2 individuals carrying the same (combinations of) functional alleles may be proportional to their relatedness. Although the number of potential causal variants may be large, LMMs enjoy the advantage of not running into overfitting because the GRM is calculated in advance so that genetic variants with small contributions will not be modeled individually. However, LMMs may run into the risk of underfitting as the calculation of GRM evenly incorporates the genetic variants without explicitly considering their functional consequence. In this sense, modeling the GRM more accurately will increase the overall fit, leading to higher power in the association mapping (Fisher 1919; Ober et al. 2001; Visscher and Goddard 2019). Indeed, treating different variants selectively in GRM has been shown to improve performance in various applications, e.g. the LDAK model (Speed et al. 2012; Speed et al. 2017). However, incorporating informative weights in an LMM to better model its null distribution has not been explored yet. Specifically, although it is generally believed that the use of a polygenic term modeled by GRM could be equivalent to a form of Bayesian feature selection (Sorensen and Gianola 2002; Zhou et al. 2013) (a mathematical elaboration is enclosed in Supplementary Notes 1), there is no such efforts to improve the null distribution modeling in LMM using functional a priori.
Second, we articulate our alternative understanding of what is TWAS (recently published in Cao et al. (2021a, 2021b, 2022)). The mainstream format of TWAS is a 2-step protocol: first, one forms an expression prediction model using (usually cis) genotype: in a reference dataset (e.g. GTEx (Lonsdale et al. 2013; Carithers and Moore 2015)), where e is the expression vector of the focal gene (subscript specific for the gene omitted for simplicity), are genotypes in the reference dataset, and are the weights to be trained. The predicted expression is called Genetically Regulated eXpression, or GReX. Then, in the second step, in the main dataset for association mapping (that does not contain expression values), one can predict expressions using the genotype: , where is the predicted expression are the previously trained weights, and are the corresponding genotype in the GWAS dataset. Then using the predicted expression, one will conduct association mapping . Our recent works have shown that this standard interpretation of “predicting expressions” may need to be revisited. First, theoretical power analysis showed that TWAS could be more powerful than the hypothetical scenario in which expression data are available in the main GWAS dataset (Cao et al. 2021a); additionally, TWAS could be underpowered compared with GWAS when the expression heritability is low (Cao et al. 2021a). Both results question the interpretation of the prediction of expression in TWAS. As such, we proposed to interpret the “prediction” step as a selection of genetic variants directed by expression (Cao et al. 2022). From the perspective of Machine Learning, the first step in TWAS is exactly feature selection and the second being feature aggregation. With this interpretation in mind, one may cancel GReX and instead conduct feature selection and aggregation independently. Indeed, novel methods splitting these 2 steps developed by us (Cao et al. 2021b; Cao et al. 2022) and others (Tang et al. 2021) showed higher power than standard TWAS.
Based on our understanding of LMM and alternative interpretation of TWAS, we developed expression-directed linear mixed model, or edLMM, a synergy between LMM and TWAS, to model the polygenic term (based on a weighted GRM) using expression-based feature selection. In both real data and simulations, we showed that edLMM is more powerful than EMMAX (Kang et al. 2010), a flagship LMM used by many researchers, and it is particularly sensible when the effect size is low. Additionally, it is demonstrated that the explainable proportion of heritability is substantially improved when the low-effect variants discovered by edLMM are incorporated.
Materials and methods
Overview of the edLMM model
edLMM is structured into 2 steps: first, like a typical TWAS protocol, BSLMM (Zhou et al. 2013) is employed to train a cis genotype-based expression prediction model using a reference panel such as GTEx (Lonsdale et al. 2013; Carithers and Moore 2015) (Fig. 1a). Based on our alternative interpretation of TWAS, we do not have to use the predicted expression (GReX) in the next step; instead, the first step is used only for feature selection. We then pull together all the selected genetic variants (from all individual gene prediction models) and their BSLMM-coefficients to form the GRM. Instead of the traditional GRM formation using the product of a centralized genotype matrix and its transpose, we incorporate BSLMM-trained coefficients as functional weights by inserting a diagonal matrix to rescale the product (Fig. 1b). This GRM weighted by expression-relevance will be then used in a standard LMM (Fig. 1c). Note that, the use of weighted GRM is equivalent to adding functional a priori in a Bayesian setting (Supplementary Notes 1). The edLMM protocol enjoys higher power for discovering low-effect variants due to its more accurate estimation under the null model: when the effect size is small, the subtle distance between the actual and the modeled null distributions may be larger than the effect size, masking the genuine signal to be captured (Fig. 1d). In contrast, by better modeling the null using the weighted GRM, edLMM effectively amplified the statistical signal to alleviate this difficulty (Fig. 1e). As such, the same effect size in the fixed term (i.e. the regression coefficient) will be assessed to a more significant P-value in edLMM, increasing statistical power.
Mathematical formulation of the standard LMM and edLMM
The standard linear mixed model in genotype–phenotype association mapping is defined as:
where n is the number of individuals, y is an vector of quantitative or qualitative phenotypes, x is an vector of genotype, is the random effect term with where is the genetic relationship matrix (GRM), is an vector of residual effects such that where is an identity matrix, denotes multivariate normal distribution with n dimension. The overall phenotypic variance–covariance matrix can be expressed as (Fig. 1c).
In standard LMM (Kang et al. 2008b; Kang et al. 2010), the GRM was calculated without assuming any functional weights, e.g. using the realized relationship matrix (RRM):
where X is the standardized genotype matrix, and p is the number of genetic variants.
In edLMM, we calculated by weighting the genetic variants according to their contributions to gene expressions. Specifically, the weighted realized relationship matrix (w-RRM) is calculated as (Fig. 1b):
where w is a vector of weights estimated from BSLMM, a Bayesian sparse linear mixed model that predicts expressions based on genotype (Zhou et al. 2013) (Fig. 1a):
Here, z is a vector of gene expressions measured on m individuals, m is the number of individuals in genotype-expression dataset, is an vector of , μ is a scalar representing the phenotype mean, X is an matrix of genotypes measured on the same individuals at p genetic variants, w is the corresponding effects of variants which come from a mixture of 2 normal distributions, ν is the random effect term, and is the error term. In the mixture normal distribution, π controls the proportion of values that are nonzero. controls the expected magnitude of the nonzero w. controls the expected magnitude of the random effects ν. The BSLMM model involves μ, π, , and as hyperparameters and specifies their prior distributions. w can be estimated using Markov chain Monte Carlo (MCMC) given the observed data. In our analysis, all the parameters in BSLMM are based on its default recommendation (Zhou et al. 2013) and the detailed commands we used are listed in Supplementary Notes 3.
As linear model without controlling population structure has also been used frequently in practice, we also including a linear simple regression in the comparison:
GWAS and gene expression datasets, genotype imputation, and quality controls
To test the performance of edLMM in real data, 2 representative GWAS datasets with around 5,000 samples for binary and quantitative traits, namely WTCCC (The Wellcome Trust Case Control Consortium) (Wellcome Trust Case Control 2007) and NFBC1966 (Northern Finland Birth Cohort 1966) (Sabatti et al. 2009) were used. WTCCC contains ∼2,000 individuals for each of the 7 diseases and a shared set of ∼3,000 controls: type 2 diabetes (T2D), type 1 diabetes (T1D), hypertension (HT), rheumatoid arthritis (RA), coronary artery disease (CAD), Crohn's disease (CD), and bipolar disorder (BD) (Supplementary Table 1). We removed genetic variants with a minor allele frequency (MAF) ≤ 1%, resulting in a pruned set of 392,937 variants carried forward in our analysis. NFBC1966 contains quantitative metabolic traits: triglycerides, high-density lipoprotein (HDL), low-density lipoprotein (LDL), glucose, insulin, C-reactive protein, body mass index, and systolic and diastolic blood pressure (Supplementary Table 2). A full set of 299,681 variants remained after removing variants with MAF ≤ 1%, along with 5,244 individuals. A variant will be reported significant if it reaches the Bonferroni-corrected significance, i.e. 0.05 divided by number of genome-wide genetic variants in the dataset.
Replication studies for significant variants were conducted using 2 independent dbGaP datasets for T1D (Pezzolesi et al. 2009) (dbGaP: phs000018.v2.p1; N = 1,792, #SNP = 364,292) and T2D (phs000237.v1.p1, N = 1,384, #SNP = 1,199,187), respectively. The genomic build for these 2 datasets was hg18 causing few overlapped SNPs with WTCCC after they were lifted over to hg38. Therefore, we conducted imputation to increase SNPs number in a secure manner. Specifically, genotype imputation was conducted using a pipeline involving SHAPEIT4 (Delaneau et al. 2019) for pre-phasing and IMPUTE5 (Rubinacci et al. 2020). Datasets aligned to previous versions of the human genome were lifted over to hg38 using CrossMap (Zhao et al. 2014) and SHAPEIT4 was used with the included maps for hg38 to produce pre-phased VCFs of each chromosome with no missing genotypes. After that, IMPUTE5 was used to output chunked VCFs each containing imputed genotypes using data from the 1000 Genome Project resulting in over 73 million SNPs for both final datasets. Finally, imputed SNPs with the same genomic coordinates as WTCCC (392,937) SNPs were selected for downstream EMMAX and edLMM analyses (390,838 for T1D and 389,577 for T2D). A genetic variant is reported as a being successfully replicated if it meets 2 criteria: (1) its P-value is lower than 0.05 divided by number of significant SNPs in corresponding WTCCC disease dataset; and (2) it locates within 500 K base pair distance to the WTCCC significant variant to be replicated.
Gene expression data generated by RNA-seq were obtained from the Genotype-Tissue Expression Project version 8 (GTEx v8) (GTEx Consortium 2015). As the sample size and data quality vary among different GTEx tissues, the number of genes for each tissue also varies. Here we used whole-blood sample (N = 670) as it is the most representative general dataset to use when the disease-causing tissue is unclear. The target tissue has a total of 20,315 genes. As the focus of edLMM is to collect a large number of genetic variants to form the GRM, in contrast to make an accurate gene prediction model that conventional TWAS do, we did not conduct filtering of genes and used all 20,315 genes. The covariates, including genotyping principal components (PCs), were obtained from GTEx_Analysis_v8_eQTL_covariates.tar.gz, which was downloaded from the GTEx portal. For each gene, the gene expression was adjusted for top 5 genotyping PCs, age, sex, sequencing platform, PCR protocol, and 60 peer confounding factors using PEER, following the GTEx best practices (Stegle et al. 2012).The BSLMM training used GTEx genotype within 1 Mb flanking region of the focal genes.
Simulation study
We conducted simulations to estimate power and Type-I Error of edLMM, in contrast to EMMAX. The genotype data of NFBC (N = 5,244) and whole-blood gene expression data of GTEx (v8, N = 670) were used to simulate phenotypic values. Quantitative traits and binary traits were simulated separately, which are described below.
Quantitative phenotypes were simulated based on an “additive” architecture, combing the contributions from a focal variant, polygenic background, and a noise. During each round of the simulation, we first simulated 1 variant as the target to be tested by the models (i.e. EMMAX and edLMM). Then we randomly sampled 2,500 genes from a total of 20,315 in GTEx v8. For each gene, 4-cis genetic variants were randomly selected. These 2,500 × 5 = 10,000 variants will form the contribution of polygenic term. The noise was simulated by rescaling the phenotypic variance components to ensure that the phenotypes have different prespecified heritability. Finally, phenotypes were calculated by adding the contribution from the focal variant, weighted sum of 10,000 variants from GTEx v8 where weights were from the pretrained BSLMM model (polygenic background), and the noise. We repeated simulations 1,000 times for both quantitative and qualitative phenotypes.
Mathematically, the model is given by:
where x represents the focal variant, is the weighted sum of 10,000 variants with prespecified by the BSLMM model, is a noise added to gene expression effect sizes that is sample from a standard normal distribution , and is the noise sampling from a standard normal distribution (). The coefficients are calculated as follows. Note that although only selected variants contribute to the phenotype in simulations, in the analysis of simulated data, the same protocols analyzing real data are used. That is, we utilized all genome-wide genetic variants to calculate the GRM for LMM (i.e. EMMAX) and all variants that have a real weight to calculate the GRM for edLMM.
The overall heritability (i.e. variance components contributed by all variants, x and Z) is fixed to . The variance component of the focal variant, , is set to be 9 different quantities: 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.01, and 0.015. The variance component of noise term is . To ensure and , we calculate as follows. We first set . Given , can be solved using the equations below:
Solving the above equations yields:
The model for simulating binary traits largely reused the above model for quantitative traits. We first simulated a quantitative trait, and then used a liability cutoff to divide data into cases and controls, keeping the ratio between cases and controls roughly 2:3, mirroring the WTCCC dataset for which we have conducted the real data analysis.
Type-I error estimation and power calculations
The Type-I error of each protocol was estimated around . To achieve this, for the binary case, we randomly assigned phenotypes (1 for a case and 2 for a control) to 5,244 subjects to empirically determine the null distribution of each protocol. In the quantitative case, we randomly assigned phenotypes by drawing samples from a standard normal distribution. Then we ranked P-values calculated for each variant and took the top 5% as the cutoff. If this cutoff is around 0.05, then it indicates the Type-I error is under control.
For the above genetic architecture and its associated parameters, 1,000 datasets containing simulated phenotypes were created. In each simulated dataset, we tested the 2 protocols’ (EMMAX and edLMM) ability to successfully identify the causal variant simulated from GWAS dataset. Success is defined as the Bonferroni-corrected P-value of this variant that is lower than a predetermined critical value (0.05). We then calculated the power of each protocol as the number of successes divided by the number of rounds of simulation (1,000).
Protocols for association mapping using edLMM and EMMAX
First, BSLMM (Zhou et al. 2013), a function provided by the GEMMA toolkit (Zhou and Stephens 2012), was applied to the GTEx (v8) dataset to train the weights of each cis genetic variant (in the surrounding 1 Mb flanking region). Using the default setting of GEMMA, the variant effect w can be decomposed into 2 parts: α that captures the minor effects that all variants have, and β that captures the additional effects of some large effect variants. The total effect size for a given genetic variant i is then . Based on the formulas described in Mathematical formulation of the standard LMM and edLMM above, we implemented a new function, “kinshipwe”, in our tool Jawamix5 (Xiong et al. 2019) (Long et al. 2013) to calculate the weighted GRM. An existing function, “kinship”, is used to calculate the standard (unweighted) GRM. Then, taking the above 2 different versions of GRM as alternative input (representing edLMM and EMMAX, respectively), the “emmax” function of Jawamix5 was used to conduct the GWAS for both WTCCC and NFBC.
SNPs annotated to genes
Significant SNPs identified by edLMM and EMMAX were annotated to genes using cS2G (combining SNP-to-gene) (Gazal et al. 2022), which utilized multiple combined SNP-to-gene strategies and trained on various diseases.
Genes’ functional validations
To verify functional relevance of statistical results out of the association mapping, we performed gene-disease association analysis using the DisGeNET repository (v7.0) (Pinero et al. 2015; Pinero et al. 2017), a platform containing 1,134,942 gene-disease associations (GDAs), between 21,671 genes and 30,170 traits. For each disease, DisGeNET has its specific summary of gene-disease associations containing information of identified significant genes and their gene-disease associations scores (GDAs). This summary of associations can be searched using the CUI number of each disease. The CUI numbers of 7 diseases in the WTCCC are listed Supplementary Table 3. The gene model file used in this study was downloaded from the GTEx website (https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf).
Heritability estimation
We aim to quantify the degree to which the low-effect variants bring incremental heritability. We estimated SNP heritability identified by both protocols in the WTCCC and NFBC datasets using GCTA (Yang et al. 2011) (Genome-wide Complex Trait Analysis), a software package developed to estimate the proportion of phenotypic variance explained by all genome-wide variants. In particular, GCTA-GRM was used to calculate the genetic relationship matrix (GRM) from the associated variants and GCTA-GREML was used to estimate the variance explained by them.
Results
Simulations confirm the higher power of edLMM at the low-effect size end of the genetic effects spectrum
We conducted simulations to test the statistical power of edLMM against EMMAX (Kang et al. 2010), based on both quantitative and binary traits (Materials and methods). First, the type-I errors in both binary and quantitative datasets were around 0.05, which are well controlled (Supplementary Table 3). The power comparisons showed that, for both quantitative traits and binary traits (=case/control), edLMM outperforms EMMAX by a substantial margin when the variance component explained by the focal genetic variant is low, i.e. less than 0.01 (Fig. 2a, and 2b; Supplementary Table 4). In contrast, when the variance component is high, e.g. above 0.01, there is no clear distinction between the 2 protocols. This supports our assertion that by fitting random terms using weighted GRM in LMMs, the statistical power is increased for variants with low-effect sizes. Additionally, edLMM also outperforms linear models in a similar pattern.
edLMM improves identification of genetic variants in real data
We applied edLMM and EMMAX to 2 frequently used benchmark datasets that have a moderate sample size of around 5,000 (Materials and methods). Excluding the traits for which none of the protocols called any signal (after Bonferroni correction), edLMM identified more significant variants than EMMAX in both WTCCC (Fig. 3; Supplementary Table 5) and NFBC (Supplementary Table 6). In addition, we replicated EMMAX and edLMM significant discoveries in 2 independent datasets for T2D and T1D using other independent GWAS datasets (Materials and methods). Our results showed that edLMM is more robust than EMMAX in both diseases, as edLMM replication rates are 11.97% (14/117) and 10.16% (31/305) in T2D and T1D, respectively, in contrast to EMMAX's replication rates of 10.53% (6/57) and 4.97% (9/181).
edLMM discovered additional genetic variants with low-effect sizes
We further analyzed whether edLMM's superior performance over EMMAX is indeed due to the advantage of detecting low-effect variants (that is indicated by our design intuition and simulations). To this end, the distribution of effect sizes (=absolute values of regression coefficient of the genetic variant) of significant signals identified by EMMAX and edLMM were visualized for WTCCC (Fig. 4). It is observed that edLMM outperforms EMMAX across the spectrum, and the advantage for low-effect variants (≤0.05) are particularly pronounced.
edLMM-identified variants are functionally relevant
To check the functional relevance of the statistical results, we first mapped each significant variant to genes using cS2G, a state-of-the-art tool to map genetic variants to genes based on various factors (Gazal et al. 2022). Then we queried the role of genes in the related disease in DisGeNET, an established repository of validated genes associated with diseases (Pinero et al. 2017) (Materials and Methods; Supplementary Table 7). We compared edLMM and EMMAX for each disease on their numbers of validated and the rate of validated (number of validated divided by the total number of discovered). Additionally, by removing the variants that were discovered by both methods, we also calculated the number of unique validated as well as unique validated rate. We find that edLMM outperformed EMMAX in all the 7 WTCCC diseases in terms of numbers of validated, although validated rates are slightly lower (Fig. 5; Supplementary Table 7). However, the edLMM validated rates are still good considering the substantially larger number of candidates and their low-effect sizes, as well as the fact that low-effect variants may be less studied and deposited in the databases. We then also calculated the numbers of background genes and assessed the significant level of enrichments using a hypergeometric test (Supplementary Table 21). Evidently, among all 7 diseases, 4 show significance (T1D, HT, RA, and CD) and edLMM outcomes enjoy lower P-values in 3 of them (T1D, HT, and CD). All significant genes identified by any of the 2 protocols for the 7 diseases of WTCCC, along with their P-values, effect size, and MAF, are in Supplementary Tables 8–14. For NFBC dataset, we focused on 2 traits: high-density lipoprotein (HDL) and low-density lipoprotein (LDL), as other traits the number of significant variants identified by EMMAX and edLMM is relatively small and not much difference to analyze. Following the same protocol, we mapped significant variants identified by both EMMAX and edLMM to genes and reported these genes’ DisGeNET validations as well (Supplementary Tables 15 and 16).
Literature support of genes uniquely identified by edLMM
We conducted a literature search on the genes that were discovered by edLMM but not EMMAX. For WTCCC, we focused on T1D and RA, which contain most significant genes among 7 diseases. For each disease, we selected 2 edLMM uniquely discovered genes based on their DisGeNET scores, as well as 2 genes that are not reported by DisGeNET but with large cS2G scores. For NFBC, only 1 gene from HDL was annotated by cS2G but has not been reported by DisGeNET. The gene names and associated P-values and scores are presented in (Supplementary Tables 17). We were able to identify multiple lines of evidence in literature for these genes (Supplementary Notes 2).
Inclusion of edLMM-identified genetic variants leads to higher explainable SNP heritability
To quantify the incremental benefits brought by low-effect variants, we collected variants identified by edLMM and EMMAX to estimate SNP heritability using GCTA, a standard tool for heritability estimation (Yang et al. 2011) (Materials and methods). In the WTCCC dataset, the SNP heritability explained by associated variants increased substantially, ranging from 6.76% (T1D) to 147.61% (HT), with the median being 73.29% (T2D) (Fig. 6; Supplementary Table 18). In the NFBC dataset, SNPs’ heritability increases of 24.44% for HDL and 1.35% for LDL were observed (Supplementary Table 19).
Discussion
Based on our interpretation of LMM and unique understanding of TWAS (Cao et al. 2021b; Cao et al. 2022), we proposed edLMM, a new model that borrows a variant-weighting protocol to refine the modeling of the polygenic term (represented by the weighted GRM) in an LMM model. This design has been shown to bring a substantial power gain when detecting low-effect genetic variants in association studies with moderate samples (i.e. WTCCC and NFBC that have around 5,000 individuals), leading to improved SNP-heritability estimations. This work further suggests that, by skipping the use of GReX, our disentangling of genetic feature selection and aggregation in transcriptome-wide association studies (Cao et al. 2022) may open a door for novel models integrating multi-scale -omics to be developed.
This work has been only focusing on expression data within a single tissue. We do not have a systematic way to decide which tissue to use to differentiate diseases. To investigate the limitation of tissue specificity, we applied edLMM to re-analyze the 7 WTCCC diseases with 14 GTEx tissues that have reasonably large sample sizes, namely adipose, artery aorta, artery coronary, brain cerebellar hemisphere, brain cerebellum, brain cortex, brain hippocampus, brain hypothalamus, colon sigmoid, colon transverse, heart atrial appendage, heart left ventricle, small intestine, and stomach. Some of these tissues may be matched to the related disease better than others. Interestingly, it is observed that tissue-specific edLMM identified a similar number of significant variants comparable to the results from whole blood, with an average overlap of 88% significant variants between whole blood edLMM and tissue-specific edLMM (Supplementary Fig. 1 and Supplementary Table 20). Moreover, the utilization of potentially matched tissue expression does not reveal many additional variants compared to whole blood, which can be attributed to the relatively limited sample size of primary tissues in comparison. These observations indicate that while whole-blood gene expression might provide an aggregation of individuals’ gene expression landscape and could be a reasonable choice if matched tissue for the focal disease is not known or unavailable.
The GCTA estimates of heritability may be sensitive to the structure of genetic relatedness matrix (Kumar et al. 2016). Additionally, under infinitesimal model, more genetic variants naturally tend to explain more heritability. To address these 2 potential problems, we analyzed the per-SNP heritability using LDAK (Speed et al. 2012; Speed et al. 2017), in particular its BLD-LDAK variation that is specifically recommended for human traits (Speed et al. 2020) (Supplementary Note 3). By applying LDAK to the WTCCC diseases, using the same number of top significant variants, we observed that, for most of diseases, per-SNP heritability of variants identified by edLMM is higher than (in T2D, T1D, CAD) or comparable to (in RA, HT, BD) the per-SNP heritability of variants identified by EMMAX, with the exception of CD in which EMMAX-identified is higher (Supplementary Fig. 2). This provides another line of evidence that edLMM indeed identified genetics variants explaining more heritability.
Although the main motivation of the work is to show edLMM's advantage in small samples, to test its performance in larger samples, we applied edLMM to another T1D study with #cases = 5,220 and #controls = 5,571 (Barrett et al. 2009) (dbGaP: phs000911.v1.p1, #SNP = 164,643). The detailed outcomes are listed in Supplementary Table 23. Our analysis reveals that edLMM surpassed EMMAX in detecting significant variants (edLMM = 1,447 vs EMMAX = 694). Notably, nearly all the results from EMMAX (98%, or 680 out of 694) were included in the results of edLMM. Of the significant variants pinpointed by edLMM, approximately 19.9% (or 288 out of 1,447) are characterized as low-effect variants. Intriguingly, 91.3% of these low-effect variants (263 out of 288) were not previously detected by EMMAX. This additional analysis shows that the superiority of edLMM is also present when the sample size increases.
LMM has been shown to be robust to model misspecification (Jiang et al. 2014). To test whether edLMM indeed improves power on top of such property, we carried out simulations by generating phenotype directly from where and and then analyzed the data using mis-specified model (in which ) or matched model (in which ) that has a more accurate polygenic term (i.e. GRM estimation). Indeed, we find that the matched model, i.e. edLMM, has higher power consistently across all the spectrum of heritability (Supplementary Table 22). Therefore, we suggest that, although LMM models are robust to the model misspecification, the accurate estimation of polygenetic terms brings more power to detect low-effect variants.
To check whether the gain of power of edLMM is indeed brought by the functional weight, instead of excluding variants not near a gene, we performed a variation of EMMAX analysis, namely cis-EMMAX, which uses a GRM calculated by the genetic variants in cis regions (which are 336,856 out of 392,937 variants, accounting for 85% SNPs in the WTCCC genotype dataset) without including the expression-directed weights. We observed that cis-EMMAX identified a small proportion of edLMM-identified variants, ranging from 3 to 53% for different diseases (Supplementary Fig. 3). These results indicate that simply including cis-region variants does not enhance the power of LMMs, and the integrating functional weights may be the genuine source of the power gain.
Natural extensions include developing refined protocols using multi-tissue-based variants selection (Hu et al. 2019) and the inclusion of other functional -omics data such as DNA methylation (Wu et al. 2020) and CHIP-Seq (Sun et al. 2022). The flexibility of the edLMM framework enables straightforward incorporation of extra -omics data and multiple tissue-based feature selections, which are our future work. Another extension is to use GLMM to handle binary traits, although it has been pointed out that ordinary LMM is also generally appropriate (Kang et al. 2010).
Although there are many interpretations for “missing heritability” including rare variants (Zuk et al. 2012) and nonlinear interactions (Zuk et al. 2014), here we chose to focus on genetic variants with a low-effect size. These alternative interpretations do not have to be mutually exclusive. Indeed, although our additional low-effect variants can explain substantial missing heritability, still most of the heritability is left unexplained. However, our contribution from the perspective of the linear contribution of low-effect variants may be the most straightforward to implement in practical genetic tests using PRS, as it incorporates common variants in a simple way.
Supplementary Material
Contributor Information
Qing Li, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada.
Jiayi Bian, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.
Yanzhao Qian, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.
Pathum Kossinna, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada.
Cooper Gau, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.
Paul M K Gordon, Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada.
Xiang Zhou, School of Public Health, University of Michigan, Ann Arbor 48109, USA.
Xingyi Guo, Department of Medicine & Biomedical Informatics, Vanderbilt University Medical Center, Nashville 37203, USA.
Jun Yan, Physiology and Pharmacology, University of Calgary, Calgary T2N 1N4, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada.
Jingjing Wu, Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada.
Quan Long, Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada; Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada; Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada; Department of Medical Genetics, University of Calgary, Calgary T2N 1N4, Canada.
Data availability
GTEx gene expression: https://gtexportal.org/home/datasets. GTEx whole genome sequencing data: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v9.p2. WTCCC genotype: https://www.wtccc.org.uk/. NFBC genotype: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1. T1D and T2D dataset used in replication: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000018.v2.p1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000911.v1.p1 and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000237.v1.p1. The gene model file: https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf. edLMM is a function in Jawamix5, which is publicly available at: https://github.com/theLongLab/Jawamix5. cS2G code is publicly available at: https://alkesgroup.broadinstitute.org/cS2G/code.
Supplemental material available at GENETICS online.
Funding
New Frontiers in Research Fund grant (NFRFE-2018-00748) (QuL). Alberta Innovates LevMax-Health Program Bridge Funds 222300769 (QuL). Canada Foundation for Innovation 36605 (QuL). NSERC Discovery Grant (RGPIN-2018-04328 JW; RGPIN-2017-04860: QuL). Campbell McLaurin Chair for Hearing Deficiencies (JY). Alberta Innovates Graduate Student Scholarships (JB).
Author contributions
Conceived the project: JW and QuL; supervised the study: JW and QuL; designed statistical model and derived mathematical formulations: JB & QuL; implemented computational tool: QiL, JB and YQ; conducted analyses: QiL, JB, PK; provided advice and tools: PG, XZ, XG, and JY; provided funding support: JY, JW, QuL; wrote the manuscript: QiL, JB, and QuL, with contributions from all coauthors.
Literature cited
- Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, et al. 2009. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet. 41(6):703–707. doi: 10.1038/ng.381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao C, Ding B, Li Q, Kwok D, Wu J, Long Q. 2021a. Power analysis of transcriptome-wide association study: implications for practical protocol choice. PLoS Genet. 17(2):e1009405. doi: 10.1371/journal.pgen.1009405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao C, Kossinna P, Kwok D, Li Q, He J, Su L, Guo X, Zhang Q, Long Q. 2022. Disentangling genetic feature selection and aggregation in transcriptome-wide association studies. Genetics. 220(2):iyab216. doi: 10.1093/genetics/iyab216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao C, Kwok D, Edie S, Li Q, Ding B, Kossinna P, Campbell S, Wu J, Greenberg M, Long Q. 2021b. kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Brief Bioinform. 22(4):bbaa270. doi: 10.1093/bib/bbaa270. [DOI] [PubMed] [Google Scholar]
- Carithers LJ, Moore HM. 2015. The genotype-tissue expression (GTEx) project. Biopreserv Biobank. 13(5):307–308. doi: 10.1089/bio.2015.29031.hmm. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi SW, Mak TSH, O'Reilly PF. 2020. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 15(9):2759–2772. doi: 10.1038/s41596-020-0353-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de los Campos G, Burgueño J, González-Camacho JM, Pérez-Elizalde S, Beyene Y, et al. 2017. Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 22(11):961–975. doi: 10.1016/j.tplants.2017.08.011. [DOI] [PubMed] [Google Scholar]
- Delaneau O, Zagury JF, Robinson MR, Marchini JL, Dermitzakis ET. 2019. Accurate, scalable and integrative haplotype estimation. Nat Commun. 10(1):5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher RA. 1919. XV.—the correlation between relatives on the supposition of Mendelian inheritance. Earth Environ Sci Trans R Soc Edinburgh. 52(2):399–433. doi: 10.1017/S0080456800012163. [DOI] [Google Scholar]
- Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, et al. 2004. Assessing the impact of population stratification on genetic association studies. Nat Genet. 36(4):388–393. doi: 10.1038/ng1333. [DOI] [PubMed] [Google Scholar]
- Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Consortium GT, Nicolae DL, et al. 2015. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 47(9):1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gazal S, Loh PR, Finucane HK, Ganna A, Schoech A, Sunyaev S, Price AL. 2018. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat Genet. 50(11):1600–1607. doi: 10.1038/s41588-018-0231-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, Weiner DJ, Shi H, Fulco CP, O'Connor LJ, et al. 2022. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet. 54(6):827–836. doi: 10.1038/s41588-022-01087-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goddard ME, Hayes BJ. 2007. Genomic selection. J Anim Breed Genet. 124(6):323–330. doi: 10.1111/j.1439-0388.2007.00702.x. [DOI] [PubMed] [Google Scholar]
- GTEx Consortium . 2015. Human genomics. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 348(6235):648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, Jansen R, de Geus EJ, Boomsma DI, Wright FA, et al. 2016. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 48(3):245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, Yu Z, Li B, Gu J, Muchnik S, et al. 2019. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet. 51(3):568–576. doi: 10.1038/s41588-019-0345-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang J, Li C, Paul D, Yang C, Zhao H. 2014. High-dimensional genome-wide association study and misspecified mixed model analysis. arXiv 1404.2355. 10.48550/arXiv.1404.2355, preprint: not peer reviewed. [DOI]
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. 2010. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 42(4):348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang HM, Ye C, Eskin E. 2008a. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 180(4):1909–1925. doi: 10.1534/genetics.108.094201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. 2008b. Efficient control of population structure in model organism association mapping. Genetics. 178(3):1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krishna Kumar S, Feldman MW, Rehkopf DH, Tuljapurkar S. 2016. Limitations of GCTA as a solution to the missing heritability problem. Proc Natl Acad Sci USA. 113(1):E61–E70. doi: 10.1073/pnas.1520109113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Listgarten J, Kadie C, Schadt EE, Heckerman D. 2010. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci U S A. 107(38):16465–16470. doi: 10.1073/pnas.1002425107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long Q, Zhang Q, Vilhjalmsson BJ, Forai P, Seren U, Nordborg M. 2013. JAWAMix5: an out-of-core HDF5-based java implementation of whole-genome association studies using mixed models. Bioinformatics. 29(9):1220–1222. doi: 10.1093/bioinformatics/btt122. [DOI] [PubMed] [Google Scholar]
- Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, et al. 2013. The genotype-tissue expression (GTEx) project. Nat Genet. 45(6):580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ober C, Abney M, McPeek MS. 2001. The genetic dissection of complex traits in a founder population. Am J Hum Genet. 69(5):1068–1079. doi: 10.1086/324025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O'Connor LJ, Schoech AP, Hormozdiari F, Gazal S, Patterson N, Price AL. 2019. Extreme polygenicity of complex traits is explained by negative selection. Am J Hum Genet. 105(3):456–476. doi: 10.1016/j.ajhg.2019.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peyrot WJ, Price AL. 2021. Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet. 53(4):445–454. doi: 10.1038/s41588-021-00787-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pezzolesi MG, Poznik GD, Mychaleckyj JC, Paterson AD, Barati MT, Klein JB, Ng DPK, Placha G, Canani LH, Bochenski J, et al. 2009. Genome-wide association scan for diabetic nephropathy susceptibility genes in type 1 diabetes. Diabetes. 58(6):1403–1410. doi: 10.2337/db08-1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piñero J, Bravo A, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, García-García J, Sanz F, Furlong LI. 2017. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45(D1):D833–D839. doi: 10.1093/nar/gkw943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pinero J, Queralt-Rosinach N, Bravo A, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. 2015. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford). 2015:bav028. doi: 10.1093/database/bav028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Powell JE, Visscher PM, Goddard ME. 2010. Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet. 11(11):800–805. doi: 10.1038/nrg2865. [DOI] [PubMed] [Google Scholar]
- Price AL, Zaitlen NA, Reich D, Patterson N. 2010. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 11(7):459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rousset F. 2002. Inbreeding and relatedness coefficients: what do they measure? Heredity (Edinb). 88(5):371–380. doi: 10.1038/sj.hdy.6800065. [DOI] [PubMed] [Google Scholar]
- Rubinacci S, Delaneau O, Marchini J. 2020. Genotype imputation using the positional Burrows Wheeler transform. PLoS Genet. 16(11):e1009049. doi: 10.1371/journal.pgen.1009049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, Brodsky J, Jones CG, Zaitlen NA, Varilo T, Kaakinen M, et al. 2009. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 41(1):35–46. doi: 10.1038/ng.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorensen D, Gianola D. 2002. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. New York: Springer. [Google Scholar]
- Speed D, Balding DJ. 2015. Relatedness in the post-genomic era: is it still useful? Nat Rev Genet. 16(1):33–44. doi: 10.1038/nrg3821. [DOI] [PubMed] [Google Scholar]
- Speed D, Cai N, Consortium U, Johnson MR, Nejentsev S, Balding DJ. 2017. Reevaluation of SNP heritability in complex human traits. Nat Genet. 49(7):986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Speed D, Hemani G, Johnson MR, Balding DJ. 2012. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet. 91(6):1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Speed D, Holmes J, Balding DJ. 2020. Evaluating and improving heritability models using summary statistics. Nat Genet. 52(4):458–462. doi: 10.1038/s41588-020-0600-y. [DOI] [PubMed] [Google Scholar]
- Stegle O, Parts L, Piipari M, Winn J, Durbin R. 2012. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 7(3):500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Q, Liu Y, Chen Z, Chua TS, Schiele B. 2022. Meta-transfer learning through hard tasks. IEEE Trans Pattern Anal Mach Intell. 44(3):1443–1456. doi: 10.1109/TPAMI.2020.3018506. [DOI] [PubMed] [Google Scholar]
- Tang SZ, Buchman AS, De Jager PL, Bennett DA, Epstein MP, Yang JJ. 2021. Novel variance-component TWAS method for studying complex human diseases with applications to Alzheimer's dementia. PLoS Genet. 17(4):e1009482. doi: 10.1371/journal.pgen.1009482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Visscher PM, Goddard ME. 2019. From R.A. Fisher's 1918 paper to GWAS a century later. Genetics. 211(4):1125–1130. doi: 10.1534/genetics.118.301594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellcome Trust Case Control Consortium . 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu L, Yang Y, Guo X, Shu XO, Cai Q, Shu X, Li B, Tao R, Wu C, Nikas JB, et al. 2020. An integrative multi-omics analysis to identify candidate DNA methylation biomarkers related to prostate cancer risk. Nat Commun. 11(1):3905. doi: 10.1038/s41467-020-17673-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie FJ, Wang SB, Beavis WD, Xu SZ. 2021. Estimation of genetic variance contributed by a quantitative trait locus: correcting the bias associated with significance tests. Genetics. 219(3):iyab115. doi: 10.1093/genetics/iyab115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong Z, Zhang Q, Platt A, Liao W, Shi X, de Los Campos G, Long Q. 2019. OCMA: fast, memory-efficient factorization of prohibitively large relationship matrices. G3 (Bethesda). 9(1):13–19. doi: 10.1534/g3.118.200908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu SZ. 2003. Estimating polygenic effects using markers of the entire genome. Genetics. 163(2):789–801. doi: 10.1093/genetics/163.2.789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu SZ. 2013. Genetic mapping and genomic selection using recombination breakpoint data. Genetics. 195(3):1103–1115. doi: 10.1534/genetics.113.155309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu SZ. 2017. Predicted residual error sum of squares of mixed models: an application for genomic prediction. G3 (Bethesda). 7(3):895–909. doi: 10.1534/g3.116.038059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y, Ma KX, Zhao Y, Wang X, Zhou K, Yu GN, Li C, Li PC, Yang ZF, Xu CW, et al. 2021. Genomic selection: a breakthrough technology in rice breeding. Crop J. 9(3):669–677. doi: 10.1016/j.cj.2021.03.008. [DOI] [Google Scholar]
- Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38(2):203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
- Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, et al. 2010. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 42(4):355–360. doi: 10.1038/ng.546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H, Sun ZF, Wang J, Huang HJ, Kocher JP, Wang LG. 2014. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 30(7):1006–1007. doi: 10.1093/bioinformatics/btt730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou X, Carbonetto P, Stephens M. 2013. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9(2):e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou X, Stephens M. 2012. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 44(7):821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zondervan KT, Cardon LR. 2004. The complex interplay among factors that influence allelic association. Nat Rev Genet. 5(2):89–100. doi: 10.1038/nrg1270. [DOI] [PubMed] [Google Scholar]
- Zuk O, Hechter E, Sunyaev SR, Lander ES. 2012. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA. 109(4):1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, Daly MJ, Neale BM, Sunyaev SR, Lander ES. 2014. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci USA. 111(4):E455–E464. doi: 10.1073/pnas.1322563111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
GTEx gene expression: https://gtexportal.org/home/datasets. GTEx whole genome sequencing data: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v9.p2. WTCCC genotype: https://www.wtccc.org.uk/. NFBC genotype: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1. T1D and T2D dataset used in replication: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000018.v2.p1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000911.v1.p1 and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000237.v1.p1. The gene model file: https://www.gtexportal.org/home/datasets/gencode.v26.GRCh38.genes.gtf. edLMM is a function in Jawamix5, which is publicly available at: https://github.com/theLongLab/Jawamix5. cS2G code is publicly available at: https://alkesgroup.broadinstitute.org/cS2G/code.
Supplemental material available at GENETICS online.