Trait imputation enhances nonlinear genetic prediction for some traits

Ruoyu He; Jinwen Fu; Jingchen Ren; Wei Pan

doi:10.1093/genetics/iyae148

. 2024 Sep 10;228(3):iyae148. doi: 10.1093/genetics/iyae148

Trait imputation enhances nonlinear genetic prediction for some traits

Ruoyu He ^1,², Jinwen Fu ^3,⁴, Jingchen Ren ^5,⁶, Wei Pan ^7,^✉,²

Editor: J Yang

PMCID: PMC12098936 PMID: 39255064

Abstract

The expansive collection of genetic and phenotypic data within biobanks offers an unprecedented opportunity for biomedical research. However, the frequent occurrence of missing phenotypes presents a significant barrier to fully leveraging this potential. In our target application, on one hand, we have only a small and complete dataset with both genotypes and phenotypes to build a genetic prediction model, commonly called a polygenic (risk) score (PGS or PRS); on the other hand, we have a large dataset of genotypes (e.g. from a biobank) without the phenotype of interest. Our goal is to leverage the large dataset of genotypes (but without the phenotype) and a separate genome-wide association studies summary dataset of the phenotype to impute the phenotypes, which are then used as an individual-level dataset, along with the small complete dataset, to build a nonlinear model as PGS. More specifically, we trained some nonlinear models to 7 imputed and observed phenotypes from the UK Biobank data. We then trained an ensemble model to integrate these models for each trait, resulting in higher $R^{2}$ values in prediction than using only the small complete (observed) dataset. Additionally, for 2 of the 7 traits, we observed that the nonlinear model trained with the imputed traits had higher $R^{2}$ than using the imputed traits directly as the PGS, while for the remaining 5 traits, no improvement was found. These findings demonstrate the potential of leveraging existing genetic data and accounting for nonlinear genetic relationships to improve prediction accuracy for some traits.

Keywords: genome-wide association studies (GWAS), LS-imputation, machine learning, polygenic score (PGS), SNP

Introduction

The emergence of large-scale biobanks and the accumulation of vast amounts of phenotypic and genomic data have significantly advanced the fields of genetics and biomedicine. These extensive databases aim to encompass a wide array of phenotypes within the population. Despite their comprehensive nature, a significant challenge in leveraging these resources is the frequent occurrence of missing phenotypes. Such gaps often result from the prohibitive cost or complexity of acquiring specific phenotypes. It can be also due to the unavailability of a phenotype, for example, a late-onset disease state, such as Alzheimer’s disease (AD), in a cohort of young individuals. The consequence is a notable limitation in our capacity to thoroughly study these phenotypes using biobank data. Therefore, accurately imputing the genetic components of any phenotype from genotypic information is of immense value to genetic research. Particularly, as many phenotypes exhibit complex genetic architecture, the importance of learning and predicting complex, potentially highly nonlinear genotype–phenotype relationships has increased. Recent advancements in the field now make completing such tasks with only genome-wide association studies (GWAS) summary statistics feasible. Recently Ren et al. (2023) proposed a nonparametric method, LS-imputation, which utilizes the individual-level genotypes and a separate GWAS summary dataset to impute the missing individual-level phenotypes while possibly retaining nonlinear genotype–phenotype relationships. By leveraging external GWAS data and adopting a nonparametric approach, LS-imputation offers a significant opportunity to enhance nonlinear analyses using genetic data from large biobanks, effectively addressing the missing phenotype issue. In particular, we would like to leverage phenotype imputation to improve the performance of building a nonlinear polygenic score (PGS): given a small complete dataset with both genotypes and phenotypes, a large set of genotypes but no phenotypes, and a separate GWAS summary dataset of the phenotype of interest, whether/how we can construct a possibly nonlinear PGS with an improved performance over that built on only the small complete dataset. It is noted that in general, it is impossible to build a nonlinear model using only GWAS summary statistics. One motivating example would be the construction of a PGS for late-onsite AD. On one hand, we have a small dataset, such as from the Alzheimer’s Disease Sequencing Project (ADSP), with individual-level genotype data and diagnosed AD status. On the other hand, we have a much larger biobank dataset, such as from the UK Biobank (UKB), with individual-level genotype data but an insufficient number of diagnosed AD cases due to its relatively younger demographic. Alongside publicly available GWAS data such as the International Genomics of Alzheimer’s Project (IGAP), our aim is to improve the performance of nonlinear PGS using all 3 dataset, rather than only the ADSP data. In summary, the major contribution here is to evaluate the effectiveness of using trait imputation to improve the prediction performance with nonlinear PGS models with large-scale biobank data.

Although numerous statistical and deep learning imputation methods have been developed to address the missing phenotype problem, compared to our approach, they tend to focus on different issues or suffer from inherent limitations for the purpose of genetic analyses. For instance, some (Dahl et al. 2016; Hormozdiari et al. 2016; An et al. 2023) rely on the correlations between the missing phenotype and other observed covariates/phenotypes for imputation—a correlation that may stem from shared genetic and environmental influences. Yet, these approaches do not use genetic data for imputation and thus fail to distinguish genetic effects from independent environmental influences, which is not really suitable for downstream genetic analyses. Moreover, existing PGS methods that utilize genotypic data for phenotype prediction are often based on linear models with a subset of selected genetic variants, such as Ge et al. (2019), or require complete genotype and phenotype data for training (van Hilten et al. 2021; Elgart et al. 2022; Sehrawat et al. 2023; Georgantas et al. 2024), which is constrained by the very problem of missing phenotypes.

We assessed our proposed method’s efficacy using 7 traits, namely Lipoprotein(a) (lipA), HDL, LDL, triglycerides (TRIG), systolic blood pressure (SBP), diastolic blood pressure (DBP), and height, in the UKB dataset. Some of the traits such as lipA and LDL have previously been found to have possibly nonlinear genetic associations (called genetically nonlinear traits) (Elgart et al. 2022; Sigurdsson et al. 2023), and other traits like Height will serve as examples of traits with (mostly) linear genetic associations (called genetically linear traits). We trained the models on the observed traits and the imputed traits (by 2 imputation methods) separately. Additionally, we also built an ensemble model integrating the model trained on the observed trait with that trained on the imputed trait. We evaluated the performance of these models on an independent test set. A model’s performance was measured by the coefficient of determination ( $R^{2}$ ) between predicted and observed phenotypes in a test set.

The following sections outline our methodological approach, present the results of our analyses, and discuss the implications of our findings.

Materials and Methods

Our primary objective is 2-fold: firstly, to use imputed traits as training phenotypes and compare the improvement in prediction accuracy with different ML/DL prediction models; secondly, to compare the predictive performance of using traits imputed by LS-imputation and by a more popular linear model-based PGS (Zhou et al. 2023) called PRS-CS (Ge et al. 2019). The workflow begins by mirroring the approach of Ren et al. (2023), examining the GWAS traits obtained from LS-imputation and PRS-CS on lipA. The aim is to replicate and extend the findings of Ren et al. (2023) for a genetically nonlinear trait, thus underscoring the superior performance of LS-imputation by capturing the intricate genetic architecture revealed by GWAS. Subsequently, we proceed to train prediction models—both linear and nonlinear—using traits imputed via LS-imputation and PRS-CS as outcomes, with GWAS genotypes (SNPs) serving as predictors to elucidate the genotype–phenotype mappings. The efficacy of these models is then evaluated on an independent test set through the coefficient of determination ( $R^{2}$ ) between the predicted values of the trait and its observed values. This evaluation delineates the extent of nonlinear information that LS-imputation is able to capture beyond what is possible with PRS-CS.

Lastly, we focus on the target application where we combine a PGS learned from the imputed trait with that learned from the small complete data (with observed trait) to construct a final PGS with improved performance. Here, ensemble learning techniques are employed to refine prediction accuracy. This involves training distinct models on both observed and imputed traits, followed by utilizing the predicted values from these models on an independent ensemble dataset as input to train the ensemble model. The ensemble model’s performance is subsequently assessed on the test set, providing a comprehensive evaluation of its predictive capability.

In the following sections, we detail the imputation methods, prediction models, and datasets employed in our analysis.

Imputation methods

Consider a genotype matrix $X \in R^{n \times p}$ and a quantitative trait $Y \in R^{n}$ , akin to the model in Ren et al. (2023):

Y = g (X) + ε = E (Y | X) + ε,

where $g (X)$ represents the genetic component of the phenotype and $ϵ$ is the error term, assumed independent with constant variance. Suppose we have GWAS summary statistics ${\hat{β}}^{*} = (\hat{β_{1}^{*}}, \dots, \hat{β_{p}^{*}})^{'}$ for p SNPs calculated from an individual-level GWAS dataset $(X^{GWAS}, Y^{GWAS})$ . Now assuming Y is not observed, we would like to impute Y given X and $\hat{β}$ .

LS-imputation

LS-imputation Ren et al. (2023) is a nonparametric method to impute missing traits with genotypes and a GWAS summary dataset of the trait. It assumes the genetic model as in equation (1). The imputed phenotype $\tilde{Y}$ is obtained by minimizing

\tilde{Y} = {arg min}_{Y} ‖ \hat{β^{*}} - \frac{1}{n - 1} X^{'} Y ‖^{2} = (n - 1) X^{' +} \hat{β^{*}},

where $(\cdot)^{+}$ denotes the Moore–Penrose inverse of a matrix.

PRS-CS

PRS-CS Ge et al. (2019) is a polygenic risk score (PRS) method that assumes $g (X)$ in equation (1) to be linear, i.e. $g (X) = X β$ , $β \in R^{p \times 1}$ . PRS-CS employs a high-dimensional Bayesian regression framework that imposes a continuous shrinkage prior on the marginal effect size β. This approach enhances the accuracy of the estimates and facilitates the integration of external linkage disequilibrium (LD) reference panels, improving the prediction reliability. The estimated effect sizes used to calculate the additive PRS are derived from the posterior means of β. We use the PRS-CS-auto version in the python toolkit provided by the author Ge et al. (2019) and the UKB EUR reference panel for LD.

Prediction models

Under the same model in equation (1), our objective is to decipher the genotype-to-phenotype mapping g using machine learning or linear models, employing the predicted values for phenotype prediction. In this section, we outline 2 distinct algorithms for constructing trait prediction models, each representing a leading approach in machine learning, deep learning, and linear modeling.

XGBoost

Gradient Boosting is a family of ensemble learning methods based on building decision trees in a sequential manner. Their ability to handle large-scale datasets, capture complex nonlinear relationships and their robustness against overfitting make them ideal for genomics prediction tasks. Among them, the Extreme Gradient Boost (XGBoost) method often enjoys the advantage of efficiency and accuracy with its advanced algorithmic structure, and is extensively used in trait prediction tasks (Li et al. 2018; Zhou and Gallins 2019; Gill et al. 2022). Specifically, XGBoost incorporates L1 and L2 regularization terms into the loss function, which effectively controls model complexity and helps prevent overfitting. We use the “xgboost” package in python to implement the method. After simply tuning parameters using “GridSearchCV” function in package “sklearn.model_selection” in a validation set, all trait prediction base models are trained with a learning rate of 0.1, maximum depth of 6, subsample fraction of 0.9, various column subsample ratios (0.5 by tree, 0.7 by level, and 0.7 by node), an L1 penalty of 20, and an L2 penalty of 50.

Genome-local-net

Genome-local-net (GLN) is a neural network (NN) architecture proposed by Sigurdsson et al. (2023). It uses multiple residual blocks with locally connected layers to efficiently capture nonadditive and interaction effects (Sigurdsson et al. 2022), and is capable of integrating various types of covariates. It has demonstrated competitive performance in genomic prediction tasks compared to traditional linear models and other NN models (Sigurdsson et al. 2022, 2023).

PRS-CS

We use PRS-CS as representative linear prediction model. With either imputed or observed trait values, we first perform GWAS analysis and obtain the estimated marginal effect sizes $\hat{β}$ . Then using a reference panel for LD structure we apply PRS-CS to build a linear prediction model as PRS.

Phenotype prediction with ensemble learning

In this section, we assume a small portion of the trait is observed and introduce ensemble learning to enhance prediction accuracy. Figure 1a illustrates the ensemble learning steps. Let $X = (X_{1}, X_{2})^{'}$ and $Y = (Y_{1}, Y_{2})^{'}$ where $X_{1} \in R^{n_{1} \times p}$ , $X_{2} \in R^{n_{2} \times p}$ , $Y_{1} \in R^{n_{1} \times 1}$ , and $Y_{2} \in R^{n_{2} \times 1}$ , $n_{1} + n_{2} = n$ . We assume that $Y_{1}$ is observed, but $Y_{2}$ is not and thus needs to be imputed by a method described in section Imputation methods. We train the prediction model, $M^{obs}$ , on the observed data $(X_{1}, Y_{1})$ and $M^{imp}$ on the imputed data $(X_{2}, {\tilde{Y}}_{2})$ , where ${\tilde{Y}}_{2}$ is the imputed trait. Then on an independent dataset $(X^{ens}, Y^{ens})$ , we use the predicted values of the models, denoted ${\hat{Y}}^{obs}$ and ${\hat{Y}}^{imp}$ , as inputs to train an ensemble model $M^{ens}$ . Note that we assume that $Y^{ens}$ is observed. We could obtain this ensemble dataset by partition $(X_{1}, Y_{1})$ . We use linear regression as the ensemble model. Finally, with the prediction and ensemble models, we can obtain the predicted values of the observed trait on a test set and calculate the $R^{2}$ between the predicted and observed traits.

Fig. 1. — Flow charts illustrating a) the ensemble learning steps and b) the UKB sample splitting.

Data

We used the UKB data Sudlow et al. (2015) with imputed genotypes and 7 traits, namely lipA, HDL, LDL, TRIG, DBP, SBP, and height. All phenotypes are adjusted for sex, age, and top 10 genetic PCs. During genotype matrix preprocessing, SNPs were filtered out based on minor allele frequency (≤0.05), missing value proportion (≥10%), and Hardy–Weinberg equilibrium exact test failure ( $P -value \leq 0.001$ ). SNPs in high LD were also pruned out, resulting in a final count of 1,200,000. The dataset is denoted as $(X, Y)$ , with X representing the SNP matrix with individuals on rows and SNPs on columns, and Y as the phenotype vector.

Figure 1b shows the sample splitting process. We first divided the whole dataset into GWAS data and Analysis data, denoted as $(X^{GWAS}, Y^{GWAS})$ and $(X^{An}, Y^{An})$ , respectively. The size of $(X^{An}, Y^{An})$ was fixed at 70,000, the rest were allocated to $(X^{GWAS}, Y^{GWAS})$ . GWAS summary statistics were computed on $(X^{GWAS}, Y^{GWAS})$ , and effect sizes from SNPs with $P -values \leq 0.05$ were used for LS-imputation and PRS-CS to derive ${\tilde{Y}}^{An}$ . For phenotype prediction, we focused on the marginally significant SNPs, applying a more stringent cutoff with $P -values \leq 0.01$ to reduce the computational burden.

To train and evaluate the prediction models and ensemble learning, we further divided $(X^{An}, Y^{An})$ into Training, Ensemble, and Test subsets, denoted as $(X^{Tr}, Y^{Tr})$ , $(X^{ens}, Y^{ens})$ , and $(X^{Ts}, Y^{Ts})$ of size 30,000, 30,000, and 10,000, respectively. $(X^{Tr}, Y^{Tr})$ was further divided into 10 sets of equal size. We designated 1 set as the observed data, $(X^{obs}, Y^{obs})$ , and treated the remaining as imputed data, $(X_{(i)}^{imp}, Y_{(i)}^{imp})$ for $i = 2, \dots, 10$ , where imputed traits $Y_{(i)}^{imp}$ , instead of the observed traits, were used. We then train 1 model, $M^{obs}$ , on $(X^{obs}, Y^{obs})$ and 9 models, $M_{i}^{imp}$ , on $(X_{(1)}^{imp}, Y_{(1)}^{imp})$ , $(X_{(1)}^{imp}, Y_{(1)}^{imp}) \cup (X_{(2)}^{imp}, Y_{(2)}^{imp}), \dots, \cup_{i = 2}^{i = 10} (X_{(i)}^{imp}, Y_{(i)}^{imp})$ .

Results

LS-imputation has much lower false discoveries than PRS-CS in marginal association testing

We first confirm that LS-imputation recapitulates genetic associations better than PRS-CS does, which may have some implications for trait prediction considered here. Figure 2 presents the Manhattan plots for GWAS results calculated on the Analysis set based on the lipA trait imputed by the 2 methods. The top panel delineates the GWAS results derived from the LS-imputed trait alongside those obtained from the observed trait. It is evident that the distribution of significant SNPs is similar between those using the LS-imputed and observed traits, with the LS-imputed trait being slightly more conservative. In contrast, the PRS-CS imputed trait reveals a markedly higher number of significant SNPs beyond those obtained from the observed trait. This disparity suggests a substantial inflation in false discoveries when using PRS-CS for imputation, affirming the propensity of PRS-CS to introduce false positives into GWAS analyses. This observation aligns with the findings reported in Ren et al. (2023).

Fig. 2. — Manhattan plots of the GWAS results for lipA using the LS-imputed trait (top panel) and PRS-CS-imputed trait (bottom) as the trait. The GWAS summary statistics used to compute the LS-imputed trait and PRS-CS were calculated on $(X^{GWAS}, Y^{GWAS})$ as shown as the mirrored plots in each panel.

Learning from the imputed traits improved prediction accuracy for lipA and LDL but not for other traits

In this section, we compare the prediction accuracy of $M^{imp}$ trained with different imputation methods to using the imputed traits directly as the predicted values. Figure 3 shows the $R^{2}$ values of $M^{imp}$ for all traits. The $R^{2}$ of the LS-imputed traits and of the PRS-CS-imputed traits are indicated by the dotted and dashed lines, respectively.

Fig. 3. — Performance of models trained on imputed traits evaluated on the test set $(X^{Ts}, Y^{Ts})$ for all 7 traits: a) lipA, b) HDL, c) LDL, d) TRIG, e) SBP, f) DBP, and g) height. The shaded regions show the point-wise $95 %$ confidence intervals (CIs) of the $R^{2}$ values. X-axis shows the training sample size. The plot is organized into panels representing distinct prediction models. The test $R^{2}$ between the observed and PRS-CS-imputed traits and the test $R^{2}$ between the observed and LS-imputed traits are indicated by the dashed and dotted lines, respectively.

We observed that for lipA, the models trained on the imputed traits attained higher $R^{2}$ than the imputed traits themselves, especially when the sample size of the imputing dataset became larger, with a near 2-fold increase in $R^{2}$ for XGB trained with the LS-imputed trait at 27,000 imputed samples. Similarly, for LDL, the models trained with the imputed traits had higher $R^{2}$ than the imputed traits themselves. For lipA, the models trained with the LS-imputed trait generally performed better than those trained with the PRS-CS-imputed trait; however, for LDL, XGB, and GLN trained with the PRS-CS-imputed trait performed better than those trained with the LS-imputed trait; at the same time, when PRS-CS was used as the prediction model, training with LS-imputation was still better. These 2 traits have been previously found to benefit from nonlinear models such as XGB and other deep learning models (Elgart et al. 2022; Georgantas et al. 2024). Note that in these studies, the models were trained on the observed traits with a much larger sample size. Here, we demonstrate that these 2 traits could also benefit from training nonlinear models with imputed traits when a large set of observed traits are not available.

On the other hand, we did not observe performance improvement for other traits, perhaps because of their being mostly linear genetic traits. In addition, the imputed traits were calculated based on a much larger GWAS dataset; whether the performance of the models trained with the imputed traits would keep increasing with the training sample size and eventually match that of the imputed traits is left for future investigation.

Model ensembling improves prediction accuracy over models trained with the observed trait only

We investigate the performance gains of model ensembling over training with the observed trait. Figure 4 shows the $R^{2}$ of ensemble models trained with different imputed traits. For lipA, the XGB model trained with the LS-imputed trait achieved the highest $R^{2}$ , surpassing the performance of the best model (XGB) trained solely with the observed trait, especially when the sample size was large. Conversely, XGB trained with the PRS-CS-imputed trait did not show improvement from ensemble learning and exhibited a similar $R^{2}$ to XGB trained on the observed trait. For LDL, both GLN and XGB models trained with the PRS-CS-imputed trait demonstrated superior performance. For other traits, ensemble learning generally enhanced the $R^{2}$ compared to the best model trained only with the observed trait; however, none of the models exceeded the $R^{2}$ achieved by directly using the PRS-CS-imputed trait as the PGS. Practically, treating an imputation model as a prediction model, we can incorporate it with other models (also trained on various imputed traits) into a single ensemble model, which might result in improve performance. However, this is beyond the scope of this paper and is left for future investigation.

Fig. 4. — Performance of ensemble models evaluated on the test set $(X^{Ts}, Y^{Ts})$ for all 7 traits: a) lipA, b) HDL, c) LDL, d) TRIG, e) SBP, f) DBP, and g) height. The shaded regions show the point-wise $95 %$ CIs of the $R^{2}$ values. The plot is organized into panels representing distinct imputation methods: LS-imputation (left) and PRS-CS (right). The highest test $R^{2}$ of the models trained with the observed trait only is indicated by the dotted line. The test $R^{2}$ between the observed and PRS-CS-imputed traits is indicated by the dashed line.

Discussion

This study navigates beyond conventional phenotype prediction practices by integrating machine learning techniques, GWAS summary statistics, and genetic data to unveil nonlinear genotype–phenotype relationships through LS-imputation, thereby refining the predictions. By harnessing the extensive genetic information within biobanks and external published GWAS summary statistics, we used statistical and machine learning/deep learning models to impute and predict missing phenotypic data. The employment of XGBoost and GLN was instrumental due to their proficiency in capturing complex, nonlinear associations between genotypes and phenotypes. LS-imputation was included due to its previously reported potential to recapitulate nonlinear genotype–phenotype associations; nevertheless, for comparison, we also included a representative linear PRS method, PRS-CS, for trait imputation.

We first compared the performance of models trained on the imputed traits ( $M^{imp}$ ) to directly using the imputed traits. Our results showed that for lipA and LDL, nonlinear models trained with the imputed trait could attain higher $R^{2}$ than the linear PRS-CS model. For other traits, perhaps due to their being mostly as linear genetic traits, training with the imputed traits (with either linear or nonlinear models) did not improve the performance over PRS-CS. However, the performance was highly affected by the sample size of the imputed data as shown by the increasing trend of $R^{2}$ . In our analysis, the max sample size of the imputed data was 27,000, much smaller than the GWAS data. This was due to partitioning the UKB data into the GWAS and the Analysis subsets. In practice, if we use separate GWAS data that are independent of the biobank data, as in the case for AD as discussed earlier, the performance of $M^{imp}$ might be further improved. This will be a topic for future investigation.

We then compared the performance of ensemble learning using both observed and imputed traits ( $M^{ens}$ ) to that of using only observed traits ( $M^{obs}$ ), showing efficacy of using imputed traits. To test the efficacy of the imputation, we need to compare a model trained on both imputed and observed data to a model trained only on observed data. For this purpose, we employed ensemble learning, instead of training a model on the combined set of the imputed and the observed traits. The reason is that the imputed and the observed traits might have different distributions as the imputed traits are only the genetic components of the observed traits. The results showed that one could almost always benefit from ensemble learning over using only observed traits if the imputed trait had a sufficiently large sample size.

There are several limitations of our study. First, apart from lipA, models like XGB and GLN trained with the LS-imputed traits generally underperformed compared to those trained with the PRS-CS-imputed traits, despite some traits being previously identified as nonlinearly associated with genotypes (Elgart et al. 2022; Georgantas et al. 2024), though currently linear model-based PRSs are still most popular and effective for these traits. This discrepancy merits further investigation. In particular, we only employed 2 nonlinear modeling methods, XGB and GLN; other nonlinear methods, especially some new ones recently developed (Georgantas et al. 2024) or to be developed to tailor for high-dimensional SNPs with small effect sizes and relatively small sample sizes, might perform better. Second, we did not incorporate trait imputation models themselves into the ensemble learning, which might have led to suboptimal performance as compared to treating trait imputation directly as a PGS. We leave this as a topic for future work as our primary goal was not to find the best-performing model but to show possible gains from using imputed traits. Third, the GWAS, the observed, and the imputed data were from the same study. In practice, these data may come from different sources, which could affect the final outcomes.

Conclusion

Our study illustrates the potential of using large-scale genetic data and trait imputation for phenotype prediction, a methodology that could enhance the utility of biobanks with genetic data but possibly some missing phenotypes. This approach showed some potential in uncovering the genetic underpinnings of complex traits with nonlinear genetic associations. Future research might focus on refining the method, exploring their applicability across diverse datasets, and overcoming the outlined limitations to maximize their impact on genetic research. More generally, trait imputation can be applied to other downstream genetic analyses, such as cross-population genetic prediction (with linear or nonlinear PGSs) (Zhao et al. 2022; Gyawali et al. 2023; Zhou et al. 2023), other applications of PGS (Yan et al. 2022; Li et al. 2024), and nonlinear association testing as in transcriptome-wide association studies (He et al. 2023), though more investigations are warranted.

Acknowledgements

We thank the reviewers for many constructive and helpful comments. This research was supported by the Minnesota Supercomputing Institute at the University of Minnesota. The access to the UKB data was approved through UKB Application #35107.

Contributor Information

Ruoyu He, Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, USA; School of Statistics, University of Minnesota, Minneapolis, MN 55414, USA.

Jinwen Fu, Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, USA; School of Statistics, University of Minnesota, Minneapolis, MN 55414, USA.

Jingchen Ren, Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, USA; School of Statistics, University of Minnesota, Minneapolis, MN 55414, USA.

Wei Pan, Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, USA.

Data availability

The UKB data are available to the approved user at https://www.ukbiobank.ac.uk/. The R and Python code can be found at https://github.com/RuoyuHe/LS-imputation_For_PGS. For GLN, we used the toolkit EIR provided by the original authors of Sigurdsson et al. (2023) (https://github.com/arnor-sigurdsson/EIR).

Funding

This research was supported by NIH grants U01AG073079 and R01AG065636.

Literature cited

An U, Pazokitoroudi A, Alvarez M, Huang L, Bacanu S, Schork AJ, Kendler K, Pajukanta P, Flint J, Zaitlen N, et al. 2023. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat Genet. 55(12):2269–2276. doi: 10.1038/s41588-023-01558-w [DOI] [PMC free article] [PubMed] [Google Scholar]
Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. 2016. A multiple-phenotype imputation method for genetic studies. Nat Genet. 48(4):466–472. doi: 10.1038/ng.3513 [DOI] [PMC free article] [PubMed] [Google Scholar]
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, et al. 2022. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol. 5:856. doi: 10.1038/s42003-022-03812-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. 2019. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 10:1776. doi: 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Georgantas C, Kutalik Z, Richiardi J. 2024. Deep learning for polygenic risk prediction [preprint]. medRxiv. 10.1101/2024.04.19.24306079 [DOI] [Google Scholar]
Gill M, Anderson R, Hu H, Bennamoun M, Petereit J, Valliyodan B, Nguyen HT, Batley J, Bayer PE, Edwards D. 2022. Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction. BMC Plant Biol. 22(1):1–8. doi: 10.1186/s12870-022-03559-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Gyawali PK, Le Guen Y, Liu X, Belloy ME, Tang H, Zou J, He Z. 2023. Improving genetic risk prediction across diverse population by disentangling ancestry representations. Commun Biol. 6(1):964. doi: 10.1038/s42003-023-05352-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
He R, Liu M, Lin Z, Zhuang Z, Shen X, Pan W. 2023. DeLIVR: a deep learning approach to IV regression for testing nonlinear causal effects in transcriptome-wide association studies. Biostatistics. 25(2):468–485. doi: 10.1093/biostatistics/kxac051 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hormozdiari F, Kang EY, Bilow M, Ben-David E, Vulpe C, McLachlan S, Lusis AJ, Han B, Eskin E. 2016. Imputing phenotypes for genome-wide association studies. Am J Hum Genet. 99(1):89–103. doi: 10.1016/j.ajhg.2016.04.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. 2018. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet. 9:237. doi: 10.3389/fgene.2018.00237 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li X, Fernandes BS, Liu A, Lu Y, Chen J, Zhao Z, Dai Y. 2024. GRPa-PRS: a risk stratification method to identify genetically-regulated pathways in polygenic diseases [preprint]. medRxiv. doi: 10.1101/2023.06.19.23291621 [DOI] [Google Scholar]
Ren J, Lin Z, He R, Shen X, Pan W. 2023. Using GWAS summary data to impute traits for genotyped individuals. Hum Genet Genom Adv. 4(3):100197. doi: 10.1016/j.xhgg.2023.100197 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sehrawat S, Najafian K, Jin L. 2023. Predicting phenotypes from novel genomic markers using deep learning. Bioinform Adv. 3(1):vbad028. doi: 10.1093/bioadv/vbad028 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sigurdsson AI, Louloudis I, Banasik K, Westergaard D, Winther O, Lund O, Ostrowski SR, Erikstrup C, Pedersen OBV, Nyegaard M, et al. 2023. Deep integrative models for large-scale human genomics. Nucleic Acids Res. 51(12):e67. doi: 10.1093/nar/gkad373 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sigurdsson AI, Ravn K, Winther O, Lund O, Brunak S, Vilhjálmsson BJ, Rasmussen S. 2022. Improved prediction of blood biomarkers using deep learning [preprint]. medRxiv. doi: 10.1101/2022.10.27.22281549 [DOI] [Google Scholar]
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, et al. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
van Hilten A, Kushner SA, Kayser M, Ikram MA, Adams HH, Klaver CC, Niessen WJ, Roshchupkin GV. 2021. Gennet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun Biol. 4(1):1094. doi: 10.1038/s42003-021-02622-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Yan S, Sha Q, Zhang S. 2022. Gene-based association tests using new polygenic risk scores and incorporating gene expression data. Genes. 13:1120. doi: 10.3390/genes13071120 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Z, Fritsche LG, Smith JA, Mukherjee B, Lee S. 2022. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet. 109:1998–2008. doi: 10.1016/j.ajhg.2022.09.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou G, Chen T, Zhao H. 2023. Sdprx: a statistical method for cross-population prediction of complex traits. Am J Hum Genet. 110(1):13–22. doi: 10.1016/j.ajhg.2022.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou YH, Gallins P. 2019. A review and tutorial of machine learning methods for microbiome host trait prediction. Front Genet. 10:579. doi: 10.3389/fgene.2019.00579 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[iyae148-B1] An U, Pazokitoroudi A, Alvarez M, Huang L, Bacanu S, Schork AJ, Kendler K, Pajukanta P, Flint J, Zaitlen N, et al. 2023. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat Genet. 55(12):2269–2276. doi: 10.1038/s41588-023-01558-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B2] Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. 2016. A multiple-phenotype imputation method for genetic studies. Nat Genet. 48(4):466–472. doi: 10.1038/ng.3513 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B3] Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, et al. 2022. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol. 5:856. doi: 10.1038/s42003-022-03812-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B4] Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. 2019. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 10:1776. doi: 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B5] Georgantas C, Kutalik Z, Richiardi J. 2024. Deep learning for polygenic risk prediction [preprint]. medRxiv. 10.1101/2024.04.19.24306079 [DOI] [Google Scholar]

[iyae148-B6] Gill M, Anderson R, Hu H, Bennamoun M, Petereit J, Valliyodan B, Nguyen HT, Batley J, Bayer PE, Edwards D. 2022. Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction. BMC Plant Biol. 22(1):1–8. doi: 10.1186/s12870-022-03559-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B7] Gyawali PK, Le Guen Y, Liu X, Belloy ME, Tang H, Zou J, He Z. 2023. Improving genetic risk prediction across diverse population by disentangling ancestry representations. Commun Biol. 6(1):964. doi: 10.1038/s42003-023-05352-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B8] He R, Liu M, Lin Z, Zhuang Z, Shen X, Pan W. 2023. DeLIVR: a deep learning approach to IV regression for testing nonlinear causal effects in transcriptome-wide association studies. Biostatistics. 25(2):468–485. doi: 10.1093/biostatistics/kxac051 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B9] Hormozdiari F, Kang EY, Bilow M, Ben-David E, Vulpe C, McLachlan S, Lusis AJ, Han B, Eskin E. 2016. Imputing phenotypes for genome-wide association studies. Am J Hum Genet. 99(1):89–103. doi: 10.1016/j.ajhg.2016.04.013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B11] Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. 2018. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet. 9:237. doi: 10.3389/fgene.2018.00237 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B10] Li X, Fernandes BS, Liu A, Lu Y, Chen J, Zhao Z, Dai Y. 2024. GRPa-PRS: a risk stratification method to identify genetically-regulated pathways in polygenic diseases [preprint]. medRxiv. doi: 10.1101/2023.06.19.23291621 [DOI] [Google Scholar]

[iyae148-B12] Ren J, Lin Z, He R, Shen X, Pan W. 2023. Using GWAS summary data to impute traits for genotyped individuals. Hum Genet Genom Adv. 4(3):100197. doi: 10.1016/j.xhgg.2023.100197 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B13] Sehrawat S, Najafian K, Jin L. 2023. Predicting phenotypes from novel genomic markers using deep learning. Bioinform Adv. 3(1):vbad028. doi: 10.1093/bioadv/vbad028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B14] Sigurdsson AI, Louloudis I, Banasik K, Westergaard D, Winther O, Lund O, Ostrowski SR, Erikstrup C, Pedersen OBV, Nyegaard M, et al. 2023. Deep integrative models for large-scale human genomics. Nucleic Acids Res. 51(12):e67. doi: 10.1093/nar/gkad373 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B15] Sigurdsson AI, Ravn K, Winther O, Lund O, Brunak S, Vilhjálmsson BJ, Rasmussen S. 2022. Improved prediction of blood biomarkers using deep learning [preprint]. medRxiv. doi: 10.1101/2022.10.27.22281549 [DOI] [Google Scholar]

[iyae148-B16] Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, et al. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B17] van Hilten A, Kushner SA, Kayser M, Ikram MA, Adams HH, Klaver CC, Niessen WJ, Roshchupkin GV. 2021. Gennet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun Biol. 4(1):1094. doi: 10.1038/s42003-021-02622-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B18] Yan S, Sha Q, Zhang S. 2022. Gene-based association tests using new polygenic risk scores and incorporating gene expression data. Genes. 13:1120. doi: 10.3390/genes13071120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B19] Zhao Z, Fritsche LG, Smith JA, Mukherjee B, Lee S. 2022. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet. 109:1998–2008. doi: 10.1016/j.ajhg.2022.09.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B20] Zhou G, Chen T, Zhao H. 2023. Sdprx: a statistical method for cross-population prediction of complex traits. Am J Hum Genet. 110(1):13–22. doi: 10.1016/j.ajhg.2022.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[iyae148-B21] Zhou YH, Gallins P. 2019. A review and tutorial of machine learning methods for microbiome host trait prediction. Front Genet. 10:579. doi: 10.3389/fgene.2019.00579 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Trait imputation enhances nonlinear genetic prediction for some traits

Ruoyu He

Jinwen Fu

Jingchen Ren

Wei Pan

Roles

Abstract

Introduction

Materials and Methods

Imputation methods

LS-imputation

PRS-CS

Prediction models

XGBoost

Genome-local-net

PRS-CS

Phenotype prediction with ensemble learning

Fig. 1.

Data

Results

LS-imputation has much lower false discoveries than PRS-CS in marginal association testing

Fig. 2.

Learning from the imputed traits improved prediction accuracy for lipA and LDL but not for other traits

Fig. 3.

Model ensembling improves prediction accuracy over models trained with the observed trait only

Fig. 4.

Discussion

Conclusion

Acknowledgements

Contributor Information

Data availability

Funding

Literature cited

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases