Summary
We evaluated whether predicted continuous disease representations could enhance genetic discovery beyond case-control genome-wide association study (GWAS) phenotypes across eight complex diseases in up to 485,448 UK Biobank participants. Predicted phenotypes had high genetic correlations with case-control phenotypes (median rg = 0.66) but identified more independent associations (median 306 versus 125). While some predicted phenotype associations were spurious, multi-trait analysis of GWAS-boosted case-control phenotypes identified a median of 46 additional variants per disease, of which a median of 73% replicated in FinnGen, 37% reached genome-wide significance in a UK Biobank/FinnGen meta-analysis, and 45% had supporting evidence. Predicted phenotypes also identified 14 genes targeted by phase I–IV drugs not identified by case-control phenotypes, and combined polygenic risk scores (PRSs) using both phenotypes improved prediction performance, with a median 37% increase in Nagelkerke’s R2. Predicted phenotypes represent composite biomarkers complementing case-control approaches in genetic discovery, drug target prioritization, and risk prediction, though efficacy varies across diseases.
Keywords: machine learning, genome-wide association study, electronic health records
Graphical abstract

Highlights
-
•
Predicted phenotypes increase genetic discovery over binary disease definitions
-
•
Multi-trait analysis with predicted phenotypes reduces spurious associations
-
•
Predicted phenotypes provide support for additional drug targets
-
•
Combined polygenic scores enhance risk prediction across diverse populations
Motivation
Traditional genome-wide association studies rely on binary case-control phenotypes, which inadequately represent the continuous and heterogeneous nature of complex disease manifestation. To address this, we used machine learning with electronic health record-derived clinical data to generate continuous predicted representations of eight complex diseases in the UK Biobank. Our method increases statistical power for genetic discovery, enables identification of additional therapeutically relevant targets, and improves polygenic risk score performance across diverse ancestry populations.
Chen et al. demonstrate that machine learning-derived continuous disease phenotypes from electronic health records improve genetic discovery, drug target identification, and polygenic risk prediction for complex diseases. These phenotypes complement traditional case-control definitions and, when integrated using multi-trait analysis, uncover additional replicable genetic associations.
Introduction
Genome-wide association studies (GWASs) have revolutionized our understanding of disease pathophysiology and enabled the identification of novel drug targets. These large-scale genetic analyses are facilitated by biobanks containing extensive electronic health records (EHRs) linked to genetic data. However, EHRs encode diseases as binary diagnoses using administrative codes that fail to capture the continuous spectrum of disease presentations and severities observed in patients. Addressing this, recent studies constructed machine learning models leveraging comprehensive phenotypic data in EHR-linked biobanks to represent traditionally binary diagnoses as continuous scores, including for coronary artery disease (CAD),1 autoimmune disease,2 and venous thromboembolism (VTE).3 These scores are composite biomarkers representing both the probability and severity of disease and predict disease complications and mortality.
Simultaneously, multiple studies have demonstrated the ability of machine learning to assist genetic discovery. For example, deep learning models can parse multi-dimensional data into disease-relevant quantitative traits, including spirograms for chronic obstructive pulmonary disease,4 cardiac magnetic resonance imaging (MRI) for valvular regurgitation and right heart function,5,6 and liver MRI for steatotic liver disease.7 This approach often has limited sample sizes due to the difficulty of acquiring such data. Others used machine learning with widely available EHR and survey data to predict neuropsychiatric traits like depression,8 binge eating disorder,9 and creativity,10 which are otherwise challenging to phenotype due to the heterogeneity of and overlap between traits. Both approaches have identified additional variants compared to case-control studies, and the validity of these variants is supported by their consistent effects with case-control and related disease phenotypes.
We recently showed that a digital marker for CAD increased discovery of rare and ultra-rare coding variants11 but did not extend this to genome-wide common variants. For common variant association studies, predicted phenotypes could increase power by addressing under- and misdiagnosis and by ranking individuals by their disease probabilities.12 Predicted phenotypes could also capture orthogonal disease etiologies compared to binary diagnoses, identifying additional disease mechanisms and therapeutic targets and contributing to improved polygenic risk scores (PRSs).
A caveat of using predicted phenotypes for genetic discovery is that they yield spurious associations even with high prediction accuracy. For example, a GWAS on type 2 diabetes (T2D) predicted using hemoglobin A1c yields variants affecting erythrocyte traits but not glycemic traits.13 While recent methods like SynSurr and post-prediction GWAS reduce these associations,13,14 they are designed for the imputation of missing values, require separate labeled and unlabeled samples, and do not allow the combination of binary (case-control) and continuous (predicted) phenotypes. Other studies tested associations with predicted continuous phenotypes without correcting for possible spurious associations.9,11,15,16,17 Hence, further evaluation of the feasibility and interpretation of genetic association analyses using predicted continuous representations of binary diagnoses is necessary.
Here, we investigate whether predicted phenotypes can complement case-control phenotypes for genetic analyses in diverse ancestry populations. To assess generalizability, we include eight complex diseases representing varied biological processes (atrial fibrillation [AFib], CAD, celiac disease, gallstones, nasal polyps, T2D, varicose veins, and VTE) and construct models using different feature sets and architectures. After assessing the genetic overlap between predicted and case-control phenotypes and the presence of spurious associations with predicted phenotypes, we propose a new application of multi-trait analysis of GWAS (MTAG) that reduces spurious associations.18 Specifically, we analyze predicted and case-control phenotypes in the same dataset where all participants have labels, and because predicted phenotypes distill hundreds of disease-correlated traits into a single disease-representative trait, we use them to increase the power of case-control phenotypes with MTAG. Unlike prior applications of MTAG toward case-control phenotypes,19,20 our approach does not require manual trait selection and avoids including numerous traits. Importantly, we show that MTAG with predicted phenotypes has low false discovery rate (FDR) estimates and identifies additional variants with high replication rates and supporting evidence. Finally, we demonstrate the utility of predicted phenotypes by enhancing drug target prioritization and increasing the prediction performance of PRSs.
Results
Machine learning models accurately predict disease status in the UK Biobank
We constructed machine learning models using comprehensive phenotypic data to predict disease status in up to 481,912 UK Biobank participants (Tables S1A and S1B). Models were well calibrated, as indicated by low Brier scores and model probabilities that closely reflected the proportion of participants with disease (Figure 1A; Table S2A). Model probabilities were right skewed, with only CAD and T2D predictions having bimodal distributions (Figure 1B). Models had high discriminant ability, achieving areas under the receiver operating characteristic curve (AUROCs) greater than 0.80 for all diseases except for varicose veins (0.73) and VTE (0.79) (Figure 2A; Table S2A). Likewise, areas under the precision-recall curve (AUPRCs) were several times greater than the observed proportion of cases for all diseases (Figure 2A), indicating a strong ability to minimize false positives and negatives. A diverse set of the 1,037–1,048 phenotypes used to generate predictions had high feature importances across multiple disease models, including age, creatinine, arm impedance, triglycerides, and testosterone (Table S2B).
Figure 1.
Calibration and distribution of model predictions
(A) Scatterplots showing the correlation between model probabilities and proportion of cases, with probabilities rounded to two decimal places. Only points corresponding to at least 25 participants are shown. The gray dotted line indicates perfect calibration.
(B) Kernel density estimate plots showing the distribution of model probabilities separately for cases and controls. Densities are not normalized between cases and controls.
Figure 2.
Construction and application of predicted phenotypes in the UK Biobank
(A) Areas under the receiver operating characteristic curve (AUROC) and areas under the precision-recall curve (AUPRC) for machine learning models for each disease as well as genetic correlations between case-control and predicted phenotypes estimated using linkage disequilibrium (LD) score regression among European ancestry participants. Data represent means with 95% confidence intervals for AUROCs and AUPRCs and means with standard errors for genetic correlations. Horizontal lines under AUPRCs represent the observed proportion of cases.
(B) Number of LD-independent variants identified by case-control and predicted phenotypes for each disease.
(C) Distribution of variants across Molecular Signatures Database hallmark gene sets (x axis) for case-control and predicted phenotypes for each disease (y axis). The color in each cell represents the proportion of distinct variants corresponding to each gene set-phenotype combination. Vertical cell pairs with gray boxes indicate a Bonferroni-significant difference in proportions (p < 0.001) as determined by a two-proportion Z test.
(D) Distribution of case-control phenotype −log10(p values) for variants identified by predicted phenotypes. Case-control phenotype associations from the UK Biobank, FinnGen, meta-analyses of the UK Biobank and FinnGen, and external meta-analyses are shown. Note that larger −log10(p values) represent greater significance. Dashed lines represent significance thresholds of p = 0.05 (y = 1.3) and p = 5 × 10−8 (y = 7.3).
(E) Association between UK Biobank case-control phenotype −log10(p-values) and machine learning metrics.
See also Figures S1–S3 and Tables S2A–S2G, S3A–S3C, and S5A–S5D.
Predicted phenotypes have variable genetic overlap with case-control phenotypes
We tested common variant associations with case-control and predicted phenotypes in up to 485,448 UK Biobank participants (Table S1B). Predicted phenotypes had high genetic correlations with case-control phenotypes, with maximums of 0.84 for CAD and 0.91 for T2D (Figure 2A; Table S2C), but identified a median of 160% more linkage disequilibrium (LD)-independent variants (median of 306 [predicted] and 125 [case-control] per disease) from 180% more genes (median of 252 [predicted] and 91 [case-control]) (Figure 2B). Although predicted phenotypes had higher genomic inflation factors, this was due to increased polygenicity; among European ancestry participants, predicted phenotypes had similar LD score regression intercepts and significantly higher heritability estimates compared to case-control phenotypes (Table S2D).
We first assessed whether predicted phenotypes captured known disease biology represented in case-control phenotypes. Across diseases, a median of 63% of case-control phenotype variants were identified by predicted phenotypes at nominal significance (Table S2E), and a median of 95% were identified with the same effect direction. For these variants, there was also a significant correlation in effect sizes between the two phenotypes, with a median r2 of 0.82 (Figure S1A). However, eight gene sets contained a significantly greater proportion of case-control compared to predicted phenotype variants (Figure 2C). For example, 20% of VTE case-control phenotype variants corresponded to the coagulation gene set compared to only 1% of predicted phenotype variants. Thus, some disease mechanisms may be underrepresented in predicted phenotypes.
We next assessed the replication of predicted phenotype variants by case-control phenotypes. In a meta-analysis of up to 933,649 UK Biobank and FinnGen participants, AFib, CAD, and T2D had the highest replication rates: 16%, 35%, and 44% of predicted phenotype variants replicated at genome-wide significance and 79%, 91%, and 90% replicated at nominal significance (Figure 2D; Table S2F), respectively, and 19%, 72%, and 86% of variants were within 500 kb of all GWAS Catalog variants (Table S2G). Comparatively, VTE and celiac disease had the lowest replication rates, with 51% and 75% of predicted phenotype variants failing to replicate at nominal significance, respectively (Table S2F). In the largest CAD and T2D meta-analyses to date with 4 and 10 times more cases than the UK Biobank,21,22 respectively, 35% and 56% of variants replicated at genome-wide significance; however, 12% and 11% of variants still failed to replicate at nominal significance and likely represent spurious associations. Nevertheless, a median of 95% of predicted phenotype variants had consistent effect directions with case-control phenotypes across diseases (Table S2F), and effect directions for these variants were significantly correlated with a median r2 of 0.54 (Figure S1B). Predicted phenotype variants also had more significant case-control phenotype associations than randomly sampled variants (Figure S2).
Across diseases, median case-control phenotype −log10(p values) for predicted phenotype variants were significantly associated with model AUPRCs (β = 0.06, p = 1.4 × 10−4) and AUROCs (β = 0.17, p = 0.019) (Figure 2E). Thus, although predicted phenotypes identified more variants across diseases, higher disease prediction metrics resulted in stronger associations of these variants with case-control phenotypes.
Spurious associations with predicted phenotypes are explainable
We explored three possible explanations for spurious associations with predicted phenotypes by analyzing variants failing to reach nominal significance in any case-control study, ranging from 65% for celiac disease to 2% for CAD (Table S3A). First, some variants were significantly associated with non-causal disease biomarkers (Figure S3A). For example, 35 of 97 non-replicating celiac disease variants were associated with cholesterol in the opposite direction, possibly due to untreated celiac disease causing hypocholesterolemia.23,24 Second, some variants were associated with known causal factors but not with the disease itself, possibly due to pleiotropy or environmental modifiers.25 For example, among 93 non-replicating gallstone variants, 39 were associated with body fat percentage in the same direction despite body fat being a main gallstone risk factor.26 Third, some variants were associated with disease biomarkers reflecting both non-causal and causal pathways (Figure S3B). Replicating prior findings,13 8 of 13 non-replicating T2D variants were associated with hemoglobin A1c in the same direction, of which 6 were also associated with erythrocytic traits unrelated to diabetes. We note that this discussion of possible causes of spurious associations is speculative and warrants a thorough investigation of root causes.
Because obesity-related traits were important features across all diseases (Table S2B), we assessed the contribution of body mass index (BMI) to predicted phenotype genetic associations. Predicted phenotypes had significantly higher genetic correlations with BMI than did case-control phenotypes (pwilcoxon = 0.008) (Table S3B), although they still had significantly larger correlations with case-control phenotypes than with BMI (pwilcoxon = 0.04), and only the two diseases with the highest predicted phenotype-BMI correlations (gallstones and VTE) had BMI-associated non-replicating variants (Table S3A). Including BMI as a covariate when testing genetic associations significantly reduced the number of LD-independent variants from predicted (pwilcoxon = 0.04) but not case-control phenotypes (pwilcoxon = 0.40) (Table S3C). It also increased the proportion of predicted phenotype variants replicated by case-control phenotypes at nominal significance, but only when case-control phenotypes were also tested with BMI as a covariate (with BMI: pwilcoxon = 0.02; without BMI: pwilcoxon = 0.55) (Table S3C). Overall, these results demonstrate that predicted phenotypes represent both the disease and individual prediction features, which can explain some spurious associations.
Feature selection and model architecture impact associations with predicted phenotypes
We hypothesized that using Mendelian randomization (MR) to exclude non-causal features could reduce spurious associations attributable to non-causal components of predicted phenotypes. We tested 190 phenotypes by MR for each disease and identified positive controls and known causal factors (Tables S4A–S4D). A higher proportion of variants from MR-selected predicted phenotypes showed genome-wide significant associations with selected compared to excluded features for each phenotype (Table S5A). Similarly, a higher proportion of variants were significantly associated with possible causal factors compared to all-feature predicted phenotypes, while a lower proportion were significantly associated with possible non-causal factors (Figures S4A and S4B; Table S5B). However, MR-based feature selection did not improve predicted phenotype performance. Compared to all-feature predicted phenotypes, variants identified by MR-selected predicted phenotypes had lower case-control phenotype replicability (Figure S4C; Table S2F), and MR-selected predicted phenotypes generally identified fewer case-control phenotype variants (Table S2E). Both metrics were significantly associated with model AUROCs and AUPRCs (Figures 3A and 3B; Table S5C), and prediction performance decreased with more stringent MR feature selection (Table S2A). As a negative control, variants from predicted phenotypes using only possible non-causal factors had the least significant case-control phenotype associations (Table S2F).
Figure 3.
Relationship between disease prediction metrics and the case-control replicability of predicted phenotype variants
Each point represents a different model. Models in (A) and (B) include those constructed using all features, MR-selected possibly causal features, or MR-selected non-casual features. Models in (C) and (D) include those constructed using all features, only the most important features from the all-feature model (i.e., top 1, 5, 10, 50, 100, or 500), or only covariates (age, sex, and fasting time). Models in (E) and (F) include those constructed using all features, without the most important features from the all-feature model (i.e., top 1, 5, 10, 50, 100, or 500), or only covariates (age, sex, and fasting time). Models in (G) and (H) include gradient boosting and logistic regression models using all features. Points representing fewer than 10 variants are not shown. See also Figures S4 and S5 and Tables S5E–S5G.
We further explored how prediction performance influences genetic associations by (1) constructing predicted phenotypes using only the most important features from the all-feature model, (2) constructing predicted phenotypes without these important features, and (3) comparing gradient boosting to logistic regression (Table S2A). First, although independent of causality, feature importances were significantly correlated with MR −log10(p-values), and possible causal factors had significantly higher importances than possible non-causal factors (Table S5D). Including additional important features resulted in increases in model performance for all diseases (Table S5E), and both AUROC and AUPRC were significantly associated with the case-control phenotype replicability of predicted phenotype variants and vice versa (Figures 3C and 3D; Table S5C). However, performance gains plateaued beyond the 50–100 most important features (Figure S5). As a negative control, models trained only on covariates identified no genome-wide significant associations. Second, removing important features conversely resulted in decreases in model performance (Table S5F), and both AUROC and AUPRC were likewise significantly associated with the case-control phenotype replicability of predicted phenotype variants and vice versa (Figures 3E and 3F; Table S5C). Nasal polyps were an exception; likely because many spurious associations were attributable to eosinophil and lymphocyte traits (Table S3A), which were also among the most important features (Table S2B), predicted phenotypes missing these features had increased case-control replicability but identified substantially fewer variants (Table S5F). Third, gradient boosting models, which capture nonlinear feature-outcome relationships, significantly outperformed logistic regression models in AUROC and AUPRC across all diseases, and their predicted phenotype variants likewise had higher case-control replicability than logistic regression models (Figures 3G and 3H; Table S5G).
Together, these results demonstrate a significant positive association between disease prediction metrics and the case-control replicability of predicted phenotype variants, emphasizing the importance of optimizing model performance prior to genetic association testing.
Multi-trait analysis increases case-control phenotype power at low FDRs
To increase genetic discovery while reducing spurious associations, we propose using predicted phenotypes to boost the power of case-control phenotypes using MTAG, which is ideal for highly genetically correlated traits and works at the variant level. Using formulae provided with MTAG,18 we estimated that maximum FDRs had a median of 1.1% and that the median equivalent sample size increased by 63% (Figure 4A; Table S6A). MTAG phenotypes identified a median of 46 additional variants at genome-wide significance compared to standard case-control phenotypes, of which a median of 19 were also novel (i.e., >500 kb from all GWAS Catalog variants) (Figure 4B; Table S6B). There were strong correlations between standard case-control and MTAG phenotype effect sizes (median r2 = 0.96) (Figures S6A and S6B), including for additional variants (Figure 4C).
Figure 4.
Genetic discovery using MTAG-boosted case-control phenotypes in the UK Biobank
(A) Percent increase in effective sample size from standard case-control phenotypes to MTAG-boosted case-control phenotypes.
(B) Number of additional and novel LD-independent variants identified by MTAG phenotypes. “Additional” refers to variants not identified by standard case-control phenotypes, while “novel” refers to the subset of additional variants that are also >500 kb from all reported variants in the GWAS Catalog.
(C) Scatterplots comparing variant betas (representing effect size and direction) from case-control and MTAG phenotypes for additional variants from MTAG phenotypes. Colors of points indicate the −log10(p value) for case-control phenotypes. The top left corner of each plot shows Pearson’s r2 and p value. The median r2 across diseases is 0.95. The dark gray and light gray lines represent linear regression lines with 95% confidence intervals, respectively.
(D) Proportion of additional variants from predicted phenotypes, additional variants from MTAG phenotypes, and all variants from case-control phenotypes that replicated in FinnGen (i.e., p < 0.05 and same effect direction). Error bars represent 95% confidence intervals calculated using the Wilson method.
(E) Distribution of case-control phenotype −log10(p values) for additional variants from MTAG phenotypes. Case-control phenotype associations from the UK Biobank, FinnGen, meta-analyses of the UK Biobank and FinnGen, and external meta-analyses are shown. Note that larger −log10(p values) represent greater significance. Dashed lines represent significance thresholds of p = 0.05 (y = 1.3) and p = 5 × 10−8 (y = 7.3).
See also Figures S6–S8 and Tables S6A–S6F.
MTAG results are ideally validated through external replication, and a median of 73% of additional variants identified by MTAG replicated in FinnGen at nominal significance (Figure 4D; Table S6B). These replication rates are comparable to prior MTAG and imputation studies16,19,27 and are significantly higher compared to the median of 39% for predicted phenotypes without MTAG (pwilcoxon = 0.008) (Figure 4D), suggesting that MTAG indeed reduces spurious associations with predicted phenotypes. Additionally, 55%, 47%, and 68% of additional AFib, CAD, and T2D variants replicated at genome-wide significance in either a UK Biobank/FinnGen or external meta-analysis, respectively (Figure 4E), supporting the ability of MTAG to increase power. The additional variants identified by MTAG also corresponded to genes with supporting clinical, experimental, and genetic evidence from Open Targets (Table S6C), with 55%, 72%, and 83% of additional AFib, CAD, and T2D variants having support, respectively. Finally, a median of 94% of MTAG additional variants had genome-wide significant associations with at least one possible causal factor (Table S6D), significantly greater than both case-control variants (pwilcoxon = 0.01) and predicted variants (pwilcoxon = 0.02).
To assess robustness, we further performed MTAG using predicted phenotypes from logistic regression models. Despite these models having significantly worse prediction metrics compared to gradient boosting models (Table S2A), MTAG-estimated maximum FDRs still had a median of 1.0% (Table S6A). There was no significant difference in the median case-control −log10(p-value) of MTAG-identified additional variants using logistic regression compared to gradient boosting phenotypes (pwilcoxon = 0.46) (Table S6E), and there were still strong correlations between standard case-control and MTAG phenotype effect sizes (median r2 = 0.97) (Figures S7A and S7B). Nevertheless, a significantly greater proportion of additional variants had no significant association with case-control phenotypes (pwilcoxon = 0.008) (Table S6E), suggesting that prediction performance is still important when using MTAG.
In rare cases, MTAG can still yield possibly spurious associations: for 3.7% and 5.7% of MTAG variants using gradient boosting and logistic regression phenotypes, respectively, the standard case-control phenotype association was not significant. For all of these variants, predicted phenotype associations had extremely low p values (p < 5 × 10−13), causing the MTAG association to be significant even when the case-control association was not (Figures S8A–S8D). This is a known limitation of MTAG and emphasizes the need for external replication.18
Finally, we examined the 270 novel MTAG variants from gradient boosting phenotypes across all diseases, of which 159 replicated in FinnGen, 56 reached genome-wide significance in a UK Biobank/FinnGen meta-analysis, and 67 had supporting evidence from Open Targets (Table S6F). Examples of novel variants mapped to genes with substantial experimental support included rs6031435 (JPH2; β = 0.02, pMTAG = 1 × 10−8) for AFib, with reduced JPH2-mediated stabilization of RyR2 promoting atrial arrythmias28; rs3761549 (FOXP3; β = −0.04, pMTAG = 2 × 10−8) for nasal polyps, which is characterized by both reduced Foxp3+ regulatory T cells and reduced FOXP3 expression29; and rs13161058 (HTR4; β = −0.02, pMTAG = 3 × 10−8) for T2D, with recent evidence supporting HTR4 agonism as a novel mechanism of improving glucose tolerance.30,31 These results support the use of predicted phenotypes to boost case-control phenotype power with MTAG.
Predicted phenotypes provide genetic support for additional drug targets
Genetic association results are useful for selecting targets for drug development, with drugs approved by the US Food and Drug Administration being twice as likely to have supporting genetic evidence.32 We identified 906 gene-disease pairs targeted by phase I–IV drugs across all diseases except gallstones and varicose veins, which typically require surgical management. Of the 906 gene-disease pairs, 39 had corresponding genome-wide significant variants from case-control and/or all-feature predicted phenotypes (Table S7), of which 14 had variants only from predicted phenotypes, 13 only from case-control phenotypes, and 12 from both phenotypes. Notable examples of drugs supported only by predicted phenotypes in the UK Biobank included statins for CAD, anti-interleukin (IL)-4R/anti-IL-5RA antibodies for nasal polyps, glucokinase activators for T2D, and defibrotide for VTE. Results were similar for relaxed significance thresholds; when including LD-independent variants with p < 1 × 10−5, an additional 41 gene-disease pairs had genetic support: 34 only from predicted phenotypes, 6 only from case-control phenotypes, and 1 from both (Table S7). These results show that predicted and case-control phenotypes provide complementary genetic evidence for drug target identification.
Combining case-control and predicted phenotypes improves PRS performance
We constructed three PRSs for each disease: one using case-control phenotypes (“case-control PRS”), one using predicted phenotypes (“predicted PRS”), and one using both phenotypes (“combined PRS”). Among European ancestry participants in All of Us, combined PRSs explained a greater proportion of phenotypic variance than case-control PRSs for all diseases, with a median increase in Nagelkerke’s R2 of 37% (Figure 5A; Table S8A). Predicted PRSs also had larger R2s compared to case-control PRSs for CAD, gallstones, nasal polyps, and T2D. Similarly, combined PRSs had greater odds ratios (ORs) per standard deviation (SD) increase in PRS compared to case-control PRSs for all diseases (Figure 5B; Table S8B), with increases ranging from 2% (celiac disease) to 9% (gallstones), as well as greater ORs for participants ≥95th percentile compared to those between the 25th and 75th percentiles (Figure 5C; Table S8C), with increases ranging from 3% (celiac disease) to 23% (nasal polyps). Additionally, combined PRSs achieved similar ORs as participants ≥95th percentile for case-control PRSs at lower thresholds, increasing the number of participants identified as high risk by a median of 120% (Table S8D). Combined PRSs had variable performance increases compared to case-control PRSs in cross-ancestry and other ancestry-specific analyses in All of Us and BioMe, likely due to small sample sizes of non-European ancestry groups (Figures S9A–S9F; Tables S8A–S8D). However, all PRSs performed consistently worse in non-European compared to European ancestry groups (Table S8E).
Figure 5.
Performance of PRSs in All of Us
(A) Nagelkerke’s R2s for case-control, predicted, and combined polygenic risk scores (PRSs).
(B) ORs per standard deviation (SD) increase in PRSs for case-control, predicted, and combined PRSs.
(C) ORs for participants above the 95th percentile compared to those between the 25th and 75th percentiles of the case-control, predicted, and combined PRSs.
(D) ORs per SD increase in PRS for case-control and predicted PRSs in multi-variable regressions.
(E) ORs comparing participants in the 40th to 60th percentile bin of both case-control and predicted PRSs (center box) to participants in different percentile bins of the two PRSs. Only associations with p < 0.05 are shown.
All plots represent participants of European ancestry; results for other ancestry groups are shown in Figure S12. Error bars represent 95% confidence intervals. In (A)–(C), green text represents the percent improvement of the combined PRS over the case-control PRS. In (D), green text represents the percent improvement of the predicted PRS over the case-control PRS and the p value from a Wald test for the difference in log odds between case-control and predicted PRSs. See also Figures S9 and S10 and Tables S8A–S8F.
To explain why combined PRSs outperformed case-control PRSs, we investigated the extent to which case-control and predicted PRSs were complementary. For all diseases, multi-variable regressions showed that predicted PRSs were significantly associated with all diseases even in the presence of case-control PRSs (Figure 5D; Table S8F). There was also limited overlap between participants ≥95th percentile of each of the two PRSs, ranging from only 7% (varicose veins) to 23% (T2D). Participants with high scores in one PRS generally had substantially lower scores in the other PRS; those ≥95th percentile of predicted PRSs had an average difference between predicted and case-control PRS Z scores of 0.7 (celiac disease) to 1.7 (nasal polyps), with similar differences for the reverse comparison. We further divided participants into percentile bins for case-control and predicted PRSs and found that each PRS provided additional risk stratification within percentile bins of the other PRSs (Figure 5E). Additionally, for CAD, nasal polyps, T2D, varicose veins, and VTE, participants in the top percentile bin (95th to 100th percentile) of both PRSs had the highest risk of disease.
Finally, because MTAG outputs can be used for PRSs,18 we compared combined PRSs to MTAG PRSs. For six diseases, their performance was similar based on Nagelkerke’s R2 and ORs (Figures S10A–S10C). However, for AFib and gallstones, combined PRSs significantly outperformed MTAG PRSs. Thus, while MTAG is useful for genetic discovery, the combined PRS approach may be preferable for PRS analyses.
Discussion
In this study, we used machine learning to convert binary case-control phenotypes into predicted continuous phenotypes for eight complex diseases. We systematically tested genome-wide associations with both phenotypes, finding that their genetic overlap varied substantially in proportion to disease prediction metrics (AUROCs and AUPRCs). We then applied MTAG to boost case-control phenotypes using predicted phenotypes, identifying additional variants not captured by case-control phenotypes for all diseases. Many of these variants replicated in FinnGen and reached genome-wide significance in larger analyses. Predicted phenotypes also increased genetic support for drug targets and, when combined with case-control phenotypes, improved cross-ancestry and ancestry-specific PRS performance. Taken together, case-control and predicted phenotypes represent complementary approaches for genetic discovery, drug target prioritization, and PRS prediction, but caution is needed when applying and interpreting results from predicted phenotypes.
Spurious associations are a major limitation of predicted phenotypes.13 Even for CAD and T2D, which had the highest prediction metrics, 12% and 11% of predicted phenotype variants failed to replicate in much larger case-control meta-analyses, and replication rates were substantially lower for diseases with lower disease prediction metrics. Some spurious associations were likely attributable to features not necessarily causal but still important for predicting case-control status and, thus, represented in the predicted phenotypes. Furthermore, because predicted phenotypes represent composite biomarkers, they inherit limitations from studies performing GWASs on individual biomarkers as disease proxies.33,34 These include non-specificity due to biomarkers capturing different biological processes, confounding from pleiotropic variants that influence biomarkers independent of disease status, and sensitivity to data missingness and measurement factors. All of these limitations emphasize the need for caution when using predicted phenotypes.
There was substantial heterogeneity in the performance of predicted phenotypes both within and across diseases. Explaining this, prediction metrics were significantly associated with the case-control phenotype replicability of predicted phenotype variants and vice versa, both within and across diseases. These metrics reflect how well a model distinguishes cases from controls using available features, and some diseases are better suited for machine learning-assisted genetic discovery based on the phenotypic data available. For example, VTE primarily involves hypercoagulability, yet the UK Biobank lacks coagulation-related phenotypes; consequently, far fewer predicted than case-control variants mapped to the coagulation gene set, and variants in key coagulation genes (F2 and F8–F11) were identified only by the case-control phenotype. In contrast, for CAD and T2D, where several complementary dyslipidemia and insulin resistance markers were available, predicted phenotypes showed strong overlap with case-control phenotypes. For both diseases, models constructed without these features resulted in predicted phenotypes with substantially poorer overlap with case-control phenotypes.
Our results also support maximizing performance metrics when constructing predicted phenotypes, either by using improved architectures (e.g., gradient boosting versus logistic regression) or by including additional disease-relevant features. Predictions from a perfectly accurate model would still increase power over the case-control phenotype by virtue of being a continuous phenotype where participants are ranked by disease probabilities.12,35,36 Importantly, feature selection may be unnecessary for gradient boosting models, with both causally informed and importance-based selection failing to improve predicted phenotype performance. Regarding causally informed selection, causal factors emerged as important features even without selection, and non-causal factors, particularly those correlated with disease severity or outcomes, can help models distinguish cases from controls and rank cases by severity. Causal variants can also be associated with both causal and non-causal factors simultaneously due to pleiotropy. These findings suggest that spurious associations are better addressed post GWAS at the variant level,13 as feature-level filtering can reduce model performance and compromise replicability.
Combining predicted with case-control phenotypes using MTAG can increase power while reducing spurious associations. There were strong correlations between MTAG and case-control effect sizes and up to 83% of additional variants externally replicated in FinnGen at nominal significance. This approach is likely generalizable across diseases; additional MTAG-identified variants from diseases with low genetic correlations between predicted and case-control phenotypes still had high replication rates, but effective sample size increases for these diseases were small. This aligns with the original MTAG study, which concluded that the greatest power gains with minimal FDR inflation occur for traits with high genetic correlation.18 However, a known limitation of MTAG is when a variant is null for one trait but not another,18 which is represented in its maximum FDR estimates. We indeed observed that, in rare cases, MTAG produced spurious associations for variants with extremely significant predicted phenotype associations but no significant case-control phenotype associations. Thus, even when using post-GWAS corrections like MTAG, this approach needs to be used with caution. It is always important to assess the external replicability of and supporting experimental and functional evidence for identified variants.18,37
Besides genetic discovery, we apply predicted phenotypes under two circumstances where, in addition to causal variants, identifying non-causal but still disease-associated variants may be useful. First, similar to our prior report showing that predicted phenotypes improved drug target prioritization across 112 chronic diseases,38 we found that predicted phenotypes increased identification of genes targeted by existing drugs at multiple significance thresholds, which could broaden target discovery for drug discovery programs. Effective drugs need not target causal genes and instead could target non-causal genes to achieve symptomatic relief or to counteract the effects of causal genes. Second, similar to prior multi-phenotype PRS approaches that used disease-related PRSs to augment a disease-specific PRS,39,40 combined PRSs incorporating both case-control and predicted phenotype effects explained more phenotypic variance and improved risk stratification compared to case-control PRSs for all diseases, including in cross-ancestry and most non-European ancestry analyses. This was likely because case-control and predicted phenotype PRSs were complementary, with different participants identified as high risk and with participants in the top percentile bins of both PRSs generally having the highest disease risk. The advantage of our combined PRS approach is that it does not require manually selecting related PRSs or including PRSs for hundreds of traits.40
In summary, this study demonstrates that predicted phenotypes can complement case-control phenotypes in genetic analyses of complex diseases. Although predicted phenotypes introduce spurious associations, incorporating them into MTAG analyses identified novel disease-associated variants with decent replicability and supporting evidence. Predicted phenotypes can also improve the PRS performance of case-control phenotypes. However, the effectiveness of using predicted phenotype genetic associations varies across diseases. Further studies are needed to systematically identify the causes of spurious associations in predicted phenotypes and develop strategies to address them.
Limitations of the study
This study has several limitations. First, we trained machine learning models and performed genetic association testing only in the UK Biobank; results may differ for other biobanks where different sets of phenotypic data are available. Second, we only used tabular data to train machine learning models. Hybrid models combining tabular with multi-dimensional data, including imaging data, electrocardiograms, and spirograms, would likely yield even more accurate predicted phenotypes. Third, despite our use of robust MR methods, MR is not infallible in determining causality due to biases in instrument selection, unmeasured confounding, and genetic pleiotropy. Fourth, we were unable to perform MR for many UK Biobank phenotypes that lacked sufficient variants with genome-wide significant associations. While some of these phenotypes may be non-heritable, others may be underpowered. Fifth, because of the UK Biobank consisting primarily of European ancestry participants, all PRSs underperformed in non-European ancestry participants in All of Us and BioMe despite our use of cross-population polygenic prediction methodology. Sixth, differences in disease definitions and inclusion criteria between the four datasets in this study may have affected the proportion of cases in each dataset and, thus, variant replication rates (in FinnGen) and PRS performance (in All of Us and BioMe).
Resource availability
Lead contact
Requests for further information, resources, and data should be directed to and fulfilled by the lead contact, Ron Do (ron.do@mssm.edu).
Materials availability
This study did not generate new unique materials.
Data and code availability
-
•
All summary statistics and PRS posterior effect sizes are available as a Zenodo repository (https://doi.org/10.5281/zenodo.14969847).
-
•
Analysis scripts are available at the GitHub repository (https://github.com/robchiral/ML-GWAS), and a permanent version is available at the Zenodo repository. Publicly available data and code are detailed in the key resources table.
-
•
Any additional information required to re-analyze the data reported in this study is available from the lead contact upon request.
Acknowledgments
R.C., I.S.F., and J.K.P. are supported by the National Institute of General Medical Sciences of the NIH (T32-GM007280). R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We gratefully acknowledge All of Us, BioMe, and UK Biobank participants for their contributions, without whom this research would not have been possible. The data in this paper were used in a dissertation as partial fulfillment of the requirements for a PhD degree at the Graduate School of Biomedical Sciences at Mount Sinai.
Author contributions
R.C. and R.D. conceived the idea. R.C. curated the data, conducted the investigation, and created the visualizations. All authors contributed to the methodology. R.C. drafted the original manuscript, and all authors reviewed and edited the paper. R.D. oversaw project administration and supervision and acquired the funding. All authors read and approved the final manuscript.
Declaration of interests
R.D. reports being a scientific cofounder, consultant, and equity holder for Pensieve Health (pending) and being a consultant for Variant Bio and Character Bio.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Summary statistics and polygenic risk score posterior effect sizes | This paper | https://doi.org/10.5281/zenodo.14969847 |
| All of Us Controlled Tier Dataset v7 | The All of Us Research Program Genomics Investigators41 | https://allofus.nih.gov |
| FinnGen Freezes 10 and 11 | Kurki et al.42 | https://www.finngen.fi |
| MSigDB | Liberzon et al.43 | https://www.gsea-msigdb.org/gsea/msigdb |
| Open Targets | Buniello et al.37 | https://platform.opentargets.org/downloads |
| Pan-UK Biobank | Karczewski et al.44 | https://pan.ukbb.broadinstitute.org/downloads/index.html |
| UK Biobank | Bycroft et al.45 | https://www.ukbiobank.ac.uk |
| Software and algorithms | ||
| Analysis scripts | This paper | https://github.com/robchiral/ML-GWAS and https://doi.org/10.5281/zenodo.14969847 |
| LDSC version 1.0.1 | Bulik-Sullivan et al.46 | https://github.com/bulik/ldsc |
| MendelianRandomization version 0.10.0 | Yavorska and Burgess47 | https://cran.r-project.org/web/packages/MendelianRandomization |
| METAL release 2023-03-07 | Willer et al.48 | https://github.com/statgen/METAL |
| MR-PRESSO release 2023-07-25 | Verbanck et al.49 | https://github.com/rondolab/MR-PRESSO |
| PLINK 2.0 release 2024-03-02 | Chang et al.50 | https://www.cog-genomics.org/plink/2.0/ |
| regenie version 3.4.0 | Mbatchou et al.51 | https://github.com/rgcgithub/regenie |
| triple-liftOver release 2022-04-22 | Sheng et al.52 | https://github.com/GraceSheng/triple-liftOver |
| PRS-CS version 1.1.0 | Ge et al.53 | https://github.com/getian107/PRScs |
| PRS-CSx version 1.1.0 | Ruan et al.54 | https://github.com/getian107/PRScsx |
Method details
Ethics statement
Participants voluntarily enrolled and gave informed electronic consent for all datasets used in this study (All of Us, BioMe, and the UK Biobank). We accessed All of Us data under workspace aou-rw-75979bcb using the Controlled Tier Dataset v7. The Institutional Review Board at the Icahn School of Medicine at Mount Sinai approved BioMe access (GCO no. 07–0529; STUDY-11–01139). We accessed UK Biobank data under application ID 16218. This study complied with the Declaration of Helsinki.
Disease definitions and participant selection
We defined cases and controls for each disease consistent with prior large-scale GWASs (Tables S1A and S1B). In the UK Biobank, we identified cases for each disease using diagnostic codes from inpatient diagnoses (category 2000), primary care diagnoses (category 3000), and the death registry (category 100093); self-reported medical conditions (category 100074); inpatient operative procedures (category 2005); and self-reported operative procedures (category 100076). We used the same criteria to define cases and controls in All of Us and BioMe except for cause-of-death data, self-reported diagnoses, and self-reported operations, which were available only in the UK Biobank. FinnGen performed independent phenotyping but likewise used diagnostic codes, cause-of-death data, and operations to define cases.42
We performed participant-level quality control in all datasets. In the UK Biobank, we removed participants with chromosomal sex discordant with self-reported sex (fields 22001 and 31, respectively), presence of sex chromosome aneuploidy (field 22019), excess heterozygosity or missingness (field 22027), and/or ten or more third-degree relatives (field 22021). In All of Us, we excluded participants with fingerprint discordance, chromosomal sex discordant with self-reported sex, call rate ≤98%, cross-individual contamination rate ≥3%, and/or who failed coverage thresholds. In BioMe, we excluded participants with chromosomal sex discordant with self-reported sex, excess heterozygosity (standard deviation ≥6), genotyping rate <95%, and/or genotyping concordance for single nucleotide variants <80%. FinnGen excluded participants with chromosomal sex discordant with self-reported sex, high genotype missingness (>5%), and excess heterozygosity (standard deviation ≥4).42
Ultimately, we performed machine learning and common variant association analyses in up to 485,448 UK Biobank participants and replicated these associations in up to 453,733 FinnGen participants from Freeze 11 (Table S1B). We externally tested PRSs in up to 242,458 All of Us participants and 53,373 BioMe participants.
Machine learning in the UK Biobank
We considered a comprehensive set of phenotypes as features for disease prediction models, including blood biochemistry (category 100002), physical measurements (category 100006), self-reported medications (category 100075), diagnoses from all sources encoded as three-character ICD-10 codes (category 1712), and lifestyle factors (category 100050). For blood biochemistry and physical measurements, we used data as provided without further processing. We included sex, age at enrollment, and fasting time in hours as covariates. Fasting time represents the interval between consumption of food or drink and blood sample collection, which can affect laboratory measurements. To remove redundancies, we converted self-reported medications into Anatomical Therapeutic Chemical (ATC) codes using previously published conversion tables.55 For robustness, we removed binary features (features encoded as “yes” or “no,” including ICD-10 codes) with fewer than 1,000 positive examples or cases as well as continuous features that were unavailable for more than 50% of UK Biobank participants. When constructing models to predict each disease, we also removed ICD-10 codes used to define cases and ATC codes corresponding to disease-associated therapeutics to avoid data leakage (Table S4B). We removed participants who were missing more than 50% of 89 continuous phenotypes representing blood biochemistry and physical measurements. Ultimately, for all-feature models, we included 1,037 features for CAD, 1,039 features for T2D and VTE, 1,045 features for AFib, and 1048 features for celiac disease, gallstones, nasal polyps, and varicose veins.
We constructed disease prediction models using the LightGBM Python package (version 4.4.0), which remains state-of-the-art for large tabular datasets in terms of training speed and accuracy.56 We trained LightGBM models to minimize log loss when predicting case-control status. To enable inter-disease comparisons and avoid introducing additional variability to genetic association results, we applied a consistent set of hyperparameters identified through grid search that performed well across all diseases (‘data_sample_strategy’: ‘goss’, ‘boosting’: ‘gbdt’, ‘num_iterations’: 1000, ‘learning_rate’: 0.1, ‘num_leaves’: 50, ‘min_data_in_leaf’: 50). We also enabled early stopping after five iterations without improvement in training log loss to prevent overfitting. We used a 6-fold nested cross validation procedure where participants were split into six outer folds. Each outer fold served as a holdout set once for the other 5-folds, which we combined and further divided into six inner folds. We used five of these inner folds for training and the sixth fold as a validation set for early stopping. To calculate feature importances, we enabled SHapley Additive exPlanations (SHAP) when generating predictions for the holdout set and averaged SHAP values across all predictions for each feature. We calculated machine learning metrics for holdout set predictions using the scikit-learn Python package (version 1.5.0).
As a sensitivity analysis, we compared LightGBM models to logistic regression models using the LogisticRegression function from scikit-learn. For logistic regression models, which did not support early stopping, we performed 6-fold cross-validation where each fold served as a holdout set once for the other 5-folds, which we combined and used for training.
Prior to genetic association testing, we beta-regressed machine learning probabilities for all participants on age, sex and 10 principal components (PCs) of genotypes with a logit link function using the betareg R package (version 3.1–4) and transformed the resulting residuals using rank-based inverse normal transformation using the RNOmni R package (version 1.0.1.2). For sensitivity analyses including BMI as an additional covariate, we beta-regressed on age, sex, 10 PCs, and BMI.
Genetic association testing in the UK Biobank
We performed association testing using regenie (version 3.4.0), which consists of two steps: step 1 fits a whole-genome ridge regression model to generate polygenic predictions, and step 2 performs single-variant association testing conditional on these predictions. We included age, sex, age×sex, age2 age2×sex, and the first 10 PCs of genotypes as covariates to be consistent with the Pan-UK Biobank project.44 As a sensitivity analysis, we separately performed association testing with BMI as an additional covariate. For step 1 of regenie, we used genotype data to generate ridge regression predictions on blocks of 2,000 single nucleotide variants (SNVs). Using PLINK 2.0 (release 2024-03-02), we filtered genotype data for variants with minor allele count (MAC) > 100, minor allele frequency (MAF) > 0.01, genotyping rate >0.9, and Hardy-Weinberg exact test p-value <1 × 10−15.
For step 2 of regenie, we performed genome-wide association testing on blocks of 500 SNVs from Haplotype Reference Consortium-imputed genotype data. Using PLINK 2.0, we filtered this data for variants with INFO score >0.8, MAC >100, MAF >0.001, genotyping rate >0.9, and Hardy-Weinberg exact test p-value <1 × 10−15. To determine LD-independent variants that were genome-wide significant, we performed LD clumping using PLINK 2.0 with a significance threshold of 5 × 10−8, r2 threshold of 0.1, and a distance of 1 Mb. We did not consider variants located in the MHC region (chr6:28,477,797-33,448,354 for GRCh37) due to its LD complexity and because MHC fine-mapping studies have previously been performed.
Following previous pooled ancestry approaches for both common and rare variant association analyses,11,57,58 and because regenie controls for population structure,51 we pooled together all individuals in the UK Biobank into a single model for the primary analyses comparing predicted and case-control phenotypes. For secondary analyses requiring ancestry-specific associations, including LD score regression, MTAG, and PRS construction, we separately performed ancestry-specific association testing using ancestry group assignments (AFR, AMR, CSA, EAS, EUR, MID) from the Pan-UK Biobank project.44 In addition to computational efficiency, the pooled approach resulted in an increase in sample size of approximately 10% due to the inclusion of participants not assigned to a specific ancestry group. As a sensitivity analysis, we compared pooled analyses with meta-analyses of ancestry-specific analyses and observed a high concordance of results (Table S9). For genome-wide significant variants identified in pooled analyses, we observed high correlations between -log10(p-values) from pooled analyses and meta-analyses (Figure S11), with R2s > 0.98 for all case-control phenotypes and >0.95 for all predicted phenotypes. Many instances of discordant results were likely attributable to the rarity (MAF <0.001) of variants among certain ancestry groups, causing them to be excluded from the meta-analysis. There were minimal differences in genomic inflation factors between pooled analyses and meta-analyses, suggesting no significant increase in unaccounted population stratification in the pooled analyses (Table S9).
We mapped variants to genes using a three-stage approach. For missense and protein truncating variants, we used the gene assignment from Ensembl VEP (release 112). For all other variants, we used the highest scoring gene assignment from the Open Targets Variant-to-Gene pipeline (release 2022-10-06); this pipeline combines eQTL and pQTL datasets, chromatin interaction and conformation datasets, functional predictions, and distance from the canonical transcript start site.59 For remaining variants not present in the Variant-to-Gene pipeline, we assigned the closest gene based on distance from the canonical transcript start site, as this is often the causal gene.60,61,62 We then analyzed gene assignments to the 50 MSigDB hallmark gene sets. Following prior approaches,63,64,65 we defined variants as novel if they were 500 kb away from all reported variants in the GWAS Catalog and published literature.
For case-control phenotypes, we performed standard error-based meta-analyses of UK Biobank and FinnGen Freeze 11 associations using METAL (version 2020-05-05), including up to 933,649 individuals. To estimate inflation, heritability, and genetic correlation, we performed LD score regressions using LDSC (version 1.0.1) with default parameters for European ancestry participants. We used pre-computed European ancestry LD scores for HapMap3 variants from 1000 Genomes Project phase 3 data that were supplied with the LDSC package.
We performed Multi-Trait Analysis of GWAS (MTAG) of case-control and predicted phenotypes using the mtag package (release 2023-03-07) with default parameters. An ideal scenario where a disease prediction model perfectly predicts case-control status aligns with the special case of “perfect genetic correlation but different heritabilities” described by the MTAG authors, where case-control and predicted phenotypes are perfectly correlated, but predicted phenotypes exhibit significantly higher heritability. Because MTAG relies on LDSC and there are small sample sizes for non-European ancestry participants in the UK Biobank, we performed MTAG analyses only for participants of European ancestry. We used LD scores we calculated for 10,000 randomly selected European ancestry UK Biobank participants with LDSC and analyzed only variants with MAF >0.01 as recommended by the MTAG authors.
Open Targets target-disease association evidence and drug indications
We obtained target-disease association evidence from Open Targets (version 24.06).37 Target-disease association evidence includes genetic evidence, somatic mutations, drug indications, pathways and systems biology, RNA expression, text mining, and animal models. Open Targets assigned each gene-disease pair a numerical score for each source of evidence and calculated a harmonic sum to generate a target-disease association score. Because of different levels of granularity of diseases, Open Targets provided both direct (more specific) and indirect (less specific) associations; for all analyses in this study, we used only direct target-disease associations.
Mendelian randomization
We performed Mendelian randomization for 190 UK Biobank phenotypes (Table S4A). We considered all phenotypes used for machine learning models (see “Machine learning”) and removed those without at least three genetic instrumental variables (IVs). To identify IVs, we obtained European ancestry-specific summary statistics from the Pan-UK Biobank project and performed separate GWASs for 39 custom phenotypes on the same European ancestry participants (see “Genetic association testing”).44 For each phenotype, we performed LD clumping of summary statistics using PLINK 2.0 with a significance threshold of 5 × 10−8, r2 threshold of 0.001, and distance of 10 Mb, with imputed genotypes for all European ancestry participants in the UK Biobank serving as the LD reference panel. We obtained outcome summary statistics for all diseases from FinnGen Freeze 10, which had the most European ancestry participants outside of the UK Biobank. As FinnGen variants are assigned GRCh38 coordinates, we used triple-liftOver (release 2022-04-22) to convert UK Biobank coordinates from GRCh37 to GRCh38.52
We performed phenome-wide Mendelian randomization studies for each disease using seven methods: inverse variance weighted (IVW) with random effects, cML-MA, MR-ConMix, MR-Egger, MR-PRESSO, MR-Robust, and weighted median.49,66,67,68,69 We included the latter six methods as they use different methods to increase robustness to invalid IVs.70,71 For MR-PRESSO, we enabled correction of horizontal pleiotropy via outlier removal and used only the corrected results. Except for MR-PRESSO, which is available as a standalone R package (release 2023-07-25), we performed all methods using the MendelianRandomization R package (version 0.10.0).47 We defined possible causal factors as those simultaneously reaching Bonferroni significance (p < 2.6 × 10−4) with IVW MR and nominal significance with at least three robust methods, and possible non-causal factors as those not reaching nominal significance with any MR method (Table S4D). Although there was probable horizontal pleiotropy not detected by the robust methods, with all six robust methods identifying a significant association between HDL cholesterol and CAD (Table S4C), we intended for MR to remove features without evidence of causal relationships for machine learning rather than to conclusively establish causal relationships.
Polygenic risk scores
We used PRS-CSx (version 1.1.0) to estimate posterior effect sizes using ancestry-specific summary statistics from association testing in the UK Biobank.54 We used default parameters other than phi = 1e-2, n_iter = 3000, and n_burnin = 1500. We constructed PRSs only for case-control and all-feature predicted phenotypes. We used pre-computed LD reference matrices for HapMap3 variants from 1000 Genomes Project phase 3 data that were supplied with the PRS-CSx package. For a sensitivity analysis comparing MTAG PRSs to case-control, predicted, and combined PRSs, we constructed PRSs with PRS-CS (version 1.1.0) using only European ancestry summary statistics.
In All of Us and BioMe, we used these posterior effect sizes to calculate five ancestry-specific scores for each participant. In All of Us, we split participants randomly into two-halves and used each half as the validation set for the other. BioMe consisted of two samples that were genotyped separately; for each BioMe sample, participants in the other BioMe sample served as the validation set. In each validation set, we performed L2-penalized logistic regression against case-control status using scikit-learn to determine optimal coefficients for each ancestry-specific score while including age, sex, and 10 PCs of genotypes as covariates. We multiplied these coefficients with ancestry-specific scores in the respective testing set and summed the products to yield final PRSs. To calculate coefficients for combined PRSs, we included ancestry-specific scores for both case-control and predicted phenotypes in the same logistic regressions. In both validation and testing sets, we normalized ancestry-specific scores and final PRSs to a mean of 0 and a standard deviation of 1. We performed this procedure separately for each ancestry group in All of Us and BioMe as well as for all participants combined as a cross-ancestry analysis. For each disease, we excluded ancestry groups where there were fewer than 100 cases in All of Us or each BioMe sample.
We analyzed PRS performance by calculating Nagelkerke’s R2s and odds ratios from logistic regressions using the statsmodels Python package (version 0.14.1), where we included age, sex, and 10 PCs of genotypes as covariates.
Quantification and statistical analysis
This study develops statistical genetic procedures outlined in the sections above. No additional statistical tests were performed.
Published: July 25, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2025.101115.
Supplemental information
References
- 1.Forrest I.S., Petrazzini B.O., Duffy Á., Park J.K., Marquez-Luna C., Jordan D.M., Rocheleau G., Cho J.H., Rosenson R.S., Narula J., et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet. 2023;401:215–225. doi: 10.1016/S0140-6736(22)02079-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Forrest I.S., Petrazzini B.O., Duffy Á., Park J.K., O’Neal A.J., Jordan D.M., Rocheleau G., Nadkarni G.N., Cho J.H., Blazer A.D., Do R. A machine learning model identifies patients in need of autoimmune disease testing using electronic health records. Nat. Commun. 2023;14:2385. doi: 10.1038/s41467-023-37996-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chen R., Petrazzini B.O., Malick W.A., Rosenson R.S., Do R. Prediction of Venous Thromboembolism in Diverse Populations Using Machine Learning and Structured Electronic Health Records. Arterioscler. Thromb. Vasc. Biol. 2024;44:491–504. doi: 10.1161/ATVBAHA.123.320331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cosentino J., Behsaz B., Alipanahi B., McCaw Z.R., Hill D., Schwantes-An T.-H., Lai D., Carroll A., Hobbs B.D., Cho M.H., et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat. Genet. 2023;55:787–795. doi: 10.1038/s41588-023-01372-4. [DOI] [PubMed] [Google Scholar]
- 5.Gomes B., Singh A., O’Sullivan J.W., Schnurr T.M., Goddard P.C., Loong S., Amar D., Hughes J.W., Kostur M., Haddad F., et al. Genetic architecture of cardiac dynamic flow volumes. Nat. Genet. 2024;56:245–257. doi: 10.1038/s41588-023-01587-5. [DOI] [PubMed] [Google Scholar]
- 6.Pirruccello J.P., Di Achille P., Nauffal V., Nekoui M., Friedman S.F., Klarqvist M.D.R., Chaffin M.D., Weng L.-C., Cunningham J.W., Khurshid S., et al. Genetic analysis of right heart structure and function in 40,000 people. Nat. Genet. 2022;54:792–803. doi: 10.1038/s41588-022-01090-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Haas M.E., Pirruccello J.P., Friedman S.N., Wang M., Emdin C.A., Ajmera V.H., Simon T.G., Homburger J.R., Guo X., Budoff M., et al. Machine learning enables new insights into genetic contributions to liver fat accumulation. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dahl A., Thompson M., An U., Krebs M., Appadurai V., Border R., Bacanu S.-A., Werge T., Flint J., Schork A.J., et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat. Genet. 2023;55:2082–2093. doi: 10.1038/s41588-023-01559-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Burstein D., Griffen T.C., Therrien K., Bendl J., Venkatesh S., Dong P., Modabbernia A., Zeng B., Mathur D., Hoffman G., et al. Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism. Nat. Genet. 2023;55:1462–1470. doi: 10.1038/s41588-023-01464-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kim H., Ahn Y., Yoon J., Jung K., Kim S., Shim I., Park T.H., Ko H., Jung S.-H., Kim J., et al. Genome-wide association analyses using machine learning-based phenotyping reveal genetic architecture of occupational creativity and overlap with psychiatric disorders. Psychiatry Res. 2024;333 doi: 10.1016/j.psychres.2024.115753. [DOI] [PubMed] [Google Scholar]
- 11.Petrazzini B.O., Forrest I.S., Rocheleau G., Vy H.M.T., Márquez-Luna C., Duffy Á., Chen R., Park J.K., Gibson K., Goonewardena S.N., et al. Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Nat. Genet. 2024;56:1412–1419. doi: 10.1038/s41588-024-01791-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hujoel M.L.A., Gazal S., Loh P.-R., Patterson N., Price A.L. Liability threshold modeling of case–control status and family history of disease increases association power. Nat. Genet. 2020;52:541–547. doi: 10.1038/s41588-020-0613-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Miao J., Wu Y., Sun Z., Miao X., Lu T., Zhao J., Lu Q. Valid inference for machine learning-assisted genome-wide association studies. Nat. Genet. 2024;56:2361–2369. doi: 10.1038/s41588-024-01934-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.McCaw Z.R., Gao J., Lin X., Gronsbell J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat. Genet. 2024;56:1527–1536. doi: 10.1038/s41588-024-01793-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Garg M., Karpinski M., Matelska D., Middleton L., Burren O.S., Hu F., Wheeler E., Smith K.R., Fabre M.A., Mitchell J., et al. Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK Biobank. Nat. Genet. 2024;56:1821–1831. doi: 10.1038/s41588-024-01898-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.An U., Pazokitoroudi A., Alvarez M., Huang L., Bacanu S., Schork A.J., Kendler K., Pajukanta P., Flint J., Zaitlen N., et al. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat. Genet. 2023;55:2269–2276. doi: 10.1038/s41588-023-01558-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chen R., Petrazzini B.O., Duffy Á., Rocheleau G., Jordan D., Bansal M., Do R. Trans-ancestral rare variant association study with machine learning-based phenotyping for metabolic dysfunction-associated steatotic liver disease. Genome Biol. 2025;26:50. doi: 10.1186/s13059-025-03518-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Turley P., Walters R.K., Maghzian O., Okbay A., Lee J.J., Fontana M.A., Nguyen-Viet T.A., Wedow R., Zacher M., Furlotte N.A., et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 2018;50:229–237. doi: 10.1038/s41588-017-0009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Emdin C.A., Haas M., Ajmera V., Simon T.G., Homburger J., Neben C., Jiang L., Wei W.-Q., Feng Q., Zhou A., et al. Association of genetic variation with cirrhosis: a multi-trait genome-wide association and gene-environment interaction study. Gastroenterology. 2021;160:1620–1633.e13. doi: 10.1053/j.gastro.2020.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Khunsriraksakul C., Li Q., Markus H., Patrick M.T., Sauteraud R., McGuire D., Wang X., Wang C., Wang L., Chen S., et al. Multi-ancestry and multi-trait genome-wide association meta-analyses inform clinical risk prediction for systemic lupus erythematosus. Nat. Commun. 2023;14:668. doi: 10.1038/s41467-023-36306-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Aragam K.G., Jiang T., Goel A., Kanoni S., Wolford B.N., Atri D.S., Weeks E.M., Wang M., Hindy G., Zhou W., et al. Discovery and systematic characterization of risk variants and genes for coronary artery disease in over a million participants. Nat. Genet. 2022;54:1803–1815. doi: 10.1038/s41588-022-01233-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Suzuki K., Hatzikotoulas K., Southam L., Taylor H.J., Yin X., Lorenz K.M., Mandla R., Huerta-Chagoya A., Melloni G.E.M., Kanoni S., et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature. 2024;627:347–357. doi: 10.1038/s41586-024-07019-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Brar P., Kwon G.Y., Holleran S., Bai D., Tall A.R., Ramakrishnan R., Green P.H.R. Change in Lipid Profile in Celiac Disease: Beneficial Effect of Gluten-Free Diet. Am. J. Med. 2006;119:786–790. doi: 10.1016/j.amjmed.2005.12.025. [DOI] [PubMed] [Google Scholar]
- 24.Lewis N.R., Sanders D.S., Logan R.F.A., Fleming K.M., Hubbard R.B., West J. Cholesterol profile in people with newly diagnosed coeliac disease: a comparison with the general population and changes following treatment. Br. J. Nutr. 2009;102:509–513. doi: 10.1017/S0007114509297248. [DOI] [PubMed] [Google Scholar]
- 25.Gratten J., Visscher P.M. Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med. 2016;8:78. doi: 10.1186/s13073-016-0332-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sekine K., Nagata N., Sakamoto K., Arai T., Shimbo T., Shinozaki M., Okubo H., Watanabe K., Imbe K., Mikami S., et al. Abdominal visceral fat accumulation measured by computed tomography associated with an increased risk of gallstone disease. J. Gastroenterol. Hepatol. 2015;30:1325–1331. doi: 10.1111/jgh.12965. [DOI] [PubMed] [Google Scholar]
- 27.Han X., Gharahkhani P., Hamel A.R., Ong J.S., Rentería M.E., Mehta P., Dong X., Pasutto F., Hammond C., Young T.L., et al. Large-scale multitrait genome-wide association analyses identify hundreds of glaucoma risk loci. Nat. Genet. 2023;55:1116–1125. doi: 10.1038/s41588-023-01428-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Beavers D.L., Landstrom A.P., Chiang D.Y., Wehrens X.H.T. Emerging roles of junctophilin-2 in the heart and implications for cardiac diseases. Cardiovasc. Res. 2014;103:198–205. doi: 10.1093/cvr/cvu151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lan F., Zhang N., Zhang J., Krysko O., Zhang Q., Xian J., Derycke L., Qi Y., Li K., Liu S., et al. Forkhead box protein 3 in human nasal polyp regulatory T cells is regulated by the protein suppressor of cytokine signaling 3. J. Allergy Clin. Immunol. 2013;132:1314–1321. doi: 10.1016/j.jaci.2013.06.010. [DOI] [PubMed] [Google Scholar]
- 30.Vanslette A.M., Toft P.B., Lund M.L., Moritz T., Arora T. Serotonin receptor 4 agonism prevents high fat diet induced reduction in GLP-1 in mice. Eur. J. Pharmacol. 2023;960 doi: 10.1016/j.ejphar.2023.176181. [DOI] [PubMed] [Google Scholar]
- 31.Oh C.-M., Park S., Kim H. Serotonin as a New Therapeutic Target for Diabetes Mellitus and Obesity. Diabetes Metab. J. 2016;40:89–98. doi: 10.4093/dmj.2016.40.2.89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rusina P.V., Falaguera M.J., Romero J.M.R., McDonagh E.M., Dunham I., Ochoa D. Genetic support for FDA-approved drugs over the past decade. Nat. Rev. Drug Discov. 2023;22:864. doi: 10.1038/d41573-023-00158-x. [DOI] [PubMed] [Google Scholar]
- 33.Chen V.L., Du X., Chen Y., Kuppa A., Handelman S.K., Vohnoutka R.B., Peyser P.A., Palmer N.D., Bielak L.F., Halligan B., Speliotes E.K. Genome-wide association study of serum liver enzymes implicates diverse metabolic and liver pathology. Nat. Commun. 2021;12:816. doi: 10.1038/s41467-020-20870-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Karjalainen M.K., Karthikeyan S., Oliver-Williams C., Sliz E., Allara E., Fung W.T., Surendran P., Zhang W., Jousilahti P., Kristiansson K., et al. Genome-wide characterization of circulating metabolic biomarkers. Nature. 2024;628:130–138. doi: 10.1038/s41586-024-07148-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pedersen E.M., Agerbo E., Plana-Ripoll O., Grove J., Dreier J.W., Musliner K.L., Bækvad-Hansen M., Athanasiadis G., Schork A., Bybjerg-Grauholm J., et al. Accounting for age of onset and family history improves power in genome-wide association studies. Am. J. Hum. Genet. 2022;109:417–432. doi: 10.1016/j.ajhg.2022.01.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pedersen E.M., Agerbo E., Plana-Ripoll O., Steinbach J., Krebs M.D., Hougaard D.M., Werge T., Nordentoft M., Børglum A.D., Musliner K.L., et al. ADuLT: An efficient and robust time-to-event GWAS. Nat. Commun. 2023;14:5553. doi: 10.1038/s41467-023-41210-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Buniello A., Suveges D., Cruz-Castillo C., Llinares M.B., Cornu H., Lopez I., Tsukanov K., Roldán-Romero J.M., Mehta C., Fumis L., et al. Open Targets Platform: facilitating therapeutic hypotheses building in drug discovery. Nucleic Acids Res. 2025;53:D1467–D1475. doi: 10.1093/nar/gkae1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chen R., Duffy Á., Petrazzini B.O., Vy H.M., Stein D., Mort M., Park J.K., Schlessinger A., Itan Y., Cooper D.N., et al. Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score. Nat. Commun. 2024;15:8891. doi: 10.1038/s41467-024-53333-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hill W.D., Marioni R.E., Maghzian O., Ritchie S.J., Hagenaars S.P., McIntosh A.M., Gale C.R., Davies G., Deary I.J. A combined analysis of genetically correlated traits identifies 187 loci and a role for neurogenesis and myelination in intelligence. Mol. Psychiatry. 2019;24:169–181. doi: 10.1038/s41380-017-0001-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Albiñana C., Zhu Z., Schork A.J., Ingason A., Aschard H., Brikell I., Bulik C.M., Petersen L.V., Agerbo E., Grove J., et al. Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nat. Commun. 2023;14:4702. doi: 10.1038/s41467-023-40330-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Bick A.G., Metcalf G.A., Mayo K.R., Lichtenstein L., Rura S., Carroll R.J., Musick A., Linder J.E., Jordan I.K., Nagar S.D., et al. Genomic data in the All of Us Research Program. Nature. 2024;627:340–346. doi: 10.1038/s41586-023-06957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kurki M.I., Karjalainen J., Palta P., Sipilä T.P., Kristiansson K., Donner K.M., Reeve M.P., Laivuori H., Aavikko M., Kaunisto M.A., et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023;613:508–518. doi: 10.1038/s41586-022-05473-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Liberzon A., Birger C., Thorvaldsdóttir H., Ghandi M., Mesirov J.P., Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Karczewski K.J., Gupta R., Kanai M., Lu W., Tsuo K., Wang Y., Walters R.K., Turley P., Callier S., Shah N.N., et al. Pan-UK Biobank GWAS improves discovery, analysis of genetic architecture, and resolution into ancestry-enriched effects. medRxiv. 2024 doi: 10.1101/2024.03.13.24303864. preprint at. [DOI] [PubMed] [Google Scholar]
- 45.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yavorska O.O., Burgess S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. Int. J. Epidemiol. 2017;46:1734–1739. doi: 10.1093/ije/dyx034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Willer C.J., Li Y., Abecasis G.R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Verbanck M., Chen C.-Y., Neale B., Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 2018;50:693–698. doi: 10.1038/s41588-018-0099-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Mbatchou J., Barnard L., Backman J., Marcketta A., Kosmicki J.A., Ziyatdinov A., Benner C., O’Dushlaine C., Barber M., Boutkov B., et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 2021;53:1097–1103. doi: 10.1038/s41588-021-00870-7. [DOI] [PubMed] [Google Scholar]
- 52.Sheng X., Xia L., Cahoon J.L., Conti D.V., Haiman C.A., Kachuri L., Chiang C.W.K. Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing. HGG Adv. 2023;4 doi: 10.1016/j.xhgg.2022.100159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ge T., Chen C.-Y., Ni Y., Feng Y.-C.A., Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1776. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ruan Y., Lin Y.-F., Feng Y.-C.A., Chen C.-Y., Lam M., Guo Z., He L., Sawa A., Martin A.R., et al. Stanley Global Asia Initiatives Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 2022;54:573–580. doi: 10.1038/s41588-022-01054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Wu Y., Byrne E.M., Zheng Z., Kemper K.E., Yengo L., Mallett A.J., Yang J., Visscher P.M., Wray N.R. Genome-wide association study of medication-use and associated disease in the UK Biobank. Nat. Commun. 2019;10:1891. doi: 10.1038/s41467-019-09572-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.McElfresh D., Khandagale S., Valverde J., C V.P., Feuer B., Hegde C., Ramakrishnan G., Goldblum M., White C. When Do Neural Nets Outperform Boosted Trees on Tabular Data? arXiv. 2024 doi: 10.48550/arXiv.2305.02997. Preprint at. [DOI] [Google Scholar]
- 57.Wojcik G.L., Graff M., Nishimura K.K., Tao R., Haessler J., Gignoux C.R., Highland H.M., Patel Y.M., Sorokin E.P., Avery C.L., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Jurgens S.J., Wang X., Choi S.H., Weng L.-C., Koyama S., Pirruccello J.P., Nguyen T., Smadbeck P., Jang D., Chaffin M., et al. Rare coding variant analysis for human diseases across biobanks and ancestries. Nat. Genet. 2024;56:1811–1820. doi: 10.1038/s41588-024-01894-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Ghoussaini M., Mountjoy E., Carmona M., Peat G., Schmidt E.M., Hercules A., Fumis L., Miranda A., Carvalho-Silva D., Buniello A., et al. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 2021;49:D1311–D1320. doi: 10.1093/nar/gkaa840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Weeks E.M., Ulirsch J.C., Cheng N.Y., Trippe B.L., Fine R.S., Miao J., Patwardhan T.A., Kanai M., Nasser J., Fulco C.P., et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat. Genet. 2023;55:1267–1276. doi: 10.1038/s41588-023-01443-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Zhou W., Kanai M., Wu K.-H.H., Rasheed H., Tsuo K., Hirbo J.B., Wang Y., Bhattacharya A., Zhao H., Namba S., et al. Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease. Cell Genom. 2022;2 doi: 10.1016/j.xgen.2022.100192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Stacey D., Fauman E.B., Ziemek D., Sun B.B., Harshfield E.L., Wood A.M., Butterworth A.S., Suhre K., Paul D.S. ProGeM: a framework for the prioritization of candidate causal genes at molecular quantitative trait loci. Nucleic Acids Res. 2019;47:e3. doi: 10.1093/nar/gky837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Shu X., Long J., Cai Q., Kweon S.-S., Choi J.-Y., Kubo M., Park S.K., Bolla M.K., Dennis J., Wang Q., et al. Identification of novel breast cancer susceptibility loci in meta-analyses conducted among Asian and European descendants. Nat. Commun. 2020;11:1217. doi: 10.1038/s41467-020-15046-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Keaton J.M., Kamali Z., Xie T., Vaez A., Williams A., Goleva S.B., Ani A., Evangelou E., Hellwege J.N., Yengo L., et al. Genome-wide analysis in over 1 million individuals of European ancestry yields improved polygenic risk scores for blood pressure traits. Nat. Genet. 2024;56:778–791. doi: 10.1038/s41588-024-01714-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Downie C.G., Dimos S.F., Bien S.A., Hu Y., Darst B.F., Polfus L.M., Wang Y., Wojcik G.L., Tao R., Raffield L.M., et al. Multi-ethnic GWAS and fine-mapping of glycaemic traits identify novel loci in the PAGE Study. Diabetologia. 2022;65:477–489. doi: 10.1007/s00125-021-05635-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Xue H., Shen X., Pan W. Constrained maximum likelihood-based Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects. Am. J. Hum. Genet. 2021;108:1251–1269. doi: 10.1016/j.ajhg.2021.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Burgess S., Foley C.N., Allara E., Staley J.R., Howson J.M.M. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nat. Commun. 2020;11:376. doi: 10.1038/s41467-019-14156-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Burgess S., Thompson S.G. Interpreting findings from Mendelian randomization using the MR-Egger method. Eur. J. Epidemiol. 2017;32:377–389. doi: 10.1007/s10654-017-0255-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Rees J.M.B., Wood A.M., Dudbridge F., Burgess S. Robust methods in Mendelian randomization via penalization of heterogeneous causal estimates. PLoS One. 2019;14 doi: 10.1371/journal.pone.0222362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Burgess S., Davey Smith G., Davies N.M., Dudbridge F., Gill D., Glymour M.M., Hartwig F.P., Kutalik Z., Holmes M.V., Minelli C., et al. Guidelines for performing Mendelian randomization investigations: update for summer 2023. Wellcome Open Res. 2019;4:186. doi: 10.12688/wellcomeopenres.15555.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Hu X., Cai M., Xiao J., Wan X., Wang Z., Zhao H., Yang C. Benchmarking Mendelian randomization methods for causal inference using genome-wide association study summary statistics. Am. J. Hum. Genet. 2024;111:1717–1735. doi: 10.1016/j.ajhg.2024.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
All summary statistics and PRS posterior effect sizes are available as a Zenodo repository (https://doi.org/10.5281/zenodo.14969847).
-
•
Analysis scripts are available at the GitHub repository (https://github.com/robchiral/ML-GWAS), and a permanent version is available at the Zenodo repository. Publicly available data and code are detailed in the key resources table.
-
•
Any additional information required to re-analyze the data reported in this study is available from the lead contact upon request.





