Abstract
Accurately quantifying uncertainty in predicted phenotypes from polygenic score (PGS)-based applications is essential for reliable clinical interpretation of PGS, supporting effective disease risk assessment and informed decision-making. Here, we present PredInterval, a nonparametric method for constructing well-calibrated prediction intervals. PredInterval is compatible with any PGS method, takes either individual-level data or summary statistics as input and relies on information from quantiles of phenotypic residuals through cross-validation to achieve well-calibrated coverage of true phenotypic values across diverse genetic architectures. We apply PredInterval to analyze 17 traits in real-data applications, where PredInterval not only represents the sole method achieving well-calibrated prediction coverage across traits, but it also offers a principled approach to identify high-risk individuals using prediction intervals, leading to an average improvement of identification rates by 8.7–830.4% compared with existing approaches. Overall, PredInterval represents a robust and versatile tool for enhancing the clinical utility of PGS.
Subject terms: Genome-wide association studies, Statistics
PredInterval quantifies phenotype prediction uncertainty in polygenic score-based applications, achieving well-calibrated prediction coverage across 17 traits tested and offering a principled approach to identify high-risk individuals.
Main
Polygenic risk assessment using polygenic scores (PGSs) has emerged as a valuable tool for genetic prediction and clinical decision-making, particularly in the context of precision medicine1–3. PGS is often calculated as a weighted summation of genotypes across many single-nucleotide polymorphisms (SNPs)4–8, where the weights are determined by the SNP effect size estimates. By aggregating the contributions of many SNPs toward a complex trait, the PGS effectively captures an individual’s genetic predisposition for the trait and, when combined with nongenetic risk factors, provides an accurate and potentially stable predictive factor for the trait9–11. With the abundant availability of data from large-scale genome-wide association studies (GWAS), PGS has become increasingly popular and has been frequently applied for genetic prediction of complex traits and diseases12–30, individual risk stratification4,31,32, pleiotropic association analysis33–37, genomic selection in animal breeding programs38,39 and transcriptome-wide association studies (TWAS)40–42. Accurate construction of PGSs can facilitate disease screening and prevention, improve diagnosis and intervention at an early stage, and aid in the development of personalized treatment.
Many statistical methods have been developed for constructing PGSs43,44. These PGS methods often consist of two analytical steps: selecting variants with nonzero effects and estimating their effect sizes. Because of the finite sample size of GWAS, both of these steps introduce estimation errors, resulting in considerable uncertainty in the PGS estimate itself32. This uncertainty poses a notable challenge to the practical utility of PGS, hindering effective interpretation of PGS results and downstream analyses, such as individual-level risk stratification32. For example, a common strategy for identifying high-risk patients involves selecting individuals with high PGSs that pass the decision criterion of clinical intervention. However, relying solely on the PGS point estimate for decision-making is suboptimal because the PGS estimate may come with substantial uncertainty, which may vary among individuals. Therefore, it is imperative to quantify the uncertainty in PGS-based phenotype prediction to ensure the effective interpretation and practical application of the PGS.
Unfortunately, accurately measuring phenotype prediction uncertainty in PGS applications has proven technically challenging44. The prevailing method in the field for quantifying phenotype prediction uncertainty involves constructing prediction intervals for PGS point estimates based on the particular modeling assumption used in the PGS method. However, this approach inherently restricts the construction of prediction intervals to the particular PGS method used, limiting its applicability, especially for complex traits with diverse genetic architectures that may not align with the assumptions of the PGS method. Moreover, accomplishing accurate prediction uncertainty quantification for even simple PGS methods is challenging. For example, in the case of the standard infinitesimal model used for PGS construction, such as the best linear unbiased predictor (BLUP), the standard error of PGS can be computed in an analytical form32. However, such an analytical form is impractical to compute in large-scale datasets. Subsequently, an approximate analytical form has been suggested, which relies on the independent SNP assumption32. But as we will show in this article, this approximate form does not provide accurate prediction coverage in realistic settings. Additionally, a notable challenge arises with other commonly applied PGS methods that embrace flexible modeling assumptions beyond the infinitesimal assumption. For these methods, analytical computation of the prediction intervals for PGS point estimates is generally not feasible; numerically obtaining these prediction intervals is also challenging13,14,45. For example, approximate numerical algorithms such as Markov Chain Monte Carlo, implemented in some of these PGS methods, not only demand substantial computational resources but also encounter convergency difficulties with any realistic number of iterations because they often require exploration of an extremely high-dimensional parameter space17,23. Consequently, obtaining accurate uncertainty measurements for PGS-based phenotype prediction remains a formidable challenge.
In this article, we present an alternative strategy for assessing phenotypic prediction uncertainty by directly targeting the uncertainty in predicted phenotypes, rather than the uncertainty in PGS point estimates, from a specific PGS method. This strategy circumvents the technically challenging task of quantifying the uncertainty associated with PGS point estimates, which is inherently tied to individual PGS methods. Instead, it redirects the attention toward the central task of quantifying phenotype prediction uncertainty, which is not only broadly applicable to any PGS methods but also, as we will show, turns out to be a potentially more manageable task. Moreover, by directly assessing the uncertainty in the predicted phenotypes, it is feasible to evaluate the accuracy of uncertainty quantification in real datasets. Such evaluation can now be achieved by examining phenotype prediction coverage, effectively addressing the challenges associated with assessing PGS uncertainty in real data due to the unknown true PGS underlying traits.
To enable such a strategy, we developed a statistical method, which we refer to as the PGS-based phenotype prediction interval (PredInterval). PredInterval is nonparametric by nature, leveraging the absolute difference between the observed and predicted phenotypic values to achieve well-calibrated prediction coverage across traits with diverse genetic architectures. The framework of PredInterval is general, takes either individual-level data or summary statistics as input and can be combined with any arbitrary PGS method. Furthermore, PredInterval offers a principled way of identifying high-risk individuals using prediction intervals, allowing us to move beyond simple risk stratification based on PGS point estimates. We illustrate the benefits of PredInterval through comprehensive simulations and applications to 17 traits, including ten quantitative traits and seven binary traits.
Results
Simulations
A method schematic of PredInterval is shown in Fig. 1, with details provided in the Methods. We conducted simulations to examine the performance of PredInterval and compared it with two other methods, BLUP analytical form and CalPred. Simulation details are provided in the Supplementary Note. Briefly, we simulated quantitative and binary traits with distinct genetic architectures characterized by polygenicity and SNP heritability, and, for binary traits, with different case prevalences. We examined 120 simulation settings, each with ten simulation replicates. In each setting, we used the same PGS point estimates from deterministic Bayesian sparse linear mixed model (DBSLMM) for both PredInterval and CalPred, fitted the three competing methods to construct phenotypic prediction intervals and evaluated their performance in the test data by computing the mean prediction coverage rate across simulation replicates.
Fig. 1. Schematic of PredInterval.
PredInterval is a statistical method that constructs calibrated prediction intervals for quantifying uncertainty in PGS-based phenotype prediction. Left, PredInterval takes as input either individual-level GWAS data or GWAS summary statistics and can be paired with any arbitrary PGS method. Middle, PredInterval is nonparametric in nature and relies on the absolute difference between the observed phenotypic value and constructed PGS to construct prediction intervals through CV. Right, PredInterval can produce well-calibrated prediction intervals at a targeted confidence level.
Overall, PredInterval consistently generated well-calibrated prediction intervals across diverse simulation settings (Fig. 2a,b and Supplementary Figs. 1–6). Specifically, with a target coverage rate of 95%, PredInterval achieved a mean coverage rate of 96.0% (median = 96.0%, range = 95.1–97.4%) for quantitative traits across simulation settings and replicates. The mean coverage rate from PredInterval was greater than or equal to 95.0% across simulation replicates in all 24 quantitative trait simulation settings. Similarly, PredInterval achieved a mean coverage rate of 96.7% (median = 96.6%, range = 94.9–98.7%) for binary traits across simulation settings and replicates. The mean coverage rate from PredInterval was greater than or equal to 95.0% across simulation replicates in 95 of the 96 binary trait simulation settings.
Fig. 2. Comparison of the prediction coverage rates of different methods in simulations.
Compared methods included PredInterval (blue), BLUP analytical form (orange) and CalPred (green). Simulations were performed with the sample size of the training dataset to be 50,000. a–c, In the box plots, the center line represents the median, the box limits indicate the 25th and 75th percentiles and the whiskers extend to the most extreme values within 1.5 times the interquartile range from the hinges; points beyond this range are plotted individually as outliers. a, Prediction coverage rate versus different polygenicity values for quantitative traits under different simulation settings with heritability (h2) = 0.2, 0.5 and 0.8. The dashed red line represents the targeted confidence level of 0.95. b, Prediction coverage rate versus different polygenicity values for binary traits under different simulation settings with case prevalence (q) = 0.01, 0.05, 0.1 and 0.2, and heritability (h2) set to 0.5. The dashed red line represents the targeted confidence level of 0.95. c, Prediction coverage rate versus different values of the target confidence level for quantitative traits under the baseline simulation setting, where heritability (h2) = 0.5 and polygenicity (ρ) = 1. The data underlying these plots are provided as Source Data.
In contrast, the prediction intervals from BLUP analytical form and CalPred were mis-calibrated across simulation settings. Specifically, the BLUP analytical form achieved a mean coverage rate of only 91.0% (median = 94.5%, range = 75.4–95.3%) for quantitative traits across simulation settings and replicates. Its mean coverage rate was greater than or equal to 95.0% across replicates in only ten of the 24 quantitative trait simulation settings (Fig. 2a and Supplementary Fig. 1). Also, the BLUP analytical form achieved a mean coverage rate of only 83.4% (median = 89.9%, range = 23.8–99.1%) for binary traits across simulation settings and replicates (Fig. 2b and Supplementary Figs. 2–6). Its mean coverage rate was greater than or equal to 95.0% across replicates in only 28 of the 96 binary trait simulation settings, most of which contained a relatively low case prevalence (16 of the 28 settings had a case prevalence of 0.01 and 12 settings had a case prevalence of 0.05). CalPred achieved a mean coverage rate of only 80.2% (median = 84.8%, range = 67.2–91.4%) for quantitative traits across simulation settings and replicates. It failed to achieve a mean coverage rate across replicates greater than or equal to 95.0% in any of the 24 quantitative trait simulation settings. In addition, CalPred achieved a mean coverage rate of only 88.7% (median = 90.3%, range = 68.9–99.0%) for binary traits across simulation settings and replicates. It only achieved a mean coverage rate across replicates greater than or equal to 95.0% in 27 of the 96 binary trait simulation settings, with a similar pattern as the BLUP analytical form; most of these 27 settings contained a relatively low case prevalence (24 of the 27 settings had a case prevalence of 0.01 and three settings had a case prevalence of 0.05).
A detailed examination of the simulation results provides further insights. First, PredInterval produced well-calibrated prediction intervals across a range of targeted coverage rates for both quantitative and binary traits, whereas the other two methods failed to achieve calibration across the targeted coverage rates (Fig. 2c and Supplementary Fig. 7). It is noteworthy that, for binary traits, BLUP analytical form and CalPred achieved reasonable coverage when the targeted coverage rate ranged from 0.2 to 0.8, while their performance reduced substantially when the targeted coverage rate was 0.9 or 0.99. The reduced performance of BLUP analytical form and CalPred is due to these two methods achieving a targeted coverage rate of 0.8 or less by capturing most of the controls, while they failed to capture most of the cases to achieve a targeted overall coverage rate of 0.9 or more. Second, PredInterval achieved stable performance across varying sample sizes in the training data, whereas the performance of CalPred substantially deteriorated with reduced sample sizes. Specifically, for quantitative traits, the mean coverage rate of PredInterval across settings was 96.2% when the sample size was 50,000 and 95.8% when the sample size was 5,000. In contrast, the mean coverage rate of CalPred was 89.1% when the sample size was 50,000 and dropped to 71.4% when the sample size was 5,000. A similar trend was observed for binary traits: the mean coverage rate of PredInterval was 96.8% when the sample size was 50,000 and 96.5% when the sample size was 5,000, while the mean coverage rate of CalPred was 91.9% when the sample size was 50,000 and decreased to 85.5% when the sample size was 5,000. Third, PredInterval achieved calibrated performance consistently across simulation settings with different values of polygenicity, while the performance of BLUP analytical form substantially deteriorated with reduced polygenicity. For example, when sample size was 50,000, PredInterval achieved mean coverage rates of 96.1%, 96.4%, 96.3% and 96.1% for quantitative traits when polygenicity was set to 0.001, 0.01, 0.1 and 1, respectively. However, the mean coverage rates of BLUP analytical form were 89.7%, 89.9%, 94.4% and 93.2%, respectively, which is probably due to its model misspecification in sparse genetic architectures. Fourth, PredInterval achieved accurate and robust performance across simulation settings with varying heritability, with the mean coverage rate slightly increased with increasing heritability in simulations of both quantitative and binary traits (Fig. 2a,b and Supplementary Figs. 1–6). In contrast, the mean coverage rate of BLUP analytical form and CalPred decreased with increasing heritability for both quantitative and binary traits. Finally, we carefully examined the prediction interval widths from the three competing methods and found that the mean interval widths of PredInterval were larger than that of BLUP analytical form and CalPred across settings and replicates (Supplementary Figs. 8–16 and Supplementary Note).
The PredInterval framework is general and can be applied in conjunction with any PGS method to construct prediction intervals. To illustrate this, we combined PredInterval with three other PGS methods: summary-data based BLUP (SBLUP), PRS-CS and LDpred, with absolute prediction accuracy in terms of for quantitative traits shown in Supplementary Figure 17a for the four PGS methods. PredInterval achieved mean coverage rates of 0.963, 0.953 and 0.952 for quantitative traits and 0.951, 0.951 and 0.957 for binary traits when combined with the three methods, respectively (Fig. 3a and Supplementary Fig. 17b). Furthermore, the performance of PredInterval was robust across different numbers of folds in the cross-validation (CV) procedure. Specifically, PredInterval achieved mean coverage rates of 0.965, 0.962 and 0.957 for quantitative traits and 0.972, 0.966 and 0.960 for binary traits when the number of folds was 3, 5 or 10, respectively (Fig. 3b). Both the mean coverage rate and mean interval width by PredInterval decreased with increasing number of folds, corresponding to an enlarged sample size of training data in each fold because a larger sample size improves the estimation accuracy of uncertainty such that the resulting coverage rate converges toward the target confidence level of 0.95 with a narrower prediction interval. Additionally, PredInterval outperformed covariate-adjusted CalPred in simulations with covariate effects (Supplementary Fig. 18 and Supplementary Note), probably because of its flexible, nonparametric framework built on the CV+ framework.
Fig. 3. Performance of PredInterval in simulation settings paired with different PGS methods and set with a different number of folds in the CV procedure.
Results are shown for simulation settings with the baseline parameter setting, where the sample size of the training data = 50,000, heritability (h2) = 0.5 and polygenicity (ρ) = 1 for both quantitative and binary traits and case prevalence (q) = 0.1 for binary traits. a, Left, Box plots displaying the prediction coverage rate across ten simulation replicates of quantitative traits when PredInterval is paired with four different PGS methods. Right, Violin plots displaying the interval width for individuals in the test set across ten simulation replicates. Compared PGS methods include DBSLMM (purple), SBLUP (pink), PRS-CS (light blue) and LDpred (orange). The dashed red line represents the targeted confidence level of 0.95. In the box plots, the center line represents the median, the box limits indicate the 25th and 75th percentiles and the whiskers extend to the most extreme values within 1.5 times the interquartile range from the hinges; points beyond this range are plotted individually as outliers. b, Bar plots showing the mean prediction coverage rate across ten replicates (y axis, left) and the mean interval width for individuals in the test set across ten replicates (y axis, right) of both quantitative (light blue) and binary (yellow) traits for settings with a different number of folds (x axis) in the CV procedure of PredInterval, with the numeric value displayed above the bar. The dashed red line represents the targeted confidence level of 0.95. Individual data points for ten simulation replicates are also provided for each bar plot, with each data point representing the coverage rate (left) or mean interval width (right) in a single replicate. The data underlying these plots are provided as Source Data.
Finally, the ability of PredInterval to generate accurate phenotypic prediction intervals allows us to identify individuals at high risk in a principled way. Unlike a single point estimate produced by standard PGS methods, the prediction intervals empower us to directly identify individuals potentially at high risk by assessing whether an individual’s prediction interval covers case status or high-risk phenotypic values. To illustrate such benefits, we evaluated the performance of PredInterval and the other two methods in identifying individuals at high risk. To do so, we defined individuals with a phenotypic value in the top 5% quantile as high-risk individuals for quantitative traits and individuals who are cases as high-risk individuals for binary traits. We then applied different methods to construct prediction intervals, with which we identified individuals whose predicted phenotypic interval covers either the top 5% quantile of phenotypic values (for quantitative traits) or case status (for binary traits). Overall, PredInterval achieved the highest success rate of identifying high-risk individuals across most simulation settings compared to the other two methods (Fig. 4a), representing an average improvement of 8.0% (P = 5.12 × 10−13) and 86.3% (P = 7.66 × 10−69) for quantitative traits and 548.1% (P = 1.69 × 10−226) and 1,483.8% (P = 6.26 × 10−262) for binary traits, compared to BLUP analytical form and CalPred, respectively (Fig. 4b). Specifically, for quantitative traits, PredInterval achieved the highest success rate in 20 of 24 simulation settings, with a mean success rate of 94.7% across settings. In comparison, BLUP analytical form achieved the highest success rate in the remaining four settings with a mean success rate of 87.7%, while CalPred failed to achieve the highest success rate in any of the 24 settings with a mean success rate of 50.8%. The advantage of PredInterval is even more evident for binary traits. PredInterval achieved the highest success rate in all 96 simulation settings, with a mean success rate of 52.6% across settings. In contrast, BLUP analytical form and CalPred only achieved a mean success rate of 8.1% and 3.3% across settings, respectively.
Fig. 4. Comparison of success rates in identifying high-risk individuals using different methods in the simulations.
a, Scatter plots showing the success rate of PredInterval (y axis) in identifying high-risk individuals in the test data versus the success rates of the other two methods (x axis) for the 24 simulation settings of quantitative traits (top plots, blue dots) and the 96 simulation settings of binary traits (bottom plots, red dots), with each dot representing the mean success rates across the ten replicates in a single setting for the two competing methods, respectively. b, Bar plots showing the mean success rate in identifying high-risk individuals in the test data across simulation settings and replicates for both quantitative and binary traits, with the numerical value and error bar of the 95% confidence interval displayed above the bar. Individual data points for ten simulation replicates were also provided for each bar plot, with each data point representing the mean success rate across 24 (quantitative, left) or 96 (binary, right) simulation settings for a single replicate. The displayed P values were computed using a two-sided pairwise t-test from n = 240 independent experiments for the quantitative traits and n = 960 independent experiments for the binary traits, with each experiment calculating the success rate for 10,000 individuals for one simulation setting in a single replicate. Compared methods include PredInterval (blue), BLUP analytical form (orange) and CalPred (green). The data underlying these plots are provided as Source Data.
Real-data applications
We applied PredInterval, BLUP analytical form and CalPred to analyze 12 traits in the UK Biobank (UKB), including six quantitative traits (details in Supplementary Table 1) and six binary traits (details in Supplementary Table 2). Details of the analysis are provided in the Methods. Briefly, we identified 361,112 individuals of European ancestry with 1,119,148 SNPs for analysis. We partitioned the total samples into five equally sized subsets and performed fivefold CV, treating each subset in turn as the test data and the remaining four subsets as the training data. For each phenotype in turn, we fitted the three competing methods in the training data and constructed 95% PGS prediction intervals in the test data. We evaluated the performance of the three methods using the mean prediction coverage rate across the test data.
Overall, PredInterval produced well-calibrated prediction intervals across all 12 traits (Fig. 5a). Specifically, PredInterval achieved a mean coverage rate of 95.6% (median = 95.5%, range = 95.3–96.0%) across six quantitative traits, with a mean coverage rate greater than or equal to 95% across the five folds for all six traits. PredInterval also achieved a mean coverage rate of 95.9% (median = 95.5%, range = 94.8–97.6%) across the six binary traits, with a mean coverage rate greater than or equal to 95% across the five folds for five of the six traits (94.8% for the remaining trait). In contrast, and consistent with simulations, the prediction intervals from BLUP analytical form and CalPred were mis-calibrated across the 12 traits. Specifically, BLUP analytical form achieved a mean coverage rate of only 90.8% (median = 90.9%, range = 86.7–95.5%) across six quantitative traits, with a mean coverage rate greater than or equal to 95% across the five folds for only one of the six traits. Additionally, BLUP analytical form achieved a mean coverage rate of only 90.5% (median = 89.7%, range = 80.5–98.9%) across six binary traits, with a mean coverage rate greater than or equal to 95% across the five folds for only two of the six traits. For CalPred, it achieved a mean coverage rate of only 91.8% (median = 91.9%, range = 90.6–92.5%) across six quantitative traits, with a mean coverage rate greater than or equal to 95% across the five folds for none of them. Similarly, CalPred achieved a mean coverage rate of only 93.3% (median = 92.2%, range = 88.5–98.9%) across six binary traits, with a mean coverage rate greater than or equal to 95% across the five folds for only two of the six traits.
Fig. 5. Comparison of the prediction coverage rate of different methods for PGS-based phenotypic prediction in the UKB.
a, Jitter plots showing the prediction coverage rate of different methods for 12 traits in the test set across five folds using individual-level UKB data as training data. Compared methods include PredInterval (blue), BLUP analytical form (orange) and CalPred (green). The mean prediction coverage rate across the five folds is displayed above each jitter plot. The dashed red line represents the targeted confidence level of 0.95. b, Jitter plots showing the prediction coverage rate of the summary statistics version of PredInterval for six traits in the test set across five folds using external summary statistics as training data. The results are shown for five quantitative traits and one binary trait: high-density lipoprotein (HDL), low-density lipoprotein (LDL), total cholesterol (TC), triglycerides (TG), BMI and type 2 diabetes (T2D). The mean prediction coverage rate across the five folds is displayed above each jitter plot. The dashed red line represents the targeted confidence level of 0.95. The data underlying these plots are provided as Source Data. FVC, forced vital capacity; LYMPH, lymphocyte count; PLT, platelet count; SBP, systolic blood pressure; SH, standing height.
Careful examination of the UKB results provides us with further insights. First, consistent with the simulations, PredInterval achieved calibrated prediction coverage across traits with varying SNP heritability for both quantitative and binary traits (Fig. 5a). In contrast, the prediction coverage from CalPred exhibited a decreasing trend with increasing heritability. Second, consistent with the simulations, the interval width generated by PredInterval was larger than those produced by BLUP analytical form and CalPred for most quantitative and binary traits (Supplementary Fig. 19 and Supplementary Note). Additionally, PredInterval also achieved greater interval width variability and better coverage across ten deciles of interval width compared to the other two methods (Supplementary Figs. 20 and 21 and Supplementary Note). Third, for binary traits, consistent with the simulations, PredInterval achieved calibrated prediction coverage regardless of case prevalence. In contrast, the prediction coverage achieved by BLUP analytical form and CalPred exhibited a decreasing trend with increasing case prevalence. Indeed, both methods failed to achieve the targeted coverage rate in the four binary traits with a case prevalence equal to or greater than 0.05 (hypertension (HTN) = 0.265, asthma (AS) = 0.117, osteoarthritis (OA) = 0.088 and high cholesterol (HCH) = 0.128). For the two binary traits with a case prevalence of less than 0.05 (0.032 for angina (AG) and 0.011 for rheumatoid arthritis (RA)), both methods achieved the targeted prediction coverage level but with a relatively higher interval width (mean interval width ranging from 0.207 to 0.693) compared with PredInterval (0.419 for AG and 0.142 for RA), supporting the efficiency of PredInterval in achieving the targeted prediction coverage. We also examined the computational efficiency of different PGS methods when pairing with PredInterval. The analytical solution-based method DBSLMM compares favorably with the other PGS methods with relatively low computing time and memory cost, which aligns with the observation reported in a previous study46 (Supplementary Table 3).
Finally, we examined the performance of PredInterval in identifying high-risk individuals across traits. Consistent with simulations, PredInterval is more effective than the other two methods in identifying high-risk individuals for both quantitative and binary traits (Fig. 6a). Specifically, PredInterval achieved a mean success rate of 95.1% in identifying high-risk individuals across the six quantitative traits, representing 8.7%, 20.6% and 416.8% improvement over BLUP analytical form (success rate = 87.5%), CalPred (78.8%) and directly using PGS point estimates (18.4%), respectively. In addition, PredInterval achieved the highest success rate for three of the six quantitative traits, with an average improvement of 14.0% over the second-best method. For the three quantitative traits where PredInterval was not the best, the performance of PredInterval was only an average of 1.0% lower compared to the best method. Like the simulations, the advantage of PredInterval over the other methods is more evident for binary traits, especially for traits with high case prevalence (Fig. 6b). Specifically, PredInterval achieved a mean success rate of 41.1% in identifying high-risk individuals across the six binary traits, representing 830.4% and 127.2% improvement over BLUP analytical form (success rate = 4.4%), and CalPred (18.1%), respectively. In particular, for the four binary traits with a case prevalence equal to or greater than 0.05, PredInterval achieved a success rate of 84.5%, 58.5%, 41.8% and 62.1% across the five folds for HTN, AS, OA and HCH, respectively. In contrast, the success rates for the four traits were 26.5%, 0%, 0% and 0% for BLUP analytical form and 74.5%, 2.4%, 0.9% and 30.8% for CalPred, respectively (Fig. 6b). For the remaining two binary traits with a case prevalence of less than 0.05 (AG and RA), all three methods achieved zero success rate in identifying cases across the five folds, underscoring the difficulties posed by low case prevalence. The overall results suggest that the well-calibrated prediction coverage achieved by PredInterval translates to the effective identification of high-risk individuals.
Fig. 6. Comparison of success rates in identifying high-risk individuals using different methods in the UKB.
a, Scatter plots showing the success rate of PredInterval (y axis) in identifying high-risk individuals in the test data versus the success rates of the other two methods (x axis) across five folds for six quantitative traits (top plots, blue dots) and six binary traits (bottom plots, red dots), with each dot representing the success rate in a single fold for the two competing methods, respectively. b, Bar plots showing the mean success rate in identifying high-risk individuals in the test data across five folds for six quantitative traits and four binary traits, with the numerical value and error bar of the 95% confidence interval displayed above the bar. Individual data points for five folds are also provided for each bar plot, with each data point representing the success rate for the trait in a single fold. The displayed P values were computed using a two-sided pairwise t-test from n = 5 independent experiments for each trait, with each experiment calculating the success rate for 72,122 individuals in a single fold. Compared methods include PredInterval (blue), BLUP analytical form (orange) and CalPred (green). The data underlying these plots are provided as Source Data.
Besides investigating the calibration of PredInterval using individual-level UKB data as training data, we also applied the summary statistics version of PredInterval to body mass index (BMI) and five additional traits and examined its performance in the UKB. Note that CalPred requires SNP weights in combination with individual-level genotype, phenotype and covariate data during the model fitting step; therefore, it cannot be included in this evaluation, which is based solely on summary statistics. The details of analysis are provided in the Methods. Briefly, we obtained publicly available external summary statistics independent of UKB to serve as training data for PredInterval. We partitioned the total UKB samples into five equally sized folds and performed fivefold CV, with one half serving as the calibration data and the other half as the test data in each fold. For each trait in turn, we fitted the summary statistics-based PredInterval in the training data, performed calibration in the calibration data and computed the 95% PGS prediction interval in the test data. The results from the summary statistics version of PredInterval are closely matched to the previous results obtained using the individual-level version of PredInterval (Fig. 5b). Specifically, the summary statistics version achieved a mean coverage rate of 95.0% (median = 95.0%, range = 94.9–95.1%) across five quantitative traits, with a mean coverage rate greater than or equal to 95% across the five folds for four of the five traits (94.9% for the remaining trait). It also achieved a mean coverage rate of 96.3% across the five folds for the binary trait. In addition, we also compared the performance of the individual-level and summary statistics versions of PredInterval by fitting both versions to the 12 traits in the UKB, where the summary statistics version of PredInterval achieved consistently calibrated performance as that of the individual-level version of PredInterval (Supplementary Fig. 22 and Supplementary Note). These results validate the consistency between the summary statistics and individual-level versions of PredInterval, demonstrating its broad utility. The results also confirm that PredInterval consistently achieves calibrated prediction coverage, aligning with the theoretical guarantee of the CV+ framework on which it is based.
Discussion
We have presented PredInterval, a statistical method designed to produce well-calibrated phenotype prediction intervals. PredInterval uses a general nonparametric framework, takes either individual-level data or summary statistics as input and offers flexibility for pairing with any PGS method. It is tailored not only for quantifying PGS-based phenotype prediction uncertainty but also for effective identification of high-risk individuals. Importantly, PredInterval is linked to the marginal evidence function used in marginal likelihood for scoring and model selection47, potentially enabling further computational improvements. We have demonstrated the benefits of PredInterval through comprehensive simulations and applications to 17 complex traits.
PredInterval is nonparametric in nature and constructs phenotype prediction intervals by leveraging the quantiles of phenotypic residuals obtained from training data through CV. Such a residual-based modeling framework is flexible, capable of incorporating phenotypic prediction uncertainty introduced from various sources, including PGS as focused on in the present study, as well as nongenetic factors and other covariates. Thus, PredInterval can be paired with any arbitrary PGS method, despite their distinct modeling assumptions on SNP effect size distribution and the subsequent variation in prediction accuracy and uncertainty. In addition, PredInterval is applicable to scenarios where nongenetic factors and additional covariates are available alongside the PGS to improve phenotype prediction. In these cases, nongenetic factors and other covariates can be simply aggregated with PGSs into a combined risk score, which can be processed through PredInterval to produce calibrated prediction intervals. Therefore, PredInterval ensures well-calibrated coverage of true phenotypic values across diverse settings and across traits with distinct genetic architectures, facilitating the reliable interpretation of PGS results and clinical decision-making32,48.
PredInterval relies on the phenotypic residuals from training data to construct prediction intervals in the target data. Consequently, its efficacy is contingent on both training and target data containing individuals of similar genetic ancestry. In the present study, we primarily focused on settings where both training and target data consisted of individuals of European ancestry. Challenges may arise when the training and target data include ancestrally diverse populations because allele frequency and linkage disequilibrium (LD) pattern, causal variants and their effect sizes, and the sample size of multi-ancestry GWAS data can all vary substantially across ancestries49,50. Therefore, future extensions of PredInterval are necessary to accommodate cross-ancestry prediction settings.
Methods
Ethics statement
This study used individual-level genotype and phenotype data from the UKB, which had obtained approval from their respective institutional review boards and secured written informed consent from all participants. The research, conducted under UKB application no. 24460, involved the construction of an SNP–SNP correlation matrix using the UKB genotype data, which is accessible to registered researchers through the UKB data access protocol. Our study complies with all pertinent ethical regulations. This study was approved by the University of Michigan institutional review board (no. HUM00156494).
Method overview
Our objective was to compute calibrated prediction intervals for a phenotype of interest, using any PGS method, in GWAS. To do so, we adapted the CV+ framework, which is an extension of jackknife+ designed to quantify general prediction uncertainty51, toward constructing prediction intervals using any PGS method through K-fold CV. To set up notations, we considered a GWAS with N individuals. We denote Yi as the phenotypic measurement for i-th individual and Xi as the P-vector of genotypes measured on P SNPs for the same individual, with i ∈{1,…,N}. We centered and standardized the phenotype and each SNP genotype to have a mean of zero and an s.d. of one. We partition the N individuals equally into K disjoint subsets , where each subset k has a sample size nk approximately equal to N/K. For each subset in turn, we treat the remaining K − 1 subsets as the training data, fitted a PGS method of choice there to obtain SNP weights and applied the SNP weights to construct PGS for individuals in the subset of focus. For individuals in the subset Sk, their PGS are constructed as:
| 1 |
where denotes the nk vector of the PGS constructed for individuals in subset Sk using the training data with the subset Sk removed; denotes the PGS construction procedure described above, which involves first fitting the PGS model on the training data (Xj, Yj) for j ∈{1,…,N}\Sk and then applying the estimated SNP weights to construct the PGS for individuals in Sk. Using this procedure, a PGS is constructed for every individual. Afterwards, we quantified the difference between the constructed PGS and the observed phenotypic value by calculating the absolute residual:
| 2 |
where denotes the subset that contains individual i, that is, ; denotes the constructed PGS for individual i.
Based on these residuals, for a new individual with genotype vector XN + 1, we first constructed the K-different PGS for the individual, represented as , each derived from the training data with one subset Sk removed. We calculated the mean of these K PGSs to serve as the final PGS for the individual. In addition, for each k-th PGS , we subtracted it from the nk absolute residuals in the subset (), in the form of , to create a set of possible lower prediction values for the individual’s phenotypic measurement that could have been constructed because of sampling variability. We combined the set of lower PGS values across the K subsets and identified the lower α-th percentile value among them as the lower bound of the prediction interval targeted at a coverage level of 1 − α. Note that the lower α-th percentile of the absolute residuals corresponds to the α/2-th percentile in the lower tail of the residual distribution, thus further corresponding to a targeted coverage level of 1 − α. Similarly, for each of the K PGSs , we added the nk absolute residual in the subset Sk () on top of it, in the form of , to create a set of possible higher values for the individual’s phenotypic measurement that could have been observed because of sampling variability. We also combined the set of higher PGS values across the K subsets and identified the upper α-th percentile value among them as the upper bound of the prediction interval targeted at a coverage level of 1 − α. Consequently, the resulting prediction interval is in the following form:
| 3 |
where and denotes the α-th lower and upper percentiles, respectively.
The constructed prediction interval in equation (3) is guaranteed to achieve coverage on the true phenotypic value YN + 1 at the level of at least 1 − α (ref. 51). However, the realized coverage level for any practical dataset with a finite sample size tends to be conservative, especially when the sample size in the training data is small. Consequently, opting for a smaller fold K in CV, which is equivalent to a smaller training data size, may lead to an overly conservative prediction coverage level. In contrast, choosing a larger fold K in CV, while yielding a larger training data size, can incur a heavier computational cost as the PGS model needs to be fitted K times during the K-fold CV. To strike a balance between prediction interval precision and computational efficiency, we primarily used K = 5 throughout the present study.
The above prediction interval construction framework is general and can be applied with any PGS method that accepts either individual-level data or summary statistics as inputs. For the purpose of this study, we used DBSLMM45 as the primary PGS method for illustrative purpose and examined several other PGS methods to showcase the generalizability of our framework. DBSLMM relies on a flexible distributional assumption for SNP effect sizes, ensuring accurate and robust PGS construction. It also uses a deterministic inference algorithm for scalable computation. In our study, we used DBSLMM to estimate the SNP effect sizes in each training data using GWAS summary statistics along with an LD matrix calculated using genotype data from an external reference panel (details in the following sections).
Obtaining calibrated prediction intervals enables the identification of high-risk individuals. Specifically, previous PGS point estimates allow for the ranking of individuals based on their predicted PGSs, facilitating risk stratification; individuals with high PGSs probably present higher risks compared to individuals with lower scores. However, risk stratification alone is not sufficient for identifying individuals at high risk because the cutoff of PGS estimates to designate high-risk individuals is often unknown in practice. In contrast, obtaining prediction intervals allowed us to move beyond simple risk stratification; we can now identify individuals at high risk by examining whether their prediction intervals cover phenotypic values associated with high risk. We developed such a strategy detailed in the following sections and explored its use for identifying high-risk individuals in the present study.
We refer to the aforementioned method as PredInterval, which accepts either individual-level phenotype and genotype data or summary statistics (more details below) as inputs. Importantly, the PredInterval framework outputs prediction intervals to quantify prediction uncertainty but does not adjust the PGS, which remains practically unchanged before and after the construction of the intervals. PredInterval is freely available at https://xiangzhou.github.io/software/.
Summary statistics-based PredInterval framework
The above PredInterval framework is described based on individual-level data. In this section, we extend it to use summary statistics for inference. The main idea behind the summary statistics extension of PredInterval is to replace the original CV procedure, which relies on individual-level data, by performing CV directly on summary statistics. To do so, we paired PredInterval with the PUMAS framework19, which can sample marginal association statistics for a subset of individuals based on the complete GWAS summary data. Using this approach, summary statistics for the training subset can be generated without partitioning the samples, allowing these statistics to serve as the input for the PGS model fitting step of PredInterval. Specifically, we denote xTy as the observed P × 1 summary statistics for a total of N individuals and P SNPs, and x(tr)Ty(tr) represents the summary statistics for the training subset of Ntr individuals that are sampled from the N individuals. The technical details of the PUMAS framework are provided in ref. 19. Briefly, it can be shown that
where is the observed covariance matrix for xTy from the GWAS data. The subsampled GWAS summary statistics for the training set can then be obtained by
where is an estimator of the genotype variance for the j-th SNP, based on its minor allele frequency; and are the regression coefficient and its s.e. for the j-th SNP obtained from the full summary statistics, respectively. Afterwards, we fitted the PGS models to compute the SNP effect size estimates based on the subsampled GWAS summary statistics, computed phenotypic residuals in an independent calibration dataset and then applied PredInterval to construct prediction intervals for individuals in the test set.
Compared methods
We compared our method with two existing methods: the BLUP prediction interval in an approximate analytical form introduced in Ding et al.32 and CalPred52. The analytical form of the BLUP prediction interval is derived under an infinitesimal model and relies on the independent SNP assumption, while CalPred is currently the only existing software for PGS prediction interval construction that can be paired with any arbitrary PGS methods. To construct the BLUP phenotype prediction interval, we first obtained the BLUP estimates using the GCTA (Genome-wide Complex Trait Analysis) software (v.1.93.2beta) in the training data and computed the PGS predictors using the ‘score’ function of PLINK53 in the test data. Afterwards, we implemented the following formula from Ding et al.32 in R (v.4.3.2) to estimate the uncertainty of PGS: , where xij is the standardized genotype at SNP j for individual i, M is the number of SNPs, N is the sample size of the training data and is the SNP heritability. Finally, we transformed these uncertainty estimates of PGS into the uncertainty estimates of phenotypes by adding the residual error variance, in the form of for quantitative traits and for binary traits, where P is the case prevalence in the training data for binary traits. For CalPred, we used the CalPred software (v.0.1.1) with default settings for model fitting. In CalPred, we followed Hou et al.52 and fitted CalPred with context-specific adjustment using sex, age, age2, genotyping array and top 20 principal components (PCs) as covariates. To ensure a fair comparison, we primarily used the same PGS method—DBSLMM (v.0.3) with default parameter settings—to compute the SNP effect size estimates as input for both PredInterval and CalPred. In particular, we applied DBSLMM to estimate the SNP effect size in the training data, computed the PGS predictors in both training and test data using the ‘score’ function in PLINK, and applied PredInterval and CalPred to construct prediction intervals in the test data. In both simulations and real-data applications, we assessed the performance of PredInterval, BLUP analytical form and CalPred using the prediction coverage rate as the evaluation metric, defined as the frequency of the true phenotypic value lying within the prediction interval.
In addition to the primary PGS method (DBSLMM), we also examined the performance of PredInterval when combined with three other PGS methods through simulations: SBLUP54, PRS-CS17 and LDpred23. These methods were chosen for their different model fitting algorithms, with SBLUP being a BLUP-based method while PRS-CS and LDpred being Bayesian methods. For SBLUP, we used the GCTA software to fit the model. In SBLUP, as in Maier et al.55, we set the LD window size to 2,000 kb and calculated the input parameter λ as , where M is the total number of SNPs used in the analysis, and h2 is the SNP heritability. For PRS-CS, we used the PRS-CS software (v.1.0.0) for model fitting. In PRS-CS, we obtained SNP weights by using the PRS-CS-auto model, in which the global shrinkage parameter ϕ is inferred automatically using a full Bayesian approach. For LDpred, we used the LDpred software (v.1.0.11) for model fitting. In LDpred, we set the LD radius parameter to be the recommended value (m/3,000), with m being the number of SNPs. For the fraction of causal variants parameter ρ, we followed Vilhjálmsson et al.23 and Lloyd-Jones et al.12 and used a validation set to tune the parameter ρ by exploring nine different choices for ρ: 1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003 and 0.0001. The ρ value with the highest prediction R2 in the validation set was selected as the optimal fraction parameter.
Simulations
We performed comprehensive simulations to evaluate the performance of PredInterval compared with other methods56. Details of the simulations are provided in the Supplementary Note.
UKB data
For both simulations and real-data applications, we used genotype and phenotype data from the UKB57. The UKB data contains ~500,000 participants and ~93 million imputed SNPs, along with comprehensive phenotypic and health-related information for the enrolled individuals. We followed the same quality control (QC) procedures described in the online resources of the Neale laboratory for sample and variant QC. Specifically, for sample QC, we preserved individuals who (1) have White British ancestry, as evident through self-reporting ‘White-British’, ‘Irish’ or ‘White’ and being within 7 s.d. away from the first six PCs; (2) are included in the UKB v.3 imputed genotype data; and (3) are included in the genotype PC computation. We further filtered out individuals who (1) have sex chromosome aneuploidy; (2) are outliers in heterozygosity and missing rates; (3) are excluded from the kinship inference procedure; (4) have more than ten putative third-degree relatives in the kinship table; or (5) withdrew their consent from participating in the study. For variant QC, we excluded SNPs with (1) a minor allele frequency < 0.05; (2) an INFO score < 0.8; (3) a missing percentage > 5%; (4) a Hardy–Weinberg equilibrium test P < 10−7; or (5) duplicated SNPs. Furthermore, we followed Bulik-Sullivan et al.58 and the Neale laboratory, and restricted the set of SNPs to those listed in the precomputed LD scores of individuals of European ancestry provided by the LDSC website. After the sample and variant QC steps, 361,112 individuals and 1,119,148 SNPs were retained for the final analysis. These QC procedures were performed using PLINK53.
Real-data applications
We analyzed 12 traits from the UKB to evaluate the performance of PredInterval and compare it with the other methods. Specifically, following the approaches of Ding et al.32, Yang & Zhou.45, Morrison et al.59 and Sun et al.60, we selected six quantitative traits with SNP heritability estimates above 0.1 and six binary traits with a case prevalence between 0.01 and 0.3. The six quantitative traits include four physical measurements (SH, BMI, FVC and SBP) and two blood cell traits (PLT and LYMPH). The six binary traits include HTN, AG, AS, HCH, OA and RA. An overview of the phenotypes analyzed in this paper is summarized in Supplementary Table 1 for quantitative traits and Supplementary Table 2 for binary traits. For the quantitative traits, we followed Xu et al.14 to focus on phenotypic values measured during the initial visit among the three UKB assessment center visits for analysis. For the binary traits, we followed Maier et al.55 to define disease status using self-reported non-cancer illness coding (data field 20002), where we assigned a phenotypic value of one for cases and zero for controls.
In the analysis, we first randomly sampled 500 individuals from the UKB data to serve as a reference panel for calculating the SNP LD matrix. Afterwards, we randomly partitioned the remaining individuals into five equally sized and disjoint subsets for CV. For each subset in turn, we treated the subset of focus as the test data and the remaining four subsets as the training data, resulting in 80% training data and 20% test data. We repeated the CV process five times, ensuring that each of the five subsets was used exactly once as the test data. For both quantitative and binary traits, we obtained marginal z-scores in the training data by fitting a standard linear regression using GEMMA. Specifically, for quantitative traits, we first removed the effects of sex, age, age2, genotyping array and top 20 genotype PCs by regressing the phenotype on these covariates, obtaining phenotype residuals. These residuals were then transformed to a standard normal distribution through quantile–quantile normalization and supplied to GEMMA as phenotypic values for obtaining the marginal z-scores. For binary traits, we directly applied standard linear regression in GEMMA using the original binary outcomes, treating sex, age, age2, genotyping array and top 20 genotype PCs as covariates, to obtain the marginal z-scores.
With the marginal z-scores in the training data and the LD matrix from the reference panel, we estimated the SNP heritability for each trait using LDSC61 (v.1.0.1) with the precomputed LD scores from the LDSC website, supplied the SNP heritability estimates as input for PGS model fitting, fitted the PGS models (that is, BLUP for BLUP analytical form and DBSLMM for PredInterval and CalPred) to obtain the SNP effect size estimates and used the ‘score’ function in PLINK to calculate the PGS in both the training and test data. Afterwards, we applied PredInterval, BLUP analytical form and CalPred on the individual-level phenotypes and PGS in the training data to calculate the 95% PGS prediction intervals in the test data. For each phenotype, we computed the mean prediction coverage rate in the test data across the five folds as the evaluation metric for all three competing methods. Besides the overall prediction coverage, we performed an additional analysis to investigate whether the three methods can accurately quantify the individual-level PGS uncertainty. Specifically, we first examined the distribution of the prediction interval widths across individuals and computed the stratified prediction coverage rate for individuals in ten deciles of interval width to assess the calibration of coverage across varying interval sizes. Finally, we evaluated the computational efficiency of four different PGS methods (DBSLMM, SBLUP, PRS-CS and LDpred) when pairing with PredInterval, by recording the computing time and memory cost for analyzing the UKB BMI summary statistics (~1.2 M HapMap3 SNPs) in a single fold of fivefold CV.
We also evaluated the performance of different methods in identifying high-risk individuals using the same procedure described in the simulations. Specifically, for quantitative traits, we defined individuals with a phenotypic value in the top 5% quantile as high-risk individuals. We applied each method to identify individuals whose predicted phenotypic interval covered a value above the top 5% quantile of phenotypic values in the training data. Similarly, for binary traits, we defined individuals who are cases as high-risk individuals. We identified individuals whose predicted phenotypic interval covered the case status. In addition to the PGS interval methods, we also evaluated a PGS point estimate approach for six quantitative traits, where we first ranked individuals based on their PGS point estimates and selected the top 5% as the potential high-risk individual cohort. We applied each method to select individuals through such a procedure in each fold of CV. We evaluated the performance of each method by computing its success rate of identifying high-risk individuals in the test data, defined as the proportion of cases (for binary traits) or high-risk individuals (for quantitative traits) selected in the procedure. Then, for each method, we computed its mean success rate across the five folds.
Besides examining the performance of PredInterval using individual-level UKB data as training data, we also applied the summary statistics version of PredInterval to the BMI and five additional traits and evaluated its performance in the UKB. Specifically, as in Gao & Zhou62, Zhou et al.63 and Ruan et al.64, we obtained publicly available European ancestry-based summary statistics for five quantitative traits and one binary trait, for which external summary statistics from at least 300,000 samples are available outside the UKB. The external summary statistics were sourced from meta-analyses that excluded UKB samples to ensure no overlap between datasets. We used data for four lipid traits (HDL, LDL, TC, TG) from the Global Lipids Genetics Consortium65, BMI from the Genetic Investigation of Anthropometric Traits (GIANT) consortium66 and T2D from the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) consortium67. In the analysis, we conducted CV in the UKB by partitioning individuals into five equally sized and disjoint subsets to serve as five folds. In each fold, we further partitioned the individuals into two halves, with one set serving as the calibration data and the other set serving as the test data. For each fold of CV, we fitted DBSLMM using the external summary statistics as training data to compute the SNP effect size estimates, obtained phenotypic residuals for calibration in the calibration data and constructed the 95% PGS prediction intervals in the test data. For each phenotype, we computed the mean prediction coverage rate in the test data across the five folds as the evaluation metric for the summary statistics-based PredInterval framework. Finally, we applied the summary statistics version of PredInterval to the 12 traits in the UKB, where both individual-level data and summary statistics were available, and compared its performance to the individual-level version of PredInterval. Specifically, for each trait, we computed the summary statistics using GEMMA in the UKB training data, supplied them as inputs for the summary statistics version of PredInterval and constructed the 95% PGS prediction intervals in the test data.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41588-025-02360-6.
Supplementary information
Supplementary Figs. 1–23, Tables 1–3 and Note.
Source data
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Acknowledgements
This study was supported by National Institutes of Health (NIH) grants nos. R01HG009124 and R01GM144960 to X.Z. S.K.G. was supported by NIH grant nos. R01HL086694 and R35HL161016. This study was conducted using UKB resources under application no. 24460. The UKB was established by the Wellcome Trust, the Medical Research Council, the Department of Health, the Scottish Government and the Northwest Regional Development Agency. It also received funding from the Welsh Assembly Government, the British Heart Foundation and Diabetes UK.
Author contributions
X.Z. conceived the study, designed the methods and supervised the study. C.X. implemented the software, performed the experiments, analyzed the data and interpreted the results with input from X.Z. C.X. and X.Z. wrote and revised the manuscript with substantial input from S.K.G. All authors critically reviewed the manuscript, suggested revisions as needed and approved the final version.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Individual-level genotype data from the UKB are available at www.ukbiobank.ac.uk/. Pre-computed LD scores of individuals with European ancestry are available at GitHub (https://github.com/bulik/ldsc). The sample and variant quality control procedures of the UKB from the Neale laboratory are available at www.nealelab.is/uk-biobank. Summary statistics of lipid traits from Global Lipids Genetics Consortium are available at https://csg.sph.umich.edu/willer/public/glgc-lipids2021/. Summary statistics of the BMI from the GIANT consortium are available at https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium. Summary statistics of type 2 diabetes from DIAGRAM are available at https://diagram-consortium.org/downloads.html. The summary statistics data generated in the present study are available at GitHub (https://github.com/xuchang0201/PredInterval). Source data are provided with this paper.
Code availability
PredInterval (v.1.0) is available at GitHub (https://github.com/xuchang0201/PredInterval) and has been deposited at Zenodo68 10.5281/zenodo.16933469. CalPred (v.0.1.1) is available at GitHub (https://github.com/KangchengHou/calpred). DBSLMM (v.0.3) is available at GitHub (https://github.com/biostat0903/DBSLMM). GCTA (v.1.93.2beta) is available at https://yanglab.westlake.edu.cn/software/gcta/#Overview. GEMMA (v.0.98.1) is available at https://xiangzhou.github.io/software/. LDpred (v.1.0.11) is available at GitHub (https://github.com/bvilhjal/ldpred). LDSC (v.1.0.1) is available at GitHub (https://github.com/bulik/ldsc). Genotype data processing and quality control were performed using PLINK (v.1.9), which is available at www.cog-genomics.org/plink/. PRS-CS (v.1.0.0) is available at GitHub (https://github.com/getian107/PRScs). SBLUP (v.1.93.2beta) is available at https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41588-025-02360-6.
References
- 1.Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med.12, 44 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Konuma, T. & Okada, Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm. Regen.41, 18 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de los Campos, G., Vazquez, A. I., Hsu, S. & Lello, L. Complex-trait prediction in the era of big data. Trends Genet.34, 746–754 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet.101, 5–22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet.90, 7–24 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Loos, R. J. F. 15 years of genome-wide association studies and no signs of slowing down. Nat. Commun.11, 5900 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet.104, 21–34 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.So, H.-C., Kwan, J. S. H., Cherny, S. S. & Sham, P. C. Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. Am. J. Hum. Genet.88, 548–565 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Selzam, S. et al. Comparing within- and between-family polygenic score prediction. Am. J. Hum. Genet.105, 351–363 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet.9, e1003348 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.de Los Campos, G., Gianola, D. & Allison, D. B. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat. Rev. Genet.11, 880–886 (2010). [DOI] [PubMed] [Google Scholar]
- 12.Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun.10, 5086 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zeng, P. & Zhou, X. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat. Commun.8, 456 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Xu, C., Ganesh, S. K. & Zhou, X. mtPGS: leverage multiple correlated traits for accurate polygenic score construction. Am. J. Hum. Genet.110, 1673–1689 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.VanRaden, P. M. Efficient methods to compute genomic predictions. J. Dairy Sci.91, 4414–4423 (2008). [DOI] [PubMed] [Google Scholar]
- 16.Robinson, M. R. et al. Genetic evidence of assortative mating in humans. Nat. Hum. Behav.1, 0016 (2017). [Google Scholar]
- 17.Ge, T., Chen, C. Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun.10, 1776 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hu, Y. et al. Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet.13, e1006836 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhao, Z. et al. PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics. Genome Biol.22, 257 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Privé, F., Vilhjálmsson, B. J., Aschard, H. & Blum, M. G. B. Making the most of clumping and thresholding for polygenic scores. Am. J. Hum. Genet.105, 1213–1221 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet.9, e1003264 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Euesden, J., Lewis, C. M. & O’Reilly, P. F. PRSice: Polygenic Risk Score software. Bioinformatics31, 1466–1468 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet.97, 576–592 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hu, Y. et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol.13, e1005589 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Márquez-Luna, C. et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun.12, 6052 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Choi, S. W. & O’Reilly, P. F. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience8, giz082 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature460, 748–752 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhang, Q., Privé, F., Vilhjálmsson, B. & Speed, D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun.12, 4192 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zabad, S., Gravel, S. & Li, Y. Fast and accurate Bayesian polygenic risk modeling with variational inference. Am. J. Hum. Genet.110, 741–761 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang, D. et al. Incorporating external risk information with the Cox model under population heterogeneity: applications to trans-ancestry polygenic hazard scores. Preprint at arXiv10.48550/arXiv.2302.11123 (2023).
- 31.Ibanez, L., Farias, F. H. G., Dube, U., Mihindukulasuriya, K. A. & Harari, O. Polygenic risk scores in neurodegenerative diseases: a review. Curr. Genet. Med. Rep.7, 22–29 (2019). [Google Scholar]
- 32.Ding, Y. et al. Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat. Genet.54, 30–39 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Saw, J. et al. Chromosome 1q21.2 and additional loci influence risk of spontaneous coronary artery dissection and myocardial infarction. Nat. Commun.11, 4432 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Katz, A. E. et al. Fibromuscular dysplasia and abdominal aortic aneurysms are dimorphic sex-specific diseases with shared complex genetic architecture. Circ. Genom. Precis. Med.15, e003496 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Fritsche, L. G. et al. Association of polygenic risk scores for multiple cancers in a phenome-wide study: results from the Michigan Genomics Initiative. Am. J. Hum. Genet.102, 1048–1061 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yang, M. L. et al. Sex-specific genetic architecture of blood pressure. Nat. Med.30, 818–828 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Xu, C. et al. Cross-ancestry associations of spontaneous coronary artery dissection genetic risk with coronary atherosclerosis and migraine headache. J. Am. Heart Assoc.14, e036525 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics157, 1819–1829 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Habier, D., Fernando, R. L., Kizilkaya, K. & Garrick, D. J. Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics12, 186 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet.48, 245–252 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet.47, 1091–1098 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Nagpal, S. et al. TIGAR: an improved Bayesian tool for transcriptomic data imputation enhances gene mapping of complex traits. Am. J. Hum. Genet.105, 258–266 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Choi, S. W., Mak, T. S.-H. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc.15, 2759–2772 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ma, Y. & Zhou, X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends Genet.37, 995–1011 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yang, S. & Zhou, X. Accurate and scalable construction of polygenic scores in large biobank data sets. Am. J. Hum. Genet.106, 679–693 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Pain, O. et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet.17, e1009021 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Fong, E. & Holmes, C. C. On the marginal likelihood and cross-validation. Biometrika107, 489–496 (2020). [Google Scholar]
- 48.Petter, E. et al. Genotype error due to low-coverage sequencing induces uncertainty in polygenic scoring. Am. J. Hum. Genet.110, 1319–1329 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun.11, 3865 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Barber, R. F., Candès, E. J., Ramdas, A. & Tibshirani, R. J. Predictive inference with the jackknife+. Ann. Stat.49, 486–507 (2021). [Google Scholar]
- 52.Hou, K., Xu, Z., Ding, Y., Harpak, A. & Pasaniuc, B. Calibrated prediction intervals for polygenic scores across diverse contexts. Nat. Genet.56, 1386–1396 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Maier, R. M. et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun.9, 989 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods11, 407–409 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet.47, 1236–1241 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Morrison, J., Knoblauch, N., Marcus, J. H., Stephens, M. & He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nat. Genet.52, 740–747 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Sun, J. et al. Translating polygenic risk scores for clinical use by estimating the confidence bounds of risk prediction. Nat. Commun.12, 5276 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Bulik-Sullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet.47, 291–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Gao, B. & Zhou, X. MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies. Nat. Genet.56, 170–179 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zhou, G., Chen, T. & Zhao, H. SDPRX: a statistical method for cross-population prediction of complex traits. Am. J. Hum. Genet.110, 13–22 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet.54, 573–580 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature600, 675–679 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature518, 197–206 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet.50, 1505–1513 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Xu, C., Ganesh, S. K. & Zhou, X. Statistical construction of calibrated prediction intervals for polygenic score based phenotype prediction. Zenodo10.5281/zenodo.16933469 (2025). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figs. 1–23, Tables 1–3 and Note.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Data Availability Statement
Individual-level genotype data from the UKB are available at www.ukbiobank.ac.uk/. Pre-computed LD scores of individuals with European ancestry are available at GitHub (https://github.com/bulik/ldsc). The sample and variant quality control procedures of the UKB from the Neale laboratory are available at www.nealelab.is/uk-biobank. Summary statistics of lipid traits from Global Lipids Genetics Consortium are available at https://csg.sph.umich.edu/willer/public/glgc-lipids2021/. Summary statistics of the BMI from the GIANT consortium are available at https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium. Summary statistics of type 2 diabetes from DIAGRAM are available at https://diagram-consortium.org/downloads.html. The summary statistics data generated in the present study are available at GitHub (https://github.com/xuchang0201/PredInterval). Source data are provided with this paper.
PredInterval (v.1.0) is available at GitHub (https://github.com/xuchang0201/PredInterval) and has been deposited at Zenodo68 10.5281/zenodo.16933469. CalPred (v.0.1.1) is available at GitHub (https://github.com/KangchengHou/calpred). DBSLMM (v.0.3) is available at GitHub (https://github.com/biostat0903/DBSLMM). GCTA (v.1.93.2beta) is available at https://yanglab.westlake.edu.cn/software/gcta/#Overview. GEMMA (v.0.98.1) is available at https://xiangzhou.github.io/software/. LDpred (v.1.0.11) is available at GitHub (https://github.com/bvilhjal/ldpred). LDSC (v.1.0.1) is available at GitHub (https://github.com/bulik/ldsc). Genotype data processing and quality control were performed using PLINK (v.1.9), which is available at www.cog-genomics.org/plink/. PRS-CS (v.1.0.0) is available at GitHub (https://github.com/getian107/PRScs). SBLUP (v.1.93.2beta) is available at https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.






