Abstract
Polygenic scores (PGSs) have limited portability across different groupings of individuals (for example, by genetic ancestries and/or social determinants of health), preventing their equitable use1–3. PGS portability has typically been assessed using a single aggregate population-level statistic (for example, R2)4, ignoring inter-individual variation within the population. Here, using a large and diverse Los Angeles biobank5 (ATLAS, n = 36,778) along with the UK Biobank6 (UKBB, n = 487,409), we show that PGS accuracy decreases individual-to-individual along the continuum of genetic ancestries7 in all considered populations, even within traditionally labelled ‘homogeneous’ genetic ancestries. The decreasing trend is well captured by a continuous measure of genetic distance (GD) from the PGS training data: Pearson correlation of −0.95 between GD and PGS accuracy averaged across 84 traits. When applying PGS models trained on individuals labelled as white British in the UKBB to individuals with European ancestries in ATLAS, individuals in the furthest GD decile have 14% lower accuracy relative to the closest decile; notably, the closest GD decile of individuals with Hispanic Latino American ancestries show similar PGS performance to the furthest GD decile of individuals with European ancestries. GD is significantly correlated with PGS estimates themselves for 82 of 84 traits, further emphasizing the importance of incorporating the continuum of genetic ancestries in PGS interpretation. Our results highlight the need to move away from discrete genetic ancestry clusters towards the continuum of genetic ancestries when considering PGSs.
Subject terms: Genetics, Risk factors, Genome-wide association studies, Statistical methods
Using two large biobank datasets, a study shows that the accuracy of polygenic scores decreases as a function of relatedness at the individual level when modelling genetic ancestry as a continuum.
Main
PGSs—estimates of an individual’s genetic predisposition for complex traits and diseases (that is, genetic liability; also referred to as genetic value)—have garnered tremendous attention recently across a wide range of fields, from personalized genomic medicine4,8–10 to disease risk prediction and prevention11–14 to socio-genomics3,15. However, the variation in PGS performance across different genetic ancestries and/or socio-demographic features (for example, sex, age and social determinants of health)2 poses a critical equity barrier that has prevented widespread adoption of PGSs. Similar portability issues have also been reported for non-genetic clinical models16–18. The interpretation and application of PGSs are further complicated by the conflation of genetic ancestries with social constructs such as nationality, race and/or ethnicity. Here we investigate PGS performance across genetically inferred ancestry (GIA), which describes the genetic similarity of an individual to a reference dataset (for example, 1000 Genomes19) as inferred by methods such as principal component analysis (PCA); GIAs do not represent the full genetic diversity of human populations.
Genetic prediction and its accuracy (or reliability) have been extensively studied in agricultural settings with a focus on breeding programmes20–23. At the population level, PGS accuracy can be expressed as a function of heritability, training sample size and the number of markers used in the predictor in single24–26 or multi-population settings with or without effect size heterogeneity27. At the individual level, accuracy of genetic prediction from pedigree data28–30 can be derived as a function of the inverse of the coefficient matrix of mixed-models equations, whereas accuracy of genetic prediction using whole-genome genetic data can be derived similarly, with the pedigree matrix replaced with the genomic relationships matrix21–23,27,31,32 among training and testing individuals. Simulations guided by dairy breeding programmes showcase that genomic prediction accuracy varies with genetic relatedness of the testing individual to the training data33,34 as well as across generations, owing to the decay of genetic relationships35.
In humans, PGS performance evaluation has traditionally relied on population-level accuracy metrics (for example, R2)2,4. PGS accuracy decays as the target populations become more dissimilar from the training data using either relatedness36,37 or continental or subcontinental ancestry groupings1,38–40; the decay may be explained by differences in linkage disequilibrium, minor allele frequencies and/or heterogeneity in genetic effects due to gene–gene and gene–environment interactions41. However, population-level metrics of accuracy provide only an aggregate (average) metric for all individuals in the population, thus implicitly assuming some level of homogeneity across individuals2,4,42. Homogeneous populations are an idealized concept that only roughly approximate human data; human diversity exists along a genetic ancestry continuum without clearly defined clusters and with various correlations between genetic and socio-environmental factors7,42–46. Grouping individuals into discrete GIA clusters obscures the impact of individual variation on PGS accuracy. This is evident among individuals with recently admixed genomes for which genetic ancestries vary individual-to-individual and locus-to-locus in the genome. For example, a single population-level PGS accuracy estimated across all African Americans overestimates PGS accuracy for African Americans with large proportions of African GIA40; likewise, coronary artery disease PGS performs poorly in Hispanic individuals with high proportions of African GIA47. The genetic ancestry continuum affects PGS accuracy even in traditionally labelled ‘homogeneous’ or ‘non-admixed’ populations. For example, PGS accuracy decays across a gradient of subcontinental ancestries within Europe as the target cohorts become more genetically dissimilar from the PGS training data39,45. Assessing PGS accuracy using population-level metrics is further complicated by technical issues in assigning individuals to discrete clusters of GIA. Different algorithms and/or reference panels may assign the same individual to different clusters39,42,48, leading to different PGS accuracies. Moreover, many individuals are not assigned to any cluster owing to limited reference panels used for genetic ancestry inference5,39, leaving such individuals outside PGS characterization. This poses equity concerns as it limits PGS applications only to individuals within well-defined GIAs.
Here we leverage classical theory28–30 and methods that characterize PGS performance at the level of a single target individual49 to evaluate the impact of the genetic ancestry continuum on PGS accuracy. We use simulations and real-data analyses to show that PGS accuracy decays continuously individual-to-individual across the genetic continuum as a function of GD from the PGS training data; GD is defined as a PCA projection of the target individual on the training data used to estimate the PGS weights. We leverage a large and diverse Los Angeles biobank at the University of California, Los Angeles5 (ATLAS, n = 36,778) along with the UK Biobank6 (UKBB, n = 487,409) to investigate the interplay between genetic ancestries and PGS for 84 complex traits and diseases. The accuracy of PGS models trained on individuals labelled as white British (WB; see Methods for naming convention used in this work) in the UKBB (n = 371,018) is negatively correlated with GD for all considered traits (average Pearson R = −0.95 across 84 traits), demonstrating pervasive individual variation in PGS accuracy. The negative correlation remains significant even when restricted to traditionally defined GIA clusters (ranging from R = −0.43 for East Asian GIA to R = −0.85 for the African American GIA in ATLAS). On average across the 84 traits, when rank-ordering individuals according to distance from training data, PGS accuracy decreases by 14% in the furthest versus closest decile in the European GIA. Notably, the furthest decile of individuals of European ancestries showed similar accuracy to the closest decile of Hispanic Latino individuals. Characterizing PGS accuracy across the continuum allows the inclusion of individuals unassigned to any GIA (6% of all ATLAS), thus allowing more individuals to be included in PGS applications. Finally, we explore the relationship between GD and PGS estimates themselves. Of 84 PGSs, 82 show significant correlation between GD and PGS with 30 showing opposite correlation (GD, trait) versus (GD, PGS); we exemplify the importance of incorporating GD in interpretation of PGSs using height and neutrophils in the ATLAS data. Our results demonstrate the need to incorporate the genetic ancestry continuum in assessing PGS performance and/or bias.
Overview of the study
PGS accuracy has conventionally been assessed at the level of discrete GIA clusters using population-level metrics of accuracy. Individuals from diverse genetic backgrounds are routinely grouped into discrete GIA clusters using computational inference methods such as PCA50 and/or admixture analysis51 (Fig. 1a). Population-level metrics of PGS accuracy are then estimated for each GIA cluster and generalized to everyone in the cluster (Fig. 1b). This approach has three major limitations: the inter-individual variability within each cluster is ignored; the GIA cluster boundary is sensitive to algorithms and reference panels used for clustering; and a substantial proportion of individuals may not be assigned to any GIA owing to a lack of reference panels for genetic ancestry inference (for example, individuals of uncommon or admixed ancestries).
Here we evaluate PGS accuracy across the genetic ancestry continuum at the level of a single target individual. We model the phenotype of individual as , in which is an vector of standardized genotypes for M variants, is an vector of standardized causal effects, and is random noise. Under a random effects model, genetic liability and its PGS estimate are random variables for which the randomness comes from and training data (. We define the individual PGS accuracy as the correlation of an individual’s genetic liability and PGS estimate with the following equation in consistence with classical theory28,32,52:
1 |
Under an infinitesimal assumption for which all variants are causal and drawn from a normal distribution , the analytical form of PGS accuracy can be derived as:
in which is per single nucleotide polymorphism (SNP) heritability; is the variance of residual environmental noise; and are the jth eigenvector and eigenvalues of training genotype data and J is the total number of eigenvectors. is the squared Mahalanobis distance of the testing individual from the centre of the training genotype data on its principal component (PC) space; and is the sum of squared genotypes across all variants. Empirically, the ratio of the squared Mahalanobis distance to the sum of squared genotypes is highly correlated with the Euclidean distance of the individual from the training data on that PC space (R = 1, P < 2.2 × 10−16 in the UKBB). Given that this metric of accuracy is highly dependent on the GD from the training data, we term it the panel distance . In practice, we use LDpred2 to estimate (refs. 49,53) and approximate as the heritability of the phenotype30 (Methods). As a continuous GD, we use with J set to 20 (Fig. 1c,d and Methods). We note two caveats of individual PGS accuracy: first, the genetic effects are assumed to be the same for all individuals regardless of their genetic ancestry background; second, the SNPs used for PGS training may not fully capture trait heritability. Therefore, the metric we proposed here is an upper bound of genetic prediction accuracy (Supplementary Note).
PGS performance is calibrated in simulations
First, we evaluated calibration of the posterior variance of genetic liability estimated by LDpred2 for individuals at various GDs from the UKBB WB training data by checking the calibration of the 90% credible intervals (Fig. 2a). We simulated 100 phenotypes at heritability and proportion of causal variants for all individuals in the UKBB, assuming shared causal variants and homogeneous causal effect sizes for individuals from various genetic backgrounds (see Methods). Overall, the 90% credible intervals are approximately well calibrated (that is, the 90% credible interval overlaps with the true genetic liability across 90 of 100 replicates, for all individuals, regardless of their GD from the training population or GIA labels; Fig. 2a). For example, when individuals are binned into 10 deciles based on their GD from the training population, the average empirical coverage of the 90% credible intervals is 89.7% (s.d. 2.6%) for individuals from the closest decile (composed of 96.9% individuals labelled as WB, 3.1% labelled as PL under a discrete view of ancestries; see detailed naming convention in Methods) compared to the average empirical coverage of 82.4% (s.d. 4.6%) for individuals from the furthest decile (composed of 19.9% individuals labelled as CB and 80.1% labelled as NG).
Next, we investigated the impact of GD on individual-level PGS accuracy. As expected, the width of the credible interval increases linearly with GD, reflecting reduced predictive accuracy for the PGS (Fig. 2b). The average width of the 90% credible interval is 1.83 in the furthest decile of GD, a 1.8-fold increase over the average width in the closest decile of GD. In contrast to the credible interval width, the individual-level PGS accuracy decreases with GD from the training data (Fig. 2c); the average estimated accuracy of individuals in the closest decile GD is fourfold higher than that of individuals in the furthest decile. Even among the most homogeneous grouping of individuals traditionally labelled as WB, we observe a 5% relative decrease in accuracy for individuals at the furthest decile of GD as compared to those in the closest decile. Similar results are observed when using a population-level PGS metric of accuracy, albeit at the expense of binning individuals according to GD; we find a high degree of concordance between the average within the bin and the population-level R2 estimated within the bin (Fig. 2d and Extended Data Fig. 1a). Similarly, we observe a high consistency between average and squared correlation between PGS and simulated phenotypes (R = 0.86, P < 10−10; Extended Data Fig. 1b). Taken together, our results show that the 90% credible intervals remain calibrated for individuals that are genetically distant from the training population at the expense of wider credible intervals, and captures the PGS accuracy decay across GD.
To demonstrate that the continuous accuracy decay is not specific to PGS models trained on European ancestries, we conducted further analyses using a non-European training dataset composed of individuals of NG and CB GIAs (we grouped the two GIAs to attain sufficient sample size for simulations). We simulated a high signal-to-noise trait by setting and proportion of causal variants and 0.1% with 56,539 SNPs on chromosome 10 alone. We trained PGS models on 5,000 individuals from the NG and CB GIA clusters and applied the models to the remaining testing individuals. The coverage of the 90% credible intervals was invariant to GD despite slight miscalibration. The 90% credible interval width increased and individual PGS accuracy decreased when the testing individual was further away from the training data. This trend is consistent with the observed decrease in empirical accuracy computed as squared correlation between PGS and genetic value as GD increases (Extended Data Figs. 2 and 3).
We further evaluated the impact of the number of PCs used for calculating GD on its ability to capture accuracy decay. We varied the number of PCs (J) from 1 to 20 and observed that the correlation between GD and individual accuracy () increases when more PCs are used for computing GD, but no further improvement is observed when J > 15 for any GIA clusters or the whole biobank (Extended Data Fig. 4). Therefore, we set J = 20 for simplicity. We also explored average squared genetic relationship from training data as an alternative metric of GD and found that it is a better prediction of accuracy decay within each GIA clusters (Extended Data Fig. 4). However, because this metric relies on individual-level training data that are usually not available, we choose to use PCA-based GD for convenience.
PGS accuracy varies across the genetic continuum
Having validated our approach in simulations, we next turn to empirical data. For illustration purposes, we use height as an example, focusing on the ATLAS biobank as the target population with PGS trained on the 371,018 WB individuals from the UKBB (Methods); other traits show similar trends and are presented in the next sections. PGS accuracy at the individual level varies with GD across the entire biobank as well as within each GIA cluster (Fig. 3 and Extended Data Fig. 5). For example, GD strongly correlates with PGS accuracy of individuals in the GIA cluster labelled as Hispanic Latino American (HL, R = −0.84) and African American (AA, R = −0.88) in ATLAS. Notably, GD correlates with PGS accuracy even in non-admixed GIA clusters with correlations as −0.66, −0.66 and −0.35, for European American (EA), South Asian American (SAA) or East Asian American (EAA) GIA clusters, respectively. Similar qualitative results are also observed when applying PGS to test data from the UKBB; significant negative correlations are found between GD and individual PGS accuracy in all of the subcontinental GIA clusters in the UKBB (Extended Data Fig. 5), with correlation coefficients ranging from R = −0.031 for the WB cluster to R = −0.62 for the CB cluster.
Next, we focused on the impact of GD on PGS accuracy across all ATLAS individuals regardless of GIA clustering (R = −0.96, P < 10−10; Fig. 3b). Notably, we find a strong overlap of PGS accuracies across individuals from different GIA clusters demonstrating the limitation of using a single cluster-specific metric of accuracy. For example, when rank-ordering by GD, we find that the individuals from the closest GD decile in the HL cluster have similar estimated accuracy to the individuals from the furthest GD decile in EA cluster (average of 0.71 versus 0.71). This shows that GD enables identification of HL individuals with similar PGS performance to the EA cluster thus partly alleviating inequities due to limited access to accurate PGS. Most notably, GD can be used to evaluate PGS performance for individuals that cannot be easily clustered by current genetic inference methods (6% of ATLAS; Fig. 3b) partly owing to limitations of reference panels and algorithms for assigning ancestries. Among this traditionally overlooked group of individuals, we find the GD ranging from 0.02 to 0.64 and their corresponding estimated PGS accuracy ranging from 0.63 to 0.21. In addition to evaluating PGS accuracy with respect to the genetic liability, we also evaluated accuracy with respect to the residual height after regressing out sex, age and PC1–10 on the ATLAS from the actual measured trait. Using equally spaced bins across the GD continuum, we find that correlation between PGS and the measured height tracks significantly with GD (R = −0.92, P = 1.1 × 10−8; Fig. 3c).
PGS accuracy decay is pervasive
Having established the coupling of GD with PGS accuracy in simulations and for height, we next investigate whether this relationship is common across complex traits using PGSs for a broad set of 84 traits (Supplementary Table 1). We find consistent and pervasive correlations of GD with PGS accuracy across all considered traits in both ATLAS and the UKBB (Fig. 4). For example, the correlations between GD and individual PGS accuracy range from −0.71 to −0.97 with an average of −0.95 across the 84 PGSs in ATLAS with similar results observed in the UKBB. Traits with sparser genetic architectures and fewer non-zero weights in the PGS have a lower correlation between GD and PGS accuracy; we reason that this is because GD represents genome-wide genetic variation patterns that may not reflect a limited number of causal SNPs well. For example, PGS for lipoprotein A (log_lipoA) has the lowest estimated polygenicity (0.02%) among the 84 traits and has the lowest correlation in ATLAS (−0.71) and the UKBB (−0.85). By contrast, we observe a high correlation between GD and PGS accuracy (>0.9) for all traits with an estimated polygenicity >0.1%. Next, we show that the fine-scale population structure accountable for the individual PGS accuracy variation is also prevalent within the traditionally defined genetic ancestry group. For example, in ATLAS we find that 501 of 504 (84 traits across 6 GIA clusters) trait–ancestry pairs have a significant associations between GD and individual PGS accuracy after Bonferroni correction. In the UKBB, we find 572 of the 756 (84 traits across 9 subcontinental GIA clusters) trait–ancestry pairs have significant associations between GD and PGS accuracy after Bonferroni correction. We also find that a more stringent definition of homogeneous GIA clusters results in a lower correlation magnitude (Extended Data Fig. 6). Empirical analyses of PGS accuracy show a similar trend. When averaging across 84 traits, we find that the empirical accuracy decreases with increased GD across GIA clusters as reported by previous studies39. Further analyses based on GD bins show the decreasing trend at a finer scale (Extended Data Fig. 7).
PGS varies across the genetic continuum
We have focused so far on investigating the relationship between GD () and PGS accuracy (). Next, we evaluate the impact of GD on PGS estimates () themselves. We find a significant correlation between GD and PGS estimates for 82 of 84 traits, with correlation coefficients ranging from R = −0.52 to R = 0.74 (Extended Data Fig. 8); this broad range of correlations is in stark contrast with the consistently observed negative correlation between GD and PGS accuracy. To better understand whether the coupling of PGS with GD is due to stratification or true signal, we compared the correlation of GD with PGS estimates () to the correlation of GD with measured phenotype values (). We find a wide range of couplings reflecting trait-specific signals; for 30 traits, GD correlates in opposite directions with PGS versus phenotype; for 40 traits, GD correlates in the same directions with PGS versus phenotype but differs in correlation magnitudes (Extended Data Fig. 8). For example, GD shows opposite and significantly different correlations for PGS versus trait for years of education (years_of_edu, = 0.03, = −0.18). Other traits, such as hair colour, show a highly consistent impact of GD on PGS versus trait (darker_hair, = 0.59, = 0.74), whereas for monocyte percentage, GD shows different magnitudes albeit with the same directions (monocyte_perc, = −0.03, = −0.52). Moreover, GD correlates with PGS and phenotype even within the same GIA cluster and the correlation patterns vary across clusters (Extended Data Fig. 9).
The correlation of GD with phenotype and PGS is also observed in ATLAS. For example, both height phenotype and height PGS vary along GD in ATLAS (Fig. 5); this holds true even when restricting analysis to the EA genetic ancestry cluster (Supplementary Fig. 1). This is consistent with genetic liability driving difference in phenotypes but could also be explained by residual population stratification. For neutrophil counts, phenotype and PGS vary in opposite directions with respect to GD across the ATLAS (Fig. 5), although the trend is similar for phenotype and PGS in the EA GIA clusters (Supplementary Fig. 1). This could be explained by genetic liability driving signal in Europeans with stratification for other groups. Neutrophil counts have been reported to vary greatly across ancestry groups with reduced counts in individuals of African ancestries54. In ATLAS, we observe a negative correlation (−0.04) between GD and neutrophil counts in agreement with the previous reports, whereas GD is positively correlated (0.08) with PGS estimates—genetically distant individuals traditionally labelled as African American having higher PGS than average. The opposite directions in phenotype–distance and PGS–distance correlations are partly attributed to the Duffy-null SNP rs2814778 on chromosome 1q23.2. This variant is strongly associated with neutrophil counts among individuals traditionally identified as African ancestry, but it is rare and excluded in our training data. This exemplifies the potential bias in PGS due to non-shared causal variants and emphasizes ancestral diversity in genetic studies. As PGS can vary across GD either as a reflection of true signal (that is, genetic liability varying with ancestry) or owing to biases in PGS estimation ranging from unaccounted residual population stratification to incomplete data (for example, partial ancestry-specific tagging of causal effects), our results emphasize the need to consider GD in PGS interpretation beyond adjusting for PGS .
Discussion
In this work, we have shown that PGS accuracy varies from individual to individual and proposed an approach to personalize PGS metrics of performance. We used a PCA-based GD39 from the centre of training data to describe an individual’s unique location on the genetic ancestry continuum and showed that individual PGS accuracy tracks well with GD. The continuous decay of PGS performance as the target individual becomes further away from the training population is pervasive across traits and ancestries. We highlight the variability in PGS performance along the continuum of genetic ancestries, even within traditionally defined homogeneous populations. As the genetic ancestries are increasingly recognized as continuous rather than discrete7,42–45, the individual-level PGS accuracy provides a powerful tool to study PGS performance across diverse individuals to enhance the utility of PGS. For example, by using individual-level PGS accuracy, we can identify individuals from Hispanic Latino GIA who have similar PGS accuracy to individuals of European GIA, thus partly alleviating inequities due to lack of access to accurate PGS.
Simulation and real-data analyses show that individual PGS accuracy is highly correlated with GD, in alignment with existing works showing that decreased similarity (measured by relatedness, linkage disequilibrium and/or minor allele frequency differences, fixation index (Fst) and so on)41,55 between testing individuals and training data is a major contributor to PGS accuracy decay. However, practical factors that may affect transferability, such as genotype–environment interaction and population-specific causal variants, are not modelled in the calculation of individual PGS accuracy and this is left for future work.
Our results emphasize the importance of PGS training in diverse ancestries56 as it can provide advantages for all individuals. Broadening PGS training beyond European ancestries can lead to improved accuracy in genetic effect estimation particularly for variants with higher frequencies in non-European data. It can also increase PGS portability by reducing the GD from target to training data. However, increased diversity may also bring challenges to statistical modelling; for example, differences in genetic effects may correlate with environment factors and could bias genetic risk prediction. To address these challenges, more sophisticated statistical methods are needed that can effectively leverage ancestrally diverse populations to train PGS3 (for example, PRS-CSx57, vilma58 and CT-SLEB59). Concerted global effort and equitable collaborations are also crucial to increase the sample size of underrepresented individuals as part of an effort to reduce health disparities across ancestries56,60.
We highlight the pervasive correlation between PGS estimates and GD of varying magnitude and sign as compared to the correlation between phenotype and GD. This provides a finer resolution of the mean shift of PGS estimates across genetic ancestry groupings38. The correlation between GD and PGS estimates can arise from bias and/or true biological difference, and more effort is needed to investigate the PGS bias in the context of genetic ancestry continuum.
We note several limitations and future directions of our work. First, our proposed individual PGS accuracy is an upper bound of true accuracy and should be interpreted only in terms of the additive heritability captured by SNPs included in the model. Missing heritability61,62 and misspecification of the heritability model along with population-specific causal variants and effect sizes may further decrease real accuracy. For example, the prediction accuracy for neutrophil count is overestimated among African American individuals because the Duffy-null SNP rs2814778 (ref. 54) is not captured in the UKBB WB training data. Future work could investigate the impact of the population-specific components of genetic architecture on the calibration of PGS accuracy. Second, we approximate the variance of genetic liability in the denominator of equation (1) with heritability and set a fixed value for all individuals. Preliminary results show that replacing the denominator with a Monte Carlo estimation of genetic liability variance recapitulates the accuracy decay in estimated PGS accuracy, albeit the correlation is slightly reduced (Extended Data Fig. 10). Third, individual PGS accuracy evaluates how well the PGS estimates the genetic liability instead of phenotype. Quantifying the individual accuracy of PGS with respect to phenotype can be achieved by also modelling non-genetic factors for proper calibration. Fourth, limited by sample size, we combined GIA groups as a training set in simulation experiments to replicate PGS accuracy decay; this is not an optimal strategy for data analysis as the population structure in the training data may confound the true genetic effects and reduce prediction accuracy. We leave a more comprehensive investigation of non-European PGS training data for future work. Sixth, although we advocate for the use of continuous genetic ancestry, we trained our PGS models on a discrete GIA cluster of WB because current PGS methods rely on discrete genetic ancestry groupings. We leave the development of PGS training methods that are capable of modelling continuous ancestries as future work. Finally, we highlight that, just like PGS, the traditional clinical risk assessment may suffer from limited portability across diverse populations18. For examples, the pooled cohort equation overestimates atherosclerotic cardiovascular disease risk among non-European populations16; and a traditional clinical breast cancer risk model developed in the European population in the USA overestimated the breast cancer risk among older Korean women17. Here we focus on genetic prediction potability owing to the wide interest and attention from both the research community and society. We emphasize that improving the portability of traditional clinical risk factor models in diverse populations is an essential component of health equity and requires thorough investigation.
Methods
Model
We model the phenotype of an individual with a standard linear model , in which is an vector of standardized genotypes (centred and standardized with respect to the allele frequency in the training population for both training and testing individuals), is an vector of standardized genetic effects, and is random noise. Under a random effects model, is a vector of random variable sampled from a prior distribution that differs under different genetic architecture assumptions62 and PGS methods53,63–65. The PGS weights are estimated to be the posterior mean given the observed data ( with access to individual-level genotype, , and phenotype, ; or with access to marginal association statistics and LD matrix , in which GWAS stands for genome-wide association study). The genetic liability () of an individual i is estimated to be , the uncertainty of which is estimated as the posterior variance of genetic liability (ref. 49).
Individual PGS accuracy
We define individual PGS accuracy as the squared correlation between an individual’s genetic liability, , and its PGS estimate, , following the general form in ref. 28:
Here we are interested in the PGS accuracy of a given individual; therefore, the genotype is treated as a fixed variable, and genetic effects are treated as a random variable. We note that a random effects model is essential; otherwise, and are 0. Under a random effects model, both the genetic liability and PGS estimate for individual are random variables. The randomness of comes from the randomness in , and the randomness of comes from the randomness of both and the training data . Individual PGS accuracy measures the correlation between and , which can be computed with the following equation:
2 |
in which is the posterior variance of genetic liability given the training data, and is the genetic variance. The equation is derived as follows.
First, we show that under the random effects model, (in which ) following equation 5.149 in ref. 66:
Multiplying on both sides of the equation, we obtain:
3 |
Equation (3) also implies the slope from regression of observed phenotypic values (or true genetic liability) on the estimated PGS equal to 1 (Supplementary Fig. 2), which offers an alternative way to assess the calibration of PGS as done in refs. 64,67.
Next, by applying the law of total variance, we show that:
4 |
Third, we derive the correlation between and as:
The above equation is widely used in animal breeding theory to compute the reliability of estimated breeding value for each individual30. In this work, we use individual PGS uncertainty as an unbiased estimator of . We also use estimated heritability to approximate in simulations in which the phenotype has unit variance. In real-data analysis, as the phenotype does not necessarily have unit variance, we approximate by scaling the estimated heritability with the residual phenotypic variance in the training population after regressing GWAS covariates including sex, age and precomputed UKBB PC1–16 (Data-Field 22009).
Analytical form of individual PGS accuracy under infinitesimal assumption
Without loss of generality, we assume a prior distribution of genetic effects as follows:
where M is the number of genetic variants. With access to individual genotype, , and phenotype, , data, the likelihood of the data is
where N is the training sample size. The posterior distribution of genetic effects given the data is proportional to the product of the prior and the likelihood:
in which and . This form is equivalent to the solution of random effects in the best linear unbiased prediction with the pedigree matrix or genetic relationship matrix29,32.
For a new target individual, the posterior variance of the genetic liability is:
After carrying out eigendecomposition on , we can rewrite
in which and correspond to the jth eigenvalue and unit-length eigenvector of the training genotype, .
Thus, we can rewrite the posterior variance of genetic liability as
Replacing in equation (2) with the analytical form of , we get
As the eigenvalue of increases linearly with training sample size N (ref. 68), at the UKBB-level sample size (for example, N = 371,018 for our UKBB WB training data), the eigenvalues for the top PCs are usually larger than the ratio of environmental noise variance and genetic variance . Thus, we can further approximate the analytical form with:
The term is the squared Mahalanobis distance of the testing individual from the centre of the training genotype data on its PC space and is the sum of squared genotype across all variants. Empirically, the ratio between the two is highly correlated with the Euclidean distance of the individual from the training data on that PC space (R = 1, P value < 2.2 × 10−16 in the UKBB).
Genetic distance (GD)
The GD is defined as the Euclidean distance between a target individual and the centre of training data on the PC space of training data.
in which is the GD of a testing individual from the training data, is an standardized genotype vector for testing individual , is the jth eigenvector for the genotype matrix of training individuals, is the average genotype in the training population ( given that the genotypes are centred with respect to the allele frequency in the training population), and is set to 20.
Ancestry ascertainment in UKBB
The UKBB individuals are clustered into nine subcontinental GIA clusters—WB (white British), PL (Poland), IR (Iran), IT (Italy), AS (Ashkenazi), IN (India), CH (China), CB (Caribbean) and NG (Nigeria)—based on the top 16 precomputed PCs (Data-Field 22009) as described in ref. 39. First, UKBB participants are grouped by country of origin (Data-Field 20115) and the centre of each country on the PC space is computed as the geometric median for all countries, which serves as a proxy for the centre for each subcontinental ancestry. The centre of Ashkenazi GIA is determined using a dataset from ref. 69. Second, we reassign each individual to one of the nine GIA groups on the basis of their Euclidean distance to the centres on the PC space, as the self-reported country of origin does not necessarily match an individual’s genetic ancestry. The genetic ancestry of an individual is labelled as unknown if its distance to any genetic ancestry centre is larger than one-eighth of the maximum distance between any pairs of subcontinental ancestry clusters. We are able to cluster 91% of the UKBB participants into 411,018 WB, 4,127 PL, 1,169 IR, 6,499 IT, 2,352 AS, 1,798 CH, 2,472 CB and 3,894 NG. GIAs are not necessarily reflective of the full genetic diversity of a particular region but reflect only the diversity present in the UKBB individuals.
Ancestry ascertainment in ATLAS
The ATLAS individuals are clustered into five GIA clusters—European Americans (EA), Hispanic Latino Americans (HL), South Asian Americans (SAA), East Asian Americans (ESA) and African Americans (AA)—as described in ref. 5 on the basis of their proximity to 1000 Genome super populations on the PC space. First, we filter the ATLAS-typed genotypes with plink2 by Mendel error rate (plink --me 1 1 –set-me-missing), founders (--filter-founders), minor allele frequency (–maf 0.15), genotype missing call rate (--geno 0.05) and Hardy–Weinberg equilibrium test P value (–hwe 0.001). Next, ATLAS genotypes were merged with the 1000 Genomes phase 3 dataset. Then, linkage disequilibrium (LD) pruning was carried out on the merged dataset (--indep 200 5 1.15 --indep-pairwise 100 5 0.1). The top 10 PCs were computed with the flashpca2 (ref. 70) software with all default parameters. Next, we use the super population label and PCs of the 1000 Genome individuals to train the K-nearest neighbours model to assign genetic ancestry labels to each ATLAS individual. For each ancestry cluster, we run the K-nearest neighbours model on the pair of PCs that capture the most variation for each genetic ancestry group: the European, East Asian and African ancestry groups use PCs 1 and 2, the Admixed American group uses PCs 2 and 3, and the South Asian group uses PCs 4 and 5. In each analysis, we use tenfold cross-validation to select the k hyper-parameter from k = 5, 10, 15, 20. If an individual is assigned to multiple ancestries with probability larger than 0.5 or is not assigned to any clusters, their ancestry is labelled as unknown. We label the five 1000 Genome super population as EA for Europeans, HL for Admixed Americans, SAA for South Asians, AA for Africans and ESA for East Asians. We can cluster 95% of the ATLAS participants into 22,380 EA, 6,973 HL, 625 SAA, 3,331 EAA and 1,995 AA, and the ancestry of 2,332 individuals is labelled as unknown.
Genotype data
In simulations, we use 1,054,151 UKBB HapMap 3 SNPs for simulating phenotypes, training PGS models and calculating PGS for testing individuals in UKBB. For real-data analysis, we use an intersection of UKBB HapMap 3 SNPs and ATLAS imputed SNPs for the training of PGS in UKBB and calculating PGS for remaining UKBB individuals and ATLAS individuals. We start from 1,054,151 UKBB HapMap 3 SNPs and 8,048,268 ATLAS imputed SNPs. As UKBB is on genome build hg37 and ATLAS is on hg38, we first lift all ATLAS SNPs from hg38 to hg37 with the snp_modifyBuild function in the bigsnpr R package. Next, we match UKBB SNPs and ATLAS SNPs by chromosome and position with the snp_match function in bigsnpr. Then, we recode ATLAS SNPs using UKBB reference alleles with the plink2 --recode flag. In the end, 979,457 SNPs remain for training the LDpred2 models in real-data analysis.
Simulated phenotypes
We use simulations on all UKBB individuals to investigate the impact of GD from training data on the various metrics of PGS. We fix the proportion of causal SNPs and heritability as . The simulated genetic effects and phenotype are generated as follows. First, we randomly sample
in which is the variance of allele counts for SNP among all UKBB individuals. Second, we compute the genetic liability for each individual as and randomly sample environmental noise . Third, we generate phenotype as . We repeat the process 100 times to generate 100 sets of genetic liability and phenotypes.
GD from PGS training data
To compute the GD of testing individuals from the training population, we carry out PCA on the 371,018 UKBB WB training individuals and project the 48,586 UKBB testing individuals and 36,778 ATLAS testing individuals on the PC space. We start from the 979,457 SNPs that are overlapped in UKBB and ATLAS. First, we carry out LD pruning with plink2 (--indep-pairwise 1000 50 0.05) and exclude the long-range LD regions. Next, we carry out PCA analysis with flashpca2 (ref. 70) on the 371,018 UKBB WB training individuals to obtain the top 20 PCs. Then, we project the remaining 48,586 UKBB individuals that are not included in the training data and 36,778 ATLAS individuals onto the PC space of training data by using SNP loadings (--outload loadings.txt) and their means and standard deviations (--outmeansd meansd.txt) output from flashpca2. In the end, we compute the GD for each individual as the Euclidean distance of their PCs from the centre of training data with the equation , in which is the PC of individual .
LDpred2 PGS model training
The PGS models were trained on 371,018 UKBB individuals labelled as WB with the LDpred2 (ref. 53) method for both simulation and real-data analysis. For simulation analysis, we use 1,054,151 UKBB HapMap 3 variants. For real-data analysis, we use 979,457 SNPs that are overlapped in UKBB HapMap 3 variants and ATLAS imputed genotypes.
First, we obtain GWAS summary statistics by carrying out GWAS on the training individuals with plink2 using sex, age and precomputed PC1–16 as covariates. Second, we calculate the in-sample LD matrix with the function snp_cor from the R package bigsnpr71. Next, we use the GWAS summary statistics and LD matrix as input for the snp_ldpred2_auto function in bigsnpr to sample from the posterior distribution of genetic effect sizes. Instead of using a held-out validation dataset to select hyperparameters p (proportion of causal variants) and h2 (heritability), snp_ldpred2_auto estimates the two parameters from data with the Markov chain Monte Carlo (MCMC) method directly. We run 10 chains with different initial sparsity p from 10−4 to 1 equally spaced in log space. For all chains, we set the initial heritability as the LD score regression heritability72 estimated by the built-in function snp_ldsc. We carry out quality control of the 10 chains by filtering out chains with estimated heritability that is smaller than 0.7 times the median heritability of the 10 chains or with estimated sparsity that is smaller than 0.5 times the median sparsity or larger than 2 times the median sparsity. For each chain that passes filtering, we remove the first 100 MCMC iterations as burn-in and thin the next 500 iterations by selecting every fifth iteration to reduce autocorrelation between MCMC samples. In the end, we obtain an M × B matrix , in which each column of the matrix is a sample of posterior causal effects of the M SNPs. Owing to the quality control of MCMC chains, the total number of posterior samples ranges from 500 to 1,000.
Calculation of PGS and accuracy
We use the score function in plink2 to compute the PGS for 48,586 and 36,778 testing individuals in UKBB and ATLAS, respectively. For each , we compute the PGS for each individual i as with plink2 (--score). For each individual with genotype , we compute to approximate its posterior distribution of genetic liability. The genotype is centred to the average allele count (--read-freq) in training data to reduce the uncertainty from the unmodelled intercept. We estimate the PGS with the posterior mean of the genetic liability as . We estimate the individual-level PGS uncertainty as . The individual-level PGS accuracy is calculated as for simulation ( is the heritability estimated by the LDpred2 model) and as for real-data analysis, in which is the variance of residual phenotype in training data after regressing out GWAS covariates.
Calibration of credible interval in simulation
We run the LDpred2 model on 371,018 WB training individuals for the 100 simulation replicates. In each simulation, for individual with genotype , we compute to approximate their posterior distribution of genetic liability, generate a 90% credible interval (90% credible interval of genetic liability of ith individual in rth replication) with 5% and 95% quantile of the distribution and check whether their genetic liability is contained in the credible interval . We compute the empirical coverage for each individual as the mean across the 100 simulation replicates
Ethics declarations
All research carried out in this study conformed with the principles of the Helsinki Declaration. All individuals provided written informed consent to the original recruitment of the UCLA ATLAS Community Health Initiative. Patient Recruitment and Sample Collection for Precision Health Activities at UCLA is an approved study by the UCLA Institutional Review Board (Institutional Review Board number 17-001013). All analyses in this study use de-identified data (without any protected health information) with no possibility of re-identifying any of the participants.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-06079-4.
Supplementary information
Acknowledgements
We thank D. Geschwind, Y. Wu, R. Mester, R. Johnson and K. S. Burch for insightful comments. We gratefully acknowledge the resources provided by the Institute for Precision Health and participating patients from the UCLA ATLAS Community Health Initiative. The UCLA ATLAS Community Health Initiative in collaboration with UCLA ATLAS Precision Health Biobank is a programme of the Institute for Precision Health, which directs and supports the biobanking and genotyping of biospecimen samples from participating patients from UCLA in collaboration with the David Geffen School of Medicine, UCLA Clinical and Translational Science Institute and UCLA Health. The ATLAS Community Health Initiative is supported by UCLA Health, the David Geffen School of Medicine and a grant from the UCLA Clinical and Translational Science Institute (UL1TR001881). This research was conducted using the UKBB resource under application 33127. We thank the participants of UKBB for making this work possible. This work was financially supported in part by National Institutes of Health awards U01HG011715, R01HG009120 and R01MH115676. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Extended data figures and tables
Author contributions
Y.D. and B.P. conceived and designed the experiments. Y.D., K.H., Z.X., A.P., E.P. and K.B. carried out the experiments and statistical analyses. F.P., B.J.V. and L.M.O.L. provided statistical support. Y.D., K.H. and B.P. wrote the manuscript with the participation of all authors.
Peer review
Peer review information
Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
The individual-level genotype and phenotype data of UKBB are available by application from http://www.ukbiobank.ac.uk/. Owing to privacy concerns, de-identified individual-level data for UCLA ATLAS are available only to UCLA researchers and can be accessed through the Discovery Data Repository Dashboard (https://it.uclahealth.org/about/ohia/ohia-products/discovery-data-repository-dashboard-0). Summary ATLAS association statistics are publicly available at https://atlas-phewas.mednet.ucla.edu/. PGSs for 84 traits investigated in this manuscript are available at the PGS Catalog (PGP000457, https://www.pgscatalog.org/publication/PGP000457). MCMC samplings of PGSs are available at 10.6084/m9.figshare.22413970.
Code availability
Scripts for simulations and real-data analyses are available at https://github.com/yidingdd/individual-pgs-accuracy (ref. 72).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Yi Ding, Email: yiding920@ucla.edu.
Bogdan Pasaniuc, Email: pasaniuc@ucla.edu.
Extended data
is available for this paper at 10.1038/s41586-023-06079-4.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-06079-4.
References
- 1.Martin AR, et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mostafavi H, et al. Variable prediction accuracy of polygenic scores within an ancestry group. Elife. 2020;9:e48376. doi: 10.7554/eLife.48376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Y, Tsuo K, Kanai M, Neale BM, Martin AR. Challenges and opportunities for developing more generalizable polygenic risk scores. Annu. Rev. Biomed. Data Sci. 2022;5:293–320. doi: 10.1146/annurev-biodatasci-111721-074830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. [DOI] [PubMed] [Google Scholar]
- 5.Johnson R, et al. Leveraging genomic diversity for discovery in an electronic health record linked biobank: the UCLA ATLAS Community Health Initiative. Genome Med. 2022;14:104. doi: 10.1186/s13073-022-01106-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lewis ACF, et al. Getting genetic ancestry right for science and society. Science. 2022;376:250–252. doi: 10.1126/science.abm7530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 9.Kullo, I. J. et al. Polygenic scores in biomedical research. Nat. Rev. Genet.10.1038/s41576-022-00470-z (2022). [DOI] [PMC free article] [PubMed]
- 10.Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat. Rev. Genet. 2020;21:493–502. doi: 10.1038/s41576-020-0224-1. [DOI] [PubMed] [Google Scholar]
- 11.Natarajan P, et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation. 2017;135:2091–2101. doi: 10.1161/CIRCULATIONAHA.116.024436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lee A, et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet. Med. 2019;21:1708–1718. doi: 10.1038/s41436-018-0406-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Khera AV, et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell. 2019;177:587–596. doi: 10.1016/j.cell.2019.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Perkins DO, et al. Polygenic risk score contribution to psychosis prediction in a target population of persons at clinical high risk. Am. J. Psychiatry. 2020;177:155–163. doi: 10.1176/appi.ajp.2019.18060721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Polygenic Risk Score Task Force of the International Common Disease Alliance. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nat. Med. 2021;27:1876–1884. doi: 10.1038/s41591-021-01549-6. [DOI] [PubMed] [Google Scholar]
- 16.Rodriguez F, et al. Atherosclerotic cardiovascular disease risk prediction in disaggregated Asian and Hispanic subgroups using electronic health records. J. Am. Heart Assoc. 2019;8:e011874. doi: 10.1161/JAHA.118.011874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jee YH, et al. Validating breast cancer risk prediction models in the Korean Cancer Prevention Study-II Biobank. Cancer Epidemiol. Biomarkers Prev. 2020;29:1271–1277. doi: 10.1158/1055-9965.EPI-19-1478. [DOI] [PubMed] [Google Scholar]
- 18.Myers PD, et al. Identifying unreliable predictions in clinical risk models. NPJ Digit. Med. 2020;3:8. doi: 10.1038/s41746-019-0209-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature526, 68–74 (2015). [DOI] [PMC free article] [PubMed]
- 20.Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.VanRaden PM. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 22.Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136:245–257. doi: 10.1007/s10709-008-9308-0. [DOI] [PubMed] [Google Scholar]
- 23.Goddard ME, Hayes BJ, Meuwissen THE. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 2011;128:409–421. doi: 10.1111/j.1439-0388.2011.00964.x. [DOI] [PubMed] [Google Scholar]
- 24.Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE. 2008;3:e3395. doi: 10.1371/journal.pone.0003395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Visscher, P. M., Yang, J. & Goddard, M. E. A commentary on “Common SNPs explain a large proportion of the heritability for human height” by Yang et al. (2010). Twin Res. Hum. Genet.13, 517–524 (2010). [DOI] [PubMed]
- 26.Wray NR, et al. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 2013;14:507–515. doi: 10.1038/nrg3457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wientjes YCJ, et al. Empirical and deterministic accuracies of across-population genomic prediction. Genet. Sel. Evol. 2015;47:5. doi: 10.1186/s12711-014-0086-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Falconer, D. S. Introduction to Quantitative Genetics (Longman Scientific & Technical, 1989).
- 29.Lynch, M. & Walsh, B. Genetics and Analysis of Quantitative Traits Vol. 1 (Sinauer Sunderland, 1998).
- 30.Walsh, B. & Lynch, M. Evolution and Selection of Quantitative Traits (Oxford Univ. Press, 2018).
- 31.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ben Zaabza H, Mäntysaari EA, Strandén I. Using Monte Carlo method to include polygenic effects in calculation of SNP-BLUP model reliability. J. Dairy Sci. 2020;103:5170–5182. doi: 10.3168/jds.2019-17255. [DOI] [PubMed] [Google Scholar]
- 33.Pszczola M, Strabel T, Mulder HA, Calus MPL. Reliability of direct genomic values for animals with different relationships within and to the reference population. J. Dairy Sci. 2012;95:389–400. doi: 10.3168/jds.2011-4338. [DOI] [PubMed] [Google Scholar]
- 34.Wientjes YCJ, Veerkamp RF, Calus MPL. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics. 2013;193:621–631. doi: 10.1534/genetics.112.146290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177:2389–2397. doi: 10.1534/genetics.107.081190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lee SH, Weerasinghe WMSP, Wray NR, Goddard ME, van der Werf JHJ. Using information of relatives in genomic prediction to apply effective stratified medicine. Sci. Rep. 2017;7:42091. doi: 10.1038/srep42091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Truong B, et al. Efficient polygenic risk scores for biobank scale data by exploiting phenotypes from inferred relatives. Nat. Commun. 2020;11:3074. doi: 10.1038/s41467-020-16829-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Martin AR, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Privé F, et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet. 2022;109:12–23. doi: 10.1016/j.ajhg.2021.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Bitarello BD, Mathieson I. Polygenic scores for height in admixed populations. G3. 2020;10:4027–4036. doi: 10.1534/g3.120.401658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wang Y, et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 2020;11:3865. doi: 10.1038/s41467-020-17719-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Coop, G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. Preprint at https://arxiv.org/abs/2207.11595 (2022).
- 43.Mathieson I, Scally A. What is ancestry? PLoS Genet. 2020;16:e1008624. doi: 10.1371/journal.pgen.1008624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Krainc T, Fuentes A. Genetic ancestry in precision medicine is reshaping the race debate. Proc. Natl Acad. Sci. USA. 2022;119:e2203033119. doi: 10.1073/pnas.2203033119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Belbin GM, et al. Toward a fine-scale population health monitoring system. Cell. 2021;184:2068–2083. doi: 10.1016/j.cell.2021.03.034. [DOI] [PubMed] [Google Scholar]
- 46.Wojcik GL, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Clarke SL, et al. Race and ethnicity stratification for polygenic risk score analyses may mask disparities in Hispanics. Circulation. 2022;146:265–267. doi: 10.1161/CIRCULATIONAHA.122.059162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Johnson R, et al. The UCLA ATLAS Community Health Initiative: promoting precision health research in a diverse biobank. Cell Genomics. 2023;3:100243. doi: 10.1016/j.xgen.2022.100243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ding Y, et al. Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat. Genet. 2021;54:30–39. doi: 10.1038/s41588-021-00961-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 51.Alexander DH, Lange K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 2011;12:246. doi: 10.1186/1471-2105-12-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Gorjanc G, Bijma P, Hickey JM. Reliability of pedigree-based and genomic evaluations in selected populations. Genet. Sel. Evol. 2015;47:65. doi: 10.1186/s12711-015-0145-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics36, 5424–5431 (2021). [DOI] [PMC free article] [PubMed]
- 54.Reich D, et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 2009;5:e1000360. doi: 10.1371/journal.pgen.1000360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Scutari M, Mackay I, Balding D. Using genetic distance to infer the accuracy of genomic prediction. PLoS Genet. 2016;12:e1006288. doi: 10.1371/journal.pgen.1006288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Fatumo S, et al. A roadmap to increase diversity in genomic studies. Nat. Med. 2022;28:243–250. doi: 10.1038/s41591-021-01672-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Ruan Y, et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 2022;54:573–580. doi: 10.1038/s41588-022-01054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Spence, J. P., Sinnott-Armstrong, N., Assimes, T. L. & Pritchard, J. K. A flexible modeling and inference framework for estimating variant effect sizes from GWAS summary statistics. Preprint at bioRxiv10.1101/2022.04.18.488696 (2022).
- 59.Zhang, H. et al. Novel methods for multi-ancestry polygenic prediction and their evaluations in 3.7 million individuals of diverse ancestry. Preprint at bioRxiv10.1101/2022.03.24.485519 (2022).
- 60.Martin AR, et al. Increasing diversity in genomics requires investment in equitable partnerships and capacity building. Nat. Genet. 2022;54:740–745. doi: 10.1038/s41588-022-01095-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hou K, et al. Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture. Nat. Genet. 2019;51:1244–1251. doi: 10.1038/s41588-019-0465-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1776. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhang, Q., Privé, F., Vilhjálmsson, B. & Speed, D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun. 12, 4192 (2021). [DOI] [PMC free article] [PubMed]
- 65.Lloyd-Jones LR, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 2019;10:5086. doi: 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Abraham G, Inouye M. Fast principal component analysis of large-scale genome-wide data. PLoS ONE. 2014;9:e93766. doi: 10.1371/journal.pone.0093766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Sorensen, D. & Gianola, D. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics (Springer, 2002).
- 68.Behar DM, et al. No evidence from genome-wide data of a Khazar origin for the Ashkenazi Jews. Hum. Biol. 2013;85:859–900. doi: 10.3378/027.085.0604. [DOI] [PubMed] [Google Scholar]
- 69.Abraham G, Qiu Y, Inouye M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics. 2017;33:2776–2778. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
- 70.Privé F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34:2781–2787. doi: 10.1093/bioinformatics/bty185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Bulik-Sullivan BK, et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ding, Y. yidingdd/individual-pgs-accuracy. Zenodo 10.5281/zenodo.7787462 (2023)
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The individual-level genotype and phenotype data of UKBB are available by application from http://www.ukbiobank.ac.uk/. Owing to privacy concerns, de-identified individual-level data for UCLA ATLAS are available only to UCLA researchers and can be accessed through the Discovery Data Repository Dashboard (https://it.uclahealth.org/about/ohia/ohia-products/discovery-data-repository-dashboard-0). Summary ATLAS association statistics are publicly available at https://atlas-phewas.mednet.ucla.edu/. PGSs for 84 traits investigated in this manuscript are available at the PGS Catalog (PGP000457, https://www.pgscatalog.org/publication/PGP000457). MCMC samplings of PGSs are available at 10.6084/m9.figshare.22413970.
Scripts for simulations and real-data analyses are available at https://github.com/yidingdd/individual-pgs-accuracy (ref. 72).