Skip to main content
Human Genetics and Genomics Advances logoLink to Human Genetics and Genomics Advances
. 2023 Feb 13;4(2):100184. doi: 10.1016/j.xhgg.2023.100184

Low and differential polygenic score generalizability among African populations due largely to genetic diversity

Lerato Majara 1,2,6, Allan Kalungi 1,3,4,5, Nastassja Koen 1,6,7, Kristin Tsuo 8,9,10,11, Ying Wang 8,9,10, Rahul Gupta 8,9,10,11, Lethukuthula L Nkambule 8,9,10, Heather Zar 12, Dan J Stein 6,7, Eugene Kinyanda 5, Elizabeth G Atkinson 13, Alicia R Martin 8,9,10,14,
PMCID: PMC9982687  PMID: 36873096

Summary

African populations are vastly underrepresented in genetic studies but have the most genetic variation and face wide-ranging environmental exposures globally. Because systematic evaluations of genetic prediction had not yet been conducted in ancestries that span African diversity, we calculated polygenic risk scores (PRSs) in simulations across Africa and in empirical data from South Africa, Uganda, and the United Kingdom to better understand the generalizability of genetic studies. PRS accuracy improves with ancestry-matched discovery cohorts more than from ancestry-mismatched studies. Within ancestrally and ethnically diverse South African individuals, we find that PRS accuracy is low for all traits but varies across groups. Differences in African ancestries contribute more to variability in PRS accuracy than other large cohort differences considered between individuals in the United Kingdom versus Uganda. We computed PRS in African ancestry populations using existing European-only versus ancestrally diverse genetic studies; the increased diversity produced the largest accuracy gains for hemoglobin concentration and white blood cell count, reflecting large-effect ancestry-enriched variants in genes known to influence sickle cell anemia and the allergic response, respectively. Differences in PRS accuracy across African ancestries originating from diverse regions are as large as across out-of-Africa continental ancestries, requiring commensurate nuance.

Key words: polygenic scores, Africa, GWAS, health disparities, global health, population genetics


Majara et al. show that polygenic scores generalize variably across diverse populations due to genetic differences between target and discovery cohorts. Multi-ancestry discovery GWAS typically improve prediction accuracy in underrepresented populations more than an increase in the European sample size, but not necessarily when sample sizes are imbalanced.

Introduction

Genome-wide association studies (GWASs) have yielded important biological insights into the heritable basis of many complex traits and diseases.1 However, the vast majority of studies have been conducted in populations of European descent, potentially limiting generalizability across diverse populations.2,3,4,5,6 Genome-wide significant SNP associations with phenotypes spanning a wide range of genetic architectures have consistently replicated across populations in both direction and effect size, with few examples of heterogeneous effect sizes.7,8,9 However, previous studies that have compared the association between genetically predicted versus measured phenotypes in diverse populations using polygenic risk scores (PRSs) have found that PRS accuracy decreases with increasing genetic distance between the GWAS discovery and PRS target cohorts.4,10,11 This seeming paradox highlights that while variant-level associations consistently replicate across populations, genome-wide aggregate measures are more predictive but less generalizable.12 Since the earliest applications of PRS in human genetics, these concepts—coupled with Eurocentric study biases—have resulted in PRSs that are most accurate in European ancestry populations and least accurate in African ancestry populations.13 These study biases and phenomena continue to replicate a decade later, with several-fold differences in prediction accuracy of many traits between European and non-European ancestry populations.4

Quantifying PRS generalizability within and among African populations requires considerable nuance, as they represent the most genetically diverse populations globally, with more than a million more genetic variants per person than out-of-Africa populations.14 Populations collected even within the same geographic regions of Africa have complex demographic histories with complicated patterns of admixture and population structure.15,16,17,18 Further, African ancestry populations experience vastly different environments within versus outside continental Africa as well as more locally among diverse communities, countries, and regions of Africa. These differences provide unique epidemiological opportunities to query the impacts of vastly differing environments on PRS accuracy. Previous empirical analyses and theoretical work fundamentally informs how demographic history and environmental variation interplay to produce PRS heterogeneity in traditionally underserved populations.19,20,21,22,23

The inclusion of African ancestry participants in large-scale genetic studies is uniquely important for many reasons. They have the lowest life expectancies globally,24,25 receive the lowest access to and quality of medical care in the United States,26 and are the most underserved by genetic technologies.6,27 A more nuanced understanding of PRS transferability will critically inform which populations are currently the most underserved and thus where building genetic studies and resources will have the biggest benefits globally.

There are also clear benefits to including African populations in statistical genetics efforts. Because humans originated in Africa, populations from Africa have the most genetic diversity among global populations,14,28,29 such that more genotype-phenotype associations are expected in Africa than can be found elsewhere. African American individuals have been shown to contribute disproportionately to GWAS findings,2 making up 2.8% of GWAS participants but contributing 7% of trait associations. African ancestry populations also have shorter blocks of linkage disequilibrium, which improves resolution to fine-map causal variants.30 PRS accuracy is lowest in African ancestry populations due to GWAS study biases,4 but when GWASs include these and other diverse populations, PRS predict traits such as schizophrenia more accurately across all populations compared with single-ancestry GWASs.31

In this study, we have investigated how PRSs generalize within and among diverse African populations in simulations and with empirical genotype-phenotype data for dozens of quantitative traits. We first simulated genetic effects and computed genetic risk prediction accuracy using data from two African datasets: the African Genome Variation Project (AGVP) and the Africa Wits-INDEPTH partnership for Genomic Studies (AWI-Gen) project. We then calculated PRS using publicly available GWAS summary statistics from predominantly European ancestry populations to (1) quantify PRS accuracy for five physical and psychosocial traits among populations in the Drakenstein Child Health Study (DCHS) of South Africa, a birth cohort study; and (2) compare PRS accuracy for 34 quantitative traits across the Ugandan General Population Cohort (GPC) versus ancestrally diverse UK Biobank participants. Our results highlight the disproportionate benefits of genetic studies in diverse African populations to improve trait prediction. Further, while PRSs hold promise as biomarkers in precision medicine, a critical prerequisite is equitable accuracy in diverse populations to avoid exacerbating existing health disparities.

Materials and methods

Genetic and phenotypic data

Total counts of individuals by population and/or study are shown in Table S2.

1000 Genomes Project

1000 Genomes Project data from the phase 3 integrated call set were accessed and used as a reference panel and for phasing and imputation.14

Human Genome Diversity Project

Genotype data for samples from the Human Genome Diversity Project (HGDP) was publicly available on the Illumina HumanHap650K GWAS array on hg18.32 We lifted over the genotype data to the hg19 genome build using hail (http://hail.is).

African Genome Variation Project

As described previously,33 the AGVP data consist of dense genotype data from 1,481 individuals from 18 ethnolinguistic groups from Eastern, Western, and Southern Africa when including the Luhya and Yoruba from the 1000 Genomes Project.14 When accessed from the European Genome-Phenome Archive (EGA:EGAD00010001047), “Ethiopian” is the provided population label encompassing the Oromo, Amhara, and Somali groups. After collapsing these groups and counting the 1000 Genomes data separately, 1,307 individuals from 14 populations are uniquely represented in the AGVP, and 2,504 individuals from 26 populations are represented in the 1000 Genomes Project data (661 individuals from seven populations are in the AFR super population grouping).

Africa Wits-INDEPTH partnership for genomic studies

AWI-Gen is a study that investigates the relationship between genetics and the environment in causing cardiometabolic disease in sub-Saharan Africa with study participants from Burkina Faso, Ghana, Kenya, and South Africa.34,35 It is a partnership between the University of Witwatersrand (Wits) in Johannesburg, South and the International Network for the Demographic Evaluation of Populations and Their Health (INDEPTH). Ethical approval was obtained from the Human Research Ethics Committee of the University of the Witwatersrand (Protocol Number: M121029 and from the institutions of the respective centres that are in the international network.

Drakenstein Child Health Study in South Africa

The DCHS is an ongoing, multidisciplinary population-based birth cohort study in the Drakenstein area in Paarl (outside Cape Town, South Africa), that obtained ethical approval from the Faculty of Health Sciences Research Ethics Committee at the University of Cape Town (401/2009) and the Western Cape Provincial Research committee.36,37,38 After providing informed consent, pregnant women were enrolled during their second trimester (20–28 weeks’ gestation); maternal-child dyads were then followed through childbirth and longitudinally thereafter. Enrollment occurred from March 2012 to March 2015 at two primary health care clinics: TC Newman (serving a predominantly mixed ancestry population) and Mbekweni (serving a predominantly Black African population). Women were eligible to participate in the DCHS if they attended one of the study clinics, were at least 18 years of age, and intended to remain residing in the study area. Ancestral diversity computed using PCA with genetic data is shown in Figure S5 with corresponding self-reported ethnicity (“Mixed” versus “Black/African”).

Uganda General Population Cohort

The rural Uganda GPC of MRC/UVRI & LSHTM Uganda Research Unit was set up in 1989 initially to monitor the HIV epidemic among adults, children, and adolescents, after obtaining ethical approval the Uganda Virus Research Institute Science and Ethics committe and the Ugandan National Council of Science and Technology. It's mandate has since expanded to include other medical conditions.39 The “original GPC” is located in the sub-county of Kyamulibwa in rural southwestern Uganda with activities having recently been expanded to the neighboring two peri-urban townships of Lwabenge and Lukaya. The “original GPC” includes about 10,000 adults and about 10,000 children and adolescents. In 2011, genotype data were generated on more than 5,000 adult participants from nine ethnolinguistic groups using the Illumina HumanOmni2.5 BeadChip at the Sanger Wellcome Trust Institute.39,40

UK Biobank

The UK Biobank enrolled 500,000 people aged between 40 and 69 years in 2006–2010 from across the country, as described previously.41 A more detailed description of the cohort is available on their website: https://www.ukbiobank.ac.uk/. We analyzed phenotypes that overlapped with those studied in the Uganda GPC.

Ancestry analysis in the UK Biobank

As described previously,41 the UK Biobank consists of approximately 500,000 participants of primarily European ancestry who have thousands of measured or reported phenotypes. To assess polygenic score accuracy across diverse ancestries, we identified populations of ancestral groups at two levels: (1) among continental groups, and (2) among regions in Africa. To define continental ancestries, we first combined reference data from the 1000 Genomes Project and HGDP. We combined these reference datasets into continental ancestries according to their corresponding meta-data (Table S5). We then ran PCA on unrelated individuals from the reference dataset. To partition individuals in the UK Biobank based on their continental ancestry, we used the PC loadings from the reference dataset to project UK Biobank individuals into the same PC space. We trained a random forest classifier given continental ancestry meta-data (AFR = African, AMR = admixed American, CSA = Central/South Asian, EAS = East Asian, EUR = European, and MID = Middle Eastern) based on the top six PCs from the reference training data. We applied this random forest to the projected UK Biobank PCA data and assigned initial ancestries if the random forest probability was >50% (similar results obtained for p > 0.9), otherwise individuals were dropped from further analysis.

Next, we further partitioned African ancestry individuals using the same random forest approach as above but without further probability thresholding using African ancestry reference data from AGVP, HGDP, and the 1000 Genomes Project. We partitioned these reference data into UN regional codes with an additional region for Ethiopian populations given their unique population history and collapsing in AGVP data (Admixed, Central, East, Ethiopia, South, and West Africa), as shown in Table S5. PCA with reference data at the continental and subcontinental level within Africa are shown in Figures S10 and S11.

Phasing and imputation

We used the Ricopili pipeline to conduct pre-imputation quality control (QC) and perform phasing and imputation for AGVP and the Uganda GPC.42 This pipeline was also used on the DCHS data, as described previously.43 Briefly, we phased the data using Eagle 2.3.5 and imputed variants using minimac3 in chunks ≥3 Mb. The 1000 Genomes phase 3 haplotypes were used as the reference panel for phasing and imputation. For the AGVP, we used strict best guess genotypes where a variant was called if it had a probability of p > 0.8 and a missing rate less than 0.01 and MAF >5%. Then, variants with MAF <0.001 were excluded from the dataset. For Uganda GPC, we used combined best guess genotypes where a variant was called if it had a probability p > 0.8 or set to missing otherwise. Then, SNPs were filtered to keep sites with missingness <0.01 and MAF >0.05. We used genotype dosages when computed PRS.

PCA

Only SNPs with high imputation quality (INFO >0.8) were considered for principal-component analysis. We computed the first 20 principal components using plink with the --pca flag for autosomal SNPs MAF >0.05 and individual missingness <0.05.

Simulation setup

We used two independent simulation strategies for two African datasets: AGVP and AWI-Gen. The choice of simulation strategy was informed by the sample size of each dataset. We used the simulation strategy previously used by Scutari et al.11 for AGVP, and the infinitesimal model for simulations in AWI-Gen. To test the PRS prediction accuracy within and across African populations, we simulated four quantitative traits while varying heritabilities (h2 = 0.1, 0.2, 0.4 and 0.8) for both AGVP and AWI-Gen as follows.

AGVP

We randomly assigned an effect size to 5, 20, 100, 2,000, 10,000, and 50,000 causal variants, respectively. We then calculated an individual’s “true” polygenic risk as the sum of all causal effects using the --score flag in PLINK v1.07B.44 True polygenic scores were standardized to a mean of zero and standard deviation of 1. To account for the contribution of environmental risk factors, we assigned environmental effects from a normal random distribution (mean = 0 and SD = 1). The phenotype was generated according to its heritability as the weighted sum of the true polygenic risk and a random environmental effect as below:

phenotype=h2×truepolygenicrisk+(1h2)×environmentaleffect

We then conducted GWAS for the simulated phenotypes by splitting the AGVP dataset into three groups broadly representing the three geographical areas from which samples were obtained: East (n = 589), West (n = 517), and South Africa (n = 186, Figure 1). To allow for the quantification of PRS prediction accuracy across the geographical regions, each group was further split into discovery and target cohorts. The size of the target cohorts was maintained at n = 186 across all groups, while the discovery cohort consisted of all remaining individuals (East n = 403, West n = 331, and no South Africans). We conducted a linear regression for all the simulated traits for the East and West discovery datasets, controlling for the first 20 principal components.

Figure 1.

Figure 1

Simulation strategy overview

(A) We used AGVP for simulations in West, East, and South African populations that were grouped based on the United Nations geoscheme groupings. Each group was divided into discovery and target subgroups. GWAS discovery cohorts included East (n = 403) and West (n = 331) African individuals, which were independent of each target cohort (n = 186 individuals per region). South African individuals were excluded from the discovery population due to the limited total sample size (two populations and 186 individuals total).

(B) We used AWI-Gen for simulations in Burkina Faso (n = 1703), Ghana (n = 1,661), Kenya (n = 1,701), and South Africa (n = 4,455). For these simulations we withheld 500 individuals from each of the groups, which were used as the target cohort. The GWAS discovery cohort included the 9,020 individuals who were not in the target cohort. Each figure represents roughly 500 individuals. BF, Burkina Faso; SA, South Africa.

AWI-Gen

We assigned genetic effects to variants based on their minor allele frequency. The effects were calculated based on the relationship between effect size and minor allele frequency as shown by Schoech et al.45 The “true” individual’s polygenic risk was calculated in the same way as it was for AGVP, so was the environmental risk factor and the phenotypes. To be able to conduct GWAS, we split AWI-Gen into discovery and target sets such that each discovery set had 9,020 samples and each target 500 (Figure 1). For each discovery-target split, we alternately withheld 500 samples from one of the four countries (Burkina Faso, Ghana, Kenya, and South Africa). We then conducted GWAS for each of the discovery datasets.

For each discovery cohort in AGVP and AWI-Gen, we obtained independent SNP sets by clumping SNPs from corresponding summary statistics files with an r2 value greater than 0.1 using in-sample linkage disequilibrium (LD) and within 500 kb of each other in n PLINK v1.07, were obtained for each discovery cohort. The effect sizes from these SNP sets were used as weights to compute PRSs for all corresponding target datasets for a range of p values (5e-08, 1e-06, 1e-04, 1e-03, 1e-02, 0.05, 0.1, 0.2, 0.5, and all). PRS was calculated as the sum of all SNPs multiplied by their effect sizes.

Heritability estimation

For the first set of heritability estimation analyses, we relied on heritability estimates of 34 quantitative traits computed previously for the Ugandan GPC dataset.46 For the UK Biobank, we computed heritability estimates for the same traits using LD score regression with the default model (i.e., without any functional annotations)47 and using population-matched LD score references from European populations downloaded from the authors’ website (https://data.broadinstitute.org/alkesgroup/LDSCORE/). Due to the difference in study design and heritability estimation methods used for UK Biobank and Uganda GPC, we could not directly compare heritability estimates between the two cohorts. For more comparable estimates, we computed heritability estimates across the 34 quantitative traits in both the Ugandan GPC and UK Biobank using a randomized multi-component Haseman-Elston estimator (RHE-mc48) over unrelated individuals. This method offers improved power over summary statistic methods (e.g., LD score regression), which show very large standard errors with small sample sizes, with improved computational tractability when operating at biobank scale over GREML-based approaches.

To improve comparability across cohorts, we used parallel QC approaches in both cohorts after restricting to unrelated individuals (N = 2,234 in GPC). Namely, in UK Biobank, we filtered genotypes to those outside the MHC region that were defined with MAF ≥0.01 for which we did not observe significant deviation from Hardy-Weinberg equilibrium (p_hwe ≥ 1e-7) and only used genotypes that passed these criteria across all ancestry groups. For the Ugandan GPC analysis, we applied the same filters as above (i.e., variants outside MHC with MAF ≥0.01 and without significant deviation from Hardy-Weinberg equilibrium). We also restricted the analysis to SNPs with imputation INFO score >0.3 to match the approach taken for the GWAS conducted by Gurdasani et al.,46 resulting in 3,627,507 SNPs passing QC.

To account for differences in heritability as a function of LD and MAF, we performed multi-component analyses with a 4 × 2 grid of LD and MAF bins defined in Ugandan GPC and per-population in UK Biobank. LD scores in the Ugandan GPC were computed in LD score regression using all imputed SNPs with MAF >0.005 from unrelated samples, while those in the UK Biobank were computed using SNPs with INFO >0.8, MAC >20 with subsequent covariate correction for age, sex, age∗sex, age2, age2 ∗sex, and the first 20 genotype PCs in each population. LD score bins in both cohorts were computed as membership in quartiles of the LD score distribution. MAF bins in both cohorts were defined as MAF ≤0.05 and MAF >0.05.

We included standard GWAS covariates as fixed-effects covariates for heritability estimation in both Ugandan GPC and UK Biobank, namely age, sex, age∗sex, age2, age2 ∗sex, and the first 20 genotype PCs in each cohort and in each population in the UK Biobank. We ran RHE-mc with 50 random vectors and 100 jackknife blocks in the Uganda GPC and among UK Biobank non-EUR populations; and used 10 random vectors with UK Biobank European samples due to high computational complexity.

Polygenic score calculation

Pruning and thresholding

All PRSs were calculated in plink2 or in hail using custom scripts. For pruning and thresholding approaches, all clumping was done in plink2 using an LD threshold of r2 = 0.1 and a window size of 500 kb with discovery cohort population-specific reference panels. We calculated PRS using plink2 with the --score and --q-score-range flags for AGVP simulations and DCHS. We wrote custom scripts in hail (http://hail.is) to calculate PRS in the Uganda GPC and UK Biobank data due to the larger sample sizes (see web resources). For imputed genotypes, we used SNP dosages in PRS calculations. We computed 10 PRSs for each analysis using the following p-value thresholds: 1, 0.5, 0.2, 0.1, 0.05, 0.01, 1e-3, 1e-4, 1e-6, 5e-8. The PRS that explained the most phenotypic variance is shown in most figures.

We calculated PRS accuracy for continuous traits computed with custom scripts in R (Web resources). For AGVP simulations and DCHS (because all participants were mothers of a similar age), we included the first 10 PCs as covariates when computing the partial R2 specifically attributable to the PRS. For Uganda GPC data, we included age, sex, and the first 10 PCs when computing partial R2 of the PRS. For consistency with the GWAS that were run in UK Biobank previously49 and here with a holdout target set, we included, age, sex, age2, age∗sex, age2∗sex, and the first 10 PCs as covariates when computing the PRS partial R2. (The UK Biobank European GWAS included 20 PCs, but fewer were used here due to the particularly small sample sizes of some other target ancestry groups, Table S5, coupled with minimal population structure observed in PCs lower than PC10.) As described in Table S5, we included up to 351,194 European ancestry participants in a GWAS, withholding up to 9,947 European ancestry participants as a target cohort as well as up to the following numbers of participants with corresponding ancestries: 8,426 African, 1,099 Admixed American, 10,084 Central/South Asian, 2,753 East Asian, and 1,553 Middle Eastern individuals. Other participants in the UK Biobank but not included in these analyses had either second-degree relatives or closer with participants included in analysis or were ancestry outliers.

PRS accuracy evaluation (incremental R2)

To evaluate prediction accuracy, we calculated incremental R2. Specifically, we compared two models:

H0: Phenotype ∼ covariates.

H1: Phenotype ∼ PRS + covariates.

The incremental R2 calculates the change in R2 between H1 and H0, indicating the change in model accuracy attributable to the PRS. We used adjusted R2, which ensures that the model that includes PRS does not outperform the model without PRS simply because more terms were included. All error bars show 95% confidence intervals calculated from bootstrapping. Specifically, for each iteration of 100 bootstrap replicates, we resampled with replacement each individual’s full set of phenotypes, covariates, and polygenic scores, then ran the same models described above. The 95% confidence interval was determined by the 2.5% and 97.5% quantiles.

Relative comparisons of PRS across populations

We compared PRS accuracy across populations by computing relative accuracies (RAs) with respect to a baseline European ancestry PRS. For pruning and thresholding PRS, we computed RA as the ratio between the maximum R2 in the population of interest versus the maximum R2 in the European baseline comparison (i.e., for the same phenotype). Across traits, we computed median absolute deviation (MAD), i.e., the median of the absolute deviations from the median.

PRS-CS versus pruning and thresholding

We compared the prediction accuracy of the pruning and thresholding method to that of PRS-CS, a Bayesian method that has been shown to improve PRS prediction accuracy across diverse populations.50 To do this, we applied PRS-CS-auto to generate scores for the same 34 quantitative traits that were evaluated using pruning and thresholding. We maintained the same discovery cohort from the UK Biobank, i.e., 351,194 European ancestry individuals and evaluated prediction accuracy in two target cohorts: (1) continental ancestry groups from the UK Biobank comprising 9,947 European ancestry holdout sample, as well as ∼24,000 non-European ancestry individuals, and (2) the Ugandan GPC. We used European ancestry from the UK Biobank as the reference panel. Relative accuracy was calculated as the ratio between the maximum R2 for pruning and thresholding or R2 for PRS-CS versus the maximum R2 in the European population for each trait.

Observed versus predicted PRS accuracy

To evaluate the efficacy of PRS accuracy given the varying genetic architecture of the quantitative traits we assessed here, we compared the PRS accuracy we observed with the accuracy that would be predicted from theoretical models.51,52 We calculated the predicted PRS accuracy for the European ancestry individuals from the UK Biobank according to the Daetwyler equation below, where E is the predicted accuracy, h2 is the heritability estimates, M is the number of independent SNPs (i.e., total number of trait-associated SNPs from LD clumping with p-value <1), and N is the sample size (∼350,000 individuals).

E(R2)hM21+MNhM2

Meta-analysis

We used plink2 to conduct inverse variance-weighted meta-analysis across GWAS summary statistics with the --meta-analysis option.

LD reference panels and clumping

All PRS calculations required an LD panel for clumping. Our analyses used in-sample LD where feasible and reference panel data as a proxy with ancestry matching from the 1000 Genomes Project phase 3 data when individual-level data were unavailable. We weighted the ancestral representation of each population per trait matching at the continental level. We matched individuals as follows.

Cohort 1000 Genomes phase 3 reference data
BBJ East Asian (EAS)
UK Biobank European (EUR)
Uganda Genome Resource (UGR) African (AFR)
PAGE Proportional weighting of AFR, EAS, AMR (depending on trait, see Table S6 description for more detail)

We then used the maximal number of individuals available when weighting proportionally to construct this reference panel. For example, in the meta-analysis of height across the UK Biobank, Biobank Japan (BBJ), and Population Architecture Using Genomics and Epidemiology (PAGE) cohorts, UK Biobank has the largest sample size in the discovery cohort (n = 350,353), so all Europeans from 1000 Genomes were included in the reference panel (n = 503), then a random sampling of EAS, AFR, and AMR individuals were included proportionally to the overall diversity of the discovery cohorts in the meta-analysis.

Results

Our study uses both simulation-based and empirical approaches to evaluate the generalizability of PRS across diverse African ancestry populations. Abbreviations are in Table S1, and a summary of datasets used in this study is shown in Table S2.

Simulated generalizability within and across diverse African populations

We used two separate simulation strategies for AGVP and AWI-Gen depending on their sample sizes (Figure 1). Given the limited sample size of the AGVP dataset, we opted to use the strategy previously used by Scutari et al.11 We simulated several quantitative traits with varying numbers of causal variants (n = 5; 20; 100; 2,000; 10,000; and 50,000) and heritability rates (h2 = 0.1, 0.2, 0.4, and 0.8), then conducted independent GWASs for each scenario in East and West African ancestry populations (materials and methods, Figures S1–S4). We calculated the prediction accuracy for PRSs derived from the GWAS summary statistics considering 10 different p-value thresholds within and across independent target populations from East, West, and South Africa. In general, ancestry-matched results with the sparsest and most heritable genetic architectures produced the highest prediction accuracy. As expected, prediction accuracy was highest with trait h2 = 0.8 and fewer than 100 causal variants (Figure 2A), as indicated by the highest R2 and the identification of genome-wide significant associations. Conversely, when the number of causal variants exceeded 100, prediction accuracy was negligible (Figure S4), as evidenced by no variants meeting the genome-wide significance threshold (i.e., p < 5e08). Prediction accuracy was highest with 5 and 20 causal variants (Figure 2A). The within-ancestry prediction at p-value threshold < 5e-08 and five causal variants were as follows: R2 = 0.86, p = 1.74 × 10−74 for East discovery - East target scores; R2 = 0.85, p = 9.9e-74 for West discovery - West target scores. We observed lower prediction accuracy with ancestry-mismatched discovery versus target cohorts at five causal variants and p-value threshold = 1e-6 (R2 = 0.66, p = 1.79e-42 for West discovery - West target scores, compared with R2 = 0.53, p = 1.29e-74 for East discovery - West target scores). The scores in the South target sample were comparable when using East- or West-derived summary statistics (R2 = 0.86, p = 5.19e-84 for West-derived summary statistics, and R2 = 0.86, p = 1.35e-83 for East-derived summary statistics).

Figure 2.

Figure 2

Simulated GWAS and polygenic scores indicate differential prediction accuracy across diverse regions of Africa

(A) Predictive accuracy of the simulated quantitative trait in AGVP at the heritability of 0.8. The predictive accuracy was calculated for six categories of causal variants for the West and East discovery cohorts, across 10 p-value thresholds. Only the top three categories are shown here, the rest can be found in Figures S1–S4.

(B) Predictive accuracy of simulated quantitative traits in AWI-Gen for various trait heritability rates across 10 p-value thresholds. The error bars represent the lower and upper limits of 95% confidence interval.

For AWI-Gen, we used the commonly used infinitesimal simulation strategy for quantitative traits. We simulated quantitative traits by assigning genetic variant effects based on their minor allele frequency in accordance with the relationship between effects and minor allele frequency established by Schoech et al.45 We varied the trait heritability rates similar to the analysis done with AGVP and conducted GWASs for each trait (materials and methods, Figure 1). We calculated the prediction accuracy for PRS derived from the GWAS summary statistics considering 10 different p-value thresholds, as before, across independent target populations from Burkina Faso, Ghana, Kenya, and South Africa. Across heritability rates and target datasets, the PRS prediction accuracy was low and had confidence intervals that included zero (Figure 2B). The lack of PRS transferability in AGVP and AWI-Gen for traits with a polygenic architecture using two independent simulation strategies highlights that large-scale genetic studies in African populations are required to accurately predict phenotypes using genetic data and facilitate a better understanding of how PRS might transfer across African populations given the genetic diversity on the continent.

PRS accuracies in South African populations

While our simulations have shown that PRSs generalize poorly across Africa due to substantial genetic diversity and differences across the continent, there is also considerable genetic and environmental diversity within regions and countries. We quantified PRS accuracy for a range of measured phenotypes in mothers genotyped in the DCHS cohort in South Africa, including several sociodemographic, physical/biomedical, and psychosocial risk traits (Table S3). The DCHS cohort consists of participants with multiple ancestry groups that include an admixed population with ancestry from multiple continents as well as a population with almost exclusively African population. These ancestry groups correlate with self-reported “Mixed” and “Black/African” ethnicities, respectively (Figure S5). We computed PRS for maternal height, depression, psychological distress, alcohol consumption, and smoking in DCHS overall, by ethnic group, and by ancestry within the Mixed ethnic group (materials and methods).

Across all genetically predicted phenotypes, only height was significantly predicted (Figure S6). We predicted height more accurately in the Mixed versus Black/African ethnic groups (R2 = 0.099, 95% bootstrapped CI = 0.012–0.18, p = 1.5e-7 versus R2 = 0.021, 95% CI = −0.031 to 0.043, p = 5.27e-3, respectively). We also expect that PRS accuracy increases with decreasing African ancestry within the Mixed ethnic group as has been shown previously in admixed African populations53; we find suggestive evidence consistent with this trend when partitioning the Mixed group into two bins along PC1 (R2 = 0.091, 95% CI = −0.04 to 0.17, p = 6.4e-4 in lower half of PC1 with more African ancestry versus R2 = 0.12, 95% CI = −9.0e-4 to 0.21, p = 5.7e-5 with more out-of-Africa ancestry), although small sample sizes limit definitive comparisons (n = 137 in each PC1 bin). Our results are consistent with variable prediction accuracy among diverse African ancestry groups within South Africa and insignificant prediction in African populations for all but the most heritable and accurately predicted traits elsewhere. Notwithstanding these findings, the sample used for these analyses is relatively small and does not represent the larger South African population and some of the traits are greatly impacted by pregnancy, for example, pregnant women are less likely to drink and smoke than the general public. In addition, in contrast to the discovery datasets that include males and females, DCHS is a female-only cohort that has both a lower and narrower age range, which could impact PRS accuracy for the traits where age plays a key role.

Variable phenotypic and genetic similarities across the Uganda GPC and UK Biobank

Lower phenotypic correlations in the Uganda GPC suggest higher contributing environmental effects

We next investigated phenotypic similarities within and across the Uganda GPC and UK Biobank participants because these are two of the largest cohorts with dozens of traits measured in African ancestry individuals. We first considered overall cohort differences between these cohorts: the Uganda GPC enrolled participants using a house-to-house study design and generated genetic data on 5,000 adults from rural villages in southwestern Uganda,39 while the UK Biobank enrolled 500,000 people aged between 40 and 69 years in 2006–2010 from across the country (materials and methods41). Previous studies have reported higher rates of infectious diseases (e.g., HIV, hepatitis B and C) in the Uganda GPC than would be expected in the UK Biobank.39 There are many additional potential environmental explanations for mean shifts in phenotypes, such as dietary, food security, and age differences contributing to considerable BMI differences across cohorts (μ = 21.3 and σ = 3.8 in Uganda GPC versus μ = 27.4 and σ = 4.8 in the UK Biobank, p < 2.2e-16). To quantify comparisons while controlling for demographic differences for each of the 34 quantitative traits measured in both cohorts, we first mean centered each phenotype and regressed out the effects of age and sex within each cohort. Next, we then compared the distributions and variances of each phenotype across cohorts via Kolmogorov-Smirnov and F-tests, respectively (Table S4). Given the large sample sizes, all K-S tests were significantly different, with several phenotypes showing distributional and variance differences of considerable magnitude (Figure S7 and Table S4, e.g., Bilirubin, BASO, HbA1c, ALP, EOS, TG, and NEU).

We next analyzed how similar the relationships are between phenotypes across datasets. Similar trends emerge overall, with distances across variance-covariance matrices for these cohorts showing evidence of significant correlation (Mantel test Z statistic = 0.73, p < 1e-4). The correlations among phenotypes are slightly higher overall in the Uganda GPC than in UK Biobank, both among related and unrelated individuals (Figures 3B and S8). These findings are expected because of shared genetics and/or shared household environments contributing to more similar phenotypes.54 More specifically, we see consistent correlations among combinations of phenotypes including SBP and DBP; RBC, Hb, and HCT; Cholesterol and LDL; WC, BMI, WT, and HC; MCHC, MCH, and MCV; GGT, ALT, AST, and ALP; and MONO, NEU, and WBC with high overall correlations across these datasets for these traits (Figures 3A and 3B, see abbreviations in Table S1). Some pairs of traits, however, have significantly different correlations across datasets. The largest difference in phenotypic correlations across datasets is between ALP and WT (ρ = 0.11, p < 2.2e-16 in UK Biobank versus ρ = −0.36, p < 2.2e-16 in Uganda GPC).

Figure 3.

Figure 3

Phenotype correlations among 33 quantitative traits measured in the Uganda GPC data and the UK Biobank

(A) Phenotypic correlations measured in traits in the Uganda GPC among unrelated individuals.

(B) Phenotypic correlations in the unrelated UK Biobank European ancestry individuals. (A and B) Phenotypes were mean centered and adjusted for age and sex within each cohort prior to correlation analysis. The order of each phenotype correlation is determined by hierarchical clustering in the Uganda GPC.

Our next goal was to compare trait heritability estimates in the UK Biobank versus Uganda GPC data (materials and methods); however, the sample size and study design differences between these cohorts limited comparability using standard scalable approaches. Specifically, the household design of Uganda GPC included smaller sample sizes with more relatives in which family-based heritability estimates are most appropriate, whereas the large sample size and volunteer design in the UK Biobank makes SNP-based heritability estimates from unrelated individuals more appropriate. Figure S9 compares heritability estimates across traits in the UK Biobank versus Uganda GPC using these disparate approaches.46 As expected from the differences in the methods, study designs, and sample sizes, we find higher but noisier estimates in Uganda GPC for most traits, consistent with expectation from family-based versus unrelated heritability estimates across these two studies. While all of these factors fundamentally limit comparability of heritability estimates across these cohorts, we have also estimated heritability in both cohorts in unrelated individuals with consistent methodology using multi-component Haseman-Elston regression implemented in RHE-mc to improve comparability.48 These results showed higher heritability estimates in the Uganda GPC dataset that were not significantly correlated with heritability estimates from any ancestry group in the Pan-UK Biobank Project, consistent with a wide range of differences influencing these phenotypes across cohorts (Table S7, Figure S10). With these heritability estimates, we also estimated observed versus predicted PRS accuracy and find that predicted R2 tends to be higher than observed R2 (Figure S11).

African genetic risk predictions from European ancestry GWAS data are remarkably inaccurate

To understand baseline trans-ancestry PRS accuracy using a typical approach, we predicted 32 traits in the Uganda GPC using GWAS summary statistics from the UK Biobank European ancestry individuals. While several traits were significantly predicted across ancestries, prediction accuracy was low for most traits (Figure S12); the most accurate PRS was for MPV (R2 = 0.036, 95% CI = 0.0069–0.063, p = 5.73e-7), while the average variance explained across all traits was less than 1% (mean R2 = 0.007). To assess the relative effects of ancestry versus cohort differences on decreases in prediction accuracy across populations, we next withheld 10,000 European ancestry individuals from UK Biobank for use as a target cohort, reran all GWASs, then used individuals with diverse continental ancestries in the UK Biobank as target populations (EUR = Europeans withheld from the GWAS, AMR = admixed American, MID = Middle Eastern, CSA = Central/South Asian, EAS = East Asian, and AFR = African, Figure S13), subcontinental African ancestries in the UK Biobank (Ethiopian, Admixed, South, East, West African ancestries, Figure S14), as well as the Uganda GPC (Figure 4A, Table S5).

Figure 4.

Figure 4

PRS accuracy and corresponding genetic variant contributions for up to 34 traits within and across diverse ancestries

(A) PRS accuracy relative to European ancestry individuals in diverse target ancestries. Discovery data consisted of GWAS summary statistics from UK Biobank (UKB) European ancestry data. Target data consisted of globally diverse continental ancestries (including withheld European target individuals) and regional African ancestry participants from UKB, or unrelated individuals from the Uganda GPC cohort. Traits were filtered to those with a 95% confidence interval range in PRS accuracy <0.08.

(B) PRS accuracy from a homogeneous versus multi-ancestry discovery dataset. GWAS discovery data consisted of summary statistics from UKB European ancestry data only or from the meta-analysis of UKB, BioBank Japan (BBJ), and Population Architecture using Genomics and Epidemiology (PAGE). Target populations are from the UKB. Lines connect the 10 traits available in both discovery cohorts to indicate how accuracy changed for the same trait in the UKB only versus meta-analyzed discovery data, while half violin plots show the distribution across all phenotypes in each discovery cohort. When lines are missing, the trait is absent in PAGE. Trait outliers are labeled in text and with solid lines. (A and B) Relative PRS accuracies are compared to the maximum for each trait in target samples withheld from discovery consisting of UKB European ancestry individuals. To simplify comparisons, only the polygenic scores with the highest prediction accuracy are shown here. Colors in these two panels correspond to the same continental ancestries.

(C and D) Trait-specific genetic outlier plots. QQ-like plot showing p values in UKB only versus multi-cohort meta-analysis of UKB, BBJ, and PAGE. The 10 regions that are genome-wide significant in both dataset and show the most significant differences are colored and labeled for (C) MCHC, and (D) WBC.

Among continental ancestries, we computed R2 and 95% CIs for each trait (Figure S15), then computed median RA compared with Europeans and MAD across all traits (materials and methods). We predict these traits most accurately in EUR (RA = 1, MAD = 0), followed by AMR (RA = 0.784, MAD = 0.023), MID (RA = 0.643, MAD = 0.034), CSA (RA = 0.621, MAD = 0.031), EAS (RA = 0.477, MAD = 0.024), and AFR (RA = 0.219, MAD = 0.014) (Figure 4A). Because different PRS methodologies can improve overall prediction accuracy for some traits, we also compared our results using pruning and thresholding with PRS-CS; as described previously, different PRS methods may perform better for some phenotypes than others, but do not generally improve the relative loss of accuracy19,55 (Figure S16). We next compared prediction accuracy within African ancestry populations. Because some PRS accuracy estimates were noisy due to small sample sizes in UK Biobank Africans (especially Ethiopian and South African ancestry individuals, Table S5), we restricted analyses to those traits predicted with a 95% CI <0.08. Among these traits, we predicted most accurately those with Ethiopian ancestry (RA = 0.511, MAD = 0.059), followed by recently admixed individuals with West African and European ancestry (RA = 0.276, MAD = 0.016), East African ancestry (RA = 0.193, MAD = 0.023), West African ancestry (RA = 0.150, MAD = 0.012), and South African ancestry (RA = 0.083, MAD = 0.014) (Figure 4A). These results track with genetic distance as measured by FST (Table S8) and population history; the highest prediction accuracy identified in Ethiopians is expected given closer genetic proximity to European populations relative to other Africans due to back-to-Africa migrations influencing population structure there.17,56,57 The lowest prediction accuracy is in populations with southern African ancestry, consistent also with higher genetic divergence from European populations and more genetic diversity overall.16,18,58

Next, we quantified the proportion of loss of prediction accuracy (LOA, calculated as (1 − RA) ∗ 100%) due to MAF and LD in the subcontinental African ancestry groups in the UK Biobank. As expected, LOA followed an inverse trend to prediction accuracy, i.e., LOA increased with genetic distance between the discovery and target cohort (Figure S17). LOA was lowest in the Ethiopian group (median LOA = 26.22) and highest in the West group (median LOA = 40.11).

Lower prediction accuracy across ancestries than across cohorts

To compare prediction accuracy among similar ancestry participants from different cohorts, we next computed PRSs for 34 traits using GWAS summary statistics from UK Biobank Europeans in two target populations: UK Biobank participants with East African ancestry versus Uganda GPC. As expected, prediction accuracy in these populations is very low across all traits in both cohorts and only slightly higher in the UK East African ancestry individuals than in the Uganda GPC individuals (mean R2 = 0.017, SD = 0.013 versus mean R2 = 0.012, SD = 0.010, respectively, Figure S18). Across traits, the differences in PRS accuracy across cohorts but within the same ancestry (Figure S18A) are much smaller than the differences across ancestries but within the UK Biobank (Figure 4A, left and middle panels), indicating that ancestry has a larger impact on genetic risk prediction than cross-cohort differences analyzed here. Smaller effects on genetic prediction accuracy differences across cohorts may be attributable to environmental differences, such as higher rates of malnutrition and infectious diseases previously reported in Uganda and in the GPC.39,59

Improved African genetic risk prediction accuracy with multi-ethnic GWAS summary statistics

We next maintained the target populations but varied the discovery cohort to determine how more diverse GWAS impacts PRS accuracy for these phenotypes in diverse populations. Specifically, we computed PRS accuracy in diverse target populations in the UK Biobank (Table S5) using one of two discovery cohorts: the UK Biobank European-only cohort versus diverse discovery cohorts combined via meta-analysis (Table S6). Meta-analyzed GWAS summary statistics come from several cohorts, including the UK Biobank, BBJ,60 PAGE Consortium,61 and Uganda Genome Resource (UGR).46 For each trait, discovery cohort, and target cohort combination, we normalized the PRS R2 values from the p-value threshold that explained the maximum phenotypic variance with respect to the prediction accuracy in the European target cohort using UK Biobank summary statistics only, then computed RAs as before.

We find that prediction accuracy improves the most across populations when using a discovery cohort consisting of GWAS summary statistics meta-analyzed across the UK Biobank, BBJ, and PAGE cohorts (Figure 4B). To determine whether the improvement in prediction accuracy was due to the increase in sample size or the diversification of the GWAS discovery, we compared prediction accuracy across three discovery cohorts: 100,000 EUR individuals from GWAS summary statistics acquired from Martin et al., 2019a10, 350,000 EUR individuals, and multi-ancestry GWAS comprising UK Biobank, BBJ, and PAGE (Figure S19). We observe that the increase in discovery sample size from 100,000 to 350,000 EUR improves prediction accuracy differentially across populations (Figure S19A). When comparing prediction accuracy across the three discovery cohorts, the results show that increasing the sample size improves prediction accuracy across all ancestries, but more so for the EUR population. The multi-ancestry discovery cohort seemed to improve prediction accuracy in the non-EUR populations more than the increase in sample size in general, with the largest improvement in prediction accuracy observed for BMI in AMR and EAS populations and MCHC and WBC for the AFR population (Figure S19B).

Surprisingly, meta-analyzing the UGR data with UK Biobank did not improve prediction accuracy for any population and most notably decreased accuracy in African ancestry target populations (discovery UK Biobank median RA = 0.22, UGR + UK Biobank median RA = 0.15, Figures 5 and S20). We hypothesize that (1) the relatively small sample size of UGR adds more noise than signal as indicated by the large error bars, and (2) the difference in effect sizes between UGR and UK Biobank, particularly for the less polygenic traits such as LDL (Figure S21) contributes to the noise. When predicting traits using the UK Biobank, BBJ, and PAGE meta-analysis as a discovery cohort, we find that prediction accuracy increases most for the AMR, EAS, and AFR target populations, which more closely resemble the ancestry patterns of PAGE and BBJ (Figure 4B). The meta-analysis conflates two factors that are known to improve prediction accuracy: increase in sample size and diversity in the discovery cohort. To determine which of these factors drove the gains in prediction accuracy in Figure 4B, we compared the prediction accuracy from 100K EUR individuals from UK Biobank, downsampled to match the size of BBJ, to that of the 350k EUR (for 17 overlapping phenotypes) and the multi-ancestry discovery (for five overlapping phenotypes). This comparison indicates that indeed increasing the discovery sample size generally improves prediction accuracy; however, it is the inclusion of diverse samples in the discovery cohort that improves prediction accuracy, especially for the populations represented in that cohort (Figure S19). These findings are consistent with ancestry-matched discovery data disproportionately improving prediction accuracy in the corresponding target population.4,8,31

Figure 5.

Figure 5

Relative PRS accuracy using the same target individuals and varying discovery cohorts

All relative comparisons are with respect to accuracy in withheld EUR when predicting with UKB European GWAS summary statistics alone as the discovery cohort.

Large-effect population-enriched genetic variants drive heterogeneity in polygenic score accuracy for blood panel traits

We find that PRS accuracy improvements from higher diversity in the discovery cohorts vary across traits, with the largest increases seen in MCHC and WBC particularly in AMR and AFR populations (Figure 4B). We searched for specific genetic loci that could explain this pattern by comparing the significance of genetic associations in UK Biobank alone versus the meta-analysis of UK Biobank, BBJ, and PAGE (Table S6). For MCHC and WBC in particular, the genetic variants contributing to these improved PRSs consist of several well-known population-enriched variants (Figures 4C and 4D). For example, genetic variants that disproportionately explain population-specific risk for MCHC include variants previously associated with hemoglobin concentration, including rs9399137 upstream of HBS1L and MYB in a study of sickle cell anemia (p = 5.24e-249 and β = 0.0783 in the meta-analysis),62 rs855791 in TMPRSS6 (p = 3.49e-241, β = 0.0692),63,64 and rs551118 upstream of PIEZO1 and CDT1 (p = 5.18e-100, β = −0.0451)65 (Table S9). Associations with WBC tend to show more population-enriched associations as shown in the meta-analysis (Figure 4D), including rs3936197 in MED24 (p = 5.18e-289, β = −0.0772), rs58650325 near the high affinity immunoglobulin (Ig)E receptor FCER1A that initiates the allergic response (1.57e-163, β = −0.097, also close to OR10J3), and rs11533993 in CDK6 (p = 1.55e-84, β = −0.0799). Thus, genetic architecture and population genetic considerations are important to bear in mind when considering the generalizability of polygenic scores.

Discussion

PRSs have been proposed as genetic biomarkers for use in preventive medicine,66,67 but are currently limited by low accuracy across populations especially in African ancestry populations.4,6 Through simulations and empirical work, this study has enabled unique insights into PRS transferability within and among diverse continental African populations as well as among African ancestry populations living in considerably different environments. Simulations will continue to play a crucial role in understanding and mitigating biases, but the small sample size of existing genetic studies in African populations have limited the simulation designs that are even possible with realistic population structure across the African continent in this study. The AGVP dataset used for the first set of simulations was too small to use a typical infinitesimal simulation strategy. As a result, we simulated phenotypes with variants with large effects—a scenario that is inconsistent with the genetic architecture of most polygenic traits. While the simulation done with the AWI-Gen dataset represents a scenario that is more realistic for complex traits, the findings re-emphasize that the paucity of large genetic samples in non-European ancestry populations limits simulation designs. Future studies could simulate new individuals from observed allele frequencies or from larger scale genetic datasets as they are made available. Despite these limitations, the simulation work done here provides a framework for simulation designs within the current sample size confines and what can be expected from these simulations in African populations.

We demonstrate looming challenges for applying current PRS in African ancestry populations—because relatively few genetic studies have been conducted in African populations coupled with the lack of out-of-Africa population bottlenecks, PRS accuracy is low but widely variable. Differences in PRS accuracy across diverse African ancestries from different regions can be larger than across out-of-Africa continents. This is particularly problematic, as widely used algorithms that guide health decisions already have ingrained racial biases,68 warning of compounding challenges with implementation. We demonstrate that there are clear steps the field can take to work against these biases. Specifically, including ancestrally diverse populations in GWASs at considerably larger sample sizes, discovery cohorts improve accuracy for all populations and especially underrepresented populations more than conducting similarly sized studies with only European ancestry cohorts.

Another advantage of using GWASs from globally diverse populations to compute PRS is the routine inclusion of population-enriched variants. Clear examples such as African-enriched variants in APOL1 and G6PD have been shown to contribute to especially high risk of chronic kidney disease and to missed diabetes diagnosis, respectively.69,70 These examples highlight the importance of studying diverse populations to predict genetic risk of disease equitably by aggregating variants across the spectrum of allele frequencies and effect sizes in different populations. Relevant to the traits studied in genetic analyses here, hematological differences such as anemia are more common in lower income countries in Africa and in African ancestry populations elsewhere compared with European ancestry populations in high-income countries, particularly among older individuals. These hematological differences potentially arise in part due to genetic variation as well as the higher prevalence of infectious diseases and pathogens, poorer nutritional status, and altitude.71,72 Here, we show that variants influencing risk of beta thalassemia disproportionately increase PRS accuracy for hemoglobin variation particularly in African ancestry populations. The inclusion of population-enriched variants in PRS could eliminate genetic justifications for race-based medicine, which problematically reinforces implicit racial biases by overemphasizing the link between genetics and race despite the fact that there is more genetic variation within than between ancestral populations.73 However, for this to be possible, genetic data would have to be available for all populations at scale—an ideal that is still a ways off.

In addition to reduced PRS accuracy with ancestral distance from GWAS cohorts, genetic nurture, social genetic, and environmental effects can also contribute to low portability of PRS across populations,23,74 with some interventions modulating health along PRS strata.75 In this study, however, ancestry appears to have a larger effect on portability than cohort differences overall. An important distinction when comparing the magnitude of these and other non-genetic effects in other studies is that the traits most accurately genetically predicted here were primarily anthropometric and blood panel traits. When analyzing traits with more sociodemographic influences in increasingly diverse populations, population stratification, confounding, and study design considerations are thornier issues.22,76,77 PRS accuracy comparisons across ancestrally similar but environmentally diverse populations are especially important for medically actionable traits. For example, particularly low PRS portability for triglycerides (TG) from European to the Uganda GPC resulted at least in part from effect size heterogeneity that has previously been connected to pleiotropic and gene ∗ environment effects; specifically, most non-transferable genome-wide significant associations with TG showed pleiotropic associations with BMI in European but not Ugandan individuals.78

While PRSs currently have limited portability, increased diversity in genetic studies is already decreasing prediction accuracy gaps across populations.31,78,79 This is consistent with causal genetic effects tending to be similar across populations but with LD and allele frequency differences modifying marginal effect size estimates.4,7,8 This is also consistent with trans-ethnic genetic correlations tending to be close to or not significantly different from 1.80,81 The most rapid path to closing gaps in PRS transferability is to increase the inclusion of GWAS participants from populations most divergent from those already routinely studied. As empirically demonstrated here, when comparing PRS accuracy calculated from diverse cohort meta-analysis versus data from Europeans only, large-scale GWASs with diverse African populations will rapidly reduce portability gaps across global populations because they have the most genetic diversity, most rapid linkage disequilibrium decay, and highest genetic divergence from the best studied populations. Major efforts under way, such as the Human Hereditary and Health in Africa Initiative, PAGE, All of Us, and NeuroGAP programs,61,82,83,84,85 are especially promising for rectifying current PRS gaps and missed scientific opportunities by increasing inclusion of diverse African participants.

Beyond expanding on diversity by increasing the number of study participants in large-scale studies, it is equally important to diversify researchers working on genomics studies. Currently, the vast majority of researchers in genomics studies are of European ancestry,86,87,88 paralleling the over-representation of European ancestry individuals in genomic studies. The exclusion of African researchers leads to the disparity in research leadership and reduced scientific output from African researchers.89 Efforts such as the Global Initiative for Neuropsychiatric Genetics Education and Research (GINGER) program,90 which provides mentorship and training for early-career investigators on the African continent (particularly in Uganda, Kenya, Ethiopia, and South Africa, including several of this study’s authors), are important in moving toward a more inclusive and representative research community.

Conclusion

Previous studies that have examined PRS accuracy across globally diverse ancestry groups have demonstrated that accuracy is lowest in African ancestry samples. However, the extent to which this accuracy varies within African ancestry populations has not been previously investigated. Our findings that prediction accuracy varies by African ancestry populations is a clear reflection of the vast genetic diversity of the continent. It is therefore critically important to create well-powered GWASs that reflect the full range of diversity within Africa.

Data and code availability

All data used in this study are publicly available. Data from the African Genome Variation Project was accessed by combining EGAD00010001045, EGAD00010001046, EGAD00010001049, EGAD00010001050, EGAD00010001051, EGAD00010001052, EGAD00010001053, EGAD00010001054, EGAD00010001055, EGAD00010001056, EGAD00010001057, and EGAD00010001058. The Drakenstein Child Health Study is committed to the principle of data sharing. De-identified data will be made available to requesting researchers as appropriate. Requests for collaborations to undertake data analysis are welcome. Uganda GPC genetic data used in this paper were accessed through EGAD00010000965 and phenotype data were accessed via sftp from EGA (reference: DD_PK_050716 gwas_phenotypes_28Oct14.txt). We accessed data from the UK Biobank with application 31,063. BioBank Japan summary statistics were accessed from http://jenger.riken.jp/en/result. GWAS summary statistics for the Population Architecture using Genomics and Epidemiology (PAGE) study were accessed through the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/downloads/summary-statistics).

All code used in analysis is available here: https://github.com/armartin/africa_prs.

Acknowledgments

We thank Lori Chibnik, Bizu Gelaye, Kristi Post, and Courtney White for facilitating the GINGER program and making this work possible. This work was supported by funding from the National Institutes of Health (K99/R00MH117229 to A.R.M.; K01MH121659 and T32MH017119 to E.G.A.). UK Biobank analyses were conducted via application 31063. The DCHS is funded by the Bill and Melinda Gates Foundation (OPP1017641). Additional support for H.J.Z. and D.J.S., and for the research reported in this publication, was provided by the South African Medical Research Council (SAMRC). The SAMRC provided additional support through its Division of Research Capacity Development under the National Health Scholarship program from funding received from the Public Health Enhancement Fund/South African National Department of Health. The views and opinions expressed are those of the authors and do not necessarily represent the official views of the SAMRC. We thank the mothers and their children for participating in the DCHS and the study staff, the clinical and administrative staff of the Western Cape Government Health Department at Paarl Hospital, and at the clinics for support of the study. We also thank all research participants in the UK Biobank, BioBank Japan, PAGE study, UGR and Uganda GPC, and AGVP studies.

Declaration of interests

The authors declare no competing interests.

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2023.100184.

Web resources

European Genomic-Phenome Archive, https://ega-archive.org.

Drakenstein Child Health Study, http://www.paediatrics.uct.ac.za/scah/dchs.

BioBank Japan summary statistics, http://jenger.riken.jp/en/result.

Population Architecture using Genomics and Epidemiology (PAGE), https://www.ebi.ac.uk/gwas/downloads/summary-statistics.

Supplemental information

Document S1. Figures S1–S21
mmc1.pdf (7.1MB, pdf)
Data S1. Tables S1–S9
mmc2.xlsx (48.2KB, xlsx)
Document S2. Article plus supplemental information
mmc3.pdf (10.4MB, pdf)

References

  • 1.Visscher P.M., Wray N.R., Zhang Q., Sklar P., McCarthy M.I., Brown M.A., Yang J. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Morales J., Welter D., Bowler E.H., Cerezo M., Harris L.W., McMahon A.C., Hall P., Junkins H.A., Milano A., Hastings E., et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 2018;19:21. doi: 10.1186/s13059-018-1396-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Popejoy A.B., Fullerton S.M. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Martin A.R., Kanai M., Kamatani Y., Okada Y., Neale B.M., Daly M.J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Manrai A.K., Funke B.H., Rehm H.L., Olesen M.S., Maron B.A., Szolovits P., Margulies D.M., Loscalzo J., Kohane I.S. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 2016;375:655–665. doi: 10.1056/NEJMsa1507092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sirugo G., Williams S.M., Tishkoff S.A. The missing diversity in human genetic studies. Cell. 2019;177:26–31. doi: 10.1016/j.cell.2019.02.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen J., Spracklen C.N., Marenne G., Varshney A., Corbin L.J., Luan J., Willems S.M., Wu Y., Zhang X., Horikoshi M., et al. The trans-ancestral genomic architecture of glycemic traits. Nat. Genet. 2021;53:840–860. doi: 10.1038/s41588-021-00852-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lam M., Chen C.-Y., Li Z., Martin A.R., Bryois J., Ma X., Gaspar H., Ikeda M., Benyamin B., Brown B.C., et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet. 2019;51:1670–1678. doi: 10.1038/s41588-019-0512-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu J.Z., van Sommeren S., Huang H., Ng S.C., Alberts R., Takahashi A., Ripke S., Lee J.C., Jostins L., Shah T., et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 2015;47:979–986. doi: 10.1038/ng.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Martin A.R., Gignoux C.R., Walters R.K., Wojcik G.L., Neale B.M., Gravel S., Daly M.J., Bustamante C.D., Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Scutari M., Mackay I., Balding D. Using genetic distance to infer the accuracy of genomic prediction. PLoS Genet. 2016;12:e1006288. doi: 10.1371/journal.pgen.1006288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Martin A.R., Daly M.J., Robinson E.B., Hyman S.E., Neale B.M. Predicting polygenic risk of psychiatric disorders. Biol. Psychiatr. 2019;86:97–109. doi: 10.1016/j.biopsych.2018.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.International Schizophrenia Consortium. Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.1000 Genomes Project Consortium. Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Uren C., Kim M., Martin A.R., Bobo D., Gignoux C.R., van Helden P.D., Möller M., Hoal E.G., Henn B.M. Fine-scale human population structure in southern Africa reflects ecogeographic boundaries. Genetics. 2016;204:303–314. doi: 10.1534/genetics.116.187369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Busby G.B., Band G., Si Le Q., Jallow M., Bougama E., Mangano V.D., Amenga-Etego L.N., Enimil A., Apinjoh T., Ndila C.M., et al. Admixture into and within sub-saharan Africa. Elife. 2016;5:e15266. doi: 10.7554/eLife.15266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pagani L., Schiffels S., Gurdasani D., Danecek P., Scally A., Chen Y., Xue Y., Haber M., Ekong R., Oljira T., et al. Tracing the route of modern humans out of Africa by using 225 human genome sequences from Ethiopians and Egyptians. Am. J. Hum. Genet. 2015;96:986–991. doi: 10.1016/j.ajhg.2015.04.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Choudhury A., Aron S., Botigué L.R., Sengupta D., Botha G., Bensellak T., Wells G., Kumuthini J., Shriner D., Fakim Y.J., et al. High-depth African genomes inform human migration and health. Nature. 2020;586:741–748. doi: 10.1038/s41586-020-2859-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang Y., Guo J., Ni G., Yang J., Visscher P.M., Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 2020;11:3865. doi: 10.1038/s41467-020-17719-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.de Vlaming R., Okbay A., Rietveld C.A., Johannesson M., Magnusson P.K.E., Uitterlinden A.G., van Rooij F.J.A., Hofman A., Groenen P.J.F., Thurik A.R., Koellinger P.D. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 2017;13:e1006495. doi: 10.1371/journal.pgen.1006495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wray N.R., Yang J., Hayes B.J., Price A.L., Goddard M.E., Visscher P.M. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 2013;14:507–515. doi: 10.1038/nrg3457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zaidi A.A., Mathieson I. 2020. Demographic History Impacts Stratification in Polygenic Scores. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mostafavi H., Harpak A., Agarwal I., Conley D., Pritchard J.K., Przeworski M. Variable prediction accuracy of polygenic scores within an ancestry group. Elife. 2020;9:e48376. doi: 10.7554/eLife.48376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hero J.O., Zaslavsky A.M., Blendon R.J. The United States leads other Nations in differences by income in perceptions of health and health care. Health Aff. 2017;36:1032–1040. doi: 10.1377/hlthaff.2017.0006. [DOI] [PubMed] [Google Scholar]
  • 25.Roser M., Ortiz-Ospina E., Ritchie H. Life expectancy. Our World in Data. 2013 [Google Scholar]
  • 26.U.S. Department of Health & Human Services, and Agency for Healthcare Research and Quality (2017). 2016 National Healthcare Quality and Disparities Report. [PubMed]
  • 27.Martin A.R., Teferra S., Möller M., Hoal E.G., Daly M.J. The critical needs and challenges for genetic architecture studies in Africa. Curr. Opin. Genet. Dev. 2018;53:113–120. doi: 10.1016/j.gde.2018.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Henn B.M., Cavalli-Sforza L.L., Feldman M.W. The great human expansion. Proc. Natl. Acad. Sci. USA. 2012;109:17758–17764. doi: 10.1073/pnas.1212380109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Campbell M.C., Tishkoff S.A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genom. Hum. Genet. 2008;9:403–433. doi: 10.1146/annurev.genom.9.081307.164258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Genovese G., Friedman D.J., Ross M.D., Lecordier L., Uzureau P., Freedman B.I., Bowden D.W., Langefeld C.D., Oleksyk T.K., Uscinski Knob A.L., et al. Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science. 2010;329:841–845. doi: 10.1126/science.1193032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bigdeli T.B., Genovese G., Georgakopoulos P., Meyers J.L., Peterson R.E., Iyegbe C.O., Medeiros H., Valderrama J., Achtyes E.D., Kotov R., et al. Contributions of common genetic variants to risk of schizophrenia among individuals of African and Latino ancestry. Mol. Psychiatr. 2020;25:2455–2467. doi: 10.1038/s41380-019-0517-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Li J.Z., Absher D.M., Tang H., Southwick A.M., Casto A.M., Ramachandran S., Cann H.M., Barsh G.S., Feldman M., Cavalli-Sforza L.L., Myers R.M. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–1104. doi: 10.1126/science.1153717. [DOI] [PubMed] [Google Scholar]
  • 33.Gurdasani D., Carstensen T., Tekola-Ayele F., Pagani L., Tachmazidou I., Hatzikotoulas K., Karthikeyan S., Iles L., Pollard M.O., Choudhury A., et al. The African genome variation project shapes medical genetics in Africa. Nature. 2015;517:327–332. doi: 10.1038/nature13997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tamburini F.B., Maghini D., Oduaran O.H., Brewster R., Hulley M.R., Sahibdeen V., Norris S.A., Tollman S., Kahn K., Wagner R.G., et al. Short- and long-read metagenomics of urban and rural South African gut microbiomes reveal a transitional composition and undescribed taxa. Nat. Commun. 2022;13:926. doi: 10.1038/s41467-021-27917-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ramsay M., Crowther N., Tambo E., Agongo G., Baloyi V., Dikotope S., Gómez-Olivé X., Jaff N., Sorgho H., Wagner R., et al. H3Africa AWI-Gen Collaborative Centre: a resource to study the interplay between genomic and environmental risk factors for cardiometabolic diseases in four sub-Saharan African countries. Glob. Health Epidemiol. Genom. 2016;1:e20. doi: 10.1017/gheg.2016.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zar H.J., Pellowski J.A., Cohen S., Barnett W., Vanker A., Koen N., Stein D.J. Maternal health and birth outcomes in a South African birth cohort study. PLoS One. 2019;14:e0222399. doi: 10.1371/journal.pone.0222399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Stein D.J., Koen N., Donald K.A., Adnams C.M., Koopowitz S., Lund C., Marais A., Myers B., Roos A., Sorsdahl K., et al. Investigating the psychosocial determinants of child health in Africa: the Drakenstein child health study. J. Neurosci. Methods. 2015;252:27–35. doi: 10.1016/j.jneumeth.2015.03.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zar H.J., Barnett W., Myer L., Stein D.J., Nicol M.P. Investigating the early-life determinants of illness in Africa: the Drakenstein child health study. Thorax. 2015;70:592–594. doi: 10.1136/thoraxjnl-2014-206242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Asiki G., Murphy G., Nakiyingi-Miiro J., Seeley J., Nsubuga R.N., Karabarinde A., Waswa L., Biraro S., Kasamba I., Pomilla C., et al. The general population cohort in rural south-western Uganda: a platform for communicable and non-communicable disease studies. Int. J. Epidemiol. 2013;42:129–141. doi: 10.1093/ije/dys234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Heckerman D., Gurdasani D., Kadie C., Pomilla C., Carstensen T., Martin H., Ekoru K., Nsubuga R.N., Ssenyomo G., Kamali A., et al. Linear mixed model for heritability estimation that explicitly addresses environmental variation. Proc. Natl. Acad. Sci. USA. 2016;113:7377–7382. doi: 10.1073/pnas.1510497113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lam M., Awasthi S., Watson H.J., Goldstein J., Panagiotaropoulou G., Trubetskoy V., Karlsson R., Frei O., Fan C.-C., De Witte W., et al. RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics. 2020;36:930–933. doi: 10.1093/bioinformatics/btz633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Duncan L.E., Ratanatharathorn A., Aiello A.E., Almli L.M., Amstadter A.B., Ashley-Koch A.E., Baker D.G., Beckham J.C., Bierut L.J., Bisson J., et al. Largest GWAS of PTSD (N=20 070) yields genetic overlap with schizophrenia and sex differences in heritability. Mol. Psychiatr. 2018;23:666–673. doi: 10.1038/mp.2017.77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Schoech A.P., Jordan D.M., Loh P.-R., Gazal S., O’Connor L.J., Balick D.J., Palamara P.F., Finucane H.K., Sunyaev S.R., Price A.L. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun. 2019;10:790. doi: 10.1038/s41467-019-08424-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Gurdasani D., Carstensen T., Fatumo S., Chen G., Franklin C.S., Prado-Martinez J., Bouman H., Abascal F., Haber M., Tachmazidou I., et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell. 2019;179:984–1002.e36. doi: 10.1016/j.cell.2019.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Pazokitoroudi A., Wu Y., Burch K.S., Hou K., Zhou A., Pasaniuc B., Sankararaman S. Efficient variance components analysis across millions of genomes. Nat. Commun. 2020;11:4020. doi: 10.1038/s41467-020-17576-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Howrigan D. 2017. Details and Considerations of the UK Biobank GWAS. [Google Scholar]
  • 50.Ge T., Chen C.-Y., Ni Y., Feng Y.-C.A., Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1776. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Daetwyler H.D., Villanueva B., Woolliams J.A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008;3:e3395. doi: 10.1371/journal.pone.0003395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wray N.R., Kemper K.E., Hayes B.J., Goddard M.E., Visscher P.M. Complex trait prediction from genome data: contrasting EBV in livestock to PRS in humans: genomic prediction. Genetics. 2019;211:1131–1141. doi: 10.1534/genetics.119.301859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Bitarello B.D., Mathieson I. Polygenic scores for height in admixed populations. G3. 2020;10:4027–4036. doi: 10.1534/g3.120.401658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kong A., Thorleifsson G., Frigge M.L., Vilhjalmsson B.J., Young A.I., Thorgeirsson T.E., Benonisdottir S., Oddsson A., Halldorsson B.V., Masson G., et al. The nature of nurture: effects of parental genotypes. Science. 2018;359:424–428. doi: 10.1126/science.aan6877. [DOI] [PubMed] [Google Scholar]
  • 55.Wang Y., Namba S., Lopera E., Kerminen S., Tsuo K., Läll K., Kanai M., Zhou W., Wu K.H., Favé M.J., Bhatta L. Global Biobank analyses provide lessons for developing polygenic risk scores across diverse cohorts. Cell Genom. 2023;3:100241. doi: 10.1016/j.xgen.2022.100241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Hodgson J.A., Mulligan C.J., Al-Meeri A., Raaum R.L. Early back-to-Africa migration into the horn of Africa. PLoS Genet. 2014;10:e1004393. doi: 10.1371/journal.pgen.1004393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Henn B.M., Botigué L.R., Gravel S., Wang W., Brisbin A., Byrnes J.K., Fadhlaoui-Zid K., Zalloua P.A., Moreno-Estrada A., Bertranpetit J., et al. Genomic ancestry of North Africans supports back-to-Africa migrations. PLoS Genet. 2012;8:e1002397. doi: 10.1371/journal.pgen.1002397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Henn B.M., Gignoux C.R., Jobin M., Granka J.M., Macpherson J.M., Kidd J.M., Rodríguez-Botigué L., Ramachandran S., Hon L., Brisbin A., et al. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc. Natl. Acad. Sci. USA. 2011;108:5154–5162. doi: 10.1073/pnas.1017511108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Nalwanga D., Musiime V., Kizito S., Kiggundu J.B., Batte A., Musoke P., Tumwine J.K. Mortality among children under five years admitted for routine care of severe acute malnutrition: a prospective cohort study from Kampala, Uganda. BMC Pediatr. 2020;20:182. doi: 10.1186/s12887-020-02094-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Nagai A., Hirata M., Kamatani Y., Muto K., Matsuda K., Kiyohara Y., Ninomiya T., Tamakoshi A., Yamagata Z., Mushiroda T., et al. Overview of the BioBank Japan project: study design and profile. J. Epidemiol. 2017;27:S2–S8. doi: 10.1016/j.je.2016.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wojcik G.L., Graff M., Nishimura K.K., Tao R., Haessler J., Gignoux C.R., Highland H.M., Patel Y.M., Sorokin E.P., Avery C.L., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Lettre G., Sankaran V.G., Bezerra M.A.C., Araújo A.S., Uda M., Sanna S., Cao A., Schlessinger D., Costa F.F., Hirschhorn J.N., Orkin S.H. DNA polymorphisms at the BCL11A, HBS1L-MYB, and β-globin loci associate with fetal hemoglobin levels and pain crises in sickle cell disease. Proc. Natl. Acad. Sci. USA. 2008;105:11869–11874. doi: 10.1073/pnas.0804799105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Chambers J.C., Zhang W., Li Y., Sehmi J., Wass M.N., Zabaneh D., Hoggart C., Bayele H., McCarthy M.I., Peltonen L., et al. Genome-wide association study identifies variants in TMPRSS6 associated with hemoglobin levels. Nat. Genet. 2009;41:1170–1172. doi: 10.1038/ng.462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Benyamin B., Ferreira M.A.R., Willemsen G., Gordon S., Middelberg R.P.S., McEvoy B.P., Hottenga J.-J., Henders A.K., Campbell M.J., Wallace L., et al. Common variants in TMPRSS6 are associated with iron status and erythrocyte volume. Nat. Genet. 2009;41:1173–1175. doi: 10.1038/ng.456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Astle W.J., Elding H., Jiang T., Allen D., Ruklisa D., Mann A.L., Mead D., Bouman H., Riveros-Mckay F., Kostadima M.A., et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell. 2016;167:1415–1429.e19. doi: 10.1016/j.cell.2016.10.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Knowles J.W., Ashley E.A. Cardiovascular disease: the rise of the genetic risk score. PLoS Med. 2018;15:e1002546. doi: 10.1371/journal.pmed.1002546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Khera A.V., Chaffin M., Aragam K.G., Haas M.E., Roselli C., Choi S.H., Natarajan P., Lander E.S., Lubitz S.A., Ellinor P.T., Kathiresan S. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 2018;50:1219–1224. doi: 10.1038/s41588-018-0183-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Obermeyer Z., Powers B., Vogeli C., Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366:447–453. doi: 10.1126/science.aax2342. [DOI] [PubMed] [Google Scholar]
  • 69.Rotimi C.N., Bentley A.R., Doumatey A.P., Chen G., Shriner D., Adeyemo A. The genomic landscape of African populations in health and disease. Hum. Mol. Genet. 2017;26:R225–R236. doi: 10.1093/hmg/ddx253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Wheeler E., Liu C.T., Mf H., Hievert M.F., Strawbridge R., Podmore C. Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: a transethnic genome-wide meta-analysis. PLoS medicine. 2018 doi: 10.1530/ey.15.13.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Mugisha J.O., Seeley J., Kuper H. Population based haematology reference ranges for old people in rural South-West Uganda. BMC Res. Notes. 2016;9:433. doi: 10.1186/s13104-016-2217-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Mugisha J.O., Baisley K., Asiki G., Seeley J., Kuper H. Prevalence, types, risk factors and clinical correlates of anaemia in older people in a rural Ugandan population. PLoS One. 2013;8:e78394. doi: 10.1371/journal.pone.0078394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Cerdeña J.P., Plaisime M.V., Tsai J. From race-based to race-conscious medicine: how anti-racist uprisings call us to act. Lancet. 2020;396:1125–1128. doi: 10.1016/S0140-6736(20)32076-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.He Y., Lakhani C.M., Manrai A.K., Patel C.J. Poly-exposure and poly-genomic scores implicate prominent roles of non-genetic and demographic factors in four common diseases in the UK. Cold Spring Harbor Lab. 2019:833632. doi: 10.1101/833632. [DOI] [Google Scholar]
  • 75.Barcellos S.H., Carvalho L.S., Turley P. Education can reduce health differences related to genetic risk of obesity. Proc. Natl. Acad. Sci. USA. 2018;115:E9765–E9772. doi: 10.1073/pnas.1802909115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Novembre J., Barton N.H. Tread lightly interpreting polygenic tests of selection. Genetics. 2018;208:1351–1355. doi: 10.1534/genetics.118.300786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Kerminen S., Martin A.R., Koskela J., Ruotsalainen S.E., Havulinna A.S., Surakka I., Palotie A., Perola M., Salomaa V., Daly M.J., et al. Geographic variation and bias in the polygenic scores of complex diseases and traits in Finland. Am. J. Hum. Genet. 2019;104:1169–1181. doi: 10.1016/j.ajhg.2019.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Kuchenbaecker K., Telkar N., Reiker T., Walters R.G., Lin K., Eriksson A., Gurdasani D., Gilly A., Southam L., Tsafantakis E., et al. The transferability of lipid loci across African, Asian and European cohorts. Nat. Commun. 2019;10:4330. doi: 10.1038/s41467-019-12026-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Graham S.E., Clarke S.L., Wu K.-H.H., Kanoni S., Zajac G.J.M., Ramdas S., Surakka I., Ntalla I., Vedantam S., Winkler T.W., et al. The power of genetic diversity in genome-wide association studies of lipids. Nature. 2021;600:675–679. doi: 10.1038/s41586-021-04064-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Brown B.C., Asian Genetic Epidemiology Network Type 2 Diabetes Consortium. Ye C.J., Price A.L., Zaitlen N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 2016;99:76–88. doi: 10.1016/j.ajhg.2016.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Shi H., Burch K.S., Johnson R., Freund M.K., Kichaev G., Mancuso N., Manuel A.M., Dong N., Pasaniuc B. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. Am. J. Hum. Genet. 2020;106:805–817. doi: 10.1016/j.ajhg.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Hindorff L.A., Bonham V.L., Brody L.C., Ginoza M.E.C., Hutter C.M., Manolio T.A., Green E.D. Prioritizing diversity in human genomics research. Nat. Rev. Genet. 2018;19:175–185. doi: 10.1038/nrg.2017.89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Mulder N., Abimiku A., Adebamowo S.N., de Vries J., Matimba A., Olowoyo P., Ramsay M., Skelton M., Stein D.J. H3Africa: current perspectives. Pharmgenomics Pers. Med. 2018;11:59–66. doi: 10.2147/PGPM.S141546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Stevenson A., Akena D., Stroud R.E., Atwoli L., Campbell M.M., Chibnik L.B., Kwobah E., Kariuki S.M., Martin A.R., de Menil V., et al. Neuropsychiatric genetics of African populations-psychosis (NeuroGAP-Psychosis): a case-control study protocol and GWAS in Ethiopia, Kenya, South Africa and Uganda. BMJ Open. 2019;9:e025469. doi: 10.1136/bmjopen-2018-025469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.All of Us Research Program Investigators. Denny J.C., Rutter J.L., Goldstein D.B., Philippakis A., Smoller J.W., Jenkins G., Dishman E. The “all of us” research program. N. Engl. J. Med. 2019;381:668–676. doi: 10.1056/NEJMsr1809937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Ginther D.K., Schaffer W.T., Schnell J., Masimore B., Liu F., Haak L.L., Kington R. Race, ethnicity, and NIH research awards. Science. 2011;333:1015–1019. doi: 10.1126/science.1196783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Hoppe T.A., Litovitz A., Willis K.A., Meseroll R.A., Perkins M.J., Hutchins B.I., Davis A.F., Lauer M.S., Valantine H.A., Anderson J.M., Santangelo G.M. Topic choice contributes to the lower rate of NIH awards to African-American/black scientists. Sci. Adv. 2019;5:eaaw7238. doi: 10.1126/sciadv.aaw7238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Hamrick K. National Science Foundation, National Center for Science and Engineering Statistics (NCSES); 2019. Women, Minorities, and Persons with Disabilities in Science and Engineering: 2019; pp. 19–304. Alexandria, VA, Special Report NSF. [Google Scholar]
  • 89.Bentley A.R., Callier S.L., Rotimi C.N. Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom. Med. 2020;5:5. doi: 10.1038/s41525-019-0111-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.van der Merwe C., Mwesiga E.K., McGregor N.W., Ejigu A., Tilahun A.W., Kalungi A., Akimana B., Dubale B.W., Omari F., Mmochi J., et al. Advancing neuropsychiatric genetics training and collaboration in Africa. Lancet. Glob. Health. 2018;6:e246–e247. doi: 10.1016/S2214-109X(18)30042-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S21
mmc1.pdf (7.1MB, pdf)
Data S1. Tables S1–S9
mmc2.xlsx (48.2KB, xlsx)
Document S2. Article plus supplemental information
mmc3.pdf (10.4MB, pdf)

Data Availability Statement

All data used in this study are publicly available. Data from the African Genome Variation Project was accessed by combining EGAD00010001045, EGAD00010001046, EGAD00010001049, EGAD00010001050, EGAD00010001051, EGAD00010001052, EGAD00010001053, EGAD00010001054, EGAD00010001055, EGAD00010001056, EGAD00010001057, and EGAD00010001058. The Drakenstein Child Health Study is committed to the principle of data sharing. De-identified data will be made available to requesting researchers as appropriate. Requests for collaborations to undertake data analysis are welcome. Uganda GPC genetic data used in this paper were accessed through EGAD00010000965 and phenotype data were accessed via sftp from EGA (reference: DD_PK_050716 gwas_phenotypes_28Oct14.txt). We accessed data from the UK Biobank with application 31,063. BioBank Japan summary statistics were accessed from http://jenger.riken.jp/en/result. GWAS summary statistics for the Population Architecture using Genomics and Epidemiology (PAGE) study were accessed through the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/downloads/summary-statistics).

All code used in analysis is available here: https://github.com/armartin/africa_prs.


Articles from Human Genetics and Genomics Advances are provided here courtesy of Elsevier

RESOURCES