Abstract
Understanding genetic differences between populations is essential for avoiding confounding in genome-wide association studies and improving polygenic score (PGS) portability. We developed a statistical pipeline to infer fine-scale Ancestry Components and applied it to UK Biobank data. Ancestry Components identify population structure not captured by widely used principal components, improving stratification correction for geographically correlated traits. To estimate the similarity of genetic effect sizes between groups, we developed ANCHOR, which estimates changes in the predictive power of an existing PGS in distinct local ancestry segments. ANCHOR infers highly similar (estimated correlation 0.98 ± 0.07) effect sizes between UK Biobank participants of African and European ancestry for 47 of 53 quantitative phenotypes, suggesting that gene–environment and gene–gene interactions do not play major roles in poor cross-ancestry PGS transferability for these traits in the United Kingdom, and providing optimism that shared causal mutations operate similarly in different populations.
Subject terms: Population genetics, Genome-wide association studies, Software, Medical genetics
This study introduces the concept of Ancestry Components and shows that they can offer improved population stratification correction for geographically correlated traits. By using ancestry-aware polygenic score construction in admixed individuals, the authors find that effect sizes are conserved across ancestry groups.
Main
Genome-wide association studies (GWAS) have uncovered numerous genetic influences on complex human traits, regulated by many loci with small effect sizes. For traits such as height, large sample sizes in European groups have allowed PGS to explain a substantial fraction of heritability by summing effects across many single nucleotide polymorphisms (SNPs) genome-wide1. Because the true causal variants are unknown, SNPs in PGS are expected not to influence a trait directly, but rather correlate with (‘tag’) a true causal mutation. Genetic stratification is a major confounder in GWAS, potentially causing false positives when phenotypes correlate with stratification. Methods including mixed models2,3 and principal component analysis4,5 have proved powerful in solving major issues, but subtle population structure still biases effect size estimates6,7, complicating studies of temporal trait evolution6 and comparisons between human groups. Moreover, PGS accuracy drops in populations with different ancestry from the GWAS8,9 cohort, particularly those with strong genetic differentiation. This lack of portability is partly caused by genetic drift making some causal variants group-specific, and population-specific linkage disequilibrium (LD) patterns affecting tagging accuracy.
As well as such ‘local’ differences, recent studies10–14 have suggested that causal variants have different effect sizes on traits in different groups, because of ‘global’ (nonlocal) factors. Changes in causal SNP effect sizes, defined as their mean effects on a trait of interest, can arise by either gene–gene interactions not captured by the additive model, or gene–environment interactions, differing across populations. Varying effect sizes have been documented in populations9, so potentially contribute to differences between populations. However, other studies suggest similarities across groups in underlying effect sizes15–19 and, for stronger GWAS hits, in direction of effect at least20.
One recent study14 leverages individuals of mixed African and European ancestry, decomposing local ancestry to test whether causal variants on African versus European chromosomes show different effect sizes. By focusing on within-individual comparisons, this approach eliminated factors including gene–environment interactions14,21 and long-range gene–gene interactions21 that might differ across populations. It instead examines whether, for example, local gene–gene interactions alter effect sizes on different ancestral backgrounds. The study inferred strong sharing of underlying effect sizes at this within-individual level, suggesting local interactions are not major factors for most traits. However, it also inferred strong differences in causal effects across individuals possessing different continental ancestries, possibly because of gene–environment interactions. Resolving the role of such interactions is vital to successfully apply genetic findings across groups, design efficient studies and understand whether evolution or environmental differences drive interpopulation trait variation. Here, we develop an approach to do this, using models in which genetic effect sizes are shared across ancestries within individuals (as observed in ref. 14), but may vary between individuals with mainly European or African ancestry.
To analyze the impacts of population structure, we introduce a fine-scale ancestry pipeline. We use ‘ancestry’ throughout to refer to sharing inferred most-recent ancestors with someone from a particular self-reported ethnicity or geographic region. Expanding on our previous work22–24, we created a pipeline inferring individuals’ ancestry contributions from 127 geographically meaningful and genetically recoverable (Methods) regions worldwide. We applied this pipeline to all 487,409 participating UK Biobank25 (UKB) individuals. We show that using detailed ancestry information in GWAS better corrects for population stratification versus other state-of-the-art methods2,4,26, reduces likely false positives and uncovers previously undiscovered associations, but nonportability remains strong. We also develop ANCHOR, a statistical inference approach to estimating cross-population similarity in causal effect sizes using admixed individuals, complementing approaches including POPCORN10 and XPASS27 that only apply to nonadmixed individuals (‘Discussion’). Notably, ANCHOR requires no prior assumptions about the underlying effect size distribution. As in ref. 14 and other recent studies11–13, ANCHOR decomposes ancestry at fine scales along the genome. It assigns mutations to specific ancestries to quantify the contributions of gene–environment interactions on differences of predictive power for individuals with different ancestries. Application of ANCHOR to 8,003 UKB mixed-ancestry individuals yields different findings from recent studies, which we discuss along with implications.
Results
A statistical pipeline to infer precise individual ancestry
We developed an approach able to decompose the fine-scale ancestry of a genome into a mix of 127 regions, including 23 in the United Kingdom and Ireland (Supplementary Table 1). The UKB dataset comprises many individuals of mainly British ancestry, alongside a substantial fraction with ancestry from elsewhere. Our approach identified 105 regions present in at least 5 individuals for at least 10% of their ancestry. Our pipeline leverages data from previous studies23,28–31 of human genetic variation, and uses methods25,32 to impute these data on a common set of variants for high-quality imputation. This generates a unified ‘reference panel’ of haplotypes (Methods and Fig. 1a), phased using SHAPEIT2 (ref. 32) to ensure consistent data quality. Reference haplotypes are labeled through semi-supervised clustering with ChromoPainter33 and fineSTRUCTURE33, combined with geographic and ethnicity labels, to produce a painting reference panel.
Fig. 1. Schematic diagram of the ancestral inference pipeline and the performance in the UKB BI individuals.
a, Main steps in the ancestral inference pipeline. The pipeline accepts individual genotype data (microarray or sequencing data) as input. The genotype data are phased and imputed against a phasing and imputation reference panel in first ‘Phasing’ and ‘Imputation’ stages, and painted against the painting reference panel including preclustered groupings (5 groups in this example; 127 in our actual analysis). In a final mixture fitting stage, non-negative coefficients summing to one and representing the proportions of ancestries from the labeled groups in the reference panel are inferred. b, Geographical average proportion of DNA in BI individuals, positioned according to their birthplaces (Methods), inferred to come from three regional groupings: North Yorkshire, South Yorkshire and South East England (which correspond to the excess ancestry locations colored red). Pop., population; A, proportion of ancestries from each population.
Our pipeline uses this panel to generate an ancestry decomposition for a new ‘target’ sample (Fig. 1a), extending our previous approaches22,33, by closely mirroring the steps used to generate the reference panel itself. Unified panel markers are phased and imputed in the target, using the phasing and imputation panel. For UKB, haplotype data are prephased. Imputed haplotypes are then matched against the labeled painting reference panel using ChromoPainter to quantify recent ancestor sharing. Finally, a non-negative least square (NNLS) based approach (Methods)22 is used to infer ancestry coefficients from ChromoPainter output by fitting the observed haplotype-matching vector as a mixture of those from the 127 reference panel groups. This approach leverages haplotype information to infer population structure33 alongside Hidden Markov Model approximations to the coalescent33, analyzing each sample in minutes (Methods).
We applied our pipeline to 487,409 UKB participants. Because our pipeline uses ‘out-of-sample’ comparisons based on population structure, analyzing individuals independently, it captures population structure information in the United Kingdom and Ireland but not genetic relatedness between samples. Ancestry is defined through subjective groupings of reference individuals using genetic clusters identifiable via fineSTRUCTURE. Because the true genetic ancestry of UKB individuals is unknown, we assessed the pipeline’s effectiveness by examining the relationship between inferred ancestry and individuals’ birthplace and self-reported ethnicity, the ability of Ancestry Components (ACs) to predict principal components (PCs) and the capacity of our ACs to correct for ancestry in GWAS, compared with widely used approaches including PCs and BOLT-LMM4,34. We also developed an expectation-maximization algorithm (Methods and Supplementary Note 1) to estimate allele frequencies across SNPs. This method provides frequency estimates for each of 25,485,700 UKB imputed mutations25 within 127 reference groups, aiding studies of regional allele frequencies within the United Kingdom.
Fine-scale population structure across the United Kingdom
As a first test, the mean ancestries for the 434,781 UKB participants born in the United Kingdom or Republic of Ireland with self-reported white British and/or Irish (WBI) ethnicity were: British–Irish (BI), 94.9%; Dutch, 1.35%; Swiss, 0.79%; Norwegian, 0.49%; Polish, 0.29%; and Danish, 0.19%. For participants whose self-reported ethnicity is ‘other white background’, the BI proportion drops to 25.5%. For individuals born in the Republic of Ireland, inferred Irish ancestry averages 74.2%, within 98.4% BI ancestry, demonstrating the pipeline’s accuracy in capturing geographic and ethnicity information (Extended Data Fig. 1).
Extended Data Fig. 1. Heatmap of regional mean ancestry proportions of 30 UK+ EU ancestries based on birthplace among UK-born and Ireland-born UKB participants.
One panel per ancestry, brightest red colour on maps indicates highest ancestry, darkest blue lowest ancestry.
The average ancestry proportion from 22 British Isles regions (Methods) varies by birthplace (Fig. 2a). These ancestry proportions are based on an out-of-sample dataset of UK individuals with grandparents born within an ~80-km radius in each region23. Mean BI ancestry associated with a region decreases with geographic distance from that region, indicating that DNA information is informative for birthplace (Fig. 1b and Extended Data Fig. 1). Some 41.5% of UK-born individuals have more than 50% of their ancestry from a single region, matching their birthplace 59.2% of the time, rising to 82.7% after expanding to neighboring regions (Supplementary Table 2a). There is a strong correspondence between self-identified ethnicity and birthplace for non-UK ancestries (Fig. 2b and Extended Data Fig. 2). However, regional variation exists: ancestry localization is weaker in southern and eastern England23, and stronger in Scotland, Wales, Northern England and South West England. An entropy-based statistic (Methods) shows regional mixing, with London having the highest entropy and the highest average fraction of ancestry outside the United Kingdom. Other major cities and South East England also show strong mixing (Extended Data Fig. 3). Non-British ancestries also vary geographically, with higher Irish ancestry (>10%) seen in Liverpool, Birmingham, Manchester and London; Dutch ancestry found in the south of England and around Bristol (3.5%); and Polish ancestry peaking near Wrexham in Wales, the site of the second highest concentration of Polish burials in the United Kingdom35.
Fig. 2. Ancestry inference for UKB individuals born in the United Kingdom or Ireland and worldwide.
a, Ancestry inference stratified by birthplace region for UKB WBI individuals; for each regionally labeled bar plot, each column shows ancestry decomposition for a single individual, with colors representing regions shown on the map and numbers representing counts of individuals from each area. b, As a, but showing decomposition for Asian, Oceanian and selected East African countries. Colors are as shown on the map, with colors for ancestry from additional regions given in the legend. White lines on the map delineate the borders of different countries. Self-reported ethnicity labels are shown below each bar plot. Color legends differ in a and b. Self-reported ethnicity: Afr, African; Asi, Asian; Bri, British; Chi, Chinese; Ind, Indian; Ire, Irish; Mix, mixed; Other, other ethnic group; Pa, Pakistani (Asia); Whi, white (Europe).
Extended Data Fig. 2. World ancestries in the UKB inferred by the ancestry pipeline.
UKB participants are mapped to their self-reported country of birth. World countries are coloured by different colours if they are present in the pipeline. Ethnic groups are abbreviated as follows in the bar plots: Bri = British, Whi = White, Ire = Irish, Chi = Chinese, Asi = Asian, Pa = Pakistani, Ind = Indian, Mix = Mixed, Car = Caribbean, and Oth = Other Ethnic Group. a). Ancestries in Central Asia; b). Ancestries in Europe; c). Ancestries in Africa; d). Ancestries in the Americas. White lines on the map delineate the borders of different countries.
Extended Data Fig. 3. Heatmap of regional mean entropy statistic across the UK and Republic of Ireland.
Regional mean entropy indicates varying degrees of admixture in the UK. Areas with high entropy are coloured by red and areas with low entropy are coloured by blue. Selected regions of high or low entropy are highlighted by the boundaries, with the barplots detailing ancestry decomposition for each such region.
For non-UK-born individuals, we again achieve fine-scale ancestry resolution (Fig. 2b, Extended Data Fig. 2 and Supplementary Table 2b). Although some countries (for example, Philippines and Japan) show genetic homogeneity, perhaps because of small sample sizes, most exhibit diverse ancestry patterns. Because non-UK-born individuals reflect people who made their home in the United Kingdom, rather than unbiased samples from birthplace countries, we observe widespread BI ancestry. Other patterns are present, such as the over-representation of Gujarat ancestry among UKB individuals born in Uganda and Kenya, which might be explained by the post-colonial exodus of South Asians from East Africa.
ACs aid GWAS stratification for geographically linked traits
To avoid false-positive GWAS associations, correcting for population stratification is crucial. PCs are widely used for this purpose4,5,36, either alone or with mixed-model analysis3,34. However, selecting the number of PCs is nontrivial; too few can result in false positives, whereas too many can cause overcorrection because of extensive LD (for example, beyond the first 40 publicly released UKB PCs36). We compared the use of 127 ACs with PCs in GWAS. First, we predicted the first 16 UKB PCs from these 127 ACs, and conversely predicted 16 common BI ACs from the first 140 UKB PCs (Methods, Fig. 3a,b and Supplementary Figs. 1 and 2). We found a good prediction of most PCs from ACs, but not the converse, indicating that ACs capture additional information (Supplementary Fig. 1). PC-based prediction was also poor for non-UK regions (Supplementary Fig. 2). Second, we performed GWAS on 104 UKB quantitative traits with more than 10,000 unrelated white British individuals (Methods). We compared the effectiveness of ACs versus PCs for correcting population stratification by using LD-score regression (LDSC; Extended Data Fig. 4)37, which measures systematic inflation because of uncorrected stratification. An intercept estimate close to 1 indicates effective correction, although intercepts slightly above 1 can occur for highly heritable traits like height, or large sample sizes in practice37,38.
Fig. 3. Comparison between AC-corrected and PC-corrected GWAS.
a,b, Predictions are based on ‘linear’ combinations of ACs or PCs. a, Prediction of first 16 UKB PCs (x axes) using a linear model-based prediction from the 127 ACs (y axes) shows strong correlations (R2 values). b, As a, but now predicting 16 UK and worldwide ACs from 140 PCs, often showing poor prediction. c–e, Comparison of AC-corrected (x axis) and PC-corrected (y axis) −log10(P values) for SNPs in three exemplar GWAS for labeled traits: birthplace (c), employment score (d) and waist circumference (e). All plots are colored according to the legend shown at the bottom, indicating earlier evidence from GWAS for each SNP in particular phenotypic categories (gray: SNPs show no prior GWAS evidence, perhaps consistent with likely false-positive associations). The horizontal and vertical dark blue lines indicate the genome-wide P-value threshold (P = 5 × 10⁻⁸) in a −log10 scale, while the light blue line represents y = x. In each plot, the points show only independent SNPs with P < 5 × 10−8 for one or both approaches.
Extended Data Fig. 4. Comparison of inflation between ACs-corrected GWAS and PC-corrected GWAS across 104 quantitative traits by running LD score regression (LDSC).
a). Comparison of the LDSC intercept between AC-corrected GWAS (x-axis) and PC-corrected GWAS (y-axis) where red dots indicate phenotypes with intercept difference between PC-corrected GWAS and AC-corrected GWAS >0.015; b). Comparison of the LDSC attenuation ratio between AC-corrected GWAS (x-axis) and PC-correct GWAS (y-axis), where red dots indicate phenotypes with difference of attenuation ratio between PC-corrected GWAS and AC-corrected GWAS >0.02.
Birth location is often used as a control phenotype to test for stratification39 because few SNPs are likely to be causally associated with it. For latitude (Fig. 3c), PC-based correction or BOLT-LMM analysis (Fig. 3c, Supplementary Fig. 3a and Supplementary Table 3a,b) yielded many apparent independent hits, and substantial genome-wide inflation (LDSC intercept of 1.6608; Extended Data Fig. 4a,b), even with 100 PCs (Supplementary Fig. 3b). By contrast, AC-based correction reduced the number of association signals from 470 to 7 (Fig. 3c; with five shared hits), implying that fine-scale ancestry information can effectively remove stratification impacts. Similar results were observed for home location (Extended Data Fig. 4a,b and Supplementary Fig. 3c,d). Although any residual birthplace hits may reflect inadequate adjustment for stratification, the few association signals identified using ACs show a modest enrichment in SNPs showing previous GWAS evidence (Supplementary Table 4; odds ratio (OR) = 5.9, P = 0.039 compared with 12% PC-corrected hits with previous GWAS evidence). Specifically, the strongest remaining signal (P < 10−15) after using ACs surrounds rs5743618 on chromosome 4, linked to toll-like-receptor genes involved in innate immunity40,41, including TLR1 and TLR10, one of the strongest hay fever hits in European GWAS cohorts42,43, with weaker asthma risk association. Because the protective allele against hay fever is more common in southern England where hay fever is most prevalent (Extended Data Fig. 5), such geographic differentiation44 might reflect selective migration of people carrying this variant and/or past natural selection.
Extended Data Fig. 5. Heatmap of regional mean allele frequency of the SNP rs5743618.
Allele frequency is calculated based on genotype and birthplace among UK-born and Ireland-born UKB participants.
We also performed a GWAS for ‘employment score England’, a trait defined based on the region in which a person lives. GWAS associations with such traits have been observed6, but such associations can be confounded by regional stratification. Indeed (Fig. 3d), ACs and PCs shared a number of hits, but some signals were only observed in the PC-based analysis, and AC-based showed a stronger enrichment of previous GWAS signals (Supplementary Table 4; OR = 11, P = 0.0013) compared with PCs. In total, 18% of PC-only associations overlapped with previous GWAS signals, similar to the null trait of latitude (OR = 1.6, P = 0.44), compared with 71% for AC-based associations (OR = 18, P = 1.8 × 10−10 versus latitude). Moreover, many AC-based signals were significant in previous GWAS for educational attainment or socioeconomic status (Fig. 3d and Supplementary Table 3c). These trait classes were further enriched relative to PC-only hits (OR = 16, P = 0.005), suggesting PC-only hits are likely false positives (Fig. 3d). These findings suggest that previous GWAS for ‘regionally defined’ traits can produce false positives, and ACs help correct for this.
Among the other 99 nonregionally defined traits, PC-based and AC-based correction provided similar P values and LDSC intercepts, with subtle improvements for AC (Extended Data Fig. 4a,b). However, differences included five hits unique to the AC-corrected analysis (Methods, Fig. 3e and Supplementary Table 3e), some in strong LD with known GWAS hits for the same traits43,45–47, and thus unlikely to be false positives. For traits including waist circumference (Fig. 3e and Supplementary Table 3e), AC-specific hits occurred at SNPs in regions of strong LD with high regional SNP loadings for particular PCs (Supplementary Table 5a–e and Extended Data Fig. 6). Therefore, applying ACs likely removed false negatives caused by PC overcorrection from PCs strongly correlated with genotypes in particular genomic regions.
Extended Data Fig. 6. SNP loadings of the first 20 PCs along Chromosome 15.
For PC20, high loadings occur in the region 80–90 megabases.
Causal effects are similar across ancestries
To study PGS portability across UKB groups, we created PGS using 343,047 white British individuals for 53 heritable (estimated heritability >5%) UKB-measured quantitative traits, correcting for ancestry using either ACs or PCs, and tested their performance in independent samples representing different ancestries; that is, with more than 50% inferred ancestry (by ACs) from seven respective labeled regions: South Central England, Northumberland, Republic of Ireland, Poland, India, China and West Africa. AC-based versus PC-based correction yielded different group-specific PGS means, but strong within-group correlations (92.7% to 99.9%) (Supplementary Fig. 4 and Supplementary Table 6) so we focus on AC-corrected PGS for the remaining analyses. By regressing the PGS against actual traits in each group, we quantified the increase in trait per unit PGS increase (that is, the regression slope) and denoted it as β, and denoted the variance explained by the PGS as ΔR2, after regressing out covariates (Methods). Both ΔR2 and β decreased with increasing genetic distance from the British ancestry groups (Supplementary Figs. 5 and 6), particularly for sub-Saharan African ancestry, with a >2.2-fold reduction in ΔR2. Fine-scale ancestry showed this occurs even between BI regional ancestries for some traits: for standing height, forced expiration volume in 1 s (FEV1) and apolipoprotein B, ΔR2 for Northumberland differs from that for Ireland (P < 0.02, P < 0.005 and P < 0.02, respectively).
To partition the drop-off in PGS performance across ancestries into local effects (causal variant tagging) and nonlocal effects (ancestry-specific gene–gene or gene–environment interactions), we developed ANCHOR, which leverages variation in local ancestry along the genome and between admixed individuals (Fig. 4a). ANCHOR takes PGS coefficients from an independent sample as input, and analyzes quantitative phenotype and genotype data for a group of admixed individuals. It produces an estimate of ρ, the mean trait increase in admixed individuals per unit increase in a perfect PGS for nonadmixed (for example, European) individuals, constructed using their (unknown) true effect sizes. ρ equals the correlation in true effect sizes between populations under reasonable assumptions (Supplementary Note 2). For our analysis, we generated PGS coefficients from 343,047 UKB white British samples, and analyzed 8,003 ‘African ancestry’ individuals with varying (mean 83.6%) inferred sub-Saharan African ancestry and BI + Europe (mean 11.5%) ancestry.
Fig. 4. Separation of local and nonlocal factors influencing portability.
a, Test principles: in UKB samples with European (blue) and African (red) ancestries, a causal variant contributing to a trait is captured by a tag SNP whose predictive power (pink arrow thickness) varies by ‘local’ ancestry (upper versus lower chromosomes), or nonlocal factors captured by genome-wide ‘global’ ancestry (left versus right individuals); ANCHOR separates these contributions to PGS portability. b–d, values refer to the mean increase in phenotype per PGS unit increase for local ancestry j and global ancestry i (see Methods for further details). b, ANCHOR performance for 24 simulated traits and 53 UKB quantitative traits with PGSs constructed using different P-value thresholds (P = 0.05 and P = 0.0001; right). True effect size correlations ρ (x axis) between African and European ancestries are compared with the ANCHOR estimator (y axis). Colors denote African ancestry bins, as defined in c. c, Application of ANCHOR for 53 UKB traits across varying African ancestry binned as shown (x axis; colored regions). For each bin, mean estimates across traits of ratios (blue) and (red) are shown. Also shown are ratio estimates for individuals of ~100% European (leftmost point at y = 1) or ~100% African (red horizontal bar) ancestry. CIs crossing y = 1 are consistent with identical effects to ~100% European-ancestry individuals, and similarly for red points or bar. d, Mean increase in standing height per PGS unit increase across populations (seven left-hand columns); alongside corresponding ANCHOR estimates for height (final six columns) labeled by global or local ancestry combinations. Data are presented as (weighted) means (b,c) or as estimated values (d) with 95% central bootstrapped CIs. Error bars indicate 95% bootstrapped CIs, Af, African; Eu, European.
ANCHOR estimates local ancestry along the phased genome, for example, using HAPMIX48 here, masking uncertain and short segments to ensure ancestry segments extend further than LD (Methods and Supplementary Note 2). It calculates separate PGSs for African-like and European-like segments, yielding ‘African PGS’ (APGS’ and ‘European PGS’ (EPGS) scores for each individual, each ‘mean-centered’ by ancestry-dependent mean SNP frequencies, a critical step both in theory (Supplementary Note 2) and in simulations. The phenotype is regressed against APGS and EPGS, including other covariates, to estimate coefficients βAf and βEu, which quantify ancestry-specific PGS predictive power as the respective average trait increases in admixed individuals (Supplementary Note 2) per unit APGS or EPGS increase. Finally, βEu is then compared with the corresponding βObs.Eu estimate from BI individuals to estimate ρ, as βEu/βObs.Eu (Supplementary Note 2). Importantly, the validity of the ρ estimation requires no assumptions regarding underlying effect size distribution, causal mutation frequencies, LD patterns or selection. Under a local effect only (‘null’) model, causal effect sizes are identical in all European-ancestry regions and ρ = 1. If nonlocal ancestry-specific (gene–gene or gene–environment) interactions occur, ρ ≠ 1, and if the variance of underlying genetic effect sizes is the same in each group, then ρ < 1 (Supplementary Note 2). Therefore, testing for ρ ≠ 1 provides a test for differing effect sizes across groups. Assuming effect sizes scale linearly with genome-wide ancestry, ANCHOR can predict ρ for 100% European or African ancestry without sampling (Methods and Supplementary Note 2). Pooling ρ values among traits helps reduce overall uncertainty, and confidence intervals (CIs) for all coefficients are obtained by bootstrapping (Methods).
We verified HAPMIX’s48 ability to accurately infer ancestry and construct ancestry-specific PGS by comparing it with trios with known phase (Extended Data Fig. 7 and Supplementary Fig. 7). We then tested ANCHOR by simulating 24 quantitative traits with various settings of heritabilities, clustering of causal mutations and causal marker frequency spectra, using genetic data from the 8,003 African-ancestry individuals (Methods and Supplementary Note 1). We performed GWAS and downstream analyses exactly as for the real phenotypes. In nonadmixed populations, mean-centering of genotypes has no impact on the PGS predictive power, but is crucial for ANCHOR’s validity in admixed populations (Supplementary Note 2). Under the null (ρ = 1), across simulations ρ is correctly estimated from mean-centered EPGS and APGS, but estimates are strongly downward biased without mean-centering (Extended Data Fig. 8 and Supplementary Figs. 8 and 9), even when masking short or uncertain segments (Supplementary Figs. 8a and 9). Therefore, we use mean-centering for all ANCHOR analyses. We also observe significantly reduced ρ estimates for real quantitative traits without mean-centering, fully consistent with the simulation findings (Supplementary Figs. 10–12).
Extended Data Fig. 7. Comparison between PGS in trios calculated using either unphased genotypes (x-axis) or trio-phased haplotypes (y-axis) for 6 individuals from 2 trios across 53 traits.
For details of the normalization performed to enable joint comparison of these PGS, see ‘Using trios to conduct ancestry aware phasing for PGSs’. In this plot, colours represent one of the 53 UKB phenotypes and numbers represent trio ids as shown in the bottom right panel. 1: child in the first trio; 2: father in the first trio; 3: mother in the first trio; 4: child in the second trio; 5: father in the second trio; 6: mother in the second trio. Top left panel: comparison of unphased and phased EPGS, calculated using only inferred European segments by using the default p-value 0.05 in EPGS construction; Top right panel: comparison of unphased and phased APGSs, calculated using only inferred African segments by using the default p-value 0.05 in EPGS construction; Bottom left panel: comparison of unphased and phased EPGSs, calculated using only inferred European segments by using the alternative p-value 0.0001 in EPGS construction; Bottom right panel: comparison of unphased and phased APGSs, calculated using only inferred African segments by using the alternative p-value 0.0001 in EPGS construction. Note that across traits, individuals and different p-value thresholds, the PGS are very similar.
Extended Data Fig. 8. Comparison of combined estimate of ρ for 24 simulated traits from PGSs constructed in combination of with/without mean-centring and with/without masking the uncertain and short segments under the null model.
True =1 under the null model. Data are presented as (weighted) means with 95% central bootstrapped CIs. Dots separated by vertical line indicate different ratios of estimate: left panel: and right panel: . Horizontal line represents true setting of . If confidence interval for each ratio of estimate intersects with =1 line, the ratio estimate is accurate. Colour of the dots indicates if mean-centring is applied or not, and shape of the dots indicates if masking uncertain and short segments is applied or not.
If ρ = 1, 95% bootstrapped CIs for 96% of individual traits contained the true value (ρ = 1), with robustness to different GWAS P-value thresholds (0.05 versus 0.0001) (Supplementary Fig. 13). Averaging ρ estimated by ANCHOR across traits for groups of individuals with similar ancestry levels also yielded ρ = 1 (Extended Data Fig. 9), indicating good performance under the null across both traits and ancestry. In simulations in which ρ declines, ANCHOR estimates of ρ remain well calibrated (Fig. 4b, Extended Data Fig. 9 and Supplementary Fig. 14). In all cases we observe βAf/βObs.Eu < 1 as expected because of local effects (Supplementary Note 2), so African-ancestry segments are less predictive of traits, but βAf values varied similarly to βEu (Extended Data Fig. 9 and Supplementary Fig. 14).
Extended Data Fig. 9. Results of application of ANCHOR for 24 UKB simulated traits in bins including individuals with varying African ancestry.
African ancestry is binned as in main Fig. 4c (x-axis; coloured region). Data are presented as (weighted) means with 95% central bootstrapped CIs. For each bin in each panel, estimates of coefficients βEu/βObs.Eu (blue) and βAf/βObs.Eu (red) are shown with 95% bootstrapped confidence intervals, representing the ratio of PGS within European and African genomic regions to the PGS obtained from external European samples respectively (Methods). Also shown are these estimates from individuals of ~100% European (βObs.Eu; blue horizontal bar) or ~100% African (red horizontal bar) ancestry. From top to bottom, each panel represents descending correlation ρ. The case ρ=1, where only local effects impact portability, corresponds to blue points lying along the blue line, and similarly for red points, as observed. In cases when ρ<1, the dotted horizontal line is plotted at ρ, and the blue points are predicted to lie along this line if βEu/ βObs.Eu provides an accurate estimate of ρ, as predicted by theory (Methods; Supplementary Note 2).
We applied ANCHOR to 53 UKB quantitative phenotypes and found strikingly constant average ρ = βEu/βObs.Eu estimates across genome-wide ancestry bins (Fig. 4c and Methods), overlapping the null value ρ = 1 for all bins. This indicates that European-ancestry segments retain predictive power similar to that in European-ancestry individuals. Thus, a ‘true’ PGS from Europeans predicts similar trait increase, on average, per unit PGS increase, in African-ancestry individuals. This implies either conserved effect sizes between groups at shared causal SNPs, across the broad range of human molecular and quantitative phenotypes we examined or—less parsimoniously—systematically larger average effects across all these phenotypes in African-ancestry individuals, in such a manner as to coincidentally balance the impact of incomplete correlation.
For individual traits, as in the simulations we combine all individuals to estimate a trait-specific ρ, and extrapolated to estimate effect sizes for European segments in individuals with almost 100% European ancestry or African ancestry genome-wide (Figs. 4d and 5). For standing height (Fig. 4d), we show βAf and βEu estimates alongside βObs.Eu, and β estimates for various AC-defined UKB cohorts. β estimates decline with increasing genetic distance because of LD changes, reducing PGS predictive power. However, ANCHOR shows that European-ancestry segments in African-ancestry genomes have strong predictive power, similar to wholly European-ancestry individuals (blue bars), and does not identify any significant effect sizes changes across the range of genome-wide ancestry. African-ancestry segments (red) show much lower predictive power, explaining nonportability because of local LD and allele frequency differences, rather than gene–gene or gene–environment interactions.
Fig. 5. ANCHOR results for 53 UKB traits.
Data are presented as estimated values of ratio of true effect sizes with 95% central bootstrapped CIs. values refer to mean increase in phenotype per PGS unit increase for local ancestry j and global ancestry i (see Methods for further details). Colors of : blue, European; purple, projected to 100% African ancestry; red, African ancestry. Black rows represent individual UKB traits; the first standing height row uses an existing PGS16; the dark green rows show combined estimates. Columns (left to right) estimate ρ for ‘all’ 8,003 African-ancestry individuals, ρ for individuals of 100% projected African ancestry and (as expected, reduced) predictive power for African-ancestry segments. From top to bottom, the rows above and below the first horizontal dashed line represent non-molecular and molecular traits and rows above and below the second dashed line represent individual traits and their weighted average estimation. Vertical dotted lines: grey lines indicate ρ = 0.5 (left of the red dotted lines) and ρ = 1.5 (right of the red dotted lines); red lines indicate ρ = 1. ALP, alkaline phosphatase; FVC, forced vital capacity; HDL, high-density lipoprotein; IGF1, Insulin-like growth factor-1; LDL, low-density lipoprotein.
Across all 53 quantitative and molecular phenotypes (Fig. 5), ρ estimates (first column) were almost indistinguishable from 1, even for individuals with the highest African ancestry (second column), indicating embedded European segments have similar predictive power to nonadmixed Europeans, but with strongly reduced overall PGS power (third column). The joint bootstrap yielded an overall ρ = 0.98 ± 0.07 for these traits, extremely close to 1. Results were consistent using varying P-value thresholds (P value of 0.05 and 0.0001) from the initial GWAS (Supplementary Fig. 15), and correction methods (ACs versus PCs) in the initial GWAS (Supplementary Fig. 16). Results for standing height show little change using a previously published, alternative, PGS16 (Fig. 5). Two lung-function traits: FEV1 and forced vital capacity, and the correlated trait49,50 of sitting height, showed ρ < 1 at nominal significance (P < 0.05) (Fig. 5, Supplementary Fig. 15), with white blood cell count, red blood cell count and albumin showing ρ > 1 (P < 0.05). Only the white blood cell count remained significant (P < 0.05) with a GWAS P-value threshold of 0.0001, and no traits were significant after Bonferroni correction. In future, larger sample sizes may better detect heterogeneity of effect sizes for individual traits.
Discussion
We introduce new approaches to understand human ancestry and its connections to GWAS and PGS prediction. Decomposing UKB individuals’ ancestry into ACs at a subnational level, improves confounding correction by capturing information missing from PCs, reducing likely false positives and likely false negatives. ACs offer better control for geographic effects than current methods, within the UKB at least. Many traits show similar P values using ACs versus other popular approaches, but ACs uniquely avoid likely false positives in ‘regional traits’ defined on groups of individuals6. Together with observations of likely false negatives in GWAS using PCs, and potential larger sample sizes of future studies, incorporating ACs into GWAS may improve stratification correction by reducing overcorrecting linked to local genomic regions, and avoids the issue of deciding the number of PCs in GWAS. The complex patterns of ancestral mixing within UK regions or between, for example, cities and rural areas, suggest particular care is needed for traits showing similar regional variation such as educational attainment24. We note that the achievable granularity and interpretation of ACs depends on the size and diversity of the ancestry reference panel used, as well as the related number of identifiable predefined ‘ancestry groups’. A limitation of our approach is, therefore, that current ACs are somewhat United-Kingdom-specific, but region-specific ACs for non-UK GWAS could offer similar benefit. Until then, ACs can complement PCs, though even the best ACs may not completely correct for subtle population stratification.
To compare causal effect sizes across human populations, we consider the mean effect of a causal mutation on a trait12,51 and compare this across groups. We note that the common and often convenient scaling of genotypes by their population standard deviations also scales effect sizes differently across populations, so to avoid downward biases in estimating causal effect correlations, we avoid such scaling14, and instead define identical impacts as a mutation causing the same average increase in phenotype in each group. This biologically natural approach provides an interpretable scale for practical applications.
ANCHOR leverages local ancestry inference to estimate the ratio of underlying effect sizes between an admixed group and a reference group similar to that used for the initial GWAS and PGS construction, by assessing the predictive power of PGS among groups in terms of their relative ‘mean trait increase per unit PGS increase’. This ratio is 1 if effect sizes are identical, and estimates the correlation between true effect sizes in the two groups (Supplementary Note 2). Overall our results indicate clearly that—within the UK at least—using effect sizes from 100% European individuals yields near-identical performance in individuals of African ancestry for various quantitative phenotypes. This suggests PGS utility across populations, at least for causal mutations not private to one group. Although an underlying correlation between effect sizes ρ < 1 is possible (Supplementary Note 2), if coincidentally counterbalanced by larger average effect sizes in African-ancestry individuals, this seems unlikely across diverse quantitative traits a priori. If instead ρ ≈ 1, most causal effect sizes are very similar between individuals of African ancestry and European ancestry, across the range of quantitative phenotypes we studied. Of interest for further study, a few traits such as FEV1 do suggest evidence of differences.
Gene–gene and gene–environment interactions likely, therefore, do not drive the lack of PGS portability in the UKB. Population structure is also an unlikely cause, given consistent effect size estimation between PGSs constructed by AC-corrected and PC-corrected GWAS. Instead, local LD and SNP frequency differences appear to be the main factors. African-ancestry segments show significantly reduced predictive power (about a threefold reduction) compared to European ancestry segments (Fig. 5). Our (ρ = 1) simulations confirm that PGS coefficients match real data for European segments, but show only a modest reduction (less than twofold for traits with >100 causal markers) for African-ancestry segments. This suggests reduced tagging of causal variants by the PGS in African-ancestry regions potentially because of selection against trait-impacting variants52,53 causing larger frequency differences for true causal variants between populations than the randomly selected variants in our simulations, and the observed strong drop-off between groups. Stronger stratification correction by ACs could help investigate such selection in future.
Our results differ from previous findings51,54,55 of often considerably lower correlations between European-ancestry and African-ancestry samples using methods that model genetic variance components and local LD differences. For example, one study51 finds different effects for body mass index in UKB, another55 finds different effects for height, and an estimated correlation of only ~50% across 26 traits in UKB. Because ANCHOR focuses on PGS prediction with minimal assumptions, it is challenging to attribute our near-parity estimates to confounding. These differences might result from how genotypes are scaled and centered, as shown by the importance of appropriate centering in ANCHOR (Extended Data Fig. 8 and Supplementary Figs. 8 and 9). Alternatively, if causal variants have unique evolutionary histories because of natural selection, this might impact some methods more than others. Further study is needed to understand these discrepancies.
Our results do not imply that gene–environment, or even gene–gene, interactions are absent across the traits we studied. Such interactions likely cause variation in effect size across individuals with African ancestry but must largely be shared with other ancestries to avoid differences in overall (mean) effect sizes across populations. Previous work9 has shown effect sizes variation within UKB individuals of British ancestry, based on age, gender and socioeconomic status. We also observe differences between males and females in mean effect sizes (Extended Data Fig. 10), and subtle differences among UK groups stratified by ACs (Supplementary Fig. 6), likely reflecting gene–environment interactions. However, the strong lack of portability in African-ancestry individuals seems not driven by these interactions, apart from specific traits like FEV1. This result encourages joint fine-mapping56 efforts and suggests that improving causal variant tagging is key to applying genetic findings across groups, simplifying the process by avoiding the need of re-estimating effect sizes.
Extended Data Fig. 10. Investigation of impact of gender and UK vs non-UK birthplace on the relationship between PGS performance and ancestry, for standing height.
Data are presented as estimated values with 95% central bootstrapped CIs. For each plot, details are as described in the legend for Fig. 4d unless stated otherwise. Left column: analysis results for only male African-ancestry individuals; right column, female individuals. Top row: as Fig. 4d. Middle row: the last four columns in each plot estimate the estimated increase in phenotype per unit increase in the EPGS (blue) or APGS (red) polygenic score, separately for UK-born and non-UK-born individuals (note there is no significant difference in either case). Bottom row: as for the first row, but now applying the ANCHOR model to UK-born or non-UK-born individuals of the specified gender and varying ancestry (see section ‘Joint model fitting by accounting for other environment factors’). Once again none of the bootstrapped CIs indicate significant impacts of birthplace on results; and neither gender shows significant evidence of different βEu/ βObs.Eu ratios (distinct effect sizes) for African-ancestry (right-hand blue bars) vs UK-ancestry (left hand two blue bars) individuals, indicating results are consistent with ρ=1, that is equal effect sizes in either gender and regardless of birthplace and ancestry. (There are obvious differences comparing males and females, corresponding to their different mean heights.).
The UKB resource contains diverse ancestries, but collects homogenous data for individuals within a single country, minimizing trait definition differences and environmental effects to better isolate underlying biological impacts. Our results likely rule out differential gene–gene interactions as a major driver of nonportability, even in other settings, because these would still operate strongly within the United Kingdom. However, effect size differences might be stronger between countries with greater environmental differences or differing trait definitions. In future, ANCHOR might be applied to groups outside the United Kingdom, for example, African Americans14, to analyze various traits, or extended to analyze binary disease traits. As GWAS sample sizes in admixed and other populations grow, methods including ANCHOR will likely uncover variable effect sizes across countries or cohorts, whereas other approaches10,52,54 enable comparisons of groups of similar ancestry.
Methods
Our research complies with all relevant ethical regulations. Collection of the UKB data was approved by the Research Ethics Committee of the UKB and this research has been conducted using the UK Biobank Resource under application number 27960.
Statistics and reproducibility
Sample size is clearly disclosed in the paper for each different dataset or sub-dataset. The sample size of ancestry-specific analysis in the section ‘Polygenic score calculations’ was determined by the corresponding genetic ancestry coefficients inferred using our method. Trio individuals with African-background ancestry were detected using the open source software king (v.2.2). No data were excluded unless the participants withdrew their participation from the UKB. No randomization was used; because there are no experimental groups it is not relevant to our study. No blinding is used because it is not relevant to this study; however, group allocation is objective using inferred genetic ancestry.
Ancestry pipeline
To construct an ancestry ‘painting’ reference panel, we merged data from Supplementary Table 7, resulting in 9,129 samples and 2,011,414 distinct mutations. After relatedness pruning in plink1.9 (ref. 57) using the IBS/IBD computation with the ‘--genome’ option and setting a PI_HAT threshold of 0.25 to exclude related samples exceeding this value, we retained 7,775 individuals. Phasing (using SHAPEIT2)32 and imputation (using IMPUTE4)25 were performed using the UK10K58 + 1000 Genomes data as a reference panel, retaining 851,948 sites with a mean IMPUTE4 information score above 0.9, across all the different genotyping platforms simultaneously, as well imputed sites, in the ‘draft painting panel’.
The remainder of the panel construction process, detailed in Supplementary Note 1, involved a nested sample hierarchy, chromosome painting using ChromoPainterv2 (ref. 33), quality control for relatedness, ‘surrogate’ and ‘donor’ group formation based on sample labels and admixture estimation by NNLS22, and additional SNP quality control, resulting in 677,173 SNPs in the ‘final painting panel’. We also painted UKB individuals who were born in world regions that we believed our reference individuals sampled from poorly, after excluding samples with, for example, mainly BI ancestry, and used their profile as a surrogate group reference vector.
Annotating population labels into genetic groupings is an art, not a science. Our choices led to 206 surrogate groups that were genetically distinct, which we sum into the 127 interpretable ancestry labels reported in the ‘Results’ section. For replicability we include the SNP lists, and the final donor and surrogate group annotations (Supplementary Note 1).
Painting panel processing
To obtain calibrated local ancestry estimates, individuals in the panel must be exchangeable with those we wish to compare. We first construct a SNP and sample list using the process above, excluding UK10K or 1000 Genomes samples; second, phase each sample independently against the UK10K + 1000 Genomes dataset using SHAPEIT2; third, impute each sample independently against the UK10K + 1000 Genomes dataset using IMPUTE4; and fourth, infer best-fit parameters Ne (which controls recombination rate) and m (which controls mutation rate) using ChromoPainterv2 in expectation-maximization mode for 10% of samples randomly chosen for each chromosome, then average these for final parameters. First, to perform a ‘leave-one-out’ procedure to create reference groups with one fewer sample, we paint with ChromoPainterv2 using inferred parameters and repeat the ‘leave-one-out’ procedure above. Second, given each sample a ‘donor-vector’ indicating genome shared (in centi-Morgans) with each donor group, we create a surrogate panel by averaging donor-vector within each surrogate group. Third, we estimate admixture by treating each sample’s donor-vector as a mixture of the surrogate vectors in the surrogate panel using NNLS as described in ref. 22, merging or removing any groups not >50% recovered.
UK Biobank data included in the analysis
The UKB study includes more than 500,000 UK residents. We analyzed genotypic and phenotypic data under application 27960. In total, 487,409 UKB participants with available autosomal haplotype and genotype imputation data (field IDs 22438 and 22828) were used in our ancestry inference. Quality control, phasing and imputation for the UKB genetic data have been described previously (http://biobank.ctsu.ox.ac.uk/). Demographic and phenotypic data are listed in Supplementary Table 8. For PGS analysis, we selected 53 quantitative traits on which to run GWAS using the following criteria: trait measured on at least 400,000 individuals; LDSC-estimated trait heritability at least 5%; and the trait must be noncategorical.
Running the ancestry pipeline on UK Biobank
The ancestry pipeline accepts genotype data in various formats as input (Supplementary Note 1). For UKB data, we used available phased haplotypes. We performed initial imputation with IMPUTE4 in batches of 1,000 UKB individuals, then ran the remaining jobs individually: first, remove close relatives and one random individual per donor group from the donor panel; second, paint the target using parameters Ne and m estimated from the panel to obtain a donor-vector; and third, infer global ancestry using NNLS.
Assign UK counties with predefined UK regions
Within the United Kingdom and Ireland, 23 distinct groupings were identified. Each UK county was assigned to one group, to refine geographic boundaries. We downloaded county-level UK map data (https://gadm.org), mapped the birthplaces of 426,879 UK-born UKB individuals to a county using the R package ‘sp’, and filtered individuals with >50% ancestry from one group. We then assigned a county to the group with the most remaining individuals born in the county. ‘Irish’ ancestry was assigned to the Republic of Ireland. Within Cornwall, because both ‘Cornwall’ and ‘Cornwall Tip’ localized to this county, we used the R package ‘raster’ to define ancestry on finer-scale pixels, and assigned locations whose mean ‘Cornwall Tip’ ancestry was at least 0.2 greater than their mean ‘Cornwall’ ancestry to the ‘Cornwall Tip’ group.
Visualization of ancestry based on UKB birthplace
We used Gaussian kernel smoothing to generate spatial smoothed plots showing average ancestry. For each pixel (p) in the rasterized UK map with birthplace coordinates p = (xp,yp), we calculated the mean of a quantity of interest O (for example, AC) for each of the N = 426,879 UK-born or Irish-born UKB samples, smoothed by the Gaussian kernel:
1 |
where Oi is the ancestral object and (xi,yi) is the birthplace coordinate for individual i. is the mean of O at pixel p = (xp,yp). We used adaptive bandwidth smoothing for q (Supplementary Note 1).
To quantify ancestry mixing across the United Kingdom, we calculated ancestry entropy for each individual:
2 |
where nr = 23 is the number of predefined BI regions, and aj(1 ≤ j ≤ nr) are the ACs for that individual. Adaptive bandwidth Gaussian kernel smoothing was used to plot the UK-wide entropy profile.
Genome-wide association study
Forty PCs (field ID 22009) were downloaded from the UK Biobank. The first 20 PCs were used in GWAS analysis. We additionally calculated the first 200 UKB PCs using the R packages ‘pcapred.largedata’ and ‘pcapred’ (https://github.com/danjlawson/pcapred.largedata, https://github.com/danjlawson/pcapred), which closely matches UKB PCs (>99% correlation for the first 40 PCs) (Supplementary Fig. 17).
Some 127 ACs from the ancestry pipeline were used to regress each of 140 UKB PCs (chosen to exceed the number of ACs) on all the 487,409 UKB samples and predict each PC. Similarly, 140 UKB PCs were used to predict the 127 ACs on the same UKB samples. R2 between the true and predicted PC or AC was then used to evaluate the prediction. In our GWAS study, we analyzed 104 continuous UKB phenotypes with a sample size >10,000 (Supplementary Table 8). In total, 343,047 unrelated white British individuals25 were included in our GWAS. Association testing was performed for UKB imputed SNPs G (minor allele frequency (MAF) >0.001 and information scores >0.3). Covariates in each association test were ‘genotype measurement batch’, ‘age at recruitment’ and ‘sex’ (field IDs 22000, 21022, and 31), as well as nonlinear terms ‘age2’, ‘age × sex’ and ‘age2 × sex’. The full model using ACs to correct stratification is:
3 |
The model instead using 20 PCs (and similarly when using 100 PCs) is:
4 |
Association testing was performed by ‘BGENIE’ v.1.2 (ref. 25) using unnormalized phenotypes, to keep estimated effect sizes remain in units of the original phenotypes.
We also performed GWAS using BOLT-LMM34 with 20 PCs, to test performance of a linear mixed model for the place of birth, north coordinates (field ID 129) phenotype. We performed quality control and LD pruning on the UKB chip genotype SNPs using plink1.9 (ref. 57) (www.cog-genomics.org/plink/1.9/; Supplementary Note 1) to generate 142,182 independent SNPs (r2 < 0.1) for null model building in BOLT-LMM. The participants and imputed SNPs for association testing remained as for the other GWAS. The 20 PCs included in BOLT-LMM as covariates were calculated by FlashPCA2 (ref. 59) from the same genetic relationship matrix constructed by the 142,182 independent SNPs and 343,047 UKB white British individuals. Other covariates used in BOLT-LMM were the same as those used in the linear regression based GWAS.
To identify independent genome-wide significant signals, we used a threshold P < 5 × 10−8 and for each SNP more significant than this threshold for either AC or PC, we used LD pruning with a threshold of r2 < 0.01 and window size of 100 kb to keep only the most significant SNP among groups of variants in LD.
We queried AC-significant or PC-significant SNPs in a GWAS database using the R package ‘mrcieu/ieugwasr’ (https://mrcieu.github.io/ieugwasr/). For most traits we applied a genome-wide threshold of 5 × 10−8 for query output P value, relaxed to 1 × 10−6 if no SNP met the initial threshold, and returning ‘No GWAS hits’ if the relaxed threshold was also not met. For traits shown in Fig. 3c–e, queried results containing the same or directly related traits were excluded from categorization. Query results for input SNPs were ranked by ascending GWAS P value across different traits such that the most significant trait was top-ranked (Supplementary Table 3a–e). SNPs without a standard RS catalogue identifier were categorized as ‘Not queried’. Queried traits were assigned into the following categories: ‘Anthropometric’, ‘Cognitive function’, ’Lipids’, ‘Education attainment’, ‘Autoimmune’, ‘Population density’, ‘Allergy’, ‘Socioeconomics status’, ‘Blood pressure’, ‘Blood counts’ and ‘Other medical condition’ (Supplementary Table 9).
For ‘Waist circumference’, we identified 627 independent genome-wide significant hits for either AC- or PC-corrected GWAS. Only five hits were AC-specific with a of >1, where PAC and PPC are P values for AC- and PC-based GWAS for ‘Waist circumference’. We queried four of those five variants (Fig. 3e), using ‘mrcieu/ieugwasr’, while the remaining variant (15:84311431:TA:T) lacked an RS catalogue identifier, so was manually queried using opentargets 60 (http://genetics.opentargets.org/). It was strongly associated with many different ‘Anthropometric’ related traits such as height, trunk fat mass and body fat percentage, and so was labeled as ‘Anthropometric’. The other 622 variants possessed shared signals and were therefore not queried.
In assigning categories to SNPs overlapping multiple association types, we ranked as follows to prioritize relevance to the traits studied in a particular GWAS in Fig. 3c–e. For ‘Birthplace (North/South)’, we assigned the category ‘Allergy’ to the queried SNP if it is associated with allergic phenotypes such as ‘Hay fever’; for ‘Employment Score England’, we assigned category ‘Educational attainment’ to the queried SNP if it is associated with ‘Educational attainment’ or related phenotypes, including ‘Age completed full-time education’, and then category ‘Cognitive function’ if associated with such traits as, for example, ‘Mood swing’; for ‘Waist circumference’, we assigned category ‘Anthropometric’ to queried SNPs associated with, for example, ‘ Leg fat percentage’. Remaining uncategorized SNPs were categorized by their top-ranked associated trait (Supplementary Table 3a–e).
Heritability estimation and GWAS inflation assessment
We ran LD-score regression37 on both AC-corrected and PC-corrected GWAS summary statistics for 99 UKB traits, using precalculated LD scores from the 1000 Genomes phase 3 Utah residents with Northern and Western European ancestry (CEU) panel provided alongside the LDSC software. From LD-score regression output of the intercept and χ2 values, we bounded these at below 1 to ensure non-negativity, and calculated a trait-wise LDSC attenuation ratio, defined as follows38:
5 |
Estimation of allele frequencies for UKB SNPs
Highly differentiated alleles across different regions reflect strong local genetic drift and can reveal natural selection. We derived an expectation-maximization algorithm to estimate maximum-likelihood regional allele frequencies based on our individual ACs (Supplementary Note 1). We applied this to 12,977,776 imputed UKB SNPs with minimum MAF at least 0.01% and imputation information score above 0.9 in 339,304 white British samples with more than 95% UK + Ireland ancestry. Using ‘qctool2’ (https://www.well.ox.ac.uk/~gav/qctool_v2/index.html), ‘bgenix’61 and plink2.0 (ref. 57)), we conducted genotype data preprocessing, quality control and estimated their allele frequencies across the 23 British–Irish regions.
Polygenic score calculations
For each of 53 traits, we constructed two PGSs using either ACs or PCs for population stratification correction in a GWAS of 343,047 unrelated white British individuals. PGS construction used the Hapmap3 SNPs62 with a minimum 1% MAF. For each phenotype, we performed LD-clumping by ‘plink1.9’57 using 1000 Genomes phase 3 CEU63 as the reference panel with following parameters: significant threshold for index SNPs is 0.05, secondary significance threshold for clumped SNPs is 1, LD threshold for clumping is 0.1 and physical distance threshold for clumping is 500 kb, with respect to the plink command line as follows: ‘--clump-p1 0.05 --clump-p2 1 --clump-r1 0.1 --clump-kb 500’. We also tested a stricter P value threshold of 0.0001 by setting ‘—clump-p1 0.0001’ in the above plink command. The PGS was evaluated on the remaining 144,362 UKB individuals.
PGS was calculated as a linear sum over SNPs and genotypes:
6 |
7 |
where for SNP j, Gj is the genotype for the target individual, is the estimated effect size from the AC-corrected GWAS and is the effect size from the PC-corrected GWAS. Mean-centering genotypes within each population would yield equivalent results, so it was not done here (Supplementary Note 2).
To evaluate PGSAC performance across different ancestries, we identified those individuals among the 144,362 UKB ‘test’ individuals with at least 50% inferred ancestry from each of seven separate AC-labeled groups: ‘South Central England’ (1,491 individuals), ‘Northumberland’ (5,355), ‘Republic of Ireland’ (13,265), ‘Poland’ (1,503), ‘India’ (4,438), ‘China’ (1,266) and ‘West Africa’ (7,108).
For each group and phenotype, we evaluated PGSAC for individual i, label as PGSi, and fit the model:
8 |
where Yi is the phenotype for individual i, εi is the error, and the model was fit via standard linear regression, with 1,000 bootstrapped sample replicates for evaluating uncertainty. As previously, the covariates included are batch, age, sex, age2, age × sex and age2 × sex, as well as the 127 estimated ACs. The parameter β represents an estimator of the increase in the mean phenotype per unit increase in the PGS. We compared β values among groups to quantify PGS applicability in distinct groups, and used generalized values in our ANCHOR analysis.
For each group, we defined the ‘residual r2’, ΔR2, a scale-independent measure of PGS performance, as 1 minus the ratio of the phenotypic variance remaining after fitting the above model to the (larger) variance when β = 0. This represents the fraction of variance explained by the PGS after accounting for confounding and covariates.
We also estimated an overall value βObs.Eu for 19,596 individuals of BI ancestry by fitting the same model. The BI individuals were selected from three BI groups—South Central England, Northumberland and Republic of Ireland—filtering out individuals with the sum of these three ancestries less than 0.9.
Ancestry-aware PGS and correlation in effect sizes
To understand the behavior of the PGS in African-ancestry individuals, we identified 8,003 UKB European–African admixed individuals whose sum of 27 UK or European ancestries and 4 sub-Saharan African ancestries is larger than 90% and sum of 4 sub-Saharan ancestries larger than 10%. We further binned these individuals into five ancestry bins based on their inferred African ancestry: [0.1, 0.35], 373 individuals, mean ancestry 0.207; [0.35, 0.55], 560 individuals, mean ancestry 0.46; [0.55, 0.78], 475 individuals, mean ancestry 0.707; [0.78, 0.95], 2,556 individuals, mean ancestry 0.89; and [0.95, 1], 4,039 individuals, mean ancestry 0.986. Individuals in the bin [0.95, 1] with near 100% African ancestry were only given an ‘APGS’ PGS in our analysis (see below), the results of which are shown in Fig. 4b,c and Extended Data Fig. 9 and Supplementary Figs. 9 and 12.
In the ANCHOR approach, we generate and analyze ancestry-specific PGS for a trait. We applied HAPMIX48 to estimate local ancestry genome-wide for each individual, using 1000 Genomes phase 3 (ref. 63) CEU and Yoruba in Ibadan, Nigeria groups, respectively, as European and African ancestry reference panels, with default parameters. For individuals labeled i = 1, 2, … n, the output of HAPMIX at each locus j = 1, 2, … J provides probabilities that the local ancestry is each of the four possibilities, ordering the two chromosomes arbitrarily (for example, ‘EE’ refers to both chromosomes possessing European ancestry). Given genotype Gij at site j for individual i, we estimate allele frequencies for European and African background, respectively, by fitting the model
9 |
where is the mean-zero noise, by least squares. In practice we fit the equivalent model, noting and that the ancestry probabilities sum to 1:
10 |
where after model fitting we obtain estimates and . Given the large sample size, these estimates closely match true SNP frequencies for the specific admixing groups (they also correlate strongly with the frequencies of the same variants in relevant 1000 Genomes cohorts) (Supplementary Fig. 18).
HAPMIX also estimates expected allele counts at this site for European and African ancestry backgrounds (obtained via summation; Supplementary Note 1), which are transformed to their mean-centered version (Supplementary Note 2), conditional on local ancestry probabilities:
11 |
12 |
We apply genomic masking to remove uncertain or short segments (<5 megabases) inferred by HAPMIX. However, simulations show consistent results regardless of masking, provided mean-centering is correctly conducted.
The overall PGS can be decomposed into the European PGS (EPGS) and African PGS (APGS) for African-ancestry individuals i = 1, 2, … n as follows:
13 |
14 |
where is the estimated per-copy effect size of SNP j on the phenotype. To investigate local (for example, LD) and nonlocal factors (that is, interactions) attenuating the PGS performance in African-ancestry individuals, we fitted the following model to real and simulated data (Supplementary Note 2):
15 |
where for individuals i = 1, 2, … n, Yi is the phenotype, εi is the zero-mean noise and the parameters to be estimated are the intercept I, and βEu, βAf, ω, EPGSi and PPGSi are the centralized EPGS and APGS after regressing out the covariates, and θi is the mean genome-wide European ancestry proportion for individual i. Covariates are as described in the section ‘Polygenic score calculations’.
We fit this model across various individual subsets, combinations of phenotypes and parameter constraints (described below), obtaining parameter estimates via least squares (Supplementary Note 2), and uncertainty estimates by 1,000 bootstrap resamples of individuals and model refitting. The model is linear so trivial to fit unless allowing ω ≠ 0; in this case, conditional on the ratio the model is again linear, so we use a grid search (1,000 values) over this ratio from 0 to 1, fitting the linear model for each possible value and then minimizing the achieved sum of squares.
Parameters βEu and βAf measure the increase in the phenotype per unit increase in the respective PGS, indicating the scores’ predictive power. Local factors mean that we expect βEu > βAf (Supplementary Note 2; Extended Data Fig. 9 shows this via simulation). If no additional nonlocal factors contribute to nonportability, then βEu and βAf should remain constant across ancestry bins, that is varying θi, with βEu also shared between African-ancestry individuals and 100% European ancestry individuals (Supplementary Note 2). With nonlocal factors acting, this no longer holds. To investigate this setting, we allow the predictive power of the two scores varies with genome-wide European or African ancestry that is θi, captured by ω. Because local factors still operate, the model covaries the predictive power of the βEu, βAf parameters together with ω; we lack power with current sample sizes to fit separate effects, and so we simply use ω to capture linear effects14 (Supplementary Note 2).
We fit this model to analyze both real and simulated datasets in an identical manner. When jointly analyzing phenotypes and averaging results, we first binned the African-ancestry individuals by their genome-wide ancestry θi (bin boundaries shown on Fig. 4c). Within each bin ancestry varies little, so we set ω = 0 and fit the model independently for each bin for each phenotype. This provides estimates of βEu, βAf for each phenotype-bin combination, averaged within bins for Fig. 4c, allowing comparisons across bins, and with the effect size obtained for individuals of mainly European ancestry. By holding ω = 0 fixed but analyzing all individuals jointly for a phenotype produces estimates and summarizing the predictive power of the PGS across the full African-ancestry sample set, shown in, for example, Figs. 4d and 5. The ratio of the means of the estimates over phenotypes, , estimates the correlation ρ in effect sizes between Africa-ancestry and European-ancestry groups, averaged over phenotypes (Supplementary Note 2); estimates of this are shown in, for example, Figs. 4b and 5 for simulated and real data. For an individual phenotype, if the 95% bootstrapped confidence interval of the ratio contains 1, we accept the null hypothesis of shared effect sizes. The ratio captures local (for example, LD) effects on prediction, and so as expected is <1 for all simulated and real traits. Finally, we fit the full model allowing to vary. In individuals with 100% African [respectively European] ancestry, this fits the unit increase in the phenotype per unit increase in EPGS as ; [], and analogous coefficients correspond to the APGS. These coefficients are shown in, for example, Figs. 4d and 5. Ratios of these quantities to , and their bootstrapped CIs, are interpreted exactly as those for ‘All’ individuals but now projected to estimate properties of individuals—for example, ratios of true effect sizes—whose ancestry is 100% African or even 100% European. These quantities are defined consistently across analyses and subsets of individuals.
Ancestry-aware simulation
To explore GWAS portability and evaluate performance of ANCHOR, we simulated phenotypes across the entire UKB cohort, and analyzed as for the real data. The UKB imputed genotype data includes 12,690,793 variants after applying the following filters: MAF ≥ 0.001, minor allele count ≥25, genotype missingness ≤5%, Hardy–Weinberg equilibrium P ≥ 1 × 10−10 and imputation INFO score ≥0.8. From these, we chose a set of J causal variants (J = 100, 1,000 or 10,000) either at randomly selected genomic positions, or with clustering. For the clustered setting we selected J/ 10 nonoverlapping 10 KB regions, each containing an average of five causal variants. The number of variants placed in each region is drawn from a multinomial distribution with parameters J/2 (number of clustered variants) and pk, k = 1, 2, … C/10 where p1 = p2 = … = pk. The remaining 50% of causal variants were uniformly distributed along the genome. Effects sizes of the causal variants are drawn independently either from a N(0,1) distribution, or following a LDAK model64 where the effect size of variant j is a draw from a standard normal multiplied by a factor [2pj (1 − pj)−0.25], where pj is that variant’s frequency, resulting in larger effects for rarer variants. This generates 3 × 2 × 2 = 12 scenarios; we simulated two different heritabilities 0.3, 0.6 resulting in 24 simulations in total.
Finally, to generate simulated phenotypes Yi for individual i = 1, 2, … N, we apply the following additive model which leverages the actual UKB genotypes and so matches properties of these data, including in particular population stratification:
16 |
Here ɣj and gij are the effect size and genotype of individual i at site j = 1, 2, … J. Noise terms εi are drawn from a normal distribution with mean 0 and variance . To obtain heritability h2 of 0.3 or 0.6, we set the variance of εi in equation (16), is equal to , where is the observed variance of the first term.
We also repeated these simulations, now allowing effect sizes to differ in the 8,003 African-ancestry individuals (Supplementary Note 2). For a correlation in effect sizes of , we simulated from the following model:
17 |
This can be done efficiently without modifying effect sizes for non African-ancestry individuals, by noting that conditional on the European effect size at site j = 1, 2, … J, has the following distribution:
18 |
or equivalently,
19 |
where Z is a standard normal random variable. We use this to generate for each setting and calculate PGS for African-ancestry individuals. We adjust the phenotypic variance for African-ancestry individuals so as to maintain their heritability at the same value as the UKB samples as a whole. Based on the above scenarios, a simulated phenotype named as, for example, ‘#causal:1K(uniform)S:0 h2:0.6’ means the underlying simulation scenario for this phenotype is: a phenotype with heritability 0.6, in total 1,000 causal variants, uniformly distributed along the genome, and S:0 (versus S:0.5) means there is no effect size scaling used.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41588-024-02035-8.
Supplementary information
Supplementary table legends, figures and Notes 1 and 2.
Table title and legend for each supplementary table. Supplementary Tables 1–9.
Acknowledgements
This research has been conducted using the UK Biobank Resource under application number 27960. We thank all the participants of the UK Biobank project. POPRES was accessed under application number 16508: ‘Admixture and Selection for Fine-Scale Population Genetics of Europe’. This work was supported by the Wellcome Trust grant no. 200186/Z/15/Z awarded to J.M., S.R.M. and G.H., grant no. 224575/Z/21/Z awarded to G.H. and grant no. 212284/Z/18/Z awarded to S.R.M. L.A.F.F. is supported by the Wellcome Trust grant no. 222334/Z/21/Z. Computations used the high-performance computing facilities at the Department of Statistics, University of Oxford and Oxford Biomedical Research Computing facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol—http://www.bris.ac.uk/acrc.
Extended data
Author contributions
S.R.M., D.J.L., J.M. and G.H. designed the study. S.H., G.H., J.M., D.J.L. and S.R.M. developed the methods. S.H., G.H., J.M., D.J.L. and S.R.M. performed the main analyses with contributions for specific analyses from L.A.F.F. and S.S. S.H., G.H., J.M., D.J.L. and S.R.M. wrote the manuscript.
Peer review
Peer review information
Nature Genetics thanks Hakhamanesh Mostafavi and Benjamin Neale for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
UK and world map data can be accessed through https://gadm.org. UK Biobank data can be downloaded by approved researchers through https://www.ukbiobank.ac.uk. Phased haplotype data from 1000 Genomes used as reference panel for HAPMIX can be accessed through https://www.internationalgenome.org/category/data-access/. POBI was accessed using accession no. EGAS00001000672. GWAS summary level data used in this paper can be queried using the interface implemented by ‘mrcieu/ieugwasr’: https://gwas.mrcieu.ac.uk and through Open Target at https://www.opentargets.org. HapMap3 variants list can be accessed at https://ftp.ncbi.nlm.nih.gov/hapmap/. All genetic data used in constructing the ancestry pipeline is provided by third parties and is available for use by others. Variant Frequency information for every SNP in each genetic grouping is available at the University of Bristol data repository, data.bris, at 10.5523/bris.3g5oatl682kz82as80jakjrq91. All other resources were downloaded from their respective websites without registration requirements. The following files have been returned to UK Biobank so that they might be made available to other researchers: (1) ACs on all UK Biobank participants; (2) group-specific allele frequency estimates for 25 M variants; (3) (Mean-centered) European/African genotypes and local ancestry of 8,003 UK Biobank African ancestry individuals (including the variants annotation information); (4) European/African PGS for 8,003 African ancestry individuals across 53 UK Biobank phenotypes. GWAS summary statistics files with population structure corrected by ACs and PCs are available in GWAS catalog (https://www.ebi.ac.uk/gwas/, GCST90310137–GCST90310200 and GCST90429571–GCST90429610).
Code availability
The ANCHOR software package is available via GitHub at https://github.com/MyersGroup/ANCHOR and Zenodo at 10.5281/zenodo.13847648 (ref. 65). Analysis scripts can be downloaded via GitHub at http://github.com/fuopen/UKB_anc and Zenodo at 10.5281/zenodo.14026928 (ref. 66). External software/packages used in this study are available via GitHub at https://github.com/danjlawson/pcapred.largedata (‘pcapred.largedata’) and https://github.com/danjlawson/pcapred (‘pcapred’).
Competing interests
S.H. became a full-time employee of Novo Nordisk Ltd during the drafting of this manuscript. J.M. is a current employee and stockholder of Regeneron Pharmaceuticals. The other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Garrett Hellenthal, Jonathan Marchini, Daniel J. Lawson, Simon R. Myers.
Contributor Information
Sile Hu, Email: leunghom@gmail.com.
Daniel J. Lawson, Email: dan.lawson@bristol.ac.uk
Simon R. Myers, Email: myers@stats.ox.ac.uk
Extended data
is available for this paper at 10.1038/s41588-024-02035-8.
Supplementary information
The online version contains supplementary material available at 10.1038/s41588-024-02035-8.
References
- 1.Yengo, L. et al. A saturated map of common genetic variants associated with human height. Nature610, 704–712 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet.42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet.38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]
- 5.Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet.40, 646–649 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Abdellaoui, A. et al. Genetic correlates of social stratification in Great Britain. Nat. Hum. Behav.3, 1332–1342 (2019). [DOI] [PubMed] [Google Scholar]
- 7.Abdellaoui, A., Dolan, C. V., Verweij, K. J. & Nivard, M. G. Gene–environment correlations across geographic regions affect genome-wide association studies. Nat. Genet.54, 1345–1354 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun.10, 3328 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mostafavi, H. et al. Variable prediction accuracy of polygenic scores within an ancestry group. eLife9, e48376 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet.99, 76–88 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bitarello, B. D. & Mathieson, I. Polygenic scores for height in admixed populations. G3 (Bethesda)10, 4027–4036 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Patel, R. A. et al. Genetic interactions drive heterogeneity in causal variant effect sizes for gene expression and complex traits. Am. J. Hum. Genet.109, 1286–1297 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Marnetto, D. et al. Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat. Commun.11, 1628 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hou, H. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet.55, 549–558 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet.54, 450–458 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Privé, F. et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet.109, 12–23 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Márquez‐Luna, C., Loh, P. R., South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017). [DOI] [PMC free article] [PubMed]
- 18.Martin, A. R. et al. Current clinical use of polygenic scores will risk exacerbating health disparities. Nat. Genet.51, 584–591 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun.11, 3865 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Carlson, C. S. et al. Generalization and dilution of association results from European GWAS in populations of non-European ancestry: the PAGE study. PLoS Biol.11, e1001661 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Atkinson, E. G. Estimation of cross-ancestry genetic correlations within ancestry tracts of admixed samples. Nat. Genet.55, 527–529 (2023). [DOI] [PubMed] [Google Scholar]
- 22.Hellenthal, G. et al. A genetic atlas of human admixture history. Science343, 747–751 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Leslie, S. et al. The fine-scale genetic structure of the British population. Nature519, 309–314 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Haworth, S. et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat. Commun.10, 333 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet.2, e190 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cai, M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet.108, 632–655 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cann, H. M. et al. A human genome diversity cell line panel. Science296, 261–262 (2002). [DOI] [PubMed] [Google Scholar]
- 29.Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science338, 374–379 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Pagani, L. et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet.91, 83–96 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Novembre, J. et al. Genes mirror geography within Europe. Nature456, 98–101 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods10, 5–6 (2012). [DOI] [PubMed] [Google Scholar]
- 33.Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet.8, e1002453 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet.47, 284–290 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Krausova, A. & Vargas-Silva, C. Wales: census profile. Migration Observatory at the University of Oxford Briefinghttps://migrationobservatory.ox.ac.uk/resources/briefings/wales-census-profile/ (2013).
- 36.Privé, F., Luu, K., Blum, M. G., McGrath, J. J. & Vilhjálmsson, B. J. Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics36, 4449–4457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet.47, 291–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet.50, 906–908 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cook, J. P., Mahajan, A. & Morris, A. P. Fine-scale population structure in the UK Biobank: implications for genome-wide association studies. Hum. Mol. Genet.29, 2803–2811 (2020). [DOI] [PubMed] [Google Scholar]
- 40.Kawasaki, T. & Kawai, T. Toll-like receptor signaling pathways. Front. Immunol.5, 461 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mikacenic, C., Reiner, A. P., Holden, T. D., Nickerson, D. A. & Wurfel, M. M. Variation in the TLR10/TLR1/TLR6 locus is the major genetic determinant of interindividual difference in TLR1/2-mediated responses. Genes Immun.14, 52–57 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Waage, J. et al. Genome-wide association and HLA fine-mapping studies identify risk loci and genetic pathways underlying allergic rhinitis. Nat. Genet.50, 1072–1080 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Churchhouse, C. et al. Rapid GWAS of thousands of phenotypes for 337,000 samples in the UK Biobank. Neale Lab Blogwww.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank (2019).
- 44.Bycroft, C. et al. Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula. Nat. Commun.10, 551 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Martin, S. et al. Genetic evidence for different adiposity phenotypes and their opposing influences on ectopic fat and risk of cardiometabolic disease. Diabetes70, 1843–1856 (2021). [DOI] [PubMed] [Google Scholar]
- 46.Christakoudi, S., Evangelou, E., Riboli, E. & Tsilidis, K. K. GWAS of allometric body-shape indices in UK Biobank identifies loci suggesting associations with morphogenesis, organogenesis, adrenal cell renewal and cancer. Sci. Rep.11, 10688 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Graff, M. et al. Genome-wide physical activity interactions in adiposity―A meta-analysis of 200,452 adults. PLoS Genet.13, e1006528 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Price, A. L. et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet.5, e1000519 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Torres, L. A., Martinez, F. E. & Manço, J. C. Correlation between standing height, sitting height, and arm span as an index of pulmonary function in 6–10‐year‐old children. Pediatr. Pulmonol.36, 202–208 (2003). [DOI] [PubMed] [Google Scholar]
- 50.Lum, S. et al. Lung function in children in relation to ethnicity, physique and socioeconomic factors. Eur. Respir. J.46, 1662–1671 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Momin, M. M. et al. A method for an unbiased estimate of cross-ancestry genetic correlation using individual-level data. Nat. Commun.14, 722 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Shi, H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun.12, 1098 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Yair, S. & Coop, G. Population differentiation of polygenic score predictions under stabilizing selection. Philos. Trans. R. Soc. Lond. B Biol. Sci.377, 20200416 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Galinsky, K. J. et al. Estimating cross‐population genetic correlations of causal effect sizes. Genet. Epidemiol.43, 180–188 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Guo, J. et al. Quantifying genetic heterogeneity between continental populations for human height and body mass index. Sci. Rep.11, 5240 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.LaPierre, N. et al. Identifying causal variants by fine mapping across multiple studies. PLoS Genet.17, e1009733 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience4, s13742-015-0047-8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Walter, K. et al. The UK10K project identifies rare variants in health and disease. Nature526, 82–90 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics33, 2776–2778 (2017). [DOI] [PubMed] [Google Scholar]
- 60.Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet.53, 1527–1533 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at BioRxiv10.1101/308296 (2018).
- 62.International HeatMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature467, 52–58 (2010). [DOI] [PMC free article] [PubMed]
- 63.1000 Genomes Project Consortium. A global reference for human genetic variation. Nature526, 68–74 (2015). [DOI] [PMC free article] [PubMed]
- 64.Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet.49, 986–992 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Hu, S. & Myers, S. MyersGroup/ANCHOR: ANCHOR v0.1-beta.1. v0.1 edn. Zenodo10.5281/zenodo.13847648 (2024).
- 66.Hu, S., Lawson, D. & Myers, S. fuopen/UKB_anc: Source code of analysis for reproducing the main results/figures in ANCHOR paper. v0.2 edn. Zenodo10.5281/zenodo.14026928 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary table legends, figures and Notes 1 and 2.
Table title and legend for each supplementary table. Supplementary Tables 1–9.
Data Availability Statement
UK and world map data can be accessed through https://gadm.org. UK Biobank data can be downloaded by approved researchers through https://www.ukbiobank.ac.uk. Phased haplotype data from 1000 Genomes used as reference panel for HAPMIX can be accessed through https://www.internationalgenome.org/category/data-access/. POBI was accessed using accession no. EGAS00001000672. GWAS summary level data used in this paper can be queried using the interface implemented by ‘mrcieu/ieugwasr’: https://gwas.mrcieu.ac.uk and through Open Target at https://www.opentargets.org. HapMap3 variants list can be accessed at https://ftp.ncbi.nlm.nih.gov/hapmap/. All genetic data used in constructing the ancestry pipeline is provided by third parties and is available for use by others. Variant Frequency information for every SNP in each genetic grouping is available at the University of Bristol data repository, data.bris, at 10.5523/bris.3g5oatl682kz82as80jakjrq91. All other resources were downloaded from their respective websites without registration requirements. The following files have been returned to UK Biobank so that they might be made available to other researchers: (1) ACs on all UK Biobank participants; (2) group-specific allele frequency estimates for 25 M variants; (3) (Mean-centered) European/African genotypes and local ancestry of 8,003 UK Biobank African ancestry individuals (including the variants annotation information); (4) European/African PGS for 8,003 African ancestry individuals across 53 UK Biobank phenotypes. GWAS summary statistics files with population structure corrected by ACs and PCs are available in GWAS catalog (https://www.ebi.ac.uk/gwas/, GCST90310137–GCST90310200 and GCST90429571–GCST90429610).
The ANCHOR software package is available via GitHub at https://github.com/MyersGroup/ANCHOR and Zenodo at 10.5281/zenodo.13847648 (ref. 65). Analysis scripts can be downloaded via GitHub at http://github.com/fuopen/UKB_anc and Zenodo at 10.5281/zenodo.14026928 (ref. 66). External software/packages used in this study are available via GitHub at https://github.com/danjlawson/pcapred.largedata (‘pcapred.largedata’) and https://github.com/danjlawson/pcapred (‘pcapred’).