Summary
Emerging evidence suggests that migration behavior can be selective with respect to individuals’ genotypes, producing genotype-environment correlations that standard methods used in genetic association studies cannot correct. We investigate this phenomenon by examining the spatial dynamics of polygenic scores (PGSs) in Estonia. Our analyses show that contemporary migrations intensify inter-regional differences in PGSs for multiple traits, with educational attainment (EA) PGS showing the strongest effect and largely explaining the inter-regional variation of other PGSs. This differentiation is mainly driven by individuals with higher EA PGS migrating to Estonia’s two largest cities from the rest of the country. Importantly, this pattern replicates within families: individuals migrating to the major cities have, on average, higher EA PGS than their siblings staying elsewhere. This trend has persisted since the mid-20th century, despite significant societal changes. These findings illustrate how migration shapes genetic differentiation within a population and highlight direct genetic effects influencing this process.
Subject areas: Human genetics, Biological sciences, Paleogenetics
Graphical abstract

Highlights
-
•
Internal migrations increase regional differences in multiple polygenic scores
-
•
Movers to cities have a higher mean education polygenic score than non-migrants
-
•
Within-family replication indicates direct genetic effects on migration behavior
-
•
Selective migration has persisted in Estonia since at least the mid-20th century
Human genetics; Biological sciences; Paleogenetics
Introduction
Spatial population genetic structure, i.e., differences in allele frequencies across geographic locations,1 has been observed in human populations from global2,3 to fine scales.4,5,6,7,8 It is driven by various demographic phenomena, including migrations and admixture as well as isolation due to geographic and cultural factors.9,10,11,12,13 Spatial population structure, in this case referred to as population stratification, may cause spurious genotype-phenotype associations and is routinely corrected for in genetic studies.14,15,16,17 At a fine scale, such structure is being blurred by migration that has largely intensified in the past century.18,19,20,21 If migration is equally likely to happen in any direction, it should mutually randomize environment and allele frequencies, reducing environmental confounding in genome-wide association studies (GWASs) and downstream analyses.17,22,23,24 In practice, however, migration patterns may be associated with individuals’ genotype at certain loci. In the case of directional migrations (for instance, migrations to more economically developed areas) over generations, this will lead to increasing differentiation in allele frequencies at migration-associated loci between regions.
Indeed, Abdellaoui et al. have shown that migrants and non-migrants from the same economically deprived areas in Great Britain differ in their average genetic profiles, with the strongest difference observed for alleles associated with educational attainment (EA).25 As a result, this newly emerging population structure might generate genotype-environment correlations, which are the source of bias for the genetic effect estimates and complicate the interpretability of the results of genetic studies.26 Recent works demonstrate that traits related to socioeconomic status (SES) are particularly prone to such complications.25,26,27
Despite the potential practical implications of such non-random changes in fine-scale spatial population structure due to recent human migrations, little is still known about how widespread and how recent they are. Most of the observations to date come from the UK Biobank,28 raising the question of whether these effects are country- or cohort-specific. Additional analyses are also required to further characterize this phenomenon in terms of affected phenotypes, effects of confounding, and temporal dynamics.
Here, we aim to analyze the Estonian biobank (EstBB) dataset29,30 in order to assess the genetic consequences of recent migrations within Estonia, a country which is characterized by different genetic backgrounds, and demographic and socio-economic aspects when compared to Great Britain. Estonia has a centuries-old population structure that aligns with the broader European context but carries unique local patterns.7,10 The recruitment strategy of the EstBB also differs from that of the UK Biobank. The EstBB includes data on more than 210,000 participants, which represents approximately 20% of the current adult population of all ages and a relatively uniform geographic coverage.30
In this work, we explore how contemporary migrations (defined as the difference between an individual’s place of birth and place of current residence) change the spatial distribution of polygenic scores (PGSs) for 169 complex traits. We show that migrations lead to increasing differentiation between regions of Estonia (specifically the two major cities versus the rest of the country) in most of the tested PGSs, with the strongest effect for PGS for educational attainment. Using the within-family approach, we demonstrate that this association between genotype and migration profile reflects direct genetic effects and cannot be fully accounted for by parental effects or confounding. Next, we reveal that the inter-regional PGS differences accumulate over generations regardless of substantial changes in society. Finally, we discuss the potential implications of such migration-driven population structure for genetic studies.
Results
Data overview
We investigated the distribution of genetic principal components and polygenic scores for complex traits across geographic areas and between different migration groups in the Estonian Biobank (EstBB).29,30 We used genome-wide single-nucleotide polymorphism (SNP) data from 183,576 adults of European genetic ancestry who were born in Estonia, resided there at the time of joining the EstBB, and indicated “Estonian” (172,376) or “Russian” (11,200) when answering a question about ethnicity in the EstBB questionnaire. We treated these two cohorts separately, and we refer to them as Estonians and Russians, respectively. We stress that this division was based on self-identity and is made primarily to control for potential historical and cultural differences in migration patterns between these groups. The cohort of self-reported Estonians was used for all the main analyses. To control for group-specific effects and to provide a comparison across subgroups, we repeated some analyses in partially overlapping subsamples based on demography (sex and age) and time of the biobank enrollment, which occurred in two different recruitment campaigns. We also repeated the analyses on the Estonian subcohort after excluding relatives up to and including the second degree to confirm the observations using independent data points. Most of the analyses were replicated in the cohort of self-reported Russians, the second largest group in the EstBB. Detailed subdivision information and a description of the groups can be found in Note S1.
Effect of recent migrations on regional differences in genome-wide ancestry and polygenic scores
Among all the EstBB participants involved in this study, 41% (75,384 out of 183,576) have their current county-level place of residence (POR) different from their place of birth (POB). Hence, we investigated whether, and to what extent, contemporary migrations affect the present-day population structure in Estonia. In doing so, we compared the extent of genetic differentiation across the 15 Estonian counties when grouping individuals based on POB versus POR. Specifically, we calculated coordinates for the top 100 genomic principal components (PCs) for each individual and then compared Varcounty – the proportion of variance for each principal component explained by POB or POR. POB and POR explain a non-zero proportion of variance for 100 and 98 PCs, respectively (Figure 1A, Table S4), consistent with previous reports of population structure in Estonia.10 Importantly, for all the first 100 PCs, Varcounty for POB is significantly higher than for POR, suggesting that contemporary migrations attenuate population structure on a genome-wide scale.
Figure 1.
Variance in PCs and PGSs explained by county of birth and county of residence
Panels show the fraction of the inter-individual variance of (A) PCs, (B) PGSs, and (С) PGSs additionally adjusted for PGSEA, explained by county of birth (POB) and county of residence (POR).
PGSs are preliminary adjusted for the top 100 PCs and demographic covariates. The y axis in panel A has a logarithmic scale. The PGSs on the x axis in panels B and C are ordered in the same way according to the difference between POR and POB Varcounty in panel B. Red and green dots refer to the POB and POR, respectively. Estimates significantly different from zero are outlined in yellow. The line connecting the two points is yellow when the variance explained by POB and POR together is significantly larger than the variance explained by only the weaker predictor (if significant) or when the stronger predictor is significant. The significance level is 0.05, after Bonferroni correction.
Certain genetic loci or sets thereof, however, might exhibit patterns different from the genome-wide ones. Indeed, contemporary migrations have been shown to be able to enhance regional differences in PGS for certain traits.25 To verify these findings in the EstBB, we explored the spatial distribution of PGS for 169 diverse phenotypes, enriched with traits related to behavior and SES (Table S1). These PGS were calculated using summary statistics from GWAS in the UK Biobank subcohort of European ancestry28,31 and adjusted for demographic covariates and the first 100 PCs. Regional differences in POB and POR explain a statistically significant non-zero proportion of variance for 73 and 112 PGSs, respectively (Figure 1B, Table S5). Unlike for the PCs, for 107 PGSs, the Varcounty values for POR are significantly higher than for POB. The effect was reversed for only 3 PGSs. Therefore, most PGSs show a geographic structure remaining after regressing out the first 100 PCs, and this structure is enhanced by contemporary migrations.
As the tested PGSs are intercorrelated (Table S6), these results might reflect a single underlying phenomenon. Assuming a set of loci with a concordant pattern of allele frequency change across Estonian counties due to migrations, we expect multiple PGSs to have a non-zero Varcounty, with this statistic being higher for PGSs that better capture alleles with stronger frequency differentiation. To see whether different PGSs capture the same population structure pattern and change thereof, we repeated the analysis after correcting the tested PGSs for the PGS for “College or university degree” (PGSEA), which is the PGS with the highest Varcounty for both POB and POR (0.45% and 1.16%, respectively). Among PGSs with Varcounty significantly greater than zero before the adjustment, the adjustment for PGSEA leads to a decrease of Varcounty in all 72 cases for POB and for 110 out of 111 cases for POR (Figure 1C, Table S7). Varcounty remains significantly higher than zero for 32 and 48 adjusted PGSs, not exceeding 0.07% and 0.19%, for POB and POR, respectively. This result suggests that the loci associated with university education, or more broadly with educational attainment (EA), contribute to a substantial fraction of the signal of non-random distribution of the other PGSs in space.
If the signal described above is indeed mostly linked to the EA-associated loci, we expect that correcting for a more powerful PGS for EA will reduce the signal for other PGSs even further. We indeed confirmed this by using summary statistics from a meta-analysis GWAS (PGSEA4).32 Varcounty for POB (0.50%) and POR (1.53%), and the difference between them is higher for PGSEA4 than for PGSEA. Adjustment for PGSEA4 reduces the Varcounty for other PGSs to a greater extent than adjustment for PGSEA (21 and 37 remain significant with a maximum of 0.05% and 0.14% for POB and POR, respectively) (Figure S23, Table S8). These results support the hypothesis that most of the inter-regional differences in the PGSs and the increase of these differences due to contemporary migrations can be considered as a single phenomenon well indexed by PGSEA. We note that, as various polygenic scores, including PGSEA, are intercorrelated (Table S6), we focus on PGSEA just as a proxy for a broader genetic component associated with multiple different traits. Specifically, we do not assume any causal role of EA genetics in this stratification. PGSEA4 is more powerful but potentially more confounded than PGSEA as it is based on a meta-analysis of many relatively small cohorts in which the adjustment for population stratification can be less effective.33,34 Thus, throughout the rest of our analyses, we mainly focused on PGSEA, which captures most of the signal of the non-random distribution of the tested PGSs, yet potentially carries less confounding than PGSEA4. At the same time, PGSEA4 may not be excessively confounded if population stratification was independent in the meta-analysed cohorts. We report the results for PGSEA4 in Note S6.
We additionally analyzed PGSs for certain psychiatric conditions as their association with choice of place of living was shown in other studies25,35 (Note S7). We observed that Varcounty is non-zero for four out of seven tested PGSs after adjustment for demographic covariates and the first 100 PCs for POR in the subcohort of unrelated Estonians. Although Varcounty for the psychiatric trait PGSs is substantially lower than for the top PGSs from the main analysis, it is consistently higher for POR compared to POB. Furthermore, Varcounty markedly decreases after adjustment for the PGSEA4, indicating that a substantial fraction of the observed regional differentiation in PGSs can be attributed to the component indexed by PGSEA4 (Figures S29 and S30).
Next, we explored whether the increase of Varcounty for PGSEA can be an artifact due to some unaccounted properties of the EstBB sample. We showed that the increase in Varcounty for PGSEA remains significant if we a) stratify the sample by sex, age, or recruitment phase (Figures S12–S14); b) filter out relatives up to second degree included (Figure S17); c) repeat the analysis in the cohort of self-reported Russians (Figure S11); d) adjust PGSEA for the complete genetic relatedness matrix (GRM) in a leave-one-chromosome-out (LOCO) approach (Figure S16). We also showed that adjusting PGSEA for PCs has a minor effect on the difference in Varcounty between POR and POB, and this effect is weakly influenced by the number of PCs used (Figure S15). Additionally, e) the polygenic score derived from the summary statistics of within-sibship GWAS for EA36 also demonstrates a significant increase in Varcounty (Figures S10–S14). See Notes S4 and S5 (Figure S17) for details.
Finally, f) the relatively large number of sibships in the EstBB allowed us to repeat the analysis for PC or PGS deviations from their within-sibship mean values. Such a sibling design randomizes genotypes and environment, allowing the differentiation of genetic effects from associations due to environmental confounding.37,38 As we used only sibships, in which all the members were born in the same county, and thus Varcounty for POB is zero by design, we compared Varcounty for POR to zero and calculated empirical p-values. For the PC coordinates, Varcounty for POR is in the range between 7.3x10−5 and 5.5x10−4 with one out of 100 PCs reaching significance (PC76: Varcounty = 5.5x10−4, p-valueBonf = 0.02; Figure 2A; Table S9). In addition to the 169 UKB-based PGSs, we include PGSEA4 in the analysis as the sibling design is robust to environmental confounding inherited by a PGS from the corresponding GWAS.39 Varcounty for POR is significantly different from zero for 12 PGSs after correction for multiple testing (Figure 2B). These PGSs correspond to the phenotypes related to SES or cognitive skills, with PGSEA4 having the highest Varcounty (Varcounty = 1.4x10−3, p-valueBonf = 0.017). When we repeated the analysis using PGSs adjusted for PGSEA4, Varcounty for POR did not exceed 5x10−4 and never reached the significance level (Figure 2C). The sibling design is less powered than our main analysis because of the smaller sample and the reduced genetic variation within families. In addition, sibships in which all members migrated to the same region do not contribute to this analysis, since in such instances the mean PGS value in their POB and POR remains unchanged. Nevertheless, these results confirm that migration behavior is associated with a specific genetic component that, out of the PGSs we tested, is best captured by PGSs for EA-related phenotypes. These associations lead to a non-random spatial distribution of the corresponding alleles and are at least partly driven by direct genetic effects.
Figure 2.
Within-sibship variance in PCs and PGSs explained by county of birth and county of residence
Panels show the fraction of the inter-individual variance of the deviation of an individual’s value from the sibship’s mean for (A) PCs, (B) PGSs, and (C) PGSs additionally adjusted for PGSEA4, explained by county of birth (POB) and county of residence (POR).
PGSs are preliminary adjusted for the top 100 PCs and demographic covariates. The PGSs on the x axis in panels B and C are ordered as in Figure 1B but with PGSEA4 added as the rightmost datapoint. Red and green dots refer to the POB and POR, respectively. Estimates significantly different from zero are outlined in yellow. Varcounty for POB is always zero by design of the analysis. The significance level is 0.05, after Bonferroni correction.
Geographical distribution of polygenic scores for educational attainment
To explore whether the increasing between-county variability of PGSEA reported above is driven by specific Estonian regions, we mapped the mean values of PGSEA, adjusted for PCs and demographic covariates, for every county in Estonia (Figure 3A and 3B). For both POB and POR, two counties have values significantly higher than the country average: Harju (FDR-adjusted p-value 4.1x10−77 and 1.1x10−168, correspondingly) and Tartu (FDR-adjusted p-value 4.8x10−12 and 2.8x10−14, correspondingly). These counties are where Tallinn and Tartu, the two most populated Estonian cities, are located, together making up more than 40% of the country’s population.40 Most other counties have values significantly lower than the country’s average.
Figure 3.
PGSEA (PGSEA4) landscape in Estonia
Mean PGSEA of individuals (A) born or (B) residing in each county. (C) Difference between values in panels B and A per county. In panels A-C, PGSEA is adjusted for the top 100 PCs and demographic covariates. (D) Mean value of PGSEA4 adjusted for sibship mean per county of residence. Only siblings born in the same county are included, making the corresponding statistic for the county of birth zero for all counties. Counties with the corresponding value being significantly different from zero after FDR correction at the 0.05 significance level are marked with an asterisk (∗). Error bars correspond to 95% confidence intervals. PGSs are measured in standard deviations.
To see how the mean PGSEA changed due to contemporary migrations, we subtracted the mean values of PGSEA individuals born in a corresponding county from the mean values of PGSEA of the county’s residents (Figure 3C). This change is significantly positive for Harju County, which includes the capital Tallinn, significantly negative for nine counties, and is not significant for the remaining five. Notably, for Tartu County, the estimate has a narrow CI95%, suggesting that recent in- and out-migrations counterbalance each other. These patterns are generally consistent across cohorts of Russians and unrelated Estonians, as well as in subcohorts stratified by sex, year of birth, and year of biobank enrollment (Figures S19, S24, and S39–S42). A significant increase in average PGSEA in Harju County is also observed on the within-sibship level, while the point estimates of change in the average PGSEA in all other regions but Tartu County are negative (Figure 3D).
Polygenic scores for educational attainment values in groups with different migration profiles
Next, we compared the mean PGSEA between groups with different migration profiles (Figure 4). For this, we divided Estonia into three areas: Harju County (including Tallinn), Tartu County (including Tartu City), and other regions of Estonia (referred to as “ORE” later in discussion). All the individuals were classified into 9 groups based on their place of birth and residence. This classification was motivated by the results above and by the fact that Harju and Tartu Counties are the major destinations of migration in Estonia,41,42 see also Note S2. In all cases, migration within the defined areas (for instance, between counties defined as the ORE) was ignored.
Figure 4.
Mean PGSEA (PGSEA4) in migration groups defined by a combination of place of birth (POB) and residence (POR)
(A) County-based analysis where “Tartu” and “Tallinn” refer to Tartu County and Harju County, respectively, while ‘ORE’ refers to other counties.
(B) City-based analysis, where “Tartu” and “Tallinn” refer to the respective cities while ‘ORE’ refers to other counties, as in A. In A and B, PGSEA is adjusted for the top 100 PCs and demographic covariates.
(C) County-based analysis for PGSEA4 adjusted for sibship-average. Error bars correspond to 95% confidence intervals.
Individuals who moved to Harju or Tartu Counties from ORE have higher PGSEA in comparison with those who stayed in ORE, explaining the decrease of PGSEA in most ORE counties (Figure 4A). Among individuals born in Harju or Tartu Counties, those migrating to ORE have the lowest PGSEA, while those moving between Tartu and Harju Counties have the highest PGSEA. Tallinn and Tartu are the two biggest cities in Estonia, the main hotspots of urbanization, centers of education, and economic development. Therefore, we questioned if our results are also driven by those cities. To check this, we repeated the analysis, keeping only participants born/residing in Tallinn or Tartu City instead of the entire corresponding counties (Figure 4B). The results demonstrate an even larger contrast between those who were born in or moved to Tallinn or Tartu City and those who stayed in ORE. These patterns replicate in unrelated Estonians (Figure S20), in subcohorts divided by sex, year of biobank enrollment, and some of the groups divided by year of birth (Figures S51–S53). They are also consistent with observations coming from the remaining subcohorts (Figures S50 and S52). That supports the hypothesis on the important role of cities in the increasing contrast between the counties. PGSEA adjusted for the sibship average also demonstrates significant differences between those who stayed in ORE and those who moved to Harju (p-value = 2.2x10−13) or Tartu County (p-value = 9.3x10−4) (Figure 4C) or to Tallinn (p-value = 9.6x10−11) or Tartu City (p-value = 3.4x10−3) (Figure S26). It is also higher among migrants from Tartu County to Harju County than among those who stayed in Tartu County (p-value = 3.1x10−3).
We next investigated whether the PGSEA of migrants to Tallinn and Tartu City varies based on an individual’s POB in a destination-specific manner. We calculated differences in mean PGSEA between residents of Tallinn and Tartu City born outside those two cities, grouped by their county of birth (Figure 5). Individuals who migrated to Tallinn from counties surrounding Tartu City show, on average, higher PGSEA compared to individuals born in the same counties and migrated to Tartu City. On the contrary, individuals who migrated to Tallinn from counties surrounding Tartu City show, on average, higher PGSEA compared to individuals born in the same counties and migrated to Tartu City. Subcohort analyses produce results generally in line with these observations, though they suffer from low power (Figures S21 and S55–S58).
Figure 5.
The difference in mean PGSEA and EA (years of education) between residents of Tallinn and Tartu City by county of birth
(A) The value for each county corresponds to the mean PGSEA of individuals born in that county and residing in Tartu City, subtracted from the mean PGSEA of individuals born in the same county and residing in Tallinn. Individuals born in Tallinn or Tartu City are excluded from the analysis. PGSEA is adjusted for the top 100 PCs and demographic covariates.
(B) The same, but for the “years of education” phenotype. Counties with the difference being significant after FDR correction at level 0.05 are marked with an asterisk (∗).
Genetic predictors of other regions of Estonia-to-cities migration
The results above suggest that the pattern observed in Figure 1B and 1C is mostly driven by selective migration out of ORE to Tallinn and Tartu City. Thus, we explored the genetic differences between those moving out of ORE to the two major cities (“cases”) versus those born and staying in ORE (“controls”) in more detail. First, an SNP-based heritability estimate of 0.13 (CI95%: 0.10–0.16) (Table 1) confirmed that there are systematic genetic differences between migrants and non-migrants. Next, we tested the 169 UKB-based PGSs and PGSEA4 as predictors for out-of-ORE migration to Tallinn or Tartu City in unrelated individuals (29,306 cases and 14,028 controls) and in siblings (5,931 cases and 15,281 controls; 11,078 sibships).43 The latter approach allowed us to estimate between- and within-family effects separately. Within-family estimates of PGS effects are not confounded by genotype-environment correlations and parental indirect effects. We note, however, that this approach does not control for indirect genetic effects of siblings, which are expected to correlate positively with direct effects,44 thus introducing downward bias to the within-sibship effect estimates. In this analysis, we applied two models: a) mixed effects logistic regression, with sibship modeled as a random effect and b) fixed effects logistic regression. The mixed effects model explicitly accounts for unexplained inter-sibship variability and is hence more appropriate for within-family comparisons. The fixed-effects model, in contrast, enabled us to compare estimates from the sibling subsample to population-level effects derived from the subcohort of unrelated Estonians. Together, these complementary approaches enable us to assess both within-family and population-level associations. The effects of PGSEA4 on the migration phenotype at the within-sibship, between-sibship, and population levels are significantly higher than zero and are the strongest among the tested PGSs, followed by PGSEA (Figure 6A and S33–S35A). All the PGSs with significantly non-zero within-sibship effects are related to SES. Estimates of the population effects for the PGSs from the unrelated Estonians and from siblings are consistent with each other, demonstrating an absence of strong biases in the sibling subsample in comparison with the whole sample (Figures S34 and S35). PCs do not show significant within-sibship effects on ORE-to-cities migration (Figures S36 and S37). Next, we regressed PGSEA4 out of other PGSs and found no significant within-sibship effects for the adjusted PGSs (Figure 6B, S35B). This result supports the idea that the loci associated with EA substantially contribute to the set of loci that directly influence an individual’s probability of ORE-to-city migration.
Table 1.
Genetic aspects of the migration phenotype
| Estimate, CI95 | P-value | |
|---|---|---|
| Population-level logistic regression, ORPGS(EA) | ||
| PGSEA | 1.31 [1.29; 1.33] | 4.9 × 10−258 |
| PGSEA + Years of education | 1.15 [1.13; 1.17] | 5.4 × 10−64 |
| PGSEA + EA (categories) | 1.14 [1.12; 1.16] | 2.0 × 10−57 |
| Within-sibship logistic regression, ORPGS(EA) | ||
| PGSEA | 1.22 [1.15; 1.29] | 2.1 × 10−10 |
| PGSEA + Years of education | 1.11 [1.04; 1.18] | 1.5 × 10−3 |
| PGSEA + EA (categories) | 1.10 [1.03; 1.17] | 4.1 × 10−3 |
| GCTA-GREML | ||
| h2EA, % | 25.9 [23.1; 28.6] | 8.8 × 10−77 |
| h2Migr, % | 12.9 [10.2; 15.6] | 2.7 × 10−21 |
| rg, % | 79.9 [69.9; 89.9] | 2.3 × 10−55 |
The migration phenotype is defined for individuals born in ORE and residing in either Tallinn or Tartu City (cases) or in ORE (controls). The logistic regression section provides the odds ratio for PGSEA as a migration predictor in a model without or with EA. Two models with EA as a covariate were tested: years of education translated from the reported categories of EA (Table S3) and the reported categorical EA. Fixed-effects model was used for within-sibship logistic regression for comparability with population-level effects. The GCTA-GREML section tabulates heritability estimates for binary educational attainment - university degree (h2EA) and migration (h2Migr), as well as the genetic correlation between them in the corresponding cohort.
Figure 6.
PGSs as predictors of migration from ORE to the major cities (Tallinn or Tartu)
All the effects are estimated in a sample of siblings, with only siblings born in the same county being included. The estimates are obtained using mixed effects logistic regression with a random intercept for sibship. (A) Effect sizes for PGSs and (B) effect sizes for PGSs additionally adjusted for PGSEA4. All PGSs are preliminary adjusted for the top 100 PCs and the demographic covariates. Results are shown for PGSs with a significant within-sibship effect before adjusting for PGSEA4 and after Bonferroni correction. Vertical dashed line indicates odds ratio equal to 1. Error bars correspond to 95% confidence intervals.
Difference in mean polygenic score for educational attainment between cities and other regions of Estonia accumulated over time
Above, we showed that contemporary migration increases the PGSEA differentiation between Tallinn/Tartu City and ORE. We next set out to explore if this effect accumulated over the last century and if there has been any change in the genetic makeup of migrants over this period of time. We compared the mean PGSEA in Estonians grouped by place of birth and residence and the birth decade, while the PGSEA was adjusted and standardized in the entire Estonian cohort (Figure 7). We used wider birth year bins for the oldest and the youngest participants due to their smaller sample sizes. The comparison between groups of individuals born in Tallinn/Tartu City and ORE shows that individuals born in the cities on average have significantly higher PGSEA than those born in ORE starting from the 1940s (p-value 4.2x10−3). Furthermore, the contrast between these groups tends to increase over time (Figure 7A). Consistently, PGSEA is significantly higher in the group of migrants from ORE to the cities than in the group of participants who stayed in ORE. This difference is significant already in the earliest bin (p-value 1.4x10−3) and persists in all subsequent bins (Figure 7B). These patterns persist in the analysis of unrelated subsamples and with PGSEA4 (Figures S22 and S28, respectively).
Figure 7.
Difference in mean PGSEA between cities (Tallinn and Tartu combined) and ORE across birth year bins
(A) Mean PGSEA of individuals born in either ORE or Tallinn/Tartu; (B) mean PGSEA of individuals born in ORE and residing in either ORE or Tallinn/Tartu. PGSEA is adjusted for the top 100 PCs and demographic covariates. Error bars correspond to 95% confidence intervals.
The relation between genetic factors of educational attainment and migration
It has been previously shown that a higher EA is associated with higher migration activity.45,46,47 Hence, the patterns we reported above for PGSEA can merely reflect migration patterns of individuals with various EA levels. This is supported by the observation that EA shows a similar geographic distribution as well as a similar distribution between different migration-profile groups (Figures S43–S48, and S59–S70).
To test whether the results for PGSEA can be entirely explained by the trait itself, we first compared PGSEA between different migration groups after controlling for the EA phenotype. With either binary and continuous measures of EA (university degree and years of education, respectively), regressed out of PGSEA, the differences between the migration groups become less pronounced but remain significant in most cases (Figures S71–S83).
We showed above that PGSEA is a significant predictor for migration out of ORE in a logistic regression model. Here, we used logistic regression to test whether PGSEA predicts migration in a joint effect model including EA as a predictor (Table 1). As above, we estimated both population as well as within-sibship effects. Years of education attenuate the regression coefficient of PGSEA, yet it remains statistically significant, which is in agreement with the results of another recent study on Swedish twins.48 Treating EA as a continuous variable does not allow for different effects of different EA categories in the regression model. To relax this condition, we tested an alternative model with EA included as a categorical covariate. In this case, the effect of PGSEA on migration is close to that with years of education as a covariate and is still significant. Remarkably, despite wider confidence intervals and generally lower regression coefficient estimates than at the population level, within-sibship estimates of PGSEA effects remain significant in the joint models.
We also show using GCTA-GREML that migration to the cities has a genetic correlation of 0.8 (CI95%: 0.7–0.9) with having versus not having a university degree. This suggests the two traits have largely but not fully overlapping genetic backgrounds.
Discussion
In this work, we harness a sample of more than 180 thousand individuals from the Estonian Biobank29,30 to explore the genetic correlates and consequences of contemporary migrations in Estonia. We show that contemporary migrations intensify inter-regional differences in polygenic scores (PGSs) for many traits, especially those related to socioeconomic status (SES). The strongest effect is observed for the PGS for educational attainment (PGSEA), which is consistent with previous observations in the UK.25 Moreover, correlation with PGSEA explains a substantial fraction of the inter-regional differences for other PGSs. We demonstrate that spatial PGS differentiation in Estonia is mainly driven by the migration of individuals with relatively high PGSEA to the two largest cities from the rest of the country. Through sibling comparison, we show that such migration and the resulting increase in inter-regional PGS heterogeneity can be in part explained by direct genetic effects. We also demonstrate that the accumulation of inter-regional PGS differences began no later than the mid-20th century and has continued into the 21st century, despite significant societal changes. Our findings shed light on the interplay between genetics and social factors, providing deeper insights into the processes driving contemporary changes in population structure. Furthermore, they illustrate a type of genotype-environment correlation that is likely to be widespread and should be considered in genetic studies.
First, we demonstrate that contemporary migration in Estonia amplifies inter-regional differences in most of the tested PGSs, contrasting with the trend observed for the genome-wide population structure reflected by PCs. PGSEA exhibits the largest increase in Varcounty and explains a big fraction of the signal for other PGSs. This motivated us to treat the PGSEA as a genomic variable that most effectively captures genetic loci associated with migration behavior in the EstBB. The analysis of the geographic distribution of PGSEA and differences between migration groups in PGSEA, as well as the comparison of the intensity of the migration paths, indicates that selective migration from ORE to Tallinn and Tartu City is the major process driving the genetic differences between the regions (Note S2).
This selective migration might reflect either some causal genetic effects on migration behavior or be driven by environmental confounding. For example, culturally driven differences in migration rates from different localities of ORE might co-occur with genetic differences between localities due to the recent fine-scale population structure that is challenging to fully adjust for.17 Indirect parental or dynastic genetic effects, known to particularly influence SES-related traits, may also be considered a form of environmental confounding. However, we accumulate evidence for a substantial role of causal direct genetic effects on migration behavior. First, PGSs for the SES-related traits are associated with migration behavior in Estonians and Russians (this study) as well as in the British25 populations. Similarly to our observations, in the UK Biobank, PGS for EA is strongly associated with migration from less to more economically developed areas. It is unlikely that the fine-scale population structure and migration patterns overlap in the same way between those three populations. Second, and most importantly, we observe consistent patterns when exploring this phenomenon in siblings, where we don’t expect any association between local or family environment and the genotype at birth. Importantly, demonstrating the causal link between genotype and migration behavior, we still face the problem of genetic confounding. It refers to long-range LD driven mostly by fine-scale population structure and assortative mating.39 Genetic confounding makes it problematic to highlight certain traits that share direct genetic effects with migration behavior. Nevertheless, it does not call into question the presence of direct effects. Although the PC76 also has a significant increase in Varcounty on the within-sibship level, it may serve as an example of a PC capturing some genetic component with a direct effect on the phenotype.49 Alternatively, it may represent a false positive finding. We make separate corrections for multiple testing for PCs and PGSs. However, the p-value increases up to 0.054 when correcting jointly for the number of PCs and PGSs tested, while the p-value for PGSEA4 (0.027) remains lower than the threshold of 0.05.
One potential explanation for the observed changes in the population structure may lie in the link between migration and EA. Indeed, there is rich evidence that obtaining an education or applying the acquired qualification is a strong motivation for migration and that education level is associated with migration activity. Moreover, Tallinn and Tartu are the primary centers in Estonia for obtaining a university education and host the highest concentration of high-skilled employment opportunities.50 This might explain part of the association between PGSEA and migration. However, first, these two traits are not perfectly genetically correlated in our sample, thus having some non-overlapping genetic component. Second, in agreement with a study of mobility in Sweden,48 PGSEA is associated with migration behavior even after including the EA phenotype in both population and within-sibship regression models.
In the 20th century, Estonia went through a series of political transitions related to drastic changes in economic and social organization.51 It first gained independence in 1918 and lost it during the Soviet period from 1940 to 1991, which was interrupted by German occupation from 1941 to 1944. Despite this, we observe a consistent trend for increasing differences between the cities and ORE in the PGSEA during almost a century, largely caused by the genetically biased migration from ORE to the cities. This trend is comparable to the genetically selective migration from coal mining areas in the UK. These findings make us suggest that the effect of recent migrations on PGS distribution may be a more general phenomenon for urbanized societies, largely independent of political and economic aspects and probably shared with other countries, at least within Europe. We also replicated the patterns of PGSEA distribution in sex-, age-, and recruitment strategy-based subcohorts and in self-reported Russians, further supporting these patterns to be genuine and general. This hypothesis should be verified using data from more countries, including non-European ones.
PGSEA and the EA phenotype are associated not only with the mere fact of migration but also with migration distance: city residents born in more distant regions have, on average, higher PGSEA and EA than those born in nearby regions. This is consistent with previous studies from other countries.24,52 A similar pattern has already been observed on the phenotype level in the early 20th century in Estonia, where students from farther away from Tartu City had on average higher scores on an intelligence test than students born closer to the city.53 Although the test used in that study is considered outdated, factors affecting the result are in line with those currently affecting EA.54 Since EA is largely transmitted through the family environment, the correlation between EA (and, more broadly, SES) and genetic ancestry, once originated from such selective migration, will persist across subsequent generations.27
No matter the relative contribution of causal effects and confounding in the genetic component of the migration behavior we discuss here, it has clear implications: such migration leads to the differentiation between more and less urbanized and economically developed regions in allele frequencies at a specific set of loci. Thus, such migrations create both active (for the migrating individuals) and passive (for their offspring) genotype-environment correlation.55 We show that this correlation is not UK-specific, is amplified over generations, and cannot be corrected for either using the top 100 PCs or the whole common SNPs-derived GRM in a LOCO approach. Thus, it might contribute to the inflation of heritability estimates and marginal SNP effects in genetic studies, especially those focusing on socio-economic traits, including EA itself.26,36 This may also cause inflation in the estimates of assortative mating, confusing it with mating by proximity (Note S8).32 While here we focus on differences between major cities and other areas, the same phenomenon might be observed on the finer geographic (e.g., neighborhood) level or even on the level of professional or SES groups. The latter would happen even if SES is mostly culturally heritable, but there is some genetic component (not necessarily causally) associated with the chance and direction of SES change. A similar process can create correlations between allele frequencies and SES within cities.27
To reduce such confounding on the population level, one would ideally want to control for the familial environment, but at least controlling for place of birth could be a partial solution to the problem.26 Another solution is to move from population-based cohorts toward large-scale family-based studies to be able to separate the different sources of the genetic associations.56
Our findings demonstrate that people’s geographic mobility, particularly related to urbanization, is accompanied by changes in the population structure of a population. The comparison of Estonia and the UK shows that this phenomenon can manifest in countries with different socio-economic systems as well as population sizes. Such migrations, which are non-random with respect to allele frequencies, generate genotype-environment correlations that should be accounted for in genetic studies.
Limitations of the study
First, although the biobank data includes information on approximately 20% of the adult population in Estonia, it has been shown not to be a completely representative population cohort29 (Note S1). Second, the information on the places of birth or residence may contain inaccuracies.57 The reported place of birth may, in some cases, correspond to the settlement where the maternity hospital was located and not to the actual place where the family lived at the time of birth. For this reason, it is safer to consider counties than individual cities, and our main conclusions are not sensitive to this potential issue. The information on the place of residence is updated regularly, synchronizing with the population register. However, people do not always report their movements to the register. Third, in reporting the results for separate age groups, we consider the age of the dead individuals to be fixed at the time of death. Given that migration behavior can change both with an individual’s age and across historical periods, the desynchronization of the year of birth and age can obscure some patterns conditioned on either factor. Still, this effect is expected to be negligible (Note S1).
Additionally, we would like to make a cautionary note about interpreting our results within a broader sociological framework. In most analyses, we used the polygenic score based on population-based GWAS for EA, which captures a substantial fraction of the inter-regional differentiation signal of other PGSs. It has been shown by many studies that it is influenced by lots of diverse confounders.25,26,32,36,44,58,59,60,61,62 Thus, PGSEA cannot be interpreted as a cumulative genetic factor directly affecting EA outcome. It is rather a correlate of EA, only in part determined by direct genetic effects. One should also note that the differences in mean PGSEA between migration groups are subtle despite being statistically significant. Moreover, the corresponding distributions strongly overlap for all the migration groups considered (Note S9).
Migration may also influence other aspects of the genetic characteristics of a population, such as runs of homozygosity (ROH).63 While our analyses focused on allele frequency differentiation and polygenic score variation, incorporating ROH would offer a complementary perspective and represents an interesting direction for future work.
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Ivan A. Kuznetsov (ivan.kuznetsov@ut.ee).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
Access to the Estonian Biobank data (https://genomics.ut.ee/en/content/estonian-biobank) is restricted to approved researchers and can be requested.
-
•
Custom R code used for statistical analyses is publicly available on Zenodo (https://doi.org/10.5281/zenodo.17497533) and GitHub (https://github.com/ivkuz/GeneticMigrationStructureEstonia).
-
•
Any additional information required to reanalyze the data reported in this article is available from the lead contact upon request.
Acknowledgments
We want to acknowledge the participants of the Estonian Biobank. Data analysis was carried out in part in the High-Performance Computing Center of the University of Tartu. We thank Kelli Lehto for sharing EA4 summary statistics, with the Estonian cohort excluded from the meta-analysis. This study was funded by the European Union through the European Regional Development Fund Project No. 2014-2020.4.01.15-0012 GENTRANSMED and by the Ministry of Education and Research of Estonia through the project TK214 Center of Excellence for Personalized Medicine. This project has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101060011. Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. IK was a student of and supported by the Center for Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russia, during the initial phase of the project. VP and MM were supported by the European Union through Horizon 2020 research and innovation program under grant no 810645 and through the European Regional Development Fund project no. MOBEC008. MM was also supported by the Estonian Research Council grant PUT (PRG1899). UV was supported by the Estonian Research Council grant PUT (PSG759). LP was supported by the Italian Ministry of University and Research (2022B27XYM). FM was supported by Fondazione con il Sud (2018-PDR-01136) and by the Italian Ministry of University and Research (2022P2ZESR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article. This work was written at writing retreats organized by the University of Tartu Institute of Genomics and the Estonian Doctoral School for Natural and Agricultural Sciences (2021-2027.4.04.24-0003), co-funded by the European Union.
Author contributions
I.K., L.P., F.M., and V.P. conceived and designed the study. I.K. performed all the analyses. I.K. and V.P. wrote the initial draft of the article. All co-authors contributed to the interpretation of the results, reviewed, and approved the submitted version of the article.
Declaration of interests
The authors declare no competing interests.
Declaration of generative AI and AI-assisted technologies in the writing process
The authors used Grammarly and ChatGPT to assist with proofreading and grammar correction. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Estonian Biobank | University of Tartu | https://genomics.ut.ee/en/content/estonian-biobank |
| Pan-UKBB summary statistics | Karczewski et al. 202531 | https://pan.ukbb.broadinstitute.org/downloads/index.html |
| EA4 summary statistics | Okbay et al. 202232 | https://www.thessgac.org/ |
| Within-sibship GWAS summary statistics | Howe et al. 202236 | https://gwas.mrcieu.ac.uk/ |
| Shapefiles for Estonia with county borders | Estonian Land Board64 | Administrative and Settlement Division, 2023.02.01 |
| Software and algorithms | ||
| PLINK2 | Chang et al. 201565 | http://www.cog-genomics.org/plink/2.0/ |
| KING | Manichaikul et al. 201066 | https://www.kingrelatedness.com/Download.shtml |
| SBayesR (gctb_2.02) | Lloyd-Jones et al. 201967 | https://gctbhub.cloud.edu.au/software/gctb/ |
| R 4.0.2 | https://cran.r-project.org/bin/windows/base/old/4.0.2/ | |
| GCTA-GREML | Yang et al. 201168 | https://yanglab.westlake.edu.cn/software/gcta/#GREML |
| Custom code | This study |
https://doi.org/10.5281/zenodo.17497533 https://github.com/ivkuz/GeneticMigrationStructureEstonia |
Experimental model and study participant details
Participants
The participants of this study were sourced from the Estonian Biobank (EstBB), which is a volunteer-based cohort of the Estonian resident adult population.29,30 It includes (as of 2022) genetic and diverse phenotype data on 210,438 individuals (72,708 men and 137,730 women) corresponding to ∼20% (∼14% men and ∼24% women) of the contemporary adult population of Estonia.69 Participants' age ranges from 18 to 107, determined as of 2022 for alive participants or at the year of death. The EstBB is linked with the Estonian national register, so the information on education level and place of residence is being constantly updated. The participants were recruited over two decades from 2001 to 2021 across the country, covering all the regions and a variety of different settings, providing socio-economic and cultural heterogeneity. Besides genetic and demographic data, participants provided health data, blood samples, and lifestyle information.
Ethics statement
The activities of the EstBB are regulated by the Human Genes Research Act, which was adopted in 2000 specifically for the operations of the EstBB. Individual level data analysis in the EstBB was carried out under ethical approval “1.1–12/3593” from the Estonian Committee on Bioethics and Human Research (Estonian Ministry of Social Affairs), using data according to release application “6–7/GI/8959” from the Estonian Biobank.
Method details
Genotypes and quality control
Samples were genotyped on the Infinium Global Screening Array (GSA) of different versions (depending on the time of recruitment) with approximately 550,000 overlapping positions. Samples with <95% call rate or mismatch between genetic and self-reported sex were excluded. Before the imputation step, all non-SNP polymorphisms and strand-ambiguous SNPs were filtered out. The final number of SNPs before the imputation step was 309,258. The genotypes were imputed with Beagle 5.470 using the Estonian Reference panel as a reference set.71 To create polygenic scores, we extracted a set of 1,075,599 autosomal HapMap 3 SNPs with a minor allele count >5 and info score >0.7. Unrelated individuals were defined as having less than 2nd-degree relationship (Kinship <0.088) inferred with KING.66
For GREML analysis, the non-imputed genotyping data were used after keeping SNPs with minor allele frequency >0.01, Hardy–Weinberg equilibrium (HWE) p-value >10−5 and missingness <0.015. Related individuals with a 2nd-degree relationship and closer (Kinship >0.088) were excluded. Relationships were inferred with KING.66
Ancestry and PCA
Genetic ancestry grouping was estimated using imputed genotypes with bigsnpr, following the original workflow.72 For ancestry inference, genotypes were imputed using 1000 Genomes Project phase 3 samples.3 Individuals from ‘Europe (East)’, ‘Europe (North West)’, and ‘Finland’ inferred ancestry groups were kept for further analysis. Next, individuals with self-reported ethnicity other than ‘Estonian’ or ‘Russian’ were excluded from the participants who passed the genetic ancestry filter. These steps were implemented to retrieve a relatively genetically homogeneous set of participants. Only individuals born and residing in Estonia were included in the analysis. In total, 183,576 individuals (63,753 men and 119,823 women) left after the filtering. Next, the resulting set was subdivided into self-reported Estonians (172,376) and Russians (11,200). The purpose of the latter step was to divide the sample by cultural and historical background rather than by genetic profile.73
A principal component analysis (PCA) was conducted separately for self-reported Estonians (182,252 individuals) and Russians (17,954 individuals) regardless of their country of birth or residence to capture population structure within the corresponding groups. Before the analysis, genotypes were filtered for minor allele frequency >0.01, Hardy–Weinberg equilibrium (HWE) p-value >10−5, and missingness <0.05. Long-range linkage disequilibrium regions were removed.74 Genotypes were pruned for linkage disequilibrium with PLINK265,75 with a window size of 50 kb, a step 5 kb, and r2 threshold 0.1. The PCA to construct PCs on self-reported Estonian and Russian individuals was conducted on this SNP set using flashPCA version 2.76
Polygenic score calculations
Polygenic scores were computed for 169 phenotypes using population-based GWAS summary statistics from the UK Biobank (PGSs).28 The PGSs were calculated using summary statistics from GWAS in the European ancestry cohort of the UK Biobank conducted by the Pan-UKBB project.31 The Pan-UKBB project particularly presents an analysis of 7,228 phenotypes, spanning 16,131 studies. The list of traits selected for the analysis included the maximally independent set of 146 phenotypes (with correlation between them <0.1) for which GWAS results passed the quality control. The link to this list is available on the Pan-UKBB project website (https://pan.ukbb.broadinstitute.org/downloads/index.html). Additionally, 23 phenotypes related to education, mental health, fluid intelligence, height, and body mass index (BMI) were added. The complete list of the phenotypes and the numbers of individuals included in the study is presented in Table S1. An additional PGS was calculated using summary statistics from the largest GWAS of educational attainment (EA) currently available with Estonian individuals and the 23andMe cohort excluded from the meta-analysis.32
Polygenic scores were calculated using SBayesR (gctb_2.02) with default parameters (--pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --chain-length 10000 --burn-in 2000 --out-freq 10), including an LD matrix built using data on 50,000 UK Biobank participants.67 To reduce the effect of the ancestry-related population structure on polygenic scores, the top 100 principal components (PCs) specific to the Estonian or Russian cohort were regressed out. Sex, age, sex×age, and age2 were also regressed out of the PGSs to mitigate the influence of potential sex and age bias reported for population volunteer cohorts.59,77 In analyses of PGS adjusted for educational attainment, binary or continuous EA (see educational attainment phenotypes in the STAR Methods) was also regressed out. PGS for EA (PGSEA or PGSEA4) was also regressed out from other PGSs where explicitly mentioned.
Sources of education and geographic information
Initial information on the highest level of education, place of birth and place of residence was obtained from the questionnaire completed by participants when enrolled in the biobank. The EstBB regularly synchronises its information with the Estonian Population Register on the highest level of education and the municipality of current residence. The data used in this study was last updated in 2022. Participants without information on the counties of birth and residence in Estonia or born outside the country were excluded from the analysis. Participants born or residing in Harju or Tartu Counties and lacking information on the municipality were excluded from the analyses, where it was necessary to distinguish Tallinn/Tartu City from other municipalities of the corresponding counties. After filtering, the analyzed sample included 172,376 self-reported Estonians and 11,200 self-reported Russians.
Educational attainment phenotypes
Continuous and binary traits corresponding to educational attainment were considered. The continuous ‘years of education’ phenotype was derived according to the ISCED 2011 methodology. The link table for the reported level of education, ISCED 2011, and ‘years of education’ is presented in Table S3. Alternatively, attainment of a Bachelor’s degree or higher was used as a binary phenotype (0 - not having a Bachelor’s degree; 1 - having a Bachelor’s degree or higher). The quantitative EA phenotype was adjusted to mitigate possible sampling bias in the corresponding analyses: sex, age, sex×age, and age2 and 100 genetic PCs were regressed out using linear regression.
Geographic data visualisation
Shapefiles used to plot maps of Estonia with county borders were retrieved from the Estonian Land Board website (Administrative and Settlement Division, 2023.02.01).64 Geographic data were visualized in R78 with the aid of the following packages: ‘sf’,79,80 ‘geos’81 and ‘ggplot2’.82
Quantification and statistical analysis
Inter-regional differences in PC values and polygenic scores
To measure the inter-regional differences in a variable of interest, we calculated the proportion of variance explained by county differences:
| (Equation 1) |
where SSB is the sum of squares between counties, and SSW is the sum of squares within counties. P-values were calculated from the ANOVA test. The chi-square test was implemented to test whether the difference of variance explained by county of birth and county of residence together is significantly larger than by only one of them. The base model to compare with was a less powerful model with either county of birth or county of residence as an independent variable. Statistical significance was determined using a level of 0.05 after the Bonferroni correction for the number of tests (100 for PCs, 169 for PGSs, or 170 when PGSEA4 was included).
We applied a Z-test to evaluate whether the mean PGS, phenotypic value, or its change due to migration was significantly different from zero. Counties with values that remained significantly different from zero after FDR correction at the 0.05 significance level are indicated with an asterisk (∗) in the figures.
Sibling analyses
The sibling design was applied in the analysis of inter-regional variation of PCs and PGSs, the comparison of migration groups and the association between PGSs and migrations from ORE. Siblings were defined among self-reported Estonian individuals by KING66 criteria of first-siblings and an additional threshold of 0.177 < Kinship <0.354 (29,003 sibling pairs, 48,082 unique individuals). Only siblings born in the same county were included in the analysis. If the subgroups of two or more members of the same sibship had different POB, they were treated as separate sibships. The total number of siblings after filtering was 41,081 making up 19,407 sibships, of which 8,318 include members with different POR. PGSs were first adjusted as described above. Then, the sibship mean PGS was subtracted from each individual’s PGS, regardless of the sibship size. There is evidence that birth order may affect EA.83 It could lead to systematic differences in PGSEA between older and younger siblings through participation bias mechanisms. However, we do not observe such significant differences in our data and thus do not adjust PGSs for birth order.
As by design (all the siblings from every family have the same POB), all the county averages and, accordingly, Varcounty for POB are zero for all the PGSs and PCs. Because of this characteristic of the model, Varcounty does not follow the distribution of F-statistic from the standard ANOVA. To determine the PCs and PGSs significantly exceeding the Varcounty expected by chance, we generated 10,140 random variables v ∼ N(0,1) and applied the sibling design to them. The estimate of the p-value was obtained as (r+1)/(n+1), where n is the number of simulated variables and r is the number of these variables that produce Varcounty statistic greater than that calculated for the actual PC or PGS.84,85 The number of variables used was chosen as a compromise between precision and performance as it allowed us to get a p-value = 0.05 after Bonferroni correction for the PGS with Varcounty exceeding all but two values of the random variables (3/10,141 × 169 = 0.05).
PGS effects on migration from ORE to the cities
Other Regions of Estonia (ORE) are defined as the counties of Estonia, excluding Harju and Tartu Counties, which encompass the largest cities, Tallinn and Tartu, respectively. The migration phenotype was specified for individuals born in ORE with POR in ORE or Tallinn/Tartu City. The phenotype value 0 corresponds to POR also in ORE, the value 1 corresponds to POR Tallinn or Tartu City.
For the estimation of the population effects of the PGSs, mixed (Equation 2) and fixed (Equation 3) effects logistic regression models were used:
| (Equation 2) |
| (Equation 3) |
Here Yi is the phenotypic outcome of an individual i. P(Yi= 1) is the probability of the outcome Yi being 1. PGSi is the individual’s polygenic score. α0 is the intercept and β is the regression coefficient of the PGS. ζ is the vector of fixed coefficients for the demographic variables sex and age, denoted by vector di. γj is the random intercept with γj ∼ N(0,σ2γ), for sibship j accounting for the between-family variation in the intercept. The residual is represented as εi with εi ∼ N(0,σ2ε). The mixed effects model was applied to the sample of siblings. The fixed-effects model was applied to the subsample of siblings with a single individual picked up randomly from each of the sibships, as well as to the subsample of unrelated Estonians.
For the estimation of the within-sibship effects of the PGSs, we used mixed effects logistic regression similar to the models introduced by Selzam et al.43 and Abdellaoui et al.26 (Equation 4) and fixed effects logistic regression without random effect of sibship (Equation 5):
| (Equation 4) |
| (Equation 5) |
PGSij denotes the individual’s polygenic score, and refers to the mean polygenic score of the family j. βW and βB correspond to within- and between-family PGS effects, respectively. All the other terms correspond to those from the population model.
The joint regression analysis of PGSEA and EA was performed in accordance with the formulas given above with EA (years of education or categories) added as a covariate.
Heritability and genetic correlation calculations
Bivariate GREML analysis implemented in GCTA software68,86 was used to estimate heritabilities and genetic correlations for EA and migration from ORE to Tallinn or Tartu City. For this 43,334 individuals with relatedness more distant than the 2nd-degree were used. Sex, age, age2, sex×age, sex×age2 and 10 genetic PCs were included as covariates in the models. The heritability estimates were transformed from the observed to the liability scale using the Robertson transformation.87
Published: November 12, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.114013.
Contributor Information
Ivan A. Kuznetsov, Email: ivan.kuznetsov@ut.ee.
Vasili Pankratov, Email: vasili.pankratov@ut.ee.
Estonian Biobank Research Team:
Andres Metspalu, Lili Milani, Tõnu Esko, Reedik Mägi, Mari Nelis, and Georgi Hudjashov
Supplemental information
References
- 1.Charlesworth B., Charlesworth D. W. H. Freeman; 2010. Elements of Evolutionary Genetics. [Google Scholar]
- 2.Cavalli-Sforza L.L. Genes, peoples, and languages. Proc. Natl. Acad. Sci. USA. 1997;94:7719–7724. doi: 10.1073/pnas.94.15.7719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.1000 Genomes Project Consortium, Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.O’Dushlaine C., McQuillan R., Weale M.E., Crouch D.J.M., Johansson A., Aulchenko Y., Franklin C.S., Polašek O., Fuchsberger C., Corvin A., et al. Genes predict village of origin in rural Europe. Eur. J. Hum. Genet. 2010;18:1269–1270. doi: 10.1038/ejhg.2010.92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Abdellaoui A., Hottenga J.-J., de Knijff P., Nivard M.G., Xiao X., Scheet P., Brooks A., Ehli E.A., Hu Y., Davies G.E., et al. Population structure, migration, and diversifying selection in the Netherlands. Eur. J. Hum. Genet. 2013;21:1277–1285. doi: 10.1038/ejhg.2013.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kerminen S., Havulinna A.S., Hellenthal G., Martin A.R., Sarin A.P., Perola M., Palotie A., Salomaa V., Daly M.J., Ripatti S., Pirinen M. Fine-scale genetic structure in Finland. G3 (Bethesda). 2017;7:3459–3468. doi: 10.1534/g3.117.300217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pankratov V., Montinaro F., Kushniarevich A., Hudjashov G., Jay F., Saag L., Flores R., Marnetto D., Seppel M., Kals M., et al. Differences in local population history at the finest level: the case of the Estonian population. Eur. J. Hum. Genet. 2020;28:1580–1591. doi: 10.1038/s41431-020-0699-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Nait Saada J., Kalantzis G., Shyr D., Cooper F., Robinson M., Gusev A., Palamara P.F. Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations. Nat. Commun. 2020;11:6130. doi: 10.1038/s41467-020-19588-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Leslie S., Winney B., Hellenthal G., Davison D., Boumertit A., Day T., Hutnik K., Royrvik E.C., Cunliffe B., Wellcome Trust Case Control Consortium 2, et al. The fine-scale genetic structure of the British population. Nature. 2015;519:309–314. doi: 10.1038/nature14230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kivisild T., Saag L., Hui R., Biagini S.A., Pankratov V., D’Atanasio E., Pagani L., Saag L., Rootsi S., Mägi R., et al. Patterns of genetic connectedness between modern and medieval Estonian genomes reveal the origins of a major ancestry component of the Finnish population. Am. J. Hum. Genet. 2021;108:1792–1806. doi: 10.1016/j.ajhg.2021.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rodríguez Díaz R., Blanco Villegas M.J. Genetic structure of a rural region in Spain: distribution of surnames and gene flow. Hum. Biol. 2010;82:301–314. doi: 10.3378/027.082.0304. [DOI] [PubMed] [Google Scholar]
- 12.Gilbert E., O’Reilly S., Merrigan M., McGettigan D., Molloy A.M., Brody L.C., Bodmer W., Hutnik K., Ennis S., Lawson D.J., et al. The Irish DNA Atlas: Revealing Fine-Scale Population Structure and History within Ireland. Sci. Rep. 2017;7 doi: 10.1038/s41598-017-17124-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Byrne R.P., van Rheenen W., Project MinE ALS GWAS Consortium, van den Berg L.H., Veldink J.H., McLaughlin R.L. Dutch population structure across space, time and GWAS design. Nat. Commun. 2020;11:4556. doi: 10.1038/s41467-020-18418-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Devlin B., Roeder K. Genomic Control for Association Studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341X.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- 15.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 16.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.-Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zaidi A.A., Mathieson I. Demographic history mediates the effect of stratification on polygenic scores. eLife. 2020;9 doi: 10.7554/eLife.61548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Helgason A., Yngvadóttir B., Hrafnkelsson B., Gulcher J., Stefánsson K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 2005;37:90–95. doi: 10.1038/ng1492. [DOI] [PubMed] [Google Scholar]
- 19.Vitart V., Carothers A.D., Hayward C., Teague P., Hastie N.D., Campbell H., Wright A.F. Increased level of linkage disequilibrium in rural compared with urban communities: a factor to consider in association-study design. Am. J. Hum. Genet. 2005;76:763–772. doi: 10.1086/429840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kerminen S., Cerioli N., Pacauskas D., Havulinna A.S., Perola M., Jousilahti P., Salomaa V., Daly M.J., Vyas R., Ripatti S., Pirinen M. Changes in the fine-scale genetic structure of Finland through the 20th century. PLoS Genet. 2021;17 doi: 10.1371/journal.pgen.1009347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ziyatdinov A., Torres J., Alegre-Díaz J., Backman J., Mbatchou J., Turner M., Gaynor S.M., Joseph T., Zou Y., Liu D., et al. Genotyping, sequencing and analysis of 140,000 adults from Mexico City. Nature. 2023;622:784–793. doi: 10.1038/s41586-023-06595-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Verweij K.J.H., Mosing M.A., Zietsch B.P., Medland S.E. In: Statistical Human Genetics: Methods and Protocols. Elston R.C., Satagopan J.M., Sun S., editors. Humana Press; 2012. Estimating Heritability from Twin Studies; pp. 151–170. [DOI] [Google Scholar]
- 23.Richards J.B., Evans D.M. Back to school to protect against coronary heart disease? BMJ. 2017;358 doi: 10.1136/bmj.j3849. [DOI] [PubMed] [Google Scholar]
- 24.Lawson D.J., Davies N.M., Haworth S., Ashraf B., Howe L., Crawford A., Hemani G., Davey Smith G., Timpson N.J. Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? Hum. Genet. 2020;139:23–41. doi: 10.1007/s00439-019-02014-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Abdellaoui A., Hugh-Jones D., Yengo L., Kemper K.E., Nivard M.G., Veul L., Holtz Y., Zietsch B.P., Frayling T.M., Wray N.R., et al. Genetic correlates of social stratification in Great Britain. Nat. Hum. Behav. 2019;3:1332–1342. doi: 10.1038/s41562-019-0757-5. [DOI] [PubMed] [Google Scholar]
- 26.Abdellaoui A., Dolan C.V., Verweij K.J.H., Nivard M.G. Gene-environment correlations across geographic regions affect genome-wide association studies. Nat. Genet. 2022;54:1345–1354. doi: 10.1038/s41588-022-01158-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Haworth S., Mitchell R., Corbin L., Wade K.H., Dudding T., Budu-Aggrey A., Carslake D., Hemani G., Paternoster L., Smith G.D., et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat. Commun. 2019;10:333. doi: 10.1038/s41467-018-08219-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Leitsalu L., Haller T., Esko T., Tammesoo M.-L., Alavere H., Snieder H., Perola M., Ng P.C., Mägi R., Milani L., et al. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 2015;44:1137–1147. doi: 10.1093/ije/dyt268. [DOI] [PubMed] [Google Scholar]
- 30.Milani L., Alver M., Laur S., Reisberg S., Haller T., Aasmets O., Abner E., Alavere H., Allik A., Annilo T., et al. The Estonian Biobank’s journey from biobanking to personalized medicine. Nat. Commun. 2025;16:3270. doi: 10.1038/s41467-025-58465-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Karczewski K.J., Gupta R., Kanai M., Lu W., Tsuo K., Wang Y., Walters R.K., Turley P., Callier S., Shah N.N., et al. Pan-UK Biobank genome-wide association analyses enhance discovery and resolution of ancestry-enriched effects. Nat. Genet. 2025;57:2408–2417. doi: 10.1038/s41588-025-02335-7. [DOI] [PubMed] [Google Scholar]
- 32.Okbay A., Wu Y., Wang N., Jayashankar H., Bennett M., Nehzati S.M., Sidorenko J., Kweon H., Goldman G., Gjorgjieva T., et al. Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals. Nat. Genet. 2022;54:437–449. doi: 10.1038/s41588-022-01016-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sohail M., Maier R.M., Ganna A., Bloemendal A., Martin A.R., Turchin M.C., Chiang C.W., Hirschhorn J., Daly M.J., Patterson N., et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8 doi: 10.7554/eLife.39702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Berg J.J., Harpak A., Sinnott-Armstrong N., Joergensen A.M., Mostafavi H., Field Y., Boyle E.A., Zhang X., Racimo F., Pritchard J.K., Coop G. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8 doi: 10.7554/eLife.39725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Colodro-Conde L., Couvy-Duchesne B., Whitfield J.B., Streit F., Gordon S., Kemper K.E., Yengo L., Zheng Z., Trzaskowski M., de Zeeuw E.L., et al. Association between population density and genetic risk for schizophrenia. JAMA Psychiatry. 2018;75:901–910. doi: 10.1001/jamapsychiatry.2018.1581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Howe L.J., Nivard M.G., Morris T.T., Hansen A.F., Rasheed H., Cho Y., Chittoor G., Ahlskog R., Lind P.A., Palviainen T., et al. Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat. Genet. 2022;54:581–592. doi: 10.1038/s41588-022-01062-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Pingault J.-B., O’Reilly P.F., Schoeler T., Ploubidis G.B., Rijsdijk F., Dudbridge F. Using genetic data to strengthen causal inference in observational research. Nat. Rev. Genet. 2018;19:566–580. doi: 10.1038/s41576-018-0020-3. [DOI] [PubMed] [Google Scholar]
- 38.Young A.I., Benonisdottir S., Przeworski M., Kong A. Deconstructing the sources of genotype-phenotype associations in humans. Science. 2019;365:1396–1400. doi: 10.1126/science.aax3710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Veller C., Coop G.M. Interpreting population- and family-based genome-wide association studies in the presence of confounding. PLoS Biol. 2024;22 doi: 10.1371/journal.pbio.3002511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.https://andmed.stat.ee/en/stat/rahvaloendus__rel2021__rahvastiku_paiknemine__elukoht-ja-soo-vanusjaotus/RL21003
- 41.Sjoberg O., Tammaru T. Transitional statistics: internal migration and urban growth in post-Soviet Estonia. Eur. Asia Stud. 1999;51:821–842. doi: 10.1080/09668139998732. [DOI] [PubMed] [Google Scholar]
- 42.Lang T., Burneika D., Noorkõiv R., Plüschke-Altof B., Pociūtė-Sereikienė G., Sechi G. Socio-spatial polarisation and policy response: Perspectives for regional development in the Baltic States. Eur. Urban Reg. Stud. 2022;29:21–44. doi: 10.1177/09697764211023553. [DOI] [Google Scholar]
- 43.Selzam S., Ritchie S.J., Pingault J.-B., Reynolds C.A., O’Reilly P.F., Plomin R. Comparing Within- and Between-Family Polygenic Score Prediction. Am. J. Hum. Genet. 2019;105:351–363. doi: 10.1016/j.ajhg.2019.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Howe L.J., Evans D.M., Hemani G., Davey Smith G., Davies N.M. Evaluating indirect genetic effects of siblings using singletons. PLoS Genet. 2022;18 doi: 10.1371/journal.pgen.1010247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Malamud O., Wozniak A. The Impact of College on Migration: Evidence from the Vietnam Generation. J. Hum. Resour. 2012;47:913–950. [Google Scholar]
- 46.Haapanen M., Böckerman P. More educated, more mobile? Evidence from post-secondary education reform. Spat. Econ. Rev. 2017;12:8–26. doi: 10.1080/17421772.2017.1244610. [DOI] [Google Scholar]
- 47.Xu X., Waltmann B., van der Erve L., Britton J. London calling? Higher education, geographical mobility and early-career earnings (The IFS) 2021. [DOI]
- 48.Ojalehto E., Finkel D., Russ T.C., Karlsson I.K., Ericsson M. Influences of genetically predicted and attained education on geographic mobility and their association with mortality. Soc. Sci. Med. 2023;324 doi: 10.1016/j.socscimed.2023.115882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Sotoudeh R., Trejo S., Harpak A., Conley D. Does standard adjustment for genomic population structure capture direct genetic effects? Preprint at. bioRxiv. 10.1101/2024.05.03.592431. [DOI]
- 50.https://www.studyinestonia.ee/index.php/study/universities
- 51.Meissner B. The change in the social structure of Estonia. J. Balt. Stud. 1987;18:301–322. doi: 10.1080/01629778700000151. [DOI] [Google Scholar]
- 52.Niedomysl T. How Migration Motives Change over Migration Distance: Evidence on Variation across Socio-economic and Demographic Groups. Reg. Stud. 2011;45:843–855. doi: 10.1080/00343401003614266. [DOI] [Google Scholar]
- 53.Tork J. Eesti laste intelligents [The intelligence of Estonian children] Koolivara; Tartu, Estonia: 1940. pp. 295–307. [Google Scholar]
- 54.Must O., te Nijenhuis J., Must A., van Vianen A.E.M. Comparability of IQ scores over time. Intelligence. 2009;37:25–33. doi: 10.1016/j.intell.2008.05.002. [DOI] [Google Scholar]
- 55.Plomin R., DeFries J.C., Loehlin J.C. Genotype-environment interaction and correlation in the analysis of human behavior. Psychol. Bull. 1977;84:309–322. doi: 10.1037/0033-2909.84.2.309. [DOI] [PubMed] [Google Scholar]
- 56.Davies N.M., Hemani G., Neiderhiser J.M., Martin H.C., Mills M.C., Visscher P.M., Yengo L., Young A.S., Keller M.C. The importance of family-based sampling for biobanks. Nature. 2024;634:795–803. doi: 10.1038/s41586-024-07721-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.von Hinke S., Vitt N. An analysis of the accuracy of retrospective birth location recall using sibling data. Nat. Commun. 2024;15:2665. doi: 10.1038/s41467-024-46781-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Domingue B.W., Fletcher J., Conley D., Boardman J.D. Genetic and educational assortative mating among US adults. Proc. Natl. Acad. Sci. USA. 2014;111:7996–8000. doi: 10.1073/pnas.1321426111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Pirastu N., Cordioli M., Nandakumar P., Mignogna G., Abdellaoui A., Hollis B., Kanai M., Rajagopal V.M., Parolo P.D.B., Baya N., et al. Genetic analyses identify widespread sex-differential participation bias. Nat. Genet. 2021;53:663–671. doi: 10.1038/s41588-021-00846-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kemper K.E., Yengo L., Zheng Z., Abdellaoui A., Keller M.C., Goddard M.E., Wray N.R., Yang J., Visscher P.M. Phenotypic covariance across the entire spectrum of relatedness for 86 billion pairs of individuals. Nat. Commun. 2021;12:1050. doi: 10.1038/s41467-021-21283-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Young A.I., Nehzati S.M., Benonisdottir S., Okbay A., Jayashankar H., Lee C., Cesarini D., Benjamin D.J., Turley P., Kong A. Mendelian imputation of parental genotypes improves estimates of direct genetic effects. Nat. Genet. 2022;54:897–905. doi: 10.1038/s41588-022-01085-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Young A.S. Estimation of Indirect Genetic Effects and Heritability under Assortative Mating. bioRxiv. 2023 doi: 10.1101/2023.07.10.548458. Preprint at. [DOI] [Google Scholar]
- 63.Abdellaoui A., Hottenga J.-J., Willemsen G., Bartels M., van Beijsterveldt T., Ehli E.A., Davies G.E., Brooks A., Sullivan P.F., Penninx B.W.J.H., et al. Educational attainment influences levels of homozygosity through migration and assortative mating. PLoS One. 2015;10 doi: 10.1371/journal.pone.0118935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.https://geoportaal.maaamet.ee/eng/Spatial-Data/Administrative-and-Settlement-Division-p312.html
- 65.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Lloyd-Jones L.R., Zeng J., Sidorenko J., Yengo L., Moser G., Kemper K.E., Wang H., Zheng Z., Magi R., Esko T., et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 2019;10:5086. doi: 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.https://andmed.stat.ee/en/stat/rahvastik__rahvastikunaitajad-ja-koosseis__rahvaarv-ja-rahvastiku-koosseis/RV021
- 70.Browning B.L., Zhou Y., Browning S.R. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am. J. Hum. Genet. 2018;103:338–348. doi: 10.1016/j.ajhg.2018.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Mitt M., Kals M., Pärn K., Gabriel S.B., Lander E.S., Palotie A., Ripatti S., Morris A.P., Metspalu A., Esko T., et al. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel. Eur. J. Hum. Genet. 2017;25:869–876. doi: 10.1038/ejhg.2017.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Privé F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics. 2022;38:3477–3480. doi: 10.1093/bioinformatics/btac348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Vihalemm T. Crystallizing and emancipating identities in post-communist Estonia. Natly. Pap. 2007;35:477–502. doi: 10.1080/00905990701368738. [DOI] [Google Scholar]
- 74.Price A.L., Weale M.E., Patterson N., Myers S.R., Need A.C., Shianna K.V., Ge D., Rotter J.I., Torres E., Taylor K.D., et al. Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 2008;83:132–139. doi: 10.1016/j.ajhg.2008.06.005. author reply 135–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.http://www.cog-genomics.org/plink/2.0/
- 76.Abraham G., Qiu Y., Inouye M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics. 2017;33:2776–2778. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
- 77.Schoeler T., Speed D., Porcu E., Pirastu N., Pingault J.-B., Kutalik Z. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat. Hum. Behav. 2023;7:1216–1227. doi: 10.1038/s41562-023-01579-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.R Core Team . R Foundation for Statistical Computing; Vienna, Austria: 2020. R: A Language and Environment for Statistical Computing. [Google Scholar]
- 79.Pebesma E.J., Bivand R. CRC Press, Taylor & Francis Group; 2023. Spatial Data Science: With Applications in R. [DOI] [Google Scholar]
- 80.Pebesma E. Simple features for R: Standardized support for spatial vector data. R J. 2018;10:439. doi: 10.32614/rj-2018-009. [DOI] [Google Scholar]
- 81.Dunnington D., Pebesma E. 2023. geos: Open Source Geometry Engine (’GEOS’) R API. R package version 0.2.4. [DOI] [Google Scholar]
- 82.Wickham H. Springer-Verlag New York; 2016. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
- 83.Barclay K.J. The birth order paradox: Sibling differences in educational attainment. Res. Soc. Stratif. Mobil. 2018;54:56–65. doi: 10.1016/j.rssm.2018.02.001. [DOI] [Google Scholar]
- 84.North B.V., Curtis D., Sham P.C. A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet. 2002;71:439–441. doi: 10.1086/341527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Davison A.C., Hinkley D.V. Cambridge University Press; 2013. Cambridge Series in Statistical and Probabilistic Mathematics: Bootstrap Methods and Their Application Series Number 1. [DOI] [Google Scholar]
- 86.Lee S.H., Yang J., Goddard M.E., Visscher P.M., Wray N.R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics. 2012;28:2540–2542. doi: 10.1093/bioinformatics/bts474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Dempster E.R., Lerner I.M. Heritability of Threshold Characters. Genetics. 1950;35:212–236. doi: 10.1093/genetics/35.2.212. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
Access to the Estonian Biobank data (https://genomics.ut.ee/en/content/estonian-biobank) is restricted to approved researchers and can be requested.
-
•
Custom R code used for statistical analyses is publicly available on Zenodo (https://doi.org/10.5281/zenodo.17497533) and GitHub (https://github.com/ivkuz/GeneticMigrationStructureEstonia).
-
•
Any additional information required to reanalyze the data reported in this article is available from the lead contact upon request.







