Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2021 Mar 4;17(3):e1009347. doi: 10.1371/journal.pgen.1009347

Changes in the fine-scale genetic structure of Finland through the 20th century

Sini Kerminen 1, Nicola Cerioli 2, Darius Pacauskas 2,3, Aki S Havulinna 1,4, Markus Perola 4, Pekka Jousilahti 4, Veikko Salomaa 4, Mark J Daly 1,5, Rupesh Vyas 2, Samuli Ripatti 1,5,6, Matti Pirinen 1,6,7,*
Editor: Simon Gravel8
PMCID: PMC7932171  PMID: 33661898

Abstract

Information about individual-level genetic ancestry is central to population genetics, forensics and genomic medicine. So far, studies have typically considered genetic ancestry on a broad continental level, and there is much less understanding of how more detailed genetic ancestry profiles can be generated and how accurate and reliable they are. Here, we assess these questions by developing a framework for individual-level ancestry estimation within a single European country, Finland, and we apply the framework to track changes in the fine-scale genetic structure throughout the 20th century. We estimate the genetic ancestry for 18,463 individuals from the National FINRISK Study with respect to up to 10 genetically and geographically motivated Finnish reference groups and illustrate the annual changes in the fine-scale genetic structure over the decades from 1920s to 1980s for 12 geographic regions of Finland. We detected major changes after a sudden, internal migration related to World War II from the region of ceded Karelia to the other parts of the country as well as the effect of urbanization starting from the 1950s. We also show that while the level of genetic heterogeneity in general increases towards the present day, its rate of change has considerable differences between the regions. To our knowledge, this is the first study that estimates annual changes in the fine-scale ancestry profiles within a relatively homogeneous European country and demonstrates how such information captures a detailed spatial and temporal history of a population. We provide an interactive website for the general public to examine our results.

Author summary

We have inherited our genomes from our parents, who, in turn, inherited their genomes from their parents, etc. Hence, a comparison between genomes of present day individuals reveals genetic population structure due to the varying levels of genetic relatedness among the individuals. We have utilized over 18,000 Finnish samples to characterize the fine-scale genetic population structure in Finland starting from a binary East-West division and ending up with 10 Finnish source populations. Furthermore, we have applied the resulting ancestry information to generate records of how the population structure has evolved each year between 1923 and 1987 in 12 geographical regions of Finland. For example, the war-related evacuation of Karelians from Southeast Finland to other parts of the country show up as a clear, sudden increase in the Evacuated ancestry elsewhere in Finland between 1939 and 1945. Additionally, different regions of Finland show very different levels of genetic mixing in 1900s, from little mixed regions like Ostrobothnia to highly mixed regions like Southwestern Finland. To distribute the results among general public, we provide an interactive website for browsing the municipality and region-level genetic ancestry profiles at https://geneviz.aalto.fi/genetic_ancestry_finland/

Introduction

A genetic ancestry profile of an individual tells which proportion of the individual’s genome originates from each of the available reference populations. Such a profile provides a unique view to the individual’s personal history, is a crucial component in emerging genomic medicine[1] and is central for forensic genetics[2]. Therefore, it is of great interest to determine how detailed an ancestry profile we are able to robustly generate by current data resources and computational methods.

A variety of methods have been proposed to estimate genetic structure and individual-level ancestry[35]. The most popular methods, such as f-statistics (ADMIXTOOLS)[6], principal component analysis (PCA) (e.g. EIGENSOFT)[7] or ADMIXTURE[8] are based on genotype frequencies of independent variants and explore ancestry on a time-scale of hundreds of generations. Thus, these methods may not be optimal for the studies of recent past or fine-scale relationships between subpopulations. In contrast, current haplotype-based methods, such as FineSTRUCTURE[9], detect population genetic differences in striking detail, for example, in Britain and Ireland [1013], Japan[14], Italy[15] and France[16], and provide better resolution for genetic structure in the recent past. So far, the haplotype-based methods, such as GLOBETROTTER[17] and SOURCEFIND[18], have been applied to estimate ancestry and date admixture from relatively broad geographic areas, for example, in Europe[11, 13, 19, 20], Africa[2123] and Eurasia[24, 25]. Consequently, there remains limited information about the accuracy and robustness of individual-level ancestry estimation using fine-scale source populations. In this study, we assess these questions within a single European country, Finland.

The Finnish population has been widely used in human genetic studies[26] and due to an ongoing large-scale biobank collection of 500,000 samples by the FinnGen project (https://www.finngen.fi/en), Finland will likely remain as one of the most accessible and best characterized populations for future research in human genetics. We have previously characterized the fine-scale genetic structure of Finland that prevailed during the first half of the 20th century[27, 28], before urbanization and the large-scale migration events related to World War II. During and after the war (from 1939 to 1945), over 400,000 (11% of total population) individuals left their homes as Finland lost parts of its eastern territories to the Soviet Union (Fig 1). Almost 70% of the evacuees were relocated to the southern or western parts of Finland while around 25% were relocated to Eastern Finland and 5% were assigned to the northern parts of the country[29]. Later, starting from 1950s, urbanization has shaped the population distribution within Finland[30] and now the biggest cities locate in southern and western parts of the country (Fig 1). These two well-documented migration events together with over 18,000 samples from the National FINRISK Study with genotype, birth year (range 1923–1987) and birth place information at municipality level, provide us with an opportunity to study how demographic changes are reflected in genetic population structure at an unprecedented spatial and temporal detail.

Fig 1. Map of Finland with the study regions.

Fig 1

Map of Finland showing historically motivated regions that we use to track changes in genetic ancestry. Dotted lines show the regions ceded to the Soviet Union (Ceded Karelia (CKA), Petsamo, and Salla) after World War II. Today these three regions are part of Russia. Purple arrows show the main patterns how evacuees from CKA were relocated to other parts of Finland. Diamonds mark the four largest metropolitan areas in Finland: Helsinki, Turku, Tampere and Oulu. ÅLA: Åland islands, SOF: Southern Finland. SWF: Southwestern Finland, TAV: Tavastia, SKA: South Karelia, CKA Ceded Karelia, NKA: North Karelia, SSA: South Savo, NSA: North Savo, CNF: Central Finland, OST: Ostrobothnia, NOS: North Ostrobothnia, KAI: Kainuu, LAP: Lapland.

In this study, we will first define genetically and geographically homogeneous reference groups within Finland at three levels of detail and evaluate their ability to detect ancestry from various geographic regions via simulations. Second, we estimate the genetic ancestry profiles of over 18,000 Finnish individuals with respect to the reference groups using a haplotype-based method SOURCEFIND[18]. By stratifying the ancestry profiles based on birth year and birth place, we track the annual changes in the fine-scale genetic structure of Finland through the 20th century. The results demonstrate high accuracy to detect both the fine-scale individual-level ancestry profiles, as well as the sudden internal migration of the World War II evacuees and the region-specific rates in genetic diversification due to urbanization in Finland.

To our knowledge, this is the first study to track geographically the annual genetic contributions of subpopulations within a single European country. Our work concretely demonstrates continuous genetic mixing within current human populations. To convey our results to the general public, we provide a website for interactive examination of our results.

Results

To create and test a framework for individual-level ancestry estimation within Finland, we apply haplotype-based computational methods, ChromoPainter, FineSTRUCTURE[9] and SOURCEFIND[18] on geographically and genetically informed data set from the National FINRISK Study[31]. The general validity of these computational methods have been evaluated previously (e.g. [9, 10] for ChromoPainter/FineSTRUCTURE and [18] for SOURCEFIND). Therefore, we focus on the particular application of these methods to create a reliable and easily interpretable individual-level ancestry estimation framework within Finland. We will first introduce a procedure to identify suitable reference groups, then we use simulations to test the performance of our reference groups in detecting ancestry, and finally apply them to estimate the ancestry of 18,463 FINRISK samples to characterize the fine-scale changes in the genetic structure of Finland through the 20th century. S1 Fig describes the workflow of the study.

Identifying reference groups for ancestry estimation within Finland

Our first goal was to define geographically and genetically motivated reference groups within Finland to serve as a basis for robust individual-level ancestry estimation. We did this on several levels of detail, starting from the binary division between eastern and western Finland, and further refining the sources of genetic ancestry to 6 and finally to 10 reference groups within Finland, as explained below. We refer to these three sets of reference groups with the term “refset” as an abbreviation for “reference set”.

We used the municipalities of birth of parents to identify 2,741 geographically precisely located and evenly distributed unrelated individuals (see Materials and methods for detailed description). We call these individuals “reference candidates” as we later further filtered them down to form our final reference groups. The refence candidates were analyzed with ChromoPainter and FineSTRUCTURE to identify groups that represented the fine-scale population structure in Finland (Materials and methods). The fine-scale population structure can be studied by cutting the hierarchical tree from FineSTRUCTURE (FS-tree) at different levels. In S3 Fig, we confirm that our results closely match with previous results on the fine-scale structure in Finland[27, 32] showing a main division between the east and the west that further splits into dozens of geographically well-defined and fairly equal sized subpopulations throughout the country.

To identify statistically separable reference groups, we evaluated the fine-scale populations with the following procedure:

  1. Starting level

    Choose K, the starting number of FineSTRUCTURE populations from the FS-tree.

  2. Initial ancestry

    Estimate the genetic ancestry of the reference candidates with respect to the K populations using SOURCEFIND.

  3. Identity proportions

    For each of the K populations, calculate the population’s identity proportion as the average proportion of ancestry in that particular population across the individuals assigned to that population by FineSTRUCTURE.

  4. Population exclusions

    Exclude the populations with low identity proportions (< 50%) from the reference candidates, decrease K accordingly, and repeat from step 2. If no population is excluded, proceed to step 5.

  5. Candidate exclusions

    Exclude the reference candidates who show low levels of ancestry from the population they were assigned to by FineSTRUCTURE. (Thresholds used either < 70% or < 95%.)

  6. Geographic outliers

    Exclude possible geographic outliers manually (S4 Fig).

As the East-West division in genetic structure is relatively strong in Finland[33], we first focused on the top level of the FS-tree which divided our reference candidates into 926 western and 1,815 eastern individuals. Both populations showed a high identity proportion (> 87%), and we excluded the reference candidates whose both eastern and western ancestry components were below 95% or who were geographically located on the opposite part of Finland compared to their dominant genetic component (S4 Fig). A comparison between Figs 2A and S3A shows that this procedure excluded reference candidates from the borders of the two populations and resulted in geographically more tightly defined reference groups. The final sample sizes of the reference groups of our refset 2 were 497 in western and 975 in eastern Finland (Fig 2A).

Fig 2. Geographic location of reference individuals.

Fig 2

The Geographic location of the reference individuals in the reference groups of A) refset 2 (n = 1,472), B) refset 6 (n = 1,026) and C) refset 10 (n = 1,236) marked with colors. The names of the reference groups are shown on right. The locations were determined as the mean of the parents’ municipalities of birth.

To test whether we can detect more detailed ancestry within Finland, we continued by considering the FS-tree of the reference candidates at the level of the first 15 populations. The 15 populations showed differing identity proportions (S5A Fig), suggesting that some populations were more mixed and/or so closely related to some other populations that they could not be reliably distinguished from the other populations. To ensure robust ancestry estimation, we excluded the populations whose identity proportion was below 50%. This left us with 10 populations. We reran the ancestry estimation for the remaining reference candidates using the 10 populations as reference groups and recalculated the identity proportions. The 10 populations were clearly split into two groups: 6 populations with identity proportion around 80% and 4 populations with identity proportion around 70% (S5B Fig). In what follows, we will study the detailed ancestry estimation using both the more clearly distinguishable 6 populations (refset 6), and also the 10 populations (refset 10) that provide a larger geographic coverage (Fig 2B and 2C). The reference individuals at refset 10 are included in their corresponding population at refset 6 (except for the four refset 10 populations that were not present at refset 6). Furthermore, from the reference groups in refset 6 or in refset 10, we excluded the individuals who showed an identity proportion below 70%, as well as 11 refset-6 and 16 refset-10 individuals who were geographically outside of their own population (S4 Fig). S6 Table shows the number of excluded or included reference candidates. The distributions of birth years for the reference groups are shown in S6 Fig and S4 Table.

To avoid confusion between the names of the geographic regions and the names of the genetic ancestry groups we use the following conventions. For the names of the geographic regions, we use their full names or the three letter abbreviations (e.g. North Karelia or NKA) given in Fig 1. When referring to the genetic ancestry from a reference group, we use the prefix R with the index of a refset and a name describing the geographic location of the reference group, as given in Fig 2 (e.g. R2-East or R10-Evacuated). Finally, when we evaluate the framework through simulations, we refer to the “ancestor groups” similarly to the reference groups but using the prefix A instead of R (e.g. A-East), as described in S9 Fig.

Table 1 shows that genetically the most distant reference groups are R10-Bothnia and R10-East_Lapland with FST ~0.007 (computed with EIGENSOFT[7]). The smallest difference (FST ~0.002) appears between R10-Evacuated and R10-Southwest, R10-Central_Finland and R10-Savo-Karelia highlighting the status of R10-Evacuated between the east and the west. Also, R10-Savo-Karelia and R10-Kainuu are closely related groups.

Table 1. Pairwise-FST values (×105) between reference groups of refset 10 (lower triangular) and the corresponding standard errors (upper triangular).

R10-Southwest R10-Bothnia R10-Kokkola R10-Evacuated R10-Central_Finland R10-Savo-Karelia R10-Kainuu R10-Kuusamo R10-West_ Lapland R10-East_ Lapland
R10-Southwest - 5 5 5 6 5 8 8 6 12
R10-Bothnia 251 - 8 7 8 8 10 9 8 14
R10-Kokkola 267 338 - 7 9 7 10 10 9 14
R10-Evacuated 238 371 343 - 6 4 7 7 7 11
R10-Central_Finland 268 397 367 160 - 6 8 9 9 14
R10-Savo-Karelia 404 498 405 185 226 - 5 6 7 11
R10-Kainuu 546 637 542 362 380 219 - 8 10 13
R10-Kuusamo 538 639 544 392 423 302 363 - 9 12
R10-West_Lapland 369 505 472 415 454 497 632 602 - 13
R10-East_Lapland 557 661 582 503 545 514 618 538 497 -

Identifiability of ancestry from reference groups

We tested the identifiability of ancestry from different reference groups using simulations where 2G individuals were sampled to represent the ancestors from G generations back in time (G varied between 1 and 5). We simulated the meioses within these ancestors, and within their subsequent descendants in generations G-1, G-2, …,1, to determine the genotypes of the target individual at generation 0. The ancestry of the target individual was then estimated and compared to the expected ancestry groups of the sampled ancestors based on their geographic and genetic origin (see S7 Fig for a schematic representation).

Refset 2

First, we tested the identifiability of eastern vs. western ancestry by SOURCEFIND using our refset 2. We simulated individuals using four scenarios where either all ancestors came from the same region (All-West, All-East) or one ancestor came from A-East and the remaining ancestors came from A-West (Almost-West) or vice versa (Almost-East). The set of ancestor candidates was disjoint from our reference individuals, and they were chosen by their parents’ geographic location and a standard principal component analysis (S8 Fig), with their locations shown in S9A Fig. For each scenario, and for each number of generations from 1 to 5, we randomly chose the ancestors (2, 4, 8, 16 or 32 ancestors depending on the number of generations), simulated the meioses from the ancestors to the target individual by sampling a recombination process in the ancestral haplotypes (Materials and methods), and estimated the ancestry profile of the target individual. Fig 3 shows the average ancestry profiles of 20 simulated individuals (S10 Fig shows all the 20 individual ancestry profiles) and demonstrates that individuals with a single origin (All-West or All-East) show high levels of ancestry (>96%) from the expected reference group. For the individuals with mixed background (Almost-West, Almost-East), the estimated ancestry in the minor reference group decreases approximately as expected, that is, as 1/2G when 1 out of 2G ancestors G generations back in time come from the minor reference group. We noticed that we always estimate at least some small proportion (~3%) from both of the refset-2 reference groups, even when all ancestors were chosen from a single ancestor group. While this may well reflect a small but real ancestry proportion in our ancestor candidates, it also suggests that an upper limit of a reliable detection of direct refset-2 ancestry is 4 generations (6.25% of ancestry) rather than 5 generations (3.13% of ancestry) back in time. In S1 Table, we explicitly tested whether we can distinguish when an individual has all ancestors G generations back in time from a single source (e.g. All-West) from the case when one of the ancestors is from the different source (e.g. Almost-West). The results verify that we can identify correctly the east-west origin of ancestors 4 generations back in time for over 75% of the individuals but that we cannot reliably do the same 5 generations back in time.

Fig 3. Average ancestry profiles for simulation scenarios between East and West.

Fig 3

Average ancestry profiles for four simulation scenarios and up to 5 generations back in time (G1 –G5). In All-West and All-East scenarios all ancestors from G = 1…5 generations back in time originate from a single ancestor candidate group, A-West or A-East, respectively. In Almost-West and Almost-East scenarios a single ancestor, G generations back in time, originates from the other group, A-East or A-West, respectively. Each bar shows the average over 20 simulated individuals (individually shown in S10 Fig) with 95% confidence intervals. The colors denote the ancestry profile with respect to refset 2 (Fig 2A): R2-West in red and R2-East in blue.

Refsets 6 and 10

To test the accuracy of the detailed ancestry estimation using either refset 6 or refset 10, we simulated individuals with the ancestor candidates shown in S9B Fig. These ancestor candidates had their both parents born within 80km, had similar genetic background based on PCA (S8 Fig), and originated approximately from the same geographical regions as the individuals in the reference groups (Materials and methods). The ancestor candidates covered well the geographic regions of the refset 6 reference groups and, in addition, the region of the R10-Evacuated reference group. For the remaining three reference groups in refset 10 (R10-Central_Finland, R10-Kokkola and R10-East_Lapland), we did not have enough ancestor candidates outside the reference group to perform simulations.

Fig 4A shows the mean of the estimated ancestry for the simulated individuals whose both parents originated from one geographic region (single origin) using refset 2, refset 6 or refset 10 (S12 Fig shows the individual ancestry profiles). For the single origin individuals from A-Southwest, A-Bothnia, A-N_Karelia, A-Kainuu and A-Kuusamo our approach estimated high ancestry proportions (~80%) for their closest reference group at all three refsets. For A-Lapland, however, we estimated more varying ancestry proportions with respect to refset-6 or refset-10, and, on average, only 60% of ancestry was from the reference groups geographically located in Lapland (LAP in Fig 1). A comparison of pairwise-FST between the pairs of reference groups and groups of ancestor candidates (S2 Table) showed that the pairs of reference-ancestor groups in Lapland were significantly more distant from each other than other geographically close reference-ancestor pairs, suggesting that the lower ancestry estimates are likely due to real genetic differences between the reference groups and the ancestor candidates in Lapland. The individuals simulated with ancestors from the A-Evacuated group showed mixed ancestry with refsets 2 and 6, whereas with refset 10 they showed about 75% ancestry from the R10-Evacuated reference group. This demonstrates that the interpretation of ancestry profiles depends crucially on the reference groups available. In addition, we saw that, with both refset 6 and refset 10, each reference group included in the analysis showed on average at least ~2% contribution in the single-origin simulation settings (S3 Table). We noticed that by shrinking the ancestry proportions below 5% to zero and by rescaling the remaining ancestry proportions to 100%, the total ancestry from the expected reference groups increased and the average unexpected ancestry proportions reduced considerably (S11 Fig).

Fig 4. Average ancestry profiles for detailed simulation scenarios.

Fig 4

The average ancestry for 20 simulated individuals whose parents originate from the ancestor candidate groups described in S9 Fig. Panel A presents individuals whose all ancestors come from one group (single origin) estimated using refset 2, 6 or 10. Panel B presents individuals whose 2G-1 ancestors, where G = 1…4 is the number of generations, originate from A-Southwest and 1 ancestor originates from the ancestor group in the title, estimated using refset 10. The colors correspond to the reference groups in Fig 2.

To examine how well our approach can detect one ancestor coming from one region while all other ancestors come from another region, we simulated additional generations back in time, analogously to the earlier simulations with refset-2. Fig 4B shows the results when all but one ancestor come from A-Southwest, and the single remaining ancestor originates from one of the seven ancestor groups (see S13 Fig for results for refsets 2 and 6 and when the majority of ancestors come from A-N_Karelia and S14 Fig for individual-level results for the major ancestry component). The results demonstrate that the major source of ancestry is estimated as expected, i.e., corresponds to the geographically closest reference group, for generations 1, 2 and 3 back in time, but remains smaller than expected for generation 4. The minor source of ancestry is always underestimated, and starting from generation 3, the original source of ancestry is not clearly distinguishable from other more geographically distant reference groups.

We observe that when simulating individuals that have one ancestor from A-Southwest and one from A-N_Karelia (Fig 4B), we detect almost 14% ancestry in R10-Evacuated reference group, which is more than expected based on the simulated individuals with a single origin in either A-Southwest or A-N_Karelia, who showed on average about 7% ancestry in R10-Evacuated. This suggests that we may overestimate the R10-Evacuated component for an individual who happens to be mixed between eastern and western Finland, possibly because the R10-Evacuated group itself shows some mixing between R2-East and R2-West (Fig 4A A-Evacuated). This observation needs to be kept in mind later when we interpret the patterns of ancestry from R10-Evacuated in the regions of SWF and NKA. Other pairs of regions do not show similar results of mixing (S15 Fig).

The results averaged over individuals demonstrated that our reference groups are able to accurately detect ancestry all around Finland. With refset 2, we can identify Eastern and Western ancestry up to an accuracy of 6% (4 generations back). With refsets 6 and 10, the major source of ancestry is accurately detected 3 generations back in time and while the proportion of the minor ancestry is underestimated, the source of it can be identified 2 generations back. On the other hand, at the level of individual, the ancestry estimates show increasing variance with more heterogeneous genetic background (S10, S12 and S14 Figs), which makes precise conclusions about genetic ancestry challenging for any one individual. Additionally, our ancestor candidates are likely to be less genetically mixed than an average individual with Finnish ancestry; hence our results do not necessarily directly apply to the Finnish individuals whose ancestors are more mixed.

Changes in the genetic structure of Finland during the 20th century

We then applied our ancestry estimation approach to generate ancestry profiles for 18,463 individuals in the FINRISK Study using the refsets 2, 6, and 10. We also included the reference individuals in the analysis but each reference individual itself was excluded from its reference group when estimating its own ancestry profile.

To study the genetic ancestry patterns across the country, we grouped the individuals into 12 groups based on their region of birth. These 12 regions covered the whole mainland Finland (Fig 1) but not Åland islands (ÅLA). For each region, we averaged the ancestry profiles over individuals’ birth years by fitting a local regression (LOESS) curve (Materials and methods). Fig 5 shows the temporal changes in the ancestry profile of six regions using refset 10 (see S17S19 Figs for all regions with refsets 2, 6 and 10). First, we detected that the regions are often genetically dominated by their geographically closest reference group, but, in many regions, the proportion of that reference group decreased towards the present, indicating increasing levels of genetic mixing. The most dramatic changes occurred in SOF (23 percentage point decrease in the closest reference group between 1930 and 1980), SWF (22 p.p), TAV (28 p.p) and CNF (21 p.p) while SKA, KAI and NOS did not show considerable changes in their dominant components. The same trends could be detected also with respect to refset 2 (S17 Fig) or refset 6 (S18 Fig).

Fig 5. Development of genetic ancestry through the 20th century.

Fig 5

Map names 14 regions and shows the locations of the individuals who form the 10 reference groups. Each panel corresponds to one region and shows 10 curves corresponding to the 10 reference groups. Each curve shows the estimated proportion of ancestry from the reference group, as a function of birth years of individuals from 1923 to 1987. The bands around the curves show 95% confidence intervals.

World War II evacuees from Ceded Karelia

We identified also some rapid changes which, in some cases, can be dated with an accuracy of one year. The most prominent example is the rapid increase in the ancestry of the R10-Evacuated in SWF and in TAV in the 1940s, which is a result of the massive evacuation and relocation caused by World War II. Figs 6 and S20 show that the increase in R10-Evacuated ancestry could be detected in all regions except in SKA (whose dominant group was already closely genetically related to R10-Evacuated) and in KAI, that, according to the historical records, did not noticeably gain evacuees[29]. Moreover, we could detect widely varying regional patterns of the subsequent movements of the evacuees after the war (Fig 6). In many regions, such as in SWF, TAV and NSA, the increase of the R10-Evacuated ancestry in 1940s was followed by a constant proportion of that ancestry in the following decades. A striking exception is OST, where we detected a rapid decrease in the R10-Evacuated ancestry after 1945, shrinking almost to its pre-war level already by 1950. This suggests that the evacuees did not settle in OST. This observation is supported by the historical records which show that while in 1944, 28% of the evacuees were located in OST, in 1950, the proportion was only 9% (corresponding to 5% of the total population of OST, Table 1 in [29]). A completely opposite pattern is present in SOF, where we detected a continuous increase in the R10-Evacuated ancestry until 1960. This is also supported by the historical records which show that while in 1945 about 17% of the evacuees were located in SOF, by 1960 the proportion had increased to 25%[29].

Fig 6. Regional development of genetic ancestry from R10-Evacuated reference group.

Fig 6

The development of the genetic ancestry from the R10-Evacuated reference group on four regions, Tavastia (TAV), Ostrobothnia (OST), Southern Finland (SOF) and Kainuu (KAI) suggests very different migration patterns of war-time evacuees in different parts of Finland during the war (1939–1945) and after the war.

We also compared the estimated genetic ancestry proportions of R10-Evacuated to the proportion of the Karelian evacuees in different regions in 1950 (Table 2). The comparison is restricted to the regions of SOF, SWF, TAV, OST, NOS+KAI and LAP because, for these regions, the administrative borders have remained fairly stable over the years and their dominant genetic group is distinguishable from the R10-Evacuated ancestry. The comparison showed that the estimated proportion of R10-Evacuated ancestry in individuals born around 1950 was slightly higher in all regions compared to the proportion of evacuees in the regions. This is unlikely to result from a higher fertility rate of the Karelian evacuees compared to the rest of the population since the fertility of evacuees has been reported lower than in the general population[34]. While we cannot fully exclude the possibility of overestimation of R10-Evacuated ancestry in our approach, another likely explanation also remains. The R10-Evacuated does not only capture the ancestry of the actual war-time evacuees but captures also other than war-related migration and ancestry from the nearby region of SKA. For example, in SOF, that contains the capital region of Helsinki and has been a target of incoming migration through the 20th century, the R10-Evacuated ancestry clearly started to grow already in the 1920s, well before World War II.

Table 2. Proportion of Karelian evacuees (from CKA) and the estimated genetic ancestry proportion of R10-Evacuated in different regions in year 1950.

Region* Total population[35] Number of evacuees[29] Proportion of evacuees Genetic ancestry
estimate (95% CI)
SOF 667 500 70 686 0.11 0.19 (0.18–0.21)
SWF 631 000 57 834 0.09 0.11 (0.10–0.12)
TAV 553 300 68 187 0.12 0.17 (0.15–0.20)
OST 607 200 32 487 0.05 0.07 (0.04–0.10)
NOS+KAI 359 800 12 495 0.03 0.07 (0.07–0.08)
LAP 167 100 6 069 0.04 0.08 (0.07–0.09)

*In 1960, CNF was separated from OST (and small parts of CNF also from TAV and SSA).

Increase in genetic heterogeneity

To quantify whether the level of heterogeneity of the ancestry profiles changed through the 20th century, we used entropy as a measure of heterogeneity with larger entropy meaning a more heterogeneous profile. With this measure, we do not quantify the absolute heterogeneity in each region, as measured, e.g., by the average inbreeding coefficient of individuals (S23 Fig), but rather we quantify the regional diversity of ancestral backgrounds with respect to our reference groups. Figs 7 and S21 show the regression slopes of heterogeneity of ancestry on birth years in each region. We see that, on average, the genetic profiles diversified towards the present, but also showed notable differences between the regions. The largest changes were observed in SWF and no change at all in OST and NOS. In addition, there was no change in the refset 2 heterogeneity in KAI, CNF, LAP or SKA, suggesting that the diversification in these regions has happened locally between nearby regions rather than more broadly across the main East-West split.

Fig 7. Rate of change in heterogeneity.

Fig 7

The rate of change in heterogeneity at each region of Finland (see Fig 1) between 1923 and 1987. The rate of change is calculated by regressing the entropy of ancestry profiles of the individuals born in the region on their year of birth. Entropy was calculated with respect to refsets 2, 6, and 10.

We also compared the rate of change before and after the year 1950 corresponding to the situations before and after the war (S22 Fig). This comparison shows that, on average, the war-related migration had a more diversifying effect than the later events, such as the urbanization starting from 1950s. After the war, increasing heterogeneity is seen in SOF, SWF and TAV that contain the largest metropolitan areas in Western Finland (Helsinki, Turku and Tampere, respectively), and in NKA and NSA in Eastern Finland. In other regions, the urbanization may have gathered people from nearby rural regions into the nearest cities rather than resulted in incoming migration across the country, but traces of such local urbanization within the regions are not detectable in our analysis.

We have made the municipality-level ancestry profiles and the regional changes in ancestry profiles publicly available via an interactive web application.

Discussion

Individual-level genetic ancestry information has important scientific applications in genome-wide association studies[36], in genomic medicine[1] and in forensics[2], and it is also of great interest to the general public, as demonstrated by tens of millions of individuals who have taken direct-to-consumer tests motivated by genetic ancestry[37]. The ancestry profiles have so far been estimated at a fairly broad scale by separating ancestry components between continents or countries. In the current phase of rapidly developing biobanks, we have already seen implementations of large-scale genetic studies within individual countries, such as the FinnGen project (www.finngen.fi) in Finland that aims to cover 10% of the Finnish population by 2023. Here we have developed and evaluated an ancestry estimation framework for such a targeted, within-country setting. This way we answer the need for accurate ancestry profiles both when implementing genomic medicine via biobank resources and when conveying the potential of large-scale, country-wide genomic resources to the general public. For example, in our work we have been able to track the major demographic events within Finland at the accuracy of one year, and released these results for the general public to browse via an interactive website. While our work uses Finnish data, our analysis framework could be applied to other populations with similar data available.

By grouping individual-level ancestry profiles by birth year and region, we showed how demographic events have affected the fine-scale genetic population structure in Finland. The results verified the prominent effect of the Second World War motivated migration within Finland. While the war-related migration happened within a relatively short period from 1939 to 1945, its diversifying effect on the genetic heterogeneity was larger than the effect of urbanization, that started from the 1950s and has continued to this day. We expect that this result mirrors the pattern of urbanization in Finland: the number of towns and cities has increased rapidly but their size has remained relatively small[30] suggesting that, at least up to 1990s, people have largely moved to their nearby cities rather than migrated long distances across the country. When such a local urbanization event happens within a region dominated by one of our genetic reference groups, it does not change the ancestry profile of the region, and consequently it does not show up in our results.

While our genetic analysis results match well with the known history, the exact interpretation of ancestry and genetic mixing is complicated because it always depends on the available reference groups[38, 39]. An intuitive interpretation of results would require independent reference groups that form a comprehensive collection of all genetic variation present in the region of interest. Unfortunately, this is rarely fully possible in natural populations whose structure is continuous rather than discrete and whose subpopulations are related in various ways to each other. Still, in many cases, useful approximations of discrete reference populations can be found. We did this by first starting with geographically comprehensive and evenly distributed data and by using an unsupervised clustering method (FineSTRUCTURE) to identify statistically separable genetic populations. We got highly consistent results compared to the existing knowledge of the Finnish population structure[27] and we had a good geographic coverage for other parts of mainland Finland except Lapland where the sample size is relatively small and several municipalities cover vast geographic areas. Second, we evaluated the identifiability of the populations with respect to each other by estimating populations’ identity proportions. For each population, we defined the identity proportion as the average ancestry proportion of the members of the population in the population itself. A low identity proportion indicates that the population is closely related to one or more other populations included in the analysis. By excluding the populations that showed low identity proportions, we were left with reference groups that showed similar levels of independency and covered well most parts of Finland. The regions lacking their own reference groups were located near the city of Oulu (NOS) and near the capital region of Helsinki (SOF, TAV), both of which may have gathered recent and genetically heterogeneous migration from elsewhere in Finland. The reference groups were not selected based on the birth year. The mean birth year, that was 1953 across all samples, did not significantly vary between the reference groups, except for R10-Evacuated (mean birth year 1946, Mann-Whitney p-value 4.3e-7) (S6 Fig and S4 Table), whose region of origin was significantly affected by the Second World War.

It is important to acknowledge that the Finnish population today contains a considerable proportion of recent genetic ancestry from outside of Finland as well as from several minority groups who have lived in Finland for centuries (e.g. Roma people) or for millennia (Sami people). A comprehensive ancestry profile of a Finnish individual should therefore include appropriate reference data also from outside of Finland as well as from the minority groups within Finland. Detailed reference data from the neighboring countries would also provide additional information about the relationships between our Finnish reference groups and the populations outside of Finland and open up possibilities to probe history further back in time through genetic analyses.

The ancestor candidates that we used in our simulations were chosen by parental birthplace information and PCA of genetic data, and hence they are expected to be less genetically heterogeneous than an average individual with Finnish ancestry. Consequently, our results about the identifiability of an ancestor with certain genetic background are valid when the ancestor was approximately equally representative of their ancestry group as our ancestor candidates. We have not studied more complex scenarios, where an individual has a considerable proportion of genetic ancestry in a certain reference group, but that ancestry originates from many heterogeneous ancestors rather than one (or a few) homogeneous ancestor(s).

Our simulation results between A-East and A-West, as well as between more detailed ancestor groups, showed more variance in the ancestry estimates for individuals with more heterogenous background than for the individuals with ancestors from a single origin. We noticed that our estimation procedure always introduced small proportions of ancestry from each of the available reference groups. Based on this, we assume that small proportions of ancestry (~3–6%) should not be interpreted as a reliable evidence of direct ancestry in the reference group but rather as natural variation in a continuous measure. Shrinking away such small ancestry proportions, for example by thresholding at 5%, could improve the interpretability of the results, as was shown for the simulated results in S11 Fig. This, and other technical ways to improve the accuracy and interpretation of the results, warrant a more detailed study in the future.

SOURCEFIND is a software tool that works directly on the output of the haplotype-based methods ChromoPainter and FineSTRUCTURE[9]. It has been previously utilized to capture genetic ancestry within Europe [12, 16, 40] and in Latin America[18]. We have previously shown that ChromoPainter and FineSTRUCTURE work well in our target population[27]. Therefore, SOURCEFIND was a natural candidate for testing how well our reference groups identify genetic ancestry. Previously, SOURCEFIND has been shown to give highly concordant continental-level ancestry estimates with a standard ADMIXTURE-analysis[8] and to outperform a ChromoPainter-based NNLS method [18]. We are not aware of a direct comparison between SOURCEFIND and other haplotype-based methods, such as RFmix[41].

Individual’s genetic ancestry has also been shown to crucially affect the interpretability of polygenic risk scores, that are currently heavily studied for medical genetics applications[42, 43]. The challenges of polygenic risk scores manifest not only between genetically distant, intercontinental populations[42, 44] but also between subpopulations within a country[45] demonstrating the need for characterization of genetic structure at finer scales. In fact, incorporating the genetic ancestry information in polygenic risk scores has shown to improve the genetic risk estimation[1]. We consider this topic as an important future application area of the framework that we have presented here.

To conclude, this work demonstrates the power of ancestry estimation to reveal detailed demographic events as well as the continuous, gradually changing and mixing nature of real-world human populations. The work serves as a basis for detailed ancestry estimation within Finland, and we also provide a website for general public to interactively examine our results. We expect that our results help personalizing future genomic medicine in the Finnish population and promote the participation of general public in large-scale biobank collections that provide unprecedented opportunities for human genetic research, in Finland as well as elsewhere in the world.

Materials and methods

Data

Data originate from the National FINRISK Study, a population-based random sample of individuals aged 25–74 years[31]. The study has been conducted every five years since 1972. DNA was collected first time in 1992 study, and here we used individuals collected in years 1992, 1997, 2002, 2007 and 2012. In 1992, the study was conducted in four study areas; North Karelia and North Savo provinces in eastern Finland, Turku-Loimaa area in Southwestern Finland, and the cities Helsinki and Vantaa in southern Finland, in 1997 a fifth study area, provinces of North Ostrobothnia and Kainuu (in 2002 also the province of Lapland) in northern Finland, was included. Thus, the study covers different geographic areas of Finland well and nearly half of the Finnish population is living in the study area. For each individual, the data included information on their genotypes, birth place and birth year. For a large part of the individuals, the data also included their parents’ birth place information. All study individuals had given a written consent.

Data quality control

All individuals were genotyped with Illumina’s HumanCoreExome genotyping chip. Starting from 23,431 individuals and 538,840 genetic variants, we first excluded variants with minor allele frequency (MAF) below 5%, Hardy-Weinberg equilibrium p-value below 1e-6 and variants that were missing in over 1% of the samples. We also excluded variants with duplicated genotype positions. These statistics were calculated using PLINK 1.9[46, 47]. After these filters, the data included 229,844 variants.

We excluded individual samples if their heterozygosity was below -0.04 or above 0.04 and if variant-missingness was above 0.005 calculated with PLINK 1.9[46, 47]. We calculated individual relatedness with KING 2.1.4[48] and excluded 3,635 individuals with a 3rd degree relative (kinship over 0.0442) in the data. Further, we excluded the individuals who themselves or whose parent was born abroad (no Finnish birth municipality available), and the individuals on two bad-quality genotyping plates, identified in our previous study[27]. For the haplotype-based ancestry analyses, we also excluded the individuals with a high proportion (over 10%) of missing variants in chromosome 21. For the analyses using region of birth, we further excluded the individuals marked to have been born in the municipality of Karjala (municipality code 222), located in SWF, because it was impossible to know whether they were truly born in that municipality or rather in the larger region of Karelia that is in Finnish called with the same name (Karjala). Additionally, we ran PCA of our samples together with the non-Finnish samples from 1000 Genomes project[49]. By utilizing K-nearest neighbors method (K = 21), we excluded 31 individuals who showed closer relatedness to the 1000 Genomes samples than to our Finnish reference candidates (S16 Fig). Finally, our data included 18,463 individuals with good-quality genotypes and location information. S5 Table shows the number of excluded or included samples after each filtering step.

Identification and spatial sampling of reference candidates

To define geographically motivated reference groups, we first identified over 8,187 individuals whose parents were born within 80 km from each other and calculated their geographic location as the mean of their parents’ birth places (available at the level of municipality). Most of these individuals were geographically clustered in either eastern or southwestern corners of Finland and because we had previously discovered that an uneven geographic sampling density can affect the identification of subpopulations[27], we applied the following spatial sampling procedure to thin down the individuals from the densest areas.

For each individual, we calculated how many neighbors they had in their immediate proximity (local neighbors, within 5 km) and in their wider proximity (global neighbors, within 30 km). Then, we excluded individuals in two steps. First, we found all individuals that had more than 15 local neighbors and randomly excluded one such individual as long as all individuals had 15 or less local neighbors. This reduced the number of samples in large cities and ensured more even sampling for the next step. Second, we found all individuals that had more than 40 global neighbors and, among them, we identified those who had the most local neighbors and randomly excluded one of those. This was repeated until no individual had more that 40 global neighbors. S2 Fig demonstrates the impact of the procedure on the geographic distribution of samples. The procedure resulted in 2,754 individuals from whom we further excluded 10 geographic outliers and 3 outliers in PCA on ChromoPainter’s coancestry matrix as described in [9], resulting in 2,741 individuals. These 2,741 individuals were used as the donors in ChromoPainter analyses and they formed our set of reference candidates. The 5,446 individuals who were excluded during this spatial sampling procedure were later used as ancestor candidates in simulations to test the ancestry estimation.

Chromosome painting with ChromoPainter

All individuals were first phased together using SHAPEIT2 software[50], 229,844 genetic variants, and an average European effective population size of 11,418 and the HapMap phase II recombination map. Then, both for identifying reference groups and for further estimating ancestry, we identified pairwise haplotype-sharing patterns, so called chromosome paintings, using ChromoPainter v 2.0[9]. ChromoPainter estimates the number and the cumulative length of haplotype blocks shared between the test individual and all donor individuals using dense genotype data. We ran ChromoPainter using the 2,741 reference candidates as donors, and the average switch rate (-n 3720.27) and global mutation probability (-M 0.00014), estimated over chromosomes 1, 9, 15 and 22, and with 28 donor individuals using 10 EM-iterations. Other parameters were kept at their default values.

Identifying genetic populations with FineSTRUCTURE

To identify genetically homogeneous reference groups within the 2,741 geographically defined reference candidates, we ran FineSTRUCTURE 2.0.1[9]. First, we performed the Markov chain Monte Carlo (MCMC) run using 1,000,000 burn-in iterations and 2,000,000 sample iterations, saving every 10,000th iteration. The MCMC run was followed by an additional re-assignment where the population assignments were assessed against the individuals’ maximum assignments during the MCMC iterations following Leslie et al. 2015[10]. Finally, the clusters were merged with the FineSTRUCTURE’s tree-building method maintaining the tree order and likelihoods. The procedure was repeated with a different seed for MCMC run to check the convergence.

Ancestry estimation with SOURCEFIND

To estimate the haplotype-based ancestry within Finland, we used SOURCEFINDv2[18], a software specifically implemented to work with ChromoPainter. SOURCEFIND uses an averaged chromosome painting of the reference groups to find the ancestry proportions for the test individual/population utilizing an MCMC method. In our analyses, we used 50,000 burn-in iterations, 200,000 sample iterations and recorded the results from every 5,000th iteration. The copyvector, used to compare chromosome painting patterns between the test individual and the reference groups, was defined as an average over 20 FineSTRUCTURE populations (S3C Fig). We used 2, 6 and 10 reference groups (called surrogate groups in SOURCEFIND) to estimate ancestry. For individuals in the reference groups, ancestry was estimated by leaving the individual itself out from the reference group in the SOURCEFIND analysis. For other individuals, all reference individuals were used.

Selection of ancestor candidates

To test the estimation of genetic ancestry within Finland, we simulated individuals using real data. We started by identifying ancestor candidates that would be geographically motivated similarly to our reference candidates. We reutilized those over 5,446 individuals who were excluded from the reference candidates in the spatial sampling procedure but whose parents were born within 80 km from each other.

For testing refset 2, we selected two ancestor candidate sets, one from West and one from East. The western ancestor candidates (A-West) were those whose parents were born either in Southwestern Finland (SWF) or in Ostrobothnia (OST) and the eastern ancestor candidates (A-East) had their parents from Ceded Karelia (CKA), North Karelia (NKA), Kainuu (KAI), North Ostrobothnia (NOS) or Lapland (LAP). The region of North Savo (NSA) was not included as it would have caused a significant imbalance in the geographic distribution of the ancestor candidates. The candidates were further filtered down based on PC1 coordinate: 80% of western (eastern) ancestor candidates with the smallest (largest) PC1 coordinate were kept (S8 Fig).

For testing refsets 6 and 10, we selected 7 groups of ancestor candidates in such a way that both of the parents were born in the same region, either in SWF, OST, LAP, KAI, NKA, CKA or in the municipalities of Kuusamo, Taivalkoski and Pudasjärvi of NOS (referred to as A-Kuusamo). In addition, for each of the 7 sets, we estimated a two-dimensional normal density function in PC1-PC2 space and kept the individuals who were within the highest 50% density region (S8 Fig). We did not use the genetic data of our reference individuals or SOURCEFIND estimates were not used in the selection process of the ancestor candidates.

Principal component analyses

We performed principal component analyses both within the FINRISK data and together with the samples of the 1000 Genomes Project[49] using PLINK 1.9[46, 47].

Principal component analysis within the FINRISK samples was performed for 18,719 individuals and 56,661 LD-independent variants. These data included 256 individuals who were not part of the haplotype-based analyses as they were only later excluded due to ambiguous or missing location data or as outliers of the 1000 Genomes PCA (S16 Fig). LD-independent variants (56,661) were defined using command--indep-pairwise with 1500 kb window size, 500 kb step size and 0.2 as r2 threshold in PLINK 1.9, and by further excluding the long-range LD regions described in [51].

Principal component analyses together with the 1000 Genomes data were performed on 18,715 FINRISK samples and 1,536 non-Finnish samples of the 1000 Genomes phase 3 data using 49,423 LD-independent variants. We performed 6 separate PCA runs: one with all five super populations (314 Africans, 264 Americans, 480 East Asians, 380 Europeans, and 98 South Asians) together with the FINIRSK samples, and also 5 runs, where each super population was separately analyzed with the FINRISK samples (S16 Fig).

Simulating individuals from ancestor candidates

For each simulation scenario, we simulated 20 individuals by first randomly sampling ancestors from the ancestor candidate groups (S9 Fig) and then simulating recombination between the ancestor haplotypes. Each ancestor candidate was used only once in one simulation run. We simulated new haplotypes for chromosomes 1–22 in R (see Data Availability Statement) by first sampling crossing-over events within the ancestors’ phased chromosomes and then sampling one new haplotype from each ancestor. The crossing-over events were sampled between two loci according to the probabilities from the HapMap phase II recombination map. When simulating several generations, we generated new haplotypes iteratively by starting from the eldest ancestors (S7 Fig).

Local regression for ancestry curves

We estimated the average ancestry proportions for each region by using local regression method, LOESS (locally estimated scatterplot smoothing), in R[52]. We regressed the individuals’ ancestry profiles on individuals’ birth years by fitting LOESS using value 0.5 for the spatial smoothing parameter.

Changes in genetic heterogeneity

We estimated genetic heterogeneity of ancestry profiles using entropy measure to quantify the change in heterogeneity over time. For each region and for each yearly ancestry profile averaged over individuals born in that year, we measured entropy as

H=i=1k(pilog2(pi))

where pi is the proportion of the ancestry profile assigned to reference group i = 1,…,k, where the number of reference groups in profile was k = 2, 6 or 10. Instead of comparing the absolute values of entropy that can depend strongly on the reference groups, we quantified the change by predicting entropy with birth year by a simple linear regression where each yearly observation was weighted by the number of samples used to calculate the yearly ancestry profile. The slope of the model is then interpreted as the rate of change in genetic heterogeneity, negative values indicating a decrease in heterogeneity and positive values indicating diversifying profiles. We estimated the rate of change for the whole time period from 1923 to 1987, as well as separately for the periods before and after the year 1950, corresponding to the period including the Second World War and the post-war period, respectively. S21 Fig shows the yearly entropy and the fitted model for each region with refset 10.

Average inbreeding coefficient, F

To complement the measure of change in heterogeneity of the ancestry profile per region, we also computed the average inbreeding coefficient, F, for each study region. The individual inbreeding coefficients were first calculated using PLINK 1.9[46, 47] and then averaged over the individuals born in the region.

Pairwise-FST

We calculated pairwise-FST between the reference groups (Fig 2) and the ancestor candidate groups (S9 Fig) using SmartPCA of EIGENSOFT package[7] (fstonly: YES, fsthiprecision: YES) and 56,661 LD-independent variants.

Average ancestry profiles over municipalities

In an interactive web application, we present the average ancestry profiles for the Finnish municipalities. These profiles have been estimated by weighting the individual profiles by the inverse of a squared distance between the individual and the center of the municipality as

Am=1rToti=1Nairim2,

where Am is the ancestry profile of a municipality m, ai is the ancestry profile of an individual i, rim is the distance between the birth place of individual i and the municipality m, rTot is the sum over 1/r2im. Thus, the ancestry profile of a municipality does not represent the average profile of only the individuals born in that municipality but may also be affected by the individuals from the neighboring municipalities. The minimum distance between an individual and a municipality was set to 5 kilometers to avoid very high weights in the calculation.

Maps

Maps in figures were generated by using data from geoBoundaries[53].

Supporting information

S1 Fig. Workflow of the study.

(TIF)

S2 Fig. Geographic distribution of study samples chosen by parents’ birth places.

Geographic distribution of samples whose parents were born within 80 km from each other A) before the spatial sampling procedure, B) after excluding the individuals with a high number of local neighbors and C) after excluding the individuals with a high number of global neighbors. The individuals are displayed at the mean of their parents’ municipalities of birth after adding some jitter to the points to ensure the anonymity. Geographic outliers are excluded.

(TIF)

S3 Fig. FineSTRUCTURE tree and populations.

FineSTRUCTURE tree at level 20 and the corresponding populations on a map when the tree is cut at A) level 2, B) level 15 and C) level 20.

(TIF)

S4 Fig. Geographic outliers.

Maps shows the location of individuals excluded as geographic outliers of A) refset 2 (3 individuals), B) refset 6 (11 individuals) and C) refset 10 (16 individuals). The excluded outliers are highlighted with black X-marks. The included individuals are shown with pale colors.

(TIF)

S5 Fig. Identity proportions for the FineSTRUCTURE populations.

Panel A) shows the identity proportions when all 15 population were used as reference populations. Panel B) shows the identity proportions using only the 10 populations that show identity proportion above 0.50 in panel A. Colors correspond to the populations in S3B Fig.

(TIF)

S6 Fig. Age distributions of reference groups and ancestor candidates.

Age distributions of the reference groups of A) refset 2, B) refset 6, C) refset 10 and D) the ancestor candidates. The boxplot whiskers show the range, the boxes show the interquartile range and the dark line shows the median of the birth years.

(TIF)

S7 Fig. Schematic representation of our simulation strategy.

In each simulation, 2G individuals were sampled to represent the ancestors from G generations back in time (black box), where G varied between 1 and 5. All the subsequent descendants in generations G-1, G-2, …, were simulated to determine the genotypes of the target individual at generation 0 (grey box). In this example simulation, 1 ancestor is sampled from A-West (red) and the remaining 2G – 1 ancestors were sampled from A-East (blue). The two adjacent bars correspond to the two haplotypes of an individual and the color corresponds to the ancestor candidate group.

(TIF)

S8 Fig. Location of ancestor candidates on genetic principal component space.

The location of ancestor candidates on a plane defined by principal components (PC) 1 and 2 of the genetic structure for A) simulation settings for refset 2 and B) simulation settings for refsets 6 and 10.

(TIF)

S9 Fig. Geographic location of ancestor candidates.

The geographic location of the ancestor candidates in simulation settings A) for refset 2 and B) for refsets 6 and 10. The names of the ancestor candidate groups are shown on right.

(TIF)

S10 Fig. Individual ancestry profiles for simulation scenarios between East and West.

Ancestry profiles for 20 individuals in simulation settings involving ancestry groups R2_East and R2_West: A) for setting All-West, B) for All-East, C) for Almost-East and D) for Almost-West. Blue denotes the estimated proportion in reference group R2_East and red denotes the proportion in reference group R2_West. Gen (1,…,5) refers to the number of generations considered in the simulation. ‘Mean’ shows the average over the 20 simulated individuals.

(TIF)

S11 Fig. Average ancestry profiles for detailed simulation scenarios without below 5% proportions.

Average ancestry for the simulation results in Fig 4 when the ancestry proportions below 5% were shrunk to zero and the remaining proportions were scaled to one. Panel A) presents individuals whose all ancestors come from one group (single origin) shown in the title estimated using refsets 2, 6 or 10. Panel B) presents individuals whose 2G-1 ancestors, where G = 1…4 is the number of generations, originate from A-Southwest and 1 ancestor originates from the ancestor group in the title, estimated using refset 10. The colors correspond to the reference groups in Fig 2.

(TIF)

S12 Fig. Individual ancestry profiles for detailed single origin simulation scenarios.

The individual ancestry profiles for 20 individuals whose both parents originate from the ancestor candidate group of A) A-Southwest, B) A-Bothnia, C) A-N_Karelia, D) A-Kainuu, E) A-Kuusamo, F) A-Lapland and G) A-Evacuated (see Fig 4A for the mean values). The colors correspond to the reference groups in Fig 2.

(TIF)

S13 Fig. Average ancestry profiles for detailed simulation scenarios.

Detailed simulation results for mixed ancestry from ancestor groups A-Southwest and A-N_Karelia. Panel A) presents individuals whose 2G-1, where G is the number of generations, ancestors originate from A-Southwest and 1 ancestor originates from the region in the title. Top row shows the ancestry profiles estimated using refset 2, middle row shows the same for refset 6 and bottom row shows them for refset 10. Panel B) shows the same quantities for a simulation setting where all but one ancestors originate from A-N_Karelia.

(TIF)

S14 Fig. Individual estimates of the major ancestry component in detailed simulation results.

Detailed simulation results for mixed ancestry from ancestor groups A-Southwest and A-N_Karelia (corresponding to S13 Fig). Panel A) presents individuals whose 2G-1, where G is the number of generations, ancestors originate from A-Southwest and 1 ancestor originates from the region in the title. Top row shows estimated ancestry in R2-West, middle row shows the same for R6-Southwest and bottom row shows them for R10-Southwest. Panel B) shows the same quantities for a simulation setting where all but one ancestors originate from A-N_Karelia and the reference groups whose estimates are shown are R2-East (top), R6-Savo-Karelia (middle) and R10-Savo-Karelia (bottom).

(TIF)

S15 Fig. Ancestry profiles for simulated first-generation mixed individuals.

Average ancestry profiles for 20 simulated individuals whose parents come from different geographic regions are shown. Title describes the ancestor candidate groups of the parents used in the simulation.

(TIF)

S16 Fig. Principal component analyses with 1000 Genomes samples and FINRISK samples.

A) PCA of 5 super populations of the 1000 Genomes (Phase 3) samples and our Finnish FINRISK (FIN-FR) samples. PCA of the FINRISK samples together with the B) non-Finnish European (EUR), C) East Asian (EAS), D) American (AMR), E) African (AFR) and F) South Asian (SAS) samples of the 1000 Genomes Phase 3. The FINRISK samples circled with red were identified to show admixture with one or more super populations and were excluded from the regional ancestry analyses. None of our reference individuals was among the excluded.

(TIF)

S17 Fig. Development of genetic ancestry profile in 12 regions using refset 2.

The map on the left shows the regions and the locations of the individuals who form the 2 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

(TIF)

S18 Fig. Development of genetic ancestry profile in 12 regions using refset 6.

The map on the left shows the regions and the locations of the individuals who form the 6 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

(TIF)

S19 Fig. Development of genetic ancestry profile in 12 regions using refset 10.

The map on the left shows the regions and the locations of the individuals who form the 10 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

(TIF)

S20 Fig. Regional changes in genetic ancestry from the R10-Evaluated reference group.

Changes in the genetic ancestry proportion within mainland Finland using refset 10 but showing only the ancestry proportion from R10-Evacuated. The map on the left shows the location of reference individuals and R10-Evacuated is located at southeast corner of the map overlapping the region of Ceded Karelia (CKA).

(TIF)

S21 Fig. Entropy values regressed on birth years for the 12 study regions.

The points represent entropy of the mean profile of individuals born during one year in a particular region estimated with respect to refset 10. The size of the point represents the number of individuals averaged in that yearly profile. The purple line is the linear regression line fitted to the data. The slope estimates the rate of change in heterogeneity of the ancestry profile.

(TIF)

S22 Fig. Rate of change in heterogeneity at each study region before (1923–1950) and after (1951–1987) the year 1950.

The rate of change is calculated by regressing the mean entropy over the years with refsets 2, 6, and 10 on the year of birth.

(TIF)

S23 Fig. Average inbreeding coefficient, F, per study region.

The values at the bottom report the average F per region with its standard error in parentheses. Whiskers show the 95% confidence interval.

(TIF)

S1 Table. Number of incorrectly assigned individuals in simulations between East and West.

The number of individuals incorrectly assigned to a single origin, out of 20, based on whether their A) West ancestry component or B) East ancestry component was above the threshold. The threshold was defined as the second largest value in A) All-West or B) All-East simulation setting (corresponding to the 95% quantile in the simulation setting).

(PDF)

S2 Table. Pairwise-FST values (×105) between ancestor candidate groups and reference groups of refset 10 (lower triangular) and their standard errors (upper triangular).

(PDF)

S3 Table. Ancestry proportions for single-origin simulations categorized into expected and unexpected ancestry.

Column ‘Expected ancestry’ shows which reference groups were considered the closest to the corresponding ancestor group, and the ‘Total expected ancestry’ sums over the expected ancestries. Unexpected ancestry was defined as everything else except the expected and the average contribution of those groups is shown in column ‘Average unexpected ancestry’. Column ‘No shrink’ shows the values of the raw ancestry estimates and column ‘Shrink <5%’ shows the results after shrinking the individual ancestry estimates below 5% to zero and rescaling the remaining non-zero ancestry proportions back to 100%.

(PDF)

S4 Table. Range of the birth years (Min and Max) and the mean birth years of the reference groups.

Mann-Whitney p-value corresponds to a test between the focal group and the union of the rest of the groups at that refset.

(PDF)

S5 Table. Number of study individuals excluded and included after each filtering step.

(PDF)

S6 Table. Number of reference candidates excluded and included after the steps of reference group identification process.

The numbers in parentheses refer to the number of populations excluded or included.

(PDF)

Acknowledgments

The data used for the research were obtained from THL Biobank under the project BB2019_44. We thank all study participants for their generous participation at THL Biobank and the National FINRISK study.

Data Availability

All data underlying the findings are available via THL Biobank (https://thl.fi/en/web/thl-biobank/for-researchers) for research projects that are of high scientific quality and impact, are ethically conducted, and that correspond with the research areas of THL Biobank. The script for generating haplotypes for simulations: https://github.com/sinikerm/OffspringSimulator.

Funding Statement

This work was supported by the Academy of Finland (https://www.aka.fi/en/) (Grants 288509 and 319181 to M.Pi.), the Academy of Finland Center of Excellence in Complex Disease Genetics (312076 to M.Pi; 312062 to S.R.), by the Sigrid Juselius Foundation (https://sigridjuselius.fi/en/) (to M.Pi and S.R.) by University of Helsinki (https://www.helsinki.fi/en) HiLIFE Fellow and Grand Challenge grants (M.Pi. and S.R.), by the Finnish Foundation for Cardiovascular Research (https://www.sydantutkimussaatio.fi/en) (S.R.) and by Professor Package A80202-921072-Vyas from Aalto University (https://www.aalto.fi/en) to R.V. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Marnetto D, Parna K, Lall K, Molinaro L, Montinaro F, Haller T, et al. Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat Commun. 2020;11(1):1628. Epub 2020/04/04. 10.1038/s41467-020-15464-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Phillips C. Forensic genetic analysis of bio-geographical ancestry. Forensic Sci Int Genet. 2015;18:49–65. Epub 2015/05/28. 10.1016/j.fsigen.2015.05.012 . [DOI] [PubMed] [Google Scholar]
  • 3.Novembre J, Peter BM. Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev. 2016;41:98–105. Epub 2016/09/24. 10.1016/j.gde.2016.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hellwege JN, Keaton JM, Giri A, Gao X, Velez Edwards DR, Edwards TL. Population Stratification in Genetic Association Studies. Curr Protoc Hum Genet. 2017;95:1 22 1–1 3. Epub 2017/10/19. 10.1002/cphg.48 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wangkumhang P, Hellenthal G. Statistical methods for detecting admixture. Curr Opin Genet Dev. 2018;53:121–7. Epub 2018/09/25. 10.1016/j.gde.2018.08.002 . [DOI] [PubMed] [Google Scholar]
  • 6.Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012;192(3):1065–93. Epub 2012/09/11. 10.1534/genetics.112.145037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. Epub 2006/12/30. 10.1371/journal.pgen.0020190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19(9):1655–64. Epub 2009/08/04. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8(1):e1002453. Epub 2012/02/01. 10.1371/journal.pgen.1002453 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Leslie S, Winney B, Hellenthal G, Davison D, Boumertit A, Day T, et al. The fine-scale genetic structure of the British population. Nature. 2015;519(7543):309–14. Epub 2015/03/20. 10.1038/nature14230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gilbert E, O’Reilly S, Merrigan M, McGettigan D, Molloy AM, Brody LC, et al. The Irish DNA Atlas: Revealing Fine-Scale Population Structure and History within Ireland. Sci Rep. 2017;7(1):17199. Epub 2017/12/10. 10.1038/s41598-017-17124-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Gilbert E, O’Reilly S, Merrigan M, McGettigan D, Vitart V, Joshi PK, et al. The genetic landscape of Scotland and the Isles. Proc Natl Acad Sci U S A. 2019;116(38):19064–70. Epub 2019/09/05. 10.1073/pnas.1904761116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Byrne RP, Martiniano R, Cassidy LM, Carrigan M, Hellenthal G, Hardiman O, et al. Insular Celtic population structure and genomic footprints of migration. PLoS Genet. 2018;14(1):e1007152. Epub 2018/01/26. 10.1371/journal.pgen.1007152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Takeuchi F, Katsuya T, Kimura R, Nabika T, Isomura M, Ohkubo T, et al. The fine-scale genetic structure and evolution of the Japanese population. PLoS One. 2017;12(11):e0185487. Epub 2017/11/02. 10.1371/journal.pone.0185487 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Raveane A, Aneli S, Montinaro F, Athanasiadis G, Barlera S, Birolo G, et al. Population structure of modern-day Italians reveals patterns of ancient and archaic ancestries in Southern Europe. Sci Adv. 2019;5(9):eaaw3492. Epub 2019/09/14. 10.1126/sciadv.aaw3492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Saint Pierre A, Giemza J, Alves I, Karakachoff M, Gaudin M, Amouyel P, et al. Correction: The genetic history of France. Eur J Hum Genet. 2020. Epub 2020/03/13. 10.1038/s41431-020-0604-1 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, et al. A genetic atlas of human admixture history. Science. 2014;343(6172):747–51. Epub 2014/02/18. 10.1126/science.1243518 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chacon-Duque JC, Adhikari K, Fuentes-Guajardo M, Mendoza-Revilla J, Acuna-Alonzo V, Barquera R, et al. Latin Americans show wide-spread Converso ancestry and imprint of local Native ancestry on physical appearance. Nat Commun. 2018;9(1):5388. Epub 2018/12/21. 10.1038/s41467-018-07748-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Busby GBJ, Hellenthal G, Montinaro F, Tofanelli S, Bulayeva K, Rudan I, et al. The Role of Recent Admixture in Forming the Contemporary West Eurasian Genomic Landscape. Curr Biol. 2015;25(21):2878. Epub 2015/11/02. 10.1016/j.cub.2015.10.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bycroft C, Fernandez-Rozadilla C, Ruiz-Ponte C, Quintela I, Carracedo A, Donnelly P, et al. Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula. Nat Commun. 2019;10(1):551. Epub 2019/02/03. 10.1038/s41467-018-08272-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.van Dorp L, Balding D, Myers S, Pagani L, Tyler-Smith C, Bekele E, et al. Evidence for a Common Origin of Blacksmiths and Cultivators in the Ethiopian Ari within the Last 4500 Years: Lessons for Clustering-Based Inference. PLoS Genet. 2015;11(8):e1005397. Epub 2015/08/21. 10.1371/journal.pgen.1005397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chaichoompu K, Abegaz F, Cavadas B, Fernandes V, Muller-Myhsok B, Pereira L, et al. A different view on fine-scale population structure in Western African populations. Hum Genet. 2020;139(1):45–59. Epub 2019/10/21. 10.1007/s00439-019-02069-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Patin E, Lopez M, Grollemund R, Verdu P, Harmant C, Quach H, et al. Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science. 2017;356(6337):543–6. Epub 2017/05/06. 10.1126/science.aal1988 . [DOI] [PubMed] [Google Scholar]
  • 24.Tambets K, Yunusbayev B, Hudjashov G, Ilumae AM, Rootsi S, Honkola T, et al. Genes reveal traces of common recent demographic history for most of the Uralic-speaking populations. Genome Biol. 2018;19(1):139. Epub 2018/09/23. 10.1186/s13059-018-1522-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Jeong C, Balanovsky O, Lukianova E, Kahbatkyzy N, Flegontov P, Zaporozhchenko V, et al. The genetic history of admixture across inner Eurasia. Nat Ecol Evol. 2019;3(6):966–76. Epub 2019/05/01. 10.1038/s41559-019-0878-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Norio R. Finnish Disease Heritage II: population prehistory and genetic roots of Finns. Hum Genet. 2003;112(5–6):457–69. Epub 2003/03/11. 10.1007/s00439-002-0876-2 . [DOI] [PubMed] [Google Scholar]
  • 27.Kerminen S, Havulinna AS, Hellenthal G, Martin AR, Sarin AP, Perola M, et al. Fine-Scale Genetic Structure in Finland. G3 (Bethesda). 2017;7(10):3459–68. Epub 2017/10/07. 10.1534/g3.117.300217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Martin AR, Karczewski KJ, Kerminen S, Kurki MI, Sarin AP, Artomov M, et al. Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. Am J Hum Genet. 2018;102(5):760–75. Epub 2018/05/01. 10.1016/j.ajhg.2018.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Paukkunen L. Siirtokarjalaiset nyky-Suomessa. Jyväskylä: Jyväskylän yliopisto; 1989. ISBN: 951-680-200-1
  • 30.Heikkilä E, Järvinen T, editors. History and future lines of urbanization process in Finland. ERSA 2002 - 42nd Congress of the European Regional Science Association; 2002; Dortmund: European Regional Science Association.
  • 31.Borodulin K, Tolonen H, Jousilahti P, Jula A, Juolevi A, Koskinen S, et al. Cohort Profile: The National FINRISK Study. Int J Epidemiol. 2018;47(3):696–i. Epub 2017/11/23. 10.1093/ije/dyx239 . [DOI] [PubMed] [Google Scholar]
  • 32.Locke AE, Steinberg KM, Chiang CWK, Service SK, Havulinna AS, Stell L, et al. Exome sequencing of Finnish isolates enhances rare-variant association power. Nature. 2019;572(7769):323–8. Epub 2019/08/02. 10.1038/s41586-019-1457-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Salmela E, Lappalainen T, Fransson I, Andersen PM, Dahlman-Wright K, Fiebig A, et al. Genome-wide analysis of single nucleotide polymorphisms uncovers population structure in Northern Europe. PLoS One. 2008;3(10):e3519. Epub 2008/10/25. 10.1371/journal.pone.0003519 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lynch R, Lummaa V, Panchanathan K, Middleton K, Rotkirch A, Danielsbacka M, et al. Integration involves a trade-off between fertility and status for World War II evacuees. Nat Hum Behav. 2019;3(4):337–45. Epub 2019/04/12. 10.1038/s41562-019-0542-5 . [DOI] [PubMed] [Google Scholar]
  • 35.Vital statistics in Finland in the years 1941–50. Väestömuutokset vuosina 1941–50: Teksti. Helsinki; 1955. Available from https://www.doria.fi/handle/10024/69350.
  • 36.Haworth S, Mitchell R, Corbin L, Wade KH, Dudding T, Budu-Aggrey A, et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat Commun. 2019;10(1):333. Epub 2019/01/20. 10.1038/s41467-018-08219-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Regalado A. More than 26 million people have taken an at-home ancestry test. MIT Technology Review [Internet]. 2019. Available from: https://www.technologyreview.com/2019/02/11/103446/more-than-26-million-people-have-taken-an-at-home-ancestry-test/. [Google Scholar]
  • 38.Lawson DJ, van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat Commun. 2018;9(1):3258. Epub 2018/08/16. 10.1038/s41467-018-05257-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Mathieson I, Scally A. What is ancestry? PLoS Genet. 2020;16(3):e1008624. Epub 2020/03/10. 10.1371/journal.pgen.1008624 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Byrne RP, van Rheenen W, Project MinE ALS GWAS Consortium, van den Berg LH, Veldink JH, McLaughlin RL. Dutch population structure across space, time and GWAS design. bioRxiv. 2020. 10.1038/s41467-020-18418-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013;93(2):278–88. Epub 2013/08/06. 10.1016/j.ajhg.2013.06.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet. 2017;100(4):635–49. Epub 2017/04/04. 10.1016/j.ajhg.2017.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584–91. Epub 2019/03/31. 10.1038/s41588-019-0379-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Reisberg S, Iljasenko T, Lall K, Fischer K, Vilo J. Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations. PLoS One. 2017;12(7):e0179238. Epub 2017/07/06. 10.1371/journal.pone.0179238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kerminen S, Martin AR, Koskela J, Ruotsalainen SE, Havulinna AS, Surakka I, et al. Geographic Variation and Bias in the Polygenic Scores of Complex Diseases and Traits in Finland. Am J Hum Genet. 2019;104(6):1169–81. Epub 2019/06/04. 10.1016/j.ajhg.2019.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Purcell S, Chang C. PLINK 1.9.
  • 47.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 2015/02/28. 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–73. Epub 2010/10/12. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.The 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. Epub 2015/10/04. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Delaneau O, Zagury JF, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 2013;10(1):5–6. Epub 2012/12/28. 10.1038/nmeth.2307 . [DOI] [PubMed] [Google Scholar]
  • 51.Price AL, Weale ME, Patterson N, Myers SR, Need AC, Shianna KV, et al. Long-range LD can confound genome scans in admixed populations. Am J Hum Genet. 2008;83(1):132–5; author reply 5–9. Epub 2008/07/09. 10.1016/j.ajhg.2008.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. [Google Scholar]
  • 53.Runfola D, Anderson A, Baier H, Crittenden M, Dowker E, Fuhrig S, et al. geoBoundaries: A global database of political administrative boundaries. PLoS ONE. 2020;15(4):e0231866. 10.1371/journal.pone.0231866 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Hua Tang, Simon Gravel

28 Sep 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Pirinen,

Thank you very much for submitting your Research Article entitled 'Changes in the fine-scale genetic structure of Finland through the 20th century' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Simon Gravel, Ph.D.

Guest Editor

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

Both reviewers appreciated the attention to an important problem, and the relevance of the analyses.

Both reviewers also expressed some confusion about the clustering and filtering steps, and I think that this would be important to address. Perhaps a graphical representation, as suggested by Reviewer 2, would be helpful.

I would find it also important to report the number of participants excluded through each filtering step. E.g., on line 206, the manuscript lists the number of outliers, but not the proportion of individuals excluded for low identity proportions. This is important to ascertain how representative the analysis is the the actual population of Finland (see also point 2 of reviewer 2).

On line 241, it was unclear to me how the candidate ancestors were selected, in particular whether they were selected after the filtering steps (in which case the ancestry analysis would be over-confident, since the simulations used individuals that cluster exceptionally well.)

Figure 4 shows average ancestry over multiple simulated individuals. As I understand things, this would provide an estimate of the systematic bias in assignment, which is a relevant metric for the time-dependence analysis, but not of the uncertainty in assignment. I think this could be made clearer in the discussion of the results.

For all these reasons, the conclusions on lines 342-347 seem to overstate the accuracy of the regional inferences at the individual level. This is particularly important given that the manuscript highlighs forensics as a possible application of this type of research.

Minor points:

“homogeneous” and “distinct” reference groups – this has not quite been shown, and I would expect that a better powered or more detailed study would reveal that these groups are neither quite homogeneous nor quite distinct. Given that this work has a public outreach component, I would advocate for more careful language given how humans like to overinterpret genetic differences across groups.

This is especially true here since the approach used extensive filtering to reach the “homogeneous” groups, and therefore the figures (such as Fig. 2) give an exaggerated idea of the divergence between populations.

Similarly, I would encourage the authors to avoid expressions such as “genetically intact”, which suggest a positive connotation to lack of mixing. (e.g., “genetically isolated” would be preferable).

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is an excellent paper that I thoroughly enjoyed reading. The methodology is sound and is in-line with the state-of-the-art in ancestry estimation. The results are very interesting, and the level of detail and precision on population movements within Finland are unprecedented. It is well-written, and well structured. I have only minor comments on the exposition to offer.

- My only criticism of this paper is that I found the terminology around the three "levels" of detail in which the population was studied very confusing. This made sense when discussed in the introduction, when talking about 2, 6 and 10 reference groups of ancestry. However, when you speak later of (say, line 188) about level-2 reference groups, and then around line 203 of level 10 and level 6, I thought you were talking about different heights in the FS-tree. It became clearer later on what you mean when I read the methods, but I think the terminology could be a lot simpler and clearer here. Maybe use 2-way, 6-way and 10-way instead? This gives a better intuition to the reader that it's about splitting into groups, and "level" just seems misleading. I think "level" is being used in a few different ways, so it would be good to go through the text with this in mind.

- Line 68: "A variety of methods"

- Line 75: The term "British Isles" is politically charged and regarded as offensive by many Irish people (please see the wikipedia page for a good summary: https://en.wikipedia.org/wiki/British_Isles_naming_dispute). Please consider avoiding this contentious terminology by using a more neutral term such as "Britain and Ireland".

- Line 87. Delete "also"

- Line 91 (and elsewhere). Should it be "the Soviet Union"?

- Line 197: rerun -> reran

- Line 337. This should be 2^{G - 1}, shouldn't it?

- Line 489. I couldn't parse this sentence - what is it trying to say?

- Line 497. Swap "rather continuous" to "continuous rather".

Reviewer #2: In this paper, Kerminen and collaborators use state-of the-art population genetics tools to investigate the population structure of Finland, specifically to see how major events in the 20th century affected this structure. FineSTRUCTURE was used to partition a reference population into discrete subgroups, while SOURCEFIND was used to estimate the proportion of ancestry from each subgroup for each individual in a testing population, using entropy measurement and year of birth data to quantify the changes in ancestry over time.

The authors’ findings matched what is known from the historical and demographic record, and further contributed that the migration of people from regions of Finland that were annexed by the USSR (and continue to be part of Russia to this day) have had the largest detectable effect on the population structure of Finland. Events such as the urbanization of the population in the latter half of the 20th century show much smaller, more local effects. Additionally, the study demonstrates the technical limits of SOURCEFIND to distinguish reliably ancestry proportions < ~5% from background population genetic variation. This has implications for inference of contributions further back in an individual’s lineage.

Generally, I think the work done is of excellent quality and that the conclusions are supported by the results shown. I have three main comments on the methodology, and several minor suggestions:

Main comments:

1.Date of birth of “reference candidates”.

Given the question asked, I would have thought that choosing the “reference candidates” based on earliest year of birth would have made sense, so that these reference groups really reflect the genetic background of the reference group early in the 20th century. I am not sure why the author decided not to choose reference individuals based on this info (l.514-518). In the >8000 potential candidates they reported, they made sure that they had good coverage across the country, but I feel that they could have consider selecting the people with the earliest date of birth as well. My concern is that, if a certain reference group is biased towards early ancestors (as reported l.516) and the other is biased towards later ancestors, this could potentially lead to strange effects when looking at admixture proportions in the regional subgroups. Similarly, were the ancestors in the simulations selected based on a logical date of birth scheme (ie. first ancestors are the oldest, with G1>G2>G3>G4)? If not, what is the expected impact of overlooking this aspect (that needs to happen in reality) on their results?

2. Admixture from outside Finland.

The authors mention in discussion the fact that the individuals could could have had ancestors from outside of Finland, and I am not really sure what the impact of that could be on the results presented, especially if that distribution of admixture is uneven between reference candidate or the tested regional subgroups. A solution would have been to "masked out" the chunks from distant ancestry in the genomes of individuals. And even leaving out recent immigration to Finland from countries all over the world, there has always been gene flow between Finland and Sweden (also probably Russia?). Could the authors show that these kind of admixture event would not (or only negligibly) bias their results?

3. SOURCEFIND

I did not know much about SOURCEFIND before reading this manuscript and wish that there had been more justification for why it was used over, for instance, applying RFMix or other alternative methods? I would have liked to see more discussion of the implications of the fact that SOURCEFIND only seems to make accurate inferences about recent ancestry. How does this compare to other software? If someone were interested in making inferences about more remote ancestors, is there any existing software suitable to that question? How much of SOURCEFIND’s uncertainty is a function of the specific population history of Finland? Would it be more or less accurate in a more heterogeneous population?

Minor suggestions:

- I had several questions on the simulation strategy while reading the results and I think that a figure, showing the simulation scheme graphically, would be beneficial to the reader. For example, it could clarify the fact that in the Almost-East/West simulations, the foreign ancestor was always drawn as a G1 ancestor (I think? from the results in Figure 3.. although I am not sure because legend says "a single ancestor, G generations back in time" l.271, suggesting it could be any G value?)

- I am a bit confused with the wording "location of individuals" - does it refer to birth place, or where these individuals live/were sampled (eg. Figure 2)? Similarly, "parents’ geographic location" and "parental birthplace information" are used... I think this wording should be classified throughout.

Similarly, I was a bit confused by the term "newborns" in several places (l.410,458,488) - what does this mean?

- l. 647 "For individuals in the reference groups, the ancestry was estimated by leaving the individual itself out from the reference group." I am not really sure I understand how this is done in practice? Is FineSTRUCTURE rerun on the entire dataset by leaving one individual out?

- l.432-433. absolute genetic diversity in regions is defined in opposition to average heterozygosity. I was curious as to whether average heterozygocity in the different regions has been computed as well, and if there are notable differences between regions or if it is quite homogeneous (maybe this has been done in a previous study, if so, please cite).

- In my opinion, the term "admixture" is generally refering more to the mixing of genetic material from a distantly-related populations. I dont know what a better term would be though (gene flow? genetic mixing?)... or maybe it could simply be explicitely defined in the introduction that the terms admixed/admixture (normally used for more distantly related populations) will refer here to genetic mixing from/ gene flow between closely related populations.

Website:

The website is great and very appealing! However, on Safari I see a truncated version (the right side goes outside the page and we can't slide the page - at least on two different computers). Also, on iPhone, the ancestry/tree panel is not displayed at all (might be too heavy for mobile - but just wanted to report it).

Typos

- l.140-142 "We will first introduce a procedure to identify suitable reference groups, then we test their performance to detect ancestry via simulations, and finally apply them to estimate the ancestry of 18,494 FINRISK samples to characterize" >> I am not sure what "their performance", "them" refers too, probably to the methods?

- l.588 "For birth region analyses, we further excluded individuals marked to had been born in the municipality of Karjala" >> ... to have been born ...

- l.708 "We estimated the rate of change for whole time period from 1923 to 1987" >> ... for the whole time period ...

- Some figures are missing labels on axes (eg. Figures 3,4, S7-10)

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jerome Kelleher

Reviewer #2: Yes: Julie Hussin

Decision Letter 1

Hua Tang, Simon Gravel

15 Dec 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Pirinen,

Thank you very much for submitting your Research Article entitled 'Changes in the fine-scale genetic structure of Finland through the 20th century' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

Most of these concerns are requests and suggestions for clarification. We therefore ask you to take these into account in submitting a revised manuscript. 

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Simon Gravel, Ph.D.

Guest Editor

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed all of my points well, I have no further comments.

Reviewer #2: I thank the authors for answering my questions and addressing them in the manuscript. I think the changes they made have greatly improved the readability of the paper and have made it more comprehensible.

I have two final minor comments:

- Line 448: “whence” means “from which” or “from where”. I think the authors mean “where” or "for which" here.

- On the issue of the word “newborns” : When this term is used, it’s usually referring to people who are currently newborns or to discuss some facet of infancy where being newborn is relevant (for example, “newborns can’t focus their eyes”). It’s not generally used the way the authors have used it in this paper. I would use the clunkier, but more accurate “individuals born in 19XX” in its place.

Reviewer #3: This paper by Kerminen and al. is simultaneously addressing the question of genetic structure of Finland, which is not totally new and the properties of SOURCEFIND algorithm when the source populations show small differentiation (at the fine greographical scale, within one country, be it Finland).

Revisiting the genetic structure of Finland and brings a very interesting approach, the analysis of rapid change in time of this structure, because of dramatic events. In this regard, the approach is original and worth reporting.

The introduction/title could provide more clear description of the study’s goal. This work basically tests, using simulation from realistic data, whether SOURCEFIND can correctly identify origin when the source population display only limited difference.

Moreover, in my opinion, this introduction should stress earlier one of the original points, the possibility to stratify the changes in genetic composition within a short period of time.

I don’t find very clear the references of the use of Globetrotter and Sourcefind. The message seems to be that the methods have been applied to populations that are large whereas in this study it is going to focus on fine-scale structure of a supposedly less broad populations. However, the referenced studies focus for instance on Ireland. I guess here it would be more clear to explictely state that you are testing these methods in a fine-scale context where source populations and target populations are very close. And where source populations are not heterogenous.

In terms of novelty the description of the genetic structure of Finland in present times has already been adressed and I find that they don't separate enough, even in the second part, what is really new from what is not. I think that the contribution of refugees from Soviet Karelia is new. Also, the evolution of the genetic structure in a short period of time is something important - actually, this is in my opinion the most important point. This is also quite new compared to previous papers where we could only estimate the change in population size from current genomes in the different clusters.

Concerning the simulations, I find it very interesting to start from existing chromosomes representative of a region in order to see Globetrotter's ability to estimate the proportions of origin of each population. They start from existing chromosomes (estimated in any case) as "founders" and simulate transmission. Somewhat in the spirit of HapGen which was distributed with the 1000 genomes. The "critics" asking to take founders with an older date of birth miss the fact that the problem is just theoretical: let's take ancestors representing populations with an Fst close to the classical Fsts between provinces.

The simulations seem to me to be valid in relation to the question posed. One thing has not been accounted for, however. This is the fact that they only capture a fraction of the haplotype diversity because they take the chromosomes as they are and observed whereas one could imagine generating generating founder haplotypes using the observed “source population” haplotypes but allowing for recombination at this stage – in the ancestors. Thus, they could have captured a wider and more accurate haplotype diversity while still relying on the observed structure. This is however a limited criticism as it is still of matter of simulating given a Fst and testing the consequences in lower generations. The simulation process (including the algorithm to identify “seed” founder populations) is more clear in this reviewed versions, as asked by editor and reviewers in the first round.

This paper (which seems to me very good and very pro) seems to chase two hares at the same time ... this is what is a bit annoying because it mixes a practical problem (structure and history of Finland) and a theoretical one - which uses very realistic (because real) data and therefore in a context - structure Finland.

The problem is interesting because it seems to me that Globetrotter seems to have been made to find admixture (and date it) from much more differentiated source populations. So it is and see the properties of the method to the extreme.

Even if this paper is a bit confusing because it piles up two (nearly three) topics, it has this novelty of following genetic structures on several generations and therefore comparing the impacts of internal migration (in the sense of the same people) and urbanization. Results from simulations can also give useful guidelines for interpreting SOURCEFIND results from other populations.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Julie Hussin

Reviewer #3: No

Decision Letter 2

Hua Tang, Simon Gravel

6 Jan 2021

Dear Dr Pirinen,

We are pleased to inform you that your manuscript entitled "Changes in the fine-scale genetic structure of Finland through the 20th century" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Simon Gravel, Ph.D.

Guest Editor

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01240R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Hua Tang, Simon Gravel

11 Feb 2021

PGENETICS-D-20-01240R2

Changes in the fine-scale genetic structure of Finland through the 20th century

Dear Dr Pirinen,

We are pleased to inform you that your manuscript entitled "Changes in the fine-scale genetic structure of Finland through the 20th century" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Workflow of the study.

    (TIF)

    S2 Fig. Geographic distribution of study samples chosen by parents’ birth places.

    Geographic distribution of samples whose parents were born within 80 km from each other A) before the spatial sampling procedure, B) after excluding the individuals with a high number of local neighbors and C) after excluding the individuals with a high number of global neighbors. The individuals are displayed at the mean of their parents’ municipalities of birth after adding some jitter to the points to ensure the anonymity. Geographic outliers are excluded.

    (TIF)

    S3 Fig. FineSTRUCTURE tree and populations.

    FineSTRUCTURE tree at level 20 and the corresponding populations on a map when the tree is cut at A) level 2, B) level 15 and C) level 20.

    (TIF)

    S4 Fig. Geographic outliers.

    Maps shows the location of individuals excluded as geographic outliers of A) refset 2 (3 individuals), B) refset 6 (11 individuals) and C) refset 10 (16 individuals). The excluded outliers are highlighted with black X-marks. The included individuals are shown with pale colors.

    (TIF)

    S5 Fig. Identity proportions for the FineSTRUCTURE populations.

    Panel A) shows the identity proportions when all 15 population were used as reference populations. Panel B) shows the identity proportions using only the 10 populations that show identity proportion above 0.50 in panel A. Colors correspond to the populations in S3B Fig.

    (TIF)

    S6 Fig. Age distributions of reference groups and ancestor candidates.

    Age distributions of the reference groups of A) refset 2, B) refset 6, C) refset 10 and D) the ancestor candidates. The boxplot whiskers show the range, the boxes show the interquartile range and the dark line shows the median of the birth years.

    (TIF)

    S7 Fig. Schematic representation of our simulation strategy.

    In each simulation, 2G individuals were sampled to represent the ancestors from G generations back in time (black box), where G varied between 1 and 5. All the subsequent descendants in generations G-1, G-2, …, were simulated to determine the genotypes of the target individual at generation 0 (grey box). In this example simulation, 1 ancestor is sampled from A-West (red) and the remaining 2G – 1 ancestors were sampled from A-East (blue). The two adjacent bars correspond to the two haplotypes of an individual and the color corresponds to the ancestor candidate group.

    (TIF)

    S8 Fig. Location of ancestor candidates on genetic principal component space.

    The location of ancestor candidates on a plane defined by principal components (PC) 1 and 2 of the genetic structure for A) simulation settings for refset 2 and B) simulation settings for refsets 6 and 10.

    (TIF)

    S9 Fig. Geographic location of ancestor candidates.

    The geographic location of the ancestor candidates in simulation settings A) for refset 2 and B) for refsets 6 and 10. The names of the ancestor candidate groups are shown on right.

    (TIF)

    S10 Fig. Individual ancestry profiles for simulation scenarios between East and West.

    Ancestry profiles for 20 individuals in simulation settings involving ancestry groups R2_East and R2_West: A) for setting All-West, B) for All-East, C) for Almost-East and D) for Almost-West. Blue denotes the estimated proportion in reference group R2_East and red denotes the proportion in reference group R2_West. Gen (1,…,5) refers to the number of generations considered in the simulation. ‘Mean’ shows the average over the 20 simulated individuals.

    (TIF)

    S11 Fig. Average ancestry profiles for detailed simulation scenarios without below 5% proportions.

    Average ancestry for the simulation results in Fig 4 when the ancestry proportions below 5% were shrunk to zero and the remaining proportions were scaled to one. Panel A) presents individuals whose all ancestors come from one group (single origin) shown in the title estimated using refsets 2, 6 or 10. Panel B) presents individuals whose 2G-1 ancestors, where G = 1…4 is the number of generations, originate from A-Southwest and 1 ancestor originates from the ancestor group in the title, estimated using refset 10. The colors correspond to the reference groups in Fig 2.

    (TIF)

    S12 Fig. Individual ancestry profiles for detailed single origin simulation scenarios.

    The individual ancestry profiles for 20 individuals whose both parents originate from the ancestor candidate group of A) A-Southwest, B) A-Bothnia, C) A-N_Karelia, D) A-Kainuu, E) A-Kuusamo, F) A-Lapland and G) A-Evacuated (see Fig 4A for the mean values). The colors correspond to the reference groups in Fig 2.

    (TIF)

    S13 Fig. Average ancestry profiles for detailed simulation scenarios.

    Detailed simulation results for mixed ancestry from ancestor groups A-Southwest and A-N_Karelia. Panel A) presents individuals whose 2G-1, where G is the number of generations, ancestors originate from A-Southwest and 1 ancestor originates from the region in the title. Top row shows the ancestry profiles estimated using refset 2, middle row shows the same for refset 6 and bottom row shows them for refset 10. Panel B) shows the same quantities for a simulation setting where all but one ancestors originate from A-N_Karelia.

    (TIF)

    S14 Fig. Individual estimates of the major ancestry component in detailed simulation results.

    Detailed simulation results for mixed ancestry from ancestor groups A-Southwest and A-N_Karelia (corresponding to S13 Fig). Panel A) presents individuals whose 2G-1, where G is the number of generations, ancestors originate from A-Southwest and 1 ancestor originates from the region in the title. Top row shows estimated ancestry in R2-West, middle row shows the same for R6-Southwest and bottom row shows them for R10-Southwest. Panel B) shows the same quantities for a simulation setting where all but one ancestors originate from A-N_Karelia and the reference groups whose estimates are shown are R2-East (top), R6-Savo-Karelia (middle) and R10-Savo-Karelia (bottom).

    (TIF)

    S15 Fig. Ancestry profiles for simulated first-generation mixed individuals.

    Average ancestry profiles for 20 simulated individuals whose parents come from different geographic regions are shown. Title describes the ancestor candidate groups of the parents used in the simulation.

    (TIF)

    S16 Fig. Principal component analyses with 1000 Genomes samples and FINRISK samples.

    A) PCA of 5 super populations of the 1000 Genomes (Phase 3) samples and our Finnish FINRISK (FIN-FR) samples. PCA of the FINRISK samples together with the B) non-Finnish European (EUR), C) East Asian (EAS), D) American (AMR), E) African (AFR) and F) South Asian (SAS) samples of the 1000 Genomes Phase 3. The FINRISK samples circled with red were identified to show admixture with one or more super populations and were excluded from the regional ancestry analyses. None of our reference individuals was among the excluded.

    (TIF)

    S17 Fig. Development of genetic ancestry profile in 12 regions using refset 2.

    The map on the left shows the regions and the locations of the individuals who form the 2 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

    (TIF)

    S18 Fig. Development of genetic ancestry profile in 12 regions using refset 6.

    The map on the left shows the regions and the locations of the individuals who form the 6 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

    (TIF)

    S19 Fig. Development of genetic ancestry profile in 12 regions using refset 10.

    The map on the left shows the regions and the locations of the individuals who form the 10 reference groups. The curves show the estimated ancestry proportion in each reference group as a function of the birth years of individuals born in each region (name of the region in the title).

    (TIF)

    S20 Fig. Regional changes in genetic ancestry from the R10-Evaluated reference group.

    Changes in the genetic ancestry proportion within mainland Finland using refset 10 but showing only the ancestry proportion from R10-Evacuated. The map on the left shows the location of reference individuals and R10-Evacuated is located at southeast corner of the map overlapping the region of Ceded Karelia (CKA).

    (TIF)

    S21 Fig. Entropy values regressed on birth years for the 12 study regions.

    The points represent entropy of the mean profile of individuals born during one year in a particular region estimated with respect to refset 10. The size of the point represents the number of individuals averaged in that yearly profile. The purple line is the linear regression line fitted to the data. The slope estimates the rate of change in heterogeneity of the ancestry profile.

    (TIF)

    S22 Fig. Rate of change in heterogeneity at each study region before (1923–1950) and after (1951–1987) the year 1950.

    The rate of change is calculated by regressing the mean entropy over the years with refsets 2, 6, and 10 on the year of birth.

    (TIF)

    S23 Fig. Average inbreeding coefficient, F, per study region.

    The values at the bottom report the average F per region with its standard error in parentheses. Whiskers show the 95% confidence interval.

    (TIF)

    S1 Table. Number of incorrectly assigned individuals in simulations between East and West.

    The number of individuals incorrectly assigned to a single origin, out of 20, based on whether their A) West ancestry component or B) East ancestry component was above the threshold. The threshold was defined as the second largest value in A) All-West or B) All-East simulation setting (corresponding to the 95% quantile in the simulation setting).

    (PDF)

    S2 Table. Pairwise-FST values (×105) between ancestor candidate groups and reference groups of refset 10 (lower triangular) and their standard errors (upper triangular).

    (PDF)

    S3 Table. Ancestry proportions for single-origin simulations categorized into expected and unexpected ancestry.

    Column ‘Expected ancestry’ shows which reference groups were considered the closest to the corresponding ancestor group, and the ‘Total expected ancestry’ sums over the expected ancestries. Unexpected ancestry was defined as everything else except the expected and the average contribution of those groups is shown in column ‘Average unexpected ancestry’. Column ‘No shrink’ shows the values of the raw ancestry estimates and column ‘Shrink <5%’ shows the results after shrinking the individual ancestry estimates below 5% to zero and rescaling the remaining non-zero ancestry proportions back to 100%.

    (PDF)

    S4 Table. Range of the birth years (Min and Max) and the mean birth years of the reference groups.

    Mann-Whitney p-value corresponds to a test between the focal group and the union of the rest of the groups at that refset.

    (PDF)

    S5 Table. Number of study individuals excluded and included after each filtering step.

    (PDF)

    S6 Table. Number of reference candidates excluded and included after the steps of reference group identification process.

    The numbers in parentheses refer to the number of populations excluded or included.

    (PDF)

    Attachment

    Submitted filename: Responses_PlosGen_review_round1_v5.pdf

    Attachment

    Submitted filename: Responses_PlosGen_review_round2_v1.pdf

    Data Availability Statement

    All data underlying the findings are available via THL Biobank (https://thl.fi/en/web/thl-biobank/for-researchers) for research projects that are of high scientific quality and impact, are ethically conducted, and that correspond with the research areas of THL Biobank. The script for generating haplotypes for simulations: https://github.com/sinikerm/OffspringSimulator.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES