Summary
The Finnish population is a unique example of a genetic isolate affected by a recent founder event. Previous studies have suggested that the ancestors of Finnic-speaking Finns and Estonians reached the circum-Baltic region by the 1st millennium BC. However, high linguistic similarity points to a more recent split of their languages. To study genetic connectedness between Finns and Estonians directly, we first assessed the efficacy of imputation of low-coverage ancient genomes by sequencing a medieval Estonian genome to high depth (23×) and evaluated the performance of its down-sampled replicas. We find that ancient genomes imputed from >0.1× coverage can be reliably used in principal-component analyses without projection. By searching for long shared allele intervals (LSAIs; similar to identity-by-descent segments) in unphased data for >143,000 present-day Estonians, 99 Finns, and 14 imputed ancient genomes from Estonia, we find unexpectedly high levels of individual connectedness between Estonians and Finns for the last eight centuries in contrast to their clear differentiation by allele frequencies. High levels of sharing of these segments between Estonians and Finns predate the demographic expansion and late settlement process of Finland. One plausible source of this extensive sharing is the 8th–10th centuries AD migration event from North Estonia to Finland that has been proposed to explain uniquely shared linguistic features between the Finnish language and the northern dialect of Estonian and shared Christianity-related loanwords from Slavic. These results suggest that LSAI detection provides a computationally tractable way to detect fine-scale structure in large cohorts.
Keywords: ancient DNA, population structure, imputation, community structure, medieval Estonian genomes, Finnish bottleneck, Estonian Biobank, long shared allele intervals, identity by descent
Introduction
Evidence derived from archaeology and genome-scale studies of ancient human remains explain high genetic homogeneity across present-day Europe in a world context by massive population movements associated with Steppe ancestry in the Late Neolithic and Early Bronze Age.1 Underneath this overarching homogeneity of allele frequencies, substantial regional differences can be revealed through the study of long identical-by-descent (IBD) segments that are sensitive to signals of regional mating patterns within the last millennia.2 While ancient DNA work has become pivotal for addressing questions about the genetic ancestry in European prehistory, the use of IBD-based methods has been limited so far because of the fact that these require good-quality genotype calls, which can be made directly only from high-quality data. A study of a late-medieval 11.3× genome from Barcelona3 showed, intriguingly, an excess of IBD sharing locally with the present-day Spanish population, highlighting the potential of IBD sharing measures to be informative in ancient DNA analyses in historical time depths. However, most ancient genomes that are currently available have low coverage and are routinely assessed via haploid genotype calls. Yet, accurate imputation methods4, 5, 6 have been shown to enable the recovery of usable diploid genotype calls from ancient DNA,7, 8, 9 including from samples with as low as 0.1× coverage data with accuracy of common variants > 0.95.10 In parallel, fast methods for IBD estimation from tens to hundreds of thousands of individuals have been recently developed for phased11, 12, 13, 14 and unphased15,16 genomic data along with scalable clustering methods for the detection of fine-scale community structure.17,18
Late Bronze and Early Iron Age migrations have been argued to be responsible for the spread of Finnic languages together with a minor Siberian genetic component (Figure 1D) in the circum-Baltic region;19,20,21 however, it has been less clear how much gene flow and contact over the Gulf of Finland has occurred in the last 2,000 years (Figures 1E and 1F). Linguistic studies have suggested that the differentiation of the Finnic from Finno-Volgaic languages dates back to 3,000–4,000 years ago.22,23,24 Numerous Baltic loan words in Finnic and archaeological evidence of metal work and proximity of fortified settlements point to extensive local contacts between the Finnic and Baltic speakers in the Late Bronze and Early Iron Ages,25 while the divergence of the Finnic and split of Estonian and Finnish may have occurred more recently between 1,000 and 2,000 years ago.22,23 The time gap between these two split dates means that the divergence of Finnic languages likely postdates the first arrival of Finnic languages in the region. Numerous Slavic loan words in Finnish related to the spread of Christianity,24 the similarities between North Estonian and Finnish, and the lack of record for historically attested migration events to Finland from the south since the 12th century point to a possible prehistoric migration event from North Estonia to Finland after the second wave of Slavic expansion (8th–10th centuries) potentially related to the intensification of agriculture in the region.22,26,27 The origin of the modern Finnish population with its unique “disease heritage” has been ascribed to founder events and population range expansions, from relatively small coastal distribution of ∼50,000 people to more than 5 million, within the last millennium.28,29 Further significant founder events most likely postdate the reforms introduced by the Swedish King Gustav Vasa in the 16th century.30
Finns and Estonians can be clearly distinguished in genetic distance-based analyses of modern genomes.36 Both Estonia and Finland show internally high levels of sub-structure,36, 37, 38 which in the case of Finland, seems to reflect geographic divisions and founder events during the late settlement process and its long-term isolation in the last 100 generations. There is no historical record for the last eight centuries of significant migration events across the Gulf of Finland apart from the accounts of Finnish settlements in Northeast Estonia in the 17th and 18th centuries, which could account for local patterns of Finnish IBD sharing in Northeast Estonia.38 However, the analyses of modern genomes cannot offer conclusive answers about the time depth and directionality of migration events that have caused regional and inter-regional patterns of genetic differentiation and similarity between Estonians and Finns. In this study, we focus on the potential of applying IBD-based methods on ancient genomic sequence data to address these questions. Some of our key results leverage an unphased IBD detector,16 which is typically viewed as reliable for segments ≥7 cM long, yet we find meaningful genetic signals by using shorter segments (>5 cM). Because stretches of shared alleles in an unphased context at these lengths are unlikely to always correspond to shared haplotypes,39 we refer to them as long shared allele intervals (LSAIs).
Material and methods
Present-day populations: Estonian Biobank data merged with 1000 Genomes Project European data
Illumina Global Screening Array data for 707,385 SNVs genotyped in 150,415 individuals of the Estonian Biobank (EstBB) were merged with the 1000 Genomes Project (1000 GP) Phase 340 data. After applying --maf 0.05, --geno 0.005, and --mind 0.01 filters in PLINK-1.9.041 data for 143,774 EstBB individuals (representing >10% of the total population of Estonia), we retained 503 individuals of European ancestry (CEU, Utah residents with Northern and Western European ancestry; GBR, British in England and Scotland; FIN, Finnish in Finland; IBS, Iberian populations in Spain; and TSI, Toscani in Italy) from the 1000 GP and 254,325 overlapping SNVs with genetic map coordinates (build 37) as an input for downstream analyses.
Ancient DNA extraction and sequencing
As part of this study, DNA was extracted from medieval human remains of two individuals from Estonia: TPM003 from Tartu Püha Maarja Kirik, Tartu County and TUD001 from Tudulinna, Ida-Viru County. In addition, we analyzed a medieval tooth sample, PSN177, from the cemetery of the Hospital of St John, Cambridge, UK as a control to test the effect of local Estonian reference panel on imputation results and their effect on downstream population genetic analyses. The teeth used for DNA extraction were obtained with relevant institutional permissions from the Institute of History and Archaeology, University of Tartu and the Cambridge Archaeological Unit, Department of Archaeology, University of Cambridge, which excavated the remains from the cemetery of the Hospital of St John on behalf of St John’s College. All laboratory work was performed in dedicated ancient DNA laboratories of the Institute of Genomics, University of Tartu and the Department of Archaeology, University of Cambridge. The library quantification and sequencing were performed at the Institute of Genomics Core Facility, University of Tartu.
For extraction, we broke off or cut off apical tooth roots by using a drill and used them whole to avoid heat damage during powdering with a drill and to reduce the risk of cross-contamination between samples. Contaminants were removed from the surface of tooth roots by soaking in 6% bleach, rinsing with milli-Q water (Millipore) and 70% ethanol, and drying under a UV light. Next, we added EDTA and proteinase K and left the samples to digest on a rotating mixer at 20°C for 72 h to compensate for the smaller surface area of the whole root compared to powder. The DNA solution was concentrated to 250 μL (Vivaspin Turbo 15, 30,000 molecular weight cut-off [MWCO] polyethersulfone [PES], Sartorius) and purified in large volume columns (High Pure Viral Nucleic Acid Large Volume Kit, Roche) with the MinElute PCR Purification Kit (QIAGEN).
We built sequencing libraries by using NEBNext DNA Library Prep Master Mix Set for 454 (E6070, New England Biolabs) and Illumina-specific adaptors42 following established protocols.19,42 The samples were purified between steps with the MinElute PCR Purification Kit (QIAGEN). The libraries were amplified and both the indexed and universal primer (NEBNext Multiplex Oligos for Illumina, New England Biolabs) were added by PCR with HGS Diamond Taq DNA polymerase (Eurogentec). We implemented three verification steps to make sure library preparation was successful and to measure the concentration of double-stranded DNA sequencing libraries—fluorometric quantitation (Qubit, Thermo Fisher Scientific), parallel capillary electrophoresis (Fragment Analyzer, Agilent Technologies), and qPCR.
We sequenced DNA by using the Illumina NextSeq 500 platform with the 75 bp single-end method. First, we multiplexed samples to gain low-coverage data. Later, we generated an additional four full runs of data for TPM003 to increase coverage.
Ancient sequence data processing and authentication
Before mapping, the adaptor sequences and poly-G tails were cut from the ends of DNA sequences via cutadapt 1.11.43 We removed sequences shorter than 28 bp to avoid random mapping of sequences from other species. The sequences were mapped to reference sequence GRCh37 (hs37d5) with the Burrows-Wheeler Aligner (BWA 0.7.12)44 algorithm mem with re-seeding disabled. After mapping, the sequences were converted to BAM format and only sequences that mapped to the human genome were kept with samtools 1.3.45 Next, data from all flow cell lanes for the same sample were merged and duplicates were removed with picard 2.12. Indels were realigned with GATK 3.5,46 and reads with mapping quality under 10 were filtered out with samtools 1.3.
Because of post-mortem degradation, ancient DNA can be distinguished from modern DNA by shorter fragments and a high frequency of cytosine deamination at the ends of sequences. We used the program mapDamage2.047 to estimate the frequency of deamination damage with results for the three newly reported genomes shown in Figure S1. mtDNA contamination was estimated with contammix.48 This included calling an mtDNA consensus sequence based on reads with mapping quality of at least 30 and positions with at least 5× coverage, aligning the consensus with a panel of 311 human mtDNA sequences, mapping the mtDNA reads to the consensus sequence, and running contamMix 1.0-10 with the reads mapping to the consensus and the 312 aligned mtDNA sequences with the option trimBases to trim seven bases from the ends of reads. For male individuals, X chromosome contamination was also estimated via the two methods in the script contamination.R incorporated in ANGSD.49
Detailed summary of the sequence data of all the 14 ancient samples from Estonia used in this study, including 12 published genomes, and the PSN177 genome from Cambridge is provided in Table S1.
Genotype calling of the high-coverage genome
The genotype calls of the high-coverage TPM003 genome were determined with GATK 3.5 HaplotypeCaller50 with Build37 reference and --genotyping_mode GENOTYPE_GIVEN_ALLELES, --output_mode EMIT_ALL_SITES, and --alleles variant.list options. In total, 12.6 million SNVs that had minor allele frequency (MAF) higher than 0.1% in a subset of 2,076 high-coverage whole-genome sequences51 were used in variant.list. Called variants were filtered with genotype quality (GQ) > 30, read depth (DP) > 10, and genotype probability (GP) > 0.99. Details of the down-sampling of the high-coverage genome and the imputation of the low-coverage replicas are given in the supplemental methods.
Principal-component analyses
We used FlashPCA252 to perform principal-component analysis (PCA) on high-coverage and imputed ancient genomes in the context of 69,218 EstBB samples and 503 Europeans (CEU, GBR, FIN, IBS, and TSI) from the 1000 GP data.40 After merging genotype data of 15 ancient (including 14 Estonian samples and one British sample) and 69,713 modern individuals, we thinned the data by excluding variants in linkage disequilibrium with the PLINK41 --indep-pairwise 1000 50 0.5 option and excluded the recommended52 range of likely non-neutral regions with --exclude range exclusion_regions_hg19.txt. After thinning, 153,813 variants remained available for PCA, which was performed with default settings of FlashPCA2.
We assessed the performance of imputed ancient genomes by comparing the placement of TPM003 high coverage (23×) and its down-sampled (to 0.1×) and imputed replicas in PCA performed with FlashPCA2 and smartpca in analyses together with a sub-sample of 1,040 modern genomes (including 503 Europeans of the 1000 GP and 537 EstBB samples). We confirmed first that the two methods produce highly correlated PC1 (r = 0.999997) and PC2 (r = 0.999979) values and performed further analyses including the projections of the five 0.1× replicas of TPM003, which were haploid-called with smartpca that offers the option of projection with the least-squares equations. We observe minor shifts between the position of projected and unprojected copies of TPM003 relative to their most proximate neighbors in the EstBB data on the plot (Figure S2). Similar shifts were observed in additional analyses with different sets of modern references, different sets of SNPs, and different ancient samples (data not shown). Because the positional shifts were relatively minor and would not affect the conclusions drawn from the analyses, these were not followed up further. The data were converted to EIGENSTRAT format with the program convertf from the EIGENSOFT 7.2.0 package.53 The results were plotted in R with ggplot2.54
Y chromosome analyses
In total, 113,217 haplogroup informative Y chromosome variants from regions that uniquely map to the Y chromosome51,55,56 were called as haploid from the BAM file of the high-coverage TPM003 with --doHaploCall function in ANGSD.49 Derived and ancestral allele as well as haplogroup annotations for each of the called variants were added via BEDTools 2.19.057 intersect option. Haplogroup R1a-YP578 assignment received the highest support of informative positions called in the derived state in TPM003. Further fine-level phylogenetic assignment was contextualized (Figure S8, Table S11) within the Y chromosome variation of Estonian high-coverage genomes51 and the phylogenetic tree of Yfull YTree v.8.10.0.
LSAI sharing and individual connectedness inference
LSAI segments and kinship coefficients were estimated from merged PLINK files of EstBB samples, 503 Europeans from the 1000 GP, and 15 ancient Estonain genomes with IBIS16 (identical by descent via identical by state) v.1.20.6 using –min_L 5 cM and –c 0.0005 kinship coefficient cut-offs—corresponding to minimum requirement of one shared segment of >5 cM length and total sharing of at least 0.1% of the genome ∼6.6 cM—for most of the analyses (except for Table S4 comparing 2, 5, 7, and 10 cM thresholds) and –maxDist 0.1 and –mt 300 parameters. The total number of SNPs used varied between 244,643–254,326 MAF > 0.05 variants.
Although IBIS has the highest IBD inference accuracy for >7 cM segments,16 we use >5 cM threshold in our diachronic inferences because our focus is on relationships at generational distances > 15 at which longer IBD block sharing expectations become relatively low,58 particularly in combination with the loss of sensitivity to detect long IBD segments from imputed ancient DNA sequences, as shown by the fragmented nature of TPM003’s self-sharing in Table S6. Because true IBD segments of this length are not expected to be common at these generational distances, we need to consider the detected segments as “long shared allele intervals” (LSAIs) rather than IBD segments sensu stricto. Because they are inferred from unphased data after removal of rare variants (which cannot be imputed with sufficient accuracy), the LSAIs are likely to include undetected recombination points and smaller IBD segments residing on different haplotypes. To control for potential effect of differences in sites with missing data on the LSAI inference, we analyzed all low-coverage (<0.3×) samples individually (in order to avoid cumulative loss of SNP numbers) against the EstBB and 1000 GP data by using the –setIndexEnd option in IBIS after filtering out, on individual basis, variants for which the low-coverage genome had missing data.
Further details on the LSAI inference parameter choice are given in the supplemental methods.
The probability of individual connectedness (PiC) score for individual x in group Z was estimated as the proportion of individuals from group Z with whom individual x shared IBD above the given threshold. In practice, we estimated the count of connected individuals from group Z from sorted IBIS .coef output files by using the linux “join” function to add group codes to individual identifiers and by using the “crosstab” function of datamash59 to generate the table of counts, each of which we divided by the total number of individuals in group Z to obtain the individual connectedness proportions by groups (the PiC scores).
Simulations of LSAI sharing under different demographic models
To investigate the patterns of IBD sharing between contemporary and ancient samples expected under different demographic scenarios, we simulated eight different demographic scenarios, described in Table S9, by using msprime.60 In all simulations, we used the discrete time Wright-Fisher model (model = “dtwf”) to simulate generations 0 to 1,000 and then switched to the Hudson model (model = “hudson”) as advised by msprime documentation for simulations with large sample size and multiple chromosomes. We used a recombination map obtained by concatenating two 1000 GP maps for chromosome 1 (GRCh37) separated by a region of 50 cM to increase analyzed sequence length. Mutation rate was set to 1.25 × 10−8. In each simulation, we sampled 400 haplotypes (200 diploid samples) per time point per population (in the case with two populations simulated) at six time points: 0, 10, 20, 30, 50, and 100 generations ago. We filtered out positions falling into telomeric or centromeric regions of the chromosome 1 recombination map or in the junction between the two maps as well as positions with derived allele frequency less than 5% in the simulated dataset to match the filtering scheme applied on empirical data. The LSAI segments were detected via IBIS16 with the same thresholds (at least one >5 cM shared segment and kinship coefficient > 0.0005) as used in the analyses of the empirical data.
Unsupervised community extraction analyses
We used the list of individual pairs sharing at least one >5 cM LSAI segment and having a kinship coefficient > 0.0005 from the IBIS (.coef) results as an input (with three columns: id1, id2, kinship coefficient) for community extraction analyses. This list was passed to a custom R script that runs a hierarchical clustering method for “community detection,” known as the Louvain algorithm,17 that is implemented in the R library “igraph.”61 We introduced an additional step to quantify the significance of the extracted communities. Five nested cycles of the Louvain algorithm were run on each community passing the Wilcoxon rank-sum significance test implemented in the R library exactRankTests.62
In our pipeline, the igraph algorithm first detects all possible level-one communities and then each community undergoes a Wilcoxon rank-sum test that weighs the internal and external degrees of the community connections in order to quantify its significance. In cases of significantly (p value < 0.05) more internal than external connections, the communities are accepted and passed on to the analyses at the next level. All individuals from the communities that do not pass significance testing are excluded from further steps. Every next cycle of community extraction begins with modularity detection followed by testing of statistical significance before moving to the following cycle. We let this process continue up to the fourth level. A fifth cycle is internally implemented only for testing the statistical significance of the level-four communities. By the end of this process, a network of connections will include all those communities statistically supported at each level and a per community list of included individuals. At this point, using a series of custom scripts, we combined all the community levels in order to assign each individual to a community defined by a unique alphanumeric code, resuming a sample’s complete path from one level to another. Based on the significance test results, each sample’s last community level assignment can be confirmed (the sample maintains its position) or be changed (sample is reassigned to the previous statistically significant level). Finally, connectivity scores are estimated for individuals of the extracted communities.
PiC score was used for the outlier detection process in each extracted community. We screened by community the individuals for their PiC scores and identified as “outlier candidates” individuals below the lower whisker of the boxplot distribution of the PiC scores in the community they were assigned to. Each list of “outlier candidates” was tested against the overall distribution of the PiC scores in that same community via a custom R script for the significance. Communities with more than 25 individuals were tested with a Rosner’s test, whereas communities with 25 or less members were tested with Dixon’s test. Individuals with p values < 0.05 are marked as significant outliers and removed from further analyses. Eventually, 330 samples out of 4,852 were removed from the intensity score matrix as outliers, including 281/320 Slavic- or Baltic-speaking EstBB participants and 49/4,419 ethnic Estonians (Table S15). Community membership proportions were then plotted as pie charts in R with the ggplot2 package.
Phenotype prediction analyses
We used vcftools63 (--snp option) to extract the genotype information at 104 phenotype informative markers already analyzed,64 after excluding nine SNPs absent in our Estonian reference panel, from the high-coverage TPM003 and for its down-sampled (0.1× and 0.3×) and imputed copies. We then filtered the genotypes to keep only variants with GP ≥ 0.99 and recorded them as the number of effective alleles by using PLINK (--recode A --recode-allele option). For the HIrisPlex-S set for the pigmentation prediction,65 we uploaded the genotype data to the HIrisPlex-S webtool after reformatting by using the “merge” function in R to combine information from all informative SNPs. We interpreted the results of the webtool according to its manual to obtain the pigmentation prediction (Table S13).
Further details of the phenotype prediction concordance estimation and analyses of the SLC24A5 region are given in the supplemental methods.
Results
While IBD-based methods can offer high-resolution insights into the recent phases of our demographic history, the accuracy and robustness of shared IBD inferences—or the related signal we explore here, LSAI—from low-coverage ancient genomes has not been determined yet. To address this, we sequenced the genome of a 15th century male individual (TPM003) from Tartu Püha Maarja (St. Mary) parish cemetery (Estonia) to an average depth of 23× (Table S1). We determined the genotype calls of the high-coverage genome of TPM003 directly and then compared the results against genotype calls from five down-sampled copies of 0.1× coverage, each of which we imputed by using a panel of 2,076 Estonian high-coverage sequences.51 We estimated the average proportion of matching heterozygote calls between the imputed and high-coverage data as the primary estimator of imputation accuracy at 98.6% for common (MAF > 0.05) variant sites and noticed a notable accuracy drop to <95% and <80% in variants with MAF < 0.05 and MAF < 0.01, respectively (Table S2). In further analyses, we used only variants with MAF > 0.05.
We next analyzed the imputed low-coverage ancient samples from Estonia and one medieval sample from the UK together in context of genotype data from 69,218 individuals from the EstBB and 503 Europeans from the 1000 GP40 by using FlashPCA252 and smartpca53 (Figure 2 and Figure S1). We observed a clear distinction of Estonians from other European populations, including Finns (Figure 2A). By the fixation index (FST) statistic, Estonians are more differentiated from neighboring Finns than, for example, 1000 GP Italians from Tuscany are from the Iberians (Figure 2B). We found that all 14 imputed Bronze Age, Iron Age, and medieval samples from Estonia cluster together with present-day Estonians approximately within the same broad geographic regions of Estonia in which they were buried, although the resolution afforded by these analyses did not allow for finer county-level assignments (Figure 2D). Similarly, we found that the medieval British sample that we imputed together with five Estonian medieval genomes maps close to the GBR cohort (Figures 2A and 2C). We confirmed the robustness of the placement in PCA of ancient imputed genomes directly without the use of projection by comparative analyses of high-coverage, imputed, and haploid-called and projected data (Figure S2). Notably, the down-sampled replica of TPM003, imputed from 0.1× coverage, mapped next to the closest neighbors from the EstBB in the PCA constructed with the high-coverage sample without imputation, suggesting that high accuracy (with less variance than from projections of haploid-called genotype data) of individual ancestry mapping is possible from imputed data at this coverage (Figure S2).
To explore regional LSAI sharing patterns, we used IBIS16 to extract pairs of individuals who share long unphased (>5 cM) LSAI segments and estimated kinship coefficient > 0.0005. We introduce PiC, the probability of individual connectedness, as a simple measure to explore patterns of LSAI sharing within and among populations by user-defined segment length (L) and kinship coefficient k (as a measure of total genome-wide IBD sharing) thresholds. We first compared the outgroup-f3 statistic as a measure of drift sharing against PiC among Estonians, Baltic and Slavic speakers from EstBB, and Finns from 1000 GP and observed that PiC offers high resolution in distinguishing local differences (Figure S3). There is a notable decline of within-region connectedness across year of birth cohorts (Figure S4), which most likely reflects higher mobility within Estonia in the last few generations. Consistent with the lack of major geographic barriers and proximity, Estonians share most drift with Baltic-speaking Latvians, while by the PiC statistic, Finns are the most closely connected group to Estonians. The differentiation of Estonians and Finns by drift-sensitive statistics, such as f3, can be explained by the founder effects in the Finnish demographic history, a finding that is consistent with the higher genetic FST differentiation between Estonians and Finns than among Tuscan and Iberian genomes (Figure 2B); the higher level of Estonian connectedness with Finns than with Baltic-speaking neighbors by PiC requires, however, further scrutiny with regards to the time depth of these connections. Analysis of ancient genomes can provide answers about whether the LSAI segment sharing reflects recent gene flow or some other aspects of shared demographic history.
We assessed IBD sharing between ancient and modern genomes from Estonia and found that the proportion of individuals from the EstBB with whom ancient genomes share LSAI segments increased from Bronze to Iron Age and medieval periods (Table S3). We observed significantly higher PiC scores between EstBB and medieval (12th–16th centuries) than EstBB and Iron (p = 0.02, two-tailed t test) or Bronze Age (p = 0.017) samples when we used LSAI length > 5 cM threshold. These results stand in contrast to the lack of clear differences between the diachronic samples in their >2 cM LSAI sharing (Table S4). Further, we observe that the sharing of >5 cM and >7 cM LSAI is at comparable levels for modern-modern and modern-medieval pairs of samples, while >10 cM segments can be detected more abundantly in modern-modern than modern-ancient pairs, most likely because of the excess of distant genealogical relationships among the modern samples that would be absent in modern-ancient pairs.
To explore further regional details of LSAI sharing patterns between Estonians and their geographic neighbors in light of evidence from ancient imputed genomes, we focused on PiCL > 5 cM, k > 0.0005 scores in a subset of Estonians of the EstBB born before 1940 for whom county- and parish-level information of birthplace was available (Figure 3, Figure S9). Under realistic scenarios of human population densities and dispersal rates, virtually all pairwise shared IBD blocks longer than 4 cM are expected to coalesce to common ancestors within the last 100 generations66 or approximately 3,000 years. Consistent with this prediction, we observe marginally low levels (1%–2%) of >5 cM LSAI sharing between Estonians and other East European populations (Poles, Belarusians, Lithuanians) (Figure 3) with whom they share Steppe ancestry31 through Late Neolithic dispersals from a common Corded Ware culture source (Figure 1C).
Estonian Iron and Bronze Age individuals sampled at a 2,400–2,800 years time depth show >10× higher connectivity with modern Finnic- and Baltic-speaking populations than with West Europeans (Table S3), while their PiC scores with present-day Estonians are lower than with Baltic-speaking Latvians and Lithuanians (Table S3, Figure 3). These observations are in line with common ancestry sharing in a broader area of Corded Ware culture before the arrival of Finnic speakers: consistent with this model (Figure 1C), Belarusians and Poles show more LSAI sharing with the Bronze Age than present-day individuals from Estonian (Figure 3). We observe no significant excess of PiC scores between EstBB Estonians and Iron or Bronze Age genomes sampled from the same Estonian counties (Table S10, Figure 3). Neither do we see higher regional affinity between North Estonian Bronze and Iron Age samples in relation to Iron Age samples from Saaremaa. Overall, these results suggest that the present-day county-level LSAI sharing patterns were not yet fixed in the Iron Age.
In contrast to individuals sampled from earlier periods, six Estonian medieval genomes from the 13th–16th centuries share significantly more >5 cM LSAIs with present-day individuals born in the same county in Estonia (Figure 3, Figure S9). Furthermore, all ancient genomes from Estonia studied here show high affinity not only to present-day Estonians but also to present-day Finns at levels up to an order of magnitude higher than to Swedes (Table S10), including Late Bronze Age (average 5%) and Iron Age (average 5%, range 3%–8%) genomes. Estonians share more Finnish LSAIs than 82 EstBB Latvians and Lithuanians (average 2%, range 0%–7%) or a medieval low-coverage genome from Cambridgeshire, UK imputed with Estonian medieval samples (Figure 4). Medieval Estonians, however, share significantly more >5 cM LSAIs with Finns (10.1% average, Table S3) than Iron Age (average 4.7%, p = 0.006, two-tailed t test) or modern Estonians (8.7% on average), suggesting that recent (17th–18th centuries) localized migration events cannot explain the excess of Finnish LSAI sharing that we observe across Estonia today. Instead, these findings point to a migration event across the Gulf of Finland earlier than the 13th century as being responsible for the observed patterns.
We observe higher consistency in regional LSAI sharing patterns between the high-coverage genome and its down-sampled replicas (Table S8, for further details see supplemental methods). To summarize the compound effect of imputation errors on LSAI-based ancestry mapping of the ancient samples, we applied the uniform manifold approximation and projection (UMAP) dimension reduction method on the regional (county-based) PiC scores (Figure 5) and observed that the imputed TPM003i0.1× mapped closely to its high-coverage version that had not been imputed. This, along with regional clustering of other medieval genomes among EstBB individuals from the same geographic context suggests that (1) LSAI inference through imputation from ancient low-coverage genomes can be achieved sufficiently accurately for addressing questions about regional ancestry; (2) >5 cM LSAI segments persist and remain regionally informative for at least 800 years, over which time the regional genetic identities in Estonia have remained relatively distinct from one another; and (3) considering the fact that local Iron and Bronze Age populations do not show region-specific affinity to present-day local communities—although most likely being genetically ancestral to these in a broader geographic sense—these regional LSAI sharing patterns that unite medieval and present-day Estonians were most likely created between the Iron Age and the 12th/13th centuries AD whence our earliest medieval samples derive.
The geographic patterns of connectedness we estimated from PiC scores of 15 present-day counties of Estonia rely on administrative divisions that may have not been meaningful in the past. To further test the robustness of the inference of geographic patterns in our data, we used an unsupervised modularity optimization technique, called the Louvain method,17 that clusters individuals into modular units (communities) by their LSAI connectivity among individuals without the use of any geographic or other sample pooling criteria. We extracted communities by using a nested application of the Louvain algorithm, allowing each detected community to undergo a further cycle of community identification on the basis of significant excess of internal as opposed to external connections. We applied the Louvain method on the IBIS results for 4,739 EstBB donors born before 1940, 14 ancient genomes from Estonia, and 99 Finns from the 1000 GP. The Louvain method revealed four first-order and 20 second-order communities, which roughly corresponded to the main geographic regions of Estonia (Figure 6). Notably, all Finns of the 1000 GP data clustered together in one of the 3rd-level sub-clusters, I7b, of a 2nd-level cluster, I7, that has predominantly Northeast Estonian provenance (Figure 6, Table S15). The Louvain method places all six medieval genomes from Estonia into communities containing modern genomes from the same geographic region as their burial place while lumping all eight Iron and Bronze Age genomes, regardless of their geography, to community I4 (Figure 6, Table S15). The I4 community contains a small number of modern counterparts and is characterized by low connectedness both internally and to other communities, which is uncharacteristic of modern and medieval genomes (Table S15).
We next ran simulations with msprime60 in order to better understand the observed patterns of extensive LSAI sharing between ancient and modern Estonian genomes and how these are affected by demographic history (Figures S5 and S6). The simulation models were inspired by effective population size (Ne) trajectories obtained by applying IBDNe67 to modern high-coverage Estonian genomes.38 We show (Table S9) that Ne and its changes over time significantly affect the pattern of IBD sharing between individuals. First, unsurprisingly, the results of the simulations show that the fraction of the population that an average individual is connected to is inversely and linearly dependent on Ne (compare Figures S6A and S6C versus Figures S6F and S6H), resulting in little expected connectedness in large populations. Second, under a population model with a recent exponential growth, modern individuals can have a higher LSAI sharing with ancient individuals sampled from periods of small population size preceding the growth compared to IBD sharing with present-day individuals; the specific pattern is dependent on the duration of the growth period and the growth rate (Figure S6). Third, under scenarios realistic for Estonian subpopulations, we find that present-day individuals are expected to have similarly high levels of LSAI sharing with their contemporaries and with ancient individuals from up to 30 generations (∼900 years) ago at maximum, and there is a notable drop deeper in time (Figures S5A and S5C, Figures S5B and S5D, and Figures S5E and S5G). This means that our simulations do not support a model by which the high connectedness between Finns and Estonians could derive from Iron Age migrations circa 100 generations ago (Figure 1A). Finally, under a simplistic model of a clean population split with no subsequent gene flow, present day individuals from one of the populations are expected to show an increasing level of IBD sharing with individuals from the other population as we sample from time points successively closer to the split time (Figure S5F and S5H, blue boxes). The latter observation explains why present-day Finns can have higher IBD sharing with medieval rather than contemporary Estonians from certain Estonian regions.
The high-coverage genome of TPM003 allowed us to examine his Y chromosome at high resolution in context of a large reference set of 1,160 high-coverage sequences from Estonia.51 Consistent with the autosomal LSAI sharing results, we detect a signal of regional clustering of TPM003 together with lineages from Southeast Estonia in a newly defined R1a1c-B2153 clade (Figure S8, Table S11), which is nested within a broader set of Y chromosomes in a clade R1a1c-YP578. According to YFull tree, this clade has been estimated to have a coalescent date of 2,100 (confidence interval [CI] 95% 2,300–1,800) years and geographic distribution mainly in present-day Russia and Finland. Although R1a1c-YP578 is widely spread across Estonia (3.1% on average), the newly defined R1a1c-B2153 has, according to our knowledge, not yet been found outside of Estonia. Among 1,160 Estonian Y chromosomes, it is only found in six individuals from Tartu and Põlva counties, including a grandfather-father-son trio from Tartumaa. Unsurprisingly, considering the generational distance between the ancient and modern genomes, we find no evidence of triangular autosomal IBD transmission of the medieval TPM003 shared segments from the modern grandparent to his grandchild.
Because imputation of genotypes at loci that have been targets of selection can be problematic68 and we had not filtered out such variants from our analyses, we assessed the accuracy of imputation at 104 functionally informative positions widely used for phenotype inference, including those affected by recent selection (Table S12). We observed high (>0.98) match rate between imputed and high-coverage genome for genotype calls at 90 variants that were sufficiently well covered in the high-coverage TPM003 genome (Table S13).
Interestingly, we found TPM003 to be heterozygous at rs1426654 (A/G) in the SLC24A5 that has been identified among 22 strongest signals of selection in human genome.69 rs1426654 is a variant that explains a major part of skin pigmentation differences between Africans and Europeans and differences among South Asians.70, 71, 72 The derived A allele at this variant, associated with lighter pigmentation, has been shown to have been introduced to Europe by Neolithic farmers followed by its virtual fixation in most European populations today.73 The highest frequency of the ancestral G allele in the 1000 GP Europeans appears to be in Finns (2%).
Because genotype imputation at variants with low MAF is reduced and potentially problematic in regions of the genome targeted by natural selection,68 we further assessed the accuracy of our LSAI inference from imputed data by comparing TPM003's LSAI sharing, with a >2 cM threshold, around the rs1426654 variant in the local Estonian reference panel that was used in imputation against LSAI matches for TPM003 in this locus in the Haplotype Reference Consortium (HRC) and the 1000 GP panels not used in our imputation. We found that both directly called and imputed copies of TPM003 share IBD segments, both for the A and G allele, with Estonian and Finnish samples from different haplotype panels (Figure S7, Table S14). Among segments longer than 5 cM, there are both G- and A-allele-carrying haplotypes shared between TPM003 and four Estonians (three with the A allele and one with the G allele) and one Finn (with the G allele) (Figure S7A). On average, G-allele-carrying haplotypes are significantly longer (3.44 cM versus 2.42 cM, one-tail t test: 6.63 × 10−9), suggesting more recent common ancestry in Finns and Estonians of the ancestral than the derived allele, which has been highlighted as one of the strongest targets of positive selection in human populations.69 The core 200 kb G haplotype observed in TPM003 is distinct from Asian and African G haplotypes and observed in the given sample only among Estonians and Finns (with the exception of a single Japanese [JPT] individual from 1000 GP) (Figure S7B).
Discussion
We have shown that shotgun sequencing of ancient DNA at low (0.1–1×) coverage enables sufficiently accurate genotype data imputation for ancestry and IBD/LSAI-based community structure analyses. Our estimated imputation accuracy of 0.99 (Table S1) for heterozygote calls of common variants from a medieval Estonian genome is higher than the 0.93 estimate we previously obtained by applying the same approach10 on a high-coverage Neolithic genome8 and higher than comparable estimates of accuracy of <0.90 for other approaches.6,74 The increased accuracy can potentially be explained by (1) the temporal (and genetic ancestry) proximity of our medieval genome compared to the Neolithic sample considered in Hui et al.10 and (2) the use of a large, ethnically/regionally matched reference panel. Although our analyses at phenotype-informative variants, including those highlighted previously as selection targets, did not reveal a notable drop in imputation accuracy, we caution against generalizations to cases without ethnically/regionally matched large reference panels and longer time gaps between the imputed sample and the imputation panel and note that these results are based on a limited number of observations of a diverse set of 90 different variants taken between a single high-coverage genome and its down-sampled replicas.
Downstream PCA and LSAI analyses showed sufficiently high precision for fine-scale mapping of the genetic ancestry of the imputed samples. Our analyses showed relatively lower accuracy of IBD1 and IBD2 recovery from imputed low-coverage (0.1×) genomes, suggesting that detection of sample identity, as well as twins and 1st-degree relatedness from imputed low-coverage genomes via IBD/LSAI approach, can be challenging; however, at 0.3× coverage, we were able to correctly recover 92.6% of IBD2 and 96.5% of total IBD segments of the down-sampled ancient genome (Table S6). Furthermore, other methods such as READ75 or GRUPS76 offer accurate estimation of close relatives from low (>0.05×) coverage data.
Kinship coefficients are a measure of the proportion of genome-wide IBD in a pair of individuals. The abundance of long IBD segments can be a robust indicator of close relatedness in a large unstructured population and is therefore widely used by direct-to-consumer genetic testing for inferring matches in genealogical relationships (up to 5th cousins). However, our simulations (Figures S5 and S6) show that IBD sharing patterns are strongly influenced by effective population size history. The kinship coefficients estimated here between modern and medieval samples are clearly not interpretable in terms of meaningful genealogical relationships given that the pairs are separated by more than 15 generations in time. Hence, the signals of elevated LSAI sharing (comparable in their intensity to the levels of 4th- to 5th-cousin relationships in large populations) between present-day individuals and those sampled 10–20 generations ago can be best explained, in line with our simulation results, by relatively low historic Ne and recent exponential growth in Estonia. Additionally, a large fraction of the segments we detect may not correspond to shared haplotypes because of their unphased nature and their length,39 and as such, represent a series of very short segments that coalesce, on average, longer ago than a true IBD segment of the same length would. Consistent with this, among the triangular cases, where a medieval genome shares LSAI with two modern individuals who are themselves closely related (grandparent-parent-offspring sequence), we observe no excess of LSAI sharing in the grandparent (Figure S8), which would be expected under genealogical relationships. Thus, it is more likely that most cases of diachronic LSAI sharing that we describe are explainable by cumulative long-term maintenance of community-specific chunks of IBD through marriages involving distant (cryptic) relatedness within the same parish- or county-level community.
We suggest that diachronic LSAI sharing patterns can be informative for resolving complex demographic scenarios involving recent population splits and subsequent gene flow. Most shared IBD blocks longer than 4 cM are expected to be less than 1,500 years old,2 and virtually all IBD blocks of this length are expected to derive from the last 3,000 years.66 The large-scale >5 cM LSAI sharing between Estonians and Finns would thus be expected to reflect primarily historical gene flow across the Gulf of Finland (Figure 1F) while not necessarily standing in conflict with contributions from earlier migration events predicted on linguistic grounds (Figure 1E) or synthesis of archaeological and genetic evidence (Figure 1D).19
The high levels of LSAI sharing with Finns that we observe in present-day Northeast Estonians could, at least partly, be explained38 (Figure 1F) by historically attested Finnish settlements in Northeast Estonia in the 17th–18th centuries. However, our ancient DNA evidence (Figures 3 and 4) from the 12th-16th centuries points to deeper time depth for this relationship across Estonia. According to the current synthesis of genetic and archaeological evidence, the earliest migration event that could account for genetic ancestry sharing and unique connectedness among Finnic-speaking Finns and Estonians dates back to the Pre-Roman Iron Age (Figure 1D 19). However, the Nganasan-related autosomal component that appears in the circum-Baltic region in this time period as a signature of possibly the first arrival of Finno-Ugric speakers is likely to have reached Fennoscandia and Estonia by different routes and is relatively minor (3%–5% of total autosomal ancestry).19,32,20 Yet, our analyses of ancient genomes through the transect of time show that the levels of LSAI sharing with present-day Finns have been higher among Estonian genomes than those observed in present-day Latvians and Lithuanians (Table S3) since not only the Iron Age but also the Bronze Age (Figures 3 and 4), suggesting they have been generated in situ in Estonia for a long period of time rather than being introduced to Estonia from external sources recently.
The minor Nganasan-related component in the Pre-Roman Iron Age migrations (Figure 1D) could explain the specific G-allele-carrying haplotype distribution of SLC24A5 among Finns and Estonians (Figure S7), but the genome-wide sharing patterns, in which individuals from the 12th–14th centuries AD show the highest connectedness to present-day Finnish genomes (Figure 4), are arguing against the Pre-Roman Iron Age time depth for the main connectedness signal we observe. Furthermore, the results of our community extraction analyses (Figure 6) suggest that the patterns of region-specific connectedness within Estonia postdate our Iron Age samples and that all 99 Finnish samples we explored were assigned to a single community primarily composed of North Estonians. Notably, in simulations, this diachronic pattern of extensive sharing between past and present populations is consistent not only with the outcome of a population split model (Figure S6F) but also observable under a range of panmictic cases that consider realistic demographic scenarios of population history in Estonia (Figure S6E). In sum, these results suggest that informative LSAI signals can persist in structured populations at least for dozens of generations and that the high level of connectedness between North Estonian and Finnish genomes is older than our earliest medieval and younger than our latest Pre-Roman Iron Age samples.
The high level of Finnish LSAI sharing observed in individuals who lived in Estonia during the 12th–14th centuries AD represents the first direct evidence that a significant proportion of these relationships date back to the time before the expansion of the Finnish population, the Finnish founder event,29 i.e., to the time when the total population size of Finland is estimated to have been very small. Because we observe nearly identical and highly correlated (r > 0.999) levels of LSAI sharing between Estonian counties and Finnish samples from two independent datasets (Table S10), we consider our results to be robust and representative for the Finnish population in general. However, considering the existence of significant population substructure in Finland,36, 37, 38 further research would be required for determining regional and temporary details of the connectedness patterns revealed here within the context of temporary changes of ancestry and substructure of the Finnish population.
In sum, the results of our analyses on genetic data are consistent with the linguistic model (Figure 1E) that ascribes the language affinities and innovations shared between the Finnish language and North Estonian dialects to a migration event from North Estonia to Finland in the end of the first millennium. Because LSAIs are expected to decay in time because of recombination and admixture, the fact that present-day Finns still show genetic connectedness with medieval and modern North Estonians at levels comparable to internal connectedness in Estonia suggests that these uniquely shared long allele intervals are abundantly present across the genome as a feature that characterizes a major part of Finnish genetic ancestry. However, more precise quantification of the impact of the migration event and its timing would require ancient DNA evidence from Finland before and after the event as well as modeling of Finnish effective population size history in context of its local regional diversity and admixture sources.
Acknowledgments
We would like to thank the University of Tartu Development Fund for support to the Collegium for Transdisciplinary Studies in Archaeology, Genetics, and Linguistics. Analyses were carried out with the facilities of the High-Performance Computing Center of the University of Tartu. This work was funded by the Estonian Research Council grants PRG243 (M. Metspalu, Lauri Saag, C.L.S., Lehti Saag), PRG1027 (K.T.), PRG1071 (S.R.), and PRG29 (M. Malve); KU Leuven startup grant STG/18/021 (T.K.); KU Leuven BOF-C24 grant ZKD6488 C24M/19/075 (T.K. and S.A.B.); and The Wellcome Trust award no. 2000368/Z/15/Z (T.K., R.H., C.L.S.). D.N.S. and A.L.W. were supported by NIH grant R35 GM133805. E.D’A. was supported by Sapienza University of Rome fellowship “borsa di studio per attività di perfezionamento all’estero 2017,” C.L.S. was supported by European Regional Development Fund 2014–2020.4.01.16–0030, and K.T. was supported by UT Institute of Genomics grant PP1GI19936. This research has been conducted with the UK Biobank resource under application numbers 54698 and 19947.
Declaration of interests
A.L.W. is a paid consultant for 23andMe and the owner of HAPI-DNA LLC. All other authors declare no competing interests.
Published: August 18, 2021
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.07.012.
Data and code availability
The community extraction analysis scripts generated during this study are available at https://github.com/SABiagini/Louvain. The ancient genomic data generated during this study are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB46155 (accession code ENA: PRJEB46155) and the data depository of the EBC (https://evolbio.ut.ee/). The Estonian Biobank (EstBB) data used in this study are available under restricted access. The procedure of applying for access to the data can be found under the following link: https://genomics.ut.ee/en/biobank.ee/data-access.
Supplemental information
References
- 1.Lazaridis I. The evolutionary history of human populations in Europe. Curr. Opin. Genet. Dev. 2018;53:21–27. doi: 10.1016/j.gde.2018.06.007. [DOI] [PubMed] [Google Scholar]
- 2.Ralph P., Coop G. The geography of recent genetic ancestry across Europe. PLoS Biol. 2013;11:e1001555. doi: 10.1371/journal.pbio.1001555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ferrando-Bernal M., Morcillo-Suarez C., de-Dios T., Gelabert P., Civit S., Diaz-Carvajal A., Ollich-Castanyer I., Allentoft M.E., Valverde S., Lalueza-Fox C. Mapping co-ancestry connections between the genome of a Medieval individual and modern Europeans. Sci. Rep. 2020;10:6843. doi: 10.1038/s41598-020-64007-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Browning B.L., Browning S.R. Genotype Imputation with Millions of Reference Samples. Am. J. Hum. Genet. 2016;98:116–126. doi: 10.1016/j.ajhg.2015.11.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Browning B.L., Zhou Y., Browning S.R. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am. J. Hum. Genet. 2018;103:338–348. doi: 10.1016/j.ajhg.2018.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rubinacci S., Ribeiro D.M., Hofmeister R., Delaneau O. Efficient phasing and imputation of low-coverage 1 sequencing data using large reference panels. Nat. Genet. 2020;53:120–126. doi: 10.1038/s41588-020-00756-0. [DOI] [PubMed] [Google Scholar]
- 7.Cassidy L.M., Maoldúin R.O., Kador T., Lynch A., Jones C., Woodman P.C., Murphy E., Ramsey G., Dowd M., Noonan A. A dynastic elite in monumental Neolithic society. Nature. 2020;582:384–388. doi: 10.1038/s41586-020-2378-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gamba C., Jones E.R., Teasdale M.D., McLaughlin R.L., Gonzalez-Fortes G., Mattiangeli V., Domboróczki L., Kővári I., Pap I., Anders A. Genome flux and stasis in a five millennium transect of European prehistory. Nat. Commun. 2014;5:5257. doi: 10.1038/ncomms6257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Martiniano R., Cassidy L.M., Ó’Maoldúin R., McLaughlin R., Silva N.M., Manco L., Fidalgo D., Pereira T., Coelho M.J., Serra M. The population genomics of archaeological transition in west Iberia: Investigation of ancient substructure using imputation and haplotype-based methods. PLoS Genet. 2017;13:e1006852. doi: 10.1371/journal.pgen.1006852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hui R.Y., D’Atanasio E., Cassidy L.M., Scheib C.L., Kivisild T. Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes. Sci. Rep. 2020;10:18542. doi: 10.1038/s41598-020-75387-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Naseri A., Liu X., Tang K., Zhang S., Zhi D. RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome Biol. 2019;20:143. doi: 10.1186/s13059-019-1754-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shemirani R., Belbin G.M., Avery C.L., Kenny E.E., Gignoux C.R., Ambite J.L. Rapid detection of identity-by-descent tracts for mega-scale datasets. Nat. Commun. 2019;12:3546. doi: 10.1038/s41467-021-22910-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhou Y., Browning S.R., Browning B.L. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. Am. J. Hum. Genet. 2020;106:426–437. doi: 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dimitromanolakis A., Paterson A.D., Sun L. Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via TRUFFLE. Am. J. Hum. Genet. 2019;105:78–88. doi: 10.1016/j.ajhg.2019.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Seidman D.N., Shenoy S.A., Kim M., Babu R., Woods I.G., Dyer T.D., Lehman D.M., Curran J.E., Duggirala R., Blangero J., Williams A.L. Rapid, Phase-free Detection of Long Identity-by-Descent Segments Enables Effective Relationship Classification. Am. J. Hum. Genet. 2020;106:453–466. doi: 10.1016/j.ajhg.2020.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Blondel V.D., Guillaume J.L., Lambiotte R., Lefebvre E. J Stat Mech-Theory E.; 2008. Fast unfolding of communities in large networks. [Google Scholar]
- 18.Saada J.N., Kalantzis G., Shyr D., Robinson M., Gusev A., Palamara P. Identity-by-descent detection across 487,409 British samples reveals fine-scale population structure, evolutionary history, and trait associations. Eur. J. Hum. Genet. 2020;28:2–3. doi: 10.1038/s41467-020-19588-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Saag L., Laneman M., Varul L., Malve M., Valk H., Razzak M.A., Shirobokov I.G., Khartanovich V.I., Mikhaylova E.R., Kushniarevich A. The Arrival of Siberian Ancestry Connecting the Eastern Baltic to Uralic Speakers further East. Curr. Biol. 2019;29:1701–1711.e16. doi: 10.1016/j.cub.2019.04.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tambets K., Yunusbayev B., Hudjashov G., Ilumäe A.M., Rootsi S., Honkola T., Vesakoski O., Atkinson Q., Skoglund P., Kushniarevich A. Genes reveal traces of common recent demographic history for most of the Uralic-speaking populations. Genome Biol. 2018;19:139. doi: 10.1186/s13059-018-1522-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lang V. Helsinki: Suomalaisen kirjallisuuden seura; 2020. Homo Fennicus. [Google Scholar]
- 22.Honkola T., Vesakoski O., Korhonen K., Lehtinen J., Syrjänen K., Wahlberg N. Cultural and climatic changes shape the evolutionary history of the Uralic languages. J. Evol. Biol. 2013;26:1244–1253. doi: 10.1111/jeb.12107. [DOI] [PubMed] [Google Scholar]
- 23.Janhunen J. Proto-Uralic—what, where, and when? Suomalais-Ugrilaisen Seuran Toimituksia. 2009;258:57–78. [Google Scholar]
- 24.Kallio P. On the Earliest Slavic Loanwords in Finnic. Slavica Helsingiensia. 2006;27:154–166. [Google Scholar]
- 25.Lang V. Early Finnic-Baltic contacts as evidenced by archaeological and linguistic data. ESUKA-JEFUL. 2016;7:11–38. [Google Scholar]
- 26.Bjørnflaten J.I. In: Nuorluoto J., editor. Department of Slavonic and Baltic Languages and Literatures, University of Helsinki; Helsinki: 2006. Chronologies of the Slavicization of Northern Russia Mirrored by Slavic Loanwords in Finnic and Baltic; pp. 50–77. (The Slavicization of the Russian North Mechanisms and Chronology). [Google Scholar]
- 27.Maurits L., de Heer M., Honkola T., Dunn M., Vesakoski O. Best practices in justifying calibrations for dating language families. J. Lang. Evol. 2020;5:17–38. [Google Scholar]
- 28.Nevanlinna H.R. The Finnish population structure. A genetic and genealogical study. Hereditas. 1972;71:195–236. doi: 10.1111/j.1601-5223.1972.tb01021.x. [DOI] [PubMed] [Google Scholar]
- 29.Peltonen L., Jalanko A., Varilo T. Molecular genetics of the Finnish disease heritage. Hum. Mol. Genet. 1999;8:1913–1923. doi: 10.1093/hmg/8.10.1913. [DOI] [PubMed] [Google Scholar]
- 30.Norio R. Finnish Disease Heritage I: characteristics, causes, background. Hum. Genet. 2003;112:441–456. doi: 10.1007/s00439-002-0875-3. [DOI] [PubMed] [Google Scholar]
- 31.Saag L., Varul L., Scheib C.L., Stenderup J., Allentoft M.E., Saag L., Pagani L., Reidla M., Tambets K., Metspalu E. Extensive Farming in Estonia Started through a Sex-Biased Migration from the Steppe. Curr. Biol. 2017;27:2185–2193.e6. doi: 10.1016/j.cub.2017.06.022. [DOI] [PubMed] [Google Scholar]
- 32.Lamnidis T.C., Majander K., Jeong C., Salmela E., Wessman A., Moiseyev V., Khartanovich V., Balanovsky O., Ongyerth M., Weihmann A. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nat. Commun. 2018;9:5018. doi: 10.1038/s41467-018-07483-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Skoglund P., Malmström H., Omrak A., Raghavan M., Valdiosera C., Günther T., Hall P., Tambets K., Parik J., Sjögren K.G. Genomic diversity and admixture differs for Stone-Age Scandinavian foragers and farmers. Science. 2014;344:747–750. doi: 10.1126/science.1253448. [DOI] [PubMed] [Google Scholar]
- 34.Skoglund P., Malmström H., Raghavan M., Storå J., Hall P., Willerslev E., Gilbert M.T., Götherström A., Jakobsson M. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science. 2012;336:466–469. doi: 10.1126/science.1216304. [DOI] [PubMed] [Google Scholar]
- 35.Mittnik A., Wang C.C., Pfrengle S., Daubaras M., Zarina G., Hallgren F., Allmae R., Khartanovich V., Moiseyev V., Torv M. The genetic prehistory of the Baltic Sea region. Nat. Commun. 2018;9:442. doi: 10.1038/s41467-018-02825-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Martin A.R., Karczewski K.J., Kerminen S., Kurki M.I., Sarin A.P., Artomov M., Eriksson J.G., Esko T., Genovese G., Havulinna A.S. Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. Am. J. Hum. Genet. 2018;102:760–775. doi: 10.1016/j.ajhg.2018.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kerminen S., Havulinna A.S., Hellenthal G., Martin A.R., Sarin A.P., Perola M., Palotie A., Salomaa V., Daly M.J., Ripatti S. Fine-Scale Genetic Structure in Finland. G3. 2017;7:3459–3468. doi: 10.1534/g3.117.300217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pankratov V., Montinaro F., Kushniarevich A., Hudjashov G., Jay F., Saag L., Flores R., Marnetto D., Seppel M., Kals M. Differences in local population history at the finest level: the case of the Estonian population. Eur. J. Hum. Genet. 2020;28:1580–1591. doi: 10.1038/s41431-020-0699-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Freyman W.A., McManus K.F., Shringarpure S.S., Jewett E.M., Bryc K., Auton A., 23 and Me Research Team Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows-Wheeler Transform. Mol. Biol. Evol. 2021;38:2131–2151. doi: 10.1093/molbev/msaa328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Chang C.C., Chow C.C., Tellier L.C.A.M., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Meyer M., Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb. Protoc. 2010;2010:t5448. doi: 10.1101/pdb.prot5448. [DOI] [PubMed] [Google Scholar]
- 43.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17:10–12. [Google Scholar]
- 44.Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Jónsson H., Ginolhac A., Schubert M., Johnson P.L., Orlando L. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics. 2013;29:1682–1684. doi: 10.1093/bioinformatics/btt193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Fu Q., Mittnik A., Johnson P.L.F., Bos K., Lari M., Bollongino R., Sun C., Giemsch L., Schmitz R., Burger J. A revised timescale for human evolution based on ancient mitochondrial genomes. Curr. Biol. 2013;23:553–559. doi: 10.1016/j.cub.2013.02.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Korneliussen T.S., Albrechtsen A., Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics. 2014;15:356. doi: 10.1186/s12859-014-0356-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Poplin R., Ruano-Rubio V., DePristo M.A., Fennell T.J., Carneiro M.O., Van der Auwera G.A., Kling D.E., Gauthier L.D., Levy-Moonshine A., Roazen D. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018 doi: 10.1101/201178. [DOI] [Google Scholar]
- 51.Mitt M., Kals M., Pärn K., Gabriel S.B., Lander E.S., Palotie A., Ripatti S., Morris A.P., Metspalu A., Esko T. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel. Eur. J. Hum. Genet. 2017;25:869–876. doi: 10.1038/ejhg.2017.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Abraham G., Qiu Y., Inouye M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics. 2017;33:2776–2778. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
- 53.Patterson N., Price A.L., Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Wickham H. Springer; 2009. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
- 55.Karmin M., Saag L., Vicente M., Wilson Sayres M.A., Järve M., Talas U.G., Rootsi S., Ilumäe A.M., Mägi R., Mitt M. A recent bottleneck of Y chromosome diversity coincides with a global change in culture. Genome Res. 2015;25:459–466. doi: 10.1101/gr.186684.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Poznik G.D., Xue Y., Mendez F.L., Willems T.F., Massaia A., Wilson Sayres M.A., Ayub Q., McCarthy S.A., Narechania A., Kashin S., 1000 Genomes Project Consortium Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat. Genet. 2016;48:593–599. doi: 10.1038/ng.3559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Quinlan A.R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinformatics. 2014;47:1–34. doi: 10.1002/0471250953.bi1112s47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Speed D., Balding D.J. Relatedness in the post-genomic era: is it still useful? Nat. Rev. Genet. 2015;16:33–44. doi: 10.1038/nrg3821. [DOI] [PubMed] [Google Scholar]
- 59.Free Software Foundation . 2014. GNU Datamash.https://www.gnu.org/software/datamash/ [Google Scholar]
- 60.Kelleher J., Etheridge A.M., McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Csardi G., Nepusz T. The igraph software package for complex network research. InterJournal. Complex Syst. 2006;1695:1–9. [Google Scholar]
- 62.Hothorn T., Hornik K. The R Project; 2019. exactRankTests: Exact Distributions for Rank and Permutation Tests. R package version 08-31. [Google Scholar]
- 63.Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., 1000 Genomes Project Analysis Group The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Saag L., Vasilyev S.V., Varul L., Kosorukova N.V., Gerasimov D.V., Oshibkina S.V., Griffith S.J., Solnik A., Saag L., D’Atanasio E. Genetic ancestry changes in Stone to Bronze Age transition in the East European plain. Sci. Adv. 2021;7:1–17. doi: 10.1126/sciadv.abd6535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Chaitanya L., Breslin K., Zuñiga S., Wirken L., Pośpiech E., Kukla-Bartoszek M., Sijen T., Knijff P., Liu F., Branicki W. The HIrisPlex-S system for eye, hair and skin colour prediction from DNA: Introduction and forensic developmental validation. Forensic Sci. Int. Genet. 2018;35:123–135. doi: 10.1016/j.fsigen.2018.04.004. [DOI] [PubMed] [Google Scholar]
- 66.Ringbauer H., Coop G., Barton N.H. Inferring Recent Demography from Isolation by Distance of Long Shared Sequence Blocks. Genetics. 2017;205:1335–1351. doi: 10.1534/genetics.116.196220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Browning S.R., Browning B.L. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am. J. Hum. Genet. 2015;97:404–418. doi: 10.1016/j.ajhg.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Burger J., Link V., Blöcher J., Schulz A., Sell C., Pochon Z., Diekmann Y., Žegarac A., Hofmanová Z., Winkelbach L. Low Prevalence of Lactase Persistence in Bronze Age Europe Indicates Ongoing Strong Selection over the Last 3,000 Years. Curr. Biol. 2020;30:4307–4315.e13. doi: 10.1016/j.cub.2020.08.033. [DOI] [PubMed] [Google Scholar]
- 69.Sabeti P.C., Varilly P., Fry B., Lohmueller J., Hostetter E., Cotsapas C., Xie X., Byrne E.H., McCarroll S.A., Gaudet R., International HapMap Consortium Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–918. doi: 10.1038/nature06250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Lamason R.L., Mohideen M.A.P.K., Mest J.R., Wong A.C., Norton H.L., Aros M.C., Jurynec M.J., Mao X., Humphreville V.R., Humbert J.E. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 2005;310:1782–1786. doi: 10.1126/science.1116238. [DOI] [PubMed] [Google Scholar]
- 71.Basu Mallick C., Iliescu F.M., Möls M., Hill S., Tamang R., Chaubey G., Goto R., Ho S.Y.W., Gallego Romero I., Crivellaro F. The light skin allele of SLC24A5 in South Asians and Europeans shares identity by descent. PLoS Genet. 2013;9:e1003912. doi: 10.1371/journal.pgen.1003912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Norton H.L., Kittles R.A., Parra E., McKeigue P., Mao X., Cheng K., Canfield V.A., Bradley D.G., McEvoy B., Shriver M.D. Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol. Biol. Evol. 2007;24:710–722. doi: 10.1093/molbev/msl203. [DOI] [PubMed] [Google Scholar]
- 73.Mathieson I., Lazaridis I., Rohland N., Mallick S., Patterson N., Roodenberg S.A., Harney E., Stewardson K., Fernandes D., Novak M. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Davies R.W., Flint J., Myers S., Mott R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 2016;48:965–969. doi: 10.1038/ng.3594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Monroy Kuhn J.M., Jakobsson M., Günther T. Estimating genetic kin relationships in prehistoric populations. PLoS ONE. 2018;13:e0195491. doi: 10.1371/journal.pone.0195491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Martin M.D., Jay F., Castellano S., Slatkin M. Determination of genetic relatedness from low-coverage human genome sequences using pedigree simulations. Mol. Ecol. 2017;26:4145–4157. doi: 10.1111/mec.14188. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The community extraction analysis scripts generated during this study are available at https://github.com/SABiagini/Louvain. The ancient genomic data generated during this study are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB46155 (accession code ENA: PRJEB46155) and the data depository of the EBC (https://evolbio.ut.ee/). The Estonian Biobank (EstBB) data used in this study are available under restricted access. The procedure of applying for access to the data can be found under the following link: https://genomics.ut.ee/en/biobank.ee/data-access.