Abstract
Due to postmortem DNA degradation and microbial colonization, most ancient genomes have low depth of coverage, hindering genotype calling. Genotype imputation can improve genotyping accuracy for low-coverage genomes. However, it is unknown how accurate ancient DNA imputation is and whether imputation introduces bias to downstream analyses. Here we re-sequence an ancient trio (mother, father, son) and downsample and impute a total of 43 ancient genomes, including 42 high-coverage (above 10x) genomes. We assess imputation accuracy across ancestries, time, depth of coverage, and sequencing technology. We find that ancient and modern DNA imputation accuracies are comparable. When downsampled at 1x, 36 of the 42 genomes are imputed with low error rates (below 5%) while African genomes have higher error rates. We validate imputation and phasing results using the ancient trio data and an orthogonal approach based on Mendel’s rules of inheritance. We further compare the downstream analysis results between imputed and high-coverage genomes, notably principal component analysis, genetic clustering, and runs of homozygosity, observing similar results starting from 0.5x coverage, except for the African genomes. These results suggest that, for most populations and depths of coverage as low as 0.5x, imputation is a reliable method that can improve ancient DNA studies.
Subject terms: Population genetics, Haplotypes
For most ancient genomes, low sequencing depth restricts genotyping, limiting their study. Here, the authors test imputation performance of ancient human genomes by estimating error rates and potential bias introduced in downstream analyses.
Introduction
Ancient DNA (aDNA) is characterized by pervasive postmortem damage, including fragmentation and deamination1. Moreover, the ubiquitous microbial contamination gives rise to, in most cases, low amounts of endogenous DNA, whereas contamination with DNA from related species is an even bigger challenge, as the endogenous and the contaminant DNA cannot be easily deconvolved, and thus highly contaminated genomes are often discarded from the analyses2. As a result, most ancient genomes have low breadth and depth of coverage, hindering confident genotype calling. Instead, pseudo-haploid data are commonly generated by sampling one allele per variant site3,4. Evermore methods and tools are developed to study population structure, including diploid genetic properties such as runs of homozygosity (ROH)5, using pseudo-haploid data. However, on the one hand, methods designed for diploid and haplotypic data cannot be easily applied to pseudo-haploid data, and, on the other hand, these data come with increased bias toward the reference genome6.
One alternative to downsampling the data to pseudo-haploid is to impute low-coverage ancient genomes. The goal of imputation is to infer missing sites, usually by using reference panels of haplotypes. Most imputation tools employ a hidden Markov model (HMM) that determines which assembly of reference haplotype chunks represents the target best. The Li and Stephen model of linkage disequilibrium (LD)7 and haplotype sharing is at the core of this HMM. This model describes LD in terms of the subjacent recombination rates. In particular, it estimates the probability of observing a chromosome (or haplotype) given the previously sampled haplotypes from a population by considering the new haplotype as a copy of different parts of the sampled haplotypes while allowing mutations to arise. The transition rate between copying haplotypes is proportional to the recombination rate and it decreases with the number of available haplotypes to copy from.
SNP-array imputation is applied when genomes are genotyped at a subset of variant sites8. SNP-array imputation of modern DNA is often implemented to increase sample sizes for genome wide association studies, so as to reduce sequencing costs9. It is also possible to impute low-coverage genomes whose genotypes cannot be determined with certainty, in which case genotype uncertainty is captured by likelihoods10–15. This second type of method is suitable to impute low-coverage ancient genomes. Present-day genotypes have been imputed with increasing accuracy due to improved imputation methods on the one hand, and increased reference panel size and diversity on the other hand, such as the Haplotype Reference Consortium (HRC)16, the 1000 Genomes Project17 and TOPMed18. These advances have also been exploited to impute low-coverage ancient genomes, using present-day haplotypes, assuming matching ancestry (e.g., Martiniano et al.19 Haber et al.20 Saupe et al.21 Clemente et al.22, Cox et al.23 Allentoft et al.24).
However, aDNA introduces extra challenges, including damage and potential contamination25, and it is not clear whether ancient individuals’ ancestries are well captured by reference panels of present-day individuals. Moreover, a precise quantification of possible imputation biases and errors is lacking. Hui et al.26 proposed a two-step imputation pipeline to be applied to ancient genomes. This pipeline first imputes based on genotype likelihoods using Beagle4.111, and then removes sites based on their maximum genotype probability (GP), a measure of how likely each possible genotype at a site is after imputation. The resulting genotype calls are again imputed with Beagle527, followed by a final GP filtering step. When compared to the first imputation step alone, this pipeline yielded larger proportions of heterozygous sites that pass the specified GP threshold. Nonetheless, a single downsampled ancient European genome was used to validate these results. Cox et al.23 further compared the proposed pipeline in Hui et al. with simply imputing with Beagle4.1 and GLIMPSE, using the same ancient genome for downsampling experiments. The precision was highest with GLIMPSE, but Hui et al. pipeline yielded the highest recall. Another recent study28 assessed the imputation of ancient genomes performance by downsampling (0.1–2.0×) and imputing genomes from five high-coverage ancient Europeans using Beagle4.029 and various reference panel and sample size configurations. The authors measured genotype concordance, bias towards the reference panel and compared projections of the high-coverage, low-coverage and imputed 1x data onto principal component analysis (PCA) of present-day data. Imputation accuracy improved when i) using all populations in the 1000 Genomes reference panel instead of restricting to European genotypes alone and ii) the ancient genomes were imputed simultaneously. They found no bias increase towards the most common reference panel allele for ancient genome coverages as low as 0.75x.
These studies23,26,28 suggest that aDNA imputation performs well under specific conditions. However, in their assessment of imputation accuracy they used a limited sample of ancient genomes (one26 or five28) and of only European descent. Furthermore, more accurate and efficient low-coverage imputation methods are available, e.g., GLIMPSE13, than the methods they tested, i.e., Beagle4.0 and 4.1.
Here, we make use of 43 ancient genomes24,30–50, including an ancient trio and 42 high-coverage (>10x) genomes, from four different continents and different time spans to assess i) imputation accuracy of low-coverage ancient genomes and ii) how imputation affects downstream analyses. Our overall goal is to give users a sense of the performance of imputation and to measure whether large biases are introduced in standard downstream population genetic analyses; we do not compare the pseudo-haploid and the imputation strategies in this study as we believe they can be used in a complementary fashion. To this end, we downsampled to low coverage this diverse dataset of ancient genomes, which allowed us to quantify imputation performance across different ancestries, unlike, to our knowledge, any other previous study. We imputed the downsampled ancient genomes with GLIMPSE13, a state-of-the-art imputation and phasing tool that was shown to accurately impute low-coverage present-day genomes when relying on 1000 Genomes17 as a reference panel. In the next sections, we show that imputation yields accurate genotypes at common variants (minor allele frequency (MAF) > 5%) starting at 0.5x, and transition and transversion sites are imputed with similar accuracy. We obtain low error imputation error rates for 1x non-African ancient genomes and we observe a decrease in imputation accuracy at rare variants for the ancient genomes dated back more than 30,000 years before present. We further assess imputation and phasing performance in the case of the ancient trio. We test different post-imputation filtering stringency levels and we find that more stringent filtering resulted in a higher number of lost alternative-allele variant sites. We show that imputation of in-solution capture (1240 K) sequenced genomes produces more accurate genotypes at the capture sites, with a small reduction in accuracy at the non-targeted common variants. To address our second goal, we study the effects of imputation not only on PCA, but also on genetic clustering and ROH analyses. Comparing to the high-coverage genomes, we obtain similar results for these downstream applications when depth of coverage is at least 0.5x.
Results
The approach we followed in this study is schematically described in Fig. 1a. We generated two datasets: imputed genotypes from downsampled genomes and corresponding validation genotypes called from the high-coverage ancient genomes, that is, the ground truth. We started by sampling fractions of the sequencing reads from the 43 ancient genomes to obtain genomes with average depths of coverage between 0.1x and 2.0x. Then, using bcftools51 (see Supplementary Note 1 and Supplementary Fig. 1 on the choice of genotype caller prior to imputation), we generated genotype likelihoods at biallelic sites of the 1000 Genomes phase 3 v5 data17 phased with TOPMed18, the imputation reference panel, including all transition sites, in contrast to other studies28. We then imputed the data with GLIMPSE with the different steps described in the methods section. Lastly, we called genotypes for the high-coverage genomes and filtered out low-quality calls (methods, Supplementary Note 2 and Supplementary Fig. 2), thus reducing the deamination impact. Finally, we assessed imputation performance and compared downstream analyses’ results.
Three out of the 43 ancient genomes in this study constitute a trio (mother, father and son) that were re-sequenced in this study24,47, in contrast to the remaining 40 genomes. These 43 ancient genomes were published in different studies and relate to different epochs and continents. In total, the data includes 22 individuals from Europe, five from Africa, eight from Asia and eight from the Americas (Fig. 1b). For five of the individual samples, we had access to both high-coverage shotgun and capture data. Information concerning location and age of remains, and genome coverage is included in Supplementary Table 1 and Supplementary Table 4. To increase readability and to be able to summarize the results in more straightforward way, we split the individual samples into categories that reflect their geographical origin and/or the period they lived in: Africa, Americas, Prehistoric Europe, Historic Europe, Western Asia, South Asia and Siberia (Supplementary Table 2). While we refer to these categories throughout the text, we recognize, however, that these labels can be vague and are obviously not fully descriptive, as discussed in Coop52.
Accuracy of low-coverage ancient DNA imputation
We started by examining how imputation quality changes with average depth of coverage, and whether transversions are more accurately imputed than transitions, since the latter are affected by postmortem DNA deamination, i.e., C-to-T substitutions, which might wrongly increase the number of called heterozygous sites. We further compared imputation performance using two different state-of-the-art imputation methods, GLIMPSE and Beagle4.111, where the latter is a widely used imputation method and was also considered in ref. 26. For that, we calculated imputation accuracy, r2, that is, the squared Pearson correlation between genotype dosage in the aggregate of the 42 high-coverage and imputed datasets, as a function of minor allele frequency (MAF) as determined from the 1000 Genomes reference panel.
We found that imputation accuracy of ancient genomes was similar to the accuracy reported for present-day genomes when using the same imputation method (Supplementary Note 4 and Supplementary Fig. 3). Accuracy was higher at common variants (MAF ≥ 5%) (Fig. 2a), as rare variants are more challenging to impute9,53. Imputation accuracy was also higher for genomes with higher coverage, as these have more data. In particular, for depths equal and greater than 0.75x, we obtained r2 > 0.90 at sites with MAF > 2%, and r2 > 0.70 and r2 > 0.95 for rare (0.1% < MAF ≤ 1%) and common variants (MAF ≥ 10%), respectively. We then found that GLIMPSE outperformed Beagle4.1 for 1x ancient genomes, particularly at rare variants (Supplementary Fig. 4), similarly to the case of present-day genomes13. Finally, there were small differences in accuracy between imputed transversion and transition sites at rare variants (0.1% < MAF ≤ 1%, r2 = 0.75 and r2 = 0.77 for transitions and transversions, respectively), but these differences disappeared for more common variants (Supplementary Fig. 4).
Fixing depth of coverage at 1x, we evaluated how imputation performs across the 42 high-coverage genomes of different ancestries and times. In addition to imputation accuracy as a function of MAF, we quantified genotyping error rates for homozygous reference and alternative allele and heterozygous sites. We also report the non-reference discordance (NRD), that is, the ratio of the number of incorrectly imputed sites and the total number of imputed sites, excluding correctly imputed homozygous reference allele sites.
The imputation of European, Western Asian, and most Native American genomes yielded similar accuracy curves starting with lower values for rare variants (0.5 < r2 ≤ 0.9) and converging to r2 ≳ 0.90 from MAF ≥ 2% (Fig. 2a). The African ancient genomes were the least accurately imputed with only two out of five imputed genomes reaching r2 > 0.90, and error rates as high as 18% at heterozygous sites (Supplementary Fig. 5), the most challenging to impute, and NRD between 4% and 29% (Fig. 2c). In contrast, most non-African imputed genomes yielded NRD rates below 5%. This difference in imputation performance is likely due to underrepresentation of the different African populations in the reference panel. Indeed, the ancient African individuals in this study have much larger pairwise genetic distances to the reference panel than non-African individuals (Fig. 2b). Although the 1000 Genomes reference panel contains individuals of African origin, mostly from West Africa (Mende Sierra Leone (MSL), Gambian Mandinka (GWD), Esan Nigeria (ESN), Yoruba (YRI) and Luhya Kenya (LWK)), the genetic diversity in Africa54 is not well represented in this panel. And yet, Native American genomes were also accurately imputed, even though the populations in the reference panel show different admixture moieties, ranging from low (e.g., Puerto Rican (PUR)) to high Native American (e.g., Peruvian (PEL)) admixture proportions17. In fact, Fig. 2b shows that the South American reference individuals tend to be genetically close to the Native American genomes (small pairwise allelic differences). We further confirmed the contribution of reference haplotypes from the Americas to the imputation of ancient Native American genomes by removing one continental group at a time from the reference panel. We found that imputation performance was only affected when using a reference panel without the American populations (Supplementary Note 7 and Supplementary Fig. 6). Imputation accuracy dropped to 0.46 from 0.78 at variants in the lowest MAF bin (0.1–1.0%), while it was only slightly smaller (~0.97 vs. ~0.98) at common variants (MAF > 5%).
Sample age is also expected to affect imputation performance, as long-time distances to the present could translate into large coalescent times between the reference populations and the ancient individuals55. While overall imputation performance seems to be unaffected by sample age (Fig. 2c), imputation accuracy at rare variants (MAF < 2%) is considerably low for the three oldest individuals, i.e., Yana (~32,000 ybp)40, SIII (~34,000 ybp)48, and Ust’Ishim (~45,000 ybp)39, as shown in Fig. 2a. We found significant (at 5% threshold) negative correlations between sample age and imputation accuracy at the two lowest MAF bins, i.e., 0.1–1% and 1%-2%: rs = −0.465 (p value = 0.003) and rs = −0.324 (p value = 0.033), respectively (Supplementary Note 8 and Supplementary Fig. 7).
The newly re-sequenced ancient trio (mother, father, son) allowed us to use an orthogonal approach based on Mendel’s rules of inheritance to measure imputation and phasing quality. This trio was sampled in a Late Neolithic mass burial at Koszyce24,47 and was re-sequenced in our study to a depth of coverage of 27.5x (mother, RISE1159), 18.9x (father, RISE1168), 5.4x (son, RISE1160). In this analysis, imputation errors corresponded to sites where parental and offspring genotypes disagreed with Mendel transmission rules. Here, we excluded sites that are homozygous for the reference allele in the three genomes as these positions are easier to impute. We estimated phasing accuracy in terms of switch error rate. The switch error rate is assessed for every two consecutive heterozygous sites by verifying if the alleles for the two sites are located on the correct haplotypes following the expected configuration from the trio. Mendel error rates ranged from 1.3% at 4x to 12.2% at 0.1x (Fig. 3a). For 1x data, in particular, Mendel error rates were between 1.5% and 2.9% across the 22 autosomes. These error rates agree with previously estimated imputation errors (Fig. 2c and Supplementary Fig. 5). Switch error rates varied between 1.6% at 4.0x and 8.2% at 0.1x, with errors for 1x data in the range 1.6–3.0% (Fig. 3b). For present-day genomes and small sample sizes, switch error rates are typically between 1% and 5%56–58, and we achieved similar accuracy when imputing and phasing the genomes downsampled to a minimum coverage of 0.25x.
After imputation, we can filter the data based on the maximum genotype probabilities (GP) for a site. GP is a measure of how likely each genotype is to be true and takes values between 0 and 1 that sum to 1 across the possible genotypes. To determine which GP value we would use to filter the imputed data prior to downstream analyses, we applied GP filters starting at 0.70 and up to 0.99 to four different imputed ancient genomes downsampled to 0.1x and 1.0x (RISE116824,47, SIII48, Ust’-Ishim39 and Mota34). We then quantified imputation accuracy and genotype discordance. We observed a greater boost in accuracy as the GP filter becomes stricter for 0.1x imputed data than for 1x data (Fig. 4a). In the case of 1x data, accuracy slightly improved for sites with MAF > 5%. The exception was the individual sample Mota (Africa), where the gain in accuracy for a specific GP filter had similar magnitude across sites with different MAF values. This African genome yielded the second lowest imputation accuracy amongst the 42 ancient high-coverage genomes downsampled and imputed in this study. Genotype discordance followed the same trend (Fig. 4b). Genotyping error rates were higher for 0.1x than for 1x imputed genomes, for whom error rates remained below 5%, except for Mota. Increasing GP filtering values decreased these error rates in all instances. Then, we looked at how GP filtering affects the number of correctly imputed heterozygous sites (Fig. 4c). The proportion of lost heterozygous sites was much higher in the case of 0.1x data, explained by the lower imputation accuracy for this coverage. For 0.1x data, filtering out sites with GP < 0.70 removed around 15% of correct heterozygous sites in the least. When GP ≥ 0.99, only between 20% and 43% of correct heterozygous sites remained. In contrast, the imputed 1.0x genomes lost a small fraction of their heterozygous sites as stricter GP filters were applied. This fraction was smallest amongst the genomes of European ancestry (<8%, RISE1168 and SIII) and largest for Mota (22%), a reflection of how accurately these genomes were imputed. In the end, a trade-off must be made between loss of heterozygous sites and imputation accuracy. Based on these results, we chose to remove sites with MAF < 5% and set to missing imputed sites with GP < 0.80, for most of the downstream analyses, thus keeping most heterozygous sites for 0.1x data while controlling for imputation accuracy.
Ancient DNA studies often resort to hybridization-capture sequencing, that increases the depth of coverage at captured pre-specified sites59–62. Capture data using the widely used 1240K array63–65 were previously generated for five of the 42 high-coverage ancient genomes, with depths of coverage at the capture sites between 1.5x and 11.1x (Supplementary Table 4). We found that imputation accuracy was higher at the intersection of 1240 K and 1000 Genomes sites at variants with MAF < 5%, but common variants were imputed with similar accuracy at the capture sites and outside of these (Fig. 5 and Supplementary Fig. 8). To study the effect of depth of coverage on imputation of capture genomes, we imputed downsampled genomes with depths of coverage between 0.1x and 2.0x and 0.1x and 5.0x at the capture sites for BOT2016 and I10871 and Stuttgart, respectively. Imputation accuracy reached 0.90 only at 2x at common variants (MAF > 5%) for BOT2016 and Stuttgart, whereas imputation performance was lower for I10871, an African individual (Fig. 5). For the Stuttgart genome, the gain in imputation accuracy was small when increasing depth of coverage from 4.0x to 5.0x (r2 ≥ 0.94 to r2 ≥ 0.95 for MAF > 5%). Moreover, for the same individual samples, the imputation performances of 1x capture and shotgun-sequenced data with depth of coverage between 0.1x and 0.5x were equivalent (Supplementary Fig. 9).
Imputation effect on downstream analyses
In order to detect and quantify potential bias introduced by imputation, we compared the results of downstream analyses, namely, principal component analysis (PCA) and genetic clustering analyses, performed with the high-coverage and imputed genomes, after filtering for MAF and GP (imputed data). These methods are broadly used in population genetics to investigate population structure and demography. PCA is a dimension reduction technique that helps visualize patterns of population structure. In the genetic clustering analyses, ancestries are estimated as the sum of K different clusters determined from the data in an unsupervised fashion. We further explore the potential of imputing low-coverage ancient genomes by estimating ROH, whose classical applications require diploid data. ROH segments are unbroken homozygous regions of the genome that contain information about past and recent breeding patterns66. ROH have been found in all populations, but their number and size vary, depending on demographic histories.
For the PCA, we calculated the first ten principal components of the 1000G reference panel and projected both the high-coverage and corresponding imputed ancient genomes onto those. We have included both transition and transversion sites in this analysis.
Both the imputed 1x and high-coverage ancient genomes were in the expected continental groups as defined by present-day individuals in the two first principal components (Fig. 6a). They also tended to colocalize, which was particularly the case for ancient individuals clustering with present-day Europeans, suggesting limited bias is introduced by imputation in the PCA results. To further verify whether imputation introduced bias in this analysis, we took the difference in coordinates between validation and corresponding imputed 1x genomes for each principal component. As shown in Fig. 6b, the normalized differences between the two datasets were small and did not deviate significantly from 0 (t-test p values > 0.01). Additionally, we found that only genomes with coverage as low as 0.1x and 0.25x show some significant deviation from 0 (Fig. 6c) for some principal components, however, the imputed data were still placed in the expected continental clusters in the PCA space (Supplementary Fig. 10). This is particularly clear for European ancient genomes. These results show that the differences between imputed and high-coverage coordinates tended to be centered on 0 for the first principal components, in particular for genomes with coverage above 0.25x, suggesting that imputation did not introduce a significant bias to the PCA.
For the genetic clustering analyses, we focused on the European genomes. Present-day Europeans can generally be modeled with three ancestral populations: western hunter-gatherers, early European farmers and Steppe pastoralists41. Ancient European individual samples tend to exhibit different distributions of these three ancestries across time and space. We asked whether imputation of European ancient genomes artificially increases the amount of inferred Steppe-like ancestry for these individuals, since most present-day European individuals have Steppe ancestry, including the European populations in the 1000 Genomes reference panel. For instance, we assessed whether the Steppe-like component increases in imputed western hunter-gatherer genomes like Loshbour41. To this aim, we performed unsupervised admixture analyses with the software ADMIXTURE67, including transitions and transversions. We used as a reference panel the genetic data of 61 ancient individuals65,68–71 present in the 1240K dataset63, including nine western hunter-gatherers, 26 Anatolian farmers and 26 individuals with Steppe-like ancestry (see Supplementary Table 5). We estimated ancestry proportions for the imputed and validation data separately varying the number of clusters (K) between two and five. For K = 2, 4, and 5, we observe qualitatively similar results for imputed and high-coverage data (see Supplementary Note 11). Here we show the results obtained with K = 3 (Fig. 7a), as these clusters seemingly capture the three aforementioned ancestries. The admixture proportions are qualitatively similar between the high-coverage ancient genomes and the corresponding imputed ones, and, in the particular case of Loschbour, the only western hunter-gatherer imputed in this study, we estimated 100% western hunter-gatherer-like ancestry with both imputed 1x and high-coverage data (Fig. 7b). In order to compare the admixture results across imputed data with different depths of coverage, we took the difference between ancestry proportions estimated for the validation and imputed genomes for each ancestry component and each coverage (Fig. 7c). We observed larger differences with imputed 0.1x and 0.25x data. For the remaining depths of coverage, the small differences distributed around 0 show that imputation introduced limited bias towards a particular ancestry in this analysis.
Then, we first quantified ROH using transversions only to minimize the aDNA damage impact on the validation estimates. We examined how well the imputed and the validation ROH overlapped in chromosome 10 for each depth of coverage and for four different individuals, namely Ust’-Ishim39 (Siberia), Rathlin172 (Europe), A46038 (Americas), and Mota34 (Africa) (Fig. 8a). The imputed 0.1x data had an excess of ROH when compared to the high-coverage data. This likely results from i) reduced imputation accuracy and ii) removal of a large proportion of heterozygous sites when applying post-imputation filters (Fig. 4c). As the depth of coverage increased, the number of falsely identified ROH tended to decrease, while most validation ROH were also found amongst the imputation ROH. We then compared the total ROH lengths, stratified by segment size, measured in the imputed data with the validation data for the different depths of coverage and the same four individuals (Fig. 8b). Again, we found the largest discrepancies between validation and imputed 0.1x data, with an excess of ROH segments, particularly of the shortest kind (0.5–1.0 Mb). For coverages above 0.1x, the total ROH lengths in the imputed genomes were close to the validation ROH, particularly for A460 (5% difference) and Ust’Ishim (0.7% difference). Lastly, restricting to imputed 1x data, we contrasted the total length of small ROH (<1.6 Mb) with the total length of longer ROH (⩾1.6 Mb) obtained with transversions only (Fig. 8c) and all sites (Fig. 8d). When using transversions only, the total ROH lengths estimated for high-coverage and corresponding imputed 1x genomes were similar, particularly for the European genomes. Furthermore, the ROH trends for the ancient individuals mostly agreed with documented ROH for their present-day counterparts, with Africans having the smallest total ROH lengths and Native Americans the longest66.
When we added transitions to estimate ROH, the distance between imputed and validation ROH increased for some genomes (Fig. 8d). In the case of the ancient Native American Sumidouro538, this distance dramatically increased. The high-coverage estimate for Sumidouro5 was now located between the African and European values, but the imputed estimate remained close to both the high-coverage and imputed values obtained with transversions only. For this genome, we found major differences between high-coverage ROH sizes obtained with transversions only and all sites, whereas the corresponding imputed ROH were highly consistent (Supplementary Fig. 18). This indicates that the discordance between validation and imputed ROH, when transitions were included, originated from the validation data. Indeed, Sumidouro5 is a very damaged genome (40% deamination rate at read termini)38, which likely led to an excess of heterozygous calls in the high-coverage data, despite the quality filtering (see methods).
Discussion
In aDNA studies, pseudo-haploid data generation is standard procedure to handle low-coverage genomes. Ancestry information can be recovered from analyses of these data, such as PCA and genetic clustering analyses which work well at coverages as low as 0.1x73. Compared with pseudo-haploid data, imputing genotypes allows to work on diploid genomes and to directly apply some population genetic tools developed for modern data.
Here we showed that low-coverage ancient genomes can be imputed with similar accuracy as modern genomes. In particular, for shotgun-sequenced data, we obtained accurate results at common variants, for coverages starting at 0.5x from MAF > 5% (or at 0.75x from MAF > 2%). However, we observed that this threshold is dependent on the ancient genomes’ ancestry. The representation of a given population in the reference panel can have a profound impact on imputation accuracy, with genotyping errors at alternative allele sites above 5% and up to 25% among African 1x genomes. Despite the absence of 100% Native American reference populations, most Native American ancient genomes were accurately imputed. The presence of haplotypes with partial Native American ancestry in the reference panel allowed us to recover variants private to Native American populations. Moreover, we found that age can negatively impact imputation accuracy of rare variants, which was the case of three non-African individual samples older than 30,000 years. These results have far-reaching implications for the potential of imputing ancient genomes, since it is not guaranteed that there will be a present-day population that directly descends from the ancient individual’s population without having admixed. Our results suggest that, on the one hand, using admixed reference populations that share recent ancestry with the ancient genomes can be enough to attain accurate imputation, even at rare variants, and, on the other hand, we can still impute common variants well in the case of non-African genomes that are either very old, such as Ust’Ishim, or that are poorly represented in the reference panel, such as Andaman, likely owing to their common history.
Furthermore, using five genomes that were both obtained via in-solution capture and shotgun sequencing (>10x for the latter), we found that imputation performance of capture-sequenced data was higher at the capture sites than outside of these and particularly for rare variants, and imputing a 1x (target sites) capture genome and a 0.25x shotgun-sequenced genome result in similar error rates. Moreover, imputation accuracy was below 0.90 for coverages below 2x at the target sites. We therefore recommend a minimum depth of coverage of 2x at capture sites, but ideally higher than that (imputation accuracy levelled off at around 4x for the Stuttgart genome), to attain accurate imputed calls, in the case of well represented ancestries.
For most genomes, we obtained similar results with high-coverage and imputed data with coverages as low as 0.5x for the downstream analyses we carried out, i.e., PCA, admixture clustering and ROH estimation. Imputation did not introduce major bias for the first principal components, nor did it considerably increase the proportion of any of the three main ancestry components found in Europeans. The similarity of validation and imputed ROH segments is worthy of note, since ROH estimation typically requires reliable knowledge of genotypes, which is only available for high-coverage genomes. This means that ROH estimation methods designed for diploid data can be applied to low-coverage ancient genomes after imputation.
Although we did not remove transition sites prior to imputation, we found that transversion and transition sites were imputed with comparable accuracy. In fact, when we compared ROH estimates performed with transversions and all sites, we observed that imputation corrected ROH in the case of Sumidouro5, with 40% C-to-T mismatch frequency at the end of the reads. Given this observation, imputation of ancient genomes has the potential of correcting genotypes that are affected by damage and other sources of error. Whether imputation can help reducing the effect of contamination remains to be assessed.
We did not explore numerous genotype and haplotype-based applications that can greatly benefit from imputation of low-coverage ancient genomes, such as temporal selection scans and local ancestry inference. Moreover, genotype imputation, in general, is expected to improve as more and larger reference datasets become available. The recent release of 200K whole-genome sequences in the UK Biobank74, which can be used as a reference panel for imputation, offers an opportunity to improve imputation performance in the case of low-coverage European genomes, including ancient genomes, especially at rare variants and lower depths of coverage75. In the case of ancient DNA, when the target genome is not well represented by modern reference populations or when a boost in imputation accuracy is required, additional reference panels can be assembled with high-quality ancient genomes of individuals with more closely shared ancestry. Furthermore, the number of sequenced ancient genomes has been growing exponentially and with no sign of slowing down. This means that more and more ancient genomes will be available with different ancestries and from different time periods and with that comes the opportunity to expand existing reference panels with ancient genomes and to implement imputation in a more standardized way.
Methods
In this section, we describe the methods implementation, starting with the Koszyce ancient trio data generation, followed by imputation, that includes all the file processing, imputation using GLIMPSE and using Beagle4.1, then the three downstream applications (PCA, genetic clustering analyses and ROH) and finishing with the reference data sets used in this study. All post-imputation analyses and corresponding plots were produced using python v3.6.12 and R v4.0.3.
The Koszyce ancient trio data generation
The Koszyce ancient trio (mother, father and son) was originally sequenced in ref. 47 and re-sequenced to higher coverage in the context of this study. The DNA was extracted from petrous bone excavated from a Late Neolithic mass grave in Koszyce, in what is today Poland.
Using the same DNA extracts as ref. 47, two additional double-stranded libraries per sample were constructed based on ref. 76. This was followed by enzymatic USER treatment to remove DNA damaged sites in the form of uracils. The optimal PCR cycle number was determined by qPCR. Indexed and amplified libraries were purified, quantified on an Agilent Bioanalyzer 2100, and pooled equimolarly. The pooled libraries were then sequenced on two Novaseq lanes (150 paired-end reads).
The sequenced reads mapping was performed as in ref. 24. The sequenced reads were aligned to both the human reference genome build 37 and the mitochondrial genome (rCRS). After alignment, reads were filtered based on a mapping quality threshold of 30 and sorted using Picard v.1.127 (http://broadinstitute.github.io/picard/) and samtools51. The resulting data was merged at the library level and duplicates were removed using Picard MarkDuplicates v.1.127. The merged data was then consolidated at the sample level. To improve the accuracy of the alignment, sample-level BAM files were realigned using GATK77 v.3.3.0. Subsequently, the md-tag was updated and extended base alignment qualities (BAQs) were calculated using samtools calmd v.1.10.
Estimating damage patterns
The frequency of C-to-T mismatches at the 5′ end of the aligned reads were estimated using bamdamage73.
Trimming the reads’ ends in bam files
To test the effect of trimming the ends of the aligned reads on imputation accuracy (Supplementary Note 2), we used BamUtil78 v1.0.14 to trim five base pairs from each end of the reads.
Imputation
File processing prior to imputation
We downsampled high-coverage (10x-59x range) ancient genomes to coverages 0.1x, 0.25x, 0.5x, 0.75x, 1.0x, and 2.0x, using samtools51 v1.10. The subsampling fraction was determined by first calculating the average coverage across the variant sites in the 1000 Genomes phase 3 reference panel17 phased with TOPMed18 (see Datasets section) so that the resulting downsampled genome had the intended coverage at those sites. Then, we computed genotype likelihoods for the downsampled and the original high-coverage genomes for the abovementioned variant sites.
To generate the genotype calls and genotype likelihoods, we used bcftools51 v1.10 and, as default, the command bcftools mpileup with parameters -I -E -a “FORMAT/DP” --ignore-RG, followed by bcftools call -Aim -C alleles. To call genotypes from the high-coverage genomes, we have applied additional parameters for quality control (more details below).
We also generated both genotype calls from the high-coverage genomes and genotype likelihoods for the downsampled data (1x) with ATLAS79 v0.9.9 (see Supplementary Note 1 and Supplementary Note 2) using the MLE caller and the empirical post-mortem damage pattern observed across reads, as described in https://bitbucket.org/wegmannlab/atlas/wiki. For sake of time, we skipped the first step, splitMerge, that separates single-end alignments by length and merges the mates of paired-end reads and requires specification of the different libraries contained in a bam file. It is often the case that an ancient genome is obtained from a mixture of paired-end and single-end libraries. We observed that this first step we skipped did not have much impact when the bam files only had single-end libraries, but the genotype calling was seemingly less accurate when there were paired-end libraries in the bam files. So, we do not report here results we obtained from ATLAS calls from ancient genomes that were sequenced from paired-end libraries.
To obtain a trimmed validation dataset (Supplementary Note 2), we trimmed five base pairs at both ends of the reads using the command trimBam from the package bamutil78 v1.0.14. Then, we called genotypes using bcftools v1.10, as previously described.
The final validation dataset was obtained by implementing the following filtering approach:38 i) genotype calling with bcftools v1.10 with mapping and base quality filters of 30 and 20 (-q 30 -Q 20), respectively, and with the parameter -C 50, as recommended by the SAMtools developers for BWA mapped data to reduce mapping quality for reads with an excess of mismatches; ii) exclusion of the sites that are not in the 1000 Genomes accessible genome strict mask;80 iii) removal of sites located in regions known to contain repeats (RepeatMask regions in UCSC Table Browser81, http://genome.ucsc.edu/); iv) filtering out sites with extreme values of depth of coverage when comparing to the average genome coverage: below the maximum of one third of the mean depth of coverage () and eight, that is, , and depth above twice the average depth; v) filtering out of sites with the field QUAL below 30.
Imputation using GLIMPSE
We imputed the downsampled genomes using GLIMPSE13 v1.1.1. First, we used GLIMPSE_chunk to split chromosomes into chunks of sizes in the range 1–2 Mb and included a 200-kb buffer region at each side of a chunk. Second, imputation was performed with GLIMPSE_phase on the chunks with parameters --burn 10, --main 15 and --pbwt-depth 2, with 1000 Genomes as the reference panel. And then, we ligated the imputed chunks with GLIMPSE_ligate.
Imputation using Beagle4.1
To evaluate how GLIMPSE performs compared to Beagle4.111 regarding imputation of low-coverage ancient genomes, we imputed the same data, but restricted to 1.0x, with Beagle4.1 with parameters --modelscale 2 and --niterations 0, that represent a trade-off between accurate results and running times.
Imputation accuracy evaluation
We used GLIMPSE_concordance to quantify imputation accuracy and genotype concordance, having the high-coverage data as validation. Only sites that were covered by at least eight reads and whose genotypes have a posterior probability of 0.9999 or more were used in validation. With GLIMPSE_concordance we obtained (i) imputation accuracy, that is, the squared correlation between dosage fields VCF/DS (DS varies between 0 and 2 that can be seen as a mean genotype value obtained from the genotype probabilities: , where is the genotype probability for genotype ) in imputed and validation datasets, divided in MAF bins, and (ii) genotype discordance, i.e., proportion of sites for which the most likely imputed genotype is different from the corresponding validation genotype for homozygous reference allele (RR), heterozygous (RA) and homozygous alternative allele sites (AA). We also estimated non-reference-discordance, NRD, defined as , where and stand for the number of errors and matches at sites of type X, respectively. NRD is an error rate which excludes the number of correctly imputed homozygous reference allele sites, which are the majority, thus giving more weight to imputation errors at alternative allele sites.
Testing significance of Spearman correlation between sample age and imputation accuracy
We calculated Spearman correlation using the function spearmanr from the python package scipy.stats. We performed a two-sided permutation test with 10,000 permutations to test whether the estimated correlation was significantly different from zero.
Downstream analyses
File processing
We filtered the imputed data by imposing that, for each variant site, the genotype probability (VCF/GP) for the most confidently imputed genotype to be at least 0.80. Then, we generated two datasets with different minor allele frequency (MAF) filters: MAF > 5% (6,550,734 SNPs) for the data used in PCA and ROH analyses, and MAF > 1% (11,553,877 SNPs) for admixture analysis, since with stricter MAF filters we would lose sites that distinguish the different populations. We used PLINK82 v1.90 to merge 1000 Genomes, high-coverage and imputed data into one file. In the case of PCA and admixture analyses, we intersected the resulting sites with the ones present in the Allen Ancient DNA Resource (AADR) data genotyped at the 1240K array sites63, that we refer to as the “1240K dataset” hereafter.
PCA
We performed PCA with smartpca (eigensoft83 package v7.2.1) without outlier removal (outliermode: 2). The 10 first principal components (numoutevec: 10) were calculated using the 1000 Genomes genetic data and both the imputed and high-coverage data were projected onto the resulting components (lsqproject: YES).
To perform the t-tests to test if there were significant differences in coordinates between validation and corresponding 1x imputed genomes for the first 10 principal components, we used the default R function t.test, running it in unpaired mode to test whether the mean of the differences was significantly different from 0 with a two-sided alternative hypothesis.
Admixture analysis
We estimated admixture proportions for 21 ancient Europeans with the software ADMIXTURE67 v1.3.0 in unsupervised mode. For the reference panel, we used a subset of the 1240K dataset containing nine western hunter gatherers, 26 Anatolian farmers and 26 individuals of Steppe ancestry63 (see Supplementary Table 5). Contrary to the imputed and high-coverage genomes, the reference data are pseudo-haploid. We merged the reference panel with each of the imputation datasets (different coverages) with plink v1.90. We removed sites that were missing in more than 30% of the individuals. We proceeded similarly for the high-coverage dataset. We ran ADMIXTURE on seven configurations: merged reference panel and high-coverage individuals, and merged reference panel with each of the six imputed data sets (with initial coverage between 0.1x and 2.0x). For each configuration and number of clusters, we ran ADMIXTURE for K between two and five with 20 replicates (20 different seeds) and chose the replicate that yielded the largest log-likelihood value. In the final run, we obtained the standard error and bias of the admixture estimates using the option --B 1000 that calculates these quantities with bootstrapping and 1000 replicates.
Runs of homozygosity (ROH)
We estimated ROH with plink v1.90 with the parameters72 --homozyg, --homozyg-density 50, --homozyg-gap 100, --homozyg-kb 500, --homozyg-snp 50, --homozyg-window-het 1, --homozyg- window-snp 50 and --homozyg-window-threshold 0.05. We estimated ROH twice: i) using transversion sites only, thus excluding sites that can be affected by aDNA damage, and ii) using both transversions and transitions.
Datasets
Ancient genomes in this study
The 43 downsampled and imputed ancient genomes (Supplementary Table 1) were obtained from the “Ancient Genomes dataset” that was compiled in the context of the study of ref. 24.
Reference panel for imputation
We used a version of 1000 Genomes v5 phase 3 (2504 genomes)17, where the genomes were re-sequenced at 30x, and subsequently phased using TOPMed18, and with sites present in TOPMed. These data are available in European Nucleotide Archive, under project PRJEB31736 and secondary study accession ERP114329. Only biallelic sites were retained (~90 million SNPs). This panel was lifted over from build 38 to hg19 reference genome assembly using Picard liftoverVCF v1.18.11 (https://gatk.broadinstitute.org/hc/en-us/articles/360037060932-LiftoverVcf-Picard-), with hg38ToHg19 chain from the University of California, Santa Cruz liftOver tool (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/).
Present-day European genomes
This dataset consists of a subset of 23 European genomes from the Simons Genome Diversity Project (SGDP)84, as specified in Supplementary Table 3. We downloaded the corresponding bam files aligned to the hg19 reference genome from the Seven Bridges Cancer Genomics Cloud (https://www.cancergenomicscloud.org). We downsampled the data to 1x and imputed as before.
Reference panel for genetic clustering analyses
The Allen Ancient DNA Resource (AADR)85 that we refer to as “1240K dataset”, is publicly available at https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data.
We extracted a subset of the 1240K dataset63 containing ancient individuals of the three ancestries we were interested in: 26 Anatolian farmers (Anatolia_N), 26 Steppe individuals (Steppe_EMBA), and nine western-hunter gatherers (WHG), as specified in Supplementary Table 5, to the exclusion of Loschbour, a genome that was also included in the dataset of 42 high-coverage genomes that we downsampled and imputed. We converted this subset from eigenstrat format to plink bed using the convertf command (eigensoft package v7.2.1). After that, we used plink v1.190 to do all of the data handling, such as merging plink bed files and filtering out sites with high missingness.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We are thankful to Isabel Alves, Samuel Neuenschwander and J. Víctor Moreno-Mayar for fruitful discussions that contributed to improving this study. B.S.d.M. was supported by a Swiss National Science Foundation (SNSF) project grant (PP00P3_176977) to O.D. and by a European Research Council grant (grant agreement no. 679330) to A.-S.M. S.R. was supported by Swiss National Science Foundation (SNSF) project grant (PP00P3_176977). D.I.C.D. was supported by the European Research Council grant (grant agreement no. 679330) to A.-S.M. C.E.G.A. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM142939. N.N.J. was supported by Aarhus University Research Foundation. H.S. was supported by the European Research Council (grant agreement no. 101045643).
Author contributions
B.S.d.M., A.-S.M., and O.D. designed the study and drafted the paper. B.S.d.M. and O.D. performed the experiments. S.R. helped with imputation. D.I.C.D., C.E.G.A, M.E.A., M.S. and E.W. helped with the population genetics analyses. H.S., M.E.A., N.N.J., M.H.S., P.W., A.S., M.M.P. generated and provided the ancient trio data. This work has been supervised by O.D. and A.-S.M. All authors helped with interpretation and reviewed the final manuscript.
Peer review
Peer review information
Nature Communications thanks Carl Nettelblad and Harald Ringbauer for their contribution to the peer review of this work. A peer review file is available.
Data availability
All data supporting the findings described in this manuscript are available in the article and its Supplementary Information files, public repositories and from the corresponding author upon request. The Koszyce ancient trio data (RISE1159, RISE1160, RISE1168) generated in this study have been deposited in the European Nucleotide Archive (ENA) database under accession code PRJEB61632. The unfiltered imputed ancient genomes (original genomes were downsampled to depths of coverage in the range 0.1x–2.0x) are available in Zenodo (10.5281/zenodo.7993392). The 1000 Genomes Project phase 3: 30X coverage whole genome sequencing data is available at the European Nucleotide Archive, under project PRJEB31736 and secondary study accession ERP114329 (https://www.ebi.ac.uk/ena/browser/view/PRJEB31736). The SGDP bam files aligned to hg19 reference genome were downloaded from Seven Bridges Cancer Genomics Cloud. The AADR85 dataset is publicly available at https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data. The remaining 40 ancient human genomes in this study have origin on the following studies: atp01630 (10.1073/pnas.1717762115); Stuttgart & Loschbour41 (10.1038/nature13673); Ballynahatty & Rathlin144 (10.1073/pnas.1518445113); sf1245 (10.1371/journal.pbio.2003703); NE1 & BR246 (10.1038/ncomms6257); SIII48 (10.1126/science.aao1807); SSG-A-2, HSJ-A-1 & STT-A-249 (10.1126/science.aar2625); VK150 (10.1038/s41586-020-2688-8); SZ15, SZ3, SZ4, SZ45, SZ43 & SZ131 (10.1038/s41467-018-06024-4); baa01, ela01 & new0132 (10.1126/science.aao6266); I1087133 (10.1038/s41586-020-1929-1); Mota34 (10.1126/science.aad2879); KK135 (10.1038/ncomms9912); WC136 (10.1126/science.aaf7943); BOT2016 & Yamnaya37 (10.1126/science.aar7711); Andaman, AHUR_2064, Lovelock2, Lovelock3, Clovis, Sumidouro5, A46038 (10.1126/science.aav2621); USR142 (10.1038/nature25173); Saqqaq43 (10.1038/nature08835); Ust’-Ishim39 (10.1038/nature13810); Kolyma_River & Yana40 (10.1038/s41586-019-1279-z).
Code availability
The scripts we used to impute the ancient genomes, as well pre- and post-processing steps can be found in the following github repository:86 https://github.com/bsmota/aDNA_imputation.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Anna-Sapfo Malaspinas, Olivier Delaneau.
Contributor Information
Anna-Sapfo Malaspinas, Email: annasapfo.malaspinas@unil.ch.
Olivier Delaneau, Email: olivier.delaneau@unil.ch.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-023-39202-0.
References
- 1.Briggs AW, et al. Patterns of damage in genomic DNA sequences from a Neandertal. Proc. Natl Acad. Sci. USA. 2007;104:14616–14621. doi: 10.1073/pnas.0704665104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Peyrégne S, Prüfer K. Present-day DNA contamination in ancient DNA datasets. BioEssays. 2020;42:1–11. doi: 10.1002/bies.202000081. [DOI] [PubMed] [Google Scholar]
- 3.Patterson N, et al. Ancient admixture in human history. Genetics. 2012;192:1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Günther, T. & Jakobsson, M. Population genomic analyses of DNA from ancient remains. In: Handbook of Statistical Genomics1, 295–324 (Wiley, 2019).
- 5.Ringbauer H, Novembre J, Steinrücken M. Parental relatedness through time revealed by runs of homozygosity in ancient DNA. Nat. Commun. 2021;12:5425. doi: 10.1038/s41467-021-25289-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 2019;15:e1008302. doi: 10.1371/journal.pgen.1008302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Marchini J, et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 2006;78:437–450. doi: 10.1086/500808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]
- 10.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Browning BL, Browning SR. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 2016;98:116–126. doi: 10.1016/j.ajhg.2015.11.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Spiliopoulou A, Colombo M, Orchard P, Agakov F, McKeigue P. GeneImp: Fast imputation to large reference panels using genotype likelihoods from ultralow coverage sequencing. Genetics. 2017;206:91–104. doi: 10.1534/genetics.117.200063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat. Genet. 2021;53:120–126. doi: 10.1038/s41588-020-00756-0. [DOI] [PubMed] [Google Scholar]
- 14.Wasik K, et al. Comparing low-pass sequencing and genotyping for trait mapping in pharmacogenetics. BMC Genom. 2021;22:1–7. doi: 10.1186/s12864-021-07508-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Davies RW, et al. Rapid genotype imputation from sequence with reference panels. Nat. Genet. 2021;53:1104–1111. doi: 10.1038/s41588-021-00877-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Auton A, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Martiniano R, et al. The population genomics of archaeological transition in west Iberia: Investigation of ancient substructure using imputation and haplotype-based methods. PLoS Genet. 2017;13:e1006852. doi: 10.1371/journal.pgen.1006852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Haber M, et al. A genetic history of the near east from an aDNA time course sampling eight points in the past 4000 years. Am. J. Hum. Genet. 2020;107:149–157. doi: 10.1016/j.ajhg.2020.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Saupe T, et al. Ancient genomes reveal structural shifts after the arrival of Steppe-related ancestry in the Italian Peninsula. Curr. Biol. 2021;31:2576–2591.e12. doi: 10.1016/j.cub.2021.04.022. [DOI] [PubMed] [Google Scholar]
- 22.Clemente F, et al. The genomic history of the Aegean palatial civilizations. Cell. 2021;184:2565–2586.e21. doi: 10.1016/j.cell.2021.03.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cox SL, et al. Predicting skeletal stature using ancient DNA. Am. J. Biol. Anthropol. 2022;177:162–174. doi: 10.1002/ajpa.24426. [DOI] [Google Scholar]
- 24.Allentoft, M. E. et al. Population Genomics of Stone Age Eurasia. bioRxiv36, 2022.05.04.490594 (2022).
- 25.Hofreiter M, Serre D, Pääbo S. Ancient DNA. Nat. Rev. Genet. 2001;2:353–359. doi: 10.1038/35072071. [DOI] [PubMed] [Google Scholar]
- 26.Hui R, D’Atanasio E, Cassidy LM, Scheib CL, Kivisild T. Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes. Sci. Rep. 2020;10:1–8. doi: 10.1038/s41598-020-75387-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Browning BL, Tian X, Zhou Y, Browning SR. Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 2021;108:1880–1890. doi: 10.1016/j.ajhg.2021.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ausmees K, Sanchez-Quinto F, Jakobsson M, Nettelblad C. An empirical evaluation of genotype imputation of ancient DNA. G3 Genes|Genomes|Genet. 2022;12:jkac089. doi: 10.1093/g3journal/jkac089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Valdiosera C, et al. Four millennia of Iberian biomolecular prehistory illustrate the impact of prehistoric migrations at the far end of Eurasia. Proc. Natl Acad. Sci. USA. 2018;115:3428–3433. doi: 10.1073/pnas.1717762115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Amorim CEG, et al. Understanding 6th-century barbarian social organization and migration through paleogenomics. Nat. Commun. 2018;9:3547. doi: 10.1038/s41467-018-06024-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Schlebusch CM, et al. Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000 years ago. Science. 2017;358:652–655. doi: 10.1126/science.aao6266. [DOI] [PubMed] [Google Scholar]
- 33.Lipson M, et al. Ancient West African foragers in the context of African population history. Nature. 2020;577:665–670. doi: 10.1038/s41586-020-1929-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Gallego Llorente M, et al. Ancient Ethiopian genome reveals extensive Eurasian admixture throughout the African continent. Science. 2015;350:820–822. doi: 10.1126/science.aad2879. [DOI] [PubMed] [Google Scholar]
- 35.Jones ER, et al. Upper Palaeolithic genomes reveal deep roots of modern Eurasians. Nat. Commun. 2015;6:1–8. doi: 10.1038/ncomms9912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Broushaki F, et al. Early Neolithic genomes from the eastern Fertile Crescent. Science. 2016;353:499–503. doi: 10.1126/science.aaf7943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.de Barros Damgaard P, et al. The first horse herders and the impact of early Bronze Age steppe expansions into Asia. Science. 2018;360:eaar7711. doi: 10.1126/science.aar7711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Moreno-Mayar JV, et al. Early human dispersals within the Americas. Science. 2018;362:eaav2621. doi: 10.1126/science.aav2621. [DOI] [PubMed] [Google Scholar]
- 39.Fu Q, et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature. 2014;514:445–449. doi: 10.1038/nature13810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sikora M, et al. The population history of northeastern Siberia since the Pleistocene. Nature. 2019;570:182–188. doi: 10.1038/s41586-019-1279-z. [DOI] [PubMed] [Google Scholar]
- 41.Lazaridis I, et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature. 2014;513:409–413. doi: 10.1038/nature13673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Moreno-Mayar JV, et al. Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans. Nature. 2018;553:203–207. doi: 10.1038/nature25173. [DOI] [PubMed] [Google Scholar]
- 43.Rasmussen M, et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature. 2010;463:757–762. doi: 10.1038/nature08835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Cassidy, L. M. et al. Neolithic and Bronze Age migration to Ireland and establishment of the insular Atlantic genome. Proc. Natl Acad. Sci.113, 2021 (2016). [DOI] [PMC free article] [PubMed]
- 45.Günther T, et al. Population genomics of Mesolithic Scandinavia: Investigating early postglacial migration routes and high-latitude adaptation. PLoS Biol. 2018;16:e2003703. doi: 10.1371/journal.pbio.2003703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gamba C, et al. Genome flux and stasis in a five millennium transect of European prehistory. Nat. Commun. 2014;5:1–9. doi: 10.1038/ncomms6257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Schroeder H, et al. Unraveling ancestry, kinship, and violence in a Late Neolithic mass grave. Proc. Natl Acad. Sci. USA. 2019;166:10705–10710. doi: 10.1073/pnas.1820210116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sikora M, et al. Ancient genomes show social and reproductive behavior of early Upper Paleolithic foragers. Science. 2017;358:659–662. doi: 10.1126/science.aao1807. [DOI] [PubMed] [Google Scholar]
- 49.Ebenesersdóttir SS, et al. Ancient genomes from Iceland reveal the making of a human population. Science. 2018;360:1028–1032. doi: 10.1126/science.aar2625. [DOI] [PubMed] [Google Scholar]
- 50.Margaryan A, et al. Population genomics of the Viking world. Nature. 2020;585:390–396. doi: 10.1038/s41586-020-2688-8. [DOI] [PubMed] [Google Scholar]
- 51.Li H, et al. The sequence Alignment/Map format and SAMtools. Bioinforma. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Coop, G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. 1–19 (2022).
- 53.Das S, Abecasis GR, Browning BL. Genotype imputation from large reference panels. Annu. Rev. Genom. Hum. Genet. 2018;19:73–96. doi: 10.1146/annurev-genom-083117-021602. [DOI] [PubMed] [Google Scholar]
- 54.Bergström A, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367:eaay5012. doi: 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Biddanda A, Steinrücken M, Novembre J. Properties of 2-locus genealogies and linkage disequilibrium in temporally structured samples. Genetics. 2022;221:iyac038. doi: 10.1093/genetics/iyac038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Browning SR, Browning BL. Haplotype phasing: Existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. doi: 10.1038/nrg3054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Delaneau O, Zagury JF, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]
- 58.Delaneau O, Zagury JF, Robinson MR, Marchini JL, Dermitzakis ET. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Maricic T, Whitten M, Pääbo S. Multiplexed DNA sequence capture of mitochondrial genomes using PCR products. PLoS One. 2010;5:9–13. doi: 10.1371/journal.pone.0014004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Burbano HA, et al. Targeted investigation of the neandertal genome by array-based sequence capture. Science. 2010;328:723–725. doi: 10.1126/science.1188046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fu Q, et al. DNA analysis of an early modern human from Tianyuan Cave, China. Proc. Natl Acad. Sci. USA. 2013;110:2223–2227. doi: 10.1073/pnas.1221359110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Castellano S, et al. Patterns of coding variation in the complete exomes of three Neandertals. Proc. Natl Acad. Sci. USA. 2014;111:6666–6671. doi: 10.1073/pnas.1405138111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Fu Q, et al. An early modern human from Romania with a recent Neanderthal ancestor. Nature. 2015;524:216–219. doi: 10.1038/nature14558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Haak W, et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature. 2015;522:207–211. doi: 10.1038/nature14317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Mathieson I, et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Ceballos FC, Joshi PK, Clark DW, Ramsay M, Wilson JF. Runs of homozygosity: windows into population history and trait architecture. Nat. Rev. Genet. 2018;19:220–234. doi: 10.1038/nrg.2017.109. [DOI] [PubMed] [Google Scholar]
- 67.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Allentoft ME, et al. Population genomics of Bronze Age Eurasia. Nature. 2015;522:167–172. doi: 10.1038/nature14507. [DOI] [PubMed] [Google Scholar]
- 69.Hofmanová Z, et al. Early farmers from across Europe directly descended from Neolithic Aegeans. Proc. Natl Acad. Sci. USA. 2016;113:6886–6891. doi: 10.1073/pnas.1523951113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Mathieson I, et al. The genomic history of southeastern Europe. Nature. 2018;555:197–203. doi: 10.1038/nature25778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Narasimhan VM, et al. The formation of human populations in South and Central Asia. Science. 2019;365:eaat7487. doi: 10.1126/science.aat7487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Cassidy LM, et al. Neolithic and Bronze Age migration to Ireland and establishment of the insular atlantic genome. Proc. Natl Acad. Sci. USA. 2016;113:368–373. doi: 10.1073/pnas.1518445113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Malaspinas AS, et al. bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS) Bioinformatics. 2014;30:2962–2964. doi: 10.1093/bioinformatics/btu410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Rubinacci, S., Hofmeister, R., Sousa Da Mota, B. & Delaneau, O. Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nat Genet 2022.11.28.518213 (2022). [DOI] [PMC free article] [PubMed]
- 76.Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb. Protoc. 2010;5:pdb.prot5448. doi: 10.1101/pdb.prot5448. [DOI] [PubMed] [Google Scholar]
- 77.McKenna A, et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Jun G, Wing MK, Abecasis GR, Kang HM. An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data. Genome Res. 2015;25:gr.176552.114. doi: 10.1101/gr.176552.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Link, V. et al. ATLAS: analysis tools for low-depth and ancient samples. bioRxiv 105346 10.1101/105346 (2017).
- 80.Altshuler DM, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Karolchik D, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32:D493–D496. doi: 10.1093/nar/gkh103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Purcell S, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 84.Mallick S, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538:201–206. doi: 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Mallick, S. et al. The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes. bioRxiv 2023.04.06.535797 10.1101/2023.04.06.535797 (2023). [DOI] [PMC free article] [PubMed]
- 86.Sousa Da Mota, B. et al. Imputation of ancient human genomes. GitHub10.5281/zenodo.7836943 (2023). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data supporting the findings described in this manuscript are available in the article and its Supplementary Information files, public repositories and from the corresponding author upon request. The Koszyce ancient trio data (RISE1159, RISE1160, RISE1168) generated in this study have been deposited in the European Nucleotide Archive (ENA) database under accession code PRJEB61632. The unfiltered imputed ancient genomes (original genomes were downsampled to depths of coverage in the range 0.1x–2.0x) are available in Zenodo (10.5281/zenodo.7993392). The 1000 Genomes Project phase 3: 30X coverage whole genome sequencing data is available at the European Nucleotide Archive, under project PRJEB31736 and secondary study accession ERP114329 (https://www.ebi.ac.uk/ena/browser/view/PRJEB31736). The SGDP bam files aligned to hg19 reference genome were downloaded from Seven Bridges Cancer Genomics Cloud. The AADR85 dataset is publicly available at https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data. The remaining 40 ancient human genomes in this study have origin on the following studies: atp01630 (10.1073/pnas.1717762115); Stuttgart & Loschbour41 (10.1038/nature13673); Ballynahatty & Rathlin144 (10.1073/pnas.1518445113); sf1245 (10.1371/journal.pbio.2003703); NE1 & BR246 (10.1038/ncomms6257); SIII48 (10.1126/science.aao1807); SSG-A-2, HSJ-A-1 & STT-A-249 (10.1126/science.aar2625); VK150 (10.1038/s41586-020-2688-8); SZ15, SZ3, SZ4, SZ45, SZ43 & SZ131 (10.1038/s41467-018-06024-4); baa01, ela01 & new0132 (10.1126/science.aao6266); I1087133 (10.1038/s41586-020-1929-1); Mota34 (10.1126/science.aad2879); KK135 (10.1038/ncomms9912); WC136 (10.1126/science.aaf7943); BOT2016 & Yamnaya37 (10.1126/science.aar7711); Andaman, AHUR_2064, Lovelock2, Lovelock3, Clovis, Sumidouro5, A46038 (10.1126/science.aav2621); USR142 (10.1038/nature25173); Saqqaq43 (10.1038/nature08835); Ust’-Ishim39 (10.1038/nature13810); Kolyma_River & Yana40 (10.1038/s41586-019-1279-z).
The scripts we used to impute the ancient genomes, as well pre- and post-processing steps can be found in the following github repository:86 https://github.com/bsmota/aDNA_imputation.