Abstract
The ability to sequence ancient genomes has revolutionized the way we study evolutionary history by providing access to the most important aspect of evolution—time. Until recently, studying human demography, ecology, biology, and history using population genomic inference relied on contemporary genomic datasets. Over the past decade, the availability of human ancient DNA (aDNA) has increased rapidly, almost doubling every year, opening the way for spatiotemporal studies of ancient human populations. However, the multidimensionality of aDNA, with genotypes having temporal, spatial and genomic coordinates, and integrating multiple sources of data, poses a challenge for developing meta-analyses pipelines. To address this challenge, we developed a publicly-available interactive tool, DORA, which integrates multiple data types, genomic and non-genomic, in a unified interface. This web-based tool enables browsing sample metadata alongside additional layers of information, such as population structure, climatic data, and unpublished samples. Users can perform analyses on genotypes of these samples, or export sample subsets for external analyses. DORA integrates analyses and visualizations in a single intuitive interface, resolving the technical issues of combining datasets from different sources and formats, and allowing researchers to focus on the scientific questions that can be addressed through analysis of aDNA datasets.
Graphical Abstract
Introduction
Until recently, answering questions regarding evolution and demographics of human history has relied on inference from contemporary sequenced genomes and a small number of sequenced ancient genomes (1). Over the past decade, the availability of human ancient DNA (aDNA) has increased rapidly, almost doubling every year (2). These data include high-quality whole genome shotgun sequencing for a very small number of samples of interest, low coverage whole genome shotgun sequencing, and sequence capture targeting SNPs in specific SNP panels (e.g. Human Origins, 1240K (3,4)). These resources now allow for testing evolutionary questions directly, rather than relying on indirect inference, by tracking changes in allele frequencies, genomic compositions, and phenotypic predictions over many millennia (5–8).
The visualization and analysis of aDNA is challenging, partly owing to the multi-dimensionality of the individual data points. For every genotype from an individual sample in a dataset, there is a (i) genomic position, (ii) geographic location, and (iii) temporal position (sample dating, which also varies between dating methods). Different analyses can consider a subset of genotypes along windows in each of these dimensions. Selection of such windows can be based on geographic regions of interest, time periods of interest, and also on accessory data such as sample quality, specified regions of the genome, climatic databases, geographical boundaries, and other considerations relevant to the specific research question. One of the challenges of meta-analyses of ancient human genomes is the integration and visualization of these multiple sources of data, which is necessary to allow selection of the specific genotypes to be analyzed. Such selection procedures are often informed by exploratory analyses of the dataset. Tools have been developed that are specifically designed to visualize the results of analyses of aDNA (9), but these require additional tools in order to generate results. Ideally, the first steps of such meta-analyses would be conducted within a single framework that would allow exploration of the dataset by allowing intuitive selection of spatial, temporal, and genomic windows of genotypes, along with population genetic analyses of the selected genotypes.
Here, we present an interactive map of human aDNA samples, called Data Overlays for Research in Archaeogenomics (DORA). This tool visualizes aDNA samples on a geographic map, and allows selection of samples by visualizing the characteristics of the samples (e.g. metadata of the sample, such as coverage). It also includes additional data layers that can be displayed alongside the samples, such as climatic data and results of downstream population structure analyses. Importantly, the tool is designed to allow users to integrate these data layers in order to select subsets of samples according to geographic regions that can be drawn on the map, and temporal windows that can be selected using the slider under the map. These subsets can be exported, or used in meta-analyses through the DORA interface. The interface is web-based and its resources are stored on Amazon Web Services Simple Storage Service (AWS S3), while private user data added to the tool will be available only on the local machine and not uploaded to AWS S3. DORA can facilitate a range of meta-analyses of ancient genomes and will allow users to seamlessly integrate published resources with their unpublished genomic samples or other data layers.
To demonstrate the usefulness of this tool, we have pre-loaded a curated aDNA dataset, the Allen Ancient DNA Resource (AADR) (2), along with the TraCE21K climatic data from the database Climatologies at high resolution for the earth’s land surface areas (CHELSA) (10), and ADMIXTURE results for ancient samples in the AADR. We also developed an interface for loading polygenic scores (PGSs), which can be used to predict the phenotypes of the ancient samples based on their genomes, from the PGSCatalog database (11); DORA is designed to directly apply PGSs to the aDNA samples loaded into the tool. We demonstrate some of the analyses that can be performed with the tool, including tracking allele frequencies in geographically defined populations, computing temporal trends in genetic differentiation between populations using pairwise FST and PCA, and tracking polygenic score values of individuals and demarcated populations over time. DORA integrates this variety of data types, analyses and visualizations, in a single intuitive interface, resolving the technical issues of combining datasets from different sources and formats, and allowing researchers to focus on the analysis and the scientific questions of interest.
Features of DORA
The main display of DORA is a world map on top of which data is visualized. By default, DORA loads a topographical map of the world in equirectangular projection (12) and sample metadata (a screenshot of the main display with metadata loaded from AADR is shown in Figure 1). Samples are represented as dots on the map, with the color of the dots reflecting additional sample metadata, such as coverage or dating, as defined in the ‘Colormap’ box on the right. The map can be browsed spatially by zooming in or out, or dragging the map with the cursor. At the bottom of the map is a timeline that shows the number of available samples in predefined temporal bins (in 100s of years). A slider on the timeline with a blue transparent background defines which samples will be displayed on the map; users can shift the sides of the slider to define this window. In addition, a ‘variants panel’ can be opened by clicking on the ‘Variants’ tab at the bottom right of the screen. This displays selected variants according to their genomic regions; new genomic regions can be added by using the variants form that appears when clicking on the variant tab, while selected genomic regions can be removed by highlighting them in the variants form and clicking ‘Delete’. Variant selection in this panel is used for analyses that are run on multiple loci, such as FST or PCA (see the Example analyses section below).
To select samples according to their geographic positions, users can define ‘Regions’, which are colored polygons that demarcate samples according to their geographic coordinates for subsequent analyses or export. Geographic regions are created by clicking ‘Add region’ on the main menu. Once clicked, the next successive left-clicks on the map will create vertices of the polygon. Right-clicking will save the polygon with the marked vertices. Geographic regions are stored in the browser cache until deleted; these regions can be deleted by clicking a region and pressing the delete key. Clicking on a geographic region will also allow the user to edit the region label, export sample metadata, or exclude samples in the region from analyses.
Adding datasets
DORA overlays multiple data sources onto a single map display, allowing users to combine these sources when selecting samples for subsequent analyses (Figure 2). Layers that were pre-loaded to demonstrate the features of DORA will load by default, including the map (based on Natural Earth (12)), aDNA metadata and genotypes from the AADR (2), CHELSA climate model data for the past 20 000 years (10), and ADMIXTURE results for ancient samples in the AADR. Additional datasets can be added by clicking the ‘Add dataset’ button; these can either add samples to the map (primary datasets; see scheme in Figure 2B), or add metadata for existing samples, according to their sample IDs (Table 1). If samples have multiple possible attributes to display, the user is able to select the desired attribute to color the samples using the ‘Colormap’ menu near the top right corner of the map.
Table 1.
Data type | Use case | Formats |
---|---|---|
Samples | Displaying additional samples on the map with associated metadata | CSV/TSV |
Metadata | Displaying additional metadata for samples on the map, such as climate information or archaeological data | CSV/TSV |
Genotypes | Analysis of unpublished genetic data, adding imputed genotypes | PLINK BED |
Structure results | Displaying population structure of AADR or additional samples on the map | PLINK FAM+Q |
PGS | Predicting traits of loaded samples | PGSCatalog API |
Analyses
DORA allows users to conduct meta-analyses on the genetic data loaded into the interface, which includes the AADR genotypes loaded into DORA by default, and any other data loaded into the browser (in PLINK BED format). One of the main benefits of running analyses from the map interface is that users can integrate multiple data sources when defining subsets of samples, and run a variety of analyses on these subsets from the same interface. Analyses can be initiated by clicking the ‘New analysis’ button. Analyses are run on the selected geographic regions, in temporal windows assigned according to the timeline or in the analysis dialog box field ‘Temporal range (in years)’, and on genomic regions as defined in the variants panel (except the allele frequency analysis, which runs on a single variant specified in the analysis dialog box). Additional filtering criteria for samples and variants can be specified in the analysis dialog box. The available analyses include: (i) allele frequency analysis, which shows the temporal trajectory of the allele frequency of a specific variant; (ii) expected heterozygosity of selected genetic variants; (iii) pairwise FST between all the pairs of geographic regions (using the implementation of Hudson FST estimation (13) from scikit-allel (14)); (iv) PCA of samples from all geographic regions; and (v) computed polygenic scores of samples in all geographic regions, over the chosen temporal windows (PGSs must first be imported from PGSCatalog).
Analyses are run on AWS Lambda for publicly available data, and on the local machine for locally stored data. DORA uses Pyodide (15) to provide the same Python analysis environment in the browser and on AWS Lambda. The analysis script is loaded by the specific environment, and the BED file is read either from AWS S3 (in the case of AWS Lambda) or from the browser cache (in the case of local data). The user receives a link to the analysis results when the analysis is run; this link will display the results as soon as they are ready. When the analysis is complete, the results are also processed and plotted using Python code run using Pyodide. Each analysis comes with a number of pre-defined plotting formats, such as line plots or scatter plots; this Python code can be manipulated by the user by clicking ‘Edit code’, and the figure updated using the ‘Update’ button (Figure 1). All analysis plots presented here were generated by the analysis feature of DORA.
Example analyses
To illustrate some potential uses of DORA, we present here example analyses that utilize different features of the tool. These tasks did not require the preparation of data with external tools and analyses were completed within 1–5 min. The results are also processed and plotted by DORA, so that the entire task is completed using the same interface. In these examples, we used a previously conducted population structure analysis to select geographic regions, conducted with ADMIXTURE (16). The ADMIXTURE results are represented in Figure 3 as colored samples on the map, where each individual is colored by its main ancestry component. At the bottom of the map, above the timeline, a ‘STRUCTURE-plot’ is shown for all displayed samples. Based on these results, we drew three polygons representing Northwest Europe, the Kazakh Steppe and East Asia.
Allele frequency fluctuations over time in different populations
Tracking allele frequencies over time is a simple analysis that is important in order to identify evolutionary processes. Such analyses can be performed by DORA in the selected geographic and temporal windows. We selected two variants that are at intermediate frequencies in modern populations: a variant that was suggested to have undergone positive selection during the Black Plague (8,17) (rs2549794), and an allele that has been suggested to confer lactase persistence (rs4988235). In the temporal range selected, the frequency of the non-reference allele of the rs2549794 variant shows no clear trend in any geographic region (Figure 4A), whereas rs4988235 increases monotonically in the NW Europe and East Asia geographic regions (Figure 4B), an increase previously attributed to positive selection (18). Such analyses may be useful to examine whether observed trends are common or if they are indeed consistent with selection processes.
PCA of samples in selected geographic regions
DORA can run PCA on a subset of samples, for specified genomic regions. By default, DORA pools samples from all windows and runs the PCA on all selected variants. To deal with missing data, DORA includes different methods that can be selected in the analysis dialog box. By default DORA uses EMU, which is designed for datasets with high levels of missing data (19); alternatively, missing genotypes can be imputed using the allele frequency of all subsets or of the same subset, or setting the frequency as 0. We show here two possible displays of PCA results generated in DORA using EMU: a scatter plot of the first two principal components (Figure 4C), and a scatter plot of the temporal position of the samples and the first principal component (Figure 4D).
In the standard PCA plot, geographic region selection captures genetic differences between the samples located in these regions (Figure 4C). Overlap can be seen mainly between the Kazakh Steppe and Northwest Europe, but also between East Asia and the Kazakh Steppe. To see whether this overlap occurs in specific temporal windows, we can plot the dating of individual samples against the first principal component: here the separation of samples from different geographic regions is apparent in each selected period, with increasing overlap between East Asia and the Kazakh Steppe closer to the present (Figure 4D).
Computing pairwise FST values for geographic region pairs
We can also visualize genetic differentiation at a population-level using the FST statistic. DORA can compute pairwise FST values for groups of samples defined by geographic region, temporal window, and subsets of genetic variants. As an analog to (Figure 4D), we computed pairwise FST values for pairs of geographic regions within each temporal window (Figure 4E).
The results show high genetic differentiation between the three geographic regions that decreases towards the present, with a drastic decrease in FST around the period of the Yamnaya expansion (4000–5000ya) (20). It is important to note that FST values are also affected by the aspects of the scenario such as the number of samples used to calculate the FST value in each window, population structure, and within-population variance (21), which may also affect the FST values. The higher FST value for the Kazakh Steppe pairwise values in the 1000–2000ya temporal window can most likely be attributed to different sample sizes in this window. Such pairwise FST analyses can be used together with population structure analyses as a means to define distinct geographic regions that show genetic differentiation, before running downstream analyses.
Applying polygenic scores (PGSs) to aDNA samples
Polygenic scores (PGSs) are used to compute predicted trait values for an individual by weighting variants according to their contribution to the trait. Applying PGSs that have been developed using contemporary GWAS to aDNA samples provides an interesting opportunity to predict phenotypically meaningful genetic changes beyond single variants. In some cases, predicted traits can be correlated with measurable phenotypes, if they can be inferred from skeletal remains (7). To allow crosstalk between the growing number of PGSs in the curated PGSCatalog (11), and aDNA samples, we added a feature to DORA that allows users to import PGSs directly from PGSCatalog, using a simple search interface. Users can search for a PGS according to the trait of interest, and the selected PGS will be processed to be compatible with genotypic data loaded into DORA.
To demonstrate this feature, we present the results of applying a PGS of height, which was developed based on the UK Biobank (22), to aDNA samples in the three geographic regions. As with PCA results, the results of the PGS analysis can be displayed alongside other metadata of the samples, including their dating, coverage, climate, or additional user-defined metadata. We see no clear differences in the prediction of height between geographic regions or temporal windows according to this PGS analysis (Figure 4F). The application of PGSs to aDNA samples is not trivial due to the low transferability of PGSs between genetically distant populations (23). Nevertheless, the exploration of temporal changes in predicted trait values for different traits in different geographic regions could provide an important lens through which human evolution can be studied; DORA not only simplifies conducting such analyses, it also allows integrating it with environmental data, potentially allowing for identification of drivers of such phenotypic changes.
Discussion
We present here DORA, a web-based tool for exploratory analysis of aDNA data. To demonstrate the features of this tool, we pre-loaded publicly available datasets and conducted a number of example analyses. These analyses were completed using the same interface and without using external tools. More elaborate analyses can be conducted with additional data layers.
The goal of DORA is to make the growing body of published human aDNA datasets more accessible for visualization and analysis, and to make the integration of accessory datasets for selection of samples and correlation with genetic analyses more intuitive. DORA combines serverless cloud-compute for analysis of publicly available datasets with local, browser-based analysis of private data. The results of these analyses are displayed in the browser, and their display can be manipulated by the user. Using the same Python environment for cloud-based and local computation, in addition to the generation of plots, makes the analysis pipelines on DORA flexible, transparent and accessible to users, and supports straightforward development of additional analyses.
Interface-based tools are particularly important for aDNA studies, because the multidimensionality of the data can pose a challenge to defining comprehensive and coherent criteria for selecting samples and genotypes for analyses. Downstream analyses also regularly involve multiple computational tools, each of which can have different requirements for the input format of the data. With currently available tools, such as PLINK and Python map plotting libraries, the visualization and analysis of aDNA involves multiple steps and libraries, and is not interactive; in DORA, these can be combined into a single step, and performed from the same interface.
We envision this tool as an important resource for researchers in the field of human aDNA by making accessible analyses that combine multiple existing resources with population genetics tools. This will allow rapid and intuitive definition of analyses and exploration of published aDNA data alongside unpublished samples.
Acknowledgements
We would like to thank Shai Carmi, David Gokhman, Viviane Slon, Fabrizio Mafessoni and Liran Carmel for comments, and Merav Bruria Yefe Nof and Erez Sonnenblick from the Alpha Program in The Hebrew University Youth Division, in addition to members of the Greenbaum, Carmel, Slon, and Gokhman labs, for testing and commenting on features of the tool. We would also like to thank the reviewers for their comprehensive evaluation of the tool and suggestions for improvements.
Contributor Information
Keith D Harris, Department of Ecology, Evolution and Behavior, The Hebrew University of Jerusalem, Givat Ram, 9190401 Jerusalem, Israel.
Gili Greenbaum, Department of Ecology, Evolution and Behavior, The Hebrew University of Jerusalem, Givat Ram, 9190401 Jerusalem, Israel.
Data availability
DORA can be accessed at https://dora.modelrxiv.org, and its code is available at https://github.com/carrowkeel/dora and https://doi.org/10.6084/m9.figshare.25304920.v2. The latest version of the AADR is accessible at doi.org/10.7910/DVN/FFIDCW. The topographic map can be downloaded from Natural Earth (12). The environmental variables used to compute the CHELSA PCs can be downloaded from the CHELSA website (https://chelsa-climate.org/chelsa-trace21k). PGSs used can be downloaded from PGSCatalog (https://www.pgscatalog.org).
Funding
Funding for open access charge: Personal university funding.
Conflict of interest statement. None declared.
References
- 1. Stoneking M., Krause J. Learning about human population history from ancient and modern genomes. Nat. Rev. Genet. 2011; 12:603–614. [DOI] [PubMed] [Google Scholar]
- 2. Mallick S., Micco A., Mah M., Ringbauer H., Lazaridis I., Olalde I., Patterson N., Reich D. The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci. Data. 2024; 11:182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Patterson N., Moorjani P., Luo Y., Mallick S., Rohland N., Zhan Y., Genschoreck T., Webster T., Reich D. Ancient admixture in human history. Genetics. 2012; 192:1065–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mathieson I., Lazaridis I., Rohland N., Mallick S., Patterson N., Roodenberg S.A., Harney E., Stewardson K., Fernandes D., Novak M., et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015; 528:499–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Skoglund P., Mathieson I. Ancient genomics of modern humans: the first decade. Annu. Rev. Genom. Hum. Genet. 2018; 19:381–404. [DOI] [PubMed] [Google Scholar]
- 6. Kerner G., Laval G., Patin E., Boisson-Dupuis S., Abel L., Casanova J.-L., Quintana-Murci L. Human ancient DNA analyses reveal the high burden of tuberculosis in Europeans over the last 2,000 years. Am. J. Hum. Genet. 2021; 108:517–524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Cox S.L., Moots H.M., Stock J.T., Shbat A., Bitarello B.D., Nicklisch N., Alt K.W., Haak W., Rosenstock E., Ruff C.B. et al. Predicting skeletal stature using ancient DNA. Am. J. Biol. Anthropol. 2022; 177:162–174. [Google Scholar]
- 8. Barton A.R., Santander C.G., Skoglund P., Moltke I., Reich D., Mathieson I. Insufficient evidence for natural selection associated with the Black Death. 2023; bioRxiv doi:15 March 2023, preprint: not peer reviewed 10.1101/2023.03.14.532615. [DOI]
- 9. Liu X., Orlando L. mapDATAge: a ShinyR package to chart ancient DNA data through space and time. Bioinformatics. 2022; 38:3992–3994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Karger D.N., Nobis M.P., Normand S., Graham C.H., Zimmermann N.E. CHELSA-TraCE21k–high-resolution (1 km) downscaled transient temperature and precipitation data since the Last Glacial Maximum. Clim. Past. 2023; 19:439–456. [Google Scholar]
- 11. Lambert S.A., Gil L., Jupp S., Ritchie S.C., Xu Y., Buniello A., McMahon A., Abraham G., Chapman M., Parkinson H. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat. Genet. 2021; 53:420–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Natural Earth Natural Earth. 2023; https://www.naturalearthdata.com/.
- 13. Hudson R.R., Slatkin M., Maddison W.P. Estimation of levels of gene flow from DNA sequence data. Genetics. 1992; 132:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Miles A.pyup.io bot pyup.io bot Murillo R., Ralph P., Harding N., Pisupati R., Rae S., Millar T. cggh/scikit-allel: v1.3.3. 2021;
- 15. The Pyodide Development Team pyodide/pyodide. 2021;
- 16. Alexander D.H., Novembre J., Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009; 19:1655–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Klunk J., Vilgalys T.P., Demeure C.E., Cheng X., Shiratori M., Madej J., Beau R., Elli D., Patino M.I., Redfern R. et al. Evolution of immune genes is associated with the Black Death. Nature. 2022; 611:312–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Patterson N., Isakov M., Booth T., Büster L., Fischer C.-E., Olalde I., Ringbauer H., Akbari A., Cheronet O., Bleasdale M. et al. Large-scale migration into Britain during the Middle to Late Bronze Age. Nature. 2022; 601:588–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Meisner J., Liu S., Huang M., Albrechtsen A. Large-scale inference of population structure in presence of missingness using PCA. Bioinformatics. 2021; 37:1868–1875. [DOI] [PubMed] [Google Scholar]
- 20. Haak W., Lazaridis I., Patterson N., Rohland N., Mallick S., Llamas B., Brandt G., Nordenfelt S., Harney E., Stewardson K. et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature. 2015; 522:207–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Alcala N., Rosenberg N.A. Mathematical constraints on FST: biallelic markers in arbitrarily many populations. Genetics. 2017; 206:1581–1600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562:203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Privé F., Aschard H., Carmi S., Folkersen L., Hoggart C., O’Reilly P.F., Vilhjálmsson B.J. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet. 2022; 109:12–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
DORA can be accessed at https://dora.modelrxiv.org, and its code is available at https://github.com/carrowkeel/dora and https://doi.org/10.6084/m9.figshare.25304920.v2. The latest version of the AADR is accessible at doi.org/10.7910/DVN/FFIDCW. The topographic map can be downloaded from Natural Earth (12). The environmental variables used to compute the CHELSA PCs can be downloaded from the CHELSA website (https://chelsa-climate.org/chelsa-trace21k). PGSs used can be downloaded from PGSCatalog (https://www.pgscatalog.org).